How I Actually Use AI in My Research: A Solo Researcher's Full Pipeline

A while back I wrote a piece arguing that vibe coding is just tool use, and tool use is not shameful. I stand by it. But rereading it, I noticed I’d spent the whole essay defending that I use AI and almost no words on how. That’s the more useful thing to share, and it’s also the more honest one — it’s easy to defend a practice in the abstract and quietly do something sloppier in private. So this is the practice, laid out plainly.

One thing to set up front, because it shapes everything: I’m a solo researcher. No grant, no program manager, no deadline handing me a problem. That means my projects don’t start the way funded projects do. Nothing is pushing me. A project starts because I noticed something — and the whole pipeline below is really just the story of what happens to that “something” on its way to becoming a paper, and where the AI is standing at each point.

The short version, if you only read one line: I hand the AI a lot of production and almost none of the judgment. Where that line falls, exactly, is the rest of the post.

It starts with noticing something

There’s no kickoff meeting. A project begins when I notice an interesting phenomenon, or spot a pattern worth continuing from an old project, or — this one happens more than you’d think — catch something that doesn’t fit, some incongruity in a project I’m already running. The dissonance is often the best seed.

Before I open any LLM, I summarize for myself what I think I’ve found. This sounds trivial and isn’t. The summary is mine, in my own words, before the tool gets a vote. The observation and the instinct behind it are the part I’m actually good at, and they’re the part I want to keep uncontaminated.

Turning the instinct concrete — and why hallucination doesn’t matter here

Then I open an LLM, and the mode here is brainstorming. But not the way people usually picture it. I’m not asking it to come up with ideas. I’m using it to make my idea concrete.

The loop is: it throws out a version it thinks is well-defined, and I correct it — with my instinct, repeatedly — until the thing on the screen matches what I actually meant. Then I lock that version in. The AI’s job is to fire a specific, concrete shot for me to shoot at. My job is to keep dragging it back into the shape my intuition is pointing at. The anchor and the referee are both me; what it provides is a starting point that’s wrong in useful ways.

Here’s the part worth pausing on, and I want to state it carefully so it isn’t misread: at this stage, hallucination is not harmless in general — it is harmless only because I am not treating the output as fact. The model is not being asked to report what is true. It is being asked to generate disposable formulations for me to reject, repair, or sharpen. In concept formation, working on something original, there is no correct answer yet for it to deviate from, and everything it says is a proposal I’m going to revise anyway. “It might be making things up” isn’t a bug when your default posture toward every line is correct it. Hallucination becomes dangerous later — when the output claims to correspond to an external fact: a citation, a numerical result, a dependency, a theorem, or the scope of a claim. Up here, none of those are in play yet. That’s the whole reason it’s safe, and it’s the only reason.

What I lock in at the end of this is a concept — a mechanism or direction I now understand cleanly. Not code yet. Just a thing I’ve thought all the way through.

Novelty check and landing — where the AI fetches but I decide

Once the concept is firm, two checks happen, and the division of labor is sharp.

Novelty check. I ask the LLM to go pull related papers off the web. Then I download them and read them myself. Whether the idea is viable, whether someone already did it — that judgment runs on my own sense of the field, not on the model’s summary. The LLM is hands and feet here, not a brain. It moves the material in front of me; what to trust and how to read it stays with me.

Landing ability. Because I’ve run projects before, I can usually tell whether a concrete-enough idea will actually land academically — whether it can become real work, real results. I do this one myself, almost always. (I’ve been burned letting a model wander into territory I didn’t authorize, so this stays on my desk.) Ideas that score high on landing get prioritized; low-landing ones I keep alive on a slow, periodic burn rather than dropping.

If there’s a single thread running through this whole pipeline, it shows up first right here: the AI produces candidates and fetches materials, but the line of judgment — is this true, is it worth doing, is it the right shape — I never hand over.

The planning doc: an anchor outside the conversation

Once I’ve decided a project is worth doing, I have the LLM generate a planning document as a markdown file, and then I edit it into shape.

This file is the project’s source of truth. Here’s why it matters more than it looks: a conversation is disposable and drifts. Context fills up, the model starts forgetting what we said thirty turns ago, the thread wanders. So I don’t store the project’s direction in the conversation. I externalize it into a doc. I can throw away the entire chat session and still pick the project back up from this file.

I want to be honest that this isn’t a habit I backed into. I’ve built KV-cache eviction methods and worked on model architecture, so I’ve never had the luxury of pretending context is reliable. I know what it is: finite, lossy, with attention that drifts. Someone who works on cache eviction for a living is not going to trust a chat window to remember things. So from the very start I treated the planning doc as a manual external memory — evicting the state that needs to persist out of the volatile context and into a place that won’t get overwritten. I’m not jotting notes while I work. I’m managing the model’s memory with the instincts of someone who designs memory systems.

The doc is LLM-drafted, me-edited — the same split as everything else: the model lowers the cost of getting a first version down, I keep authority over the content.

Building the code: pay attention where it costs you

A lot of my projects need a lot of code, so in practice I just ask the LLM to scaffold a first version, and then I go read it.

And here’s something that catches people off guard: as long as I’m not reproducing someone else’s work, I mostly don’t have to worry about hallucination here either. Take a renderer like Mitsuba — there’s no single “correct” way to write against it. There’s no reference answer for the output to deviate from. Even if the code carries some technical debt, this is research code; it runs to produce a result and gets archived. The debt is a real defect with a bounded cost — it makes the code less pleasant to reuse, it doesn’t corrupt the science. Not worth organizing my tooling around. (The exception proves the rule: if I were reproducing a published result, suddenly there’s a reference to deviate from, and hallucination is back on the table.)

But I do read what it wrote, and I don’t spread my attention evenly. I concentrate it where being wrong has consequences:

Anything numerical — I’ll read carefully, check the actual values it used, make sure they’re right.
External resources — I verify it didn’t invent a dependency or pull something that doesn’t exist.

The places with no reference answer and no downstream consequence, I let it run and skim once. The places with an external fact to check, or where an error would leak into a conclusion, I scrutinize. Verification isn’t sprinkled uniformly; it’s spent where a mistake would actually cost something.

The one hard rule: never let it do the statistics

This is the line I don’t bend on. The LLM is never allowed to compute statistics itself. Every statistical number it produces must come from a script it wrote and then ran — never from it just telling me a figure.

The reason is not philosophical, it’s common sense: it’s bad at arithmetic, anyone doing deep learning knows this, and I’m not going to gamble on it. Especially when running a script costs almost nothing. Look at the trade: betting on a directly-reported number saves you the trivial cost of writing a script; if you lose, you’ve shipped a wrong statistic that may already be sitting inside a conclusion. The payoff is laughably asymmetric. There’s nothing to think about.

So the rule isn’t really “statistics versus non-statistics.” The real line is whether the output will survive as an archived conclusion. Throwaway, in-process stuff it can hand me directly. But anything that might get saved and later treated as a result of this project has to go through the checkable middle layer — a script you can read, rerun, and audit. And I keep all of it: the raw data, and every execution script, so the whole chain from raw input to final number stays traceable.

Statistics is just the most dangerous instance of this, because a fabricated statistic is exactly the kind of error that’s fluent, confident, internally consistent, and passes every surface check — the worst kind, the kind that gets into print. (I’ve generated errors like that unaided, so I have no illusion this is a machine-only failure. The fix is the same either way: force the work through a process you can inspect, and trust the chain, not anyone’s word for it.)

Math modeling: walk the derivation together

When there’s modeling to do, I tend to discuss the math with the LLM first — what formulas to use, what derivation paths might work.

And here hallucination is, again, not a big problem — for a specific reason. A derivation is something you walk through one step at a time, together. There’s no room for a conclusion to jump. The danger with these models is the leap: a conclusion that forms first and then gets dressed in plausible-looking mechanism. But step-by-step derivation gives the leap nowhere to hide. I’m present at every step; it can’t suddenly land on a result I haven’t watched it reach.

Honest note on cost: for the more original formulas, I’ll send them out for an external validity check. But a lot of my projects don’t invent new formulas at all — they’re mostly derivation. So the verification cost here is genuinely fine. Not large, not a burden. I mention that because I don’t want this whole post to read as “I’m so rigorous that using AI is exhausting.” It isn’t. Most of the time the load is light precisely because most of the work is derivation, not invention.

Writing the paper: outsource the language, never the claim

The paper is the last stretch, and two facts about my situation set the approach. First, I’m not a native English speaker. Second, journals now put author responsibility for content front and center — the content is on me, not on any tool.

So the workflow is: I write a version first, the AI polishes the language, and then I go back and check it paragraph by paragraph, translating each one back to confirm it didn’t move my content. That last step is the whole point. Polishing is exactly where an accident happens — the model quietly rewrites a point or adds something I never said, tucked inside prose that reads beautifully. The paragraph-by-paragraph back-translation is the wall against that. Language I’ll outsource; meaning I won’t.

There’s a second thing I keep on my side of the table, and it’s the more important one: scope and claim control. The model over-claims. Plainly: the data only supports saying this much, and it’ll extrapolate the conclusion well past that. So I drive scope and claims, pulling every statement back to what the data actually holds up.

This isn’t a language problem, which is why it can’t ride along with the polishing. How strongly a result can be stated is a question about the evidence, and the evidence is the one thing the model doesn’t have. It doesn’t know how many runs I did, where the boundaries are, what’s still unverified. So the paper-writing split ends up clean: the AI handles the language, where it’s genuinely strong; I keep the content, the meaning, and the size of the claims — back-translation guards “it changed what I meant,” and driving scope guards “it said more than I can.”

What the whole thing has in common

Lay the pipeline end to end — notice something, make the instinct concrete, novelty-check, judge the landing, anchor it in a doc, scaffold the code, lock down the statistics, walk the math, write the paper — and the same shape shows up at every station.

The AI is always doing production: firing concrete first drafts, fetching papers, scaffolding code, drafting logs and plans, polishing prose. And at every station I keep the same single thing for myself: the judgment. Which shape is right. Whether it’s worth doing. Whether the number is real. How big the claim can be. I’ll let the model produce almost anything. I won’t let it decide whether to believe the result.

That’s the honest version of “tool use is not shameful.” It’s not that the tool does no harm — it over-claims, it invents dependencies, it ports things it shouldn’t. It’s that none of those failure modes touch the part I never handed over. Use the tool for what the tool is good at, keep the judgment, and there’s nothing to be embarrassed about.

A note for completeness, in the spirit of not overselling: this is how I work in 2026, as one solo researcher in my particular corner. The places where I keep the AI out — original formulas, the size of a claim, what counts as a result — are where the line sits for me, now. Lines move. The habit I’d actually defend isn’t the specific border; it’s keeping a border at all, and knowing which side each task belongs on.