In March of 2025, my employer mandated that everyone use AI for coding. Agentic coding had not yet arrived. I could vibe code some slop but I couldn't get it to move TypeScript import statements from one file to another without the agent getting confused. The models evolved. People tried elaborate "prompt engineering" or Ralph loops but, still, agentic coding hadn't yet arrived.
Finally, in February of 2026 Anthropic released Opus 4.6. This is when agentic coding arrived. There were a few other inflection points but 4.6 was the first model that knew how to code without prompt engineering and without too many hacks. It was fast. It was very good at responding to feedback (more on that in a minute).
Code factories have not yet arrived
Even Steve Yegge admits that one cannot build an autonomous code factory today. We still need skilled, human software engineers to do more than just write the specs -- they need to guide the architecture or at least dip in and out to review crucial parts of the system. Anthropic says they are approaching 99% generated code and yet they still have to review most code before shipping. We will get there. The 2389 team looks very close to building a code factory.
Wait, do we even want a code factory?
Yes, absolutely. Most "things" in the world just so happen to be powered by software. Imagine if everyone everywhere could build software at the speed of thought.
Conversely, I don't think the world needs generative AI art. Human artists aren't scarce and they aren't even expensive. Software engineers are scarce and they are very expensive. Software is a lot different from art. It either works or it doesn't. It's a set of primitives. You can even make art by writing sofware. Most commonly, we write software to give people tools.
If LLMs are a good fit for anything, it's software. If you've worked at any software company before AI (like I have) then you will probably agree that the to-do lists never fully got completed, no matter how big the team. There is just so much sofware we all want to build.
AI assisted coding
The closest thing to a code factory right now is what I think of as AI assisted coding but more commonly referred to as agentic coding. It's writing code at the speed of thought (after waiting for the result, heh). It works so well that I can't imagine ever going back.
This is an exciting time to be an experienced software engineer. We know how to architect software, how to test it, debug it, monitor it, and scale it. For the first time ever, we can keep doing all that without having to type out all the code.
But how? Funny enough, just the same way we did it before but faster. I'll explain.
Automated testing
Agents know how to run tests. Just like for humans, a suite of tests covering all existing functionality ensures the agent doesn't break things as they change the code. I always provide tests for agents, no matter how big or small the project.
Agents can write tests very quickly. When I tell an agent to use TDD (test-driven development) it already knows what I mean. Just like for humans, this simple act of specifying the behavior and setting out to pass a failing test focuses the agent and encourages it to implement a concise and simple solution (well, sometimes).
Agents can't write tests very well so I have to explain to them (through skills or prompts) how to simulate user actions in a test, how to create facades and safe mock objects via static typing, and so on -- basic test hygene.
The more feedback loops the better
Everything helps focus the agent: lint errors, static type failures, test failures, automated code reviews, etc. This will be a key ingredient in the inevitable code factory. I've also found that agents do really well when I give them everything they need to run and debug their own code. When agents get stuck, I ask them to just make the debugging tools they need to figure it out. And they do.
For CI (continuous integration), providing agents with the tools to automatically detect and fix errors gets me mergeable pull request quicker. However, CI is the slowest type of feedback loop -- a better feedback loop is a script the agent can run locally and quickly. The quicker the better.
Code factory primitives
I've been acting as an orchestrator of my own agentic code factory to figure out what the primitives are. I want to control the quality. I'm not fully automating everything (yet). I'm trying to stay close to the metal.
I've been using pi because it's very fast and very hackable and let's me experiment with different automations, try different models. Orchestrating a code factory is exhilarating and sometimes exhausting.
Autonomous coding
The most obvious coding factory primitive is letting agents autonoumously complete each coding task from start to finish without human input. The pi system prompt is very minimal. It focuses the agent on completing the task without interruptions.
I run pi in a simple sandbox-exec to make sure it only writes to the current directory since that's the only place it needs to put code. I had to set up exceptions for writing global state and chose locations like ~/.caches and ~/.local. I also allow some git locking files for work trees. That's pretty much it.
Since I don't expose my API tokens to the Internet, that's not my main worry. I worry about agents being too clever with their fixes. I worry about supply chain attacks. I've seen agents get blocked by the sandbox when trying to change global configs to get around what they perceive as tooling problems. It's not perfect and I'm sure a rogue agent could escape the sandbox if it wanted to.
Anthropic obviously had a hard time figuring this out. Sandboxing is hard. I feel like Claude Code's permission model will go down as the worst mistake they've ever made. Asking the user to babysit each write operation isn't just a waste of time, it's hard for users to actually understand what they are about to allow. It's difficult to configure the right allowances.
Also, with pi, I want to lean on rich, composable extensions -- I need a sandbox that applies to all processes and their sub processes. One could also do this with a virtual machine or container.
The brain
I type /brain to see what tasks I have in progress and switch between them. An automated code factory might have something like a prioritized task queue but since I use GitHub projects for that, /brain is more like an orchestator: do this, do that, remind myself what I'm waiting on, and so on.
Review loop
Whenever I start a task with a specification (the prompt), I use /queue, an extension that queues a series of follow-up prompts after the agent is finished. I use this to prompt a thorough self-review; reminders of things to check like did you make sure all tests pass, all types compile, there are no lint errors, and all code is formatted? Agents are forgetful.
Since agents can't write tests very well I also prompt for basic test hygene:
- Verify all tests by using the break / fix strategy (because agents "forget" to use TDD)
- Validate test setup / teardown. All teardowns must be atomic.
- Ensure each test has an explicit setup for the exact scenario it's testing
- Make sure all tests assertions will fail with a message showing the reader exactly what went wrong
And so on. Just standard software engineering. I tell the agent to trace external user inputs and ensure they are sanitized, along with other security checks. I tell it not to use try / catch for flow control, not to write a bunch of dumb comments, etc. Theoretically an agent should know all this because of my AGENTS.md and skills but it gets bogged down with context and ignores them.
A simple queue of prompts is surprisingly effective. Agents get lazy when they review their own work, though. A fancier code factory might have multiple review loops running on various models, subagents seeded with specialist context, subagents to judge the feedback before applying it, etc.
Project isolation
Agents need a way to work on the same repository simultaneously. I use /worktree to create a git worktree or I use a different extension that wraps my company's internal monorepo tool. If a software project cannot be isolated easily with worktrees you might need virtual machines (local or in the cloud). Anyone who has worked on software before AI probably already did something like this to multi-task their work.
Human code review
Ah, here we are. This is the reason we don't yet have fully autonomous coding factories for production software. I do not trust my agent. I want to see the code. I made /git for this: a wrapper around delta that lets me page through a diff and ask the agent to fix things as I read through it.
Will we ever let agents ship highly sensitive, money-making production code without us ever reading it? Yes, probably one day. I am pretty sure Anthropic is trying hard to do this internally. I can imagine a battery of tests and elaborate agent critic loops to maybe get us there.
For vibe coded prototypes I don't read the code but I always run the code. The future of code factories will probably look something like that: the human orchestrator will focus on actually running the thing which may mean just auditing its functional test suite.
Honestly, I wouldn't miss code review. It's tedious. I have to compile code in my head. I have to trace the execution and think of all the things that could go wrong. It's a valuable thought exercise, though.
When not to use AI
Claude Code wants me to write a skill for everything. This causes the agent to burn tokens figuring out what commands to run. The pi tool instead let's me write extensions like scripts: they just execute code. They can prompt the agent through an extension API if needed. This is a better model.
LLMs are non-deterministic so, really, they are best avoided at all times. I want my agent to generate code but I don't need it to figure out what git branch I'm on or how to create pull requests with gh. I can just write scripts to figure that out with standard introspection.