CTRL K

Why Batch Size Matters More Than Velocity

Will Raymer,3/26/2026

Almost everyone’s first real experience with AI goes something like this: you hear it can do something amazing, so you ask it to do something amazing. “Plan me a trip to Bali.” “Build me a Stripe integration.” It gets 70% of the way there on its own, you end up with a C-minus version of what you wanted, and you end up spending the next two hours figuring it out yourself.

So the second time, you try again, starting smaller — “When’s a good time to visit Bali?” “Create a checkout session for the monthly subscription” — and, with a longer conversation and a bit of luck, you iterate your way toward the result everyone’s been breathlessly posting about.

Most people learn that pattern pretty quickly. But lazy big-batch prompting is still so tempting. It’s 10pm, the workday is done, you’re half-watching Make Some Noise, maybe nursing a drink, poking at a side project with (hopefully) low stakes. It’s peak “tell Claude to do something I should know damn well it’s not going to figure out on the first pass, then get frustrated when the result is exactly what I deserve” territory. (With an agentic harness and dangerously-skip-permissions, you can even work toward the same unusable outcome while you’re sleeping!)

The facile answer is “prompt better,” and there’s a whole cottage industry of tools to help us all write better prompts. But better prompting won’t save us: the underlying problem is batch size.

The wrong lever

A lot of the conversation about AI and coding is about velocity. CEOs are reading headlines that AI lets everyone code 10x faster, and most organizations get far enough to see some genuinely promising examples — a PM writes a dashboard in an afternoon that the whole support team uses every day, or an engineer prototypes a feature that would have taken a quarter to even prioritize. Claude Code, Cursor, Copilot — they’re all optimized for throughput, and they’re genuinely impressive at it.

But for a lot of software teams, this ends up looking a bit like a greyhound running around the living room. The theoretical top speed is real, but without the right conditions, higher velocity just means breaking things faster. Resolving the bottleneck isn’t a function of throughput — it’s a function of batch size: how much work travels together as a unit. A 1,000-line PR and ten 100-line PRs represent the same total work, but they have radically different outcomes, and not just because one is easier to review.

Donald Reinertsen wrote a whole book about this in 2009, The Principles of Product Development Flow, which you should go read if you build software for a living; it’s really, really insightful. One of his central claims is that smaller batches reduce queue length (how much stuff is waiting to get done), cycle time (how long stuff takes to do), and variability (how unpredictable the whole process is) — all without adding capacity.

The cafeteria problem

Here’s an example Reinertsen uses: imagine a school with 500 students and a cafeteria that seats 200. They have an obvious capacity problem.

A lot of software organizations look at that situation, or its equivalent, and decide they need to build two more cafeterias. Add headcount! Spin up more infrastructure! Increase capacity!

Most schools just set three lunchtimes.

The solution is so intuitive that most of us see it immediately when we’re looking at an example outside our own work. The cafeteria gets better throughput, with no new capacity. The difference is that smaller batches fit through the existing system without piling up into queues. Ceremonies notwithstanding, this is what agile is really supposed to be about — making the unit of improvement small enough that you can iterate on it before the context goes stale.

Anyone who’s reviewed a 2,000-line pull request understands this viscerally. You open the PR, you scroll, you keep scrolling, your eyes glaze over, you pray for death. We’ve known as an industry for a long time that big PRs get fewer substantive comments than small ones.

This has gotten worse in the AI era. Over the last year, a wave of high-profile projects have changed their contribution policies in response — curl, tldraw, Ghostty, LLVM, among others. The problem isn’t AI per se; it’s that AI makes it trivially easy to blithely diagnose a “known issue” in an open-source codebase and generate big, hard-to-review, specious fixes.

DORA’s take

Google’s DORA research team has been scientifically studying software delivery performance for a decade (since well before they worked for Google), and their 2025 report identified seven “foundational capabilities” that predict whether AI actually improves organizational outcomes. One of them is small-batch discipline.

They found that when teams ship in small batches, their AI productivity gains are much more likely to translate into organizational outcomes. Teams that ship in big batches see individual metrics improve (tasks completed, PRs opened) while organizational metrics stay flat or get worse (cycle time, defect rates). The gains get lost somewhere between the individual and the team.

This makes sense if you think about what big batches do to a system. They create queues, queues create bottlenecks, and bottlenecks create context loss. By the time a big PR (especially among dozens of other big PRs) gets reviewed, everyone has forgotten why it was written and what it does. The reviewer doesn’t have context, the author has moved on, and the whole thing becomes a high-stakes chore rather than a collaboration.

Decomposition

So if batch size is the lever, how do you pull it?

The instinct with AI is to think big: “Build me an auth system.” “Implement an e-commerce platform.” “Refactor this monolith.” The alternative is decomposition — breaking work down before you start coding. Not “build the auth system,” but “log in with email and password,” then “deliver one-time PINs,” then “reset password via email.” Each piece is small enough to review, test, and understand. Each piece can at least be validated independently, and with a little feature flagging, even shipped.

This is where requirements come in. As much as the product management discipline is about the “why” of software development (for good reason, customer centricity is important, don’t take away my PM card!), a soundly decomposed “what” is still an extremely valuable expression of craft. A well factored requirement is a unit of work: specific enough to implement, small enough to test, clear enough to review. When you decompose at the requirements level, you’re controlling batch size before any code gets written.

“Build a Stripe integration” produces a massive, poorly understood diff that is hard to test, reason about, and review. “Handle the webhook for successful payment” produces a 100-line PR where you (or, let’s be real, your AI code reviewer) can actually catch the edge case.

What this looks like in practice

At Popover, we’ve started thinking of requirements as batch-size control. Each requirement is a thin vertical slice — something you can implement, test, and ship independently. When we work with AI, we try to hand it a few requirements at a time instead of an epic, and we get work that’s actually understandable.

In our actual agile practice, that means that the scope of a typical unit of work is getting smaller and smaller, usually a day or less. We started out with weekly sprint goals, but we’re closer to a kanban-style daily flow these days. Claude can produce a lot of code in a day, so anything bigger starts to look like a big unresolved accumulation of risk, at least through Reinertsen goggles.

This means we’re having chattier Claude Code sessions than the “build me everything” approach, and it also means we need find other ways to stay connected to the big picture. But it’s more productive overall, because we’re not throwing away half the output, and we’re not spending hours untangling code we don’t understand. The cafeteria doesn’t need more seats; it needs more seatings.

This is the philosophy behind how we work and what we’re building at dot•requirements — requirements that decompose naturally into testable slices, readable by humans and AI alike.