CTRL K

CR/EATED: Lightweight Prioritization for AI-Driven Development Teams

Will Raymer,5/22/2026

I went to a PM meetup this week, one of those events where you end up at a table with five strangers all trying to figure out the same things, and I asked the folks at my table where their post-AI bottlenecks were. The answers were fairly consistent: decision-making and delivery.

Delivery problems are reasonably well understood. They’re hard, and they’re surfacing as a constraint now that AI has widened upstream bottlenecks like coding, but the modern DevOps playbook has been around and effective for a decade. But decision fatigue — that’s popping up all over the place in some new and unusual shapes.

The daily tsunami of decision-making that comes with managing AI agents has gotten some significant attention on the engineering side, mostly in terms of cognitive load and cognitive debt. It’s not hard to draw the line from a mile-long backlog of PRs to decision fatigue, and I wrote a version of that story in Cognitive Debt.

The product side of the decision-making problem has gotten less attention, but I think it matters just as much. Now that it’s next-to-free to write code, the question of should we build it isn’t constrained nearly as much by can we build it. As a result, we have to ask and answer “should we?” for orders-of-magnitude more ideas.

The denominator problem

The rigorous answer to “what should we build next?” is to find the ratio of value to effort. Donald Reinertsen calls this WSJF; Roadmaps Relaunched calls it ROI Scorecards; other folks call it RICE. Whatever the name, the principle is: if you sequence your work by value ÷ effort, you maximize the value you deliver over time. Like, mathematically.

The trouble with this framework nowadays is that development effort doesn’t constrain many teams anymore — at least not in the ways we’ve traditionally measured it. 20 years of agile coaches taught us to use relative units of effort like story points. Claude Code still dutifully informs me that my next feature will take a few weeks to build — even though I know once we work out the design, it will be code-complete in less time than it takes to make another cup of coffee. “A bigger team with more velocity” went from a wish-list item to a communication burden. And even thinking relatively stops being helpful when a 100x difference in development time just marks the distinction between 3 minutes and 5 hours.

We spent years learning to estimate on a dimension that’s now useless. So how do we measure cost?

CR/EATED

The framework I’ve landed on for my own product work is CR/EATED: (Cost of Delay + Relevance) over (Entities + APIs + Testing + Experience + Doubt). The numerator quantifies value, kept intentionally simple. The denominator is where the framework has something new to say — it focuses on the things that the AI firehose now produces in excess, and offers a structured way of asking “what is going to make this hard to review, ship, and live with once it’s coded?”

The numerator, briefly

The numerator is relatively unexciting and unopinionated. If you want to express value with RICE or ICE or a homegrown rubric, go ahead — the framework doesn’t depend on the two dimensions I’ve picked.

A little guidance before you roll your own, though: I’ve found that granular value rubrics are a lot less useful than they look. When teams try to break down “how valuable is this” into overly detailed categories, prioritization debates often end up litigating the categories rather than the substantive disagreement over strategy. If your CEO doesn’t think a feature is worthwhile, a rubric is not the instrument that’s going to change her mind. The sweet spot is one level more disciplined than “how excited are we about this?” It’s granular enough that ideas don’t win on sheer exuberance, but not so granular that we spend twenty minutes arguing about the difference between “risk reduction” and “opportunity enablement” (categories from a real, widely used framework!).

The two dimensions I’ve oriented on are these, the CR of CR/EATED:

C: Cost of Delay — should we work on this now? How much does it cost us to defer this work until the next time we re-evaluate priorities? This is the Reinertsen classic: “If you only quantify one thing, quantify the cost of delay.”

R: Relevance — should we work on this? Is the solution a good fit for our product, our company, our customers, and where we’re trying to go? The world is full of real, important problems that are nevertheless not a you-problem.

The denominator

Each part of the denominator shares three properties:

First, each one is a genuine bottleneck in agentic development — something that introduces meaningful cognitive load when shipping a feature, that AI can’t trivialize.

Second, each one is estimation-friendly — a human (or agent) can read a high-level feature request and produce a finger-in-the-air score. If you can’t roughly size it before you start, it won’t help you prioritize.

Third, each one is retrospective-friendly — a human (or agent) can review a pull request and empirically assess it. The core improvement mechanism of agile estimation is the inspect-adapt cycle, where velocity is measured against estimates, and we need to be able to follow the same pattern here.

With those tests in mind:

E: Entities. How much new complexity does this solution introduce to our domain model? How much are we changing the “nouns” of the system, or adding new ones?

Practical considerations include: things like an unexpected database schema migration.
Cognitive considerations include: everyone who understands how the system works today is going to have to update their mental model.

A: APIs. How much new complexity does the solution introduce to the boundaries of, and within, the system? How much are we changing the data contracts that the system and its consumers depend on, or adding new ones? (This doesn’t need to be strictly external APIs; you can decide what boundaries matter to you.)

Practical considerations include: things like “this will be a breaking change to our SDK” or “we need to coordinate with the mobile team’s release window.”
Cognitive considerations include: everyone loosely coupled to this surface has something new to evaluate, and everyone tightly coupled to it needs actual advance warning.

T: Testing. How hard is this going to be to test — not just at what scope, but with what tools? Where in our testing pyramid (or honeycomb, or trophy, or whatever your preferred shape) will the behavior actually get validated? How much exploratory and acceptance testing do you want to do? How likely is it to introduce regressions, and how likely is it to regress in the future?

Practical considerations include: “who needs to sign off on this before it ships?”
Cognitive considerations include: how soundly you’ll sleep the night after you ship it.

E: Experience. How much newness will this introduce for users? Not strictly UI — any noticeable change in user-expected behavior counts, because even beloved changes take real effort to learn and teach.

Practical considerations include: things like discoverability, documentation, and the volume of support calls you’re about to generate.
Cognitive considerations include: whether you’re bolting more onto a UI than it can bear without more serious design revision.

D: Doubt. How uncertain are the other scores in this rubric? What are their risks? Doubt is a place to express what you don’t know — which matters in two different ways depending on who’s estimating. For folks who tend toward overconfidence, it forces epistemic humility, a reminder that estimates carry error bars. For folks who tend toward underconfidence, it’s a piece of psychological safety, a place to put “I’m not sure about this” without it reading as “I don’t want to do it.” Either way, it’s a way of making uncertainty visible alongside the other measurements.

Applying the framework

I’ve been using a simple 1-5 scale for each dimension, where 1 is very low, 3 is significant, and 5 is very high. From there, you can use either the sum or the average of CR and EATED before dividing. The math works out the same either way, but I like using the average because it means a CREATED score of 1 is roughly break-even: high value/high cost, low value/low cost, or somewhere on the line between those two points. That means I know that work that scores above 1 is disproportionately valuable for its cost, and work that scores below 1 is disproportionately expensive for its value.

Some people like non-linear sequences like fibonacci for this type of estimation. I use 1-5 because I’m sick of explaining to stakeholders why we can’t score something a 4, but the framework works perfectly well with any quantified, relative scale.

A few examples:

Example 1: “Add a ‘remember me’ checkbox to the login form.”

A user-requested feature, sitting in the backlog.

CoD (2) — a small steady trickle of “why do I have to keep logging in” feedback
Relevance (3) — universally expected at this point
Entities (1) — no new model; the existing session TTL handles it
APIs (1) — one optional flag on the existing login endpoint
Testing (1) — the session model already does this
Experience (1) — additive checkbox, pattern users already know
Doubt (1) — no real unknowns

CR/EATED = 2.5 ÷ 1.0 = 2.5. The framework surfaces that this is an easy win.

Example 2: “Add SAML SSO for enterprise customers.”

Sales is escalating, multiple deals are gated on it.

CoD (5) — deals are blocked now, and every week of delay leaks revenue
Relevance (4) — directly fits the product’s go-to-market motion
Entities (4) — each customer’s identity provider, their group-to-permission mappings, and auto-created user accounts
APIs (4) — a new authentication flow plus the configuration and admin interface to set it up per customer
Testing (5) — most of the integration’s behavior lives inside customers’ IdPs; correctness depends on configurations you’ll never see
Experience (3) — admin-facing config plus end-user redirect flows
Doubt (4) — every new customer introduces an IdP configuration you haven’t encountered, and the unknowns don’t end at launch

CR/EATED = 4.5 ÷ 4.0 = 1.125. The framework surfaces high value, high cost. The cost is helpfully measured on dimensions that you can articulate when the SVP of Sales brings you a Lovable prototype and asks why AI can’t build this overnight.

Example 3: “Build our own MFA mobile app instead of supporting TOTP.”

A board member asked why competitors have their own MFA apps and we don’t.

CoD (1) — TOTP works fine today; users already have authenticator apps they trust
Relevance (2) — fits an auth-focused product in theory, but no customer has asked for it
Entities (3) — device registration records, push notification tokens, app-version compatibility tracking
APIs (4) — push notification infrastructure, app-to-server protocol, key rotation
Testing (4) — device fragmentation across iOS and Android, and testing against real push delivery is famously unreliable
Experience (4) — users have to install yet another auth app when they already trust Google Authenticator
Doubt (4) — App Store policy changes, push deliverability degrading over time, ongoing platform maintenance in perpetuity

CR/EATED = 1.5 ÷ 3.8 = 0.39. Likely sits at the bottom of any backlog it’s in.

Put together, those three become a ranked backlog: knock out the easy win first (2.5), tackle the hard-but-important thing next (1.125), and worry about the vanity project if you get to it (0.39). The inflection point at 1 is a useful sanity check, but the framework’s math gives you a sequence, not a verdict on any given item.

Why this helps, in two ways

The obvious problem this framework solves is that an AI-relevant measure for “what does this cost” makes prioritization itself easier. Effort estimation is increasingly meaningless, but “I think the API change is bigger than you do” is still a more productive conversation than “I just don’t think it’s worth doing right now,” and once the inputs are aligned, the outputs are highly legible and easy to defend. “If you don’t agree with the priorities, tell me where you think we got the inputs wrong” is a great way to redirect a yes-no debate toward a substantive discussion.

But the less-obvious benefit is one that might matter even more, especially as we all get used to what AI-first software development looks like at scale: the framework biases the organization toward features that hit your other bottlenecks less hard. A feature with high CREATED scores isn’t just the right thing to build — it’s something that will be easier to review, easier to ship, and easier to live with. Prioritization stops being a one-shot ranking exercise and starts compounding into delivery velocity and long-term cognitive load, because each well-chosen feature avoids creating the kind of complexity that future features would otherwise have to wade through.

The denominator was never only about cost. It was about what cost was signaling — how much of the system this change would touch, how much risk it carried, how much downstream work it implied. AI didn’t kill that signal; we just have to trace the new constraints.

If “where’s our post-AI bottleneck?” has been the question lurking under your roadmap — that’s the conversation we’re having at dot•requirements . What makes a feature cheap to test and easy to live with is mostly the same discipline: clear requirements that travel with the code, stay legible to humans and AI, and stay honest about what the software is supposed to do.