Skip to main content

Framework

A.G.E.N.T. — a five-phase method for shipping production AI agents.

Each phase produces a specific artifact. No phase begins until the previous artifact exists. Published here so you can pressure-test the method before we talk.

01

Why this framework exists

The dominant AI methodology was built for boards. Agent builds need a different shape.

The dominant enterprise AI methodology was built for boards. Portfolio assessments, capability maturity models, pilot-to-scale arcs, Centers of Excellence, change management workstreams. It optimizes for slide decks presented to CEOs.

That works for transformation programs. It fails for agent builds, for four reasons.

Agent capability is determined by tool design, eval loops, and prompt iteration. Not by org structure. The work that determines whether the agent ships is engineering work, not governance work.

Pilot success is a misleading signal. The demo path works. The production path is full of edge cases the demo never touched. A methodology that celebrates pilot completion is celebrating the wrong thing.

The failure modes are silent. Models drift. Prompts get edited. Input distributions shift. The agent passes yesterday's evals and fails today's reality, and nobody knows until a customer complains. Headline risk is the wrong frame.

Centralizing agent expertise in a Center of Excellence separates it from the domain context that makes agents work. The team closest to the workflow is the team that knows what good looks like. Move the expertise away from them and you lose the thing that makes the agent ship.

A.G.E.N.T. is the method I run because it's shaped around what actually determines whether an agent ships and stays shipped.

02

The five phases

The five phases.

The five phases are sequential. Each one produces an artifact. The artifact is what proves the phase is done.

Phase 01

Anchor

Commit to one bounded workflow with a measurable baseline.

One workflow. One owner. One human or team currently doing it, so the agent has something to be compared against. No portfolio analysis, no domain rewiring, no AI strategy. If you can't point at a single workflow and describe its inputs, outputs, and current cost, the engagement isn't ready.

Artifact

A one-page workflow spec containing the trigger event, the inputs, the outputs, the current baseline (who does this today, how long it takes, error rate if known), and the success criteria that define what 'good enough to deploy' means in numbers.

Failure mode it prevents

Scope creep during anchoring. The conversation drifts from 'this workflow' to 'and it should also handle…'. Anchoring fails the moment the workflow becomes plural. The artifact exists to make scope concrete and signed off in writing, so week six doesn't relitigate week one.

Self-check

Can I describe the workflow in two sentences without using the words 'platform,' 'intelligence,' or 'automation'?

Phase 02

Ground

Build the eval set before building the agent.

Thirty to fifty real examples of the workflow with correct outputs, drawn from your historical data. What "working" means gets defined in measurable terms: accuracy, latency, cost per run, escalation rate, whatever the workflow demands. This is the most-skipped step in real agent projects. Skipping it means you can't tell if the agent is improving, regressing, or just lucky on the demo path.

Artifact

A runnable eval harness. Not a spreadsheet, a program. A script your team can execute that runs the agent against the eval set and produces a scorecard. The harness ships in week three and stays with you.

Failure mode it prevents

Treating evals as QA. Evals are the specification. The eval set is the contract the agent is being built to satisfy. If the eval set is wrong, the agent will be wrong. Building the harness first forces the spec to be honest before any code gets written against it.

Self-check

If the agent passes the eval set perfectly, does the workflow actually work in production? If no, the eval set is incomplete.

Phase 03

Engineer

Build the minimum agent that passes the eval.

Tool design, prompt iteration, state management, error recovery. The architecture is the smallest one that can plausibly pass. Nodes get added when an eval failure demands it, not before. The most common failure in agent engineering is over-architecture. Splitting fusable operations across nodes. Adding state fields nothing reads. Routing tool calls without loop-back edges. Premature abstraction. Each one adds surface area without improving eval scores.

Artifact

A working agent that scores acceptably on the eval set, against the success criteria from Anchor, with traces visible for every run.

Failure mode it prevents

Optimizing for the demo, not the eval. Agents that look impressive in a meeting and degrade under real traffic are the default outcome of demo-driven engineering. The eval is what disciplines the build.

Self-check

If I removed any single node, tool, or state field, would the eval score drop? If not, remove it.

Phase 04

iNstrument

Production observability before production traffic.

Every agent run produces traces. Failures get captured with enough context to reproduce. Cost per run is tracked. Drift is detectable, meaning the eval suite runs on a schedule against production samples and someone gets alerted when scores move. Without instrumentation, the agent silently rots. Models change, prompts get edited, tool APIs shift, input distributions drift. The agent can be passing yesterday's evals and failing today's reality, and nobody knows. Mid-market clients almost never have an observability story for AI systems. This phase is where that story gets built.

Artifact

A dashboard your operator checks daily, an alerting rule that fires on regression, and a runbook for what to do when it fires.

Failure mode it prevents

Treating observability as DevOps overhead instead of as the agent's nervous system. If the operator can't tell you yesterday's failure rate without running a query, the instrumentation isn't done.

Self-check

If the agent's accuracy dropped ten percent tomorrow, how would my team find out? If the answer is 'a customer would tell us,' instrumentation isn't done.

Phase 05

Transfer

Hand off operational ownership to your team.

Not training in the LLM sense. Not training in the corporate-LMS sense. Transfer of ownership: who edits the prompts, who extends the eval set when new failure modes appear, who decides when to escalate to a model upgrade, who owns the runbook. The agent is not done when it ships. It's done when your team can run it without me. If your team can't extend the eval set, the agent will degrade the first time the workflow shifts.

Artifact

Your team operating the agent independently, with documented protocols for adding examples to the eval set, editing prompts and re-running evals, responding to alerts, and deciding when to call me back.

Failure mode it prevents

Heroic dependency. The agent works because I'm available on Slack. Transfer fails if you never have to make a decision without me. The protocols exist so you don't.

Self-check

If I took a four-week vacation tomorrow, would the agent keep working and your team keep improving it? If not, transfer isn't done.

03

How the phases sequence in a 90-day engagement

The five phases map to three acts.

In practice the phases overlap. Engineer and iNstrument blur. Instrumentation often gets stubbed during engineering, then hardened in the final act. The phases are artifacts to deliver, not weeks on a calendar.

Act 1

Weeks 1 to 3

Anchor and Ground

Workflow spec signed off. Eval harness running. Your team can score the agent before there is an agent.

Act 2

Weeks 4 to 9

Engineer

Working agent passes the eval. Iterations are visible in traces. Architecture stays small.

Act 3

Weeks 10 to 13

iNstrument and Transfer

Dashboard live. Alerts tuned. Your team operates the agent.

04

What the framework is not

A.G.E.N.T. is bounded on purpose. Three things it deliberately doesn't try to do.

Not an AI strategy framework.

It doesn't help you decide which workflow to automate, which model family to bet on, or how to organize your AI function. It helps you ship the workflow you've already picked. If you're earlier than that, the engagement isn't ready.

Not a methodology you license or certify into.

No partners, no certified practitioners, no playbook for sale. I run A.G.E.N.T. on the engagements I take. The framework is published so you can evaluate the method, not so you can implement it without me.

Not a substitute for engineering judgment.

The framework structures the work. The work still requires someone who can read traces, edit prompts, and recognize when an architecture is over-engineered. A.G.E.N.T. makes good engineering legible. It doesn't replace it.

Book a discovery call

The right shape. Or the wrong fit.

If you're evaluating A.G.E.N.T. for a workflow you've already identified, the discovery call is where we figure out whether the method fits — or doesn't.

Forty-five minutes, and you'll know whether to move forward.

FreeNo pitch deckGo or no-go on the call
MavenSolutions

One workflow. One agent. 90 days. Then your team owns it.

© 2026 MavenEcommerce Inc. dba MavenSolutions

Andrew Korolov · principal AI engineer