How to build an AI agent

The reason most agent projects fail is not that the framework is wrong, the model is wrong, or the prompt is wrong. The reason is that the people building them did not write down, in one sentence, what success would look like and how they would know. Everything downstream is decoration on top of that missing line.

The eight steps below are not new. They have been written down in roughly this shape by everyone who has shipped an agent for the past two years. What is worth saying again, and what most versions of the list omit, is what each step actually buys you, and what you lose when you skip it.

§01Define the job

Step one is not "set up your repo." Step one is writing one sentence that says what the agent does, for whom, and how you will know it is working. That sentence has to contain a number, because without one the project is unfalsifiable, and an unfalsifiable agent will collect optimism without ever earning it.

The reason this step is hard, and the reason most teams skip it, is that the sentence is a forcing function. If you cannot write it, you do not have a project. You have a hope. The cost of admitting that early is small. The cost of admitting it after three months of building is large.

§02Design the brain

The system prompt is what differentiates two agents that have the same model, the same tools, and the same orchestration. It is also the cheapest part of the stack to improve. An hour spent rewriting it usually outperforms a week spent migrating frameworks.

What goes in: the role in the first sentence, the things the agent must never do, one or two worked examples of the kind of reasoning you want, and a description of who the user is and what they already know. The negative space matters as much as the positive.

When that page of text grows past about thirty lines, it has stopped being a prompt and started being a skill. At that point you should give it a name, version it, and treat it as an artifact. Skills are the unit of compounding for prompt work, and prompt work is where most of the actual value lives.

§03Pick the model

The decision is not "which model is best." It is "which model is the smallest one that passes my evals," and the answer depends on the step. Routing and classification rarely need the largest model. Synthesis sometimes does. Reasoning over long context usually does.

A useful default layout is Haiku for routing, Sonnet for the bulk of the work, and Opus only on the steps that produce the kind of output you cannot get from Sonnet at any prompt budget. The token-cost difference between tiers is real, and at any meaningful volume it shows up in the monthly bill.

The other discipline worth keeping is to pin the model version when you ship. Models improve quietly and regress quietly, and the asymmetry of those two outcomes is the whole reason to gate upgrades behind your eval suite.

§04Add tools

A tool is not free. Every additional tool widens the decision space the model has to reason over, which costs tokens, latency, and sometimes correctness. The right number of tools is the smallest set that covers your happy path, and the right way to find that set is to start with three and add a fourth only when an actual failure case demands it.

Most production agents end up with somewhere between three and seven tools. The temptation to ship with thirty is real, and the cost of giving in is paid in slow, confused, expensive runs that look fine in the demo and fall apart on the third real ticket.

Prefer MCP servers when you can. They are standardized, the ecosystem of well-maintained ones is growing, and they save you the maintenance cost of integrations you did not need to write yourself.

§05Give it memory

Memory in agents is the area where the gap between hype and practice is largest. The version most teams ship with is conversation context, which the model already gives you. The version they reach for first is a vector database, which they almost never need.

The version that actually pays back is in between. Markdown notes. Structured digest files. A row in a SQL table for every interaction. These are unglamorous, traceable, and they cover ninety percent of what people mean when they say "the agent should remember." A vector database is what you graduate to when the simpler version provably fails, not before.

The harder design question, the one that decides whether the memory system actually helps, is what you choose to forget. An agent that remembers everything ends up confused about what matters. The forgetting policy is more load-bearing than the storage choice.

§06Orchestrate

Orchestration is the discipline of designing for failure before it happens. Your agent will fail. Your tool will time out. Your provider will rate-limit you in the middle of a 30-step task. The question is whether the system was built with that assumption baked in, or whether it was built optimistically and is now being patched after each new outage.

The pieces are not exotic. Retries with exponential backoff on every external call. Idempotent tool calls so a retry does not double-charge somebody. Checkpoints you can resume from so a long task does not have to start over after step twenty-seven. Human in the loop on the steps where the cost of being wrong is high enough to justify the latency.

A flat agent loop is fine for a prototype. A real graph earns its place the first time you have branching, parallel work, or a step that needs review.

§07Build the interface

The interface is the agent. A good agent in a bad interface is a worse product than a bad agent in a good one, because nobody uses the first one. The lesson most teams learn the hard way is that shipping four mediocre interfaces, one in each channel they could think of, is worse than shipping one excellent interface in the channel their users already live in.

◆ pull quote

“Pick the channel your users are already in. The other three channels are vanity.”

§08Test and improve

Evals are the unglamorous part of agent work and the most important part. The teams that compound advantage over time are not the ones with the best model or the most tools. They are the ones with eval suites that catch regressions before users do, and they treat regressions in those suites the way they treat regressions in their normal test suites. The ship gets blocked. The bug gets fixed. The set grows.

The starting set is small. Ten inputs and their expected outputs is enough to catch most prompt regressions. The set grows when you find a bug in production and add it as input eleven. By month six you have fifty inputs, and the discipline of running them on every change is what turns a fragile prototype into something that can take a model upgrade without falling over.

The eight steps are not the moat. The work of doing each one honestly is.

◇ summary · field notes

$ vibgineer summarize how-to-build-an-ai-agent

01
Define the job
- ▸explicit user
- ▸explicit metric
- ▸one written sentence
- ▸falsifiable
02
Design the brain
- ▸role and worldview
- ▸guardrails as text
- ▸worked examples
- ▸convert to skill
03
Pick the model
- ▸eval-first
- ▸tier-aware routing
- ▸pin on ship
- ▸upgrade behind tests
04
Add tools
- ▸minimal viable set
- ▸MCP over custom
- ▸audit cost of each
- ▸measure decision width
05
Give it memory
- ▸episodic vs persistent
- ▸structured before semantic
- ▸forgetting policy
- ▸cost of recall
06
Orchestrate
- ▸failure-first design
- ▸idempotency
- ▸checkpoints
- ▸human in the loop
07
Build the interface
- ▸one channel
- ▸users where they are
- ▸quality over coverage
- ▸interface is product
08
Test and improve
- ▸golden set
- ▸regression gates
- ▸eval-driven iteration
- ▸compound advantage

✓ 8 steps · the work is the moat, not the steps.