Solutions
Services
AI Growth
Industries
Resources
Pricing
Book a call
Home/Knowledge/How to build an AI agent (operator's guide, 2026)
How-to·April 27, 2026·12 min read·Updated May 1, 2026

How to build an AI agent (operator's guide, 2026)

A 9-step playbook for shipping a production AI agent — scope, tools, model choice, the loop, evaluation, human-in-the-loop, deploy. Code-light, opinionated, no toy demos.

agents.dgcore — Multi-agent workflow
LIVE
▸ Planner → executors → verifier · with tools
PlannerExecutorExecutorVerifier
Tools:CRMCalendarEmailSlack
Agents in run
4
Cost / execution
$0.18
The takeaway
Skim this if you only have 30 seconds.
  1. 01You need three things: a model with tool calling, a small set of tools (5–15 max), and a loop that runs until the goal is met or stops.
  2. 02Pick the smallest scope that delivers value. "AI receptionist that books meetings" beats "general assistant" every time.
  3. 03Most production agents we ship are plain code calling Claude or GPT in a loop — frameworks (LangGraph, CrewAI) help when single-agent ReAct is provably insufficient.
  4. 04Evaluation is the moat. Run on 50–200 real cases before declaring done. Aim for 85%+ task success on a narrow scope.
  5. 05Human-in-the-loop on irreversible actions (sending external emails, billing, deploys) is non-negotiable. Skip this and you ship a liability.

We've shipped roughly 30 AI agents into production over the last 18 months — for agencies, coaches, ecommerce ops teams. About half of them work. The half that don't fail in predictable ways. This post is the playbook for the half that work, plus the failure modes we now design around. Agents are one of five production patterns we cover in what is AI automation? The 5 patterns that run in production — start there if you're still deciding whether agent shape is the right fit at all.

It's not a toy "Hello World agent" tutorial. The shape of the work, the decision points, and the actual failure modes — code-light, opinionated, no demos.

Before you start: you should already know what an AI agent is. If you don't, read what is an AI agent first — this guide assumes you understand the loop + tools concept.

Editorial illustration: a central rounded-square brain shape connected by orange-coral arrows to a small grid of five tool icons (wrench, database, envelope, calendar, key) and a small log/feedback panel at the bottom.
Minimum viable agent: model + tool registry + loop + evaluator.

Step 1 — Pick the smallest scope that delivers value

The single most common reason AI agents fail in production is over-scoped goals. "An AI assistant that handles all my email" is a feature roadmap, not a project. "An AI agent that drafts replies to inbound sales inquiries using context from HubSpot" is a project.

A good scope passes three tests:

  1. You can describe success in one sentence ("85% of inbound sales emails get a draft reply within 30 seconds with the right context attached").
  2. There are 5–15 tool calls maximum involved in the typical run.
  3. A wrong action is recoverable — either reversible cheaply, or human-reviewed before it goes external.

When in doubt, scope down. You can always expand later. You can't un-ship a half-broken agent that emailed the wrong customer the wrong contract.

Step 2 — Define your tools before writing prompts

A tool is a function the model can call. Each one has a name, a description, a parameter schema, and a return shape.

The three rules:

  1. Each tool does one thing. Not "manage_crm" — that's a router, not a tool. Use lookup_contact, update_contact, create_deal as separate tools.
  2. Tool descriptions are written for the model, not for humans. The model reads these to decide when to call them. Be specific: "Use this when you need the lifetime spend of a customer by their email address." Not "Customer lookup".
  3. Parameter schemas are strict. Required fields, enums, regex constraints — anything you can express in JSON schema, do.

A typical production agent has 5–15 tools. Below 5 means you're missing capability or the agent is too narrow to be useful. Above 15 means the model is going to get confused about which to call.

Step 3 — Pick the model

As of mid-2026, frontier models are production-grade for agent work. Pick by tool-calling reliability first, cost second, niche capability last.

Model picks for agent work (mid-2026)
ModelCost tierStrengthsUse for
Claude Sonnet 4.6MidBest-in-class tool calling, long-context reasoning over tool outputsDefault for production agents
Claude Opus 4.7HighDeepest reasoning, complex multi-step planningHardest agents · architectural decisions
GPT-4-class (4.1 / 5)Mid–HighComparable tool calling, native Assistants API, structured output benchmarksOpenAI-native stacks · structured outputs
Gemini 2.5 ProMidGenerous context window, multimodal, decent tool useMultimodal agents · long-context summarization
Claude Haiku 4.5LowFast, cheap, surprisingly strong tool callingHigh-volume narrow agents (triage, classification)
GPT-4o-miniLowCheap, OpenAI-nativeHigh-volume narrow agents
Open-weights (Llama, Qwen)Self-hostFull control, no per-call costSkip for production agents — tool-calling gap is real
Model choice is rarely your bottleneck. Spend the budget on better tool definitions and evaluation instead.

Step 4 — Write the loop

The agent loop is six steps in fewer than 100 lines of TypeScript or Python: initialize a messages array with the system prompt and goal, call the model with messages plus tool definitions, return the answer if the response is a stop, execute the tool and capture the result if the response is a tool call, append both call and result to messages, then jump back with hard-stops at 12 iterations or 60 wallclock seconds. The complexity isn't the loop itself — it's edge cases: tool errors, hallucinated tool names, stuck retry cycles. Build the happy path first, then handle each edge case as it bites you in evals.

Pseudocode for the happy path:

  1. Initialize: messages = [{ role: "system", content: SYSTEM_PROMPT }, { role: "user", content: GOAL }]
  2. Call model with messages and tool definitions.
  3. If the response says "stop" (no tool calls) — return the final answer.
  4. If the response says "call tool X with args Y" — execute the tool, capture the result.
  5. Append the tool call and the tool result to messages.
  6. Goto step 2. Hard-stop after N iterations (default 12) or M wallclock seconds (default 60).

That's it. Fewer than 100 lines of code in TypeScript or Python. The complexity comes from edge cases: what to do when a tool errors, when the model hallucinates a tool that doesn't exist, when it gets stuck calling the same tool repeatedly.

Frameworks that help

  • Claude Agent SDK / OpenAI Assistants API — managed loop, fastest to ship.
  • LangGraph — explicit state graphs; best when you need branching paths, retries, or multi-agent.
  • CrewAI — multi-agent teams with role coordination; usually overkill for single-purpose agents.
  • Vercel AI SDK — clean TypeScript wrapper, good for Next.js apps.

Honest take: 70% of our production agents are plain code calling the model API directly. Frameworks add value when you need state machines or coordination — they add ceremony when you don't.

Step 5 — Write the system prompt last

Counter-intuitive but it consistently works in production: write the tools first, then write the system prompt. Tools constrain what the agent can do; the prompt directs how it should think about doing it. Writing the prompt first leads to vague tool definitions that the prompt has to compensate for — and prompts that compensate for vague tools are the single most common reason agents misbehave in production. A good system prompt has four sections (role + goal, rules, tool guidance, stop conditions) and stays under 600 tokens, because long prompts hurt performance more than they help — the model isn't reading every word as carefully as you want it to.

  1. Role and goal — "You are an AI sales assistant for [company]. Your job is to draft replies to inbound sales inquiries."
  2. Rules — "Never quote pricing without checking the latest_pricing tool. Never schedule meetings outside business hours. Always include a calendar link."
  3. Tool guidance — "Use lookup_contact first to check if the sender is in the CRM. Use draft_reply to compose responses, never send directly."
  4. Stop conditions — "Once you've drafted a reply and saved it, stop. Do not attempt to send it."

Keep it under 600 tokens. Long prompts hurt performance more than they help — the model isn't reading every word as carefully as you want it to.

Step 6 — Build evaluation before scaling

Demos lie. The first agent run that "looks good" will fool you. The second one that fails will fool you in the other direction. Evaluation cuts through both.

  1. Collect 50–200 real-world cases (anonymized customer emails, support tickets, whatever the agent will see in production).
  2. For each case, define what the agent should do. Doesn't need to be perfect — just a directional ground truth.
  3. Run the agent on every case. Log inputs, tool calls, outputs.
  4. Score each run: success / partial / fail. Be strict. Failure includes "got the right answer but used 8 tool calls when 3 would have done it".
  5. Look at failures. Cluster by failure mode. The fix for "wrong tool argument" is different from "agent looped infinitely".

Production-ready bar for a narrow-scope agent: 85%+ task success rate. Below that, scope down or fix tools/prompts.

Step 7 — Add human-in-the-loop on irreversible actions

Reversible actions: drafting an email, looking up data, classifying a lead, scoring a record. Let the agent run.

Irreversible actions: sending external email, posting to social, charging a card, deploying code, deleting data. Add a human approval step.

The cheapest implementation: the agent generates the action, posts a Slack message with "Approve / Reject" buttons, and waits for human input before executing. Adds 30 seconds of friction; eliminates 95% of catastrophic failure modes.

Step 8 — Observability before deploy

Production agents fail in subtle ways. You will not catch them by reading logs after the fact. Set up:

  • Per-run trace: every tool call, with inputs, outputs, latency, model usage tokens.
  • Per-task success metric: did the agent achieve its goal? Logged automatically where possible (e.g. did the email get drafted?).
  • Cost per run: aggregate token usage × model price. Spike alerts when this jumps.
  • Loop length distribution: track how many iterations the agent takes. Sudden increase in average iterations = agent is getting stuck more often.

Tools we use: Langfuse for tracing, plain Postgres for run logs, simple Grafana dashboard. Sometimes just a Notion database when volumes are low.

Step 9 — Deploy

Production agents ship in one of three trigger patterns: webhook (an external event fires the agent — used for inbox triage, lead routing, support tier-1), scheduled (cron runs the agent on a recurring timer — used for daily reports, weekly summaries, batch enrichment), or user-invoked (a human triggers the agent via Slack command, internal-tool button, or API — used for one-off research, on-demand analysis, drafting tools). All three run on standard infrastructure — Vercel/Cloudflare/AWS Lambda for webhook and user-invoked, GitHub Actions or Railway scheduled tasks for cron. Nothing exotic; the agent code is the same in all three cases.

Webhook-triggered

External event (email arrives, form submitted, ticket created) → webhook → agent runs → result posted back. Simplest pattern. Used for inbox triage, lead routing, support tier-1.

Scheduled

Cron triggers the agent to run on a recurring schedule. Used for daily reports, weekly summaries, batch enrichment.

User-invoked

A human triggers the agent via a button in an internal tool, Slack command, or API. Used for one-off research, on-demand analysis, drafting tools.

All three run on standard infrastructure: Vercel/Cloudflare/AWS Lambda functions for webhook and user-invoked, GitHub Actions or Railway scheduled tasks for cron. Nothing exotic.

Common failure modes

  • Tool-call hallucination — agent calls a tool that doesn't exist. Fix: tighter system prompt + log + fail loudly.
  • Infinite loop — agent keeps calling the same tool to recover from an error. Fix: hard step limit + retry-after-N-attempts logic.
  • Wrong tool, right shape — agent calls update_contact when it should have called create_contact. Fix: clearer tool descriptions.
  • Goal drift — agent solves a different problem than asked. Fix: stricter system prompt with explicit stop conditions.
  • Context blow-up — tool results are huge and the model context fills up. Fix: summarize tool outputs before appending, or use a separate "tool result store" with references.

What you don't need

  • A custom-trained model. Frontier models do tool calling well enough.
  • Vector databases for general agent work. RAG belongs in agents that need long-term memory; most don't.
  • Multi-agent architectures. Single agent + tools handles 80% of production cases.
  • Heavy frameworks if you're building one focused agent. Plain code is often clearer.

Realistic timeline

  • Day 1–2: Scope, tool definitions, system prompt v1, smoke test.
  • Day 3–7: Build evaluation set, iterate on prompt + tools until you hit 80%+ on the eval.
  • Day 8–14: Deploy to a controlled cohort, add observability, fix the failures that production reveals.
  • Day 15+: Scale up, expand scope incrementally.

Two to three weeks for a focused, production-grade agent is realistic. Anything claiming "build your AI agent in an afternoon" is selling you a demo, not a tool.

▶ Q&A

Frequently asked.

Pulled from real "people also ask" data on these topics — answered honestly, in our own voice.

Q.01

How can I create my own AI agent?

The minimum recipe: pick a frontier LLM (Claude Sonnet 4.6 or GPT-4-class), define 5–15 tools as JSON-schema-validated functions, write a system prompt with role + rules + stop conditions, and run a loop that calls the model, executes tool calls, and feeds results back until the goal is met. Plain code in Python or TypeScript works for most cases.

Q.02

Can I build AI agents without coding?

Yes — n8n, Make, Zapier, and dedicated agent platforms (CustomGPT, Sema4, Lindy) let non-developers build basic agents visually. The trade-off: less control over the loop, fewer tools, and you hit a ceiling on complex agents. For production-grade work in narrow domains, no-code is fine. For anything load-bearing, write the code.

Q.03

What are the 4 types of AI agents?

In classical AI: simple reflex agents (pure stimulus-response), model-based reflex agents (maintain internal state), goal-based agents (plan toward objectives), and utility-based agents (optimize for a measurable outcome). Modern LLM-driven agents are usually goal-based or utility-based, with a learning component layered on top.

Q.04

What is the 10/20/70 rule for AI?

A common framing in enterprise AI: 10% of project success comes from the model/algorithm, 20% from the tooling/platform, and 70% from the people, processes, and change management around it. Translation for AI agents: getting the tech right is necessary but not sufficient — what kills most projects is unclear scope, missing evaluation, and resistance from the humans whose work is being changed.

Q.05

How long does it take to build an AI agent?

A focused, production-grade agent for a narrow use case takes 2–3 weeks of build time and 1–2 weeks of supervised rollout. A toy agent for a demo takes a weekend. The difference is evaluation, observability, and the human-in-the-loop work that keeps it from misbehaving in production.

Q.06

What programming language is best for AI agents?

TypeScript and Python are the two real options. Python has slightly more ecosystem (LangGraph, CrewAI, AutoGen are Python-first). TypeScript wins if your stack is already JavaScript or if you're shipping the agent as part of a Next.js / Cloudflare Worker / Vercel deployment. Both are first-class with the Anthropic and OpenAI SDKs.

▶ Editor's note

Want this built, not just explained?

Book a strategy call. We'll map your stack, find the highest-leverage automation, and quote a 60-day plan.