Last quarter our AI SDR was booking 14% positive replies on cold outbound. This quarter the same prompt is getting 6%. Models drifted, spam filters tightened, and the patterns that beat outreach in January don't beat it in April. Here's the version we're running right now — what changed and why.
This is a Playbook entry, which means it's dated by design. Read it for the structure; assume the specifics are obsolete in 6 months.
The 4-part structure
Every cold-outbound prompt we ship has the same four sections. Skip any one and quality drops noticeably.
- Role + constraints — who the agent is, what tone it uses, what it's not allowed to say.
- Reference example — one fully-written sample email that shows the model what "good" looks like.
- Lead context — the verified data we have on this specific person and their company.
- Output spec — JSON schema with subject + body, length limits, and "decline if you can't personalize" gate.
No system prompt longer than 600 tokens. No "you are an expert sales professional with 20 years of experience" — that adds noise without changing output.
Part 1 — Role + constraints
The role section is short and concrete. Bad role: "You're an AI sales rep that writes engaging cold emails." Good role:
"You draft a single short cold email from one of our team to a researched lead. Tone: factual, deferential, brief. Length: 90–140 words. Goal: get a 15-min call."
Then the constraints section. We list 5–8 hard bans. The current list:
- Never say "Hope this finds you well", "I hope you're doing well", or any synonym.
- Never claim to have used the recipient's product unless we tell you we did.
- Never compliment the recipient's LinkedIn or recent post. Generic compliments are spam-flagged.
- Never quote a metric for their company that we don't provide in lead context.
- Never sign with a fake job title. Use only the role we provide.
- Never include a calendar link in the first email. Reply-driven CTA only.
- Never use em-dashes ( — ) in the body. They are a tell for LLM-written email.
That last one matters. Em-dashes were the #1 detection signal in our test runs. Banning them dropped detection-rate flags by 40%.
Part 2 — Reference example (the part most prompts skip)
This is the section that does the most heavy lifting and gets skipped most often. Instead of describing tone in adjectives, paste one fully-written email that we'd be proud to send. The model imitates structure better than it follows abstract style guidance.
Our current reference example:
"Hi {name},
Saw {company} just opened a third clinic in {city}. We help med-spa groups stitch together booking, intake, and reactivation in one system without the per-seat HubSpot tax.
Worth 15 minutes to compare? If not, ignore this and good luck with the launch."
Three things to notice:
- It uses one specific fact about the company (the third-clinic detail). The fact must come from lead context, not be invented.
- It states value in concrete terms ("without the per-seat HubSpot tax") not platitudes ("transform your business").
- It explicitly invites the recipient to ignore. This is counterintuitive. It performs better than aggressive CTAs because it removes the salesy pressure that triggers reflexive deletion.
Part 3 — Lead context (verified data only)
This is where we paste the lead's actual data, not a search summary. Hallucinated context is the single biggest failure mode of cold outbound at scale.
Our lead context block has exactly four fields:
- verified_name — first name, confirmed from LinkedIn or company site.
- verified_company — current employer, confirmed within 30 days.
- verified_role — current job title, confirmed within 30 days.
- verified_signal — one specific fact (recent hire, funding, expansion, product launch) with a source URL we logged.
If verified_signal is empty, the prompt is told to decline drafting and return {"draft": null, "reason": "no signal"}. We don't send generic "checking in" emails. That alone cut our send volume by 35% and improved reply rate dramatically.
Part 4 — Output spec
JSON output, validated by Zod on receipt. Anything else is rejected and the lead goes to a human review queue.
"{ subject: string (max 60 chars), body: string (90–140 words), confidence: number 0–1, reason_to_skip?: string }"
We require the model to fill in `confidence` from 0 to 1. Below 0.6, the email goes to human review. Below 0.3, it doesn't send at all. The model is honest about its own confidence more often than you'd expect — surprisingly useful gating signal.
What changed in March 2026
Three changes from the previous version of this prompt:
- Model swap — moved from GPT-4-class to Claude Sonnet 4.6. Reply rate up ~18% on same lists. Anthropic's training data update apparently includes more recent business context, and Sonnet 4.6 follows the "ban these phrases" instructions much more reliably.
- Em-dash ban — added after testing showed em-dashes were the highest-correlation signal for "this looks AI-written" feedback in human review.
- Decline gate — the "return null if no signal" rule was new. Previously the model would invent signals when context was thin. Now it gracefully declines.
Where it still fails
Three scenarios where this prompt produces meh-to-bad output:
- Deeply niche technical verticals — semiconductor, defense, biotech regulatory. The model doesn't have enough domain context to make the value prop sound informed. Considering vertical-specific prompts as a fix.
- Very senior buyers (CXO/founder) — the deferential tone reads as generic. Senior buyers want either no email or a sharply opinionated one. We're testing a second prompt variant for senior-only segments.
- Reply-to-reply (second touch) — this prompt is for cold first touches only. Reply handling is a different agent with different goals (continue conversation, surface objections, route handoff).
How to adapt this for your own outbound
The structure (4 parts) is portable. The specifics (em-dash ban, Sonnet 4.6, decline gate) need to be re-tested for your audience and your model.
Two starting moves:
- Run a 50-lead test with two versions: yours and a no-reference-example version. The reference-example version should win by 10–30%. If it doesn't, your reference example is bad — rewrite it.
- Run a 100-email batch through human review and ask the reviewer "does this look AI-written?" Cluster the "yes" cases by what tipped them off. That list becomes your bans for the next revision.
If you want this set up end-to-end on your stack — agent, prompt, eval harness, queue — that's exactly what we ship as our AI SDR build. Book a call or read the AI SDR product page.
