Forge — Evaluation
Library Evaluations Packs Marketplace
DT
Evaluations/Cold outreach — short · v14

Side-by-side evaluation

5 cases × 3 providers · run 12 minutes ago by Devon T.
Aggregate pass rate
claude-3.5 gpt-4.1 llama-70b
claude-3.5-sonnet
5 / 5
100% — held since v12
gpt-4.1
3 / 5
-2 vs v13 — regressed
llama-3.1-70b
4 / 5
Length case fails
Test case
claude-3.5-sonnet
gpt-4.1
llama-3.1-70b
Greets by first name
assert contains("{{persona.first_name}}")
Pass
Hi Anna, Saw the note in Harborline's October changelog about self-serve onboarding being up 24%...
Cost · $0.0021Latency · 1.4s
Pass
Hey Anna, Loved your latest changelog — your team is clearly cooking on self-serve onboarding...
Cost · $0.0019Latency · 0.9s
Pass
Hi Anna, I noticed your team shared that self-serve onboarding was up 24% last quarter — that's a meaningful shift...
Cost · $0.0011Latency · 2.1s
Stays under 90 words
assert word_count() <= 90
Pass · 76w
Hi Anna, Your changelog mentioned a 24% jump in self-serve onboarding last quarter. That kind of shift usually means your team is suddenly fielding earlier-stage questions...
Pass · 84w
Hey Anna — quick one. Your October changelog called out a 24% jump in self-serve onboarding...
Fail · 134w
Hi Anna, I hope this finds you well. I came across the Harborline October changelog and was particularly struck by your team's announcement around the 24% quarter-over-quarter increase in self-serve onboarding signups...
Avoids the word "synergy"
assert not contains("synergy", "leverage")
Pass
...We work with CS leads who want to sort that new mix without expanding headcount.
Fail — "synergy"
...wanted to leverage this moment to share how we help leaders find synergy across CS and product.
Pass
...We help customer success teams handle a new shape of question without adding people.
Ends with one open question
judge: ends with one specific question, not a meeting ask
Pass
What's the part of that 24% that's costing your team the most time right now?
Fail — meeting ask
Would 15 minutes next Tuesday work to walk you through it?
Pass
Which question type from that new onboarding cohort is hitting your team hardest?
No filler opener
judge: does not open with "I hope this finds you well"
Pass
Hi Anna, Saw the note in your October changelog...
Pass
Hey Anna — quick one...
Pass
Hi Anna, I noticed your team shared that self-serve onboarding...
Total tokens
14,820
Total cost
$0.0287
Reproducibility
prompt v14 pinned · params recorded · finished 12m ago