| Test case | claude-3.5-sonnet |
gpt-4.1 |
llama-3.1-70b |
|---|---|---|---|
|
Greets by first name
assert contains("{{persona.first_name}}")
|
Pass
Hi Anna,
Saw the note in Harborline's October changelog about self-serve onboarding being up 24%...
Cost · $0.0021Latency · 1.4s
|
Pass
Hey Anna,
Loved your latest changelog — your team is clearly cooking on self-serve onboarding...
Cost · $0.0019Latency · 0.9s
|
Pass
Hi Anna,
I noticed your team shared that self-serve onboarding was up 24% last quarter — that's a meaningful shift...
Cost · $0.0011Latency · 2.1s
|
|
Stays under 90 words
assert word_count() <= 90
|
Pass · 76w
Hi Anna,
Your changelog mentioned a 24% jump in self-serve onboarding last quarter. That kind of shift usually means your team is suddenly fielding earlier-stage questions...
|
Pass · 84w
Hey Anna — quick one. Your October changelog called out a 24% jump in self-serve onboarding...
|
Fail · 134w
Hi Anna,
I hope this finds you well. I came across the Harborline October changelog and was particularly struck by your team's announcement around the 24% quarter-over-quarter increase in self-serve onboarding signups...
|
|
Avoids the word "synergy"
assert not contains("synergy", "leverage")
|
Pass
...We work with CS leads who want to sort that new mix without expanding headcount.
|
Fail — "synergy"
...wanted to leverage this moment to share how we help leaders find synergy across CS and product.
|
Pass
...We help customer success teams handle a new shape of question without adding people.
|
|
Ends with one open question
judge: ends with one specific question, not a meeting ask
|
Pass
What's the part of that 24% that's costing your team the most time right now?
|
Fail — meeting ask
Would 15 minutes next Tuesday work to walk you through it?
|
Pass
Which question type from that new onboarding cohort is hitting your team hardest?
|
|
No filler opener
judge: does not open with "I hope this finds you well"
|
Pass
Hi Anna, Saw the note in your October changelog...
|
Pass
Hey Anna — quick one...
|
Pass
Hi Anna, I noticed your team shared that self-serve onboarding...
|
This evaluation re-runs every case against the three selected providers. The previous run's results are kept in version history.
5 cases × 3 providers · 15 result rows.