What benchmark would you build for “reply quality” in SDR generation? [D]
Our take
Working on evaluating some AI-generated outbound (SDR-style emails along with follow-ups), and I’m running into a weird problem. Everyone talks about better personalisation or higher reply rates, but when you actually try to benchmark quality it gets messy fast.
A few things we’ve looked at:
a)reply rate (obvious, but noisy with a delayed signal)
b)positive vs negative replies (hard to label cleanly at scale)
c)factual accuracy about the prospect/company
d)how much editing a human has to do before sending
e)whether the message sounds human enough to not trigger spam radar
The issue for me at least, none of these fully capture “this is a good outbound message”. You can optimise for reply rate and end up with clickbaity nonsense. You can optimise for accuracy and get something technically correct but completely dead. Right now the most practical metric internally is probably the time to approve/send after human review process, but that feels like a proxy, not the thing itself. If you had to build a proper benchmark here, what would you optimise for? This seems like one of those problems where everyone says the metric isn''t important, but it seems like the core element.
- single metric or composite?
- offline eval vs live campaign data?
[link] [comments]
Read on the original site
Open the publisher's page for the full experience