•2 min read•from Machine Learning
ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]
Our take
Introducing ClawBench, a groundbreaking benchmark that assesses AI browser agents on 153 real-world tasks across 144 live websites. Unlike traditional synthetic benchmarks, ClawBench evaluates performance in actual production environments. Key findings reveal that the best model, Claude Sonnet 4.6, achieves a success rate of only 33.3%, highlighting the challenges AI faces with complex tasks. With five layers of behavioral data and a human ground-truth for every task, ClawBench offers a comprehensive approach to understanding AI capabilities and limitations. Explore the findings and resources at claw-bench.com.
We introduce ClawBench, a benchmark that evaluates AI browser agents on 153 real-world everyday tasks across 144 live websites. Unlike synthetic benchmarks, ClawBench tests agents on actual production platforms.
Key findings:
- The best model (Claude Sonnet 4.6) achieves only 33.3% success rate
- GLM-5 (Zhipu AI) comes second at 24.2% — surprisingly strong for a text-only model
- Finance and Academic tasks are easier (50% for the best model); Travel and Dev tasks are much harder
- No model exceeds 50% in any category — there's a long way to go
What makes ClawBench different:
- Tasks on real live websites, not sandboxed environments
- 5 layers of behavioral data: session replay, screenshots, HTTP traffic, agent reasoning traces, browser actions
- Request interceptor blocks the final HTTP request before irreversible actions (payments, bookings), enabling safe evaluation
- Human ground-truth for every task
- Agentic evaluator with step-level traceable diagnostics
Resources:
- Paper: https://arxiv.org/abs/2604.08523
- Website (interactive leaderboard + trace viewer): https://claw-bench.com
- Dataset: https://huggingface.co/datasets/NAIL-Group/ClawBench
- GitHub: https://github.com/reacher-z/ClawBench
- PyPI:
pip install clawbench-eval
Happy to answer any questions! We're actively looking for feedback on task selection and evaluation methodology.
[R] Research
[link] [comments]
Read on the original site
Open the publisher's page for the full experience
Tagged with
#generative AI for data analysis#Excel alternatives for data analysis#rows.com#natural language processing for spreadsheets#real-time data collaboration#real-time collaboration#big data management in spreadsheets#enterprise-level spreadsheet solutions#conversational data analysis#large dataset processing#financial modeling with spreadsheets#intelligent data visualization#no-code spreadsheet solutions#data visualization tools#enterprise data management#big data performance#interactive charts#data analysis tools#data cleaning solutions#ClawBench