←Back to Beyond Market Intelligence

April 14, 2026•2 min read•from Machine Learning

ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]

Our take

Introducing ClawBench, a groundbreaking benchmark that assesses AI browser agents on 153 real-world tasks across 144 live websites. Unlike traditional synthetic benchmarks, ClawBench evaluates performance in actual production environments. Key findings reveal that the best model, Claude Sonnet 4.6, achieves a success rate of only 33.3%, highlighting the challenges AI faces with complex tasks. With five layers of behavioral data and a human ground-truth for every task, ClawBench offers a comprehensive approach to understanding AI capabilities and limitations. Explore the findings and resources at claw-bench.com.

We introduce ClawBench, a benchmark that evaluates AI browser agents on 153 real-world everyday tasks across 144 live websites. Unlike synthetic benchmarks, ClawBench tests agents on actual production platforms.

Key findings:

The best model (Claude Sonnet 4.6) achieves only 33.3% success rate
GLM-5 (Zhipu AI) comes second at 24.2% — surprisingly strong for a text-only model
Finance and Academic tasks are easier (50% for the best model); Travel and Dev tasks are much harder
No model exceeds 50% in any category — there's a long way to go

What makes ClawBench different:

Tasks on real live websites, not sandboxed environments
5 layers of behavioral data: session replay, screenshots, HTTP traffic, agent reasoning traces, browser actions
Request interceptor blocks the final HTTP request before irreversible actions (payments, bookings), enabling safe evaluation
Human ground-truth for every task
Agentic evaluator with step-level traceable diagnostics

Resources:

Paper: https://arxiv.org/abs/2604.08523
Website (interactive leaderboard + trace viewer): https://claw-bench.com
Dataset: https://huggingface.co/datasets/NAIL-Group/ClawBench
GitHub: https://github.com/reacher-z/ClawBench
PyPI: pip install clawbench-eval

Happy to answer any questions! We're actively looking for feedback on task selection and evaluation methodology.

[R] Research

submitted by /u/Extreme_Play_8554
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →

Related Articles

Open-source 9-task benchmark for coding-agent retrieval augmentation. Per-task deltas +0.010 to +0.320, all evals reproducible [P]Sharing an open-source benchmark suite (paper-lantern-challenges) that measures coding-agent performance with vs without retrieval-augmented technique selection across 9 everyday software tasks. Disclosure: I'm the author of the retrieval system under test (paperlantern.ai/code); the artifact being shared here is the benchmark suite itself, not the product. Every prompt, agent code path, and prediction file is in the repo and reproducible. Setup. Same coding agent (Claude Opus 4.6 as the planner, Gemini Flash 3 as the task model), same input data, same evaluation scripts across all 9 tasks: test generation (mutation score), text-to-SQL (execution accuracy), PDF extraction, contract extraction, PR review, text classification, few-shot prompt selection, LLM routing, summarization evaluation. Independent variable: whether the agent could call a retrieval tool over CS literature before writing its solution. One pass per task, no retries, no manual filtering of outputs. Task selection. Tasks were chosen to span the everyday-engineering surface a coding agent actually faces, not specialized ML scenarios. Selection criteria: (1) unambiguous quantitative metric, (2) baseline performance well below ceiling, (3) standard datasets where they exist, (4) eval reproducible on a free Gemini API key in roughly 10 minutes per task. Eval methodology. Each task uses its task-standard quantitative metric (mutation score for test_generation, execution accuracy for text_to_sql, F1 on labeled spans for the extraction tasks, weighted F1 for classification, etc.). Full per-task scripts and dataset choices are in the repo - one directory per task, evaluate.py as the entry point, README.md per task documenting methodology and dataset. Retrieval setup. The "with retrieval" agent has access to three tool calls: explore_approaches(problem) returns ranked candidate techniques from the literature, deep_dive(technique) returns implementation steps and known failure modes for a chosen technique, compare_approaches(candidates) is for side-by-side when multiple options look viable. The agent decides when and how often to call them. Latency is roughly 20s per call; results cache across sessions. The baseline agent has none of these tools, otherwise identical scaffolding. Comparability. Both agents share the same task-specific user prompt; the only system-prompt difference is the retrieval agent's tool-call grammar. Predictions and per-task prompts are diffable in the repo (baseline/ and with_pl/ subdirectories per task). Results. Task Baseline With retrieval Delta extraction_contracts 0.444 0.764 +0.320 extraction_schemas 0.318 0.572 +0.254 test_generation 0.625 0.870 +0.245 classification 0.505 0.666 +0.161 few_shot 0.193 0.324 +0.131 code_review 0.351 0.395 +0.044 text_to_sql 0.650 0.690 +0.040 routing 0.744 0.761 +0.017 summeval 0.623 0.633 +0.010 The test-generation delta came from the agent discovering mutation-aware prompting - the techniques are MuTAP and MUTGEN - which enumerate every AST-level mutation of the target and require one test per mutation. Baseline wrote generic tests from pretrain priors. The contract extraction delta came from BEAVER (section-level relevance scoring) and PAVE (post-extraction validation), both 2026 techniques that post-date the agent's training. 10 of the 15 most-cited sources across the experiments were published in 2025 or later, which is the conservative argument for why retrieval matters: the agent could not have reached these techniques from parametric memory. Failure modes. Self-refinement hurt text-to-SQL (the agent second-guessed correct queries after reading work on SQL ambiguity). Two suggested techniques (DyT, SeeDNorm) were architecture-incompatible in the autoresearch experiment and got discarded. Retrieval surfaces better options, not guaranteed wins. Reproducibility. Every prompt, every line of agent code, every prediction file, every eval script is in the repo. Each task directory has a README documenting methodology and an approach.md showing exactly what the retrieval surfaced and which technique the agent chose. Repo: https://github.com/paperlantern-ai/paper-lantern-challenges Writeup with detailed per-task discussion: https://www.paperlantern.ai/blog/coding-agent-benchmarks Happy to share additional design choices in comments. submitted by /u/kalpitdixit [link] [comments]