3 min readfrom Machine Learning

[P] PhAIL (phail.ai) – an open benchmark for robot AI on real hardware. Best model: 5% of human throughput, needs help every 4 minutes.

Our take

Introducing PhAIL (phail.ai), an open benchmark designed to assess robot AI performance on real hardware, focusing on practical metrics rather than simulations. After a year of investigation into VLA models’ effectiveness in commercial tasks, we established a rigorous evaluation on the DROID platform, specifically for bin-to-bin order picking—a common industrial operation. Our approach measures Units Per Hour (UPH) and Mean Time Between Failures (MTBF) to provide transparent insights. Explore the full dataset, video, and telemetry at phail.

I spent the last year trying to answer a simple question: how good are VLA models on real commercial tasks? Not demos, not simulation, not success rates on 10 tries. Actual production metrics on real hardware.

I couldn't find honest numbers anywhere, so I built a benchmark.

Setup: DROID platform, bin-to-bin order picking – one of the most common warehouse and industrial operations. Four models fine-tuned on the same real-robot dataset, evaluated blind (the operator doesn't know which model is running). We measure Units Per Hour (UPH) and Mean Time Between Failures (MTBF) – the metrics operations people actually use.

Results (full data with video and telemetry for every run at phail.ai):

Model UPH MTBF
OpenPI (pi0.5) 65 4.0 min
GR00T 60 3.5 min
ACT 44 2.8 min
SmolVLA 18 1.2 min
Teleop / Finetuning (human controlling same robot) 330
Human hands 1,331

OpenPI and GR00T are not statistically significant at current episode counts – we're collecting more runs.

The teleop baseline is the fairer comparison: same hardware, human in the loop. That's a 5x gap, and it's almost entirely policy quality – the robot can physically move much faster than any model commands it to. The human-hands number is what warehouse operators compare against when deciding whether to deploy.

The MTBF numbers are arguably more telling than UPH. At 4 minutes between failures, "autonomous operation" means a full-time babysitter. Reliability needs to cross a threshold before autonomy has economic value.

Every run is public with synced video and telemetry. Fine-tuning dataset, training scripts, and submission pathway are all open. If you think your model or fine-tuning recipe can do better, submit a checkpoint.

What models are we missing? We're adding NVIDIA DreamZero next. If you have a checkpoint that works on DROID hardware, submit it – or tell us what you'd want to see evaluated. What tasks beyond pick-and-place would be the real test for general-purpose manipulation?

More:

submitted by /u/svertix
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Related Articles

Tagged with

#real-time data collaboration#real-time collaboration#generative AI for data analysis#Excel alternatives for data analysis#rows.com#natural language processing for spreadsheets#big data management in spreadsheets#conversational data analysis#large dataset processing#financial modeling with spreadsheets#intelligent data visualization#data visualization tools#enterprise data management#big data performance#data analysis tools#data cleaning solutions