June 11, 2026•3 min read•from Machine Learning

Routing LLMs by task verifiability: a small experiment (n=120, 3 models) inspired by Karpathy's framework [D]

Our take

A small-scale experiment (n=120 tasks, 3 models) inspired by Karpathy’s framework explores routing LLMs by task verifiability. We observed that high-verifiability tasks—like code unit tests and structured data extraction—allow smaller models, such as Mistral 3B, to achieve performance nearing frontier models when paired with a verifier. Conversely, low-verifiability tasks revealed a more substantial capability gap. This suggests that verifiability simplifies tasks, enabling weaker models to compete. For deeper insights into related research, explore "Introducing Papers Without Code" on our site.

The recent experiment exploring task verifiability and LLM performance, shared by /u/DragonfruitAlone4497, offers a compelling, albeit preliminary, validation of a concept gaining traction within the AI community. Building on Karpathy's framework, which categorizes tasks based on their mechanical checkability, the author’s investigation suggests a tangible link between verifiability and relative ease of execution, even for smaller models. This resonates with the broader conversation around the potential to augment less powerful LLMs with robust verification mechanisms, effectively closing the performance gap with frontier models in specific, high-verifiability domains. Considering Hugging Face's recent relaunch of Introducing Papers Without Code, this highlights the growing importance of accessible research and experimentation in democratizing AI development and empowering a wider range of practitioners to contribute to the field. Further, the findings echo concerns raised in discussions around Anthropic's new model, Fable, and its potential to Anthropic's new model Fable will silently handicap work on LLMs, particularly how design choices can impact the landscape of LLM research and adoption.

The results, though admittedly "messy" and limited by a small sample size (n=120), paint a clear picture. While the expected performance disparity persisted in low-verifiability tasks like creative summarization, the study demonstrated that a smaller model like Mistral 3 8B, when coupled with a verifier (in this case, JSON schema and regexes), could achieve performance levels surprisingly close to those of larger models like Claude Sonnet and GPT 5.5 in high-verifiability tasks such as code unit testing and structured data extraction. The amusing anecdote regarding the ambiguous JSON schema, which initially skewed Sonnet’s performance, serves as a valuable reminder that the effectiveness of any verification system is intrinsically tied to the quality of its underlying rules and constraints. This emphasizes a crucial point: a robust verification layer isn't a magic bullet; it requires careful design and ongoing maintenance to remain effective. The consistent hallucination observed in multi-hop reasoning, where the gap between models remained significant even with retries, underscores the limitations of verification in addressing fundamental reasoning deficits.

The implications of this work extend beyond simply improving the efficiency of smaller models. It suggests a potentially significant shift in how we approach LLM development and deployment. Rather than solely pursuing ever-larger models, we might see a greater focus on building specialized systems that leverage verification to ensure accuracy and reliability within narrow domains. This aligns with a broader trend toward modularity and specialization in AI, where smaller, more targeted models are combined to achieve complex goals. The fact that the author conducted this experiment on their own time, within an LLM infrastructure company, points to a growing interest in exploring these avenues, even outside of formal research settings. Moreover, the acknowledgement of limitations—the small sample size, the constraints of the verifier, and the influence of prompt length—demonstrates a commendable commitment to scientific rigor and transparency, a crucial element in fostering trust and accelerating progress.

Looking ahead, the key question is whether these findings can be replicated and scaled. A tenfold increase in the sample size, as the author suggests, would significantly bolster the confidence in these initial results. Further exploration of different verification techniques – particularly constrained decoding – and a more controlled approach to prompt engineering are also essential. Ultimately, this experiment provides a valuable, albeit early, glimpse into a future where AI systems are not just powerful, but also demonstrably reliable, particularly in contexts where accuracy is paramount. The ability to effectively harness verifiability as a lever for improving LLM performance represents a significant step toward realizing that vision, promising more accessible and trustworthy AI applications across a wider range of industries.

Full disclosure: this is directional, not a paper. n=120 tasks, one internal evaluator, not peer reviewed. I work at an LLM infrastructure company. This experiment was done on my own time and is not a company claim.

Karpathy's framework classifies tasks by verifiability. Can output be mechanically checked? High verifiability tasks like code compilation and structured JSON extraction are safer because the verifier catches errors. Low verifiability tasks like creative writing are riskier.

I wondered if high verifiability tasks are also easier in practice. Can a weaker model do them as well as a frontier model if the verifier catches mistakes?

Setup was 120 tasks across four categories. Code unit tests, structured extraction, multi hop reasoning, creative summarization. Three models: Claude Sonnet 4.6, GPT 5.5, local Mistral 3 8B via vLLM 0.6.3. Pass rate for the first two, human rating 1 to 5 for the last two.

Results were messy.

Code unit tests: Sonnet 4.6 94%, GPT 5.5 91%, Mistral 3 8B 87%. With one retry Mistral 3 hit 95%. That surprised me. I expected the gap to be bigger.

Structured extraction: Sonnet 4.6 97%, GPT 5.5 94%, Mistral 3 8B 89%. With retry 96%. Also closer than I expected.

But here is where it got weird. Sonnet 4.6 initially scored worse than GPT 5.5 on structured extraction, which made no sense. Turns out our JSON schema had an ambiguous nested array that confused Claude's tool use parser. Fixing the schema brought Sonnet to 98%, but I kept the original numbers in the table because the mistake is part of the story. Your verifier is only as good as your schema.

Multi hop reasoning: Sonnet 4.6 78%, GPT 5.5 71%, Mistral 3 8B 51%. Retry didn't help. The model would hallucinate reasoning paths consistently. This is where the capability gap was real.

Creative summarization: Sonnet 4.6 4.2 out of 5, GPT 5.5 3.9 out of 5, Mistral 3 8B 3.1 out of 5. Expected.

Interpretation: high verifiability tasks seem simpler in the sense that weaker model plus verifier can approach frontier performance. Low verifiability tasks show the expected gap.

Limitations: n=120 is tiny. Need 10x for confidence. Our verifier is just JSON Schema plus regexes. Constrained decoding might change the calculus entirely. I also didn't control for prompt length well. Any prompt over 8k tokens was excluded because Mistral 3 8B degrades near its limit, which probably skewed the sample.

submitted by /u/DragonfruitAlone4497
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →