2 min readfrom Machine Learning

The Structured Output Benchmark (SOB) - validates both JSON parse and value accuracy [R]

Our take

Introducing the Structured Output Benchmark (SOB), a comprehensive tool designed to validate both JSON parsing and value accuracy. Unlike existing benchmarks that primarily assess JSON schema and type pass rates, SOB addresses the critical issue of inaccurate JSON values, such as hallucinated totals or misordered arrays. By measuring seven key metrics—including value accuracy, path recall, and faithfulness—SOB sets a new standard for evaluating structured outputs.

Current structured output benchmarks only validate pass rate for json schema and types, however more commonly the issue tends to be inaccurate json values.

For example hallucinated `total_price` number when extracting value from a invoice or an array ordered wrongly because of inaccurate date mapping.

The Structured output benchmark measures 7 key metrics instead of json schema.

  • Value Accuracy (primary): exact leaf-value match against verified ground truth
  • JSON Pass Rate, Type Safety, Path Recall, Structure Coverage (structural)
  • Faithfulness: are values grounded in context or hallucinated?
  • Perfect Response: every single leaf value correct
  • Modalities: text, image and audio

Overall results

Overall benchmark results

Open source is doing pretty well with GLM 4.7 coming number 2 right below GPT 5.4.

JSON-pass vs Value-Accuracy gap

JSON-pass vs Value-Accuracy gap

What's interesting here is that while most models hit 90%+ on JSON schema pass, all of them drop significantly on value accuracy.

Overall best by modality

Overall best by modality

Full breakdown blog: https://interfaze.ai/blog/introducing-structured-output-benchmark
Full leaderboard: https://interfaze.ai/leaderboards/structured-output-benchmark
Paper: https://interfaze.ai/sob_paper.pdf (Pending arXiv)

The full break down goes deeper into different modalities, how we designed the dataset, and how we performed the benchmark. All code and dataset is open source 😄

Our goal is to be the best general model for deterministic tasks and a key aspect of determinism is controllable and consistent output structure. The first step to making structured output better is to measure it and hold ourselves and the industry against the best.

submitted by /u/404llm
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#natural language processing for spreadsheets#generative AI for data analysis#Excel alternatives for data analysis#large dataset processing#rows.com#financial modeling with spreadsheets#no-code spreadsheet solutions#Structured Output Benchmark#value accuracy#JSON parse#hallucinated values#JSON schema#Faithfulness#pass rate#leaf-value match#Perfect Response#value-accuracy gap#Type Safety#open source#benchmark results