2 min readfrom Machine Learning

AutoBe benchmark: structured harness narrows frontier-vs-local gap in backend generation [D]

Our take

AutoBe serves as a pivotal benchmark for end-to-end backend generation, enabling users to transform a single natural language request into six comprehensive outputs, including requirements analysis and type-safe SDKs. By utilizing structured function calling, the AutoBe framework narrows the gap between frontier and local models in backend quality. Notably, GLM 5 leads the benchmark, while several local models achieve 100% compile success.

AutoBe is a benchmark for end-to-end backend generation. One natural language request produces six outputs: requirements analysis, ERD, OpenAPI spec, E2E tests, NestJS implementation, and a type-safe SDK. Each phase fills a predefined AST via structured function calling rather than generating unstructured code. The scoring rubric is 100 points driven entirely by static analysis - the same artifact scores the same regardless of who reruns it.

The headline finding is that scores cluster tightly. GLM 5 tops the benchmark run. qwen3.5-27b sits directly behind frontier models. Several local models produced enterprise-scale backends with 100% compile success. The author's interpretation: once the harness is structured, backend-generation quality is constrained more by harness design than by model prestige.

The cost contrast is significant. A full benchmark run at frontier pricing ($5/M input tokens) runs $1,000-$1,500 per model. The next benchmark round plans to filter to models at $0.25/M input or runnable on a 64GB unified-memory laptop - which would include most of the models that clustered near the top anyway.

The honest caveat from the author: this uses four reference projects and may favor models that comply well with procedural function-calling instructions. How well these results generalize beyond well-structured benchmark fixtures is still an open question.

Does your experience with structured function-calling in production tasks align with benchmark findings like these?

submitted by /u/jimmytoan
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#financial modeling with spreadsheets#natural language processing for spreadsheets#generative AI for data analysis#Excel alternatives for data analysis#AI formula generation techniques#conversational data analysis#natural language processing#data analysis tools#rows.com#enterprise-level spreadsheet solutions#AI-driven spreadsheet solutions#no-code spreadsheet solutions#enterprise data management#AutoBe#backend generation#structured function calling#requirements analysis#ERD#OpenAPI spec#E2E tests