June 17, 2026•1 min read•from Machine Learning

Cleo: trying to fit full analyst behavior in a 2B model [P]

Our take

Introducing Cleo, a 2B parameter Qwen3.5-Base finetune demonstrating a surprisingly powerful capacity for analyst-like behavior. Many industrial “chatbots” rely on simpler text-to-SQL models, but Cleo achieves complex functionality within a compact framework by unifying training, evaluation, and inference. This innovative approach allows for live query execution evidence and co-design of key components like SQL safety and dialect handling. Everything is open-source—model, harness, and datasets—available on GitHub and Hugging Face. For those exploring reinforcement learning, consider ECHO, detailed in a recent paper.

The recent unveiling of Cleo, a 2B parameter model capable of surprisingly sophisticated analyst-like behavior, represents a compelling shift in the landscape of AI-powered data interaction. The author's observation that half of industrial chatbots are essentially text-to-SQL models in disguise is a stark, and likely accurate, assessment. This project highlights a critical point: performance isn't solely about model size. It’s about the intelligent integration of the model within a well-designed system. This echoes the sentiment in Open weights are not enough: we need open training frameworks for research and better algorithms, demonstrating that accessible weights are only one piece of the puzzle; the surrounding infrastructure and training methodologies are equally vital for unlocking true potential. The ability to train, evaluate, and run inference within the *same* structured harness, as implemented in Cleo, is a significant innovation, allowing for tighter optimization and a more cohesive overall system.

What’s particularly noteworthy is Cleo’s approach to query handling. Existing systems often rely on model likelihood to select queries, but Cleo introduces live execution evidence into the decision-making process. This dramatically increases accuracy and reliability, especially when dealing with complex data relationships. Furthermore, the co-design of the model contract, SQL safety layer, and clarification behavior within a single system – a feat difficult to achieve with disparate components – speaks to a level of system thinking often lacking in AI development. The open-source nature of Cleo, including the harness and datasets, further accelerates progress by allowing others to build upon this foundation. This aligns with the broader trend toward open-source AI, as highlighted in Source code for LLMs, which underscores the value of accessible codebases for collaborative research and innovation. It’s a practical demonstration that sophisticated data analysis doesn't necessitate massive, computationally expensive models, especially when combined with a thoughtful, unified architecture.

The implications of Cleo extend beyond just cost savings, though resource efficiency is certainly a key benefit. The ability to achieve robust performance with a smaller model opens doors for deployment in resource-constrained environments, such as edge devices or systems with limited computational power. This democratization of AI access is crucial for broader adoption and innovation. The challenges faced by those working with embedded ML, as discussed in Embedded/edge ML folks: what actually eats the most time ,getting data, or cleaning/labeling it (time series sensor data, not computer vision/audio)?, often revolve around data quality and preparation. Cleo’s streamlined approach, by integrating these concerns into the model’s design, offers a potential pathway to alleviate some of those bottlenecks. By focusing on a unified system, it shifts the emphasis from chasing ever-larger models to optimizing the entire data processing pipeline.

Ultimately, Cleo’s success hinges on its ability to prove its utility in real-world applications. While the technical details are impressive, the true test will be its impact on data analyst productivity and decision-making. The project illuminates a crucial direction for the field: moving beyond simply scaling model size and instead focusing on the intelligent integration of AI within practical, structured workflows. As we continue to grapple with the complexities of data management, will we see more projects prioritizing system-level optimization and architectural coherence over brute-force model scaling, and what new approaches to unified training and inference will emerge as a result?

Hello all!

Half of all industrial "chatbots" are just text-to-SQL models in a trenchcoat (and the other half RAG!). I wanted to explore just how small you could make these models if you trained, evaluated, and ran inference in the exact same structured harness, leading to Cleo: a Qwen3.5-2B-Base finetune.

Currently, some features of cleo that are only possible/useful in a unified hardel are:

Training on the exact same gather, repair, and answer contract it uses at inference time
Searching over candidate queries with live execution evidence, not just model likelihood
Co-designing the model contract, SQL safety layer, dialect handling, timeouts, and clarification behavior as one system

Everything is completely open-source, including the harness, model, and datasets.

GitHub: https://github.com/Dreeseaw/cleo

Hugging Face model: https://huggingface.co/dreeseaw/cleo

PS: If you're also resource-constrained and trying to do RL like me, I would highly recommend experimenting with ECHO: https://arxiv.org/abs/2605.24517

submitted by /u/Dreeseaw
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →

Tagged with

#rows.com#financial modeling with spreadsheets#real-time data collaboration#real-time collaboration#Cleo#Qwen3.5-2B-Base#finetune#text-to-SQL#RAG#chatbot#inference#training#harness#contract#SQL safety layer#dialect handling#timeouts#clarification behavior#model likelihood#candidate queries