2 min readfrom Machine Learning

Fearless Concurrency on the GPU: Safe GPU inference in Rust, competitive with vLLM/SGLang [R]

Our take

Introducing “Fearless Concurrency on the GPU,” a new paper exploring safe GPU inference in Rust, now available on arXiv. Addressing the growing challenge of trusting AI-generated GPU code, cuTile Rust leverages Rust’s ownership model to guarantee memory safety and data-race freedom—by construction. Our resulting Grout inference engine, built with Hugging Face, achieves competitive performance against vLLM and SGLang, reaching up to 171 tok/s on an RTX 5090. For those interested in related model optimization techniques, see our recent article on "How torch.compile() achieves massive speedups."

The recent paper "Fearless Concurrency on the GPU: Safe GPU inference in Rust" and the accompanying cuTile Rust project represent a significant step toward addressing a growing concern in the AI landscape: trust. As AI-generated code becomes increasingly prevalent, the ease of creation is rapidly outstripping our ability to confidently verify its correctness, especially within the complex and error-prone domain of GPU programming. This shift echoes concerns raised in discussions around optimizing existing models, as illustrated by How does torch.compile() achieve massive speedups despite highly optimized NumPy functions?, highlighting the challenges of ensuring reliability even with sophisticated compilation techniques. The cuTile Rust approach, by leveraging Rust’s robust ownership and borrow checking system, offers a compelling solution – guaranteeing memory safety and data-race freedom *at compile time*, a level of assurance rarely seen in GPU kernel development. This is especially relevant considering the broader context of data-intensive projects like Built a Global AQ (PM2.5) Forecaster ML Model, where the integrity of computations is paramount.

The core innovation lies in the tile-based programming model and its ability to extend Rust’s safety guarantees across the GPU launch boundary. Traditional GPU programming often requires navigating the intricacies of CUDA, a process prone to errors and difficult to automate. cuTile Rust abstracts away much of this complexity, allowing developers (or even automated code generation tools) to write kernels with familiar, single-threaded semantics. The resulting CUDA Tile IR then handles the parallel execution, while still upholding Rust's rigorous safety checks. The impressive performance figures – achieving competitive throughput with vLLM and SGLang on both modest (RTX 5090) and high-end (B200) hardware – demonstrate that safety doesn't necessarily come at a significant performance cost. The fact that safe GEMM kernels are only 0.3% slower than hand-written, low-level versions is a testament to the efficiency of this approach. This marks a notable advancement, moving beyond the reactive debugging cycle common in GPU development to a proactive, preventative safety model.

While the current implementation of Grout, the Qwen3 inference engine built on cuTile Rust, is limited to batch-1 decoding and NVIDIA GPUs, it serves as a valuable proof of concept and a demonstrable target for future development. The project’s commitment to building a library of safe kernels, as evidenced by the cutile-kernels crate, is particularly encouraging. This collaborative approach, inviting contributions of safe variants, fosters a community-driven effort to expand the ecosystem of trusted GPU code. The team's transparency about limitations – acknowledging the current batch-1 focus and NVIDIA exclusivity – is also appreciated. This pragmatic approach sets realistic expectations and highlights the project’s trajectory toward broader applicability. The explicit mention of migrating existing kernels to safe variants underscores the potential for integrating this approach into existing workflows, rather than requiring a complete rewrite.

Looking ahead, the success of cuTile Rust raises a crucial question: will the shift towards compiler-verified GPU code become a standard practice in AI development? The increasing reliance on AI-generated code, coupled with the escalating complexity of GPU architectures, makes the need for robust safety guarantees increasingly urgent. The ability to confidently deploy and scale AI models without constantly fearing hidden memory corruption or data races will be a game-changer, empowering innovation and accelerating progress. The development of tools like cuTile Rust signals a promising future where the focus shifts from simply *writing* GPU code to *trusting* it, paving the way for a more reliable and efficient AI ecosystem.

I maintain cuTile Rust and just posted the paper "Fearless Concurrency on the GPU."

As more GPU code gets AI-generated, the bottleneck moves from writing it to trusting it. cuTile Rust lets you write or generate GPU kernels whose memory safety and data-race freedom are verified by the compiler, through Rust's ownership and borrow checking. You get those guarantees by construction. It's a tile-based programming model that lowers to CUDA Tile IR, carrying Rust's ownership model across the launch boundary. You partition a mutable output into disjoint mutable sub-tensors, pass inputs as shared references, and write tile kernels with single-threaded semantics that the compiler maps to thread blocks.

End to end, we built Grout, a Qwen3 inference engine, on cuTile Rust with Hugging Face. At batch-1 decode it reaches 171 tok/s for Qwen3-4B on an RTX 5090 and 82 tok/s for Qwen3-32B on a B200, competitive with vLLM and SGLang. Batch-1 decode is memory-bandwidth-bound, and Grout's throughput is consistent with our HBM roofline analysis.

Many of Grout's kernels still use the unsafe path today, but they can be migrated to safe variants, providing a verifiable target for generated kernels. We've started a collection of such kernels in the cutile-kernels crate in the repo. If this is your thing, contributing safe variants helps grow a library of safe, high-performance kernels that future kernel synthesis can draw from.

On the kernel side, the safety is effectively free. On a B200 the safe GEMM is within 0.3% of a hand-written low-level version (~92% of dense f16 peak), and element-wise hits ~7 TB/s, matching cuTile Python within measurement noise.

Some additional caveats worth noting: Grout is batch-1 with a small set of supported models (a research case study, not a drop-in server), it's NVIDIA-only (lowers to Tile IR), and GEMM still slightly trails cuBLAS at some sizes.

- Paper: https://arxiv.org/abs/2606.15991
- Code: https://github.com/nvlabs/cutile-rs
- Grout: https://github.com/huggingface/grout

Hope you enjoy the paper and learn something new! Happy to answer any questions :)

submitted by /u/Exciting_Suspect9088
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#financial modeling with spreadsheets#generative AI for data analysis#Excel alternatives for data analysis#rows.com#natural language processing for spreadsheets#conversational data analysis#no-code spreadsheet solutions#big data performance#data analysis tools#big data management in spreadsheets#enterprise-level spreadsheet solutions#cloud-based spreadsheet applications#real-time data collaboration#intelligent data visualization#data visualization tools#enterprise data management#data cleaning solutions#GPU#Rust#Concurrency