AI-generated CUDA kernels silently break training and inference [R]

Our take

Last month, NVIDIA unveiled SOL-ExecBench, a benchmark featuring 235 production CUDA kernels sourced from projects like DeepSeek and Qwen. However, when integrating several top-ranked AI-generated kernels into real-world training and inference workloads, unexpected failures emerged. Notably, a kernel designed for the fused embedding-gradient + RMSNorm backward pass caused loss divergence in a small transformer training loop, despite passing the benchmark verification.

The recent article detailing the challenges encountered with NVIDIA's SOL-ExecBench highlights a critical juncture in the evolution of AI-generated CUDA kernels. With the introduction of this benchmark, which features 235 production CUDA kernels sourced from various high-profile projects, there is a clear push towards enhancing the efficiency and performance of AI models. However, as demonstrated, the integration of top-ranked AI-generated submissions into production workloads is fraught with unexpected issues. This resonates with other discussions around user experience, such as the challenges faced with Excel on Mac unexpectedly changing sorting settings when adding information in the next column, as detailed in the article "Excel on Mac Changing Sorting Settings on its own when adding information in next column."

The crux of the issue lies in the subtleties of computational precision. The example of the fused embedding-gradient + RMSNorm backward pass is particularly revealing. Though the kernel performed well in a benchmark context, its performance faltered in real-world training scenarios due to its reliance on bf16 precision rather than fp32. This divergence in expected behavior underscores the complexities inherent in deploying AI-generated code. It raises questions about the assumptions we make when transitioning from theoretical benchmarks to practical applications. As seen when debugging, the root cause can often be obscured, leading researchers down a rabbit hole of uncertainty where they may question the integrity of their dataset, architecture, or even their original hypotheses.

Moreover, the exploration of these AI-generated kernels highlights a broader trend in the field of machine learning and data management. As organizations increasingly seek to leverage AI to optimize their workflows, it becomes essential to scrutinize not just the performance metrics of new tools but also their reliability in diverse scenarios. This aligns with other user-centric challenges, such as the need for efficient methods in selecting repeating patterns in datasets, which is explored in the article "How do I select in a repeating pattern?." Users are looking for solutions that empower them to derive actionable insights without getting bogged down by technical intricacies.

The implications of these findings are significant. They serve as a reminder that while innovation in AI and machine learning is critical, the path to seamless integration is often riddled with unexpected hurdles. It is vital for developers and researchers to maintain a healthy skepticism toward AI-generated solutions, ensuring robust testing in practical environments before widespread adoption. The article's findings illustrate that the promise of AI in enhancing productivity is contingent upon not just performance, but also consistency and reliability in real-world applications.

Looking ahead, it will be fascinating to see how the industry responds to these challenges. Will future iterations of AI-generated kernels incorporate lessons learned from these real-world failures? As we continue to push the boundaries of what AI can accomplish, the focus on user outcomes and practical reliability must remain at the forefront of our development efforts. The journey to harnessing AI's full potential is complex, but it's one that invites exploration and innovation at every turn.

Last month NVIDIA released SOL-ExecBench, a new benchmark of 235 production CUDA kernels lifted from DeepSeek, Qwen, Gemma, and Kimi. We took several top-ranked AI-generated submissions and tried using them in production workloads. Many of them broke, sometimes in surprising ways.

One of those kernels is the fused embedding-gradient + RMSNorm backward pass, which runs at the end of every transformer training step. We took the fastest submission on the benchmark for it, and dropped it into the training loop of a small transformer. The kernel had passed the benchmark's verifier with room to spare. But in our training run, the loss diverged and never recovered.

We started debugging. Replace the dataset distribution with uniformly sampled tokens, the divergence vanishes. Swap SGD for AdamW, also vanishes.

This is the worst kind of bug for research. Symptoms and masks both look exactly like "the idea didn't work". It's the type of bug that can make researchers spend a long time debugging without knowing what's at fault: the dataset? the research idea? the architecture? or the implementation itself?

Turns out, the actual bug is that the embedding-gradient half of the kernel accumulates in bf16 instead of fp32. Embedding backward sums many small gradient contributions into each token's row of the embedding matrix. With uniform random tokens the contributions spread evenly and bf16 precision is enough. In real text, a handful of token IDs end up with thousands of contributions: the small ones round to zero against the growing accumulator, and the high-frequency rows drift. AdamW's per-parameter normalization absorbs the resulting multiplicative bias, so under AdamW the same drift is invisible in the loss.

The other broken submissions had different bug shapes (all interesting). More examples in our blogpost.

submitted by /u/laginimaineb
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →

AI-generated CUDA kernels silently break training and inference [R]

Tagged with