May 18, 2026•3 min read•from Machine Learning

Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]

Our take

In the pursuit of optimizing small-batch and real-time machine learning workloads, I have developed a CUDA-first inference runtime that directly rewrites model inference paths using C++/CUDA kernels. This approach shifts focus away from conventional graph runtimes, addressing latency bottlenecks that extend beyond single GEMM operations. As demonstrated in my results, particularly with the Motus world model, overcoming runtime overhead is crucial for achieving faster inference times. For deeper insights into related advances, explore our recent article on "Witchcraft," which enhances local semantic search capabilities.

In the ever-evolving landscape of machine learning and artificial intelligence, the quest for optimization remains a central theme for developers and researchers. The recent article on rewriting model inference with CUDA kernels underscores a critical insight: the bottleneck in small-batch runtime performance isn't merely about slow General Matrix Multiply (GEMM) operations, but rather about the inefficiencies in the surrounding infrastructure. This perspective resonates with our ongoing discussions about innovative solutions to common challenges in AI, similar to the insights shared in articles like Released a free 9.8M doc Indic multilingual corpus — Hindi, Bengali, Tamil, Telugu + 7 more (CC0, HuggingFace) and Witchcraft, fast local semantic search on top of SQLite — both of which emphasize the importance of rethinking traditional methods to unlock greater efficiency and capability.

The author’s approach of using C++/CUDA to rewrite the model inference path directly is particularly noteworthy. This method reflects a growing trend among developers to move beyond generic frameworks like PyTorch or TensorRT, which may not be optimized for specific use cases, particularly in real-time machine learning scenarios where batch sizes are typically one. The findings highlight that latency is not just a factor of mathematical computation, but also of the fragmented small kernels, layout transitions, and the overhead from quantization and dequantization. This is a crucial point for teams working on AI applications in robotics, autonomous systems, and advanced machine learning tasks, where every millisecond counts.

Moreover, the revelation that lower precision does not automatically translate into performance gains challenges long-held assumptions in the field. The nuanced understanding of floating-point precision—where FP8 may yield consistent benefits while FP4 can be mixed—encourages developers to critically assess their optimization strategies. This aligns with our broader narrative about the need for a deeper understanding of technology as we push towards more sophisticated applications in AI. As discussed in the No new paper under review in TMLR since May 09?#tab-under-review-submissions#tab-u) article, the demand for innovation in AI frameworks is palpable, and this exploration into CUDA-based inference may signal a shift in how we approach model optimization.

The implications of this work extend beyond immediate performance enhancements. As more developers adopt similar strategies, we could witness a shift in the tools and methodologies employed across the industry. Rethinking the inference pipeline could lead to more tailored solutions that enhance productivity and responsiveness, especially in real-time applications. The author’s challenge to reconsider when to utilize generic compilers versus custom optimizations is a crucial dialogue that could shape future research and development paths.

As we look ahead, it is essential to consider how these insights will influence the broader AI ecosystem. Will we see a shift towards more customized, performance-oriented approaches to machine learning model deployment? The ongoing exploration of CUDA and similar technologies may very well pave the way for transformative changes in the efficiency and effectiveness of AI applications. This is a space worth watching closely, as the potential for innovation continues to unfold.

I’ve been working on a CUDA-first inference runtime for small-batch / realtime ML workloads.

The core idea is simple: instead of treating PyTorch / TensorRT / generic graph runtimes as the main execution path, I rewrite the model inference path directly with C++/CUDA kernels.

This started from robotics / VLA workloads, but the problem is more general.

In small-batch inference, the bottleneck is often not just a single slow GEMM. A lot of latency comes from the runtime glue around the math:

fragmented small kernels
norm / residual / activation boundaries
quantize / dequantize overhead
layout transitions
Python / runtime scheduling
graph compiler fusion failures
precision conversion around FP8 / FP4 regions

For cloud LLM serving, batching can hide a lot of this.

For robotics, VLA, world models, and other realtime workloads, batch size is usually 1. There is nowhere to hide. Every launch, sync, and format boundary shows up directly in latency.

Some current results from my implementation:

Model / workload	Hardware	FlashRT latency
Pi0.5	Jetson Thor	~44 ms
Pi0	Jetson Thor	~46 ms
GROOT N1.6	Jetson Thor	~41–45 ms
Pi0.5	RTX 5090	~17.6 ms
GROOT N1.6	RTX 5090	~12.5–13.1 ms
Pi0-FAST	RTX 5090	~2.39 ms/token
Qwen3.6 27B	RTX 5090	~129 tok/s with NVFP4
Motus / Wan-style world model	RTX 5090	~1.3s baseline → targeting ~100ms E2E

The Motus / world-model case is especially interesting.

The baseline path is around 1.3s end-to-end. The target is ~100ms E2E, but the hard part is not simply “use a faster GEMM”. The bottlenecks are VAE, joint attention, launch fragmentation, and a large amount of glue around the actual math.

One lesson from this work: lower precision is not automatically a win.

FP8 has been consistently useful. FP4 / NVFP4 is more mixed. It can help memory footprint and some large GEMM regions, but if the FP4 region is small, discontinuous, or surrounded by conversion / scaling overhead, the end-to-end speedup can be tiny.

For example, in some VLA / world-model paths, FP4 over FP8 only gives a few percent latency improvement unless the region is large and deeply fused.

This changed how I think about inference optimization.

For large-batch cloud serving, generic runtimes and batching are often enough.

For realtime small-batch inference, the runtime overhead becomes the workload.

Curious if others have seen similar behavior with torch.compile, TensorRT, XLA, Triton, or custom CUDA kernels.

At what point do you stop trying to make a generic compiler optimize the model, and just rewrite the inference path directly?

Implementation: https://github.com/LiangSu8899/FlashRT

submitted by /u/Diligent-End-2711
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →

A hackable compiler to generate efficient fused GPU kernels for AI models [P]The modern ML (LLM) compiler stack is brutal. TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. I built a hackable LLM compiler from scratch and am documenting the process. It takes a small model (TinyLlama, Qwen2.5-7B) and lowers it to a sequence of CUDA kernels through six IRs. Currently, on RTX 5090, the emitted FP32 kernels run at geomean 1.11× vs PyTorch eager and 1.20× vs torch.compile, with full-block parity on TinyLlama-128 and Qwen2.5-7B at seq=128. Wins on small reductions / SDPA / kv-projections (up to 4.7×); losses on dense matmul at seq=512. Part 1 took an RMSNorm layer end-to-end and walked the upper half of that pipeline in detail. This second part closes the gap and explains Tile IR, Kernel IR, and associated lowering rules in depth. Full article: A Principled ML Compiler Stack in 5,000 Lines of Python The article focuses on producing a GPU schedule for an operation written in loop-nest form (Loop IR). Example for RMSNorm: python v0 = reciprocal(2048) for a0 in 0..32: # free for a1 in 0..2048: # reduce in2 = load x[0, a0, a1] v1 = multiply(in2, in2) acc0 <- add(acc0, v1) v2 = multiply(acc0, v0) v3 = add(v2, 1e-06) v4 = rsqrt(v3) for a2 in 0..2048: # free in3 = load x[0, a0, a2] in4 = load p_weight[a2] v5 = multiply(in3, v4) v6 = multiply(v5, in4) merged_n0[0, a0, a2] = v6 The stack mimics a sequence of optimization steps a CUDA engineer would perform when optimizing kernels: stage inputs to smem, reduce bank conflicts, increase occupancy, and so on. diff LoopOp │ ▼ [001] tileify — lift outer free Loops to thread axes [002] chunk_matmul_k — chunk the K reduce into K-outer × K-inner (intra-CTA) [003] split_matmul_k — promote the K-outer chunk loop into a grid dimension [004] cooperative_reduce — let multiple threads share one reduce; tree-merge with Combine [005] blockify_launch — pick block extents; partition free axes into BLOCK and THREAD [006] chunk_reduce — chunk non-matmul reduces so their Loads fit in shared memory [007] stage_inputs — hoist hot input slabs into Stage nodes [008] register_tile — replicate the inner tile so each thread owns a register block [009] permute_register_tile — reorder the register strip so bank-conflicting loads land on far columns [010] double_buffer — promote K-outer Stages to BufferedStage (ping-pong) [011] tma_copy — narrow eligible BufferedStages to TmaBufferedStage (sm_90+) [012] split_inner_for_swizzle — split the inner cache axis of a TmaBufferedStage for swizzle [013] async_copy — narrow the rest to AsyncBufferedStage (cp.async, sm_80+) [014] pad_smem — pad shared-memory strides to break bank conflicts [015] pipeline_k_outer — rotate the K-outer loop into prologue/steady-state/epilogue (cp.async + TMA) [016] mark_unroll — annotate small inner loops for #pragma unroll │ ▼ TileOp (fully scheduled) Each stage can be reproduced with a CLI command. For example, the stage_inputs pass stages input buffers into smem if possible and if there is a benefit in doing that (inputs are being read multiple times within CTA). To see it, the following command can be used: bash deplodock compile \ -c "torch.nn.RMSNorm(2048)(torch.randn(1,32,2048))" \ --ir tile -vv \ | awk '/^>>> t:007/,/^<<< t:007/' ```diff t:007_stage_inputs @@ matched at rms_norm (in-place) @@ @@ -2,6 +2,7 @@ v0 = reciprocal(2048) Tile(axes=(a0:256=THREAD, a1:32=BLOCK)): + x_smem = Stage(x, origin=(0, a1, 0), slab=(a2:2048@2)) StridedLoop(a2 = a0; < 2048; += 256): # reduce - in2 = load x[0, a1, a2] + in2 = load x_smem[a2] v1 = multiply(in2, in2) acc0 <- add(acc0, v1) @@ -11,5 +12,5 @@ v4 = rsqrt(v3) StridedLoop(a2 = a0; < 2048; += 256): # free - in3 = load x[0, a1, a2] + in3 = load x_smem[a2] in4 = load p_weight[a2] v5 = multiply(in3, v4) <<< t:007_stage_inputs ``` The final CUDA kernel for the RMSNorm layer: bash deplodock compile \ -c "torch.nn.RMSNorm(2048)(torch.randn(1,32,2048))" \ --target sm_120 --ir cuda c extern "C" __global__ __launch_bounds__(256) void k_rms_norm_reduce( const float* x, const float* p_weight, float* rms_norm) { float v0 = 1.0f / 2048.0f; int a1 = blockIdx.x; int a0 = threadIdx.x; int lane = threadIdx.x & 31; int warp = threadIdx.x >> 5; float acc0 = 0.0f; __shared__ float x_smem[2048]; for (int x_smem_flat = a0; x_smem_flat < 2048; x_smem_flat += 256) { float x_smem_v = x[a1 * 2048 + x_smem_flat]; x_smem[x_smem_flat] = x_smem_v; } __syncthreads(); for (int a2 = a0; a2 < 2048; a2 += 256) { float in2 = x_smem[a2]; float v1 = in2 * in2; acc0 += v1; } float acc0_w = acc0; acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 16); acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 8); acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 4); acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 2); acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 1); __shared__ float acc0_smem[8]; if (lane == 0) { acc0_smem[warp] = acc0_w; } __syncthreads(); for (int s = 4; s > 0; s >>= 1) { if (warp < s) { acc0_smem[warp] = acc0_smem[warp] + acc0_smem[warp + s]; } __syncthreads(); } float acc0_b = acc0_smem[0]; float v2 = acc0_b * v0; float v3 = v2 + 1e-06f; float v4 = rsqrtf(v3); for (int a2 = a0; a2 < 2048; a2 += 256) { float in3 = x_smem[a2]; float in4 = p_weight[a2]; float v5 = in3 * v4; float v6 = v5 * in4; rms_norm[a1 * 2048 + a2] = v6; } } submitted by /u/NoVibeCoding [link] [comments]

Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]

Related Articles