May 27, 2026•1 min read•from Machine Learning

Cross-Platform Fused MoE Dispatch in Triton: Portable Expert Routing Without CUDA [R]

Our take

Introducing the preprint for "Cross-Platform Fused MoE Dispatch in Triton," which presents a Mixture-of-Experts inference kernel, TritonMoE, developed entirely in OpenAI Triton. This innovative approach ensures portability across NVIDIA and AMD platforms without relying on vendor-specific code. Key advancements include a fused gate+up GEMM that reduces global memory traffic by 35%, achieving 89-131% throughput of Megablocks at inference batch sizes up to 512 tokens on A100. For further insights, check out our related article on "AI-generated CUDA kernels silently break training and inference."

The recent preprint on the Cross-Platform Fused Mixture-of-Experts (MoE) Dispatch in OpenAI's Triton presents a pivotal development in the realm of AI inference. By creating a kernel designed for portability across NVIDIA and AMD platforms without vendor-specific code, this work addresses a significant barrier in the deployment of advanced machine learning models. The implications of this advancement extend beyond mere technical specifications; they signal a transformative shift in how we can harness the power of AI, making it more accessible and flexible for diverse computing environments. This is particularly relevant in light of recent discussions surrounding AI-generated CUDA kernels that can disrupt training and inference processes, as highlighted in our article, AI-generated CUDA kernels silently break training and inference.

At the heart of the TritonMoE kernel is its innovative approach to computation, which integrates a fused gate and upper GEMM (General Matrix Multiplication) operation. This technique not only enhances throughput efficiency—reportedly achieving 89-131% of Megablocks throughput for inference batch sizes up to 512 tokens on the A100—but also reduces global memory traffic by 35%. Such optimizations are critical for practitioners who are increasingly seeking performance improvements in large-scale models. Moreover, the ability for the same kernel to run on AMD’s MI300X without alterations underscores a significant leap towards creating a unified infrastructure for AI development. This portability can democratize access to advanced AI capabilities, allowing a broader range of users to implement cutting-edge technologies without being tethered to a specific hardware ecosystem.

However, it's important to recognize the limitations outlined in the preprint. The kernel's performance diminishes with larger token sizes (2048+) and becomes less efficient when managing more than 64 experts under extreme routing conditions. These constraints raise important questions about scalability and practical applications in real-world scenarios, where large datasets and complex models are the norm. As we navigate this evolving landscape, it is essential to consider how such limitations can impact user experience and practical deployment. For instance, issues like those discussed in our article, Excel on Mac Changing Sorting Settings on its own when adding information in next column, reflect the everyday challenges users encounter, emphasizing the need for continuous improvements in usability and reliability.

The broader significance of this development lies in its potential to redefine the interaction between AI technology and its users. By eliminating vendor lock-in and fostering a more inclusive approach to AI infrastructure, this advancement paves the way for more innovative solutions tailored to user needs. Organizations can focus on optimizing their workflows with a newfound freedom, unencumbered by the limitations of proprietary systems. As we look to the future, it will be interesting to observe how these advancements influence the AI landscape, particularly in terms of community collaboration and shared resources in developing and deploying AI solutions.

In conclusion, the TritonMoE kernel represents not just a technical achievement but a step towards a more adaptable and user-centric AI ecosystem. As researchers and developers continue to push the boundaries of what is possible, we must remain vigilant about the challenges that accompany such innovations. The conversation around AI's role in productivity and efficiency is ongoing, and the developments in this space warrant close attention as they unfold. How will these advancements shape the future of AI applications across industries, and what new opportunities will arise from this newfound flexibility?

New preprint. A Mixture-of-Experts inference kernel (TritonMoE) written entirely in OpenAI Triton, targeting portability across NVIDIA and AMD without vendor-specific code.

Highlights:

A fused gate+up GEMM computes both SwiGLU projections from shared tile loads, eliminating 35% of global memory traffic.
89-131% of Megablocks throughput at inference batch sizes (up to 512 tokens) on A100; the same kernel runs on MI300X unchanged.
Limitations: falls behind at 2048+ tokens, and degrades with 64+ experts under extreme routing skew.

Paper: https://arxiv.org/abs/2605.23911

Code: https://github.com/bassrehab/triton-kernels

Writeup with benchmarks: https://subhadipmitra.com/blog/2026/fused-moe-dispatch-triton/

submitted by /u/bassrehab
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →

[P] Fused MoE Dispatch in Pure Triton: Beating CUDA-Optimized Megablocks at Inference Batch SizesI built a fused MoE dispatch kernel in pure Triton that handles the full forward pass for Mixture-of-Experts models. No CUDA, no vendor-specific code. On Mixtral-8x7B (A100), it beats Stanford's Megablocks at inference-relevant batch sizes (131% at 32 tokens, 124% at 128 tokens). At larger batches Megablocks' hand-tuned CUDA pulls ahead as expected. Two main contributions: Fused gate+up projection - both GEMMs share the same input tile load, SiLU computed in registers. Eliminates ~470MB of intermediate buffers per forward pass (35% memory traffic reduction). Block-scheduled grouped GEMM - precomputed block_id to (expert_id, offset) mapping handles variable-sized expert batches in a single kernel launch without padding. Tested across Mixtral-8x7B, DeepSeek-V3 (256 experts), and Qwen2-MoE. Full test suite passes on AMD MI300X with zero code changes. Code: https://github.com/bassrehab/triton-kernels Writeup: https://subhadipmitra.com/blog/2026/fused-moe-dispatch-triton/ submitted by /u/bassrehab [link] [comments]

Cross-Platform Fused MoE Dispatch in Triton: Portable Expert Routing Without CUDA [R]

Related Articles