•1 min read•from Machine Learning
[P] Fused MoE Dispatch in Pure Triton: Beating CUDA-Optimized Megablocks at Inference Batch Sizes
Our take
Introducing the Fused MoE Dispatch in Pure Triton, a groundbreaking kernel designed to optimize Mixture-of-Experts models without relying on CUDA or vendor-specific code. This innovative approach outperforms Stanford's Megablocks at inference batch sizes, achieving 131% improvement at 32 tokens and 124% at 128 tokens. Key contributions include a fused gate and up projection that significantly reduces memory traffic, alongside a block-scheduled grouped GEMM that efficiently processes variable-sized expert batches. Explore the full details and code [here](https://github.com/bassrehab/triton-kernels
I built a fused MoE dispatch kernel in pure Triton that handles the full forward pass for Mixture-of-Experts models. No CUDA, no vendor-specific code.
On Mixtral-8x7B (A100), it beats Stanford's Megablocks at inference-relevant batch sizes (131% at 32 tokens, 124% at 128 tokens). At larger batches Megablocks' hand-tuned CUDA pulls ahead as expected.
Two main contributions:
- Fused gate+up projection - both GEMMs share the same input tile load, SiLU computed in registers. Eliminates ~470MB of intermediate buffers per forward pass (35% memory traffic reduction).
- Block-scheduled grouped GEMM - precomputed block_id to (expert_id, offset) mapping handles variable-sized expert batches in a single kernel launch without padding.
Tested across Mixtral-8x7B, DeepSeek-V3 (256 experts), and Qwen2-MoE. Full test suite passes on AMD MI300X with zero code changes.
Code: https://github.com/bassrehab/triton-kernels
Writeup: https://subhadipmitra.com/blog/2026/fused-moe-dispatch-triton/
[link] [comments]
Read on the original site
Open the publisher's page for the full experience
Tagged with
#rows.com#no-code spreadsheet solutions#natural language processing for spreadsheets#generative AI for data analysis#row zero#Excel alternatives for data analysis#financial modeling with spreadsheets