•1 min read•from Machine Learning
[P] Gemma 4 running on NVIDIA B200 and AMD MI355X from the same inference stack, 15% throughput gain over vLLM on Blackwell
Our take
Today, Google DeepMind unveiled Gemma 4, showcasing two powerful models: the 31B dense architecture and the 26B MoE variant, both featuring an impressive 256K context length. Designed for efficiency and superior long-context quality, Gemma 4 runs on NVIDIA B200 and AMD MI355X, achieving a notable 15% throughput gain over vLLM on Blackwell. Both models are natively multimodal, adept at processing text, images, and video. To explore Gemma 4 without setup, visit the free playground at https://www.modular.com
Google DeepMind dropped Gemma 4 today:
Gemma 4 31B: dense, 256K context, redesigned architecture targeting efficiency and long-context quality
Gemma 4 26B A4B: MoE, 26B total / 4B active per forward pass, 256K context
Both are natively multimodal (text, image, video, dynamic resolution).
We got both running on MAX on launch day across NVIDIA B200 and AMD MI355X from the same stack. On B200 we're seeing 15% higher output throughput vs. vLLM (happy to share more on methodology if useful).
Free playground if you want to test without spinning anything up: https://www.modular.com/#playground
[link] [comments]
Read on the original site
Open the publisher's page for the full experience
Related Articles
- Google Opens Gemma 4 Under Apache 2.0 with Multimodal and Agentic CapabilitiesGoogle has announced the release of Gemma 4, a series of open-weight AI models, including variants with 2B, 4B, 26B, and 31B parameters, under the Apache 2.0 license. Key features include enhanced video and image processing, audio input on smaller models, and extended context windows up to 256K tokens. By Hien Luu
- Trials and tribulations fine-tuning & deploying Gemma-4 [P]Hey all, Our ML team spent some time this week getting training and deployments working for Gemma-4, and wanted to document all the things we ran into along the way. PEFT doesn't recognize Gemma 4's custom layers. Google wrapped vision/audio projections in a new ClippableLinear class that doesn't inherit from nn.Linear, so PEFT refuses to attach LoRA, even for text-only fine-tuning. Fix: unwrap the wrappers after loading weights but before calling PEFT. SFTTrainer killed training silently. TRL hardcodes use_cache=False, which breaks Gemma 4's KV-sharing attention. Loss never converges and there's no error, just garbage gradients. Fixed upstream in transformers v5.5.2+. DeepSpeed ZeRO-3 saves half-empty adapters. Training loss looks perfect, but the saved LoRA file has zero-element tensors for half the layers. The model acts like it was never fine-tuned. Workaround: don't use DeepSpeed for LoRA on Gemma 4. No runtime LoRA serving anywhere. Sometimes it takes a minute for vLLM and SGLang to support runtime LoRAs for Gemma 4's multimodal architecture. You have to merge weights and remap state dict keys manually before serving. Much more detail in the blog, but hopefully it's helpful in your Gemma-4 journey as well! submitted by /u/FallMindless3563 [link] [comments]
Tagged with
#rows.com#google sheets