[P] Gemma 4 running on NVIDIA B200 and AMD MI355X from the same inference stack, 15% throughput gain over vLLM on Blackwell

Our take

Today, Google DeepMind unveiled Gemma 4, showcasing two powerful models: the 31B dense architecture and the 26B MoE variant, both featuring an impressive 256K context length. Designed for efficiency and superior long-context quality, Gemma 4 runs on NVIDIA B200 and AMD MI355X, achieving a notable 15% throughput gain over vLLM on Blackwell. Both models are natively multimodal, adept at processing text, images, and video. To explore Gemma 4 without setup, visit the free playground at https://www.modular.com

Google DeepMind dropped Gemma 4 today:

Gemma 4 31B: dense, 256K context, redesigned architecture targeting efficiency and long-context quality

Gemma 4 26B A4B: MoE, 26B total / 4B active per forward pass, 256K context

Both are natively multimodal (text, image, video, dynamic resolution).

We got both running on MAX on launch day across NVIDIA B200 and AMD MI355X from the same stack. On B200 we're seeing 15% higher output throughput vs. vLLM (happy to share more on methodology if useful).

Free playground if you want to test without spinning anything up: https://www.modular.com/#playground

submitted by /u/carolinedfrasca
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →

[P] Gemma 4 running on NVIDIA B200 and AMD MI355X from the same inference stack, 15% throughput gain over vLLM on Blackwell

Related Articles