Why SSMs struggle in parameter-constrained training: empirical findings at 25M parameters [R]

Our take

In the exploration of why Structured State Machines (SSMs) struggle in parameter-constrained training, this piece presents empirical findings from the Parameter Golf competition. Over three weeks of rigorous experimentation, it becomes clear that SSMs face inherent structural disadvantages compared to transformers, particularly under tight constraints of time and size. Key observations include the inefficiency of SSM in_proj weights, which compress significantly worse than transformer attention mechanisms, and unexpected architectural shifts. The analysis also covers kernel-level experiments that reveal critical performance insights.

After ~3 weeks of experimentation in OpenAI's Parameter Golf competition, I wrote up why SSMs are structurally disadvantaged relative to transformers in a time- and size-constrained regime (10 min training, 16MB artifact, 25M parameters) on 8xH100s: https://mradassaad.github.io/posts/why-ssms-struggle-in-parameter-golf/

Main findings:

SSM in_proj weights compress up to 3.26x worse than attention QKV under LZMA, directly taxing the compressed parameter budget
Architectural wins validated at SP4096 flipped sign at SP8192 — two configs that looked like clean wins reversed direction at the target vocabulary

Also includes three kernel-level experiments on the Mamba-3 Triton kernels: a backward fusion attempt that was numerically exact but 16% slower due to SMEM pressure, a torch.compile quantizer bug that cost 5.5 mBPB, and a mixed-precision dynamics protection that recovered 0.8 mBPB at negligible size cost.

submitted by /u/mradassaad
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →

Tagged with

#rows.com#enterprise-level spreadsheet solutions#real-time data collaboration#real-time collaboration#SSM#parameter-constrained#transformers#compression#H100s#training#architectural wins#LZMA#Mamba-3 Triton#compressed parameter budget#weights#kernel-level experiments#mixed-precision#backward fusion#vocabulary#torch.compile