I built a Mamba1 variant I call SM1 with d_state=1 that runs on Blackwell in pure PyTorch [P]

Our take

Introducing SM1, a refined variant of Mamba1 designed for d_state=1 that operates seamlessly on Blackwell using pure PyTorch. This innovative approach replaces the traditional selective scan with efficient PyTorch operations, resulting in a significant reduction in memory usage—16 times less than Mamba1 with d_state=16. With an inference state of only 14,080 floats, SM1 optimizes performance for a 130M parameter model. Currently, it's being trained on 163K MIDI files, amounting to approximately 2.5B tokens.

The development of the Scalar Mamba1 (SM1) variant, as detailed in the article by TechnoVoyager, marks a significant step forward in the realm of AI-native technologies, particularly in the optimization of computational efficiency for deep learning models. The replacement of the selective scan with two native PyTorch operations not only simplifies the computation but also drastically reduces memory requirements, making it a compelling solution for users working within constrained environments. This innovation comes at a time when many users are looking for ways to enhance their workflows and productivity, particularly when using large datasets or training complex models. For those interested in visual data representation, this could be a game-changer, akin to the discussions surrounding How to make a pivot table recognize a single cell with multiple answers/info separated by commas, as multiple answers? or Having trouble creating a bar graph, where efficiency and simplicity in handling data are paramount.

The significance of SM1's design lies in its ability to deliver an exact closed-form solution to the d_state=1 recurrence, which is crucial for those who require precision in their computations. Unlike traditional methods that may involve approximations or heavier computational loads, SM1 ensures that users can achieve floating-point precision with a straightforward implementation. This advancement not only enables deeper insights through accurate data analysis but also empowers users to focus more on the applications of their findings rather than the limitations imposed by their tools. For instance, the reduction in scan memory by 16 times compared to Mamba1 with d_state=16 is a remarkable feat that speaks to the growing need for models that can operate effectively without overwhelming hardware constraints.

Moreover, the choice to train SM1 on 163K MIDI files, amounting to approximately 2.5 billion tokens, highlights the model's potential in creative domains such as music generation and analysis. By fitting within the memory limits of the RTX 5060 Ti, SM1 illustrates the progress being made to make powerful AI tools more accessible to a wider range of users, including those who may not have access to high-end computational resources. This democratization of technology is crucial for fostering innovation and encouraging exploration among budding developers and researchers who wish to experiment with AI in their respective fields.

Looking ahead, the implications of SM1 extend beyond its immediate technical benefits. As AI continues to permeate various industries, the ability to simplify complex computations without sacrificing accuracy will be crucial. The trend towards more efficient models can inspire further innovations that prioritize user experience and accessibility, aligning perfectly with the needs of modern data-driven environments. As we witness the evolution of tools like SM1, we are compelled to ask: how will these advancements shape the future of AI applications across diverse fields, and what new possibilities might emerge from this intersection of technology and creativity? The answers may redefine our understanding of what is achievable in data management and analysis.

On windows mamba-ssm is not easily available and doesn't compile on sm_120. SM1 (Scalar Mamba1) replaces the entire selective scan with two native PyTorch ops:

L = torch.cumprod(dA, dim=1)

h = L * (h0.unsqueeze(1) + torch.cumsum(dBx / L.clamp(min=1e-6), dim=1))

y = h * C

This is the exact closed-form solution to the d_state=1 recurrence via variation of parameters. Not an approximation, it is identical to sequential computation of floating point precision. d_state=2 breaks it. d_state=1 is the boundary where the closed form exists.

The Mamba1 scan intermediates are (B, T, F, S). SM1 eliminates S entirely, there is 16x less scan memory than a Mamba1 with d_state=16. The inference state for a 130M param model is about 14,080 floats, 56 KB, no KV cache, O(1) per token forever.

I am currently training it on 163K MIDI files, which is 2.5B tokens roughly in my custom format. 130M params fits in under half of my 16 GB card which is an RTX 5060 Ti.

submitted by /u/TechnoVoyager
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →