An Update on Matrix Recurrent Units, an Attention Alternative [R]
Our take
![An Update on Matrix Recurrent Units, an Attention Alternative [R]](https://preview.redd.it/9ebh98q6uo8h1.png?width=140&height=73&auto=webp&s=bc518b767c08ee943f01c04ece6bbeabcae5dbbd)
The ongoing search for alternatives to attention mechanisms in sequence modeling continues to yield intriguing, if sometimes incremental, results. The recent revisiting of Matrix Recurrent Units (MRUs) by researcher mikayahlevi offers a valuable contribution to this conversation, even if the current iteration doesn’t quite dethrone attention. MRUs, as described, represent a fascinating approach to linear-time sequence architecture, aiming to sidestep the computational complexities associated with traditional attention. It's interesting to see this exploration alongside discussions of human-in-the-loop annotation platforms Recommendations for speech annotation tools and even potential vulnerabilities within AI models Found a potential mistake in an ICLR 2026 blogpost, highlighting the multifaceted nature of AI development, from practical tooling to rigorous model evaluation. The core concept – transforming embeddings into matrices, cumulatively multiplying them, and then reversing the process – demonstrates a clever attempt to encode sequential information efficiently, a pursuit that resonates with the broader effort to optimize AI performance.
The iterative refinement of the MRU’s input state matrix creation methods, documented in the post, is particularly instructive. The initial struggles with training instability, and the subsequent experimentation with skew-symmetric matrices, LDU factors, and QR decompositions, reveal the complexities of architectural design and the importance of empirical validation. The observation that a simple scalar factor correction initially worsened results, suggesting the model was “cheating” on the toy dataset, underscores the need for careful evaluation practices and a healthy skepticism toward early successes. This aligns with the ongoing need for robust benchmarks and scrutiny within the field, as seen in discussions surrounding non-deterministic vulnerability detection Non-deterministic Vulnerability Detection Benchmark System. While current results on the TinyStories dataset suggest MRUs aren’t yet a direct replacement for attention in generative language modeling, the author’s acknowledgement of the algorithm’s unique strengths and weaknesses – its computational efficiency and lighter storage footprint – provides a valuable perspective.
The author’s proposed application of MRUs to query and key vectors within attention mechanisms, effectively using them for rotations in higher dimensions, represents a promising avenue for future research. This demonstrates a willingness to adapt and integrate the MRU concept into existing architectures rather than pursuing a complete replacement, a pragmatic and potentially fruitful approach. The comparative analysis with other linear-time models, including a linear transformer, provides context for understanding the MRU’s position within the broader landscape of sequence modeling techniques. The exploration of different matrix transformations to influence state dependencies – the contrast between shear transformations (critical for the MRU's performance) and rotations (which seemingly hinder learning) – opens up intriguing questions about the fundamental nature of sequential information processing in neural networks.
Ultimately, the MRU experiment serves as a reminder that innovation rarely follows a linear path. While this particular iteration may not have achieved its initial ambitious goal, the insights gained – concerning matrix state manipulation, training stability, and the trade-offs between computational efficiency and storage capacity – are valuable contributions to the field. The question now becomes: can the MRU's unique characteristics be leveraged in niche applications or hybrid architectures where its strengths outweigh its limitations? Exploring this potential, and encouraging further research into alternative sequence modeling paradigms, remains crucial for advancing the state of the art in AI.
| I recently revisited my matrix recurrent units algorithm (the MRU), a novel linear-time sequence architecture I created as an alternative to attention. I explain it in depth at the repo, but the gist is the MRU works by transforming the embedding into an input state matrix, cumulatively multiplying the matrices across the sequence dimension to get the output state matrix, and then transforming the matrices back into a vector. In order to make the MRU efficient on DL hardware, I created a parallel scan by utilizing the operation's associativity. About a year ago, I shared my project on Reddit (I've since renamed my account), with good results on the toy dataset shakespeare-char. A commenter asked the steps taken to bound the matrix states and another commenter found that training was inherently unstable when training on more comprehensive datasets. I addressed these by experimenting with different methods to create the input state matrix. Originally, I simply reshaped the input vector into a matrix and added the identity. Since then, I've implemented the following methods:
I found that these fixes prevented loss spikes with varying tradeoffs. Interestingly, the scalar factor method led to worse results. Dividing the input states should only affect the output states by scaling them, indicating that the unscaled model was "cheating" on the toy dataset by learning a simple scalar decay pattern instead of more complex relationships. Also, using the Cayley Map or matrix exponential to force the input states to be orthogonal surprisingly mostly prevented the model from learning information about the sequence, performing closer to the FFN than the Cayley QR method. The poor performance of orthogonal matrices indicates that the ability to learn shear transformations might be critical for the model. Possibly, rotations enforce dependence on the previous state, whereas shearing allows the model to adjust the state more independently of the previous state. Above are the train loss and validation loss on the shakespeare-char dataset for a small MRU LM, transformer, and FFN, with the embedding, state, key, and value size set to 256. The MRU LM has a single MRU layer and 4 MLPs, the transformer has a single attention layer and 4 MLPs, and the FFN only has 4 MLPs. I only used a single sequence-mixing layer in order to isolate the effect of the MRU. Finally, I moved to a larger dataset, trying to replicate https://huggingface.co/roneneldan/TinyStories-33M by training a baseline GPT-2 model and a model with attention replaced with the MRU. I ended up quitting the training runs early, but the loss curves seem to already conclusively show that the MRU performs worse on this task. For the creation of the MRU's input state matrices, I used the method of creating LDU factors, since it has the best performance. Above is the validation loss for a transformer and a LM using MRU with the same hyperparameters and dimensions as the huggingface model card. The official TinyStories model was trained for 20 epochs, which corresponds to about 200k steps. In order to compare it to other linear-time models, I also briefly trained a linear transformer, using the algorithm described in Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. I think that my research shows that the MRU likely doesn't work as a direct replacement for attention for generative language modeling, but I've already laid the groundwork for this algorithm. The MRU has dramatically different strengths and weaknesses compared to other algorithms such as attention, state space models, traditional RNNs, and fast weight programmers. It performs significantly more cumulative computation along the sequence (as opposed to the computation for each token being independent), is significantly more lightweight and hence faster, but also has a much lower storage capacity. I believe that the MRU's alternative uses should still be explored. One usage of the MRU could be applying it to query and key vectors of attention. Similar to RoPE, it would rotate chunks of the vectors, but it would be able to rotate chunks in greater than two dimensions and with dynamic and non-commutative angles. This is one of many applications of the algorithm which I will continue to research, and I hope that others are interested in its applications as well. If you're interested, reach out to me at [mikayahlevi@gmail.com](mailto:mikayahlevi@gmail.com), Reddit, GitHub, or any other platform you can find me at. [link] [comments] |
Read on the original site
Open the publisher's page for the full experience