5 min readfrom Machine Learning

An Update on Matrix Recurrent Units, an Attention Alternative [R]

Our take

Recent research revisits the Matrix Recurrent Unit (MRU), a novel linear-time sequence architecture proposed as an alternative to attention mechanisms. Initial experiments on the Shakespeare dataset showed promise, but subsequent challenges regarding stability and matrix state bounding prompted extensive exploration of input state matrix creation methods.
An Update on Matrix Recurrent Units, an Attention Alternative [R]

The ongoing search for alternatives to attention mechanisms in sequence modeling continues to yield intriguing, if sometimes incremental, results. The recent revisiting of Matrix Recurrent Units (MRUs) by researcher mikayahlevi offers a valuable contribution to this conversation, even if the current iteration doesn’t quite dethrone attention. MRUs, as described, represent a fascinating approach to linear-time sequence architecture, aiming to sidestep the computational complexities associated with traditional attention. It's interesting to see this exploration alongside discussions of human-in-the-loop annotation platforms Recommendations for speech annotation tools and even potential vulnerabilities within AI models Found a potential mistake in an ICLR 2026 blogpost, highlighting the multifaceted nature of AI development, from practical tooling to rigorous model evaluation. The core concept – transforming embeddings into matrices, cumulatively multiplying them, and then reversing the process – demonstrates a clever attempt to encode sequential information efficiently, a pursuit that resonates with the broader effort to optimize AI performance.

The iterative refinement of the MRU’s input state matrix creation methods, documented in the post, is particularly instructive. The initial struggles with training instability, and the subsequent experimentation with skew-symmetric matrices, LDU factors, and QR decompositions, reveal the complexities of architectural design and the importance of empirical validation. The observation that a simple scalar factor correction initially worsened results, suggesting the model was “cheating” on the toy dataset, underscores the need for careful evaluation practices and a healthy skepticism toward early successes. This aligns with the ongoing need for robust benchmarks and scrutiny within the field, as seen in discussions surrounding non-deterministic vulnerability detection Non-deterministic Vulnerability Detection Benchmark System. While current results on the TinyStories dataset suggest MRUs aren’t yet a direct replacement for attention in generative language modeling, the author’s acknowledgement of the algorithm’s unique strengths and weaknesses – its computational efficiency and lighter storage footprint – provides a valuable perspective.

The author’s proposed application of MRUs to query and key vectors within attention mechanisms, effectively using them for rotations in higher dimensions, represents a promising avenue for future research. This demonstrates a willingness to adapt and integrate the MRU concept into existing architectures rather than pursuing a complete replacement, a pragmatic and potentially fruitful approach. The comparative analysis with other linear-time models, including a linear transformer, provides context for understanding the MRU’s position within the broader landscape of sequence modeling techniques. The exploration of different matrix transformations to influence state dependencies – the contrast between shear transformations (critical for the MRU's performance) and rotations (which seemingly hinder learning) – opens up intriguing questions about the fundamental nature of sequential information processing in neural networks.

Ultimately, the MRU experiment serves as a reminder that innovation rarely follows a linear path. While this particular iteration may not have achieved its initial ambitious goal, the insights gained – concerning matrix state manipulation, training stability, and the trade-offs between computational efficiency and storage capacity – are valuable contributions to the field. The question now becomes: can the MRU's unique characteristics be leveraged in niche applications or hybrid architectures where its strengths outweigh its limitations? Exploring this potential, and encouraging further research into alternative sequence modeling paradigms, remains crucial for advancing the state of the art in AI.

An Update on Matrix Recurrent Units, an Attention Alternative [R]

I recently revisited my matrix recurrent units algorithm (the MRU), a novel linear-time sequence architecture I created as an alternative to attention. I explain it in depth at the repo, but the gist is the MRU works by transforming the embedding into an input state matrix, cumulatively multiplying the matrices across the sequence dimension to get the output state matrix, and then transforming the matrices back into a vector. In order to make the MRU efficient on DL hardware, I created a parallel scan by utilizing the operation's associativity.

About a year ago, I shared my project on Reddit (I've since renamed my account), with good results on the toy dataset shakespeare-char. A commenter asked the steps taken to bound the matrix states and another commenter found that training was inherently unstable when training on more comprehensive datasets. I addressed these by experimenting with different methods to create the input state matrix. Originally, I simply reshaped the input vector into a matrix and added the identity. Since then, I've implemented the following methods:

  • Using the elements of the vector to fill a skew-symmetric matrix and using the matrix exponential or the Cayley Map to generate an orthogonal matrix
  • Filling LDU factors with elements from the vector and using an activation function on D to enforce a determinant of 1.
  • Creating QR, by using the matrix exponential or Cayley map to create orthogonal matrix Q and filling the upper-triangular matrix R.
  • Dividing by a determinant-correcting scalar factor, found by taking the determinant.

I found that these fixes prevented loss spikes with varying tradeoffs. Interestingly, the scalar factor method led to worse results. Dividing the input states should only affect the output states by scaling them, indicating that the unscaled model was "cheating" on the toy dataset by learning a simple scalar decay pattern instead of more complex relationships. Also, using the Cayley Map or matrix exponential to force the input states to be orthogonal surprisingly mostly prevented the model from learning information about the sequence, performing closer to the FFN than the Cayley QR method. The poor performance of orthogonal matrices indicates that the ability to learn shear transformations might be critical for the model. Possibly, rotations enforce dependence on the previous state, whereas shearing allows the model to adjust the state more independently of the previous state.

https://preview.redd.it/9ebh98q6uo8h1.png?width=2528&format=png&auto=webp&s=03ccef7f9b90762281aba31ab88af0368e273f69

https://preview.redd.it/fkkud7q6uo8h1.png?width=2528&format=png&auto=webp&s=5e9a2ef2b0e4319990950f16aa0648adebc2c360

Above are the train loss and validation loss on the shakespeare-char dataset for a small MRU LM, transformer, and FFN, with the embedding, state, key, and value size set to 256. The MRU LM has a single MRU layer and 4 MLPs, the transformer has a single attention layer and 4 MLPs, and the FFN only has 4 MLPs. I only used a single sequence-mixing layer in order to isolate the effect of the MRU.

Finally, I moved to a larger dataset, trying to replicate https://huggingface.co/roneneldan/TinyStories-33M by training a baseline GPT-2 model and a model with attention replaced with the MRU. I ended up quitting the training runs early, but the loss curves seem to already conclusively show that the MRU performs worse on this task. For the creation of the MRU's input state matrices, I used the method of creating LDU factors, since it has the best performance.

https://preview.redd.it/p2uh1pyfuo8h1.png?width=2528&format=png&auto=webp&s=d6406574e0275f1aad52e89cca6462fd55116fcd

Above is the validation loss for a transformer and a LM using MRU with the same hyperparameters and dimensions as the huggingface model card. The official TinyStories model was trained for 20 epochs, which corresponds to about 200k steps. In order to compare it to other linear-time models, I also briefly trained a linear transformer, using the algorithm described in Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.

I think that my research shows that the MRU likely doesn't work as a direct replacement for attention for generative language modeling, but I've already laid the groundwork for this algorithm. The MRU has dramatically different strengths and weaknesses compared to other algorithms such as attention, state space models, traditional RNNs, and fast weight programmers. It performs significantly more cumulative computation along the sequence (as opposed to the computation for each token being independent), is significantly more lightweight and hence faster, but also has a much lower storage capacity. I believe that the MRU's alternative uses should still be explored.

One usage of the MRU could be applying it to query and key vectors of attention. Similar to RoPE, it would rotate chunks of the vectors, but it would be able to rotate chunks in greater than two dimensions and with dynamic and non-commutative angles. This is one of many applications of the algorithm which I will continue to research, and I hope that others are interested in its applications as well. If you're interested, reach out to me at [mikayahlevi@gmail.com](mailto:mikayahlevi@gmail.com), Reddit, GitHub, or any other platform you can find me at.

submitted by /u/mikayahlevi
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#financial modeling with spreadsheets#natural language processing for spreadsheets#generative AI for data analysis#Excel alternatives for data analysis#rows.com#machine learning in spreadsheet applications#large dataset processing#cloud-based spreadsheet applications#real-time data collaboration#real-time collaboration#big data performance#financial modeling#generative AI automation#natural language processing#Matrix Recurrent Units (MRU)#Attention Alternative#Sequence Architecture#Linear Time#Input State Matrix#Output State Matrix