2 min readfrom Machine Learning

High Dimensional, Dynamic Rotary Positional Embedding [P]

Our take

Unlock more expressive language modeling with High Dimensional, Dynamic Rotary Positional Embedding [HDD-RoPE]. Building on a cumulative matrix product approach, this innovative embedding allows models to understand position as multidimensional rather than linear, enabling finer-grained contextual understanding. Initial results, demonstrated with a GPT-2-like model trained on TinyStories, show faster convergence compared to traditional methods. Explore the math and architecture details, and replicate the findings, in the accompanying GitHub repository.
High Dimensional, Dynamic Rotary Positional Embedding [P]

The recent surge in innovation surrounding positional embeddings in large language models (LLMs) continues to reshape our understanding of how these models process sequential data. Mikayah Levi’s work on High Dimensional, Dynamic Rotary Positional Embedding (HDD-RoPE), detailed in a recent Reddit post, represents a compelling step forward. It builds upon the foundation of Rotary Positional Embedding (RoPE), a technique already demonstrating improved performance over traditional positional encoding methods. We’ve seen similar exploration of optimization and efficiency in the broader machine learning landscape, such as the effort to [Find the best open-source OCR models in one place at Papers with Code] and the intriguing approach of [Compiling Agentic Workflows into LLM Weights], both highlighting the relentless pursuit of enhanced model capabilities. HDD-RoPE’s core insight—treating position within a sequence as multidimensional rather than linear—is particularly noteworthy and offers a potentially significant advantage in capturing complex relationships within text.

The standard RoPE method, as Levi explains, operates on the premise of pairs of tokens, rotating them at a constant rate. HDD-RoPE expands on this, allowing for chunks of arbitrary size and, crucially, making the rotation amount data-dependent. This dynamic adjustment allows the model to learn *how* to advance positions based on the information encoded in the activations, creating a far more adaptable and nuanced understanding of context. The reported results on the TinyStories dataset, showcasing faster convergence compared to a baseline transformer using xPos, provide early validation of this approach. While the TinyStories dataset isn’t representative of all downstream tasks, the speed of convergence indicates a potential efficiency gain—a factor increasingly important as model sizes continue to grow. The commitment to open-source accessibility, demonstrated by the GitHub repository, allows for broader scrutiny and collaborative refinement of the technique, which is vital for accelerating progress in the field.

The significance of HDD-RoPE extends beyond just improved training speed. The concept of multidimensional position fundamentally challenges the assumptions baked into many existing architectures. By allowing tokens to represent positions within larger, learned constructs like paragraphs or sentences, the model gains a richer understanding of semantic relationships and dependencies. This could unlock new capabilities in tasks requiring a deep grasp of context, such as long-form content generation, complex reasoning, and nuanced understanding of dialogue. It's a departure from the traditional view of sequential data as a simple linear progression. The research also echoes ongoing efforts to optimize LLM performance by leveraging alternative architectures, as evidenced by projects like [Kuma: compiling PyTorch models into self-contained WebGPU executables], which aims to improve efficiency and deployment.

Looking ahead, the key question becomes how HDD-RoPE scales to larger models and more complex datasets. The success observed on TinyStories provides an encouraging foundation, but real-world applicability will depend on its performance in more demanding scenarios. Further research should focus on exploring the optimal chunk size for different tasks and investigating whether the data-dependent rotation mechanism can be generalized to other embedding techniques. The potential for learning more nuanced positional representations is undeniable, and the HDD-RoPE approach offers a promising avenue for pushing the boundaries of what LLMs can achieve.

High Dimensional, Dynamic Rotary Positional Embedding [P]

At the end of my last post, I presented an idea: what if I used the core of my last project, the cumulative matrix product, and repurposed it as a positional embedding?

I just finished fleshing out the math behind HDD-RoPE and training a model with this positional embedding algorithm, and the results are excellent. When trained on the dataset TinyStories, the validation loss begins to converge a fair amount faster than the baseline transformer trained using xPos.

A GPT-2-like model trained on TinyStories with hyperparameters copied from https://huggingface.co/roneneldan/TinyStories-33M (n_blocks=4, d_model=d_k=d_v=768)

The repo at https://github.com/mikayahlevi/hdd-rope/ allows you to replicate the results and goes in depth about the math and details of the architecture.

Standard RoPE breaks the queries and keys into groups of two and rotates each pair at a predefined rate. This allows the model to learn relative position by observing the change in basis between the queries and keys. Pairs of two make intuitive sense for a linear sequence, as a chunk can be rotated with a single degree of freedom, corresponding to linear one-dimensionally progressing position.
HDD-RoPE moves past this intuition and instead says that position within a sequence is multidimensional. Therefore, the chunks can be broken into any size, such as 4 as used in the TinyStories example. Four-dimensional chunks correspond to 4 choose 2 = 6 axes of rotation (6-dimensional position.) Essentially, we're saying that a token doesn't just lie at a position within the sequence, but a position within any construct the model can learn, such as a paragraph or sentence.
To facilitate this, I also make the amount of rotation along each axis data-dependent, such that it can learn how to advance the positions based on information stored in the current layer's activations.

If you would like to learn more, please check out the repo. I formalize the math and lay out a roadmap.

submitted by /u/mikayahlevi
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#rows.com#financial modeling with spreadsheets#generative AI for data analysis#Excel alternatives for data analysis#natural language processing for spreadsheets#big data management in spreadsheets#conversational data analysis#large dataset processing#cloud-based spreadsheet applications#real-time data collaboration#intelligent data visualization#data visualization tools#enterprise data management#big data performance#data analysis tools#data cleaning solutions#HDD-RoPE#Rotary Positional Embedding#RoPE#Positional Embedding