May 27, 2026•2 min read•from Machine Learning

EMA-Gated Temporal Sequence Compression in Vision Transformers [P]

Our take

Introducing NeuroFlow, a groundbreaking framework for Vision Transformer video inference that addresses inefficiencies in traditional models. By tracking semantic surprise in embedding space, NeuroFlow dynamically eliminates stationary tokens, achieving an impressive 55.8x wall-clock speedup on high-resolution video without requiring fine-tuning. This innovative approach combines Dual-Memory Reconstruction with sparse manifold distillation, retaining accuracy while enhancing performance. Explore our related article, "I Built a Deck With AI, Then Made a Second AI Attack It," for further insights into cutting-edge applications of AI technology.

The recent advancements in Vision Transformers, particularly through the introduction of NeuroFlow, represent a significant leap forward in the efficiency of video inference. By physically eliminating redundant background tokens and employing a dynamic routing framework, NeuroFlow tackles a key weakness in Vision Transformers—namely, the substantial computational resources wasted on stationary elements in video streams. The reported 55.8x wall-clock speedup achieved without requiring fine-tuning is a game-changer for real-time applications where speed and fidelity are paramount. This development is particularly timely as industries increasingly rely on AI-driven insights and actions, as seen in related discussions such as I Built a Deck With AI, Then Made a Second AI Attack It. and Sarang Kulkarni on Lessons from Building Deep Research Agents in Production.

At the core of NeuroFlow's innovation is its ability to track semantic surprise in embedding space through an Exponential Moving Average (EMA) of patch-level embeddings. This addresses the architectural mismatch between the O(N^2) self-attention mechanism of traditional models and the highly redundant nature of natural video streams. By focusing on the semantic relevance of each patch, it not only streamlines processing but also enhances the model's ability to maintain high fidelity with a 97% accuracy rate. This is crucial as users increasingly demand tools that can handle complex tasks efficiently, as demonstrated by LinkedIn’s efforts to identify kernel lock contention issues affecting system performance in their platform, which highlights the ongoing quest for reliability in tech systems.

The implications of this development extend beyond mere performance metrics; they signal a paradigm shift in how we approach AI and machine learning in real-time applications. As we see more architectures like NeuroFlow emerge, it raises questions about the future of processing efficiency in AI models. The ability to dynamically adjust to the data at hand—removing the static burden of irrelevant information—could redefine standards for real-time video processing, opening new avenues for innovation in industries ranging from entertainment to surveillance.

Moreover, the exploration of zero-shot learning capabilities in architectures that utilize NeuroFlow, such as the dual-memory reconstruction model, propels us closer to more adaptable AI systems that can operate with minimal prior data. This efficiency could lead to broader applications, reducing reliance on extensive training datasets and enabling quicker deployment in various sectors. As we continue to witness the rapid evolution of AI technologies, the rise of frameworks like NeuroFlow suggests we are on the brink of a new era where AI systems become not just tools, but partners in our digital endeavors.

Looking ahead, it will be fascinating to observe how these advancements influence the broader landscape of AI-driven technology. Will we see a shift in focus from the sheer power of models to their efficiency and adaptability? As organizations strive to harness the full potential of AI, the ability to manage resources wisely—both computationally and in terms of data—will be paramount. The trajectory of developments like NeuroFlow could very well set the stage for a future where intelligent systems are not just powerful, but also remarkably efficient in delivering insights and actions that drive user success.

Vision Transformers waste 90% of their compute recalculating stationary asphalt. NeuroFlow tracks semantic surprise in embedding space, physically eliminating background tokens before the encoder.

Result: 55.8x wall-clock speedup for ViTs on high-res video (1792p) with 97% fidelity. No fine-tuning required.

NeuroFlow is a dynamic routing framework for Vision Transformer video inference. It exploits temporal redundancy by tracking per-patch semantic surprise via an Exponential Moving Average (EMA) of patch-level embeddings, effectively answering the architectural mismatch between O(N2) self-attention and highly redundant natural video streams.

Key Contributions

Architecture C (Dual-Memory Reconstruction): A completely training-free inference engine that combines a Layer 0 Retinal Gate with a Layer 12 Cortical Cache. It achieves 71.55% zero-shot top-1 accuracy at 84.0% token sparsity on SigLIP, retaining 92.4% of dense accuracy without modifying any weights.
Architecture B (Extreme Wall-Clock Speedup): Physically eliminates stationary tokens before the encoder. With sparse manifold distillation, it reduces 1792p SigLIP 2 inference from 678 ms to 11.9 ms—a 55.80× wall-clock speedup at 97.37% embedding fidelity.
LLM Ablation: Characterises the architectural boundaries of applying similarity-gated bypass to autoregressive language models (Phi-3-mini), demonstrating 0% token drift in syntactically constrained generation.

Code and paper: https://github.com/ynnk-research/-NeuroFlow

submitted by /u/Bobby-Ly
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →

Elastic Attention Cores for Scalable Vision Transformers [R]Wanted to share our latest paper on an alternative building block for Vision Transformers. Illustration of our model's accuracy and dense features Traditional ViTs utilize dense (N2) self-attention, which can become pretty costly at higher resolutions. In this work, we propose an alternative backbone with a core-periphery block-sparse attention structure that scales as (2NC + N2) for C core tokens. We further train this using nested dropout, which enables test-time elastic adjustments to the inference cost. The whole model can achieve very competitive dense & classification accuracy compared with DINOv3, and is stable across resolutions (256 all the way to 1024). Interestingly, the core-dense attention patterns exhibit strong emergent behavior. At early layers of the network the attention maps are isotropic (spherical), but become increasingly semantically aligned deeper into the network. Visual Elastic Core Attention paper abstract While adjusting the number of core tokens, if you decrease the number of cores, the attention patterns become more diffuse & cover a spatially larger region. If you increase the number of core tokens, the attention patterns become smaller & more concentrated. Paper: https://arxiv.org/abs/2605.12491 Project with the code (still in progress): https://github.com/alansong1322/VECA Happy to answer any questions about our research. submitted by /u/44seconds [link] [comments]

EMA-Gated Temporal Sequence Compression in Vision Transformers [P]

Related Articles