I built a tool that shows you what GPT-2 is "thinking" in real-time as it generates 3D graph of concept activations per token [R]

Our take

Introducing AXON, a groundbreaking tool that visualizes what GPT-2 "thinks" in real-time as it generates language. By leveraging a Sparse Autoencoder to decode the model's residual stream, AXON creates a dynamic 3D graph of concept activations per token. This innovative approach reveals fascinating insights, such as how geography and language features light up moments before specific tokens are generated, enhancing our understanding of model behavior.

The recent development of AXON, a tool that visualizes the inner workings of the GPT-2 model in real time, represents a significant leap forward in the field of mechanistic interpretability. By using a Sparse Autoencoder (SAE) to decompose the model's residual stream into human-interpretable features, the creator has opened a window into the model's thought process as it generates text. This initiative echoes discussions in our ongoing exploration of data annotation and the need for transparency in AI systems, as seen in articles like [Comparing data annotation platforms [D]](/post/comparing-data-annotation-platforms-d-cmpdavtky03pjs0glvzsmbpph) and the implications of recent conference proceedings [ICML Proceedings-only [D]](/post/icml-proceedings-only-d-cmpdav8qb03nvs0glgxk4sapc).

At its core, AXON's functionality lies in its ability to display a live 3D force graph that evolves as GPT-2 generates tokens. Each node represents a feature—such as "European geography" or "capital cities"—while the connections between them reveal how these features co-activate during text generation. This insight is not just a technical marvel; it offers a profound understanding of the cognitive pathways the model navigates. For instance, witnessing certain features light up before the completion of a phrase, like "The capital of France is," can significantly enhance our understanding of how language models operate, emphasizing the predictive nature of AI and its reliance on contextual cues.

The implications of this tool extend beyond mere curiosity. It underscores the growing need for interpretability in AI systems, particularly as they become more integrated into decision-making processes across various sectors. As we strive for accountability in AI, tools like AXON can serve as critical resources for researchers and practitioners alike, allowing for a more nuanced understanding of model behavior. This aligns with the broader conversation about transparency in AI, which is crucial for developing trust and ensuring ethical applications. The questions raised by AXON's findings may prompt further research into the meaningfulness of the co-activation edges—are they mere noise, or do they reveal deeper insights into the model's operations?

Moreover, the potential for AXON to adapt to other models, such as Pythia and Gemma, enhances its utility and broadens the scope of mechanistic interpretability. As AI technologies evolve, the ability to switch between different architectures while maintaining interpretative capabilities will be invaluable. This flexibility could foster collaborative research efforts and drive innovation in AI, echoing the sentiments expressed in our article on the challenges of maintaining consistency in data processes ECCV 2026.

Looking ahead, the development of AXON poses important questions about the future of AI interpretability. Will this tool inspire a wave of innovations aimed at demystifying complex models, or will it remain a standalone venture? As we continue to explore the intersections of AI technology and human understanding, the insights gained from tools like AXON could play a pivotal role in shaping the next generation of data management and AI applications. The journey into mechanistic interpretability is just beginning, and the potential for transformative insights is vast.

Been going down a mechanistic interpretability rabbit hole for the past few weeks and ended up building this thing called AXON.

The idea: every time GPT-2 generates a token, its residual stream gets passed through a Sparse Autoencoder (Joseph Bloom's pretrained SAE). The SAE decomposes it into human-interpretable feature: hings like "European geography", "capital cities", "French language" and streams those to the browser over WebSocket, where they show up as a live 3D force graph.

Nodes = SAE features. Edges = features that fired together on the same token. Node brightness = activation strength. The whole graph evolves token by token.

What surprised me most: type "The capital of France is" and you can literally watch geography features, proper noun features, and completion-pattern features light up before the word "Paris" even gets generated. It's not what the model outputs that's interesting it's what's happening right before it decides.

Stack: TransformerLens + SAELens on the backend, FastAPI WebSocket for streaming, Three.js + 3d-force-graph on the frontend. Runs on CPU (~800ms/token) or GPU (~35ms on a 4050). Labels come from Neuronpedia's API and get cached locally.

You can also swap in other models — GPT-2 medium/large/xl, Pythia variants, Gemma-2-2B — as long as there's a pretrained SAE for it in SAELens.

GitHub: https://github.com/09Catho/axon

Would love feedback and stars especially from anyone who's worked with SAEs before curious whether the co-activation edges are actually meaningful or just noise at this layer.

submitted by /u/Financial_World_9730
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →