3 min readfrom Machine Learning

Contrastive targeted SFT as a mechinterp method - has anyone mapped causal dependency interactions this way? [D]

Our take

Researchers are exploring a novel mechinterpretation method: contrastive targeted Supervised Fine-Tuning (SFT). This approach aims to map causal dependency interactions within large language models, a largely uncharted area. The methodology involves iteratively training a model, tracing circuit interactions via ablation, and using those findings to inform subsequent training strategies—a closed-loop system for optimizing capability development. Initial experiments suggest this approach could enable more targeted and effective training by identifying upstream and downstream relationships between dimensions. See "How do you analyze the relative 'strength' of probes?

The recent post detailing a novel approach to understanding and refining large language models (LLMs) through targeted Supervised Fine-Tuning (SFT) and contrastive analysis is genuinely intriguing. The author’s self-taught, experiment-driven methodology, seeking to map causal dependencies within a 31B model, echoes a growing need for explainability and control in increasingly complex AI systems. The core idea – iteratively refining a model by tracing circuit interactions and using that knowledge to inform subsequent training – represents a potential paradigm shift from the current “black box” approach. This work builds upon existing research in circuit discovery and targeted SFT, but the key differentiator lies in the closed-loop feedback system, where mechinterpretability findings directly shape the training process. We’ve seen similar explorations of probing and analysis, such as in "How do you analyze the relative "strength" of probes? [R]", which highlights the challenges in interpreting probe signals, and the pursuit of understanding model “strength” aligns with the author's goal of mapping internal capabilities. The challenge of distinguishing direct versus indirect effects of ablation, and the exploration of activation steering as a diagnostic tool, offer particularly promising avenues for future investigation, echoing discussions around influence and causality seen in "Is ACL now irrelevant? [D]" where the emphasis on rigorous evaluation and understanding model behavior is paramount.

The author’s proposed methodology addresses a critical gap in our current understanding of LLMs. While we've made strides in improving their performance, we often lack a clear picture of *how* they achieve these results. This lack of transparency hinders our ability to debug, control, and ultimately trust these models. The concept of building a causal dependency graph, where each node represents a capability dimension and the edges represent dependencies, offers a powerful framework for visualizing and manipulating the model's internal workings. The planned testing of compositional ability through prompts requiring causal chaining is also a smart approach to validating the accuracy of the dependency graph. It's a move beyond isolated dimension scoring and towards understanding synergistic interactions—the very essence of intelligence. The difficulty in pinpointing direct versus indirect causal links is a valid concern, and the author's consideration of multi-layer ablation is a sensible initial approach.

The inherent beauty of this approach is its experimental nature. The author's willingness to self-teach and validate their findings through rigorous experimentation is commendable, especially given the relative lack of established methodology in this area. The proposed contrastive training strategy – pitting examples with deep and shallow representations of a dimension against each other – is a clever way to isolate the circuit responsible for that dimension. The potential for optimizing training order based on the causal graph is a significant advantage, allowing for more efficient and targeted fine-tuning. While the author acknowledges the possibility of reinventing existing methods, their focus on a closed-loop system and the combination of circuit tracing, ablation, and activation steering presents a uniquely holistic approach to understanding and controlling LLMs.

Looking ahead, the success of this methodology hinges on several factors. The robustness of the judge across 40 domains will be crucial in accurately identifying and scoring capability dimensions. Developing practical methods for resolving direct versus indirect dependencies in ablation experiments will be essential for building a reliable causal graph. Furthermore, the integration of this approach with existing LLM training pipelines could unlock significant efficiency gains and improve model controllability. The question remains: can this iterative, circuit-tracing approach become a standard practice for developing and refining LLMs, moving us closer to a future where we can truly understand and control the inner workings of these powerful AI systems?

Hi All, I've been running experiments on targeted SFT for specific capability dimensions on a 31B model. After running small training run to prime the model slightly in the direction I want, then ran a judge across 40 domains scoring six independent quality dimensions. One dimension consistently scored weakest across five runs.

I am now training contrastive variants from the same checkpoint - examples with that dimension deep vs examples with it deliberately shallow, same everything else. The plan is to see if I can find the difference between the the two checkpoints to locate the circuit, then ablate those heads and measure which OTHER dimensions degrade.

The idea is that if ablating dimension A's circuit causes dimension B's judge score to drop, there's a causal dependency in the network, B reads from A's residual stream output. And If I can do this for each dimension and build a causal dependency graph of how capabilities relate inside the model.

Then use that graph to determine optimal training order for future rounds (train upstream nodes first, and would help me know which downstream nodes get better signal).

A few specific questions:

  1. Has anyone done iterative targeted SFT guided by circuit tracing between rounds, and or by trying somewhat contrastive approaches to try to find any areas in the network? I can find papers on circuit discovery and papers on targeted SFT separately which somewhat validate this idea, but not the closed loop where mechinterp findings from a round determine training strategy for the next, and or what circuits may interact with each other in isolated scenarios, and how specific orders of training in specific directions may change how things behave.
  2. For the contrastive ablation - does anyone have any tips on what can work best in this area or could bring out more analysis?
  3. When tracing downstream dependencies via ablation, how do you distinguish direct from indirect effects? If ablating circuit A degrades dimension C, that could be A > C directly or A > B > C through an intermediate. Does anyone have a practical method for resolving this beyond ablating at multiple layers?
  4. After elemental training rounds, I plan to test whether dimensions compose naturally by running prompts that require causal chaining between two dimensions. For pairs that fail, I'm considering activation steering (injecting both dimension vectors simultaneously) as a diagnostic, if steering fixes it, possibly it's a routing problem, if not, could be a capability gap. Has anyone combined steering with fine tuning diagnostics like this?

For context I don't have a ML background, I am self taught through running experiments, but from what I am learning purely from first principle understanding and experiments, it feels that if you can map these circuits and their direct second, third and so on order interactions in isolated directions (for say a group of related strengths/weaknesses you're directly trying to isolate and steer, wouldn't this be a potentially way to isolate circuits for stronger training runs? Btw if anyone has any general topics or links that are super interesting around anything related to this I'd be fascinated to see and learn about!

If there's established methodology for any of this that I'm reinventing badly, I'd genuinely appreciate being pointed to it. I am so fascinated with this, it seems that if you can somehow eventually solve this problem, you could create better possible behaviour control or targeted understanding easier?

submitted by /u/Substantial_Diver469
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#generative AI for data analysis#Excel alternatives for data analysis#natural language processing for spreadsheets#financial modeling with spreadsheets#rows.com#self-service analytics tools#machine learning in spreadsheet applications#conversational data analysis#AutoML capabilities#self-service analytics#data analysis tools