Disillusionment with mechanistic interpretability research [D]

Our take

In recent discussions surrounding mechanistic interpretability research, some voices are expressing disillusionment, particularly regarding Anthropic's latest work. As an undergraduate computer scientist who initially embraced the potential of mechanistic interpretability, I find myself questioning the validity of recent developments, such as their "natural language autoencoders." This approach raises concerns about transparency and the reliability of explanations, especially when issues like confabulation arise.

Hey all, apologies if this is the wrong place to post this. I'm currently an undergrad computer scientist that got swept up in the mechanistic interpretability wave c. 2024 or so (sparse autoencoders, attribution graphs) and found it generally promising (and still do); that being said a lot of the new research out of Anthropic (which I understand as the mech interp house) doesn't sit well with me.

They recently published a blogpost on so called "natural language autoencoders" -- training one LLM to compress activations into a natural language description and another LLM to get the activations back which seems extremely suspect -- for starters it's a black box technique (which to me makes the proposition that it helps understand model internals very weak), but they also do not compare basic metrics (FVE, reconstruction error) against SAE baselines. Moreover the paper mentions so called "confabulations", when the "activation verbalizer" module just makes up stuff in explaining the activations, which to me defeats the entire purpose of the concept since you may never know whether or not an explanation is confabulated at test time.

Granted, the blogpost mentions most of these issues, and they do seem to achieve good results on a misaligned model auditing benchmark (though the utility of this again seems dubious to me, I've never been one for AI x-risk arguments), but it seems overall that Anthropic, especially recently, don't care so much about interpretability as they do scalable alignment/oversight, and are happy to satisfy the former if it means better progress on the so called control problem. Given how closely the field seems to track Anthropic's movements, I'm concerned that this is where mech interp is heading

Let me know if this is the wrong place to post this.

submitted by /u/Carbon1674
[link] [comments]

Tagged with

#natural language processing for spreadsheets#natural language processing#generative AI for data analysis#Excel alternatives for data analysis#rows.com#financial modeling with spreadsheets#real-time data collaboration#real-time collaboration#mechanistic interpretability#sparse autoencoders#attribution graphs#natural language autoencoders#black box technique#reconstruction error#confabulations#activation verbalizer#model internals#FVE#misaligned model auditing#AI x-risk arguments