4 Comments
User's avatar
Sebastian Bordt's avatar

great post!

The Column Space's avatar

I agree with:

> If interpretability methods cannot yield concrete predictions about real-world neural network phenomena, we are not studying the model; we are reading tea leaves

But I just think the antecedent doesn't satisfy. We often validate interpretability insights into prediction of NN behavior, and especially behavior when altering internal representations of the NN in some way.

I'm quite fond of my work https://arxiv.org/abs/2506.10138, which predicts actions of a reinforcement learner 50 steps in advance by looking at particular parts of the latent state. (See fig. 4.)

Circuits and concepts are a hypothesis and treated as such.

Suraj Srinivas's avatar

Thanks for your comment!

We do validate our insights, but AFAIK the validation is often somewhat limited in scope. For instance, steering methods do have somewhat predictable behaviour, but it is often very weak (https://arxiv.org/abs/2411.04430). I wouldn't update my mental picture of model behaviour based on weak validation results.

I haven't looked at your paper closely, but if your explanation is able to accurately predict behaviour for your task, then the abstraction clearly works for your specific task!

The Column Space's avatar

I see, makes sense.

> I wouldn't update my mental picture of model behaviour based on weak validation results.

Yeah, for sure not.

I guess whether it is weak or not depends on the particular paper. I was under the impression that folks in general validate their interventions, but I guess that's not necessarily the case.