> If interpretability methods cannot yield concrete predictions about real-world neural network phenomena, we are not studying the model; we are reading tea leaves
But I just think the antecedent doesn't satisfy. We often validate interpretability insights into prediction of NN behavior, and especially behavior when altering internal representations of the NN in some way.
I'm quite fond of my work https://arxiv.org/abs/2506.10138, which predicts actions of a reinforcement learner 50 steps in advance by looking at particular parts of the latent state. (See fig. 4.)
Circuits and concepts are a hypothesis and treated as such.
We do validate our insights, but AFAIK the validation is often somewhat limited in scope. For instance, steering methods do have somewhat predictable behaviour, but it is often very weak (https://arxiv.org/abs/2411.04430). I wouldn't update my mental picture of model behaviour based on weak validation results.
I haven't looked at your paper closely, but if your explanation is able to accurately predict behaviour for your task, then the abstraction clearly works for your specific task!
> I wouldn't update my mental picture of model behaviour based on weak validation results.
Yeah, for sure not.
I guess whether it is weak or not depends on the particular paper. I was under the impression that folks in general validate their interventions, but I guess that's not necessarily the case.
great post!
I agree with:
> If interpretability methods cannot yield concrete predictions about real-world neural network phenomena, we are not studying the model; we are reading tea leaves
But I just think the antecedent doesn't satisfy. We often validate interpretability insights into prediction of NN behavior, and especially behavior when altering internal representations of the NN in some way.
I'm quite fond of my work https://arxiv.org/abs/2506.10138, which predicts actions of a reinforcement learner 50 steps in advance by looking at particular parts of the latent state. (See fig. 4.)
Circuits and concepts are a hypothesis and treated as such.
Thanks for your comment!
We do validate our insights, but AFAIK the validation is often somewhat limited in scope. For instance, steering methods do have somewhat predictable behaviour, but it is often very weak (https://arxiv.org/abs/2411.04430). I wouldn't update my mental picture of model behaviour based on weak validation results.
I haven't looked at your paper closely, but if your explanation is able to accurately predict behaviour for your task, then the abstraction clearly works for your specific task!
I see, makes sense.
> I wouldn't update my mental picture of model behaviour based on weak validation results.
Yeah, for sure not.
I guess whether it is weak or not depends on the particular paper. I was under the impression that folks in general validate their interventions, but I guess that's not necessarily the case.