Discussion about this post

User's avatar
Sebastian Bordt's avatar

great post!

The Column Space's avatar

I agree with:

> If interpretability methods cannot yield concrete predictions about real-world neural network phenomena, we are not studying the model; we are reading tea leaves

But I just think the antecedent doesn't satisfy. We often validate interpretability insights into prediction of NN behavior, and especially behavior when altering internal representations of the NN in some way.

I'm quite fond of my work https://arxiv.org/abs/2506.10138, which predicts actions of a reinforcement learner 50 steps in advance by looking at particular parts of the latent state. (See fig. 4.)

Circuits and concepts are a hypothesis and treated as such.

2 more comments...

No posts

Ready for more?