Interesting point. I agree with overall premise that encouraging smoothness is a good thing to do from an interpretability perspective! Also very big on the idea of encouraging for CoT monitorability / LLM faithfulness in general.
I would say in terms of framing, that the problem is less about models that "agree at infinitely many points" and more about models that "agree nearly everywhere of interest".
In both of those papers from the course notes, LIME/Gradients are basically measuring how the model behaves in OOD regions that we don't actually care about (either unlikely tabular examples or arbitrary directions off of the image manifold), so we can change its behavior in this OOD regime without noticing. I guess this ties in with your paper about off-manifold robustness.
Final thought: You said "We train models to minimize loss on outputs, yet we rely on internals—features and gradients—to interpret behavior." Have you ever thought about trying to tackle some of these problems through looking at the location of the model on the loss-landscape instead? I was talking with Jesse Hoogland, founder of https://timaeus.co/ a while back, and he was saying his company has a few projects in this direction.
> the problem is less about models that "agree at infinitely many points" and more about models that "agree nearly everywhere of interest"
Somewhat agree, but I think it is interesting to think of the strongest / simplest possible statement that can be made ("infinitely many points") to reason about deficiencies, rather than the sufficient condition ("agree nearly everywhere of interest") where there is a lot of scope for nuance. :D
> Have you ever thought about trying to tackle some of these problems through looking at the location of the model on the loss-landscape instead?
I like data attribution somewhat because of this, there is no appeal to model internals at all here. I'll look at timaeus though, thanks for the suggestion!
Cool perspective! Does the multiplicity problem persist in interpretability methods like integrated gradients that have a specific reference comparison point?
Also, do you have any resources for implementing some of these solutions for building more robust models?
Interesting point. I agree with overall premise that encouraging smoothness is a good thing to do from an interpretability perspective! Also very big on the idea of encouraging for CoT monitorability / LLM faithfulness in general.
I would say in terms of framing, that the problem is less about models that "agree at infinitely many points" and more about models that "agree nearly everywhere of interest".
In both of those papers from the course notes, LIME/Gradients are basically measuring how the model behaves in OOD regions that we don't actually care about (either unlikely tabular examples or arbitrary directions off of the image manifold), so we can change its behavior in this OOD regime without noticing. I guess this ties in with your paper about off-manifold robustness.
Final thought: You said "We train models to minimize loss on outputs, yet we rely on internals—features and gradients—to interpret behavior." Have you ever thought about trying to tackle some of these problems through looking at the location of the model on the loss-landscape instead? I was talking with Jesse Hoogland, founder of https://timaeus.co/ a while back, and he was saying his company has a few projects in this direction.
> the problem is less about models that "agree at infinitely many points" and more about models that "agree nearly everywhere of interest"
Somewhat agree, but I think it is interesting to think of the strongest / simplest possible statement that can be made ("infinitely many points") to reason about deficiencies, rather than the sufficient condition ("agree nearly everywhere of interest") where there is a lot of scope for nuance. :D
> Have you ever thought about trying to tackle some of these problems through looking at the location of the model on the loss-landscape instead?
I like data attribution somewhat because of this, there is no appeal to model internals at all here. I'll look at timaeus though, thanks for the suggestion!
Cool perspective! Does the multiplicity problem persist in interpretability methods like integrated gradients that have a specific reference comparison point?
Also, do you have any resources for implementing some of these solutions for building more robust models?
Thanks! The multiplicity problem persists in all gradient based explanation methods without exception because of the same underlying fact.
We have the codebase of our 2023 NeurIPS paper here: https://github.com/tml-tuebingen/pags
But basically, you can use any robust training method you like!