The Reification Fallacy: Interpretability Studies Imaginary Entities
... but pretends that they are real
Deep learning is grounded in the tangible: weights, loss functions, and datasets—entities that are accessible, strictly defined, and measurable. Yet, interpretability research has drifted into the study of “imaginary quantities,” constructs that exist primarily in the researcher’s mind rather than the model’s computation.
We spend enormous cycles estimating feature importance, identifying concepts via SAEs, and isolating circuits. This is a reification fallacy. We frame our research as “estimating” these values, treating them as latent physical constants awaiting discovery.
This is problematic for two reasons. First, the quantities are ill-defined. Despite thousands of papers on concept-based methods, “concept” lacks a formal definition. Similarly, “feature importance” has no single ground-truth definition. We are estimating undefined variables. Second, extraction is brittle. Attribution methods depend heavily on specific perturbation baselines (gradient-based, perturbation-based, masking-based); SAE concepts fluctuate wildly based on architecture, sparsity penalties, and seeds. These quantities are not intrinsic to the model; they are artifacts of the external information we inject to find them.
This reliance on imaginary quantities is not inherently fatal: physics routinely relies on unobservables like gravity or potential energy. However, these constructs are scientifically valid solely because they serve a function: predicting real, measurable phenomena. Their “reality” is derived entirely from their predictive utility.
Interpretability fails to build this bridge. Identifying a “circuit” is frequently treated as the terminal goal rather than an intermediate step toward solving real problems. When real-world validation does occur, it is often tenuous or restricted to toy settings (e.g., faithfulness on unrealistic perturbations, or weak intervention performance). If an abstraction is technically valid but effectively useless for predicting real-world model behavior, it is scientifically null.
If interpretability methods cannot yield concrete predictions about real-world neural network phenomena, we are not studying the model; we are reading tea leaves. If we cannot faithfully validate these abstractions after years of research, we must consider the possibility that the abstractions themselves are wrong.


great post!
I agree with:
> If interpretability methods cannot yield concrete predictions about real-world neural network phenomena, we are not studying the model; we are reading tea leaves
But I just think the antecedent doesn't satisfy. We often validate interpretability insights into prediction of NN behavior, and especially behavior when altering internal representations of the NN in some way.
I'm quite fond of my work https://arxiv.org/abs/2506.10138, which predicts actions of a reinforcement learner 50 steps in advance by looking at particular parts of the latent state. (See fig. 4.)
Circuits and concepts are a hypothesis and treated as such.