The Reification Fallacy: Interpretability Studies Imaginary Entities

... but pretends that they are real

Dec 08, 2025

foot print on sand — Photo by Damien Dufour on Unsplash

Deep learning is grounded in the tangible: weights, loss functions, and datasets—entities that are accessible, strictly defined, and measurable. Yet, interpretability research has drifted into the study of “imaginary quantities,” constructs that exist primarily in the researcher’s mind rather than the model’s computation.

We spend enormous cycles estimating feature importance, identifying concepts via SAEs, and isolating circuits. This is a reification fallacy. We frame our research as “estimating” these values, treating them as latent physical constants awaiting discovery.

This is problematic for two reasons. First, the quantities are ill-defined. Despite thousands of papers on concept-based methods, “concept” lacks a formal definition. Similarly, “feature importance” has no single ground-truth definition. We are estimating undefined variables. Second, extraction is brittle. Attribution methods depend heavily on specific perturbation baselines (gradient-based, perturbation-based, masking-based); SAE concepts fluctuate wildly based on architecture, sparsity penalties, and seeds. These quantities are not intrinsic to the model; they are artifacts of the external information we inject to find them.

This reliance on imaginary quantities is not inherently fatal: physics routinely relies on unobservables like gravity or potential energy. However, these constructs are scientifically valid solely because they serve a function: predicting real, measurable phenomena. Their “reality” is derived entirely from their predictive utility.

Interpretability fails to build this bridge. Identifying a “circuit” is frequently treated as the terminal goal rather than an intermediate step toward solving real problems. When real-world validation does occur, it is often tenuous or restricted to toy settings (e.g., faithfulness on unrealistic perturbations, or weak intervention performance). If an abstraction is technically valid but effectively useless for predicting real-world model behavior, it is scientifically null.

If interpretability methods cannot yield concrete predictions about real-world neural network phenomena, we are not studying the model; we are reading tea leaves. If we cannot faithfully validate these abstractions after years of research, we must consider the possibility that the abstractions themselves are wrong.

Sebastian Bordt

Dec 8

great post!

The Column Space

I agree with:

> If interpretability methods cannot yield concrete predictions about real-world neural network phenomena, we are not studying the model; we are reading tea leaves

But I just think the antecedent doesn't satisfy. We often validate interpretability insights into prediction of NN behavior, and especially behavior when altering internal representations of the NN in some way.

I'm quite fond of my work https://arxiv.org/abs/2506.10138, which predicts actions of a reinforcement learner 50 steps in advance by looking at particular parts of the latent state. (See fig. 4.)

Circuits and concepts are a hypothesis and treated as such.

2 replies by Suraj Srinivas and others

2 more comments...

Interpreting Machine Learning

Discussion about this post

Ready for more?