Inside the Black Box:
The Science of AI Interpretability

We have built some of the most powerful computational systems in history. We do not know how they work. Mechanistic interpretability is the field trying to find out — and the answers are stranger than expected.

There is an uncomfortable fact at the centre of modern AI: the systems making consequential decisions in medicine, law, finance, and national security are, at the level of their internal mechanisms, opaque to their builders. We know their inputs and their outputs. We can measure their accuracy, their biases, their failure rates. We can steer their behaviour with prompts and fine-tuning. What we cannot do, in any rigorous sense, is open one up and read what it is computing.

This is not merely a philosophical inconvenience. It is a safety problem, an accountability problem, and increasingly an engineering bottleneck. The field of AI interpretability — specifically its more mechanistic branch — is attempting to solve it from first principles. The progress made in the last three years has been genuine and, in places, surprising.

Two traditions

Interpretability research divides into two broad traditions with different goals and different methods.

Post-hoc explainability asks: given a model that has already made a decision, can we generate a human-readable explanation of it? Techniques like LIME (Local Interpretable Model-Agnostic Explanations)[1] and SHAP (SHapley Additive exPlanations)[2] produce feature attributions — scores indicating which input features most influenced a model's output. These are useful for auditing and debugging but are proxies, not ground truth: they describe the model's input-output behaviour, not its internal computation.

Mechanistic interpretability asks a harder question: what are the actual algorithms implemented inside the network? Not approximations or proxies — the literal computational structure. This requires reading the weights and activations of neural networks and reverse-engineering what they compute. It is essentially neuroscience applied to artificial systems.[3]

Features and circuits

The foundational unit of mechanistic interpretability is the feature — a direction in a neural network's activation space that corresponds to a human-interpretable concept. Early work by Olah et al. at OpenAI, later continued at Anthropic, showed that individual neurons in vision models reliably activate for specific visual concepts: curve detectors, frequency detectors, multimodal neurons that respond to both visual and textual representations of the same concept (e.g., a neuron that activates for the word "cat" and for images of cats).[4]

Groups of neurons connected by weights that implement a specific algorithm are called circuits. The circuits framework asks: can we identify the specific subgraph of a neural network responsible for a particular behaviour, and reverse-engineer the algorithm it computes?

Early circuit-level analyses of transformers produced remarkable results. Wang et al. (2022) identified the circuit responsible for indirect object identification in GPT-2 — the algorithm the model uses to determine, in a sentence like "John gave Mary the ball; she thanked ___", that the blank should be filled with "John."[5] The circuit was surprisingly small (a few dozen attention heads) and implemented a recognisable algorithm: inhibiting the subject of the giving action as a candidate completion.

The superposition problem

The most significant theoretical discovery of recent mechanistic interpretability work is the phenomenon of superposition: neural networks represent more features than they have neurons, by encoding multiple features in overlapping directions across the activation space.

The intuition is geometric. A neural network layer with 512 neurons has 512 orthogonal directions — 512 features that could be cleanly separated. But if the model needs to represent thousands of concepts, and most of them are only rarely active simultaneously, it can pack many more features into the same space by assigning each a non-orthogonal direction, tolerating a small amount of interference between rarely co-occurring features.

Anthropic's 2022 paper Toy Models of Superposition formalised this and showed it emerges naturally from gradient descent when models are trained to represent more features than their dimensionality allows.[6] The implication for interpretability is uncomfortable: individual neurons are not the right unit of analysis. A single neuron participates in representing many features simultaneously. Reading neural network internals at the neuron level is reading the wrong layer of abstraction.

"Neural networks are not like computer programs, where you can inspect variables and understand what each one means. They are more like brains — systems whose computation is distributed, overlapping, and not designed to be human-readable."

— Elhage et al., Toy Models of Superposition, Anthropic, 2022[6]

Sparse autoencoders: a path through superposition

The most promising current approach to navigating superposition is the sparse autoencoder (SAE). The idea is to train a separate neural network that takes the activations of the model being studied and decomposes them into a larger set of sparse features — a dictionary of concepts in which each activation can be expressed as a small number of active features from a large vocabulary.

Anthropic's 2023 paper Towards Monosemanticity applied SAEs to a one-layer transformer and successfully recovered clean, human-interpretable features — abstract concepts, syntactic roles, specific named entities — that were invisible at the neuron level.[7] A 2024 follow-up scaled the approach to Claude Sonnet, a frontier model, and found millions of interpretable features corresponding to concepts ranging from "the Golden Gate Bridge" to "expressions of frustration."

Crucially, the features were not just interpretable in isolation — they could be causally intervened on. Artificially activating the "Golden Gate Bridge" feature caused the model to identify itself as the bridge when asked. This is not merely a correlation; it is a causal mechanism. The model's expressed identity could be modified by directly manipulating the relevant internal feature.

Interpretability and safety

The connection between interpretability and alignment is direct. If we cannot inspect a model's internal representations, we cannot verify that it is pursuing the goals we intend rather than a proxy that happens to correlate with them during training. Behavioural testing alone — asking a model questions and observing its answers — cannot distinguish a genuinely aligned model from one that has learned to appear aligned under evaluation.

Mechanistic interpretability offers, in principle, a path to alignment verification that goes beyond behaviour: examining the model's actual representations of goals, values, and plans. Anthropic has described interpretability as a core component of their safety strategy, alongside Constitutional AI and other alignment techniques.[8]

The gap between current capability and what would be needed for genuine safety verification is large. Current SAE-based approaches can identify interpretable features at the level of concepts; they cannot yet read out a model's goals or intentions in any reliable way. Scaling interpretability techniques to frontier models with hundreds of billions of parameters is an active and unsolved engineering challenge.

The regulatory dimension

Interpretability is not only a research question. The EU AI Act mandates transparency and explainability for high-risk AI systems, with provisions that, depending on implementation, may require some form of interpretability for AI used in hiring, credit, healthcare, and law enforcement.[9] The US Executive Order on AI (2023) similarly emphasised transparency requirements.

Whether current interpretability techniques satisfy regulatory requirements for explainability is unclear — partly because the regulations are vague about what "explainability" means at a technical level, and partly because the gap between post-hoc attribution methods (which are widely deployed) and mechanistic interpretability (which is not yet production-ready) is poorly understood outside the research community.

Where the field stands

Mechanistic interpretability in 2026 is where neuroscience was in the early twentieth century: we have identified some of the basic units of computation, we can trace some circuits, and we have a growing theoretical framework. We are nowhere near a complete picture of what any frontier model is computing.

The progress has been real. The sparse autoencoder breakthrough gave the field its first scalable approach to decomposing model activations into interpretable features. Circuit analysis has produced genuine algorithmic understanding of specific model behaviours. The theoretical framework — superposition, features, circuits, universality — gives researchers a vocabulary for asking precise questions about neural network internals.

What remains to be done is most of it. Scaling to frontier models, from identifying features to understanding plans, from single behaviours to global model goals. The field is small relative to its importance. It may be the most consequential research area in AI that most people have not heard of.


References

  1. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why Should I Trust You?": Explaining the Predictions of Any Classifier. KDD 2016. arXiv:1602.04938. arxiv.org/abs/1602.04938
  2. Lundberg, S. M., & Lee, S-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS 2017. arXiv:1705.07874. arxiv.org/abs/1705.07874
  3. Olah, C. (2022). Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases. Transformer Circuits Thread. transformer-circuits.pub
  4. Olah, C. et al. (2020). Zoom In: An Introduction to Circuits. Distill. distill.pub/2020/circuits/zoom-in/
  5. Wang, K. et al. (2022). Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. arXiv:2211.00593. arxiv.org/abs/2211.00593
  6. Elhage, N. et al. (2022). Toy Models of Superposition. Transformer Circuits Thread. transformer-circuits.pub/2022/toy_model/
  7. Bricken, T. et al. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Anthropic. transformer-circuits.pub/2023/monosemantic-features/
  8. Anthropic. (2023). Core Views on AI Safety: When, Why, What, and How. anthropic.com/news/core-views-on-ai-safety
  9. Hamon, R. et al. (2020). Robustness and Explainability of Artificial Intelligence. EUR 30040 EN, Publications Office of the European Union. doi.org/10.2760/57493