Mechanistic Interpretability | Adrià Garriga-Alonso

Towards Automatic Circuit Discovery for mechanistic interpretability

Recent work in mechanistic interpretability has reverse-engineered nontrivial behaviors of transformer models. These contributions required considerable effort and researcher intuition, which makes it difficult to apply the same methods to understand …

Causal Scrubbing: a method for rigorously testing interpretability hypotheses

This sequence introduces causal scrubbing, a principled approach for evaluating the quality of mechanistic interpretations. The key insight behind this work is that mechanistic interpretability hypotheses can be thought of as defining what activations inside a neural network can be resampled without affecting behavior. Accordingly, causal scrubbing tests interpretability hypotheses via behavior-preserving resampling ablations—converting hypotheses into distributions over activations that should preserve behavior, and checking if behavior is actually preserved.