Abstract: A key goal of current mechanistic interpretability research in NLP is to find linear features (also called “feature vectors”) for transformers: directions in activation space corresponding to concepts that are used by a given model in its computation. Present state-of-the-art methods for finding linear features require large amounts of labelled data — both laborious to acquire and computationally expensive to utilize. In this work, we introduce a novel method, called “observable propagation” (in short: ObsProp), for finding linear features used by transformer language models in computing a given task — using almost no data. Our paradigm centers on the concept of observables, linear functionals corresponding to given tasks. We then introduce a mathematical theory for the analysis of feature vectors: we provide theoretical motivation for why LayerNorm nonlinearities do not affect the direction of feature vectors; we also introduce a similarity metric between feature vectors called the coupling coefficient which estimates the degree to which one feature’s output correlates with another’s. We use ObsProp to perform extensive qualitative investigations into several tasks, including gendered occupational bias, political party prediction, and programming language detection. Our results suggest that ObsProp surpasses traditional approaches for finding feature vectors in the low-data regime, and that ObsProp can be used to better understand the mechanisms responsible for bias in large language models. Code for experiments can be found at this link.
Analyzing Linear Features in Transformer Models
In the field of natural language processing (NLP), understanding how transformer models make predictions has been a challenge. Mechanistic interpretability research aims to unravel the black box nature of these models by identifying linear features or feature vectors that capture the concepts they rely on for their computations.
The existing methods for finding linear features require significant amounts of labeled data, which is time-consuming and computationally expensive to acquire. However, this article introduces a groundbreaking technique called “observable propagation” (ObsProp) that overcomes these limitations, allowing for the discovery of linear features with minimal data requirements.
The core idea behind ObsProp is based on the concept of observables, which are linear functionals associated with specific tasks. By focusing on the observables, the authors leverage a mathematical theory for analyzing feature vectors, providing theoretical justification for why LayerNorm nonlinearities do not affect the direction of these vectors.
Additionally, the authors introduce a coupling coefficient as a similarity metric between feature vectors. This coefficient estimates the extent to which one feature’s output correlates with another’s, enabling deeper insights into how different features interact within the model.
The authors validate the effectiveness of ObsProp through extensive qualitative investigations, exploring various tasks such as gendered occupational bias, political party prediction, and programming language detection. The results not only demonstrate that ObsProp outperforms traditional approaches in low-data scenarios but also highlight its potential for understanding the underlying mechanisms responsible for bias in large language models.
This research opens up new possibilities for interpretable NLP models and provides a valuable tool for addressing bias and fairness concerns. By reducing the data requirement for finding linear features, ObsProp enables researchers to better understand how transformer models make predictions and discover potential areas of improvement.
To further support reproducibility and enable future research, the authors provide code for the experiments at the following link.