Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability
Introduction
Figure 1: An illustration of a “logit” prism decomposing logit into different components (generated by DALL-E)
The logit lens (nostalgebraist 2020) is a simple yet powerful tool for understanding how transformer models (Vaswani et al. 2017; Brown et al. 2020) make decisions. In this work, we extend the logit lens approach in a mathematically rigorous and effective way. By treating certain parts of the network activations as constants, we can leverage the linear properties within the...
Read more at neuralblog.github.io