"Logit Prisms Enhance Transformer Model Interpretability by Decomposing Outputs into Individual Components"

Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability

Introduction Figure 1: An illustration of a “logit” prism decomposing logit into different components (generated by DALL-E) The logit lens (nostalgebraist 2020) is a simple yet powerful tool for understanding how transformer models (Vaswani et al. 2017; Brown et al. 2020) make decisions. In this work, we extend the logit lens approach in a mathematically rigorous and effective way. By treating certain parts of the network activations as constants, we can leverage the linear properties within the...