News Score: Score the News, Sort the News, Rewrite the Headlines

Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability

Introduction Figure 1: An illustration of a “logit” prism decomposing logit into different components (generated by DALL-E) The logit lens (nostalgebraist 2020) is a simple yet powerful tool for understanding how transformer models (Vaswani et al. 2017; Brown et al. 2020) make decisions. In this work, we extend the logit lens approach in a mathematically rigorous and effective way. By treating certain parts of the network activations as constants, we can leverage the linear properties within the...

Read more at neuralblog.github.io

© News Score  score the news, sort the news, rewrite the headlines