Circuit Tracing: New Method Reveals Computational Pathways in Language Models, Aiding Interpretability

Circuit Tracing: Revealing Computational Graphs in Language Models

Contents Architecture From Cross-Layer Transcoder to Replacement Model The Local Replacement Model Constructing an Attribution Graph for a Prompt Learning from Attribution Graphs Understanding and Labeling Features Grouping Features into Supernodes Validating Attribution Graph Hypotheses with Interventions Localizing Important Layers Factual Recall Case Study Addition Case Study Global Weights in Addition Cross-Layer Transcoder Evaluation Attribution Graph Evaluation Evaluating Mechanistic Faith...