GitHub Project Unveils End-to-End Pipeline for LLM Interpretability Using Sparse Autoencoders on Llama 3.2

GitHub - PaulPauls/llama3_interpretability_sae: A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.

Llama 3 Interpretability with Sparse Autoencoders Source: OpenAI - Extracting Concepts from GPT-4 Project Overview Modern LLMs encode concepts by superimposing multiple features into the same neurons and then interpeting them by taking into account the linear superposition of all neurons in a layer. This concept of giving each neuron multiple interpretable meanings they activate depending on the context of other neuron activations is called superposition. Sparse Autoencoders (SAEs) are models th...