Tied Crosscoders: Explaining Chat Behavior from Base Model
AbstractWe are interested in model-diffing: finding what is new in the chat model when compared to the base model. One way of doing this is training a crosscoder, which would just mean training an SAE on the concatenation of the activations in a given layer of the base and chat model. When training this crosscoder, we find some latents whose decoder vector mostly helps reconstruct the base model activation and does not affect the reconstruction for the chat model activation. These we call base-e...
Read more at lesswrong.com