Researchers Develop "Tied Crosscoder" to Analyze Chat AI Behavior; New Method Reveals How Base Models Evolve During Fine-Tuning

Tied Crosscoders: Explaining Chat Behavior from Base Model

AbstractWe are interested in model-diffing: finding what is new in the chat model when compared to the base model. One way of doing this is training a crosscoder, which would just mean training an SAE on the concatenation of the activations in a given layer of the base and chat model. When training this crosscoder, we find some latents whose decoder vector mostly helps reconstruct the base model activation and does not affect the reconstruction for the chat model activation. These we call base-e...