Transformers Represent Belief State Geometry in their Residual Stream
14 min read16th Apr 202496 comments393Ω 142Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Produced while being an affiliate at PIBBSS[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS. Work done in collaboration with @Paul Riechers, @Lucas Teixeira, @Alexander Gietelink Oldenziel, and Sarah Marzen. Paul was a MATS scholar during some portion of this work. Thanks to Paul, Lucas, Alexander, Sarah, and @Gu...
Read more at lesswrong.com