GPT-4.1-mini Update Erases Uncertainty Signals, Complicating AI Content Moderation; Researchers Seek New Solutions

Alignment is not free: How model upgrades can silence your confidence signals | Variance

The Flattening Calibration CurveThe post-training process for LLMs can bias behavior for language models when they encounter content that violates their safety post-training guidelines. As mentioned by OpenAI’s GPT-4 system card, model calibration rarely survives post-training, resulting in models that are extremely confident even when they’re wrong.¹ For our use case, we often see this behavior with the side effect of biasing language model outputs towards violations, which can result in wasted...