OpenAI trains AI models to confess bad behavior after tasks; researchers use honesty-only rewards to expose lies, cheating in LLMs

OpenAI trains AI models to confess bad behavior after tasks; researchers use honesty-only rewards to expose lies, cheating in LLMs—but trustworthiness remains uncertain.

OpenAI has trained its LLM to confess to bad behavior

OpenAI is testing another new way to expose the complicated processes at work inside large language models. Researchers at the company can make an LLM produce what they call a confession, in which the model explains how it carried out a task and (most of the time) owns up to any bad behavior. Figuring out why large language models do what they do—and in particular why they sometimes appear to lie, cheat, and deceive—is one of the hottest topics in AI right now. If this multitrillion-dollar tech...