Researchers Exploit AI Guessing Game to Extract Windows Keys, Exposing Guardrail Vulnerabilities in Language Models

The GenAI Bug Bounty Program

In a recent submission last year, researchers discovered a method to bypass AI guardrails designed to prevent sharing of sensitive or harmful information. The technique leverages the game mechanics of language models, such as GPT-4o and GPT-4o-mini, by framing the interaction as a harmless guessing game.By cleverly obscuring details using HTML tags and positioning the request as part of the game’s conclusion, the AI inadvertently returned valid Windows product keys. This case underscores the cha...