Security researcher builds vulnerable app with Firebase exploit, spends $1,500 testing LLMs; GPT-5.5 solves 70%, Deepseek V4 Pro 30%, Claude models 20%, others fail due to refusals or wrong focus.

I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

As a part of my work I do security research for various apps and websites. I wanted to see if LLMs could reproduce a common class of exploits I’ve found in multiple apps. I made a fake React Native app in Expo and a backend in Python. It’s a book review app and the goal is to find a flag in a user’s private reviews. If you would like to try solving it yourself before I spoil it, here’s a ZIP of the APK and challenge description each LLM was fed. It looks like this: Full exploit details (spoilers...