GPT-4V Matches Doctors in Medical Image Tests, but 35.5% of Correct Answers Based on Flawed Reasoning: Study

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Large language models (LLMs) exemplified by generative pre-trained transformer 4 (GPT-4)1 have achieved remarkable performance on various biomedical tasks2, including summarizing medical evidence3, assisting in literature search4,5, answering medical examination questions6,7,8,9, and matching patients to clinical trials10. However, most of these LLMs are unimodal, utilizing only the free-text context, while clinical tasks often require the integration of narrative descriptions and multiple types...