GPT-4, 16% Higher Diagnostic Accuracy Than Human Doctors
A report has revealed that artificial intelligence (AI) capabilities have surpassed humans in the field of medical diagnosis. OpenAI's latest AI model, GPT-4, is said to outperform human doctors. There is growing consensus that the era of the 'AI doctor' is now within sight.
According to the "AI Index 2025" report released on the 8th (local time) by the Stanford University Institute for Human-Centered Artificial Intelligence (HAI), GPT-4 demonstrated a diagnostic accuracy 16 percentage points higher than human doctors in tests based on clinical cases. The report stated, "Overall, GPT-4 alone showed the highest and most consistent diagnostic performance." It added, "In contrast, human doctors alone had lower performance, but when collaborating with AI, the results varied greatly depending on how the AI was utilized."
The diagnostic test experiment comparing AI and human doctors, as described in the AI Index 2025 report, involved providing six challenging patient cases to both GPT-4 and 50 clinicians in the United States (26 specialists and 24 residents). The diagnostic outcomes of "GPT-4 alone," "human doctors collaborating with GPT-4," and "human doctors alone" were compared. The first experiment compared "GPT-4 vs. human doctors," while the second compared "human doctors collaborating with GPT-4 vs. human doctors alone," focusing on diagnostic accuracy.

GPT-4, 16% Higher Diagnostic Accuracy Than Human Doctors
As a result, the median accuracy in the group diagnosed by GPT-4 (92%) was 16 percentage points higher than the group diagnosed by human doctors alone (76%). The median refers to the value exactly in the middle when the data is arranged in order. Additionally, the median accuracy for the group of doctors collaborating with GPT-4 (76%) was only 2 percentage points higher than the group of human doctors alone (74%), a difference that was not statistically significant. Accuracy was independently evaluated by two internal medicine specialists who did not participate directly in the experiment, using predetermined criteria. They scored the diagnoses without knowing who made each one.
This report is significant as it demonstrates a shift in the role of AI within medical practice. AI is already widely used for robotic surgery, medical data analysis, and AI-based cancer screening solutions. However, its role has largely been limited to assisting doctors' decision-making.
With the AI Index, considered the world's most authoritative AI white paper, now reporting that generative AI models like GPT-4 outperform human doctors in diagnosis, there are growing expectations that AI doctors will soon become a common sight in hospitals.
The report stated, "These experimental results show that GPT-4's diagnostic performance is the highest and most consistent overall," and added, "When AI collaborates with doctors, outcomes vary depending on the individual doctor's judgment and ability to utilize the AI." It also noted, "Recent studies have shown that AI outperforms medical professionals in areas such as cancer detection and identifying critically ill patients, and the scope of AI's application is expanding from simple diagnosis to more complex clinical decision-making."

AI widely adopted in robotic surgery, data analysis, cancer screening, and more
Additionally, in the "MedQA" benchmark test, a representative standard for measuring GPT-4's clinical knowledge, GPT-4 achieved an accuracy rate of 96.0% last year. This figure is up a remarkable 28.4 percentage points from 67.6% in 2022. MedQA is a test based on questions at the level of the United States Medical Licensing Examination and is used to evaluate the clinical knowledge of AI.
The report stated, "There are research findings suggesting that collaboration between AI and doctors can yield the best results, making this an important area for future research." However, it also cautioned, "There are concerns regarding the inherent risks of AI systems themselves, such as the 'hallucination' problem of generating false information or unpredictable errors, raising issues of reliability and safety. Therefore, policy measures considering these risk factors are necessary."
As the diagnostic capabilities of AI in the medical field rapidly improve, discussions about the future of medical professions are ongoing in South Korea as well. In a report released by the Bank of Korea in February titled "AI and the Korean Economy," it was stated, "AI is not simply replacing human labor, but in high-risk fields such as medicine, it is likely to play a role in complementing human judgment," and added, "In particular, advances in AI have the potential to improve the quality of medical services."