70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
Comparative analysis between the performance of ChatGPT and medical students outside their examination phase
Text
Introduction: State-run medical licensing examinations in Germany are highly demanding, requiring extensive preparation. While medical expertise remains central to licensure, patients increasingly seek advice from artificial intelligence (AI) models like ChatGPT. This shift raises the question of whether AI can accurately convey essential medical knowledge. Previous international studies (e.g., China [1], US [2], Poland [3], UK [4]). have compared AI performance with that of certified professionals, often through indirect comparisons using historical exam averages [5]. However, no studies have directly compared AI to medical students assessed outside their exam preparation phase, which could provide insight into knowledge retention. This study aims to directly compare the performance of large language models (LLMs) with that of medical students beyond their examination phase.
Methods: An anonymized survey was conducted at a German medical school among students in the clinical stage of their studies (typically 140 to 160 students). Participants answered 10 single-choice questions randomly selected from a pre-filtered pool derived from past German medical licensing exam (M1) items. Questions were selected based on clinical relevance, moderate difficulty, and exclusion of chemical/mathematical content. The same questions were answered by ChatGPT-3.5, ChatGPT-4, and ChatGPT-4 mini. Performance was compared in terms of correct responses. Additionally, the corrected discrimination coefficient was calculated for each item, measuring how well each question differentiated between higher and lower performers.
Results: Of the 143 participants (median age 22), 129 were in the 5th semester, and the rest were in later semesters. About 40% identified as male and 55% as female. Students answered a median of 7 out of 10 questions correctly (range: 1–10). All AI models answered 9 out of 10 questions correctly. The only question missed by AI was answered correctly by 35% of students (50/143) and had the second-highest discrimination coefficient (0.28), indicating it effectively differentiated student performance.
Discussion: LLMs outperformed medical students who were beyond their exam preparation phase, suggesting superior knowledge retention. However, limitations include the modest sample size and preselection of questions, which may not fully reflect the exam's difficulty. Importantly, the AI's one incorrect answer had a high discrimination coefficient, highlighting a possible gap in AI understanding for nuanced or complex content. These findings suggest that while AI models show strong factual recall, they are not without limitations and should be supplemented by human oversight, particularly in high-stakes or ambiguous clinical contexts.
Conclusion: AI models demonstrate strong performance in retaining and applying medical knowledge, even surpassing students outside active exam preparation. However, their occasional errors, especially on discriminative questions, underline the need for caution. Further research is necessary to evaluate AI utility in real-world medical education and clinical decision-making, ensuring ethical and responsible integration.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
Literatur
[1] Zong H, Li J, Wu E, Wu R, Lu J, Shen B. Performance of ChatGPT on Chinese national medical licensing examinations: a five-year examination evaluation study for physicians, pharmacists and nurses. BMC Med Educ. 2024;24:143. DOI: 10.1186/s12909-024-05125-7[2] Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. DOI: 10.2196/45312
[3] Suwała S, Szulc P, Guzowski C, Kaminska B, Dorobiała J, Wojciechowska K, et al. ChatGPT-3.5 passes Poland’s medical final examination-Is it possible for ChatGPT to become a doctor in Poland? SAGE Open Med. 2024;12:20503121241257777. DOI: 10.1177/20503121241257777
[4] Vij O, Calver H, Myall N, Dey M, Kouranloo K. Evaluating the competency of ChatGPT in MRCP Part 1 and a systematic literature review of its capabilities in postgraduate medical assessments. PLoS One. 2024;19:e0307372. DOI: 10.1371/journal.pone.0307372
[5] Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, et al. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res. 2024;26:e60807. DOI: 10.2196/60807



