70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
Better ask ChatGPT?! Evaluating the performance of Large Language Models in identifying common misinterpretations of statistical tests, p-values, confidence intervals, and statistical power
Text
Introduction: As the development of Artificial Intelligence (AI) continues to accelerate, Large Language Models (LLMs) are impacting the process of research and education, providing both great opportunities and challenges [1]. One of the many opportunities is the easy and immediate accessibility of LLMs to students and professionals involved in medical research, a field where statistical advice is often requested. In line with this, a recent study explored the integration of LLMs into statistical consulting [2]. However, one of the many challenges associated with LLMs is the quality of their output, which is often incorrect and influenced by training data of unknown quality. This raises the question of how LLMs handle common misinterpretations of statistical concepts that may be embedded in their training data. In this research, we investigated how three different LLMs handle 26 common misconceptions about statistical tests, p-values, confidence intervals, and power, depending on the quality of the prompt.
Methods: The 26 misinterpretations were taken from a widely cited article by Greenland and colleagues and can be grouped into four thematic areas: Single p-values (15 statements), p-value comparisons and predictions (4), confidence intervals (5), and power (2) [3]. Each statement was presented to Llama 3.1 8B instruct, Llama 3.1 70B instruct, and GPT-4 with either a well-designed or a poorly designed prompt. The well-designed prompt placed the statistical statement in the overarching context of null hypothesis significance testing. The poorly designed prompt only asked whether the statement was true or false without providing any context. The LLMs` performance was evaluated in regard to their decision (correct/incorrect) and explanation.
Results: With well-designed prompts, the rate of correct rejections was 100.00% for Llama 8B and 96.15% for both Llama 70B and GPT-4. The explanations provided by the models (Llama 8B/Llama 70B/GPT-4) were fully correct in 80.76%/88.46%/84.62% of the cases, respectively. Across the LLMs, well-designed prompts resulted in an average of 97.43% correct rejections and 84.61% fully correct explanations. For poorly designed prompts, the rate of correct rejections was 65.38% for both Llama 8B Llama 70B, and 88.46% for GPT-4, while the proportion of correct explanations was 38.46%/53.85%/69.23% for the three models. On average, this resulted in 73.07% correct rejections and 53.85% correct explanations. The poor performance with suboptimal prompts was particularly evident in the topics comparison of p-values and predictions (correct decision rate: 66.67%), confidence intervals (60.00%), and power (50.00%), while single p-values were less affected (82.22%). The experiment was replicated repeatedly and the results are reported.
Discussion: The results indicate that generative AI can effectively detect common statistical misinterpretations when prompted correctly, though not perfectly. However, effective prompting is a non-trivial task that may require domain expertise, as it involves identifying context that is both adequate and effective for LLMs. Most importantly, performance depended more on the quality of the prompt than on the model itself, which aligns with prior research on LLM instruction principles [4].
Conclusion: Our study highlights the capabilities of LLMs and underscores the importance of well-designed prompts to ensure reliable AI-assisted statistical interpretation.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
[1] Rahman MdM, Watanobe Y. ChatGPT for Education and Research: Opportunities, Threats, and Strategies. Applied Sciences. 2023;13:5783. DOI: 10.3390/app13095783[2] Fichtner UA, Knaus J, Graf E, Koch G, Sahlmann J, Stelzer D, et al. Exploring the potential of large language models for integration into an academic statistical consulting service–the EXPOLS study protocol. PLoS ONE. 2024;19(12):e0308375. DOI: 10.1371/journal.pone.0308375
[3] Greenland S, Senn SJ, Rothman KJ, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31(4):337-350. DOI: 10.1007/s10654-016-0149-3
[4] Bsharat SM, Myrzakhan A, Shen Z. Principled instructions are all you need for questioning llama-1/2, gpt-3.5/4 [Preprint]. arXiv. 2024. DOI: 10.48550/arXiv.2312.16171



