Logo

70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS)
07.-11.09.2025
Jena


Meeting Abstract

Generating High-Quality Multiple-Choice Questions Using Small Language Models and Adaptive Agentic Infrastructure

Michael Größler 1
Lilly Marie Düsterbeck 1
Graziella Credidio 1
Layla Tabea Riemann 1
1University Medical Center Hamburg-Eppendorf, Institute for Applied Medical Informatics, Hamburg, Germany

Text

Introduction: We introduce a novel approach for generating high-quality multiple-choice questions (MCQs) within KiMED, an AI-based learning platform designed to support medical students in reviewing lecture material and preparing for coursework. KiMED is a personalized learning web application that dynamically adjusts question difficulty, topical emphasis, and explanation depth based on individual student performance and curricular context.

A key challenge in this domain lies in balancing strict legal and quality requirements with limited computational resources [1]. Erroneous material could potentially lead to lawsuits, thus demanding high-quality question generation, while computational limitations arise from the substantial demands to deploy language models locally. Scaling such services to support simultaneous use by many students further intensifies these demands. Our approach addresses these challenges using small language models within a highly adaptive agentic framework. Focusing initially on biochemistry, our approach lays the groundwork for expanding into other subjects and scaling personalized learning tools in resource-constrained educational environments.

Methods: Our system employs the GPTSwarm framework [2] on top of a custom RAG Pipeline [3], [4] for document retrieval, modeling language agents as directed acyclic graphs with functional nodes. Each node handles tasks like concept extraction, distractor generation, or answer validation, while edges manage information flow within and between agents. Agents form a swarm, with both node prompts and inter-agent communication patterns optimized automatically. Optimization occurs at two levels: node optimization refines prompt instructions, and edge optimization adjusts inter-agent information sharing. These processes are guided by reinforcement learning and task-specific feedback. The framework operates effectively with small language models to suit our available resources. The initial data are based on a combination of textbooks, slides, transcripts of those slides, and course scripts. To assess our approach, we created a biochemistry MCQ dataset comprising three sets of 50 questions: AI-generated using only prompt engineering as a baseline, AI-generated with our framework, and human-generated. Two domain experts evaluated each question using binary scores across ten criteria, including clarity, relevance, grammatical correctness, and distractor quality [5]. Scores were averaged per criterion and over all experts, and summed to yield a final score ranging from 0 to 10.

Results: The baseline AI-generated questions achieved a final score of 4.7, establishing a lower benchmark. Human-authored questions scored 8.9. Our enhanced AI pipeline attained a score of 8.7. Further, it demonstrated improved alignment with criteria such as topic centrality and relevance to learning objectives compared to the human generated. However, generating suitable distractors remained more difficult for the AI system. These results indicate that AI-generated MCQs in biochemistry can effectively support educators in developing high-quality questions.

Conclusion: This work shows the potential of a multi-agent, graph-optimized approach to automated MCQ generation in medical education. By using small language models within the GPTSwarm framework, we enable efficient and adaptive MCQ generation for personalized learning. Our code for the platform and the agent-based optimization is going to be open-source to allow the expansion to other topics and medical faculties.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

[1] Ali F, Talat H. AI Integration in MCQ Development: Assessing Quality in Medical Education: A Systematic Review. L&S. 2024;5(3):14. DOI: 10.37185/LnS.1.1.643
[2] Zhuge M, Wang W, Kirsch L, Faccio F, Khizbullin D, Schmidhuber J. GPTSwarm: Language Agents as Optimizable Graphs [Preprint]. arXiv. 2024. DOI: 10.48550/arXiv.2402.16823
[3] Wu F, Li Z, Wei F, Li Y, Ding B, Gao J. Talk to Right Specialists: Routing and Planning in Multi-agent System for Question Answering [Preprint]. arXiv. 2025. DOI: 10.48550/arXiv.2501.07813
[4] Gao Y, Xiong Y, Gao X, Jia K, Pan J, Bi Y, et al. Retrieval-Augmented Generation for Large Language Models: A Survey [Preprint]. arXiv. 2024. DOI: 10.48550/arXiv.2312.10997
[5] Wang J, Xiao R, Tseng YJ. Generating AI Literacy MCQs: A Multi-Agent LLM Approach. In: Proceedings of the 56th ACM Technical Symposium on Computer Science Education V 2. 2025. p. 1651–2. DOI: 10.1145/3641555.3705189