Accuracy and Knowledge Base Evaluation of ChatGPT-4o, Gemini-2.0-Flash, and DeepSeek-V3 in Metabolic and Bariatric Surgery: an Expert-Rated Blinded Study

Hany, M.; Zidan, M. H.; Parmar, C.; Shahmiri, S. S.; Altabbaa, H.; El-Shamarka, A.; Amgad, A.; Abdelkhalek, I. M.; Assal, A. A.; Abdou, M. E.; Kermansaravi, M.; Olmi, S.; Chiappetta, S.

doi:10.1007/s11695-026-08562-z

Background: Large language models (LLMs) are increasingly applied in medicine; however, their accuracy in guideline-driven, high-stakes specialties, such as metabolic and bariatric surgery (MBS), remains uncertain. This study evaluates the performance of ChatGPT-4o, Gemini 2.0 Flash, and DeepSeek-V3 in generating guideline-concordant responses to MBS clinical questions. Methods: Thirty standardized, guideline-based MBS questions were presented to each model. Responses were randomized in order, anonymized (blinded as Model A/B/C), and evaluated by 93 MBS experts using a validated 0–3 scale (0 = inaccurate; 3 = fully guideline-concordant). A repeated-measures ANOVA with Bonferroni correction tested model differences; reliability was assessed with Cronbach’s α and intraclass correlation coefficients (ICC). Results: DeepSeek-V3 achieved the highest mean score (2.44 ± 0.40), followed by ChatGPT-4o (1.79 ± 0.46) and Gemini 2.0 Flash (1.63 ± 0.47) (p < 0.001). Fully guideline-concordant ratings (score = 3) were most frequent for DeepSeek (80%) vs. ChatGPT (0%) and Gemini (3.3%). Internal consistency was excellent (α > 0.90), and inter-rater reliability was strong (ICC > 0.88). When mapped against the QUEST evaluation framework, the study addressed Quality and Understanding but did not fully capture Expression, Safety, or Trust dimensions. Conclusions: DeepSeek-V3 outperformed ChatGPT-4o and Gemini 2.0 Flash in generating guideline-concordant responses in MBS. These results highlight the need for ongoing, domain-focused validation before clinical use.

Accuracy and Knowledge Base Evaluation of ChatGPT-4o, Gemini-2.0-Flash, and DeepSeek-V3 in Metabolic and Bariatric Surgery: an Expert-Rated Blinded Study / Hany, M., Zidan, M.H., Parmar, C., Shahmiri, S.S., Altabbaa, H., El-Shamarka, A., Amgad, A., Abdelkhalek, I.M., Assal, A.A., Abdou, M.E., Kermansaravi, M., Olmi, S., Chiappetta, S.. - In: OBESITY SURGERY. - ISSN 0960-8923. - 36:5(2026), pp. 2160-2171. [10.1007/s11695-026-08562-z]