Background: Large language models (LLMs) are increasingly applied in medicine; however, their accuracy in guideline-driven, high-stakes specialties, such as metabolic and bariatric surgery (MBS), remains uncertain. This study evaluates the performance of ChatGPT-4o, Gemini 2.0 Flash, and DeepSeek-V3 in generating guideline-concordant responses to MBS clinical questions. Methods: Thirty standardized, guideline-based MBS questions were presented to each model. Responses were randomized in order, anonymized (blinded as Model A/B/C), and evaluated by 93 MBS experts using a validated 0–3 scale (0 = inaccurate; 3 = fully guideline-concordant). A repeated-measures ANOVA with Bonferroni correction tested model differences; reliability was assessed with Cronbach’s α and intraclass correlation coefficients (ICC). Results: DeepSeek-V3 achieved the highest mean score (2.44 ± 0.40), followed by ChatGPT-4o (1.79 ± 0.46) and Gemini 2.0 Flash (1.63 ± 0.47) (p < 0.001). Fully guideline-concordant ratings (score = 3) were most frequent for DeepSeek (80%) vs. ChatGPT (0%) and Gemini (3.3%). Internal consistency was excellent (α > 0.90), and inter-rater reliability was strong (ICC > 0.88). When mapped against the QUEST evaluation framework, the study addressed Quality and Understanding but did not fully capture Expression, Safety, or Trust dimensions. Conclusions: DeepSeek-V3 outperformed ChatGPT-4o and Gemini 2.0 Flash in generating guideline-concordant responses in MBS. These results highlight the need for ongoing, domain-focused validation before clinical use.
Accuracy and Knowledge Base Evaluation of ChatGPT-4o, Gemini-2.0-Flash, and DeepSeek-V3 in Metabolic and Bariatric Surgery: an Expert-Rated Blinded Study / Hany, M., Zidan, M.H., Parmar, C., Shahmiri, S.S., Altabbaa, H., El-Shamarka, A., Amgad, A., Abdelkhalek, I.M., Assal, A.A., Abdou, M.E., Kermansaravi, M., Borbely, Y., Nijs, Y., Yang, W., Charalampakis, V., Salles, V.J.A., Bilecik, T., Poghosyan, T., Hassab, T., Pintar, T., et al.. - In: OBESITY SURGERY. - ISSN 0960-8923. - (2026). [10.1007/s11695-026-08562-z]
Accuracy and Knowledge Base Evaluation of ChatGPT-4o, Gemini-2.0-Flash, and DeepSeek-V3 in Metabolic and Bariatric Surgery: an Expert-Rated Blinded Study
Olmi S.;Chiappetta S.;
2026-01-01
Abstract
Background: Large language models (LLMs) are increasingly applied in medicine; however, their accuracy in guideline-driven, high-stakes specialties, such as metabolic and bariatric surgery (MBS), remains uncertain. This study evaluates the performance of ChatGPT-4o, Gemini 2.0 Flash, and DeepSeek-V3 in generating guideline-concordant responses to MBS clinical questions. Methods: Thirty standardized, guideline-based MBS questions were presented to each model. Responses were randomized in order, anonymized (blinded as Model A/B/C), and evaluated by 93 MBS experts using a validated 0–3 scale (0 = inaccurate; 3 = fully guideline-concordant). A repeated-measures ANOVA with Bonferroni correction tested model differences; reliability was assessed with Cronbach’s α and intraclass correlation coefficients (ICC). Results: DeepSeek-V3 achieved the highest mean score (2.44 ± 0.40), followed by ChatGPT-4o (1.79 ± 0.46) and Gemini 2.0 Flash (1.63 ± 0.47) (p < 0.001). Fully guideline-concordant ratings (score = 3) were most frequent for DeepSeek (80%) vs. ChatGPT (0%) and Gemini (3.3%). Internal consistency was excellent (α > 0.90), and inter-rater reliability was strong (ICC > 0.88). When mapped against the QUEST evaluation framework, the study addressed Quality and Understanding but did not fully capture Expression, Safety, or Trust dimensions. Conclusions: DeepSeek-V3 outperformed ChatGPT-4o and Gemini 2.0 Flash in generating guideline-concordant responses in MBS. These results highlight the need for ongoing, domain-focused validation before clinical use.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


