Reproducibility in medico-legal personal injury assessment should be regarded as a fundamental prerequisite for transparent and equitable compensation. However, inter-rater agreement is rarely examined and quantified, even within a single regulatory framework. This study aimed to evaluate intra-system reproducibility in the Italian setting and to describe the behavior of a large language model (LLM). Using twenty case synopses derived from court-appointed expert reports in civil personal injury proceedings, we collected blinded assessments from fifteen human raters and from one LLM regarding percentage of permanent impairment and the degree of impairment-related suffering. Human raters were stratified by experience level (experts > 20 years, specialists < 10 years, final-year trainees). Inter-rater agreement among human assessors for biological damage was high (overall ICC 0.914; 95% CI 0.826-0.973), displaying an experience-related gradient. The LLM closely showed concordance with the court-appointed expert's biological damage scores (bias + 0.05; Pearson's r 0.95) but assigned substantially higher ratings of moral suffering than human assessors. Overall, biological damage evaluations showed high reproducibility, increasing with experience, whereas moral suffering exhibited a systematic divergence between human and LLM outputs.
Inter-rater reproducibility in medico-legal injury assessment: a pilot study with an exploratory comparison to a large language model / Blandino, A., Pacchioni, F., Muccino, E.A., Tambone, V., De Micco, F., Antonini, C., Zoja, R., Travaini, G.V.. - In: INTERNATIONAL JOURNAL OF LEGAL MEDICINE. - ISSN 0937-9827. - (2026). [Epub ahead of print] [10.1007/s00414-026-03842-w]
Inter-rater reproducibility in medico-legal injury assessment: a pilot study with an exploratory comparison to a large language model
Blandino A.
Primo
;Pacchioni F.;Antonini C.;Travaini G. V.
2026-01-01
Abstract
Reproducibility in medico-legal personal injury assessment should be regarded as a fundamental prerequisite for transparent and equitable compensation. However, inter-rater agreement is rarely examined and quantified, even within a single regulatory framework. This study aimed to evaluate intra-system reproducibility in the Italian setting and to describe the behavior of a large language model (LLM). Using twenty case synopses derived from court-appointed expert reports in civil personal injury proceedings, we collected blinded assessments from fifteen human raters and from one LLM regarding percentage of permanent impairment and the degree of impairment-related suffering. Human raters were stratified by experience level (experts > 20 years, specialists < 10 years, final-year trainees). Inter-rater agreement among human assessors for biological damage was high (overall ICC 0.914; 95% CI 0.826-0.973), displaying an experience-related gradient. The LLM closely showed concordance with the court-appointed expert's biological damage scores (bias + 0.05; Pearson's r 0.95) but assigned substantially higher ratings of moral suffering than human assessors. Overall, biological damage evaluations showed high reproducibility, increasing with experience, whereas moral suffering exhibited a systematic divergence between human and LLM outputs.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


