Comparison of Multi-site Neuroimaging Data Harmonization Techniques for Machine Learning Applications

Sampaio, I. W.; Tassi, E.; Bellani, M.; Benedetti, F.; Poletti, S.; Spalletta, G.; Piras, F.; Bianchi, A. M.; Brambilla, P.; Maggioni, E.

doi:10.1109/EUROCON56442.2023.10198911

Multi-site datasets have become widely accessible to the research community and their usage in machine learning (ML) analysis context is greatly appreciated as it enhances sta-tistical power and improves model's generalization capabilities. Nonetheless, variability associated with data sources can act as confounders to ML models, thus site effects should be removed beforehand in a data processing stage. In the present study, we explore the multi-site harmonization topic from an ML analysis perspective. Using a multi-site neuroimaging dataset composed of healthy controls and bipolar disorder subjects, we compared the efficacy of site harmonization based on linear regression and ComBat model, either applied to the entire dataset or adapted to the cross-validation framework used to evaluate ML models. Then, we trained an SVM model for diagnosis classification and analyzed the impact of the harmonization strategies on the model's performance. The diagnosis classification auc-roc was comparable across harmonization strategies. This evidence proves the effectiveness of the CV-based ComBat in harmonizing multi-center data while avoiding information leakage in the test sets, supporting the use of this strategy in the context of ML analyses.

Comparison of Multi-site Neuroimaging Data Harmonization Techniques for Machine Learning Applications / Sampaio, I.W., Tassi, E., Bellani, M., Benedetti, F., Poletti, S., Spalletta, G., Piras, F., Bianchi, A.M., Brambilla, P., Maggioni, E.. - (2023), pp. 307-312. (20th International Conference on Smart Technologies, EUROCON 2023 ita 2023) [10.1109/EUROCON56442.2023.10198911].