Clusters that are not there: An R tutorial and a Shiny app to quantify a priori inferential risks when using clustering methods

IRIS

Clustering methods are increasingly used in social science research. Generally, researchers use them to infer the existence of qualitatively different types of individuals within a larger population, thus unveiling previously "hidden" heterogeneity. Depending on the clustering technique, however, valid inference requires some conditions and assumptions. Common risks include not only failing to detect existing clusters due to a lack of power but also revealing clusters that do not exist in the population. Simple data simulations suggest that under conditions of sample size, number, correlation and skewness of indicators that are frequently encountered in applied psychological research, commonly used clustering methods are at a high risk of detecting clusters that are not there. Generally, this is due to some violations of assumptions that are not usually considered critical in psychology. The present article illustrates a simple R tutorial and a Shiny app (for those who are not familiar with R) that allow researchers to quantify a priori inferential risks when performing clustering methods on their own data. Doing so is suggested as a much-needed preliminary sanity check, because conditions that inflate the number of detected clusters are very common in applied psychological research scenarios.

Clusters that are not there: An R tutorial and a Shiny app to quantify a priori inferential risks when using clustering methods / Toffalini, E., Gambarota, F., Perugini, A., Girardi, P., Tobia, V., Altoè, G., Giofrè, D., Feraco, T.. - In: INTERNATIONAL JOURNAL OF PSYCHOLOGY. - ISSN 0020-7594. - 59:6(2024), pp. 1183-1198. [10.1002/ijop.13246]

Clusters that are not there: An R tutorial and a Shiny app to quantify a priori inferential risks when using clustering methods

Gambarota F.^Secondo;Perugini A.;Girardi P.;Tobia V.;Altoè G.;Giofrè D.^Penultimo;Feraco T.^Ultimo

2024-01-01

Abstract

Clustering methods are increasingly used in social science research. Generally, researchers use them to infer the existence of qualitatively different types of individuals within a larger population, thus unveiling previously "hidden" heterogeneity. Depending on the clustering technique, however, valid inference requires some conditions and assumptions. Common risks include not only failing to detect existing clusters due to a lack of power but also revealing clusters that do not exist in the population. Simple data simulations suggest that under conditions of sample size, number, correlation and skewness of indicators that are frequently encountered in applied psychological research, commonly used clustering methods are at a high risk of detecting clusters that are not there. Generally, this is due to some violations of assumptions that are not usually considered critical in psychology. The present article illustrates a simple R tutorial and a Shiny app (for those who are not familiar with R) that allow researchers to quantify a priori inferential risks when performing clustering methods on their own data. Doing so is suggested as a much-needed preliminary sanity check, because conditions that inflate the number of detected clusters are very common in applied psychological research scenarios.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Parole chiave
	
				Cluster analysis
Data simulation
Machine learning
Mixture models
k‐means
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
2024_Toffalini, Gambarota, Perugini, Girardi, Tobia, Altoè, Giofrè, Psicostat Core Team & Feraco,.pdf accesso aperto Tipologia: PDF editoriale (versione pubblicata dall'editore) Licenza: Creative commons Dimensione 5.56 MB Formato Adobe PDF Visualizza/Apri	5.56 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11768/170196

Citazioni

2

5

4

social impact