ClustOfVar-based approach for unsupervised learning: Reading of synthetic variables with sociological data


Abstract


This paper proposes an original data mining method for unsupervised learning, replacing traditional factor analysis with a system of variable clustering. Clustering of variables aims to group together variables that are strongly related to each other, i.e. containing the same information. We recently proposed the ClustOfVar method, specifically devoted to variable clustering, regardless of whether the variables are numeric or categorical in nature. It simultaneously provides homogeneous clusters of variables and their corresponding synthetic variables that can be read as a kind of gradient. In this algorithm, the homogeneity criterion of a cluster is defined by the squared Pearson correlation for the numeric variables and by the correlation ratio for the categorical variables. This method was tested on categorical data relating to French farmers and their perception of the environment. The use of synthetic variables provided us with an original approach of identifying the way farmers reconfigured the questions put to them.

DOI Code: 10.1285/i20705948v8n2p170

Keywords: environment; variable clustering; ClustOfVar; synthetic variables; typology of farmers

References


Abdallah, H. and Saporta, G. (1998). Classification d’un ensemble de variables qualitatives. Revue de Statistique Appliquée, 46(4):5–26.

Arabie, P. and Hubert, L. (1994). Cluster analysis in marketing research. In Bagozzi, R. P., editor, Advanced methods of marketing research, pages 160–189. Blackwell, Cambridge, MA.

Burton, R. J. F. (2014). The influence of farmer demographic characteristics on environmental behaviour: A review. Journal of Environmental Management, 135:19–26.

Candau, J., Deuffic, P., Ginelli, L., Lewis, N., and Lyser, S. (2005). La prise en compte de l’environnement par les agriculteurs. Résultats d’enquête. Rapport d’étude, Cemagref.

Charrad, M. and Ben Ahmed, M. (2011). Simultaneous Clustering: A Survey. In Pattern Recognition and Machine Intelligence. Springer Berlin / Heidelberg.

Chavent, M., Kuentz, V., Liquet, B., and Saracco, J. (2011). ClustOfVar: An R Package for the Clustering of Variables. In The R User Conference.

Chavent, M., Kuentz-Simonet, V., Liquet, B., and Saracco, J. (2012a). ClustOfVar: An R Package for the Clustering of Variables. Journal of Statistical Software, 50(13):1–16.

Chavent, M., Kuentz-Simonet, V., and Saracco, J. (2012b). Orthogonal rotation in PCAMIX. Advances in Data Analysis and Classification.

Dhillon, I., Marcotte, E., and Roshan, U. (2003). Diametrical Clustering for Identifying Anticorrelated Gene Clusters. Bioinformatics, 19(13):1612–1619.

Kiers, H. (1991). Simple structure in component analysis techniques for mixtures of qualitative and quantitative variables. Psychometrika, 56(2):197–212.

Lerman, I. (1990). Foundations of the likelihood linkage analysis classification method. Applied Stochastics Models and Data Analysis, 7(1):63–76.

Lerman, I. (1993). Likelihood linkage analysis classification method : An example treated by hand. Biochimie, 75(5):379–397.

SAS Institute Inc. (2013). The varclus procedure. In SAS/STAT R 13.1 User’s Guide. SAS Institute Inc., Cary, NC.

Vichi, M. and Kiers, H. A. L. (2001). Factorial k-means analysis for two-way data. Computational Statistics & Data Analysis, 37(1):49–64.

Vichi, M. and Saporta, G. (2009). Clustering and Disjoint Principal Component Analysis. Computational Statistics & Data Analysis, 53(8):3194–3208.

Vigneau, E. and Chen, M. (2015). ClustVarLV: Clustering of Variables Around Latent Variables. R package version 1.3.2.

Vigneau, E. and Qannari, E. (2003). Clustering of variables around latent components. Communications in statistics Simulation and Computation, 32(4):1131–1150.


Full Text: pdf


Creative Commons License
This work is licensed under a Creative Commons Attribuzione - Non commerciale - Non opere derivate 3.0 Italia License.