Exploring essential variables for successful and unsuccessful football teams in the "Big Five'' with multivariate supervised techniques


Esta investigación propone técnicas multivariantes para descubrir las acciones de juego que contribuyen a la clasificación final de los equipos de fútbol. Este estudio utiliza datos pertenecientes a los equipos "Big Five" que compitieron en Primera División de la Bundesliga, Premier League, LaLiga, Ligue 1 y Serie A en la temporada 2018-2019. El análisis de componentes principales se utiliza para la detección de valores atípicos y para proporcionar una visión general preliminar. Las acciones de juego estadísticamente significativas de los equipos superior e inferior se estudiaron utilizando tres técnicas multivariadas supervisadas, a saber, el análisis discriminante de mínimos cuadrados parciales, el bosque aleatorio y la regresión logística. El modelo de análisis discriminante de mínimos cuadrados parciales identifica mejor las variables con la contribución estadísticamente más significativa para el éxito o el fracaso de un equipo. Los resultados se compararon con los obtenidos utilizando pruebas univariadas de dos muestras (como la prueba t de Student o la prueba de Mann-Whitney), demostrando las ventajas de los enfoques multivariados sobre los enfoques univariados. Los resultados indican que los mejores equipos tienen potencia tanto ofensiva como defensiva, y destacan el alto número de acciones de ataque; en cambio, los colistas tienen defensas débiles y pocas acciones ofensivas.

DOI Code: 10.1285/i20705948v15n1p249

Keywords: multivariate methods, two-sample tests, partial least squares discriminant analysis (PLS-DA), random forest (RF), logistic regression (RL), game actions


Akaike, H. (1974). A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716-723.

Barker, M. and Rayens, W. (2003). Partial least squares for discrimination. Journal of Chemometrics: A Journal of the Chemometrics Society, 17(3):166-173.

Barnard, M., Boor, S., Winn, C., Wood, C., and Wray, I. (2019). World in motion: annual review of football  nance 2019. Deloitte.

Beck, M. (2013). Collinearity and stepwise vif selection. Retrieved from


Bradley, P. S., Di Mascio, M., Peart, D., Olsen, P., and Sheldon, B. (2010). High-intensity activity profiles of elite soccer players at different performance levels. The Journal of Strength & Conditioning Research, 24(9):2343-2351.

Breiman, L. (2001). Random forests. Machine Learning, 45(1):5-32.

Brito de Souza, D., Campo, L.-D., Blanco-Pita, H., Resta, R., Del Coso, J., et al. (2019). An extensive comparative analysis of successful and unsuccessful football teams in Laliga. Frontiers in Psychology, page 2566.

Brito Souza, D., L opez-Del Campo, R., Blanco-Pita, H., Resta, R., and Del Coso, J. (2019). A new paradigm to understand success in professional football: analysis of match statistics in LaLiga for 8 complete seasons. International Journal of Performance Analysis in Sport, 19(4):543-555.

Carling, C. (2011). Influence of opposition team formation on physical and skill-related performance in a professional soccer team. European Journal of Sport Science, 11(3):155-164.

Carpita, M., Sandri, M., Simonetto, A., and Zuccolotto, P. (2015). Discovering the drivers of football match outcomes with data mining. Quality Technology & Quantitative Management, 12(4):561-577.

Casal, C. A., Anguera, M. T., Maneiro, R., and Losada, J. L. (2019). Possession in football: more than a quantitative aspect-a mixed method study. Frontiers in Psychology, 10:501.

Castellano, J., Casamichana, D., and Lago, C. (2012). The use of match statistics that discriminate between successful and unsuccessful soccer teams. Journal of Human Kinetics, 31(2012):137-147.

Collet, C. (2013). The possession game? a comparative analysis of ball retention and team success in European and international football, 2007-2010. Journal of Sports Sciences, 31(2):123-136.

de Mendiburu, F. (2021). agricolae: Statistical Procedures for Agricultural Research. R package version 1.3-5.

Decroos, T., Bransen, L., Van Haaren, J., and Davis, J. (2019). Actions speak louder than goals: valuing player actions in soccer. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1851-1861.

Di Salvo, V., Baron, R., Tschan, H., Montero, F. C., Bachl, N., and Pigozzi, F. (2007). Performance characteristics according to playing position in elite soccer. International Journal of Sports Medicine, 28(3):222-227.

Edgington, E. and Onghena, P. (2007). Randomization tests. Chapman and Hall/CRC.

Eriksson, L., Byrne, T., Johansson, E., Trygg, J., and Vikstrom, C. (2013). Multi-and megavariate data analysis basic principles and applications, volume 1. Umetrics Academy.

Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861-874.

Ferrer, A. (2007). Multivariate statistical process control based on principal component analysis (MSPC-PCA): some reflections and a case study in an autobody assembly process. Quality Engineering, 19(4):311-325.

Fox, J. and Weisberg, S. (2019). An R Companion to Applied Regression. Sage publications, third edition.

Gottfries, J., Blennow, K., Wallin, A., and Gottfries, C. (1995). Diagnosis of dementias using partial least squares discriminant analysis. Dementia and Geriatric Cognitive Disorders, 6(2):83-88.

Gregson, W., Drust, B., Atkinson, G., and Salvo, V. (2010). Match-to match variability of high-speed activities in premier league soccer. International Journal of Sports Medicine, 31(4):237-242.

Johnston, R., Jones, K., and Manley, D. (2018). Confounding and collinearity in regression analysis: a cautionary tale and an alternative procedure, illustrated by studies of British voting behaviour. Quality & Quantity, 52(4):1957-1976.

Kim, H. and Loh, W.-Y. (2001). Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96(454):589-604.

Knijnenburg, T. A., Wessels, L. F., Reinders, M. J., and Shmulevich, I. (2009). Fewer permutations, more accurate P-values. Bioinformatics, 25(12):i161-i168.

Kucheryavskiy, S. (2020). mdatools | R package for chemometrics. Chemometrics and Intelligent Laboratory Systems, 198.

Kutner, M. H., Nachtsheim, C. J., Neter, J., and Li, W. (2005). Applied Linear Statistical Models. McGraw Hill Irwin, New York. NY.

Lago, C. (2009). The influence of match location, quality of opposition, and match status on possession strategies in professional association football. Journal of Sports Sciences, 27(13):1463-1469.

Lago-Ballesteros, J. and Lago-Peñas, C. (2010). Performance in team sports: identifying the keys to success in soccer. Journal of Human Kinetics, 25(2010):85-91.

Lago-Peñas, C., Lago-Ballesteros, J., Dellal, A., and G omez, M. (2010). Game-related statistics that discriminated winning, drawing and losing teams from the Spanish soccer league. Journal of Sports Science & Medicine, 9(2):288.

Lago-Peñas, C., Lago-Ballesteros, J., and Rey, E. (2011). Differences in performance indicators between winning and losing teams in the UEFA Champions League. Journal of Human Kinetics, 27(1):135-146.

Levene, H. (1961). Robust tests for equality of variances. Contributions to probability and statistics. Essays in honor of Harold Hotelling, pages 279-292.

Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3):18-22.

Liu, H., G omez, M.-A., Gon calves, B., and Sampaio, J. (2016). Technical performance and match-to-match variation in elite football teams. Journal of Sports Sciences, 34(6):509-518.

Liu, H., Gomez, M.- A., Lago-Pe~nas, C., and Sampaio, J. (2015). Match statistics related to winning in the group stage of 2014 Brazil FIFA World Cup. Journal of Sports Sciences, 33(12):1205-1213.

Malagón-Selma, P., Debón, A., and Ferrer, A. (2022). Modelos de machine learning y estadística multivariante para predecir la posición de los equipos de primera división. Journal of Sports Economics & Management, 12(1):3-22.

Migliorati, M. (2020). Detecting drivers of basketball successful games: an exploratory study with machine learning algorithms. Electronic Journal of Applied Statistical Analysis, 13(2):454-473.

Nelder, J. A. and Wedderburn, R. W. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135(3):370-384.

Oberstone, J. (2009). Differentiating the top English premier league football clubs from the rest of the pack: identifying the keys to success. Journal of Quantitative Analysis in Sports, 5(3).

Paluszynska, A. (2017). Understanding random forests with randomforestexplainer. The Comprehensive R Archive Network.

Paluszynska, A., Biecek, P., and Jiang, Y. (2020). andomForestExplainer: Explaining and Visualizing Random Forests in Terms of Variable Importance. R package version 0.10.1.

Quenouille, M. H. (1949). Approximate tests of correlation in time-series. Journal of the Royal Statistical Society: Series B (Methodological), 11(1):68-84.

R Core Team (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

Rampinini, E., Coutts, A. J., Castagna, C., Sassi, R., and Impellizzeri, F. (2007). Variation in top level soccer match performance. International Journal of Sports Medicine, 28(12):1018-1024.

Rampinini, E., Impellizzeri, F. M., Castagna, C., Coutts, A. J., and Wisloff, U. (2009). Technical performance during soccer matches of the Italian serie a league: effect of fatigue and competitive level. Journal of Science and Medicine in Sport, 12(1):227-233.

Refaeilzadeh, P., Tang, L., and Liu, H. (2009). Cross-validation. In Liu, L. and Ozsu, M. T., editors, Encyclopedia of Database Systems, pages 532-538. Springer US.

Rohart, F., Gautier, B., Singh, A., and Le Cao, K.-A. (2017). mixOmics: An R package for 'omics feature selection and multiple data integration. PLoS Computational Biology, 13(11):e1005752.

RStudio Team (2020). RStudio: Integrated Development Environment for R. RStudio, PBC., Boston, MA.

Shapiro, S. S. and Wilk, M. B. (1965). An analysis of variance test for normality (complete samples). Biometrika, 52(3/4):591-611.

Sheather, S. (2009). A modern approach to regression with R. Springer Science & Business Media.

Sing, T., Sander, O., Beerenwinkel, N., and Lengauer, T. (2005). ROCR: visualizing classifer performance in R. Bioinformatics, 21(20):7881.

Smithies, T. D., Campbell, M. J., Ramsbottom, N., and Toth, A. J. (2021). A random forest approach to identify metrics that best predict match outcome and player ranking in the esport rocket league. Scienti c Reports, 11(1):1-12.

Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36(2):111-133.

Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics, 8(1):1-21.

Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S., et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43):15545-15550.

Tusher, V. G., Tibshirani, R., and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences, 98(9):5116-5121.

Vigne, G., Gaudino, C., Rogowski, I., Alloatti, G., and Hautier, C. (2010). Activity profile in elite Italian soccer team. International Journal of Sports Medicine, 31(05):304-310.

Welch, B. L. (1947). The generalization of "Students's" problem when several different population variances are involved. Biometrika, 34(1-2):28-35.

Whitehead, S., Till, K., Jones, B., Beggs, C., Dalton-Barron, N., and Weaving, D. (2021). The use of technical-tactical and physical performance indicators to classify between levels of match-play in elite rugby league. Science and Medicine in Football, 5(2):121-127.

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics, 1(6):80-83

Wold, S., Esbensen, K., and Geladi, P. (1987). Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37-52.

Wold, S., Johansson, E., Cocchi, M., et al. (1993). PLS: partial least squares projections to latent structures. In From 3D QSAR in Drug Design: Theory, Methods and Applications, pages 523{550. Kubinyi H (eds.). ESCOM Science Publishers.

Worley, B. and Powers, R. (2013). Multivariate analysis in metabolomics. Current Metabolomics, 1(1):92-107.

Full Text: pdf

Creative Commons License
This work is licensed under a Creative Commons Attribuzione - Non commerciale - Non opere derivate 3.0 Italia License.