Statistical enhanced learning for modeling and prediction tennis matches at Grand Slam tournaments


Abstract


In this manuscript, we concentrate on a specific type of covariates, which we call statistically enhanced, for modeling tennis matches for men at Grand slam tournaments. Our goal is to assess whether these enhanced covariates have the potential to improve statistical learning approaches, in particular, with regard to the predictive performance. For this purpose, various proposed regression and machine learning model classes are compared with and without such features. To achieve this, we considered three slightly enhanced variables, namely elo rating along with two different player age variables. This concept has already been successfully applied in football, where additional team ability parameters, which were obtained from separate statistical models, were able to improve the predictive performance.

In addition, different interpretable machine learning (IML) tools are employed to gain insights into the factors influencing the outcomes of tennis matches predicted by complex machine learning models, such as the random forest. Specifically, partial dependence plots (PDP) and individual conditional expectation (ICE) plots are employed to provide better interpretability for the most promising ML model from this work. Furthermore, we conduct a comparison of different regression and machine learning approaches in terms of various predictive performance measures such as classification rate, predictive Bernoulli likelihood, and Brier score. This comparison is carried out on external test data using cross-validation, rolling window, and expanding window strategies.

Keywords: Grand Slam tournaments, tennis matches, machine learning, prediction, statistical enhanced covariates, interpretable machine learning, expanding window

References


Angelini, G., Candila, V., and De Angelis, L. (2022). Weighted Elo rating for tennis match predictions. European Journal of Operational Research, 297(1):120–132.

Apley, D. W. and Zhu, J. (2020). Visualizing the effects of predictor variables in black box supervised learning models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(4):1059–1086.

Auret, L. and Aldrich, C. (2012). Interpretation of nonlinear relationships between process variables by use of random forests. Minerals Engineering, 35:27–42.

Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford university press.

Breiman, L. (1996a). Bagging predictors. Machine learning, 24:123–140.

Breiman, L. (1996b). Heuristics of instability and stabilization in model selection. The annals of statistics, 24(6):2350–2383.

Breiman, L. (2001). Random forests. Machine learning, 45:5–32.

Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and regression trees. wadsworth int. Group, 37(15):237–251.

Brier, G.W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78:1–3.

Buhamra, N., Groll, A., and Brunner, S. (2024). Modeling and prediction of tennis matches at grand slam tournaments. Journal of Sports Analytics, 10(1):17–33.

Buhamra, N., Groll, A., and Gerharz, A. (2025). Comparing modern machine learning approaches for modeling tennis matches at grand slam tournaments. Journal of Sports Analytics. under review.

Eilers, P. H. and Marx, B. D. (2021). Practical smoothing: The joys of P-splines. Cambridge University Press.

Eilers, P. H. C. and Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11:89–121.

Fahrmeir, L. and Tutz, G. (2001). Multivariate Statistical Modelling Based on Generalized Linear Models. Springer-Verlag, New York, 2nd edition.

Felice, F., Ley, C., Groll, A., and Bordas, S. (2023). Statistically enhanced learning: a feature engineering framework to boost (any) learning algorithms. arXiv preprint arXiv:2306.17006.

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232.

Gao, Z. and Kowalczyk, A. (2021). Random forest model identifies serve strength as a key predictor of tennis match outcome. Journal of Sports Analytics, 7(4):255–262.

Goldstein, A., Kapelner, A., Bleich, J., and Pitkin, E. (2015). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics, 24(1):44–65.

Greenwell, B. M. et al. (2017). pdp: An r package for constructing partial dependence plots. R J., 9(1):421.

Groll, A., Ley, C., Schauberger, G., and Van Eetvelde, H. (2019a). A hybrid random forest to predict soccer matches in international tournaments. Journal of quantitative analysis in sports, 15(4):271–287.

Kovalchik, S. (2019). deuce: resources for analysis of professional tennis data. R package version 1.4.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436–444.

Ley, C., Wiele, T. V. d., and Eetvelde, H. V. (2019). Ranking soccer teams on the basis of their current strength: A comparison of maximum likelihood approaches. Statistical Modelling, 19(1):55–73.

Molnar, C. (2020). Interpretable machine learning. Lulu. com.

Molnar, C., Freiesleben, T., K¨onig, G., Herbinger, J., Reisinger, T., Casalicchio, G., Wright, M. N., and Bischl, B. (2023). Relating the partial dependence plot and permutation feature importance to the data generating process. In World Conference on Explainable Artificial Intelligence, pages 456–479. Springer.

Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society, A 135:370–384.

R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ” why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD internationalconference on knowledge discovery and data mining, pages 1135–1144.

Schauberger, G. and Groll, A. (2018). Predicting matches in international football tournaments with random forests. Statistical Modelling, 18(5-6):460–482.

Sipko, M. and Knottenbelt, W. (2015). Machine learning for the prediction of professional tennis matches. MEng computing-final year project, Imperial College London, 2.

Somboonphokkaphan, A., Phimoltares, S., and Lursinsap, C. (2009). Tennis winner prediction based on time-series history with neural modeling. In Proceedings of the International MultiConference of Engineers and Computer Scientists, volume 1, pages 18–20. Citeseer.

Vaughan Williams, L., Liu, C., Dixon, L., and Gerrard, H. (2021). How well do Elobased ratings predict professional tennis matches? Journal of Quantitative Analysis in Sports, 17(2):91–105.

Weston, D. (2014). Using age statistics to gain a tennis betting

edge. http://www.pinnacle.com/en/betting-articles/Tennis/

atp-players-tipping-point/LMPJF7BY7BKR2EY.

Whiteside, D., Cant, O., Connolly, M., and Reid, M. (2017). Monitoring hitting load in tennis using inertial sensors and machine learning. International journal of sports physiology and performance, 12(9):1212–1217.

Wickham, H., Chang, W., and Wickham, M. H. (2016). Package ggplot2. Create elegant data visualisations using the grammar of graphics. Version, 2(1):1–189.

Wilkens, S. (2021). Sports prediction and betting models in the machine learning age: The case of tennis. Journal of Sports Analytics, 7(2):99–117.

Wood, S. N. (2017a). Generalized additive models: an introduction with R. chapman and hall/CRC.

Wood, S. N. (2017b). Generalized Additive Models: An Introduction with R. Chapman & Hall/CRC, London, 2nd edition.

Wright, M. N. and Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1):1–17.


Full Text: pdf
کاغذ a4 ویزای استارتاپ

Creative Commons License
This work is licensed under a Creative Commons Attribuzione - Non commerciale - Non opere derivate 3.0 Italia License.