Predictions of european basketball match results with machine learning algorithms

IF 0.6 Q4 HOSPITALITY, LEISURE, SPORT & TOURISM

Journal of Sports Analytics Pub Date : 2023-03-31 DOI:10.3233/jsa-220639

Tzai Lampis, Ntzoufras Ioannis, Vassalos Vasilios, Dimitriou Stavrianna

{"title":"Predictions of european basketball match results with machine learning algorithms","authors":"Tzai Lampis, Ntzoufras Ioannis, Vassalos Vasilios, Dimitriou Stavrianna","doi":"10.3233/jsa-220639","DOIUrl":null,"url":null,"abstract":"The goal of this paper is to build and compare methods for the prediction of the final outcomes of basketball games. In this study, we analyzed data from four different European tournaments: Euroleague, Eurocup, Greek Basket League and Spanish Liga ACB. The data-set consists of information collected from box scores of 5214 games for the period of 2013-2018. The predictions obtained by our implemented methods and models were compared with a “vanilla” model using only the team-name information of each game. In our analysis, we have included new performance indicators constructed by using historical statistics, key performance indicators and measurements from three rating systems (Elo, PageRank, pi-rating). For these three rating systems and every tournament under consideration, we tune the rating system parameters using specific training data-sets. These new game features are improving our predictions efficiently and can be easily obtained in any basketball league. Our predictions were obtained by implementing three different statistics and machine learning algorithms: logistic regression, random forest, and extreme gradient boosting trees. Moreover, we report predictions based on the combination of these algorithms (ensemble learning). We evaluate our predictions using three predictive measures: Brier Score, accuracy and F 1-score. In addition, we evaluate the performance of our algorithms with three different prediction scenarios (full-season, mid-season, and play-offs predictive evaluation). For the mid-season and the play-offs scenarios, we further explore whether incorporating additional results from previous seasons in the learning data-set enhances the predictive performance of the implemented models and algorithms. Concerning the results, there is no clear winner between the machine learning algorithms since they provide identical predictions with small differences. However, models with predictors suggested in this paper out-perform the “vanilla” model by 3-5% in terms of accuracy. Another conclusion from our results for the play-offs scenarios is that it is not necessary to embed outcomes from previous seasons in our training data-set. Using data from the current season, most of the time, leads to efficient, accurate parameter learning and well-behaved prediction models. Moreover, the Greek league is the least balanced tournament in terms of competitiveness since all our models achieve high predictive accuracy (78%, on the best-performing model). The second less balanced league is the Spanish one with accuracy reaching 72% while for the two European tournaments the prediction accuracy is considerably lower (about 69% ). Finally, we present the most important features by counting the percentage of appearance in every machine learning algorithm for every one of the three analyses. From this analysis, we may conclude that the best predictors are the rating systems (pi-rating, PageRank, and ELO) and the current form performance indicators (e.g., the two most frequent ones are the game score of Hollinger and the floor impact counter).","PeriodicalId":53203,"journal":{"name":"Journal of Sports Analytics","volume":" ","pages":""},"PeriodicalIF":0.6000,"publicationDate":"2023-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Sports Analytics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/jsa-220639","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"HOSPITALITY, LEISURE, SPORT & TOURISM","Score":null,"Total":0}

引用次数: 1

Abstract

The goal of this paper is to build and compare methods for the prediction of the final outcomes of basketball games. In this study, we analyzed data from four different European tournaments: Euroleague, Eurocup, Greek Basket League and Spanish Liga ACB. The data-set consists of information collected from box scores of 5214 games for the period of 2013-2018. The predictions obtained by our implemented methods and models were compared with a “vanilla” model using only the team-name information of each game. In our analysis, we have included new performance indicators constructed by using historical statistics, key performance indicators and measurements from three rating systems (Elo, PageRank, pi-rating). For these three rating systems and every tournament under consideration, we tune the rating system parameters using specific training data-sets. These new game features are improving our predictions efficiently and can be easily obtained in any basketball league. Our predictions were obtained by implementing three different statistics and machine learning algorithms: logistic regression, random forest, and extreme gradient boosting trees. Moreover, we report predictions based on the combination of these algorithms (ensemble learning). We evaluate our predictions using three predictive measures: Brier Score, accuracy and F 1-score. In addition, we evaluate the performance of our algorithms with three different prediction scenarios (full-season, mid-season, and play-offs predictive evaluation). For the mid-season and the play-offs scenarios, we further explore whether incorporating additional results from previous seasons in the learning data-set enhances the predictive performance of the implemented models and algorithms. Concerning the results, there is no clear winner between the machine learning algorithms since they provide identical predictions with small differences. However, models with predictors suggested in this paper out-perform the “vanilla” model by 3-5% in terms of accuracy. Another conclusion from our results for the play-offs scenarios is that it is not necessary to embed outcomes from previous seasons in our training data-set. Using data from the current season, most of the time, leads to efficient, accurate parameter learning and well-behaved prediction models. Moreover, the Greek league is the least balanced tournament in terms of competitiveness since all our models achieve high predictive accuracy (78%, on the best-performing model). The second less balanced league is the Spanish one with accuracy reaching 72% while for the two European tournaments the prediction accuracy is considerably lower (about 69% ). Finally, we present the most important features by counting the percentage of appearance in every machine learning algorithm for every one of the three analyses. From this analysis, we may conclude that the best predictors are the rating systems (pi-rating, PageRank, and ELO) and the current form performance indicators (e.g., the two most frequent ones are the game score of Hollinger and the floor impact counter).

查看原文本刊更多论文

用机器学习算法预测欧洲篮球比赛结果

本文的目的是建立和比较预测篮球比赛最终结果的方法。在这项研究中，我们分析了四项不同的欧洲锦标赛的数据:欧洲联赛、欧洲杯、希腊篮球联赛和西甲ACB。该数据集由2013-2018年期间5214场比赛的盒子比分信息组成。通过我们实现的方法和模型获得的预测结果与仅使用每场比赛的球队名称信息的“香草”模型进行了比较。在我们的分析中，我们纳入了通过使用历史统计数据、关键绩效指标和来自三个评级系统(Elo、PageRank、pi-rating)的测量来构建的新绩效指标。对于这三种评级系统和考虑中的每一场比赛，我们使用特定的训练数据集来调整评级系统参数。这些新的比赛特征有效地改善了我们的预测，并且可以很容易地在任何篮球联赛中获得。我们的预测是通过实现三种不同的统计和机器学习算法获得的:逻辑回归、随机森林和极端梯度增强树。此外，我们报告了基于这些算法组合的预测(集成学习)。我们使用三个预测指标来评估我们的预测:Brier评分、准确性和f1评分。此外，我们用三种不同的预测场景(全赛季、赛季中期和附加赛预测评估)来评估我们的算法的性能。对于赛季中期和附加赛的场景，我们进一步探讨了在学习数据集中加入前几个赛季的额外结果是否会增强所实现模型和算法的预测性能。关于结果，机器学习算法之间没有明显的赢家，因为它们提供了相同的预测，但差异很小。然而，本文中提出的带有预测因子的模型在准确性方面比“香草”模型高出3-5%。从季后赛场景的结果中得出的另一个结论是，没有必要将前几个赛季的结果嵌入到我们的训练数据集中。大多数情况下，使用当前季节的数据可以获得高效、准确的参数学习和性能良好的预测模型。此外，就竞争力而言，希腊联赛是最不平衡的比赛，因为我们所有的模型都达到了很高的预测准确率(在表现最好的模型上为78%)。第二个不太平衡的联赛是西班牙联赛，准确率达到72%，而两个欧洲锦标赛的预测准确率要低得多(约69%)。最后，我们通过计算三种分析中的每一种机器学习算法的外观百分比来呈现最重要的特征。从这个分析中，我们可以得出结论，最好的预测指标是评级系统(pi-rating, PageRank和ELO)和当前形式的表现指标(例如，最常见的两个是霍林格的比赛得分和地板冲击计数器)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Sports Analytics Multiple-

自引率

9.10%

发文量