Development and internal validation of machine learning prognostic models of sports injuries using self-reported data in athletics (track and field): The influence of quantity and quality of features.
{"title":"Development and internal validation of machine learning prognostic models of sports injuries using self-reported data in athletics (track and field): The influence of quantity and quality of features.","authors":"Spyridon Iatropoulos, Pierre-Eddy Dandrieux, Pascal Edouard, Laurent Navarro","doi":"10.1080/02640414.2025.2517971","DOIUrl":null,"url":null,"abstract":"<p><p>To compare the performance of sports injury prognostic machine learning models when trained on (i) baseline data (i.e. collected once) vs. monitoring data (i.e. collected frequently over a period), (ii) raw monitoring data vs. time-integrating engineered features of the same data, and (iii) different numbers of features. Self-reported data collected during a previous randomised controlled trial in athletics athletes over 39 weeks constituted the dataset for model development. Baseline features, monitoring features, and two time-integrating feature engineering strategies were employed. Seven machine learning algorithms were trained with different groups and numbers of features and validated internally with bootstrapping. The models' discrimination was statistically compared using t-tests or Mann-Whitney tests (α = 0.00026). A dataset of 4537 cases including 149 injuries was derived from 165 athletes. Monitoring features outperformed baseline features in 5 out of 7 algorithms (<i>p</i> < 0.00026). The two feature engineering strategies showed marginal differences (1-8%) in 4 out of 7 algorithms (<i>p</i> < 0.00026). Larger numbers of features showed consistent improvements of performance for 6 out of 7 algorithms. Developing injury prediction ML models based on self-reported data in the sport of athletics seems promising but highly influenced by the quality and quantity of features.</p>","PeriodicalId":17066,"journal":{"name":"Journal of Sports Sciences","volume":" ","pages":"1-15"},"PeriodicalIF":2.5000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Sports Sciences","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/02640414.2025.2517971","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SPORT SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
To compare the performance of sports injury prognostic machine learning models when trained on (i) baseline data (i.e. collected once) vs. monitoring data (i.e. collected frequently over a period), (ii) raw monitoring data vs. time-integrating engineered features of the same data, and (iii) different numbers of features. Self-reported data collected during a previous randomised controlled trial in athletics athletes over 39 weeks constituted the dataset for model development. Baseline features, monitoring features, and two time-integrating feature engineering strategies were employed. Seven machine learning algorithms were trained with different groups and numbers of features and validated internally with bootstrapping. The models' discrimination was statistically compared using t-tests or Mann-Whitney tests (α = 0.00026). A dataset of 4537 cases including 149 injuries was derived from 165 athletes. Monitoring features outperformed baseline features in 5 out of 7 algorithms (p < 0.00026). The two feature engineering strategies showed marginal differences (1-8%) in 4 out of 7 algorithms (p < 0.00026). Larger numbers of features showed consistent improvements of performance for 6 out of 7 algorithms. Developing injury prediction ML models based on self-reported data in the sport of athletics seems promising but highly influenced by the quality and quantity of features.
期刊介绍:
The Journal of Sports Sciences has an international reputation for publishing articles of a high standard and is both Medline and Clarivate Analytics-listed. It publishes research on various aspects of the sports and exercise sciences, including anatomy, biochemistry, biomechanics, performance analysis, physiology, psychology, sports medicine and health, as well as coaching and talent identification, kinanthropometry and other interdisciplinary perspectives.
The emphasis of the Journal is on the human sciences, broadly defined and applied to sport and exercise. Besides experimental work in human responses to exercise, the subjects covered will include human responses to technologies such as the design of sports equipment and playing facilities, research in training, selection, performance prediction or modification, and stress reduction or manifestation. Manuscripts considered for publication include those dealing with original investigations of exercise, validation of technological innovations in sport or comprehensive reviews of topics relevant to the scientific study of sport.