Development and internal validation of machine learning prognostic models of sports injuries using self-reported data in athletics (track and field): The influence of quantity and quality of features.

IF 2.5 2区医学 Q2 SPORT SCIENCES

Journal of Sports Sciences Pub Date : 2025-06-13 DOI:10.1080/02640414.2025.2517971

Spyridon Iatropoulos, Pierre-Eddy Dandrieux, Pascal Edouard, Laurent Navarro

{"title":"Development and internal validation of machine learning prognostic models of sports injuries using self-reported data in athletics (track and field): The influence of quantity and quality of features.","authors":"Spyridon Iatropoulos, Pierre-Eddy Dandrieux, Pascal Edouard, Laurent Navarro","doi":"10.1080/02640414.2025.2517971","DOIUrl":null,"url":null,"abstract":"To compare the performance of sports injury prognostic machine learning models when trained on (i) baseline data (i.e. collected once) vs. monitoring data (i.e. collected frequently over a period), (ii) raw monitoring data vs. time-integrating engineered features of the same data, and (iii) different numbers of features. Self-reported data collected during a previous randomised controlled trial in athletics athletes over 39 weeks constituted the dataset for model development. Baseline features, monitoring features, and two time-integrating feature engineering strategies were employed. Seven machine learning algorithms were trained with different groups and numbers of features and validated internally with bootstrapping. The models' discrimination was statistically compared using t-tests or Mann-Whitney tests (α = 0.00026). A dataset of 4537 cases including 149 injuries was derived from 165 athletes. Monitoring features outperformed baseline features in 5 out of 7 algorithms (p < 0.00026). The two feature engineering strategies showed marginal differences (1-8%) in 4 out of 7 algorithms (p < 0.00026). Larger numbers of features showed consistent improvements of performance for 6 out of 7 algorithms. Developing injury prediction ML models based on self-reported data in the sport of athletics seems promising but highly influenced by the quality and quantity of features.","PeriodicalId":17066,"journal":{"name":"Journal of Sports Sciences","volume":" ","pages":"1-15"},"PeriodicalIF":2.5000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Sports Sciences","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/02640414.2025.2517971","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SPORT SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

To compare the performance of sports injury prognostic machine learning models when trained on (i) baseline data (i.e. collected once) vs. monitoring data (i.e. collected frequently over a period), (ii) raw monitoring data vs. time-integrating engineered features of the same data, and (iii) different numbers of features. Self-reported data collected during a previous randomised controlled trial in athletics athletes over 39 weeks constituted the dataset for model development. Baseline features, monitoring features, and two time-integrating feature engineering strategies were employed. Seven machine learning algorithms were trained with different groups and numbers of features and validated internally with bootstrapping. The models' discrimination was statistically compared using t-tests or Mann-Whitney tests (α = 0.00026). A dataset of 4537 cases including 149 injuries was derived from 165 athletes. Monitoring features outperformed baseline features in 5 out of 7 algorithms (p < 0.00026). The two feature engineering strategies showed marginal differences (1-8%) in 4 out of 7 algorithms (p < 0.00026). Larger numbers of features showed consistent improvements of performance for 6 out of 7 algorithms. Developing injury prediction ML models based on self-reported data in the sport of athletics seems promising but highly influenced by the quality and quantity of features.

查看原文本刊更多论文

利用运动员（田径）自我报告数据的运动损伤机器学习预后模型的开发和内部验证：特征数量和质量的影响。

为了比较运动损伤预测机器学习模型在(i)基线数据（即一次收集）与监测数据（即在一段时间内频繁收集）训练时的性能，（ii）原始监测数据与相同数据的时间积分工程特征，以及（iii）不同数量的特征。先前在田径运动员中进行的为期39周的随机对照试验中收集的自我报告数据构成了模型开发的数据集。采用基线特征、监控特征和两种时间积分特征工程策略。用不同的组和数量的特征训练了七种机器学习算法，并在内部通过自举进行了验证。采用t检验或Mann-Whitney检验比较模型的判别性（α = 0.00026）。来自165名运动员的4537例病例包括149例损伤的数据集。在7种算法中，监测特征在5种算法中优于基线特征（p < 0.05）

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Sports Sciences 社会科学-运动科学

CiteScore

6.30

自引率

2.90%

发文量

147

审稿时长

12 months

期刊介绍： The Journal of Sports Sciences has an international reputation for publishing articles of a high standard and is both Medline and Clarivate Analytics-listed. It publishes research on various aspects of the sports and exercise sciences, including anatomy, biochemistry, biomechanics, performance analysis, physiology, psychology, sports medicine and health, as well as coaching and talent identification, kinanthropometry and other interdisciplinary perspectives. The emphasis of the Journal is on the human sciences, broadly defined and applied to sport and exercise. Besides experimental work in human responses to exercise, the subjects covered will include human responses to technologies such as the design of sports equipment and playing facilities, research in training, selection, performance prediction or modification, and stress reduction or manifestation. Manuscripts considered for publication include those dealing with original investigations of exercise, validation of technological innovations in sport or comprehensive reviews of topics relevant to the scientific study of sport.