A Data-Driven Approach to Predicting Recreational Activity Participation Using Machine Learning.

Research quarterly for exercise and sport Pub Date : 2024-12-01 Epub Date: 2024-06-14 DOI:10.1080/02701367.2024.2343815

Seungbak Lee, Minsoo Kang

{"title":"A Data-Driven Approach to Predicting Recreational Activity Participation Using Machine Learning.","authors":"Seungbak Lee, Minsoo Kang","doi":"10.1080/02701367.2024.2343815","DOIUrl":null,"url":null,"abstract":"Purpose: With the popularity of recreational activities, the study aimed to develop prediction models for recreational activity participation and explore the key factors affecting participation in recreational activities. Methods: A total of 12,712 participants, excluding individuals under 20, were selected from the National Health and Nutrition Examination Survey (NHANES) from 2011 to 2018. The mean age of the sample was 46.86 years (±16.97), with a gender distribution of 6,721 males and 5,991 females. The variables included demographic, physical-related variables, and lifestyle variables. This study developed 42 prediction models using six machine learning methods, including logistic regression, Support Vector Machine (SVM), decision tree, random forest, eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM). The relative importance of each variable was evaluated by permutation feature importance. Results: The results illustrated that the LightGBM was the most effective algorithm for predicting recreational activity participation (accuracy: .838, precision: .783, recall: .967, F1-score: .865, AUC: .826). In particular, prediction performance increased when the demographic and lifestyle datasets were used together. Next, as the result of the permutation feature importance based on the top models, education level and moderate-vigorous physical activity (MVPA) were found to be essential variables. Conclusion: These findings demonstrated the potential of a data-driven approach utilizing machine learning in a recreational discipline. Furthermore, this study interpreted the prediction model through feature importance analysis to overcome the limitation of machine learning interpretability.","PeriodicalId":94191,"journal":{"name":"Research quarterly for exercise and sport","volume":" ","pages":"873-885"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research quarterly for exercise and sport","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/02701367.2024.2343815","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/6/14 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: With the popularity of recreational activities, the study aimed to develop prediction models for recreational activity participation and explore the key factors affecting participation in recreational activities. Methods: A total of 12,712 participants, excluding individuals under 20, were selected from the National Health and Nutrition Examination Survey (NHANES) from 2011 to 2018. The mean age of the sample was 46.86 years (±16.97), with a gender distribution of 6,721 males and 5,991 females. The variables included demographic, physical-related variables, and lifestyle variables. This study developed 42 prediction models using six machine learning methods, including logistic regression, Support Vector Machine (SVM), decision tree, random forest, eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM). The relative importance of each variable was evaluated by permutation feature importance. Results: The results illustrated that the LightGBM was the most effective algorithm for predicting recreational activity participation (accuracy: .838, precision: .783, recall: .967, F1-score: .865, AUC: .826). In particular, prediction performance increased when the demographic and lifestyle datasets were used together. Next, as the result of the permutation feature importance based on the top models, education level and moderate-vigorous physical activity (MVPA) were found to be essential variables. Conclusion: These findings demonstrated the potential of a data-driven approach utilizing machine learning in a recreational discipline. Furthermore, this study interpreted the prediction model through feature importance analysis to overcome the limitation of machine learning interpretability.

查看原文本刊更多论文

利用机器学习预测休闲活动参与度的数据驱动方法。

目的：随着娱乐活动的普及，本研究旨在建立娱乐活动参与度的预测模型，并探讨影响娱乐活动参与度的关键因素。研究方法从2011年至2018年的美国国家健康与营养调查（NHANES）中选取了12712名参与者，其中不包括20岁以下的个体。样本的平均年龄为 46.86 岁（±16.97），性别分布为男性 6721 人，女性 5991 人。变量包括人口统计学变量、身体相关变量和生活方式变量。本研究使用六种机器学习方法开发了 42 个预测模型，包括逻辑回归、支持向量机（SVM）、决策树、随机森林、极梯度提升（XGBoost）和轻梯度提升机（LightGBM）。每个变量的相对重要性是通过置换特征重要性来评估的。结果结果表明，LightGBM 是预测娱乐活动参与度最有效的算法（准确度：.838；精确度：.783；召回率：.967；F1-分数：.865；AUC：.826）。特别是，当同时使用人口统计学数据集和生活方式数据集时，预测性能会有所提高。其次，基于顶级模型的置换特征重要性结果表明，教育水平和中等强度体力活动（MVPA）是基本变量。结论这些发现证明了在娱乐学科中利用机器学习的数据驱动方法的潜力。此外，本研究还通过特征重要性分析来解释预测模型，从而克服了机器学习可解释性的局限性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Research quarterly for exercise and sport

自引率

0.00%

发文量