利用机器学习的生活方式和基因数据预测癌症风险。

IF 3.9 2区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

Scientific Reports Pub Date : 2025-08-19 DOI:10.1038/s41598-025-15656-8

Mohamed Abdelmoaty Ahmed, Ahmed AbdelMoety, Asmaa Mohamed Ahmed Soliman

{"title":"利用机器学习的生活方式和基因数据预测癌症风险。","authors":"Mohamed Abdelmoaty Ahmed, Ahmed AbdelMoety, Asmaa Mohamed Ahmed Soliman","doi":"10.1038/s41598-025-15656-8","DOIUrl":null,"url":null,"abstract":"Cancer remains one of the leading causes of mortality worldwide, where early detection significantly improves patient outcomes and reduces treatment burden. This study investigates the application of Machine Learning (ML) techniques to predict cancer risk based on a combination of genetic and lifestyle factors. A structured dataset of 1,200 patient records was used, comprising features such as age, gender, Body Mass Index (BMI), smoking status, alcohol intake, physical activity, genetic risk level, and personal history of cancer. A full end-to-end ML pipeline was implemented, encompassing data exploration, preprocessing, feature scaling, model training, and evaluation using stratified cross-validation and a separate test set. Nine supervised learning algorithms were evaluated and compared, including Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Support Vector Machines (SVMs), and several ensemble methods. Among these, Categorical Boosting (CatBoost) achieved the highest predictive performance, with a test accuracy of 98.75% and an F1-score of 0.9820, outperforming both traditional and other advanced models. Feature importance analysis confirmed the strong influence of cancer history, genetic risk, and smoking status on prediction outcomes. The findings highlight the effectiveness of boosting-based ensemble models in capturing complex interactions within health data and support their potential use in personalized cancer risk assessment. This research underscores the value of integrating genetic and modifiable lifestyle variables into predictive modeling to enhance early detection and preventive healthcare strategies.","PeriodicalId":21811,"journal":{"name":"Scientific Reports","volume":"15 1","pages":"30458"},"PeriodicalIF":3.9000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365227/pdf/","citationCount":"0","resultStr":"{\"title\":\"Predicting cancer risk using machine learning on lifestyle and genetic data.\",\"authors\":\"Mohamed Abdelmoaty Ahmed, Ahmed AbdelMoety, Asmaa Mohamed Ahmed Soliman\",\"doi\":\"10.1038/s41598-025-15656-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cancer remains one of the leading causes of mortality worldwide, where early detection significantly improves patient outcomes and reduces treatment burden. This study investigates the application of Machine Learning (ML) techniques to predict cancer risk based on a combination of genetic and lifestyle factors. A structured dataset of 1,200 patient records was used, comprising features such as age, gender, Body Mass Index (BMI), smoking status, alcohol intake, physical activity, genetic risk level, and personal history of cancer. A full end-to-end ML pipeline was implemented, encompassing data exploration, preprocessing, feature scaling, model training, and evaluation using stratified cross-validation and a separate test set. Nine supervised learning algorithms were evaluated and compared, including Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Support Vector Machines (SVMs), and several ensemble methods. Among these, Categorical Boosting (CatBoost) achieved the highest predictive performance, with a test accuracy of 98.75% and an F1-score of 0.9820, outperforming both traditional and other advanced models. Feature importance analysis confirmed the strong influence of cancer history, genetic risk, and smoking status on prediction outcomes. The findings highlight the effectiveness of boosting-based ensemble models in capturing complex interactions within health data and support their potential use in personalized cancer risk assessment. This research underscores the value of integrating genetic and modifiable lifestyle variables into predictive modeling to enhance early detection and preventive healthcare strategies.\",\"PeriodicalId\":21811,\"journal\":{\"name\":\"Scientific Reports\",\"volume\":\"15 1\",\"pages\":\"30458\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2025-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365227/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Scientific Reports\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.1038/s41598-025-15656-8\",\"RegionNum\":2,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Reports","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41598-025-15656-8","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

癌症仍然是世界范围内死亡的主要原因之一，早期发现可显著改善患者预后并减轻治疗负担。本研究调查了机器学习（ML）技术在遗传和生活方式因素组合的基础上预测癌症风险的应用。使用了一个包含1200例患者记录的结构化数据集，包括年龄、性别、体重指数（BMI）、吸烟状况、酒精摄入量、体育活动、遗传风险水平和个人癌症史等特征。实现了完整的端到端ML管道，包括数据探索、预处理、特征缩放、模型训练和使用分层交叉验证和单独测试集的评估。对9种监督学习算法进行了评估和比较，包括逻辑回归（LR）、决策树（DT）、随机森林（RF）、支持向量机（svm）和几种集成方法。其中，CatBoost （Categorical Boosting）的预测性能最高，测试准确率为98.75%，f1得分为0.9820，优于传统模型和其他先进模型。特征重要性分析证实了癌症史、遗传风险和吸烟状况对预测结果的强烈影响。这些发现强调了基于增强的集成模型在捕获健康数据中的复杂相互作用方面的有效性，并支持它们在个性化癌症风险评估中的潜在应用。这项研究强调了将遗传和可改变的生活方式变量整合到预测模型中的价值，以增强早期发现和预防保健策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Predicting cancer risk using machine learning on lifestyle and genetic data.

查看原文本刊更多论文

Predicting cancer risk using machine learning on lifestyle and genetic data.

Cancer remains one of the leading causes of mortality worldwide, where early detection significantly improves patient outcomes and reduces treatment burden. This study investigates the application of Machine Learning (ML) techniques to predict cancer risk based on a combination of genetic and lifestyle factors. A structured dataset of 1,200 patient records was used, comprising features such as age, gender, Body Mass Index (BMI), smoking status, alcohol intake, physical activity, genetic risk level, and personal history of cancer. A full end-to-end ML pipeline was implemented, encompassing data exploration, preprocessing, feature scaling, model training, and evaluation using stratified cross-validation and a separate test set. Nine supervised learning algorithms were evaluated and compared, including Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Support Vector Machines (SVMs), and several ensemble methods. Among these, Categorical Boosting (CatBoost) achieved the highest predictive performance, with a test accuracy of 98.75% and an F1-score of 0.9820, outperforming both traditional and other advanced models. Feature importance analysis confirmed the strong influence of cancer history, genetic risk, and smoking status on prediction outcomes. The findings highlight the effectiveness of boosting-based ensemble models in capturing complex interactions within health data and support their potential use in personalized cancer risk assessment. This research underscores the value of integrating genetic and modifiable lifestyle variables into predictive modeling to enhance early detection and preventive healthcare strategies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Scientific Reports Natural Science Disciplines-

CiteScore

7.50

自引率

4.30%

发文量

19567

审稿时长

3.9 months

期刊介绍： We publish original research from all areas of the natural sciences, psychology, medicine and engineering. You can learn more about what we publish by browsing our specific scientific subject areas below or explore Scientific Reports by browsing all articles and collections. Scientific Reports has a 2-year impact factor: 4.380 (2021), and is the 6th most-cited journal in the world, with more than 540,000 citations in 2020 (Clarivate Analytics, 2021). •Engineering Engineering covers all aspects of engineering, technology, and applied science. It plays a crucial role in the development of technologies to address some of the world''s biggest challenges, helping to save lives and improve the way we live. •Physical sciences Physical sciences are those academic disciplines that aim to uncover the underlying laws of nature — often written in the language of mathematics. It is a collective term for areas of study including astronomy, chemistry, materials science and physics. •Earth and environmental sciences Earth and environmental sciences cover all aspects of Earth and planetary science and broadly encompass solid Earth processes, surface and atmospheric dynamics, Earth system history, climate and climate change, marine and freshwater systems, and ecology. It also considers the interactions between humans and these systems. •Biological sciences Biological sciences encompass all the divisions of natural sciences examining various aspects of vital processes. The concept includes anatomy, physiology, cell biology, biochemistry and biophysics, and covers all organisms from microorganisms, animals to plants. •Health sciences The health sciences study health, disease and healthcare. This field of study aims to develop knowledge, interventions and technology for use in healthcare to improve the treatment of patients.