确定戒烟成功的关键预测因素:使用大型语言模型的基于文本的特征选择。

Thuy T T Le, Jiongxuan Yang, Zimo Zhao, Kaidi Zhang, Wenjun Li, Yan Hu
{"title":"确定戒烟成功的关键预测因素:使用大型语言模型的基于文本的特征选择。","authors":"Thuy T T Le, Jiongxuan Yang, Zimo Zhao, Kaidi Zhang, Wenjun Li, Yan Hu","doi":"10.1101/2025.06.18.25329854","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The most effective way to reduce mortality and morbidity among current smokers is to quit smoking. Although about half of smokers attempted to quit, only one-tenth succeeded in 2022.</p><p><strong>Objective: </strong>To identify key predictors of smoking cessation success to inform cessation interventions and increase quitting rates.</p><p><strong>Methods: </strong>We analyzed data from waves 5 and 6 of the Population Assessment of Tobacco and Health (PATH) study (December 2018 to November 2021). Using OpenAI's GPT-4.1, we identified the top 45 variables from wave 5 that are highly predictive of 12-month smoking abstinence in wave 6, based on descriptions of survey variables. We then validated the predictive power of the GPT-4.1-selected variables by comparing the performance of eXtreme Gradient Boosting (XGBoost) trained on different sets of variables. Finally, we derived insights into the top 10 variables, ranked according to their SHapley Additive exPlanations values.</p><p><strong>Results: </strong>The performance of XGBoost trained with all possible wave 5 variables and the 45 selected variables was almost identical (AUC:0.749 vs AUC:0.752). The top 10 variables included past 30-day smoking frequency, minutes from waking up to smoking first cigarette, important people's views on tobacco use, prevalence of tobacco use among close associates, daily electronic nicotine product use, emotional dependence, and health harm concerns.</p><p><strong>Conclusion: </strong>This study demonstrates the ability of OpenAI's GPT-4.1 to identify the top 45 PATH wave 5 variables associated with 12-month smoking abstinence using only their descriptions. This approach could help researchers design more effective survey questionnaires and improve efficiency of data collection.</p>","PeriodicalId":94281,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12204296/pdf/","citationCount":"0","resultStr":"{\"title\":\"Identifying Key Predictors of Smoking Cessation Success: Text-Based Feature Selection Using a Large Language Model.\",\"authors\":\"Thuy T T Le, Jiongxuan Yang, Zimo Zhao, Kaidi Zhang, Wenjun Li, Yan Hu\",\"doi\":\"10.1101/2025.06.18.25329854\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>The most effective way to reduce mortality and morbidity among current smokers is to quit smoking. Although about half of smokers attempted to quit, only one-tenth succeeded in 2022.</p><p><strong>Objective: </strong>To identify key predictors of smoking cessation success to inform cessation interventions and increase quitting rates.</p><p><strong>Methods: </strong>We analyzed data from waves 5 and 6 of the Population Assessment of Tobacco and Health (PATH) study (December 2018 to November 2021). Using OpenAI's GPT-4.1, we identified the top 45 variables from wave 5 that are highly predictive of 12-month smoking abstinence in wave 6, based on descriptions of survey variables. We then validated the predictive power of the GPT-4.1-selected variables by comparing the performance of eXtreme Gradient Boosting (XGBoost) trained on different sets of variables. Finally, we derived insights into the top 10 variables, ranked according to their SHapley Additive exPlanations values.</p><p><strong>Results: </strong>The performance of XGBoost trained with all possible wave 5 variables and the 45 selected variables was almost identical (AUC:0.749 vs AUC:0.752). The top 10 variables included past 30-day smoking frequency, minutes from waking up to smoking first cigarette, important people's views on tobacco use, prevalence of tobacco use among close associates, daily electronic nicotine product use, emotional dependence, and health harm concerns.</p><p><strong>Conclusion: </strong>This study demonstrates the ability of OpenAI's GPT-4.1 to identify the top 45 PATH wave 5 variables associated with 12-month smoking abstinence using only their descriptions. This approach could help researchers design more effective survey questionnaires and improve efficiency of data collection.</p>\",\"PeriodicalId\":94281,\"journal\":{\"name\":\"medRxiv : the preprint server for health sciences\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12204296/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv : the preprint server for health sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2025.06.18.25329854\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.06.18.25329854","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

背景:降低当前吸烟者死亡率和发病率的最有效方法是戒烟。尽管约有一半的吸烟者试图戒烟,但在2022年只有十分之一的人成功戒烟。目的:确定戒烟成功的关键预测因素,为戒烟干预提供信息,提高戒烟率。方法:我们分析了烟草与健康人口评估(PATH)研究(2018年12月至2021年11月)第5和第6波的数据。使用OpenAI的GPT-4.1,我们根据调查变量的描述,确定了第5波中高度预测第6波中12个月戒烟情况的前45个变量。然后,我们通过比较极端梯度增强(XGBoost)在不同变量集上训练的性能,验证了gpt -4.1选择变量的预测能力。最后,我们根据SHapley Additive exPlanations值对前10个变量进行了排序。结果:XGBoost在所有可能的波5变量和45个选定变量的训练下的性能几乎相同(AUC:0.749 vs 0.752)。最重要的10个变量包括过去30天的吸烟频率、起床到吸第一支烟的时间、重要人物对烟草使用的看法、亲密伙伴中吸烟的流行程度、每日电子尼古丁产品的使用、情感依赖和健康危害担忧。结论:本研究证明OpenAI的GPT-4.1能够仅使用其描述识别与12个月戒烟相关的前45个PATH波5变量。这种方法可以帮助研究者设计更有效的调查问卷,提高数据收集的效率。关于这一主题的已知情况:生成式人工智能模型最近被应用于评估它们在解决各种与烟草有关的问题方面的潜力,例如在社交媒体视频中检测烟草产品和促进戒烟。然而,它们在根据调查数据确定烟草使用行为最重要预测因素方面的应用仍未得到探索。本研究补充的内容:GPT-4.1成功地为调查变量分配了高质量的重要性分数,以预测当前吸烟者在两年内戒烟12个月。它仅使用调查变量的文本描述来完成此任务,而不访问实际的调查数据。基于这些重要性分数,GPT-4.1可以帮助确定预测戒烟成功的最关键变量。本研究如何影响研究实践或政策:本研究证明了GPT-4.1进行特征选择的能力,为未来探索这种创新方法来解决其他烟草相关问题铺平了道路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Identifying Key Predictors of Smoking Cessation Success: Text-Based Feature Selection Using a Large Language Model.

Background: The most effective way to reduce mortality and morbidity among current smokers is to quit smoking. Although about half of smokers attempted to quit, only one-tenth succeeded in 2022.

Objective: To identify key predictors of smoking cessation success to inform cessation interventions and increase quitting rates.

Methods: We analyzed data from waves 5 and 6 of the Population Assessment of Tobacco and Health (PATH) study (December 2018 to November 2021). Using OpenAI's GPT-4.1, we identified the top 45 variables from wave 5 that are highly predictive of 12-month smoking abstinence in wave 6, based on descriptions of survey variables. We then validated the predictive power of the GPT-4.1-selected variables by comparing the performance of eXtreme Gradient Boosting (XGBoost) trained on different sets of variables. Finally, we derived insights into the top 10 variables, ranked according to their SHapley Additive exPlanations values.

Results: The performance of XGBoost trained with all possible wave 5 variables and the 45 selected variables was almost identical (AUC:0.749 vs AUC:0.752). The top 10 variables included past 30-day smoking frequency, minutes from waking up to smoking first cigarette, important people's views on tobacco use, prevalence of tobacco use among close associates, daily electronic nicotine product use, emotional dependence, and health harm concerns.

Conclusion: This study demonstrates the ability of OpenAI's GPT-4.1 to identify the top 45 PATH wave 5 variables associated with 12-month smoking abstinence using only their descriptions. This approach could help researchers design more effective survey questionnaires and improve efficiency of data collection.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信