Yiming Xue, Yunzheng Zhu, Luoting Zhuang, YongKyung Oh, Ricky Taira, Denise R Aberle, Ashley E Prosper, William Hsu, Yannan Lin
{"title":"从临床叙述中提取定量吸烟史以改善肺癌筛查的基于bert的模型。","authors":"Yiming Xue, Yunzheng Zhu, Luoting Zhuang, YongKyung Oh, Ricky Taira, Denise R Aberle, Ashley E Prosper, William Hsu, Yannan Lin","doi":"10.1101/2025.06.18.25329870","DOIUrl":null,"url":null,"abstract":"<p><p>Tobacco use is a critical risk factor for diseases such as cancer and cardiovascular disorders. While electronic health records can capture categorical smoking statuses accurately, granular quantitative details, such as pack years and years since quitting, are often embedded in clinical narratives. This information is crucial for assessing disease risk and determining eligibility for lung cancer screening (LCS). Existing natural language processing (NLP) tools excelled at identifying smoking statuses but struggled with extracting detailed quantitative data. To address this, we developed SmokeBERT, a fine-tuned BERT-based model optimized for extracting detailed smoking histories. Evaluations against a state-of-the-art rule-based NLP model demonstrated its superior performance on F1 scores (0.97 vs. 0.88 on the hold-out test set) and identification of LCS-eligible patients (e.g., 98% vs. 60% for ≥20 pack years). Future work includes creating a multilingual, language-agnostic version of SmokeBERT by incorporating datasets in multiple languages, exploring ensemble methods, and testing on larger datasets.</p>","PeriodicalId":94281,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12204286/pdf/","citationCount":"0","resultStr":"{\"title\":\"SmokeBERT: A BERT-based Model for Quantitative Smoking History Extraction from Clinical Narratives to Improve Lung Cancer Screening.\",\"authors\":\"Yiming Xue, Yunzheng Zhu, Luoting Zhuang, YongKyung Oh, Ricky Taira, Denise R Aberle, Ashley E Prosper, William Hsu, Yannan Lin\",\"doi\":\"10.1101/2025.06.18.25329870\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Tobacco use is a critical risk factor for diseases such as cancer and cardiovascular disorders. While electronic health records can capture categorical smoking statuses accurately, granular quantitative details, such as pack years and years since quitting, are often embedded in clinical narratives. This information is crucial for assessing disease risk and determining eligibility for lung cancer screening (LCS). Existing natural language processing (NLP) tools excelled at identifying smoking statuses but struggled with extracting detailed quantitative data. To address this, we developed SmokeBERT, a fine-tuned BERT-based model optimized for extracting detailed smoking histories. Evaluations against a state-of-the-art rule-based NLP model demonstrated its superior performance on F1 scores (0.97 vs. 0.88 on the hold-out test set) and identification of LCS-eligible patients (e.g., 98% vs. 60% for ≥20 pack years). Future work includes creating a multilingual, language-agnostic version of SmokeBERT by incorporating datasets in multiple languages, exploring ensemble methods, and testing on larger datasets.</p>\",\"PeriodicalId\":94281,\"journal\":{\"name\":\"medRxiv : the preprint server for health sciences\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12204286/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv : the preprint server for health sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2025.06.18.25329870\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.06.18.25329870","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
SmokeBERT: A BERT-based Model for Quantitative Smoking History Extraction from Clinical Narratives to Improve Lung Cancer Screening.
Tobacco use is a critical risk factor for diseases such as cancer and cardiovascular disorders. While electronic health records can capture categorical smoking statuses accurately, granular quantitative details, such as pack years and years since quitting, are often embedded in clinical narratives. This information is crucial for assessing disease risk and determining eligibility for lung cancer screening (LCS). Existing natural language processing (NLP) tools excelled at identifying smoking statuses but struggled with extracting detailed quantitative data. To address this, we developed SmokeBERT, a fine-tuned BERT-based model optimized for extracting detailed smoking histories. Evaluations against a state-of-the-art rule-based NLP model demonstrated its superior performance on F1 scores (0.97 vs. 0.88 on the hold-out test set) and identification of LCS-eligible patients (e.g., 98% vs. 60% for ≥20 pack years). Future work includes creating a multilingual, language-agnostic version of SmokeBERT by incorporating datasets in multiple languages, exploring ensemble methods, and testing on larger datasets.