Natural language processing and machine learning to enable automatic extraction and classification of patients' smoking status from electronic medical records.

IF 1.5 4区医学 Q2 MEDICINE, GENERAL & INTERNAL

Upsala journal of medical sciences Pub Date : 2020-11-01 Epub Date: 2020-07-22 DOI:10.1080/03009734.2020.1792010

Andrea Caccamisi, Leif Jørgensen, Hercules Dalianis, Mats Rosenlund

{"title":"Natural language processing and machine learning to enable automatic extraction and classification of patients' smoking status from electronic medical records.","authors":"Andrea Caccamisi, Leif Jørgensen, Hercules Dalianis, Mats Rosenlund","doi":"10.1080/03009734.2020.1792010","DOIUrl":null,"url":null,"abstract":"Background: The electronic medical record (EMR) offers unique possibilities for clinical research, but some important patient attributes are not readily available due to its unstructured properties. We applied text mining using machine learning to enable automatic classification of unstructured information on smoking status from Swedish EMR data.Methods: Data on patients' smoking status from EMRs were used to develop 32 different predictive models that were trained using Weka, changing sentence frequency, classifier type, tokenization, and attribute selection in a database of 85,000 classified sentences. The models were evaluated using F-score and accuracy based on out-of-sample test data including 8500 sentences. The error weight matrix was used to select the best model, assigning a weight to each type of misclassification and applying it to the model confusion matrices. The best performing model was then compared to a rule-based method.Results: The best performing model was based on the Support Vector Machine (SVM) Sequential Minimal Optimization (SMO) classifier using a combination of unigrams and bigrams as tokens. Sentence frequency and attributes selection did not improve model performance. SMO achieved 98.14% accuracy and 0.981 F-score versus 79.32% and 0.756 for the rule-based model.Conclusion: A model using machine-learning algorithms to automatically classify patients' smoking status was successfully developed. Such algorithms may enable automatic assessment of smoking status and other unstructured data directly from EMRs without manual classification of complete case notes.","PeriodicalId":23458,"journal":{"name":"Upsala journal of medical sciences","volume":"125 4","pages":"316-324"},"PeriodicalIF":1.5000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/03009734.2020.1792010","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Upsala journal of medical sciences","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/03009734.2020.1792010","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/7/22 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 16

Abstract

Background: The electronic medical record (EMR) offers unique possibilities for clinical research, but some important patient attributes are not readily available due to its unstructured properties. We applied text mining using machine learning to enable automatic classification of unstructured information on smoking status from Swedish EMR data.

Methods: Data on patients' smoking status from EMRs were used to develop 32 different predictive models that were trained using Weka, changing sentence frequency, classifier type, tokenization, and attribute selection in a database of 85,000 classified sentences. The models were evaluated using F-score and accuracy based on out-of-sample test data including 8500 sentences. The error weight matrix was used to select the best model, assigning a weight to each type of misclassification and applying it to the model confusion matrices. The best performing model was then compared to a rule-based method.

Results: The best performing model was based on the Support Vector Machine (SVM) Sequential Minimal Optimization (SMO) classifier using a combination of unigrams and bigrams as tokens. Sentence frequency and attributes selection did not improve model performance. SMO achieved 98.14% accuracy and 0.981 F-score versus 79.32% and 0.756 for the rule-based model.

Conclusion: A model using machine-learning algorithms to automatically classify patients' smoking status was successfully developed. Such algorithms may enable automatic assessment of smoking status and other unstructured data directly from EMRs without manual classification of complete case notes.

Abstract Image

查看原文本刊更多论文

通过自然语言处理和机器学习，可以从电子病历中自动提取和分类患者的吸烟状况。

背景:电子病历(EMR)为临床研究提供了独特的可能性，但由于其非结构化的特性，一些重要的患者属性不容易获得。我们使用机器学习的文本挖掘技术，从瑞典电子病历数据中对吸烟状况的非结构化信息进行自动分类。方法:利用电子病历中患者吸烟状况的数据建立32种不同的预测模型，这些模型使用Weka、改变句子频率、分类器类型、标记化和属性选择在85,000个分类句子的数据库中进行训练。基于8500个句子的样本外测试数据，使用f分数和准确性对模型进行评估。利用误差权重矩阵选择最佳模型，对每种错误分类分配一个权重，并将其应用到模型混淆矩阵中。然后将表现最好的模型与基于规则的方法进行比较。结果:表现最好的模型是基于支持向量机(SVM)顺序最小优化(SMO)分类器，使用单图和双图的组合作为标记。句子频率和属性选择并没有提高模型的性能。SMO的准确率为98.14%，f分数为0.981，而基于规则的模型的准确率为79.32%，f分数为0.756。结论:成功建立了一种基于机器学习算法的患者吸烟状态自动分类模型。这种算法可以直接从电子病历中自动评估吸烟状况和其他非结构化数据，而无需手动对完整的病例记录进行分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Upsala journal of medical sciences 医学-医学：内科

CiteScore

5.60

自引率

0.00%

发文量

审稿时长

6-12 weeks

期刊介绍： Upsala Journal of Medical Sciences is published for the Upsala Medical Society. It has been published since 1865 and is one of the oldest medical journals in Sweden. The journal publishes clinical and experimental original works in the medical field. Although focusing on regional issues, the journal always welcomes contributions from outside Sweden. Specially extended issues are published occasionally, dealing with special topics, congress proceedings and academic dissertations.