Identifying Breast Cancer Distant Recurrences from Electronic Health Records Using Machine Learning.

IF 5.9 Q1 Computer Science

Journal of Healthcare Informatics Research Pub Date : 2019-01-01 Epub Date: 2019-04-08 DOI:10.1007/s41666-019-00046-3

Zexian Zeng, Liang Yao, Ankita Roy, Xiaoyu Li, Sasa Espino, Susan E Clare, Seema A Khan, Yuan Luo

{"title":"Identifying Breast Cancer Distant Recurrences from Electronic Health Records Using Machine Learning.","authors":"Zexian Zeng, Liang Yao, Ankita Roy, Xiaoyu Li, Sasa Espino, Susan E Clare, Seema A Khan, Yuan Luo","doi":"10.1007/s41666-019-00046-3","DOIUrl":null,"url":null,"abstract":"<p><p>Accurately identifying distant recurrences in breast cancer from the Electronic Health Records (EHR) is important for both clinical care and secondary analysis. Although multiple applications have been developed for computational phenotyping in breast cancer, distant recurrence identification still relies heavily on manual chart review. In this study, we aim to develop a model that identifies distant recurrences in breast cancer using clinical narratives and structured data from EHR. We applied MetaMap to extract features from clinical narratives and also retrieved structured clinical data from EHR. Using these features, we trained a support vector machine model to identify distant recurrences in breast cancer patients. We trained the model using 1,396 double-annotated subjects and validated the model using 599 double-annotated subjects. In addition, we validated the model on a set of 4,904 single-annotated subjects as a generalization test. In the held-out test and generalization test, we obtained F-measure scores of 0.78 and 0.74, area under curve (AUC) scores of 0.95 and 0.93, respectively. To explore the representation learning utility of deep neural networks, we designed multiple convolutional neural networks and multilayer neural networks to identify distant recurrences. Using the same test set and generalizability test set, we obtained F-measure scores of 0.79 ± 0.02 and 0.74 ± 0.004, AUC scores of 0.95 ± 0.002 and 0.95 ± 0.01, respectively. Our model can accurately and efficiently identify distant recurrences in breast cancer by combining features extracted from unstructured clinical narratives and structured clinical data.</p>","PeriodicalId":36444,"journal":{"name":"Journal of Healthcare Informatics Research","volume":" ","pages":"283-299"},"PeriodicalIF":5.9000,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s41666-019-00046-3","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Healthcare Informatics Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41666-019-00046-3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2019/4/8 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 22

Abstract

Accurately identifying distant recurrences in breast cancer from the Electronic Health Records (EHR) is important for both clinical care and secondary analysis. Although multiple applications have been developed for computational phenotyping in breast cancer, distant recurrence identification still relies heavily on manual chart review. In this study, we aim to develop a model that identifies distant recurrences in breast cancer using clinical narratives and structured data from EHR. We applied MetaMap to extract features from clinical narratives and also retrieved structured clinical data from EHR. Using these features, we trained a support vector machine model to identify distant recurrences in breast cancer patients. We trained the model using 1,396 double-annotated subjects and validated the model using 599 double-annotated subjects. In addition, we validated the model on a set of 4,904 single-annotated subjects as a generalization test. In the held-out test and generalization test, we obtained F-measure scores of 0.78 and 0.74, area under curve (AUC) scores of 0.95 and 0.93, respectively. To explore the representation learning utility of deep neural networks, we designed multiple convolutional neural networks and multilayer neural networks to identify distant recurrences. Using the same test set and generalizability test set, we obtained F-measure scores of 0.79 ± 0.02 and 0.74 ± 0.004, AUC scores of 0.95 ± 0.002 and 0.95 ± 0.01, respectively. Our model can accurately and efficiently identify distant recurrences in breast cancer by combining features extracted from unstructured clinical narratives and structured clinical data.

Abstract Image

查看原文本刊更多论文

利用机器学习从电子健康记录中识别乳腺癌远端复发。

从电子健康记录(EHR)中准确识别乳腺癌远处复发对于临床护理和二次分析都很重要。尽管计算表型在乳腺癌中的多种应用已经开发出来，但远端复发的识别仍然严重依赖于手工图表审查。在这项研究中，我们的目标是建立一个模型，利用临床叙述和电子病历的结构化数据来识别乳腺癌的远处复发。我们应用MetaMap从临床叙述中提取特征，并从电子病历中检索结构化的临床数据。利用这些特征，我们训练了一个支持向量机模型来识别乳腺癌患者的远处复发。我们使用1396个双标注主题训练模型，并使用599个双标注主题验证模型。此外，我们在一组4,904个单注释的受试者上验证了该模型作为泛化测试。在hold -out检验和概化检验中，F-measure得分分别为0.78和0.74，曲线下面积(AUC)得分分别为0.95和0.93。为了探索深度神经网络的表示学习效用，我们设计了多重卷积神经网络和多层神经网络来识别远递归。采用相同的检验集和可推广性检验集，F-measure得分分别为0.79±0.02和0.74±0.004,AUC得分分别为0.95±0.002和0.95±0.01。我们的模型通过结合从非结构化临床叙述和结构化临床数据中提取的特征，可以准确有效地识别乳腺癌的远处复发。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Healthcare Informatics Research Computer Science-Computer Science Applications

CiteScore

13.60

自引率

1.70%

发文量

期刊介绍： Journal of Healthcare Informatics Research serves as a publication venue for the innovative technical contributions highlighting analytics, systems, and human factors research in healthcare informatics.Journal of Healthcare Informatics Research is concerned with the application of computer science principles, information science principles, information technology, and communication technology to address problems in healthcare, and everyday wellness. Journal of Healthcare Informatics Research highlights the most cutting-edge technical contributions in computing-oriented healthcare informatics. The journal covers three major tracks: (1) analytics—focuses on data analytics, knowledge discovery, predictive modeling; (2) systems—focuses on building healthcare informatics systems (e.g., architecture, framework, design, engineering, and application); (3) human factors—focuses on understanding users or context, interface design, health behavior, and user studies of healthcare informatics applications. Topics include but are not limited to: · healthcare software architecture, framework, design, and engineering;· electronic health records· medical data mining· predictive modeling· medical information retrieval· medical natural language processing· healthcare information systems· smart health and connected health· social media analytics· mobile healthcare· medical signal processing· human factors in healthcare· usability studies in healthcare· user-interface design for medical devices and healthcare software· health service delivery· health games· security and privacy in healthcare· medical recommender system· healthcare workflow management· disease profiling and personalized treatment· visualization of medical data· intelligent medical devices and sensors· RFID solutions for healthcare· healthcare decision analytics and support systems· epidemiological surveillance systems and intervention modeling· consumer and clinician health information needs, seeking, sharing, and use· semantic Web, linked data, and ontology· collaboration technologies for healthcare· assistive and adaptive ubiquitous computing technologies· statistics and quality of medical data· healthcare delivery in developing countries· health systems modeling and simulation· computer-aided diagnosis