Akshay Swaminathan, Ivan Lopez, William Wang, Ujwal Srivastava, Edward Tran, A. Bhargava-Shah, Janet Y Wu, Alexander L Ren, Kaitlin Caoili, Brandon Bui, L. Alkhani, Susan Lee, Nathan Mohit, Noel Seo, N. Macedo, Winson Cheng, Charles Liu, Reena Thomas, Jonathan H Chen, O. Gevaert
{"title":"选择性预测提取非结构化临床数据","authors":"Akshay Swaminathan, Ivan Lopez, William Wang, Ujwal Srivastava, Edward Tran, A. Bhargava-Shah, Janet Y Wu, Alexander L Ren, Kaitlin Caoili, Brandon Bui, L. Alkhani, Susan Lee, Nathan Mohit, Noel Seo, N. Macedo, Winson Cheng, Charles Liu, Reena Thomas, Jonathan H Chen, O. Gevaert","doi":"10.1101/2022.11.15.22282368","DOIUrl":null,"url":null,"abstract":"Background: Electronic health records represent a large data source for outcomes research, but the majority of EHR data is unstructured (e.g. free text of clinical notes) and not conducive to computational methods. While there are currently approaches to handle unstructured data, such as manual abstraction, structured proxy variables, and model-assisted abstraction, these methods are time-consuming, not scalable, and require clinical domain expertise. This paper aims to determine whether selective prediction, which gives a model the option to abstain from generating a prediction, can improve the accuracy and efficiency of unstructured clinical data abstraction. Methods: We trained selective prediction models to identify the presence of four distinct clinical variables in free-text pathology reports: primary cancer diagnosis of glioblastoma (GBM, n = 659), resection of rectal adenocarcinoma (RRA, n = 601), and two procedures for resection of rectal adenocarcinoma: abdominoperineal resection (APR, n = 601) and low anterior resection (LAR, n = 601). Data were manually abstracted from pathology reports and used to train L1-regularized logistic regression models using term-frequency-inverse-document-frequency features. Data points that the model was unable to predict with high certainty were manually abstracted. Findings: All four selective prediction models achieved a test-set sensitivity, specificity, positive predictive value, and negative predictive value above 0.91. The use of selective prediction led to sizable gains in automation (anywhere from 57% to 95% reduction in manual abstraction of charts across the four outcomes). For our GBM classifier, the selective prediction model saw improvements to sensitivity (0.94 to 0.96), specificity (0.79 to 0.96), PPV (0.89 to 0.98), and NPV (0.88 to 0.91) when compared to a non-selective classifier. Interpretation: Selective prediction using utility-based probability thresholds can facilitate unstructured data extraction by giving \"easy\" charts to a model and \"hard\" charts to human abstractors, thus increasing efficiency while maintaining or improving accuracy.","PeriodicalId":236137,"journal":{"name":"Journal of the American Medical Informatics Association : JAMIA","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Selective prediction for extracting unstructured clinical data\",\"authors\":\"Akshay Swaminathan, Ivan Lopez, William Wang, Ujwal Srivastava, Edward Tran, A. Bhargava-Shah, Janet Y Wu, Alexander L Ren, Kaitlin Caoili, Brandon Bui, L. Alkhani, Susan Lee, Nathan Mohit, Noel Seo, N. Macedo, Winson Cheng, Charles Liu, Reena Thomas, Jonathan H Chen, O. Gevaert\",\"doi\":\"10.1101/2022.11.15.22282368\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Electronic health records represent a large data source for outcomes research, but the majority of EHR data is unstructured (e.g. free text of clinical notes) and not conducive to computational methods. While there are currently approaches to handle unstructured data, such as manual abstraction, structured proxy variables, and model-assisted abstraction, these methods are time-consuming, not scalable, and require clinical domain expertise. This paper aims to determine whether selective prediction, which gives a model the option to abstain from generating a prediction, can improve the accuracy and efficiency of unstructured clinical data abstraction. Methods: We trained selective prediction models to identify the presence of four distinct clinical variables in free-text pathology reports: primary cancer diagnosis of glioblastoma (GBM, n = 659), resection of rectal adenocarcinoma (RRA, n = 601), and two procedures for resection of rectal adenocarcinoma: abdominoperineal resection (APR, n = 601) and low anterior resection (LAR, n = 601). Data were manually abstracted from pathology reports and used to train L1-regularized logistic regression models using term-frequency-inverse-document-frequency features. Data points that the model was unable to predict with high certainty were manually abstracted. Findings: All four selective prediction models achieved a test-set sensitivity, specificity, positive predictive value, and negative predictive value above 0.91. The use of selective prediction led to sizable gains in automation (anywhere from 57% to 95% reduction in manual abstraction of charts across the four outcomes). For our GBM classifier, the selective prediction model saw improvements to sensitivity (0.94 to 0.96), specificity (0.79 to 0.96), PPV (0.89 to 0.98), and NPV (0.88 to 0.91) when compared to a non-selective classifier. Interpretation: Selective prediction using utility-based probability thresholds can facilitate unstructured data extraction by giving \\\"easy\\\" charts to a model and \\\"hard\\\" charts to human abstractors, thus increasing efficiency while maintaining or improving accuracy.\",\"PeriodicalId\":236137,\"journal\":{\"name\":\"Journal of the American Medical Informatics Association : JAMIA\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the American Medical Informatics Association : JAMIA\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2022.11.15.22282368\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association : JAMIA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2022.11.15.22282368","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
摘要
背景:电子健康记录是结果研究的一个大数据源,但大多数电子健康记录数据是非结构化的(例如临床记录的免费文本),不利于计算方法。虽然目前有处理非结构化数据的方法,如手动抽象、结构化代理变量和模型辅助抽象,但这些方法耗时长,不可扩展,并且需要临床领域的专业知识。本文旨在确定选择性预测是否可以提高非结构化临床数据抽象的准确性和效率,选择性预测为模型提供了不生成预测的选择。方法:我们训练了选择性预测模型,以识别自由文本病理报告中存在的四个不同的临床变量:胶质母细胞瘤(GBM, n = 659)的原发癌症诊断,直肠腺癌切除术(RRA, n = 601),以及两种直肠腺癌切除术方法:腹部会阴切除术(APR, n = 601)和低位前切除术(LAR, n = 601)。从病理报告中手动提取数据,并使用术语频率-逆文档频率特征训练l1正则化逻辑回归模型。对模型无法进行高确定性预测的数据点进行人工抽象。结果:四种选择性预测模型的测试集灵敏度、特异性、阳性预测值和阴性预测值均在0.91以上。选择性预测的使用在自动化方面带来了可观的收益(在四种结果中手工抽象图表的次数减少了57%到95%)。对于我们的GBM分类器,与非选择性分类器相比,选择性预测模型在灵敏度(0.94至0.96)、特异性(0.79至0.96)、PPV(0.89至0.98)和NPV(0.88至0.91)方面都有所提高。解释:使用基于效用的概率阈值的选择性预测可以通过向模型提供“简单”图表和向人类抽象人员提供“困难”图表来促进非结构化数据的提取,从而在保持或提高准确性的同时提高效率。
Selective prediction for extracting unstructured clinical data
Background: Electronic health records represent a large data source for outcomes research, but the majority of EHR data is unstructured (e.g. free text of clinical notes) and not conducive to computational methods. While there are currently approaches to handle unstructured data, such as manual abstraction, structured proxy variables, and model-assisted abstraction, these methods are time-consuming, not scalable, and require clinical domain expertise. This paper aims to determine whether selective prediction, which gives a model the option to abstain from generating a prediction, can improve the accuracy and efficiency of unstructured clinical data abstraction. Methods: We trained selective prediction models to identify the presence of four distinct clinical variables in free-text pathology reports: primary cancer diagnosis of glioblastoma (GBM, n = 659), resection of rectal adenocarcinoma (RRA, n = 601), and two procedures for resection of rectal adenocarcinoma: abdominoperineal resection (APR, n = 601) and low anterior resection (LAR, n = 601). Data were manually abstracted from pathology reports and used to train L1-regularized logistic regression models using term-frequency-inverse-document-frequency features. Data points that the model was unable to predict with high certainty were manually abstracted. Findings: All four selective prediction models achieved a test-set sensitivity, specificity, positive predictive value, and negative predictive value above 0.91. The use of selective prediction led to sizable gains in automation (anywhere from 57% to 95% reduction in manual abstraction of charts across the four outcomes). For our GBM classifier, the selective prediction model saw improvements to sensitivity (0.94 to 0.96), specificity (0.79 to 0.96), PPV (0.89 to 0.98), and NPV (0.88 to 0.91) when compared to a non-selective classifier. Interpretation: Selective prediction using utility-based probability thresholds can facilitate unstructured data extraction by giving "easy" charts to a model and "hard" charts to human abstractors, thus increasing efficiency while maintaining or improving accuracy.