基于电子病历的肺结核临床辅助决策模型

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining Pub Date : 2023-03-16 DOI:10.1186/s13040-023-00328-y

Mengying Wang, Cuixia Lee, Zhenhao Wei, Hong Ji, Yingyun Yang, Cheng Yang

{"title":"基于电子病历的肺结核临床辅助决策模型","authors":"Mengying Wang, Cuixia Lee, Zhenhao Wei, Hong Ji, Yingyun Yang, Cheng Yang","doi":"10.1186/s13040-023-00328-y","DOIUrl":null,"url":null,"abstract":"Background: Tuberculosis is a dangerous infectious disease with the largest number of reported cases in China every year. Preventing missed diagnosis has an important impact on the prevention, treatment, and recovery of tuberculosis. The earliest pulmonary tuberculosis prediction models mainly used traditional image data combined with neural network models. However, a single data source tends to miss important information, such as primary symptoms and laboratory test results, that is available in multi-source data like medical records and tests. In this study, we propose a multi-stream integrated pulmonary tuberculosis diagnosis model based on structured and unstructured multi-source data from electronic health records. With the limited number of lung specialists and the high prevalence of tuberculosis, the application of this auxiliary diagnosis model can make substantial contributions to clinical settings.Methods: The subjects were patients at the respiratory department and infectious cases department of a large comprehensive hospital in China between 2015 to 2020. A total of 95,294 medical records were selected through a quality control process. Each record contains structured and unstructured data. First, numerical expressions of features for structured data were created. Then, feature engineering was performed through decision tree model, random forest, and GBDT. Features were included in the feature exclusion set as per their weights in descending order. When the importance of the set was higher than 0.7, this process was concluded. Finally, the contained features were used for model training. In addition, the unstructured free-text data was segmented at the character level and input into the model after indexing. Tuberculosis prediction was conducted through a multi-stream integration tuberculosis diagnosis model (MSI-PTDM), and the evaluation indices of accuracy, AUC, sensitivity, and specificity were compared against the prediction results of XGBoost, Text-CNN, Random Forest, SVM, and so on.Results: Through a variety of characteristic engineering methods, 20 characteristic factors, such as main complaint hemoptysis, cough, and test erythrocyte sedimentation rate, were selected, and the influencing factors were analyzed using the Chinese diagnostic standard of pulmonary tuberculosis. The area under the curve values for MSI-PTDM, XGBoost, Text-CNN, RF, and SVM were 0.9858, 0.9571, 0.9486, 0.9428, and 0.9429, respectively. The sensitivity, specificity, and accuracy of MSI-PTDM were 93.18%, 96.96%, and 96.96%, respectively. The MSI-PTDM prediction model was installed at a doctor workstation and operated in a real clinic environment for 4 months. A total of 692,949 patients were monitored, including 484 patients with confirmed pulmonary tuberculosis. The model predicted 440 cases of pulmonary tuberculosis. The positive sample recognition rate was 90.91%, the false-positive rate was 9.09%, the negative sample recognition rate was 96.17%, and the false-negative rate was 3.83%.Conclusions: MSI-PTDM can process sparse data, dense data, and unstructured text data concurrently. The model adds a feature domain vector embedding the medical sparse features, and the single-valued sparse vectors are represented by multi-dimensional dense hidden vectors, which not only enhances the feature expression but also alleviates the side effects of sparsity on the model training. However, there may be information loss when features are extracted from text, and adding the processing of original unstructured text makes up for the error within the above process to a certain extent, so that the model can learn data more comprehensively and effectively. In addition, MSI-PTDM also allows interaction between features, considers the combination effect between patient features, adds more complex nonlinear calculation considerations, and improves the learning ability of the model. It has been verified using a test set and via deployment within an actual outpatient environment.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"11"},"PeriodicalIF":4.0000,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10022184/pdf/","citationCount":"0","resultStr":"{\"title\":\"Clinical assistant decision-making model of tuberculosis based on electronic health records.\",\"authors\":\"Mengying Wang, Cuixia Lee, Zhenhao Wei, Hong Ji, Yingyun Yang, Cheng Yang\",\"doi\":\"10.1186/s13040-023-00328-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Tuberculosis is a dangerous infectious disease with the largest number of reported cases in China every year. Preventing missed diagnosis has an important impact on the prevention, treatment, and recovery of tuberculosis. The earliest pulmonary tuberculosis prediction models mainly used traditional image data combined with neural network models. However, a single data source tends to miss important information, such as primary symptoms and laboratory test results, that is available in multi-source data like medical records and tests. In this study, we propose a multi-stream integrated pulmonary tuberculosis diagnosis model based on structured and unstructured multi-source data from electronic health records. With the limited number of lung specialists and the high prevalence of tuberculosis, the application of this auxiliary diagnosis model can make substantial contributions to clinical settings.Methods: The subjects were patients at the respiratory department and infectious cases department of a large comprehensive hospital in China between 2015 to 2020. A total of 95,294 medical records were selected through a quality control process. Each record contains structured and unstructured data. First, numerical expressions of features for structured data were created. Then, feature engineering was performed through decision tree model, random forest, and GBDT. Features were included in the feature exclusion set as per their weights in descending order. When the importance of the set was higher than 0.7, this process was concluded. Finally, the contained features were used for model training. In addition, the unstructured free-text data was segmented at the character level and input into the model after indexing. Tuberculosis prediction was conducted through a multi-stream integration tuberculosis diagnosis model (MSI-PTDM), and the evaluation indices of accuracy, AUC, sensitivity, and specificity were compared against the prediction results of XGBoost, Text-CNN, Random Forest, SVM, and so on.Results: Through a variety of characteristic engineering methods, 20 characteristic factors, such as main complaint hemoptysis, cough, and test erythrocyte sedimentation rate, were selected, and the influencing factors were analyzed using the Chinese diagnostic standard of pulmonary tuberculosis. The area under the curve values for MSI-PTDM, XGBoost, Text-CNN, RF, and SVM were 0.9858, 0.9571, 0.9486, 0.9428, and 0.9429, respectively. The sensitivity, specificity, and accuracy of MSI-PTDM were 93.18%, 96.96%, and 96.96%, respectively. The MSI-PTDM prediction model was installed at a doctor workstation and operated in a real clinic environment for 4 months. A total of 692,949 patients were monitored, including 484 patients with confirmed pulmonary tuberculosis. The model predicted 440 cases of pulmonary tuberculosis. The positive sample recognition rate was 90.91%, the false-positive rate was 9.09%, the negative sample recognition rate was 96.17%, and the false-negative rate was 3.83%.Conclusions: MSI-PTDM can process sparse data, dense data, and unstructured text data concurrently. The model adds a feature domain vector embedding the medical sparse features, and the single-valued sparse vectors are represented by multi-dimensional dense hidden vectors, which not only enhances the feature expression but also alleviates the side effects of sparsity on the model training. However, there may be information loss when features are extracted from text, and adding the processing of original unstructured text makes up for the error within the above process to a certain extent, so that the model can learn data more comprehensively and effectively. In addition, MSI-PTDM also allows interaction between features, considers the combination effect between patient features, adds more complex nonlinear calculation considerations, and improves the learning ability of the model. It has been verified using a test set and via deployment within an actual outpatient environment.\",\"PeriodicalId\":48947,\"journal\":{\"name\":\"Biodata Mining\",\"volume\":\"16 1\",\"pages\":\"11\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2023-03-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10022184/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biodata Mining\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s13040-023-00328-y\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-023-00328-y","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

背景:结核病是中国每年报告病例最多的危险传染病。预防漏诊对结核病的预防、治疗和康复具有重要影响。早期的肺结核预测模型主要采用传统图像数据与神经网络模型相结合的方法。然而，单一数据源往往会遗漏重要信息，如主要症状和实验室测试结果，而这些信息可在医疗记录和测试等多源数据中获得。在本研究中，我们提出了一个基于结构化和非结构化电子病历多源数据的多流集成肺结核诊断模型。由于肺病专科医生的数量有限，结核病的患病率很高，应用这种辅助诊断模型可以为临床环境做出实质性的贡献。方法:选取2015 - 2020年在国内某大型综合性医院呼吸科和感染科就诊的患者为研究对象。通过质量控制程序共选择了95,294份医疗记录。每个记录包含结构化和非结构化数据。首先，建立了结构化数据特征的数值表达式。然后，通过决策树模型、随机森林和GBDT进行特征工程。特征按照权重由高到低的顺序被包含在特征排除集中。当集合的重要性大于0.7时，该过程结束。最后，将包含的特征用于模型训练。此外，将非结构化的自由文本数据在字符级进行分割，并在索引后输入到模型中。通过多流集成结核病诊断模型(MSI-PTDM)进行结核病预测，并将准确率、AUC、灵敏度、特异性等评价指标与XGBoost、Text-CNN、Random Forest、SVM等预测结果进行比较。结果:通过多种特征工程方法，筛选出主要主诉咯血、咳嗽、红细胞沉降试验等20个特征因素，并采用中国肺结核诊断标准对其影响因素进行分析。MSI-PTDM、XGBoost、Text-CNN、RF和SVM的曲线下面积分别为0.9858、0.9571、0.9486、0.9428和0.9429。MSI-PTDM的敏感性为93.18%，特异性为96.96%，准确性为96.96%。MSI-PTDM预测模型安装在医生工作站，在真实的临床环境中运行4个月。总共监测了692,949例患者，其中包括484例确诊肺结核患者。该模型预测了440例肺结核病例。阳性样本识别率为90.91%，假阳性率为9.09%，阴性样本识别率为96.17%，假阴性率为3.83%。结论:MSI-PTDM可以同时处理稀疏数据、密集数据和非结构化文本数据。该模型增加了嵌入医学稀疏特征的特征域向量，将单值稀疏向量用多维密隐向量表示，既增强了特征表达，又减轻了稀疏性对模型训练的副作用。但是，从文本中提取特征时可能存在信息丢失的问题，加入对原始非结构化文本的处理在一定程度上弥补了上述过程中的错误，使模型能够更全面、更有效地学习数据。此外，MSI-PTDM还允许特征之间的交互，考虑了患者特征之间的组合效应，增加了更复杂的非线性计算考虑，提高了模型的学习能力。它已经通过测试集和实际门诊环境的部署进行了验证。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Clinical assistant decision-making model of tuberculosis based on electronic health records.

查看原文本刊更多论文

Clinical assistant decision-making model of tuberculosis based on electronic health records.

Background: Tuberculosis is a dangerous infectious disease with the largest number of reported cases in China every year. Preventing missed diagnosis has an important impact on the prevention, treatment, and recovery of tuberculosis. The earliest pulmonary tuberculosis prediction models mainly used traditional image data combined with neural network models. However, a single data source tends to miss important information, such as primary symptoms and laboratory test results, that is available in multi-source data like medical records and tests. In this study, we propose a multi-stream integrated pulmonary tuberculosis diagnosis model based on structured and unstructured multi-source data from electronic health records. With the limited number of lung specialists and the high prevalence of tuberculosis, the application of this auxiliary diagnosis model can make substantial contributions to clinical settings.

Methods: The subjects were patients at the respiratory department and infectious cases department of a large comprehensive hospital in China between 2015 to 2020. A total of 95,294 medical records were selected through a quality control process. Each record contains structured and unstructured data. First, numerical expressions of features for structured data were created. Then, feature engineering was performed through decision tree model, random forest, and GBDT. Features were included in the feature exclusion set as per their weights in descending order. When the importance of the set was higher than 0.7, this process was concluded. Finally, the contained features were used for model training. In addition, the unstructured free-text data was segmented at the character level and input into the model after indexing. Tuberculosis prediction was conducted through a multi-stream integration tuberculosis diagnosis model (MSI-PTDM), and the evaluation indices of accuracy, AUC, sensitivity, and specificity were compared against the prediction results of XGBoost, Text-CNN, Random Forest, SVM, and so on.

Results: Through a variety of characteristic engineering methods, 20 characteristic factors, such as main complaint hemoptysis, cough, and test erythrocyte sedimentation rate, were selected, and the influencing factors were analyzed using the Chinese diagnostic standard of pulmonary tuberculosis. The area under the curve values for MSI-PTDM, XGBoost, Text-CNN, RF, and SVM were 0.9858, 0.9571, 0.9486, 0.9428, and 0.9429, respectively. The sensitivity, specificity, and accuracy of MSI-PTDM were 93.18%, 96.96%, and 96.96%, respectively. The MSI-PTDM prediction model was installed at a doctor workstation and operated in a real clinic environment for 4 months. A total of 692,949 patients were monitored, including 484 patients with confirmed pulmonary tuberculosis. The model predicted 440 cases of pulmonary tuberculosis. The positive sample recognition rate was 90.91%, the false-positive rate was 9.09%, the negative sample recognition rate was 96.17%, and the false-negative rate was 3.83%.

Conclusions: MSI-PTDM can process sparse data, dense data, and unstructured text data concurrently. The model adds a feature domain vector embedding the medical sparse features, and the single-valued sparse vectors are represented by multi-dimensional dense hidden vectors, which not only enhances the feature expression but also alleviates the side effects of sparsity on the model training. However, there may be information loss when features are extracted from text, and adding the processing of original unstructured text makes up for the error within the above process to a certain extent, so that the model can learn data more comprehensively and effectively. In addition, MSI-PTDM also allows interaction between features, considers the combination effect between patient features, adds more complex nonlinear calculation considerations, and improves the learning ability of the model. It has been verified using a test set and via deployment within an actual outpatient environment.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.