Identification of Syndrome Types in Patients With Pancreatic Cancer From Free Text in Electronic Medical Records: Model Development and Validation.

IF 2 Q3 HEALTH CARE SCIENCES & SERVICES

JMIR Formative Research Pub Date : 2025-10-03 DOI:10.2196/70602

He Ba, Haojie Du, Chienshan Cheng, Yuan Zhang, Linjie Ruan, Zhen Chen

{"title":"Identification of Syndrome Types in Patients With Pancreatic Cancer From Free Text in Electronic Medical Records: Model Development and Validation.","authors":"He Ba, Haojie Du, Chienshan Cheng, Yuan Zhang, Linjie Ruan, Zhen Chen","doi":"10.2196/70602","DOIUrl":null,"url":null,"abstract":"Background: Syndrome differentiation is crucial in traditional Chinese medicine (TCM) diagnosis and treatment, but it heavily relies on expert experience, limiting systematic standardization.Objective: This study developed and validated a BERT (bidirectional encoder representations from transformers)-based model, the traditional Chinese medicine pancreatic cancer syndrome differentiation bidirectional encoder representations from transformers (TCMPCSD-BERT), using in-house pancreatic cancer medical records, to digitalize expert knowledge and support standardized syndrome differentiation in TCM.Methods: A retrospective dataset of pancreatic cancer cases (2011-2024) from Fudan University Shanghai Cancer Center was annotated into 4 TCM syndrome types by 2 experts (Cohen κ=0.913). The proposed TCMPCSD-BERT model was compared with conventional models (long short-term memory and text convolutional neural network) embedded in TCM diagnostic tools and with large language models (LLMs; ChatGPT-4o, Kimi, Ernie Bot 4.0 Turbo, and Zhipu Qingyan) under a prompt engineering framework. Performance evaluation on in-house data was supplemented with attention visualizations and integrated gradients analyses for interpretability. The McNemar test assessed classification accuracy differences, while bootstrap 95% CIs quantified statistical uncertainty and stability. The Welch t test (2-tailed) was used to evaluate mean differences between TCMPCSD-BERT and the comparator models.Results: Among 6830 records, case counts were damp-heat syndrome (n=1694), spleen-deficiency syndrome (n=1185), damp-heat with spleen-deficiency syndrome (n=1178), and others (n=2773). On the test set, McNemar test showed significantly higher accuracy for TCMPCSD-BERT than the 3 baseline models and generally better performance than LLMs. In all comparisons, TCMPCSD-BERT achieved higher mean macroprecision, macrorecall, macro-F1-score, and accuracy, with nonoverlapping 95% bootstrap CIs and significant Welch t test results (P<.01). The model achieved a macroprecision of 0.935 (95% CI 0.918-0.951), macrorecall of 0.921 (95% CI 0.900-0.942), macro-F1-score of 0.927 (95% CI 0.908-0.945), and accuracy of 0.919 (95% CI 0.899-0.939). Attention visualizations suggested the model could capture less common TCM term associations, while integrated gradients highlighted high-attribution diagnostic features (eg, \"gray-white stool\" 0.933 in damp-heat syndrome; \"indigestion\" 1.204 in spleen-deficiency syndrome). Misclassification analyses indicated challenges in handling overlapping or atypical symptom presentations. Compared with LLMs, web-based platforms, and diagnostic instruments, TCMPCSD-BERT appeared to provide relatively higher accuracy, interpretability, and efficiency in processing long unstructured texts for syndrome differentiation.Conclusions: The TCMPCSD-BERT model shows potential for automated syndrome differentiation from unstructured clinical texts and broader application in TCM. Based on this study, it appears to improve operability over 4-diagnostic instruments and web-based platforms, and offers greater stability and accuracy than LLMs in specific tasks. However, these findings should be interpreted cautiously, given the subjectivity of syndrome definitions, data imbalance, and reliance on preprocessed, expert-annotated data. Further studies involving larger and more diverse populations are needed to validate its generalizability and support its broader application in real-world settings.","PeriodicalId":14841,"journal":{"name":"JMIR Formative Research","volume":"9 ","pages":"e70602"},"PeriodicalIF":2.0000,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Formative Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/70602","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Syndrome differentiation is crucial in traditional Chinese medicine (TCM) diagnosis and treatment, but it heavily relies on expert experience, limiting systematic standardization.

Objective: This study developed and validated a BERT (bidirectional encoder representations from transformers)-based model, the traditional Chinese medicine pancreatic cancer syndrome differentiation bidirectional encoder representations from transformers (TCMPCSD-BERT), using in-house pancreatic cancer medical records, to digitalize expert knowledge and support standardized syndrome differentiation in TCM.

Methods: A retrospective dataset of pancreatic cancer cases (2011-2024) from Fudan University Shanghai Cancer Center was annotated into 4 TCM syndrome types by 2 experts (Cohen κ=0.913). The proposed TCMPCSD-BERT model was compared with conventional models (long short-term memory and text convolutional neural network) embedded in TCM diagnostic tools and with large language models (LLMs; ChatGPT-4o, Kimi, Ernie Bot 4.0 Turbo, and Zhipu Qingyan) under a prompt engineering framework. Performance evaluation on in-house data was supplemented with attention visualizations and integrated gradients analyses for interpretability. The McNemar test assessed classification accuracy differences, while bootstrap 95% CIs quantified statistical uncertainty and stability. The Welch t test (2-tailed) was used to evaluate mean differences between TCMPCSD-BERT and the comparator models.

Results: Among 6830 records, case counts were damp-heat syndrome (n=1694), spleen-deficiency syndrome (n=1185), damp-heat with spleen-deficiency syndrome (n=1178), and others (n=2773). On the test set, McNemar test showed significantly higher accuracy for TCMPCSD-BERT than the 3 baseline models and generally better performance than LLMs. In all comparisons, TCMPCSD-BERT achieved higher mean macroprecision, macrorecall, macro-F₁-score, and accuracy, with nonoverlapping 95% bootstrap CIs and significant Welch t test results (P<.01). The model achieved a macroprecision of 0.935 (95% CI 0.918-0.951), macrorecall of 0.921 (95% CI 0.900-0.942), macro-F₁-score of 0.927 (95% CI 0.908-0.945), and accuracy of 0.919 (95% CI 0.899-0.939). Attention visualizations suggested the model could capture less common TCM term associations, while integrated gradients highlighted high-attribution diagnostic features (eg, "gray-white stool" 0.933 in damp-heat syndrome; "indigestion" 1.204 in spleen-deficiency syndrome). Misclassification analyses indicated challenges in handling overlapping or atypical symptom presentations. Compared with LLMs, web-based platforms, and diagnostic instruments, TCMPCSD-BERT appeared to provide relatively higher accuracy, interpretability, and efficiency in processing long unstructured texts for syndrome differentiation.

Conclusions: The TCMPCSD-BERT model shows potential for automated syndrome differentiation from unstructured clinical texts and broader application in TCM. Based on this study, it appears to improve operability over 4-diagnostic instruments and web-based platforms, and offers greater stability and accuracy than LLMs in specific tasks. However, these findings should be interpreted cautiously, given the subjectivity of syndrome definitions, data imbalance, and reliance on preprocessed, expert-annotated data. Further studies involving larger and more diverse populations are needed to validate its generalizability and support its broader application in real-world settings.

查看原文本刊更多论文

从电子病历的自由文本中识别胰腺癌患者的证候类型：模型开发和验证。

背景：辨证论治在中医诊疗中至关重要，但严重依赖专家经验，缺乏系统规范。目的：本研究开发并验证了一种基于BERT（双向编码器表示）的模型——中医胰腺癌辨证双向编码器表示（TCMPCSD-BERT），利用胰腺癌内部病历，将专家知识数字化，支持中医规范化辨证。方法：采用复旦大学上海肿瘤中心2011-2024年胰腺癌病例回顾性数据集，由2位专家（Cohen κ=0.913）将其标注为4种中医证型。将提出的TCMPCSD-BERT模型与中医诊断工具中的传统模型（长短期记忆和文本卷积神经网络）以及大型语言模型（LLMs、chatgpt - 40、Kimi、Ernie Bot 4.0 Turbo和知普清言）在快速工程框架下进行比较。对内部数据的绩效评估辅以注意力可视化和可解释性的综合梯度分析。McNemar检验评估分类准确性差异，而bootstrap 95% ci量化统计不确定性和稳定性。采用Welch t检验（双尾）来评估TCMPCSD-BERT与比较模型之间的平均差异。结果：6830例记录中，湿热证1694例，脾虚证1185例，湿热伴脾虚证1178例，其他2773例。在测试集上，McNemar测试显示TCMPCSD-BERT的准确率显著高于3个基线模型，总体优于llm模型。在所有比较中，TCMPCSD-BERT具有更高的平均宏观精度、宏观召回率、宏观f1评分和准确率，95%自助CI无重叠，Welch t检验结果显著(p1评分为0.927 (95% CI 0.908-0.945)，准确率为0.919 （95% CI 0.899-0.939）。注意可视化显示，该模型可以捕获不常见的中医术语关联，而综合梯度突出了高归因诊断特征（例如，湿热证的“灰白色大便”为0.933，脾虚证的“消化不良”为1.204）。错误分类分析表明在处理重叠或非典型症状表现方面存在挑战。与法学硕士、基于网络的平台和诊断仪器相比，TCMPCSD-BERT在处理长非结构化文本辨证方面似乎提供了相对更高的准确性、可解释性和效率。结论：TCMPCSD-BERT模型具有从非结构化临床文献中自动辨证的潜力，在中医中具有更广泛的应用前景。基于这项研究，它似乎提高了4种诊断仪器和基于网络的平台的可操作性，并且在特定任务中比llm提供了更高的稳定性和准确性。然而，考虑到综合征定义的主观性、数据不平衡以及对预处理、专家注释数据的依赖，这些发现应谨慎解释。需要涉及更大和更多样化人群的进一步研究来验证其普遍性并支持其在现实世界环境中的更广泛应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊