LKAN：基于 LLM 的肝癌临床分期知识感知注意力网络。

IF 6.8 2区医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Journal of Biomedical and Health Informatics Pub Date : 2024-10-11 DOI:10.1109/JBHI.2024.3478809

Ya Li;Xuecong Zheng;Jiaping Li;Qingyun Dai;Chang-Dong Wang;Min Chen

{"title":"LKAN：基于 LLM 的肝癌临床分期知识感知注意力网络。","authors":"Ya Li;Xuecong Zheng;Jiaping Li;Qingyun Dai;Chang-Dong Wang;Min Chen","doi":"10.1109/JBHI.2024.3478809","DOIUrl":null,"url":null,"abstract":"Clinical staging of liver cancer (CSoLC), an important indicator for evaluating primary liver cancer (PLC), is key in the diagnosis, treatment, and rehabilitation of liver cancer. In China, the current CSoLC adopts the China liver cancer (CNLC) staging, which is usually evaluated by clinicians based on radiology reports. Therefore, inferring clinical information from unstructured radiology reports can provide auxiliary decision support for clinicians. The key to solving the challenging task is to guide the model to pay attention to the staging-related words or sentences, and the following issues may occur: 1) Imbalanced categories: Early- and mid-stage liver cancer symptoms are subtle, resulting in more data in the end-stage. 2) Domain sensitivity of liver cancer data: The liver cancer dataset contains substantial domain knowledge, leading to out-of-vocabulary issues and reduced classification accuracy. 3) Free-text and lengthy report: Radiology reports sparsely describe various lesions using domain-specific terms, making it hard to mine staging-related information. To address these, this article proposes a large language model (LLM)-based Knowledge-aware Attention Network (LKAN) for CSoLC. First, for maintaining semantic consistency, LLM and a rule-based algorithm are integrated to generate more diverse and reasonable data. Second, an unlabeled radiology corpus is pre-trained to introduce domain knowledge for subsequent representation learning. Third, attention is improved by incorporating both global and local features to guide the model's focus on staging-relevant information. Compared with the baseline models, LKAN has achieved the best results with 90.3% Accuracy, 90.0% Macro_F1 score, and 90.0% Macro_Recall.","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"29 4","pages":"3007-3020"},"PeriodicalIF":6.8000,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LKAN: LLM-Based Knowledge-Aware Attention Network for Clinical Staging of Liver Cancer\",\"authors\":\"Ya Li;Xuecong Zheng;Jiaping Li;Qingyun Dai;Chang-Dong Wang;Min Chen\",\"doi\":\"10.1109/JBHI.2024.3478809\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Clinical staging of liver cancer (CSoLC), an important indicator for evaluating primary liver cancer (PLC), is key in the diagnosis, treatment, and rehabilitation of liver cancer. In China, the current CSoLC adopts the China liver cancer (CNLC) staging, which is usually evaluated by clinicians based on radiology reports. Therefore, inferring clinical information from unstructured radiology reports can provide auxiliary decision support for clinicians. The key to solving the challenging task is to guide the model to pay attention to the staging-related words or sentences, and the following issues may occur: 1) Imbalanced categories: Early- and mid-stage liver cancer symptoms are subtle, resulting in more data in the end-stage. 2) Domain sensitivity of liver cancer data: The liver cancer dataset contains substantial domain knowledge, leading to out-of-vocabulary issues and reduced classification accuracy. 3) Free-text and lengthy report: Radiology reports sparsely describe various lesions using domain-specific terms, making it hard to mine staging-related information. To address these, this article proposes a large language model (LLM)-based Knowledge-aware Attention Network (LKAN) for CSoLC. First, for maintaining semantic consistency, LLM and a rule-based algorithm are integrated to generate more diverse and reasonable data. Second, an unlabeled radiology corpus is pre-trained to introduce domain knowledge for subsequent representation learning. Third, attention is improved by incorporating both global and local features to guide the model's focus on staging-relevant information. Compared with the baseline models, LKAN has achieved the best results with 90.3% Accuracy, 90.0% Macro_F1 score, and 90.0% Macro_Recall.\",\"PeriodicalId\":13073,\"journal\":{\"name\":\"IEEE Journal of Biomedical and Health Informatics\",\"volume\":\"29 4\",\"pages\":\"3007-3020\"},\"PeriodicalIF\":6.8000,\"publicationDate\":\"2024-10-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal of Biomedical and Health Informatics\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10713996/\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Biomedical and Health Informatics","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10713996/","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

肝癌临床分期（CSoLC）是评价原发性肝癌细胞（PLCC）恶化程度的重要指标，是肝癌诊断、治疗和康复的关键。在中国，目前的 CSoLC 采用的是中国肝癌（CNLC）分期，通常由临床医生根据患者的放射学报告进行评估。因此，从非结构化的放射学报告中推断临床信息可为临床医生提供辅助决策支持。解决这一挑战性任务的关键在于引导模型关注分期相关的单词或句子，可能会出现以下问题：1）分类失衡：肝癌早期或中期症状不明显，导致末期数据较多。2) 肝癌数据的领域敏感性：肝癌数据集包含大量领域知识，传统方法会加剧词汇缺失，大大影响分类的准确性。3) 自由文本和冗长报告：肝癌的放射报告用特定领域的术语对各种病变进行了稀疏描述，这给挖掘与分期相关的关键信息带来了困难。针对这些难题，本文提出了一种基于大语言模型（LLM）的知识感知注意力网络（LKAN），用于 CSoLC。首先，为了保持语义的一致性，LLM 与基于规则的算法相结合，以生成更多样、更合理的数据。其次，对未标记的肝癌放射学语料进行预训练，为后续的表征学习引入领域知识。第三，通过结合全局和局部特征来提高注意力，为分类器关注重要信息提供专业指导。与基线模型相比，LKAN 的分类准确率达到了最佳效果，准确率为 90.3%，Macro_F1 分数为 90.0%，Macro_Recall 分数为 90.0%。代码见 https://github.com/xczhh/Supplemental-Material。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

LKAN: LLM-Based Knowledge-Aware Attention Network for Clinical Staging of Liver Cancer

Clinical staging of liver cancer (CSoLC), an important indicator for evaluating primary liver cancer (PLC), is key in the diagnosis, treatment, and rehabilitation of liver cancer. In China, the current CSoLC adopts the China liver cancer (CNLC) staging, which is usually evaluated by clinicians based on radiology reports. Therefore, inferring clinical information from unstructured radiology reports can provide auxiliary decision support for clinicians. The key to solving the challenging task is to guide the model to pay attention to the staging-related words or sentences, and the following issues may occur: 1) Imbalanced categories: Early- and mid-stage liver cancer symptoms are subtle, resulting in more data in the end-stage. 2) Domain sensitivity of liver cancer data: The liver cancer dataset contains substantial domain knowledge, leading to out-of-vocabulary issues and reduced classification accuracy. 3) Free-text and lengthy report: Radiology reports sparsely describe various lesions using domain-specific terms, making it hard to mine staging-related information. To address these, this article proposes a large language model (LLM)-based Knowledge-aware Attention Network (LKAN) for CSoLC. First, for maintaining semantic consistency, LLM and a rule-based algorithm are integrated to generate more diverse and reasonable data. Second, an unlabeled radiology corpus is pre-trained to introduce domain knowledge for subsequent representation learning. Third, attention is improved by incorporating both global and local features to guide the model's focus on staging-relevant information. Compared with the baseline models, LKAN has achieved the best results with 90.3% Accuracy, 90.0% Macro_F1 score, and 90.0% Macro_Recall.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Journal of Biomedical and Health Informatics COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

CiteScore

13.60

自引率

6.50%

发文量

1151

期刊介绍： IEEE Journal of Biomedical and Health Informatics publishes original papers presenting recent advances where information and communication technologies intersect with health, healthcare, life sciences, and biomedicine. Topics include acquisition, transmission, storage, retrieval, management, and analysis of biomedical and health information. The journal covers applications of information technologies in healthcare, patient monitoring, preventive care, early disease diagnosis, therapy discovery, and personalized treatment protocols. It explores electronic medical and health records, clinical information systems, decision support systems, medical and biological imaging informatics, wearable systems, body area/sensor networks, and more. Integration-related topics like interoperability, evidence-based medicine, and secure patient data are also addressed.