{"title":"Transformer-based Named Entity Recognition for Clinical Cancer Drug Toxicity by Positive-unlabeled Learning and KL Regularizers","authors":"Weixin Xie, Jiayu Xu, Chengkui Zhao, Jin Li, Shuangze Han, Tianyu Shao, Limei Wang, Weixing Feng","doi":"10.2174/0115748936278299231213045441","DOIUrl":null,"url":null,"abstract":"Background: With increasing rates of polypharmacy, the vigilant surveillance of clinical drug toxicity has emerged as an important concern. Named Entity Recognition (NER) stands as an indispensable undertaking, essential for the extraction of valuable insights regarding drug safety from the biomedical literature. In recent years, significant advancements have been achieved in the deep learning models on NER tasks. Nonetheless, the effectiveness of these NER techniques relies on the availability of substantial volumes of annotated data, which is labor-intensive and inefficient. background: With increasing rates of polypharmacy, clinical drug toxicity has been closely monitored. Named Entity Recognition (NER) is a vital task for extracting valuable drug safety information from biomedical literature. Recently, many deep learning models in biomedical domain have made great progress for NER, especially pre-trained language models. However, these NER methods require large amounts of high-quality manually annotated data with named entities, which is labor intensive and inefficient. Methods: This study introduces a novel approach that diverges from the conventional reliance on manually annotated data. It employs a transformer-based technique known as Positive-Unlabeled Learning (PULearning), which incorporates adaptive learning and is applied to the clinical cancer drug toxicity corpus. To improve the precision of prediction, we employ relative position embeddings within the transformer encoder. Additionally, we formulate a composite loss function that integrates two Kullback-Leibler (KL) regularizers to align with PULearning assumptions. The outcomes demonstrate that our approach attains the targeted performance for NER tasks, solely relying on unlabeled data and named entity dictionaries. objective: To improve the performance of prediction Conclusion: Our model achieves an overall NER performance with an F1 of 0.819. Specifically, it attains F1 of 0.841, 0.801 and 0.815 for DRUG, CANCER, and TOXI entities, respectively. A comprehensive analysis of the results validates the effectiveness of our approach in comparison to existing PULearning methods on biomedical NER tasks. Additionally, a visualization of the associations among three identified entities is provided, offering a valuable reference for querying their interrelationships. method: In this work, instead of relying on the manually labeled data, a transformer-based Positive-Unlabeled Learning (PULearning) is proposed with adaptive learning and applied on the clinical cancer drug toxicity corpus. To improve the precision of prediction, relative position embeddings are used in transformer encoder. And then, a mixed loss is designed with two Kullback-Leibler (KL) regularizers for PULearning assumptions. Through adaptive sampling, our approach meets the expected performance for NER task only using unlabeled data and named entity dictionaries. result: The overall NER performance of our model obtains 0.819 of F1-score, while 0.841, 0.801 and 0.815 of F1-score on DRUG, CANCER and TOXI, respectively. other: None","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.2174/0115748936278299231213045441","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: With increasing rates of polypharmacy, the vigilant surveillance of clinical drug toxicity has emerged as an important concern. Named Entity Recognition (NER) stands as an indispensable undertaking, essential for the extraction of valuable insights regarding drug safety from the biomedical literature. In recent years, significant advancements have been achieved in the deep learning models on NER tasks. Nonetheless, the effectiveness of these NER techniques relies on the availability of substantial volumes of annotated data, which is labor-intensive and inefficient. background: With increasing rates of polypharmacy, clinical drug toxicity has been closely monitored. Named Entity Recognition (NER) is a vital task for extracting valuable drug safety information from biomedical literature. Recently, many deep learning models in biomedical domain have made great progress for NER, especially pre-trained language models. However, these NER methods require large amounts of high-quality manually annotated data with named entities, which is labor intensive and inefficient. Methods: This study introduces a novel approach that diverges from the conventional reliance on manually annotated data. It employs a transformer-based technique known as Positive-Unlabeled Learning (PULearning), which incorporates adaptive learning and is applied to the clinical cancer drug toxicity corpus. To improve the precision of prediction, we employ relative position embeddings within the transformer encoder. Additionally, we formulate a composite loss function that integrates two Kullback-Leibler (KL) regularizers to align with PULearning assumptions. The outcomes demonstrate that our approach attains the targeted performance for NER tasks, solely relying on unlabeled data and named entity dictionaries. objective: To improve the performance of prediction Conclusion: Our model achieves an overall NER performance with an F1 of 0.819. Specifically, it attains F1 of 0.841, 0.801 and 0.815 for DRUG, CANCER, and TOXI entities, respectively. A comprehensive analysis of the results validates the effectiveness of our approach in comparison to existing PULearning methods on biomedical NER tasks. Additionally, a visualization of the associations among three identified entities is provided, offering a valuable reference for querying their interrelationships. method: In this work, instead of relying on the manually labeled data, a transformer-based Positive-Unlabeled Learning (PULearning) is proposed with adaptive learning and applied on the clinical cancer drug toxicity corpus. To improve the precision of prediction, relative position embeddings are used in transformer encoder. And then, a mixed loss is designed with two Kullback-Leibler (KL) regularizers for PULearning assumptions. Through adaptive sampling, our approach meets the expected performance for NER task only using unlabeled data and named entity dictionaries. result: The overall NER performance of our model obtains 0.819 of F1-score, while 0.841, 0.801 and 0.815 of F1-score on DRUG, CANCER and TOXI, respectively. other: None
背景:随着多种药物使用率的增加,对临床药物毒性的警惕性监测已成为一个重要的关注点。命名实体识别(NER)是一项不可或缺的工作,对于从生物医学文献中提取有关药物安全性的宝贵见解至关重要。近年来,深度学习模型在 NER 任务方面取得了重大进展。然而,这些核糖核酸(NER)技术的有效性依赖于大量注释数据的可用性,而这是一项劳动密集型且效率低下的工作:随着多种药物的使用率越来越高,临床药物毒性受到了密切关注。命名实体识别(NER)是从生物医学文献中提取有价值的药物安全性信息的重要任务。最近,生物医学领域的许多深度学习模型在 NER 方面取得了很大进展,尤其是预训练语言模型。然而,这些 NER 方法需要大量高质量的人工标注命名实体的数据,劳动强度大且效率低。方法本研究引入了一种新方法,与传统的依赖人工标注数据的方法不同。它采用了一种被称为正向无标注学习(PULearning)的基于转换器的技术,该技术结合了自适应学习,并应用于临床癌症药物毒性语料库。为了提高预测精度,我们在变换器编码器中采用了相对位置嵌入技术。此外,我们还制定了一个复合损失函数,其中整合了两个库尔巴克-莱伯勒(KL)正则,以符合 PULearning 假设。结果表明,我们的方法仅依靠未标注数据和命名实体字典就能实现 NER 任务的目标性能:提高预测性能 结论:我们的模型实现了整体 NER 性能的提高:我们的模型实现了整体 NER 性能,F1 为 0.819。具体来说,它对 DRUG、CANCER 和 TOXI 实体的 F1 分别为 0.841、0.801 和 0.815。对结果的综合分析验证了我们的方法与现有的 PULearning 方法相比在生物医学 NER 任务中的有效性。此外,我们还提供了三个已识别实体之间关联的可视化方法,为查询它们之间的相互关系提供了有价值的参考:在这项工作中,我们提出了一种基于转换器的正向无标注学习(PULearning)方法,并将其应用于临床癌症药物毒性语料库。为了提高预测精度,在变换器编码器中使用了相对位置嵌入。然后,针对 PULearning 假设,设计了带有两个 Kullback-Leibler (KL) 正则的混合损失。通过自适应采样,我们的方法仅在使用未标记数据和命名实体词典的情况下就达到了 NER 任务的预期性能:我们模型的总体 NER 性能获得了 0.819 的 F1 分数,而在 DRUG、CANCER 和 TOXI 上的 F1 分数分别为 0.841、0.801 和 0.815:无
期刊介绍:
Current Bioinformatics aims to publish all the latest and outstanding developments in bioinformatics. Each issue contains a series of timely, in-depth/mini-reviews, research papers and guest edited thematic issues written by leaders in the field, covering a wide range of the integration of biology with computer and information science.
The journal focuses on advances in computational molecular/structural biology, encompassing areas such as computing in biomedicine and genomics, computational proteomics and systems biology, and metabolic pathway engineering. Developments in these fields have direct implications on key issues related to health care, medicine, genetic disorders, development of agricultural products, renewable energy, environmental protection, etc.