Difei Tang , Thomas Yu Chow Tam , Haomiao Luo , Cheryl A. Telmer , Natasa Miskov-Zivanov
{"title":"An open-set semi-supervised multi-task learning framework for context classification in biomedical texts","authors":"Difei Tang , Thomas Yu Chow Tam , Haomiao Luo , Cheryl A. Telmer , Natasa Miskov-Zivanov","doi":"10.1016/j.jbi.2025.104886","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>In biomedical research, knowledge about the relationships between entities, including genes, proteins, and drugs, is vital for elucidating complex biological processes and intracellular pathway mechanisms. While natural language processing (NLP) methods have shown great success in biomedical relation extraction (RE), extracted relations often lack contextual information such as cell type, cell line, and intracellular location. Previous studies treated this problem as a post hoc context-relation association task, limited by the absence of a golden standard corpus and prone to error propagation. To address these challenges, we propose CELESTA (Context Extraction through LEarning with Semi-supervised multi-Task Architecture), an open-set semi-supervised multi-task learning (OSSL-MTL) framework for biomedical context classification.</div></div><div><h3>Methods</h3><div>We designed a multi-task learning (MTL) architecture that integrates with the semi-supervised learning (SSL) strategies to leverage unlabeled data containing both in-distribution (ID) and out-of-distribution (OOD) examples. We created a large-scale dataset consisting of five context classification tasks by curating two large Biological Expression Language (BEL) corpora and annotating them with our new entity span annotation method. Additionally, we developed an OOD detector to distinguish between ID and OOD instances within the unlabeled data and applied data augmentation with an external database to enrich our dataset.</div></div><div><h3>Results</h3><div>Extensive experiments show that our framework significantly improves context classification performance. Our best OSSL-MTL models achieve F1 scores of 77.75% and 82.87% on location and disease classification tasks, and the SSL-MTL models without OOD detection perform best for cell line and cell type classification. The OOD detection experiment confirms that the OOD detector effectively identifies unknown categories while maintaining ID accuracy. Qualitative analysis shows improved extraction of implicit contexts compared to baseline models.</div></div><div><h3>Conclusion</h3><div>Our analysis demonstrates the effectiveness of the framework CELESTA in improving context classification and extracting contextual information with high accuracy. The newly created dataset and code are publicly available on GitHub (<span><span>https://github.com/pitt-miskov-zivanov-lab/CELESTA</span><svg><path></path></svg></span>).</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"169 ","pages":"Article 104886"},"PeriodicalIF":4.5000,"publicationDate":"2025-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1532046425001157","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Objective
In biomedical research, knowledge about the relationships between entities, including genes, proteins, and drugs, is vital for elucidating complex biological processes and intracellular pathway mechanisms. While natural language processing (NLP) methods have shown great success in biomedical relation extraction (RE), extracted relations often lack contextual information such as cell type, cell line, and intracellular location. Previous studies treated this problem as a post hoc context-relation association task, limited by the absence of a golden standard corpus and prone to error propagation. To address these challenges, we propose CELESTA (Context Extraction through LEarning with Semi-supervised multi-Task Architecture), an open-set semi-supervised multi-task learning (OSSL-MTL) framework for biomedical context classification.
Methods
We designed a multi-task learning (MTL) architecture that integrates with the semi-supervised learning (SSL) strategies to leverage unlabeled data containing both in-distribution (ID) and out-of-distribution (OOD) examples. We created a large-scale dataset consisting of five context classification tasks by curating two large Biological Expression Language (BEL) corpora and annotating them with our new entity span annotation method. Additionally, we developed an OOD detector to distinguish between ID and OOD instances within the unlabeled data and applied data augmentation with an external database to enrich our dataset.
Results
Extensive experiments show that our framework significantly improves context classification performance. Our best OSSL-MTL models achieve F1 scores of 77.75% and 82.87% on location and disease classification tasks, and the SSL-MTL models without OOD detection perform best for cell line and cell type classification. The OOD detection experiment confirms that the OOD detector effectively identifies unknown categories while maintaining ID accuracy. Qualitative analysis shows improved extraction of implicit contexts compared to baseline models.
Conclusion
Our analysis demonstrates the effectiveness of the framework CELESTA in improving context classification and extracting contextual information with high accuracy. The newly created dataset and code are publicly available on GitHub (https://github.com/pitt-miskov-zivanov-lab/CELESTA).
目的:在生物医学研究中,了解包括基因、蛋白质和药物在内的实体之间的关系对于阐明复杂的生物过程和细胞内通路机制至关重要。虽然自然语言处理(NLP)方法在生物医学关系提取(RE)方面取得了巨大成功,但提取的关系往往缺乏上下文信息,如细胞类型、细胞系和细胞内位置。先前的研究将此问题视为一个事后上下文关系关联任务,由于缺乏黄金标准语料库而受到限制,并且容易出现错误传播。为了解决这些挑战,我们提出了CELESTA (Context Extraction through LEarning with Semi-supervised multi-Task Architecture),这是一个用于生物医学上下文分类的开放集半监督多任务学习(OSSL-MTL)框架。方法:我们设计了一个多任务学习(MTL)架构,该架构集成了半监督学习(SSL)策略,以利用包含分布内(ID)和分布外(OOD)示例的未标记数据。我们通过管理两个大型生物表达语言(BEL)语料库,并使用我们新的实体跨度标注方法对它们进行标注,创建了一个由五个上下文分类任务组成的大规模数据集。此外,我们开发了一个OOD检测器来区分未标记数据中的ID和OOD实例,并使用外部数据库应用数据增强来丰富我们的数据集。结果:大量的实验表明,我们的框架显著提高了上下文分类性能。我们的最佳OSSL-MTL模型在位置和疾病分类任务上的F1得分分别为77.75%和82.87%,未检测OOD的SSL-MTL模型在细胞系和细胞类型分类上的表现最好。OOD检测实验证实,OOD检测器在保持ID准确性的同时,能够有效识别未知类别。定性分析表明,与基线模型相比,隐式上下文的提取得到了改进。结论:我们的分析证明了CELESTA框架在提高上下文分类和提取上下文信息方面的有效性。新创建的数据集和代码在GitHub (https://github.com/pitt-miskov-zivanov-lab/CELESTA)上公开可用。
期刊介绍:
The Journal of Biomedical Informatics reflects a commitment to high-quality original research papers, reviews, and commentaries in the area of biomedical informatics methodology. Although we publish articles motivated by applications in the biomedical sciences (for example, clinical medicine, health care, population health, and translational bioinformatics), the journal emphasizes reports of new methodologies and techniques that have general applicability and that form the basis for the evolving science of biomedical informatics. Articles on medical devices; evaluations of implemented systems (including clinical trials of information technologies); or papers that provide insight into a biological process, a specific disease, or treatment options would generally be more suitable for publication in other venues. Papers on applications of signal processing and image analysis are often more suitable for biomedical engineering journals or other informatics journals, although we do publish papers that emphasize the information management and knowledge representation/modeling issues that arise in the storage and use of biological signals and images. System descriptions are welcome if they illustrate and substantiate the underlying methodology that is the principal focus of the report and an effort is made to address the generalizability and/or range of application of that methodology. Note also that, given the international nature of JBI, papers that deal with specific languages other than English, or with country-specific health systems or approaches, are acceptable for JBI only if they offer generalizable lessons that are relevant to the broad JBI readership, regardless of their country, language, culture, or health system.