Pedro Ruas, Fernando Gallego, Francisco J. Veredas, Francisco M. Couto
{"title":"Hybrid X-Linker: Automated Data Generation and Extreme Multi-label Ranking for Biomedical Entity Linking","authors":"Pedro Ruas, Fernando Gallego, Francisco J. Veredas, Francisco M. Couto","doi":"arxiv-2407.06292","DOIUrl":null,"url":null,"abstract":"State-of-the-art deep learning entity linking methods rely on extensive\nhuman-labelled data, which is costly to acquire. Current datasets are limited\nin size, leading to inadequate coverage of biomedical concepts and diminished\nperformance when applied to new data. In this work, we propose to automatically\ngenerate data to create large-scale training datasets, which allows the\nexploration of approaches originally developed for the task of extreme\nmulti-label ranking in the biomedical entity linking task. We propose the\nhybrid X-Linker pipeline that includes different modules to link disease and\nchemical entity mentions to concepts in the MEDIC and the CTD-Chemical\nvocabularies, respectively. X-Linker was evaluated on several biomedical\ndatasets: BC5CDR-Disease, BioRED-Disease, NCBI-Disease, BC5CDR-Chemical,\nBioRED-Chemical, and NLM-Chem, achieving top-1 accuracies of 0.8307, 0.7969,\n0.8271, 0.9511, 0.9248, and 0.7895, respectively. X-Linker demonstrated\nsuperior performance in three datasets: BC5CDR-Disease, NCBI-Disease, and\nBioRED-Chemical. In contrast, SapBERT outperformed X-Linker in the remaining\nthree datasets. Both models rely only on the mention string for their\noperations. The source code of X-Linker and its associated data are publicly\navailable for performing biomedical entity linking without requiring\npre-labelled entities with identifiers from specific knowledge organization\nsystems.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.06292","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
State-of-the-art deep learning entity linking methods rely on extensive
human-labelled data, which is costly to acquire. Current datasets are limited
in size, leading to inadequate coverage of biomedical concepts and diminished
performance when applied to new data. In this work, we propose to automatically
generate data to create large-scale training datasets, which allows the
exploration of approaches originally developed for the task of extreme
multi-label ranking in the biomedical entity linking task. We propose the
hybrid X-Linker pipeline that includes different modules to link disease and
chemical entity mentions to concepts in the MEDIC and the CTD-Chemical
vocabularies, respectively. X-Linker was evaluated on several biomedical
datasets: BC5CDR-Disease, BioRED-Disease, NCBI-Disease, BC5CDR-Chemical,
BioRED-Chemical, and NLM-Chem, achieving top-1 accuracies of 0.8307, 0.7969,
0.8271, 0.9511, 0.9248, and 0.7895, respectively. X-Linker demonstrated
superior performance in three datasets: BC5CDR-Disease, NCBI-Disease, and
BioRED-Chemical. In contrast, SapBERT outperformed X-Linker in the remaining
three datasets. Both models rely only on the mention string for their
operations. The source code of X-Linker and its associated data are publicly
available for performing biomedical entity linking without requiring
pre-labelled entities with identifiers from specific knowledge organization
systems.