Knowledge-Based Biomedical Word Sense Disambiguation with Neural Concept Embeddings

Proceedings. IEEE International Symposium on Bioinformatics and Bioengineering Pub Date : 2017-10-01 Epub Date: 2018-01-11 DOI:10.1109/BIBE.2017.00-61

Akm Sabbir, Antonio Jimeno-Yepes, Ramakanth Kavuluru

{"title":"Knowledge-Based Biomedical Word Sense Disambiguation with Neural Concept Embeddings","authors":"Akm Sabbir, Antonio Jimeno-Yepes, Ramakanth Kavuluru","doi":"10.1109/BIBE.2017.00-61","DOIUrl":null,"url":null,"abstract":"<p><p>Biomedical word sense disambiguation (WSD) is an important intermediate task in many natural language processing applications such as named entity recognition, syntactic parsing, and relation extraction. In this paper, we employ knowledge-based approaches that also exploit recent advances in neural word/concept embeddings to improve over the state-of-the-art in biomedical WSD using the public MSH WSD dataset [1] as the test set. Our methods involve weak supervision - we do not use any hand-labeled examples for WSD to build our prediction models; however, we employ an existing concept mapping program, MetaMap, to obtain our concept vectors. Over the MSH WSD dataset, our linear time (in terms of numbers of senses and words in the test instance) method achieves an accuracy of 92.24% which is a 3% improvement over the best known results [2] obtained via unsupervised means. A more expensive approach that we developed relies on a nearest neighbor framework and achieves accuracy of 94.34%, essentially cutting the error rate in half. Employing dense vector representations learned from unlabeled free text has been shown to benefit many language processing tasks recently and our efforts show that biomedical WSD is no exception to this trend. For a complex and rapidly evolving domain such as biomedicine, building labeled datasets for larger sets of ambiguous terms may be impractical. Here, we show that weak supervision that leverages recent advances in representation learning can rival supervised approaches in biomedical WSD. However, external knowledge bases (here sense inventories) play a key role in the improvements achieved.</p>","PeriodicalId":87347,"journal":{"name":"Proceedings. IEEE International Symposium on Bioinformatics and Bioengineering","volume":"2017 ","pages":"163-170"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5792196/pdf/nihms919324.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE International Symposium on Bioinformatics and Bioengineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2017.00-61","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2018/1/11 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Biomedical word sense disambiguation (WSD) is an important intermediate task in many natural language processing applications such as named entity recognition, syntactic parsing, and relation extraction. In this paper, we employ knowledge-based approaches that also exploit recent advances in neural word/concept embeddings to improve over the state-of-the-art in biomedical WSD using the public MSH WSD dataset [1] as the test set. Our methods involve weak supervision - we do not use any hand-labeled examples for WSD to build our prediction models; however, we employ an existing concept mapping program, MetaMap, to obtain our concept vectors. Over the MSH WSD dataset, our linear time (in terms of numbers of senses and words in the test instance) method achieves an accuracy of 92.24% which is a 3% improvement over the best known results [2] obtained via unsupervised means. A more expensive approach that we developed relies on a nearest neighbor framework and achieves accuracy of 94.34%, essentially cutting the error rate in half. Employing dense vector representations learned from unlabeled free text has been shown to benefit many language processing tasks recently and our efforts show that biomedical WSD is no exception to this trend. For a complex and rapidly evolving domain such as biomedicine, building labeled datasets for larger sets of ambiguous terms may be impractical. Here, we show that weak supervision that leverages recent advances in representation learning can rival supervised approaches in biomedical WSD. However, external knowledge bases (here sense inventories) play a key role in the improvements achieved.

Abstract Image

查看原文本刊更多论文

基于知识的生物医学词义消歧与神经概念嵌入

生物医学词义消歧（WSD）是命名实体识别、句法分析和关系提取等许多自然语言处理应用中的一项重要中间任务。在本文中，我们采用了基于知识的方法，并利用神经词/概念嵌入的最新进展，以公共 MSH WSD 数据集 [1] 作为测试集，改进了生物医学 WSD 的先进水平。我们的方法涉及弱监督--我们不使用任何手工标记的 WSD 示例来建立预测模型；但是，我们使用现有的概念映射程序 MetaMap 来获取概念向量。在 MSH WSD 数据集上，我们的线性时间（以测试实例中的感官和单词数量计算）方法实现了 92.24% 的准确率，比通过无监督方法获得的最佳已知结果[2]提高了 3%。我们开发的一种更昂贵的方法依赖于近邻框架，准确率达到 94.34%，基本上将错误率降低了一半。从无标注的自由文本中学习到的密集向量表示最近已被证明有利于许多语言处理任务，我们的努力表明生物医学 WSD 也不例外。对于像生物医学这样复杂且快速发展的领域，为较大的模糊术语集建立标记数据集可能并不现实。在这里，我们展示了利用表征学习的最新进展进行的弱监督可以与生物医学 WSD 中的监督方法相媲美。然而，外部知识库（此处为感官清单）在实现改进方面发挥了关键作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings. IEEE International Symposium on Bioinformatics and Bioengineering

自引率

0.00%

发文量