医学规范:跨术语医学概念规范化的语料库和嵌入

Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task Pub Date : 1900-01-01 DOI:10.18653/v1/W19-3204

M. Belousov, W. Dixon, G. Nenadic

{"title":"医学规范:跨术语医学概念规范化的语料库和嵌入","authors":"M. Belousov, W. Dixon, G. Nenadic","doi":"10.18653/v1/W19-3204","DOIUrl":null,"url":null,"abstract":"The medical concept normalisation task aims to map textual descriptions to standard terminologies such as SNOMED-CT or MedDRA. Existing publicly available datasets annotated using different terminologies cannot be simply merged and utilised, and therefore become less valuable when developing machine learning-based concept normalisation systems. To address that, we designed a data harmonisation pipeline and engineered a corpus of 27,979 textual descriptions simultaneously mapped to both MedDRA and SNOMED-CT, sourced from five publicly available datasets across biomedical and social media domains. The pipeline can be used in the future to integrate new datasets into the corpus and also could be applied in relevant data curation tasks. We also described a method to merge different terminologies into a single concept graph preserving their relations and demonstrated that representation learning approach based on random walks on a graph can efficiently encode both hierarchical and equivalent relations and capture semantic similarities not only between concepts inside a given terminology but also between concepts from different terminologies. We believe that making a corpus and embeddings for cross-terminology medical concept normalisation available to the research community would contribute to a better understanding of the task.","PeriodicalId":265570,"journal":{"name":"Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task","volume":"114 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"MedNorm: A Corpus and Embeddings for Cross-terminology Medical Concept Normalisation\",\"authors\":\"M. Belousov, W. Dixon, G. Nenadic\",\"doi\":\"10.18653/v1/W19-3204\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The medical concept normalisation task aims to map textual descriptions to standard terminologies such as SNOMED-CT or MedDRA. Existing publicly available datasets annotated using different terminologies cannot be simply merged and utilised, and therefore become less valuable when developing machine learning-based concept normalisation systems. To address that, we designed a data harmonisation pipeline and engineered a corpus of 27,979 textual descriptions simultaneously mapped to both MedDRA and SNOMED-CT, sourced from five publicly available datasets across biomedical and social media domains. The pipeline can be used in the future to integrate new datasets into the corpus and also could be applied in relevant data curation tasks. We also described a method to merge different terminologies into a single concept graph preserving their relations and demonstrated that representation learning approach based on random walks on a graph can efficiently encode both hierarchical and equivalent relations and capture semantic similarities not only between concepts inside a given terminology but also between concepts from different terminologies. We believe that making a corpus and embeddings for cross-terminology medical concept normalisation available to the research community would contribute to a better understanding of the task.\",\"PeriodicalId\":265570,\"journal\":{\"name\":\"Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task\",\"volume\":\"114 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/W19-3204\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W19-3204","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

医学概念规范化任务旨在将文本描述映射到标准术语，如SNOMED-CT或MedDRA。使用不同术语注释的现有公开可用数据集不能简单地合并和使用，因此在开发基于机器学习的概念规范化系统时变得不那么有价值。为了解决这个问题，我们设计了一个数据协调管道，并设计了一个27,979个文本描述的语库，同时映射到MedDRA和SNOMED-CT，这些描述来自生物医学和社交媒体领域的五个公开数据集。该管道可以在未来用于将新的数据集集成到语料库中，也可以应用于相关的数据管理任务。我们还描述了一种将不同术语合并到一个保留其关系的概念图中的方法，并证明了基于图上随机游动的表示学习方法可以有效地编码层次关系和等效关系，并且不仅可以捕获给定术语内概念之间的语义相似性，还可以捕获来自不同术语的概念之间的语义相似性。我们相信，为研究界提供跨术语医学概念规范化的语料库和嵌入将有助于更好地理解这项任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MedNorm: A Corpus and Embeddings for Cross-terminology Medical Concept Normalisation

The medical concept normalisation task aims to map textual descriptions to standard terminologies such as SNOMED-CT or MedDRA. Existing publicly available datasets annotated using different terminologies cannot be simply merged and utilised, and therefore become less valuable when developing machine learning-based concept normalisation systems. To address that, we designed a data harmonisation pipeline and engineered a corpus of 27,979 textual descriptions simultaneously mapped to both MedDRA and SNOMED-CT, sourced from five publicly available datasets across biomedical and social media domains. The pipeline can be used in the future to integrate new datasets into the corpus and also could be applied in relevant data curation tasks. We also described a method to merge different terminologies into a single concept graph preserving their relations and demonstrated that representation learning approach based on random walks on a graph can efficiently encode both hierarchical and equivalent relations and capture semantic similarities not only between concepts inside a given terminology but also between concepts from different terminologies. We believe that making a corpus and embeddings for cross-terminology medical concept normalisation available to the research community would contribute to a better understanding of the task.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task

自引率

0.00%

发文量