Medical Concept Representation Learning from Multi-source Data.

IJCAI : proceedings of the conference Pub Date : 2019-07-01 DOI:10.24963/ijcai.2019/680

Tian Bai, Brian L Egleston, Richard Bleicher, Slobodan Vucetic

{"title":"Medical Concept Representation Learning from Multi-source Data.","authors":"Tian Bai, Brian L Egleston, Richard Bleicher, Slobodan Vucetic","doi":"10.24963/ijcai.2019/680","DOIUrl":null,"url":null,"abstract":"<p><p>Representing words as low dimensional vectors is very useful in many natural language processing tasks. This idea has been extended to medical domain where medical codes listed in medical claims are represented as vectors to facilitate exploratory analysis and predictive modeling. However, depending on a type of a medical provider, medical claims can use medical codes from different ontologies or from a combination of ontologies, which complicates learning of the representations. To be able to properly utilize such multi-source medical claim data, we propose an approach that represents medical codes from different ontologies in the same vector space. We first modify the Pointwise Mutual Information (PMI) measure of similarity between the codes. We then develop a new negative sampling method for word2vec model that implicitly factorizes the modified PMI matrix. The new approach was evaluated on the code cross-reference problem, which aims at identifying similar codes across different ontologies. In our experiments, we evaluated cross-referencing between ICD-9 and CPT medical code ontologies. Our results indicate that vector representations of codes learned by the proposed approach provide superior cross-referencing when compared to several existing approaches.</p>","PeriodicalId":73334,"journal":{"name":"IJCAI : proceedings of the conference","volume":"2019 ","pages":"4897-4903"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7047512/pdf/nihms-1558151.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IJCAI : proceedings of the conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24963/ijcai.2019/680","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Representing words as low dimensional vectors is very useful in many natural language processing tasks. This idea has been extended to medical domain where medical codes listed in medical claims are represented as vectors to facilitate exploratory analysis and predictive modeling. However, depending on a type of a medical provider, medical claims can use medical codes from different ontologies or from a combination of ontologies, which complicates learning of the representations. To be able to properly utilize such multi-source medical claim data, we propose an approach that represents medical codes from different ontologies in the same vector space. We first modify the Pointwise Mutual Information (PMI) measure of similarity between the codes. We then develop a new negative sampling method for word2vec model that implicitly factorizes the modified PMI matrix. The new approach was evaluated on the code cross-reference problem, which aims at identifying similar codes across different ontologies. In our experiments, we evaluated cross-referencing between ICD-9 and CPT medical code ontologies. Our results indicate that vector representations of codes learned by the proposed approach provide superior cross-referencing when compared to several existing approaches.

Abstract Image

查看原文本刊更多论文

从多源数据中学习医学概念表征

在许多自然语言处理任务中，将单词表示为低维向量非常有用。这一想法已扩展到医疗领域，医疗索赔中列出的医疗代码被表示为向量，以促进探索性分析和预测建模。然而，根据医疗服务提供者的类型，医疗报销单可能使用来自不同本体或本体组合的医疗代码，这就使得表征的学习变得复杂。为了能够正确利用这种多源医疗索赔数据，我们提出了一种在同一向量空间中表示来自不同本体的医疗代码的方法。我们首先修改了代码间相似性的点式互信息（PMI）度量。然后，我们为 word2vec 模型开发了一种新的负采样方法，该方法可对修改后的 PMI 矩阵进行隐式因式分解。我们在代码交叉引用问题上对新方法进行了评估，该问题旨在识别不同本体中的相似代码。在实验中，我们评估了 ICD-9 和 CPT 医疗代码本体之间的交叉引用。我们的结果表明，与现有的几种方法相比，拟议方法学习的代码矢量表示提供了更优越的交叉引用效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IJCAI : proceedings of the conference

自引率

0.00%

发文量