Scalability of Piecewise Synonym Identification in Integration of SNOMED into the UMLS
Kuo-Chuan Huang, J. Geller, M. Halper, Gai Elhanan, Y. Perl
{"title":"Scalability of Piecewise Synonym Identification in Integration of SNOMED into the UMLS","authors":"Kuo-Chuan Huang, J. Geller, M. Halper, Gai Elhanan, Y. Perl","doi":"10.4018/jcmam.2011070103","DOIUrl":null,"url":null,"abstract":"Synonym identification during source terminology integration into the Unified Medical Language System (UMLS) is a labor-intensive task needed for every new release of the source. The piecewise synonym (PWS) methodology was previously used for the integration of a small source. The goal of this paper is to determine whether the piecewise synonym methodology with two control parameters scales to a much larger terminology (a subset of SNOMED CT), the control parameters are necessary to make the methodology viable, and the control parameters lead to any loss of matching results. Additional methods for limiting the size of the dictionary used in the PWS generation methodology are used. The authors’ methodology discovered 41% of concepts not found by string matching. The necessity and effectiveness of the control parameters were confirmed. Furthermore, when comparing the results of experiments with and without control parameters, no matches were lost. of string matching methods alone is not sufficient to solve this problem. One environment where the terminology integration task has to be performed on a massive scale is with every new release of the Unified Medical Language System (UMLS). The UMLS (Bodenreider, DOI: 10.4018/jcmam.2011070103 International Journal of Computational Models and Algorithms in Medicine, 2(3), 26-45, July-September 2011 27 Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 2004; Humphreys, Lindberg, Schoolman, & Barnett, 1998; Lindberg, Humphreys, & McCray, 1993) is a large terminological database containing biomedical terms from many source terminologies. At every new release of the UMLS (currently biannually) new source terminologies are integrated. Updates or new versions of old source terminologies are also reintegrated into the UMLS. For example, the Gene Ontology (GO) (Gene Ontology Consortium, 2010), originally integrated into the UMLS in 2004 (Lomax & McCray, 2004), had about 25,000 concepts in the UMLS version 2008AA. However, the number of GO concepts increased to more than 48,000 concepts in 2008AB. The goals of the UMLS are to overcome two problems, the distribution of useful biomedical information among disparate databases and systems and the variety of ways the same concept is expressed in different sources. The UMLS contains terminologies from different medical domains, forming a large terminological repository to solve those two problems (Cimino, 1998; Humphreys et al., 1998). However, since the repository is large, the fact that “the same concept may be expressed in many different ways in different sources” (Humphreys et al., 1998) becomes a difficulty when integrating a new source terminology into the UMLS. It is sometimes difficult to match a term from a new source with the correct concept in the UMLS, even with the help of lexical tools provided by the National Library of Medicine, such as MetaMap and Norm (Cantor et al., 2003). Thus, one major problem during UMLS source integration is the identification of terms and associated concepts from the new source that already exist in the UMLS. In the UMLS 2008 AB version, the terminological repository, called Metathesaurus (Schuyler, Hole, Tuttle, & Sherertz, 1993; Tuttle et al., 1990), contains 147 source terminologies with more than 2 million concepts and over 9 million terms (U. S. National Library of Medicine, 2010c). Among these sources, SNOMED CT (Systematized Nomenclature of Medicine – Clinical Terms) (IHTSO, 2010) may be considered one of the most important, due to two factors: The number of the concepts in SNOMED CT (July 2008 version) is roughly 380k, which is the largest (English) UMLS source measured in terms. Contrary to many other UMLS sources, SNOMED CT has a rich structure, which is based on a formal model, namely a version of Description Logic (Campbell, Das, & Musen, 1994; Spackman, 2001). For a report on the original integration of SNOMED CT into the UMLS see Fung et al. (2005). Like the UMLS, SNOMED CT is updated twice a year, and changes need to be migrated into new releases of the UMLS. In this paper, we are continuing our study of the use of two non-syntactic techniques for finding new synonyms for given multi-word terms (Huang, Geller, Halper, & Cimino, 2007; Huang, Geller, Halper, Perl, & Xu, 2009), namely extraction and substitution, together defining the piecewise synonym (PWS) methodology, but we are focusing on their scalability. Extraction and substitution are used together with string matching. Below follows an informal explanation of extraction and substitution. A precise description is given in the Background Section. In the extraction (preprocessing) stage, new synonyms are generated from existing multiword UMLS synonyms. The result of this preprocessing stage is a dictionary of synonyms, which we call Generalized Synonym Dictionary (Huang et al., 2007). For example, in the UMLS the terms “Artificial lens” and “Prosthetic lens” are synonyms. Extraction (preprocessing) eliminates the common word “lens” and postulates that “Artificial” and “Prosthetic” are synonyms. This fact is stored in the Generalized Synonym Dictionary. In the substitution stage, a term from the new source is decomposed into its constituent words and all applicable synonyms (from the UMLS and the Generalized Synonym Dictionary) are retrieved for each word individually. New terms are then created by substituting these synonyms for the original single words in the term, while preserving the order. Note that an original word may remain at its position in the substitution process, but not all of them. For example, with the synonyms (Artificial, Pros18 more pages are available in the full version of this document, which may be purchased using the \"Add to Cart\" button on the product's webpage: www.igi-global.com/article/scalability-piecewise-synonymidentification-integration/60649?camid=4v1 This title is available in InfoSci-Journals, InfoSci-Journal Disciplines Medicine, Healthcare, and Life Science. Recommend this product to your librarian: www.igi-global.com/e-resources/libraryrecommendation/?id=2","PeriodicalId":162417,"journal":{"name":"Int. J. Comput. Model. Algorithms Medicine","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Comput. Model. Algorithms Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4018/jcmam.2011070103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Synonym identification during source terminology integration into the Unified Medical Language System (UMLS) is a labor-intensive task needed for every new release of the source. The piecewise synonym (PWS) methodology was previously used for the integration of a small source. The goal of this paper is to determine whether the piecewise synonym methodology with two control parameters scales to a much larger terminology (a subset of SNOMED CT), the control parameters are necessary to make the methodology viable, and the control parameters lead to any loss of matching results. Additional methods for limiting the size of the dictionary used in the PWS generation methodology are used. The authors’ methodology discovered 41% of concepts not found by string matching. The necessity and effectiveness of the control parameters were confirmed. Furthermore, when comparing the results of experiments with and without control parameters, no matches were lost. of string matching methods alone is not sufficient to solve this problem. One environment where the terminology integration task has to be performed on a massive scale is with every new release of the Unified Medical Language System (UMLS). The UMLS (Bodenreider, DOI: 10.4018/jcmam.2011070103 International Journal of Computational Models and Algorithms in Medicine, 2(3), 26-45, July-September 2011 27 Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 2004; Humphreys, Lindberg, Schoolman, & Barnett, 1998; Lindberg, Humphreys, & McCray, 1993) is a large terminological database containing biomedical terms from many source terminologies. At every new release of the UMLS (currently biannually) new source terminologies are integrated. Updates or new versions of old source terminologies are also reintegrated into the UMLS. For example, the Gene Ontology (GO) (Gene Ontology Consortium, 2010), originally integrated into the UMLS in 2004 (Lomax & McCray, 2004), had about 25,000 concepts in the UMLS version 2008AA. However, the number of GO concepts increased to more than 48,000 concepts in 2008AB. The goals of the UMLS are to overcome two problems, the distribution of useful biomedical information among disparate databases and systems and the variety of ways the same concept is expressed in different sources. The UMLS contains terminologies from different medical domains, forming a large terminological repository to solve those two problems (Cimino, 1998; Humphreys et al., 1998). However, since the repository is large, the fact that “the same concept may be expressed in many different ways in different sources” (Humphreys et al., 1998) becomes a difficulty when integrating a new source terminology into the UMLS. It is sometimes difficult to match a term from a new source with the correct concept in the UMLS, even with the help of lexical tools provided by the National Library of Medicine, such as MetaMap and Norm (Cantor et al., 2003). Thus, one major problem during UMLS source integration is the identification of terms and associated concepts from the new source that already exist in the UMLS. In the UMLS 2008 AB version, the terminological repository, called Metathesaurus (Schuyler, Hole, Tuttle, & Sherertz, 1993; Tuttle et al., 1990), contains 147 source terminologies with more than 2 million concepts and over 9 million terms (U. S. National Library of Medicine, 2010c). Among these sources, SNOMED CT (Systematized Nomenclature of Medicine – Clinical Terms) (IHTSO, 2010) may be considered one of the most important, due to two factors: The number of the concepts in SNOMED CT (July 2008 version) is roughly 380k, which is the largest (English) UMLS source measured in terms. Contrary to many other UMLS sources, SNOMED CT has a rich structure, which is based on a formal model, namely a version of Description Logic (Campbell, Das, & Musen, 1994; Spackman, 2001). For a report on the original integration of SNOMED CT into the UMLS see Fung et al. (2005). Like the UMLS, SNOMED CT is updated twice a year, and changes need to be migrated into new releases of the UMLS. In this paper, we are continuing our study of the use of two non-syntactic techniques for finding new synonyms for given multi-word terms (Huang, Geller, Halper, & Cimino, 2007; Huang, Geller, Halper, Perl, & Xu, 2009), namely extraction and substitution, together defining the piecewise synonym (PWS) methodology, but we are focusing on their scalability. Extraction and substitution are used together with string matching. Below follows an informal explanation of extraction and substitution. A precise description is given in the Background Section. In the extraction (preprocessing) stage, new synonyms are generated from existing multiword UMLS synonyms. The result of this preprocessing stage is a dictionary of synonyms, which we call Generalized Synonym Dictionary (Huang et al., 2007). For example, in the UMLS the terms “Artificial lens” and “Prosthetic lens” are synonyms. Extraction (preprocessing) eliminates the common word “lens” and postulates that “Artificial” and “Prosthetic” are synonyms. This fact is stored in the Generalized Synonym Dictionary. In the substitution stage, a term from the new source is decomposed into its constituent words and all applicable synonyms (from the UMLS and the Generalized Synonym Dictionary) are retrieved for each word individually. New terms are then created by substituting these synonyms for the original single words in the term, while preserving the order. Note that an original word may remain at its position in the substitution process, but not all of them. For example, with the synonyms (Artificial, Pros18 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the product's webpage: www.igi-global.com/article/scalability-piecewise-synonymidentification-integration/60649?camid=4v1 This title is available in InfoSci-Journals, InfoSci-Journal Disciplines Medicine, Healthcare, and Life Science. Recommend this product to your librarian: www.igi-global.com/e-resources/libraryrecommendation/?id=2
SNOMED与UMLS集成中分段同义词识别的可扩展性
在将源术语集成到统一医学语言系统(Unified Medical Language System, UMLS)的过程中,同义词识别是一项劳动密集型的任务,对于每个新版本的源来说都是如此。分段同义词(PWS)方法以前用于小源代码的集成。本文的目标是确定具有两个控制参数的分段同义词方法是否适用于更大的术语(SNOMED CT的一个子集),控制参数是否必要以使该方法可行,以及控制参数是否会导致匹配结果的损失。使用了限制PWS生成方法中使用的字典大小的其他方法。作者的方法发现41%的概念不是通过字符串匹配找到的。验证了控制参数的必要性和有效性。此外,当比较有控制参数和没有控制参数的实验结果时,没有丢失匹配。仅仅使用字符串匹配方法是不足以解决这个问题的。在统一医学语言系统(Unified Medical Language System, UMLS)的每一个新版本中,都必须大规模地执行术语集成任务。UMLS (Bodenreider, DOI: 10.4018/jcmam.2011070103)国际医学计算模型与算法杂志,2(3),26-45,2011年7月- 9月27版权所有©2011,IGI Global。未经IGI Global书面许可,禁止以印刷或电子形式复制或分发。2004;Humphreys, Lindberg, Schoolman, & Barnett, 1998;Lindberg, Humphreys, & McCray, 1993)是一个包含来自许多源术语的生物医学术语的大型术语数据库。在UMLS的每个新版本中(目前是每两年一次)都会集成新的源术语。旧源术语的更新或新版本也被重新集成到UMLS中。例如,基因本体(GO)(基因本体联盟,2010年)最初于2004年集成到UMLS中(Lomax & McCray, 2004年),在UMLS版本2008AA中有大约25,000个概念。然而,在2008年,GO概念的数量增加到48,000多个概念。UMLS的目标是克服两个问题:在不同的数据库和系统中分发有用的生物医学信息,以及在不同来源中表达同一概念的各种方式。UMLS包含来自不同医学领域的术语,形成一个大型术语库来解决这两个问题(Cimino, 1998;Humphreys et al., 1998)。然而,由于存储库很大,“相同的概念可能在不同的源中以许多不同的方式表达”(Humphreys et al., 1998)这一事实在将新的源术语集成到UMLS中时成为一个困难。有时很难将来自新来源的术语与UMLS中的正确概念相匹配,即使有国家医学图书馆提供的词汇工具的帮助,如MetaMap和Norm (Cantor et al., 2003)。因此,在UMLS源集成期间的一个主要问题是从已经存在于UMLS中的新源中识别术语和相关概念。在UMLS 2008 AB版本中,术语库称为元词库(Schuyler, Hole, Tuttle, & Sherertz, 1993;Tuttle et al., 1990),包含147个源术语,超过200万个概念和超过900万个术语(美国国家医学图书馆,2010)。在这些来源中,SNOMED CT(系统化医学命名法-临床术语)(IHTSO, 2010)可能被认为是最重要的一个,因为两个因素:SNOMED CT(2008年7月版)中的概念数量大约是380k,这是最大的(英文)UMLS来源。与许多其他UMLS资源不同,SNOMED CT具有丰富的结构,它基于形式化模型,即描述逻辑的一个版本(Campbell, Das, & Musen, 1994;Spackman, 2001)。关于将SNOMED CT原始整合到UMLS中的报告,请参见Fung等人(2005)。与UMLS一样,SNOMED CT每年更新两次,并且需要将更改迁移到UMLS的新版本中。在本文中,我们将继续研究使用两种非句法技术为给定的多词术语寻找新的同义词(Huang, Geller, Halper, & Cimino, 2007;Huang, Geller, Halper, Perl, & Xu, 2009),即提取和替换,一起定义了分段同义词(PWS)方法,但我们关注的是它们的可扩展性。提取和替换与字符串匹配一起使用。下面是抽取和替换的非正式解释。背景部分给出了精确的描述。在提取(预处理)阶段,从现有的多词UMLS同义词生成新的同义词。这个预处理阶段的结果是一个同义词词典,我们称之为广义同义词词典(Huang et al., 2007)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。