Discriminative Optimization of String Similarity and Its Application to Biomedical Abbreviation Clustering

2011 10th International Conference on Machine Learning and Applications and Workshops Pub Date : 2011-12-18 DOI:10.1109/ICMLA.2011.58

Atsuko Yamaguchi, Yasunori Yamamoto, Jin-Dong Kim, T. Takagi, A. Yonezawa

{"title":"Discriminative Optimization of String Similarity and Its Application to Biomedical Abbreviation Clustering","authors":"Atsuko Yamaguchi, Yasunori Yamamoto, Jin-Dong Kim, T. Takagi, A. Yonezawa","doi":"10.1109/ICMLA.2011.58","DOIUrl":null,"url":null,"abstract":"Many string similarity measures have been developed to deal with the variety of expressions in natural language texts. With the abundance of such measures, we should consider the choice of measures and its parameters to maximize the performance for a given task. During our preliminary experiment to find the best measure and its parameters for the task of clustering terms to improve our abbreviation dictionary in life science, we found that chemical names had different characteristics in their character sequences compared to other terms. Based on the observation, we experimented with four string similarity measures to test the hypothesis, gchemical names has a different morphology, thus computation of their similarity should be differed from that of other terms.h The experimental results show that the edit distance is the best for chemical names, and that the discriminative application of string similarity methods to chemical and non-chemical names may be a simple but effective way to improve the performance of term clustering.","PeriodicalId":439926,"journal":{"name":"2011 10th International Conference on Machine Learning and Applications and Workshops","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 10th International Conference on Machine Learning and Applications and Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2011.58","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Many string similarity measures have been developed to deal with the variety of expressions in natural language texts. With the abundance of such measures, we should consider the choice of measures and its parameters to maximize the performance for a given task. During our preliminary experiment to find the best measure and its parameters for the task of clustering terms to improve our abbreviation dictionary in life science, we found that chemical names had different characteristics in their character sequences compared to other terms. Based on the observation, we experimented with four string similarity measures to test the hypothesis, gchemical names has a different morphology, thus computation of their similarity should be differed from that of other terms.h The experimental results show that the edit distance is the best for chemical names, and that the discriminative application of string similarity methods to chemical and non-chemical names may be a simple but effective way to improve the performance of term clustering.

查看原文本刊更多论文

字符串相似度判别优化及其在生物医学缩写聚类中的应用

为了处理自然语言文本中的各种表达式，已经开发了许多字符串相似度度量。由于此类度量的丰裕，我们应该考虑度量及其参数的选择，以最大限度地提高给定任务的性能。在寻找聚类术语任务的最佳度量及其参数以改进我们的生命科学缩写词典的初步实验中，我们发现化学名称的字符序列与其他术语相比具有不同的特征。在此基础上，我们实验了四种字符串相似度度量来验证假设，化学名称具有不同的形态，因此其相似度的计算应该与其他术语不同。实验结果表明，化学名称的编辑距离是最好的，将字符串相似度方法区分应用于化学名称和非化学名称可能是一种简单而有效的方法来提高术语聚类的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 10th International Conference on Machine Learning and Applications and Workshops

自引率

0.00%

发文量