从字符串数据库中发现可互换的单词

2007 2nd International Conference on Digital Information Management Pub Date : 2007-10-01 DOI:10.1109/ICDIM.2007.4444195

Marco A. Alvarez, SeungJin Lim

{"title":"从字符串数据库中发现可互换的单词","authors":"Marco A. Alvarez, SeungJin Lim","doi":"10.1109/ICDIM.2007.4444195","DOIUrl":null,"url":null,"abstract":"This paper presents a solution for the problem of finding interchangeable words in the context of an input collection of strings. Interchangeable words are words that can be replaced indistinctly in phrases or free text without deviating its actual meaning. Under restricted conditions, pairs of interchangeable might be useful for data deduplication, copy detection, software localization, among others. The calculation of the degree of interchangeability involves the accurate calculation of semantic similarity between pairs of words and the search for candidate pairs in the overall search space imposed by the input collection. The solution presented in this paper is composed by a search method for candidate pairs using the Levenshtein distance algorithm and a novel algorithm - SSA -for calculating the semantic similarity between words. The proposed solution was implemented and tested within a real world application related to a string message database from a software development company. The system was used to build an ontology with clusters of interchangeable words.","PeriodicalId":198626,"journal":{"name":"2007 2nd International Conference on Digital Information Management","volume":"11 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Discovering interchangeable words from string databases\",\"authors\":\"Marco A. Alvarez, SeungJin Lim\",\"doi\":\"10.1109/ICDIM.2007.4444195\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a solution for the problem of finding interchangeable words in the context of an input collection of strings. Interchangeable words are words that can be replaced indistinctly in phrases or free text without deviating its actual meaning. Under restricted conditions, pairs of interchangeable might be useful for data deduplication, copy detection, software localization, among others. The calculation of the degree of interchangeability involves the accurate calculation of semantic similarity between pairs of words and the search for candidate pairs in the overall search space imposed by the input collection. The solution presented in this paper is composed by a search method for candidate pairs using the Levenshtein distance algorithm and a novel algorithm - SSA -for calculating the semantic similarity between words. The proposed solution was implemented and tested within a real world application related to a string message database from a software development company. The system was used to build an ontology with clusters of interchangeable words.\",\"PeriodicalId\":198626,\"journal\":{\"name\":\"2007 2nd International Conference on Digital Information Management\",\"volume\":\"11 6\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2007 2nd International Conference on Digital Information Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDIM.2007.4444195\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 2nd International Conference on Digital Information Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDIM.2007.4444195","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

本文提出了一种在字符串输入集合的上下文中查找可互换单词问题的解决方案。可互换词是指可以在短语或自由文本中模糊地替换而不偏离其实际含义的词。在有限的条件下，可互换对可能对重复数据删除、复制检测、软件本地化等有用。可互换性程度的计算包括精确计算词对之间的语义相似度，并在输入集合施加的整体搜索空间中搜索候选词对。本文的解决方案由一种基于Levenshtein距离算法的候选对搜索方法和一种计算词间语义相似度的新算法SSA组成。建议的解决方案在与软件开发公司的字符串消息数据库相关的实际应用程序中实现和测试。该系统被用来建立一个具有可互换词簇的本体。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Discovering interchangeable words from string databases

This paper presents a solution for the problem of finding interchangeable words in the context of an input collection of strings. Interchangeable words are words that can be replaced indistinctly in phrases or free text without deviating its actual meaning. Under restricted conditions, pairs of interchangeable might be useful for data deduplication, copy detection, software localization, among others. The calculation of the degree of interchangeability involves the accurate calculation of semantic similarity between pairs of words and the search for candidate pairs in the overall search space imposed by the input collection. The solution presented in this paper is composed by a search method for candidate pairs using the Levenshtein distance algorithm and a novel algorithm - SSA -for calculating the semantic similarity between words. The proposed solution was implemented and tested within a real world application related to a string message database from a software development company. The system was used to build an ontology with clusters of interchangeable words.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2007 2nd International Conference on Digital Information Management

自引率

0.00%

发文量