信息源集成中文本域属性的近似匹配

Information Quality in Information Systems Pub Date : 2005-06-17 DOI:10.1145/1077501.1077516

A. Koeller, Vinay Keelara

{"title":"信息源集成中文本域属性的近似匹配","authors":"A. Koeller, Vinay Keelara","doi":"10.1145/1077501.1077516","DOIUrl":null,"url":null,"abstract":"A key problem in the integration of information sources is the identification of related attributes or objects across independent sources. Inferring such meta-information from source data (rather than a-priori available meta-data, such as attribute names) is sometimes possible. For example, existing algorithms attempt to integrate information sources by finding patterns such as Inclusion Dependencies (INDs) across them. However, INDs are based on exact set inclusion and are thus very strict patterns that rarely hold across independent real-world databases.We propose two error-tolerant measures, termed Similarity Score and Distribution Score, that help identify related attributes across two independent databases, based on similarities in their data. Those measures specifically address the problem of identifying semantic relationships between textual attributes of databases that have few or no equal values.We also present implementations of those measures and some experimental results.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Approximate matching of textual domain attributes for information source integration\",\"authors\":\"A. Koeller, Vinay Keelara\",\"doi\":\"10.1145/1077501.1077516\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A key problem in the integration of information sources is the identification of related attributes or objects across independent sources. Inferring such meta-information from source data (rather than a-priori available meta-data, such as attribute names) is sometimes possible. For example, existing algorithms attempt to integrate information sources by finding patterns such as Inclusion Dependencies (INDs) across them. However, INDs are based on exact set inclusion and are thus very strict patterns that rarely hold across independent real-world databases.We propose two error-tolerant measures, termed Similarity Score and Distribution Score, that help identify related attributes across two independent databases, based on similarities in their data. Those measures specifically address the problem of identifying semantic relationships between textual attributes of databases that have few or no equal values.We also present implementations of those measures and some experimental results.\",\"PeriodicalId\":306187,\"journal\":{\"name\":\"Information Quality in Information Systems\",\"volume\":\"53 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-06-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Quality in Information Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1077501.1077516\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Quality in Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1077501.1077516","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

信息源集成中的一个关键问题是识别跨独立信息源的相关属性或对象。有时可以从源数据(而不是先验的可用元数据，如属性名)推断出此类元信息。例如，现有的算法试图通过查找包含依赖关系(INDs)等模式来集成信息源。然而，ind基于精确的集合包含，因此是非常严格的模式，很少在独立的真实数据库中适用。我们提出了两种容错度量，称为相似性评分和分布评分，它们有助于根据数据的相似性识别两个独立数据库之间的相关属性。这些措施专门处理识别具有很少或没有相等值的数据库文本属性之间的语义关系的问题。我们还介绍了这些措施的实施和一些实验结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Approximate matching of textual domain attributes for information source integration

A key problem in the integration of information sources is the identification of related attributes or objects across independent sources. Inferring such meta-information from source data (rather than a-priori available meta-data, such as attribute names) is sometimes possible. For example, existing algorithms attempt to integrate information sources by finding patterns such as Inclusion Dependencies (INDs) across them. However, INDs are based on exact set inclusion and are thus very strict patterns that rarely hold across independent real-world databases.We propose two error-tolerant measures, termed Similarity Score and Distribution Score, that help identify related attributes across two independent databases, based on similarities in their data. Those measures specifically address the problem of identifying semantic relationships between textual attributes of databases that have few or no equal values.We also present implementations of those measures and some experimental results.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Quality in Information Systems

自引率

0.00%

发文量