Top-k字符串相似连接

32nd International Conference on Scientific and Statistical Database Management Pub Date : 2020-07-07 DOI:10.1145/3400903.3400922

Shuyao Qi, Panagiotis Bouros, N. Mamoulis

{"title":"Top-k字符串相似连接","authors":"Shuyao Qi, Panagiotis Bouros, N. Mamoulis","doi":"10.1145/3400903.3400922","DOIUrl":null,"url":null,"abstract":"Top-k joins have been extensively studied in relational databases as ranking operations when every object has, among others, at least one ranking attribute. However, the focus has mostly been the case when the join attributes are of primitive data types (e.g., numerical values) and the join predicate is equality. In this work, we consider string objects assigned such ranking attributes or simply scores. Given two collection of string objects and a string similarity measure (e.g., the Edit distance), we introduce the top-k string similarity join () which returns k sufficiently similar pairs of objects with respect to a similarity threshold ϵ, which have the highest combined score computed by a monotone aggregate function γ (e.g., SUM). Such a join operation finds application in data integration, data cleaning and de-duplication scenarios, and in emerging scientific fields such as bioinformatics. We investigate how existing top-k join methods can be adapted and optimized for , taking into account the semantics and the special characteristics of string similarity joins. We present techniques to avoid computing the entire string join and indexing that enables pruning candidates with respect to both the string join and the ranking component of the query. Our extensive experimental analysis demonstrates the efficiency of our methodology for by comparing solutions that either prioritize the ranking/join component or are able to handle both components of the query at the same time.","PeriodicalId":334018,"journal":{"name":"32nd International Conference on Scientific and Statistical Database Management","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Top-k String Similarity Joins\",\"authors\":\"Shuyao Qi, Panagiotis Bouros, N. Mamoulis\",\"doi\":\"10.1145/3400903.3400922\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Top-k joins have been extensively studied in relational databases as ranking operations when every object has, among others, at least one ranking attribute. However, the focus has mostly been the case when the join attributes are of primitive data types (e.g., numerical values) and the join predicate is equality. In this work, we consider string objects assigned such ranking attributes or simply scores. Given two collection of string objects and a string similarity measure (e.g., the Edit distance), we introduce the top-k string similarity join () which returns k sufficiently similar pairs of objects with respect to a similarity threshold ϵ, which have the highest combined score computed by a monotone aggregate function γ (e.g., SUM). Such a join operation finds application in data integration, data cleaning and de-duplication scenarios, and in emerging scientific fields such as bioinformatics. We investigate how existing top-k join methods can be adapted and optimized for , taking into account the semantics and the special characteristics of string similarity joins. We present techniques to avoid computing the entire string join and indexing that enables pruning candidates with respect to both the string join and the ranking component of the query. Our extensive experimental analysis demonstrates the efficiency of our methodology for by comparing solutions that either prioritize the ranking/join component or are able to handle both components of the query at the same time.\",\"PeriodicalId\":334018,\"journal\":{\"name\":\"32nd International Conference on Scientific and Statistical Database Management\",\"volume\":\"36 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-07-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"32nd International Conference on Scientific and Statistical Database Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3400903.3400922\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"32nd International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3400903.3400922","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在关系数据库中，Top-k连接作为排序操作被广泛研究，其中每个对象至少具有一个排序属性。但是，焦点主要集中在连接属性是基本数据类型(例如，数值)并且连接谓词是相等的情况下。在这项工作中，我们考虑分配这样的排名属性或简单分数的字符串对象。给定两个字符串对象的集合和一个字符串相似性度量(例如，编辑距离)，我们引入top-k字符串相似性join()，它返回k个相对于相似性阈值λ足够相似的对象对，它们具有由单调聚合函数γ(例如，SUM)计算的最高组合分数。这种连接操作在数据集成、数据清理和重复数据删除场景以及生物信息学等新兴科学领域都有应用。考虑到语义和字符串相似连接的特殊特征，我们研究了如何适应和优化现有的top-k连接方法。我们提供了避免计算整个字符串连接和索引的技术，从而可以根据字符串连接和查询的排序组件修剪候选项。我们进行了大量的实验分析，通过比较优先考虑排序/连接组件或能够同时处理查询的两个组件的解决方案，证明了我们的方法的效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Top-k String Similarity Joins

Top-k joins have been extensively studied in relational databases as ranking operations when every object has, among others, at least one ranking attribute. However, the focus has mostly been the case when the join attributes are of primitive data types (e.g., numerical values) and the join predicate is equality. In this work, we consider string objects assigned such ranking attributes or simply scores. Given two collection of string objects and a string similarity measure (e.g., the Edit distance), we introduce the top-k string similarity join () which returns k sufficiently similar pairs of objects with respect to a similarity threshold ϵ, which have the highest combined score computed by a monotone aggregate function γ (e.g., SUM). Such a join operation finds application in data integration, data cleaning and de-duplication scenarios, and in emerging scientific fields such as bioinformatics. We investigate how existing top-k join methods can be adapted and optimized for , taking into account the semantics and the special characteristics of string similarity joins. We present techniques to avoid computing the entire string join and indexing that enables pruning candidates with respect to both the string join and the ranking component of the query. Our extensive experimental analysis demonstrates the efficiency of our methodology for by comparing solutions that either prioritize the ranking/join component or are able to handle both components of the query at the same time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

32nd International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量