Snoopy: Effective and Efficient Semantic Join Discovery via Proxy Columns

IF 10.4 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering Pub Date : 2025-02-24 DOI:10.1109/TKDE.2025.3545176

Yuxiang Guo;Yuren Mao;Zhonghao Hu;Lu Chen;Yunjun Gao

{"title":"Snoopy: Effective and Efficient Semantic Join Discovery via Proxy Columns","authors":"Yuxiang Guo;Yuren Mao;Zhonghao Hu;Lu Chen;Yunjun Gao","doi":"10.1109/TKDE.2025.3545176","DOIUrl":null,"url":null,"abstract":"Semantic join discovery, which aims to find columns in a table repository with high semantic joinabilities to a query column, is crucial for dataset discovery. Existing methods can be divided into two categories: cell-level methods and column-level methods. However, neither of them ensures both effectiveness and efficiency simultaneously. Cell-level methods, which compute the joinability by counting cell matches between columns, enjoy ideal effectiveness but suffer poor efficiency. In contrast, column-level methods, which determine joinability only by computing the similarity of column embeddings, enjoy proper efficiency but suffer poor effectiveness due to the issues occurring in their column embeddings: (i) semantics-joinability-gap, (ii) size limit, and (iii) permutation sensitivity. To address these issues, this paper proposes to compute column embeddings via proxy columns; furthermore, a novel column-level semantic join discovery framework, <inline-formula><tex-math>${\\sf Snoopy}$</tex-math></inline-formula>, is presented, leveraging proxy-column-based embeddings to bridge effectiveness and efficiency. Specifically, the proposed column embeddings are derived from the implicit column-to-proxy-column relationships, which are captured by the lightweight approximate-graph-matching-based column projection. To acquire good proxy columns for guiding the column projection, we introduce a rank-aware contrastive learning paradigm. Extensive experiments on four real-world datasets demonstrate that <inline-formula><tex-math>${\\sf Snoopy}$</tex-math></inline-formula> outperforms SOTA column-level methods by 16% in Recall@25 and 10% in NDCG@25, and achieves superior efficiency—being at least 5 orders of magnitude faster than cell-level solutions, and 3.5× faster than existing column-level methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2971-2985"},"PeriodicalIF":10.4000,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10902104/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Semantic join discovery, which aims to find columns in a table repository with high semantic joinabilities to a query column, is crucial for dataset discovery. Existing methods can be divided into two categories: cell-level methods and column-level methods. However, neither of them ensures both effectiveness and efficiency simultaneously. Cell-level methods, which compute the joinability by counting cell matches between columns, enjoy ideal effectiveness but suffer poor efficiency. In contrast, column-level methods, which determine joinability only by computing the similarity of column embeddings, enjoy proper efficiency but suffer poor effectiveness due to the issues occurring in their column embeddings: (i) semantics-joinability-gap, (ii) size limit, and (iii) permutation sensitivity. To address these issues, this paper proposes to compute column embeddings via proxy columns; furthermore, a novel column-level semantic join discovery framework,

${\sf Snoopy}$

, is presented, leveraging proxy-column-based embeddings to bridge effectiveness and efficiency. Specifically, the proposed column embeddings are derived from the implicit column-to-proxy-column relationships, which are captured by the lightweight approximate-graph-matching-based column projection. To acquire good proxy columns for guiding the column projection, we introduce a rank-aware contrastive learning paradigm. Extensive experiments on four real-world datasets demonstrate that

${\sf Snoopy}$

outperforms SOTA column-level methods by 16% in Recall@25 and 10% in NDCG@25, and achieves superior efficiency—being at least 5 orders of magnitude faster than cell-level solutions, and 3.5× faster than existing column-level methods.

查看原文本刊更多论文

Snoopy：通过代理列进行高效的语义连接发现

语义连接发现的目的是在表存储库中找到与查询列具有高语义可连接性的列，这对于数据集发现至关重要。现有的方法可以分为两类：单元格级方法和列级方法。然而，两者都不能同时保证有效性和效率。通过计算列之间的单元匹配来计算可连接性的单元级方法具有理想的有效性，但效率较低。相比之下，仅通过计算列嵌入的相似性来确定可连接性的列级方法具有适当的效率，但由于其列嵌入中出现的问题(i)语义-可连接性-间隙，（ii）大小限制和（iii）排列敏感性，因此有效性较差。为了解决这些问题，本文提出通过代理列计算列嵌入；此外，提出了一种新的列级语义连接发现框架${\sf Snoopy}$，利用基于代理列的嵌入来提高有效性和效率。具体来说，建议的列嵌入是从隐式列到代理列的关系中派生出来的，这种关系由基于轻量级近似图匹配的列投影捕获。为了获得良好的代理列来指导列投影，我们引入了一个等级感知的对比学习范式。在四个真实数据集上进行的大量实验表明，${\sf Snoopy}$在Recall@25和NDCG@25上的性能分别比SOTA列级方法高出16%和10%，并且达到了卓越的效率——比单元级解决方案快至少5个数量级，比现有的列级方法快3.5倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Knowledge and Data Engineering 工程技术-工程：电子与电气

CiteScore

11.70

自引率

3.40%

发文量

515

审稿时长

6 months

期刊介绍： The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.