Unicorn: A Unified Multi-Tasking Matching Model

ACM SIGMOD Record Pub Date : 2024-05-14 DOI:10.1145/3665252.3665263

Ju Fan, Jianhong Tu, Guoliang Li, Peng Wang, Xiaoyong Du, Xiaofeng Jia, Song Gao, Nan Tang

{"title":"Unicorn: A Unified Multi-Tasking Matching Model","authors":"Ju Fan, Jianhong Tu, Guoliang Li, Peng Wang, Xiaoyong Du, Xiaofeng Jia, Song Gao, Nan Tang","doi":"10.1145/3665252.3665263","DOIUrl":null,"url":null,"abstract":"Data matching, which decides whether two data elements (e.g., string, tuple, column, or knowledge graph entity) are the \"same\" (a.k.a. a match), is a key concept in data integration. The widely used practice is to build task-specific or even dataset-specific solutions, which are hard to generalize and disable the opportunities of knowledge sharing that can be learned from different datasets and multiple tasks. In this paper, we propose Unicorn, a unified model for generally supporting common data matching tasks. Building such a unified model is challenging due to heterogeneous formats of input data elements and various matching semantics of multiple tasks. To address the challenges, Unicorn employs one generic Encoder that converts any pair of data elements (a, b) into a learned representation, and uses a Matcher, which is a binary classifier, to decide whether a matches b. To align matching semantics of multiple tasks, Unicorn adopts a mixture-of-experts model that enhances the learned representation into a better representation. We conduct extensive experiments using 20 datasets on 7 well-studied data matching tasks, and find that our unified model can achieve better performance on most tasks and on average, compared with the state-of-the-art specific models trained for ad-hoc tasks and datasets separately. Moreover, Unicorn can also well serve new matching tasks with zero-shot learning.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"76 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM SIGMOD Record","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3665252.3665263","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Data matching, which decides whether two data elements (e.g., string, tuple, column, or knowledge graph entity) are the "same" (a.k.a. a match), is a key concept in data integration. The widely used practice is to build task-specific or even dataset-specific solutions, which are hard to generalize and disable the opportunities of knowledge sharing that can be learned from different datasets and multiple tasks. In this paper, we propose Unicorn, a unified model for generally supporting common data matching tasks. Building such a unified model is challenging due to heterogeneous formats of input data elements and various matching semantics of multiple tasks. To address the challenges, Unicorn employs one generic Encoder that converts any pair of data elements (a, b) into a learned representation, and uses a Matcher, which is a binary classifier, to decide whether a matches b. To align matching semantics of multiple tasks, Unicorn adopts a mixture-of-experts model that enhances the learned representation into a better representation. We conduct extensive experiments using 20 datasets on 7 well-studied data matching tasks, and find that our unified model can achieve better performance on most tasks and on average, compared with the state-of-the-art specific models trained for ad-hoc tasks and datasets separately. Moreover, Unicorn can also well serve new matching tasks with zero-shot learning.

查看原文本刊更多论文

独角兽统一的多任务匹配模型

数据匹配决定两个数据元素（如字符串、元组、列或知识图谱实体）是否 "相同"（又称匹配），是数据集成中的一个关键概念。目前广泛使用的做法是建立针对特定任务甚至特定数据集的解决方案，这种解决方案很难通用化，也无法利用从不同数据集和多个任务中学到的知识共享机会。在本文中，我们提出了独角兽模型（Unicorn），这是一种普遍支持常见数据匹配任务的统一模型。由于输入数据元素的格式各不相同，而且多个任务的匹配语义也各不相同，因此建立这样一个统一模型具有很大的挑战性。为了应对这些挑战，Unicorn 采用了一个通用编码器（Encoder），将任意一对数据元素（a, b）转换为学习到的表示，并使用二元分类器（Matcher）来决定 a 是否匹配 b。我们使用 20 个数据集对 7 个经过充分研究的数据匹配任务进行了大量实验，结果发现，与针对特定任务和数据集分别训练的最先进的特定模型相比，我们的统一模型能在大多数任务中取得更好的平均性能。此外，Unicorn 还能很好地服务于零点学习的新匹配任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM SIGMOD Record

自引率

0.00%

发文量