AnyMatch -- 利用小型语言模型进行高效的零点实体匹配

Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter
{"title":"AnyMatch -- 利用小型语言模型进行高效的零点实体匹配","authors":"Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter","doi":"arxiv-2409.04073","DOIUrl":null,"url":null,"abstract":"Entity matching (EM) is the problem of determining whether two records refer\nto same real-world entity, which is crucial in data integration, e.g., for\nproduct catalogs or address databases. A major drawback of many EM approaches\nis their dependence on labelled examples. We thus focus on the challenging\nsetting of zero-shot entity matching where no labelled examples are available\nfor an unseen target dataset. Recently, large language models (LLMs) have shown\npromising results for zero-shot EM, but their low throughput and high\ndeployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model\nfine-tuned in a transfer learning setup. We propose several novel data\nselection techniques to generate fine-tuning data for our model, e.g., by\nselecting difficult pairs to match via an AutoML filter, by generating\nadditional attribute-level examples, and by controlling label imbalance in the\ndata. We conduct an extensive evaluation of the prediction quality and deployment\ncost of our model, in a comparison to thirteen baselines on nine benchmark\ndatasets. We find that AnyMatch provides competitive prediction quality despite\nits small parameter size: it achieves the second-highest F1 score overall, and\noutperforms several other approaches that employ models with hundreds of\nbillions of parameters. Furthermore, our approach exhibits major cost benefits:\nthe average prediction quality of AnyMatch is within 4.4% of the\nstate-of-the-art method MatchGPT with the proprietary trillion-parameter model\nGPT-4, yet AnyMatch requires four orders of magnitude less parameters and\nincurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model\",\"authors\":\"Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter\",\"doi\":\"arxiv-2409.04073\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Entity matching (EM) is the problem of determining whether two records refer\\nto same real-world entity, which is crucial in data integration, e.g., for\\nproduct catalogs or address databases. A major drawback of many EM approaches\\nis their dependence on labelled examples. We thus focus on the challenging\\nsetting of zero-shot entity matching where no labelled examples are available\\nfor an unseen target dataset. Recently, large language models (LLMs) have shown\\npromising results for zero-shot EM, but their low throughput and high\\ndeployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model\\nfine-tuned in a transfer learning setup. We propose several novel data\\nselection techniques to generate fine-tuning data for our model, e.g., by\\nselecting difficult pairs to match via an AutoML filter, by generating\\nadditional attribute-level examples, and by controlling label imbalance in the\\ndata. We conduct an extensive evaluation of the prediction quality and deployment\\ncost of our model, in a comparison to thirteen baselines on nine benchmark\\ndatasets. We find that AnyMatch provides competitive prediction quality despite\\nits small parameter size: it achieves the second-highest F1 score overall, and\\noutperforms several other approaches that employ models with hundreds of\\nbillions of parameters. Furthermore, our approach exhibits major cost benefits:\\nthe average prediction quality of AnyMatch is within 4.4% of the\\nstate-of-the-art method MatchGPT with the proprietary trillion-parameter model\\nGPT-4, yet AnyMatch requires four orders of magnitude less parameters and\\nincurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).\",\"PeriodicalId\":501123,\"journal\":{\"name\":\"arXiv - CS - Databases\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Databases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.04073\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04073","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

实体匹配(EM)是确定两条记录是否指向同一个现实世界实体的问题,这在数据集成(如产品目录或地址数据库)中至关重要。许多 EM 方法的一个主要缺点是依赖于标记示例。因此,我们将重点放在 "零镜头实体匹配 "这一具有挑战性的情境上,在这种情境中,没有标记过的示例可用于未见过的目标数据集。最近,大型语言模型(LLM)在零拍 EM 方面取得了令人满意的结果,但其低吞吐量和高部署成本限制了其适用性和可扩展性。我们利用在迁移学习设置中经过微调的小型语言模型 AnyMatch 重新探讨了零次 EM 问题。我们提出了几种新颖的数据选择技术来为我们的模型生成微调数据,例如,通过 AutoML 过滤器选择难以匹配的配对,生成附加属性级示例,以及控制数据中的标签不平衡。我们在九个基准数据集上与 13 个基线模型进行了比较,对我们模型的预测质量和部署成本进行了广泛评估。我们发现,尽管参数规模较小,AnyMatch 却能提供具有竞争力的预测质量:它获得了第二高的 F1 总分,并超越了其他几种采用千亿参数模型的方法。此外,我们的方法在成本方面也有很大的优势:AnyMatch 的平均预测质量与采用专有万亿参数模型 GPT-4 的最先进方法 MatchGPT 相比,相差不到 4.4%,但 AnyMatch 所需的参数数量却少了四个数量级,推理成本(以每千个代币美元计)也低了 3899 倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model
Entity matching (EM) is the problem of determining whether two records refer to same real-world entity, which is crucial in data integration, e.g., for product catalogs or address databases. A major drawback of many EM approaches is their dependence on labelled examples. We thus focus on the challenging setting of zero-shot entity matching where no labelled examples are available for an unseen target dataset. Recently, large language models (LLMs) have shown promising results for zero-shot EM, but their low throughput and high deployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model fine-tuned in a transfer learning setup. We propose several novel data selection techniques to generate fine-tuning data for our model, e.g., by selecting difficult pairs to match via an AutoML filter, by generating additional attribute-level examples, and by controlling label imbalance in the data. We conduct an extensive evaluation of the prediction quality and deployment cost of our model, in a comparison to thirteen baselines on nine benchmark datasets. We find that AnyMatch provides competitive prediction quality despite its small parameter size: it achieves the second-highest F1 score overall, and outperforms several other approaches that employ models with hundreds of billions of parameters. Furthermore, our approach exhibits major cost benefits: the average prediction quality of AnyMatch is within 4.4% of the state-of-the-art method MatchGPT with the proprietary trillion-parameter model GPT-4, yet AnyMatch requires four orders of magnitude less parameters and incurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信