AnyMatch -- 利用小型语言模型进行高效的零点实体匹配

arXiv - CS - Databases Pub Date : 2024-09-06 DOI:arxiv-2409.04073

Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter

{"title":"AnyMatch -- 利用小型语言模型进行高效的零点实体匹配","authors":"Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter","doi":"arxiv-2409.04073","DOIUrl":null,"url":null,"abstract":"Entity matching (EM) is the problem of determining whether two records refer\nto same real-world entity, which is crucial in data integration, e.g., for\nproduct catalogs or address databases. A major drawback of many EM approaches\nis their dependence on labelled examples. We thus focus on the challenging\nsetting of zero-shot entity matching where no labelled examples are available\nfor an unseen target dataset. Recently, large language models (LLMs) have shown\npromising results for zero-shot EM, but their low throughput and high\ndeployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model\nfine-tuned in a transfer learning setup. We propose several novel data\nselection techniques to generate fine-tuning data for our model, e.g., by\nselecting difficult pairs to match via an AutoML filter, by generating\nadditional attribute-level examples, and by controlling label imbalance in the\ndata. We conduct an extensive evaluation of the prediction quality and deployment\ncost of our model, in a comparison to thirteen baselines on nine benchmark\ndatasets. We find that AnyMatch provides competitive prediction quality despite\nits small parameter size: it achieves the second-highest F1 score overall, and\noutperforms several other approaches that employ models with hundreds of\nbillions of parameters. Furthermore, our approach exhibits major cost benefits:\nthe average prediction quality of AnyMatch is within 4.4% of the\nstate-of-the-art method MatchGPT with the proprietary trillion-parameter model\nGPT-4, yet AnyMatch requires four orders of magnitude less parameters and\nincurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model\",\"authors\":\"Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter\",\"doi\":\"arxiv-2409.04073\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Entity matching (EM) is the problem of determining whether two records refer\\nto same real-world entity, which is crucial in data integration, e.g., for\\nproduct catalogs or address databases. A major drawback of many EM approaches\\nis their dependence on labelled examples. We thus focus on the challenging\\nsetting of zero-shot entity matching where no labelled examples are available\\nfor an unseen target dataset. Recently, large language models (LLMs) have shown\\npromising results for zero-shot EM, but their low throughput and high\\ndeployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model\\nfine-tuned in a transfer learning setup. We propose several novel data\\nselection techniques to generate fine-tuning data for our model, e.g., by\\nselecting difficult pairs to match via an AutoML filter, by generating\\nadditional attribute-level examples, and by controlling label imbalance in the\\ndata. We conduct an extensive evaluation of the prediction quality and deployment\\ncost of our model, in a comparison to thirteen baselines on nine benchmark\\ndatasets. We find that AnyMatch provides competitive prediction quality despite\\nits small parameter size: it achieves the second-highest F1 score overall, and\\noutperforms several other approaches that employ models with hundreds of\\nbillions of parameters. Furthermore, our approach exhibits major cost benefits:\\nthe average prediction quality of AnyMatch is within 4.4% of the\\nstate-of-the-art method MatchGPT with the proprietary trillion-parameter model\\nGPT-4, yet AnyMatch requires four orders of magnitude less parameters and\\nincurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).\",\"PeriodicalId\":501123,\"journal\":{\"name\":\"arXiv - CS - Databases\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Databases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.04073\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04073","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

实体匹配（EM）是确定两条记录是否指向同一个现实世界实体的问题，这在数据集成（如产品目录或地址数据库）中至关重要。许多 EM 方法的一个主要缺点是依赖于标记示例。因此，我们将重点放在 "零镜头实体匹配 "这一具有挑战性的情境上，在这种情境中，没有标记过的示例可用于未见过的目标数据集。最近，大型语言模型（LLM）在零拍 EM 方面取得了令人满意的结果，但其低吞吐量和高部署成本限制了其适用性和可扩展性。我们利用在迁移学习设置中经过微调的小型语言模型 AnyMatch 重新探讨了零次 EM 问题。我们提出了几种新颖的数据选择技术来为我们的模型生成微调数据，例如，通过 AutoML 过滤器选择难以匹配的配对，生成附加属性级示例，以及控制数据中的标签不平衡。我们在九个基准数据集上与 13 个基线模型进行了比较，对我们模型的预测质量和部署成本进行了广泛评估。我们发现，尽管参数规模较小，AnyMatch 却能提供具有竞争力的预测质量：它获得了第二高的 F1 总分，并超越了其他几种采用千亿参数模型的方法。此外，我们的方法在成本方面也有很大的优势：AnyMatch 的平均预测质量与采用专有万亿参数模型 GPT-4 的最先进方法 MatchGPT 相比，相差不到 4.4%，但 AnyMatch 所需的参数数量却少了四个数量级，推理成本（以每千个代币美元计）也低了 3899 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model

Entity matching (EM) is the problem of determining whether two records refer to same real-world entity, which is crucial in data integration, e.g., for product catalogs or address databases. A major drawback of many EM approaches is their dependence on labelled examples. We thus focus on the challenging setting of zero-shot entity matching where no labelled examples are available for an unseen target dataset. Recently, large language models (LLMs) have shown promising results for zero-shot EM, but their low throughput and high deployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model fine-tuned in a transfer learning setup. We propose several novel data selection techniques to generate fine-tuning data for our model, e.g., by selecting difficult pairs to match via an AutoML filter, by generating additional attribute-level examples, and by controlling label imbalance in the data. We conduct an extensive evaluation of the prediction quality and deployment cost of our model, in a comparison to thirteen baselines on nine benchmark datasets. We find that AnyMatch provides competitive prediction quality despite its small parameter size: it achieves the second-highest F1 score overall, and outperforms several other approaches that employ models with hundreds of billions of parameters. Furthermore, our approach exhibits major cost benefits: the average prediction quality of AnyMatch is within 4.4% of the state-of-the-art method MatchGPT with the proprietary trillion-parameter model GPT-4, yet AnyMatch requires four orders of magnitude less parameters and incurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Databases

自引率

0.00%

发文量