VersaMatch: Ontology Matching with Weak Supervision

Proc. VLDB Endow. Pub Date : 2023-02-01 DOI:10.14778/3583140.3583148

Jonathan Fürst, Mauricio Fadel Argerich, Bin Cheng

{"title":"VersaMatch: Ontology Matching with Weak Supervision","authors":"Jonathan Fürst, Mauricio Fadel Argerich, Bin Cheng","doi":"10.14778/3583140.3583148","DOIUrl":null,"url":null,"abstract":"\n Ontology matching is crucial to data integration for across-silo data sharing and has been mainly addressed with heuristic and machine learning (ML) methods. While heuristic methods are often inflexible and hard to extend to new domains, ML methods rely on substantial and hard to obtain amounts of labeled training data. To overcome these limitations, we propose\n VersaMatch\n , a flexible, weakly-supervised ontology matching system. VersaMatch employs various weak supervision sources, such as heuristic rules, pattern matching, and external knowledge bases, to produce labels from a large amount of unlabeled data for training a discriminative ML model. For prediction, VersaMatch develops a novel ensemble model combining the weak supervision sources with the discriminative model to support generalization while retaining a high precision. Our ensemble method boosts end model performance by 4 points compared to a traditional weak-supervision baseline. In addition, compared to state-of-the-art ontology matchers, VersaMatch achieves an overall 4-point performance improvement in F1 score across 26 ontology combinations from different domains. For recently released, in-the-wild datasets, VersaMatch beats the next best matchers by 9 points in F1. Furthermore, its core weak-supervision logic can easily be improved by adding more knowledge sources and collecting more unlabeled data for training.\n","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"82 1","pages":"1305-1318"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3583140.3583148","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Ontology matching is crucial to data integration for across-silo data sharing and has been mainly addressed with heuristic and machine learning (ML) methods. While heuristic methods are often inflexible and hard to extend to new domains, ML methods rely on substantial and hard to obtain amounts of labeled training data. To overcome these limitations, we propose VersaMatch , a flexible, weakly-supervised ontology matching system. VersaMatch employs various weak supervision sources, such as heuristic rules, pattern matching, and external knowledge bases, to produce labels from a large amount of unlabeled data for training a discriminative ML model. For prediction, VersaMatch develops a novel ensemble model combining the weak supervision sources with the discriminative model to support generalization while retaining a high precision. Our ensemble method boosts end model performance by 4 points compared to a traditional weak-supervision baseline. In addition, compared to state-of-the-art ontology matchers, VersaMatch achieves an overall 4-point performance improvement in F1 score across 26 ontology combinations from different domains. For recently released, in-the-wild datasets, VersaMatch beats the next best matchers by 9 points in F1. Furthermore, its core weak-supervision logic can easily be improved by adding more knowledge sources and collecting more unlabeled data for training.

查看原文本刊更多论文

versmatch:弱监督的本体匹配

本体匹配是跨孤岛数据共享中数据集成的关键，目前主要通过启发式和机器学习方法来解决。虽然启发式方法通常不灵活且难以扩展到新领域，但ML方法依赖于大量且难以获得标记的训练数据。为了克服这些限制，我们提出了versatch，一个灵活的、弱监督的本体匹配系统。versatch使用各种弱监督源，如启发式规则、模式匹配和外部知识库，从大量未标记的数据中生成标签，用于训练判别ML模型。对于预测，versatch开发了一种新的集成模型，将弱监督源与判别模型相结合，以支持泛化，同时保持较高的精度。与传统的弱监督基线相比，我们的集成方法将最终模型的性能提高了4个点。此外，与最先进的本体匹配器相比，versatch在来自不同领域的26个本体组合的F1分数中实现了4分的总体性能提升。对于最近发布的野外数据集，versatch在F1中以9分的优势击败了排名第二的对手。此外，通过增加更多的知识来源和收集更多的未标记数据进行训练，可以很容易地改进其核心的弱监督逻辑。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proc. VLDB Endow.

自引率

0.00%

发文量