基于图神经网络的语义模式对齐的主动学习

Venkata Vamsikrishna Meduri, Abdul Quamar, Chuan Lei, Xiao Qin, Berthold Reinwald
{"title":"基于图神经网络的语义模式对齐的主动学习","authors":"Venkata Vamsikrishna Meduri, Abdul Quamar, Chuan Lei, Xiao Qin, Berthold Reinwald","doi":"10.1007/s00778-023-00822-z","DOIUrl":null,"url":null,"abstract":"<p>Semantic schema alignment aims to match elements across a pair of schemas based on their semantic representation. It is a key primitive for data integration that facilitates the creation of a common data fabric across heterogeneous data sources. Deep learning approaches such as graph representation learning have shown promise for effective alignment of semantically rich schemas, often captured as ontologies. Most of these approaches are supervised and require large amounts of labeled training data, which is expensive in terms of cost and manual labor. Active learning (AL) techniques can alleviate this issue by intelligently choosing the data to be labeled utilizing a human-in-the-loop approach, while minimizing the amount of labeled training data required. However, existing active learning techniques are limited in their ability to utilize the rich semantic information from underlying schemas. Therefore, they cannot drive effective and efficient sample selection for human labeling that is necessary to scale to larger datasets. In this paper, we propose <span>Alfa</span>, an active learning framework to overcome these limitations. <span>Alfa</span> exploits the schema element properties as well as the relationships between schema elements (structure) to drive a novel ontology-aware sample selection and label propagation algorithm for training highly accurate alignment models. We propose semantic blocking to scale to larger datasets without compromising model quality. Our experimental results across three real-world datasets show that (1) <span>Alfa</span> leads to a substantial reduction (27–82%) in the cost of human labeling, (2) semantic blocking reduces label skew up to 40<span>\\(\\times \\)</span> without adversely affecting model quality and scales AL to large datasets, and (3) sample selection achieves comparable schema matching quality (90% F1-score) to models trained on the entire set of available training data. We also show that <span>Alfa</span> outperforms the state-of-the-art ontology alignment system, BERTMap, in terms of (1) 10<span>\\(\\times \\)</span> shorter time per AL iteration and (2) requiring half of the AL iterations to achieve the highest convergent F1-score.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Alfa: active learning for graph neural network-based semantic schema alignment\",\"authors\":\"Venkata Vamsikrishna Meduri, Abdul Quamar, Chuan Lei, Xiao Qin, Berthold Reinwald\",\"doi\":\"10.1007/s00778-023-00822-z\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Semantic schema alignment aims to match elements across a pair of schemas based on their semantic representation. It is a key primitive for data integration that facilitates the creation of a common data fabric across heterogeneous data sources. Deep learning approaches such as graph representation learning have shown promise for effective alignment of semantically rich schemas, often captured as ontologies. Most of these approaches are supervised and require large amounts of labeled training data, which is expensive in terms of cost and manual labor. Active learning (AL) techniques can alleviate this issue by intelligently choosing the data to be labeled utilizing a human-in-the-loop approach, while minimizing the amount of labeled training data required. However, existing active learning techniques are limited in their ability to utilize the rich semantic information from underlying schemas. Therefore, they cannot drive effective and efficient sample selection for human labeling that is necessary to scale to larger datasets. In this paper, we propose <span>Alfa</span>, an active learning framework to overcome these limitations. <span>Alfa</span> exploits the schema element properties as well as the relationships between schema elements (structure) to drive a novel ontology-aware sample selection and label propagation algorithm for training highly accurate alignment models. We propose semantic blocking to scale to larger datasets without compromising model quality. Our experimental results across three real-world datasets show that (1) <span>Alfa</span> leads to a substantial reduction (27–82%) in the cost of human labeling, (2) semantic blocking reduces label skew up to 40<span>\\\\(\\\\times \\\\)</span> without adversely affecting model quality and scales AL to large datasets, and (3) sample selection achieves comparable schema matching quality (90% F1-score) to models trained on the entire set of available training data. We also show that <span>Alfa</span> outperforms the state-of-the-art ontology alignment system, BERTMap, in terms of (1) 10<span>\\\\(\\\\times \\\\)</span> shorter time per AL iteration and (2) requiring half of the AL iterations to achieve the highest convergent F1-score.</p>\",\"PeriodicalId\":501532,\"journal\":{\"name\":\"The VLDB Journal\",\"volume\":\"20 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-11-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The VLDB Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s00778-023-00822-z\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The VLDB Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00778-023-00822-z","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

语义模式对齐的目的是基于语义表示来匹配跨一对模式的元素。它是数据集成的关键原语,有助于跨异构数据源创建公共数据结构。像图表示学习这样的深度学习方法已经显示出对语义丰富的模式(通常作为本体捕获)进行有效对齐的希望。这些方法中的大多数都是有监督的,并且需要大量标记的训练数据,这在成本和人工方面都是昂贵的。主动学习(AL)技术可以通过利用人在循环方法智能地选择要标记的数据来缓解这个问题,同时最小化所需的标记训练数据量。然而,现有的主动学习技术在利用来自底层模式的丰富语义信息方面受到限制。因此,它们不能驱动有效和高效的人类标签样本选择,这是扩展到更大数据集所必需的。在本文中,我们提出了一个主动学习框架Alfa来克服这些限制。Alfa利用模式元素属性以及模式元素(结构)之间的关系来驱动一种新的本体感知的样本选择和标签传播算法,用于训练高精度的对齐模型。我们提出语义块在不影响模型质量的情况下扩展到更大的数据集。我们在三个真实世界数据集上的实验结果表明:(1)阿尔法导致大量减少(27-82)%) in the cost of human labeling, (2) semantic blocking reduces label skew up to 40\(\times \) without adversely affecting model quality and scales AL to large datasets, and (3) sample selection achieves comparable schema matching quality (90% F1-score) to models trained on the entire set of available training data. We also show that Alfa outperforms the state-of-the-art ontology alignment system, BERTMap, in terms of (1) 10\(\times \) shorter time per AL iteration and (2) requiring half of the AL iterations to achieve the highest convergent F1-score.
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Alfa: active learning for graph neural network-based semantic schema alignment

Alfa: active learning for graph neural network-based semantic schema alignment

Semantic schema alignment aims to match elements across a pair of schemas based on their semantic representation. It is a key primitive for data integration that facilitates the creation of a common data fabric across heterogeneous data sources. Deep learning approaches such as graph representation learning have shown promise for effective alignment of semantically rich schemas, often captured as ontologies. Most of these approaches are supervised and require large amounts of labeled training data, which is expensive in terms of cost and manual labor. Active learning (AL) techniques can alleviate this issue by intelligently choosing the data to be labeled utilizing a human-in-the-loop approach, while minimizing the amount of labeled training data required. However, existing active learning techniques are limited in their ability to utilize the rich semantic information from underlying schemas. Therefore, they cannot drive effective and efficient sample selection for human labeling that is necessary to scale to larger datasets. In this paper, we propose Alfa, an active learning framework to overcome these limitations. Alfa exploits the schema element properties as well as the relationships between schema elements (structure) to drive a novel ontology-aware sample selection and label propagation algorithm for training highly accurate alignment models. We propose semantic blocking to scale to larger datasets without compromising model quality. Our experimental results across three real-world datasets show that (1) Alfa leads to a substantial reduction (27–82%) in the cost of human labeling, (2) semantic blocking reduces label skew up to 40\(\times \) without adversely affecting model quality and scales AL to large datasets, and (3) sample selection achieves comparable schema matching quality (90% F1-score) to models trained on the entire set of available training data. We also show that Alfa outperforms the state-of-the-art ontology alignment system, BERTMap, in terms of (1) 10\(\times \) shorter time per AL iteration and (2) requiring half of the AL iterations to achieve the highest convergent F1-score.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信