The VLDB Journal最新文献_第6页

A multi-facet analysis of BERT-based entity matching models 基于bert的实体匹配模型的多层面分析

The VLDB Journal Pub Date : 2023-11-29 DOI: 10.1007/s00778-023-00824-x

Matteo Paganelli, Donato Tiano, Francesco Guerra

{"title":"A multi-facet analysis of BERT-based entity matching models","authors":"Matteo Paganelli, Donato Tiano, Francesco Guerra","doi":"10.1007/s00778-023-00824-x","DOIUrl":"https://doi.org/10.1007/s00778-023-00824-x","url":null,"abstract":"State-of-the-art Entity Matching approaches rely on transformer architectures, such as BERT, for generating highly contextualized embeddings of terms. The embeddings are then used to predict whether pairs of entity descriptions refer to the same real-world entity. BERT-based EM models demonstrated to be effective, but act as black-boxes for the users, who have limited insight into the motivations behind their decisions. In this paper, we perform a multi-facet analysis of the components of pre-trained and fine-tuned BERT architectures applied to an EM task. The main findings resulting from our extensive experimental evaluation are (1) the fine-tuning process applied to the EM task mainly modifies the last layers of the BERT components, but in a different way on tokens belonging to descriptions of matching/non-matching entities; (2) the special structure of the EM datasets, where records are pairs of entity descriptions, is recognized by BERT; (3) the pair-wise semantic similarity of tokens is not a key knowledge exploited by BERT-based EM models; (4) fine-tuning SBERT, a pre-trained version of BERT on the sentence similarity task, i.e., a task close to EM, does not allow the model to largely improve the effectiveness and to learn different forms of knowledge. Approaches customized for EM, such as Ditto and SupCon, seem to rely on the same knowledge as the other transformer-based models. Only the contrastive learning training allows SupCon to learn different knowledge from matching and non-matching entity descriptions; (5) the fine-tuning process based on a binary classifier does not allow the model to learn key distinctive features of the entity descriptions.\u0000","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"171 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Givens rotations for QR decomposition, SVD and PCA over database joins 给出了QR分解、SVD和PCA在数据库连接上的旋转

The VLDB Journal Pub Date : 2023-11-23 DOI: 10.1007/s00778-023-00818-9

Dan Olteanu, Nils Vortmeier, Ɖorđe Živanović

{"title":"Givens rotations for QR decomposition, SVD and PCA over database joins","authors":"Dan Olteanu, Nils Vortmeier, Ɖorđe Živanović","doi":"10.1007/s00778-023-00818-9","DOIUrl":"https://doi.org/10.1007/s00778-023-00818-9","url":null,"abstract":"This article introduces FiGaRo, an algorithm for computing the upper-triangular matrix in the QR decomposition of the matrix defined by the natural join over relational data. FiGaRo ’s main novelty is that it pushes the QR decomposition past the join. This leads to several desirable properties. For acyclic joins, it takes time linear in the database size and independent of the join size. Its execution is equivalent to the application of a sequence of Givens rotations proportional to the join size. Its number of rounding errors relative to the classical QR decomposition algorithms is on par with the database size relative to the join output size. The QR decomposition lies at the core of many linear algebra computations including the singular value decomposition (SVD) and the principal component analysis (PCA). We show how FiGaRo can be used to compute the orthogonal matrix in the QR decomposition, the SVD and the PCA of the join output without the need to materialize the join output. A suite of experiments validate that FiGaRo can outperform both in runtime performance and numerical accuracy the LAPACK library Intel MKL by a factor proportional to the gap between the sizes of the join output and input.\u0000","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A survey on the evolution of stream processing systems 流处理系统发展综述

The VLDB Journal Pub Date : 2023-11-22 DOI: 10.1007/s00778-023-00819-8

Marios Fragkoulis, Paris Carbone, Vasiliki Kalavri, Asterios Katsifodimos

引用次数: 37

Alfa: active learning for graph neural network-based semantic schema alignment 基于图神经网络的语义模式对齐的主动学习

The VLDB Journal Pub Date : 2023-11-21 DOI: 10.1007/s00778-023-00822-z

Venkata Vamsikrishna Meduri, Abdul Quamar, Chuan Lei, Xiao Qin, Berthold Reinwald

{"title":"Alfa: active learning for graph neural network-based semantic schema alignment","authors":"Venkata Vamsikrishna Meduri, Abdul Quamar, Chuan Lei, Xiao Qin, Berthold Reinwald","doi":"10.1007/s00778-023-00822-z","DOIUrl":"https://doi.org/10.1007/s00778-023-00822-z","url":null,"abstract":"Semantic schema alignment aims to match elements across a pair of schemas based on their semantic representation. It is a key primitive for data integration that facilitates the creation of a common data fabric across heterogeneous data sources. Deep learning approaches such as graph representation learning have shown promise for effective alignment of semantically rich schemas, often captured as ontologies. Most of these approaches are supervised and require large amounts of labeled training data, which is expensive in terms of cost and manual labor. Active learning (AL) techniques can alleviate this issue by intelligently choosing the data to be labeled utilizing a human-in-the-loop approach, while minimizing the amount of labeled training data required. However, existing active learning techniques are limited in their ability to utilize the rich semantic information from underlying schemas. Therefore, they cannot drive effective and efficient sample selection for human labeling that is necessary to scale to larger datasets. In this paper, we propose Alfa, an active learning framework to overcome these limitations. Alfa exploits the schema element properties as well as the relationships between schema elements (structure) to drive a novel ontology-aware sample selection and label propagation algorithm for training highly accurate alignment models. We propose semantic blocking to scale to larger datasets without compromising model quality. Our experimental results across three real-world datasets show that (1) Alfa leads to a substantial reduction (27–82%) in the cost of human labeling, (2) semantic blocking reduces label skew up to 40(times ) without adversely affecting model quality and scales AL to large datasets, and (3) sample selection achieves comparable schema matching quality (90% F1-score) to models trained on the entire set of available training data. We also show that Alfa outperforms the state-of-the-art ontology alignment system, BERTMap, in terms of (1) 10(times ) shorter time per AL iteration and (2) requiring half of the AL iterations to achieve the highest convergent F1-score.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AutoML in heavily constrained applications 在严格约束的应用程序中的自动化

The VLDB Journal Pub Date : 2023-11-17 DOI: 10.1007/s00778-023-00820-1

Felix Neutatz, Marius Lindauer, Ziawasch Abedjan

引用次数: 0

Efficient and robust active learning methods for interactive database exploration 用于交互式数据库探索的高效、健壮的主动学习方法

The VLDB Journal Pub Date : 2023-11-16 DOI: 10.1007/s00778-023-00816-x

Enhui Huang, Yanlei Diao, Anna Liu, Liping Peng, Luciano Di Palma

{"title":"Efficient and robust active learning methods for interactive database exploration","authors":"Enhui Huang, Yanlei Diao, Anna Liu, Liping Peng, Luciano Di Palma","doi":"10.1007/s00778-023-00816-x","DOIUrl":"https://doi.org/10.1007/s00778-023-00816-x","url":null,"abstract":"There is an increasing gap between fast growth of data and the limited human ability to comprehend data. Consequently, there has been a growing demand of data management tools that can bridge this gap and help the user retrieve high-value content from data more effectively. In this work, we propose an interactive data exploration system as a new database service, using an approach called “explore-by-example.” Our new system is designed to assist the user in performing highly effective data exploration while reducing the human effort in the process. We cast the explore-by-example problem in a principled “active learning” framework. However, traditional active learning suffers from two fundamental limitations: slow convergence and lack of robustness under label noise. To overcome the slow convergence and label noise problems, we bring the properties of important classes of database queries to bear on the design of new algorithms and optimizations for active learning-based database exploration. Evaluation results using real-world datasets and user interest patterns show that our new system, both in the noise-free case and in the label noise case, significantly outperforms state-of-the-art active learning techniques and data exploration systems in accuracy while achieving the desired efficiency for interactive data exploration.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Supporting secure dynamic alert zones using searchable encryption and graph embedding 支持使用可搜索加密和图形嵌入的安全动态警报区域

The VLDB Journal Pub Date : 2023-07-18 DOI: 10.1007/s00778-023-00803-2

Sina Shaham, Gabriel Ghinita, Cyrus Shahabi

{"title":"Supporting secure dynamic alert zones using searchable encryption and graph embedding","authors":"Sina Shaham, Gabriel Ghinita, Cyrus Shahabi","doi":"10.1007/s00778-023-00803-2","DOIUrl":"https://doi.org/10.1007/s00778-023-00803-2","url":null,"abstract":"Location-based alerts have gained increasing popularity in recent years, whether in the context of healthcare (e.g., COVID-19 contact tracing), marketing (e.g., location-based advertising), or public safety. However, serious privacy concerns arise when location data are used in clear in the process. Several solutions employ searchable encryption (SE) to achieve secure alerts directly on encrypted locations. While doing so preserves privacy, the performance overhead incurred is high. We focus on a prominent SE technique in the public-key setting–hidden vector encryption, and propose a graph embedding technique to encode location data in a way that significantly boosts the performance of processing on ciphertexts. We show that the optimal encoding is NP-hard, and we provide three heuristics that obtain significant performance gains: gray optimizer, multi-seed gray optimizer and scaled gray optimizer. Furthermore, we investigate the more challenging case of dynamic alert zones, where the area of interest changes over time. Our extensive experimental evaluation shows that our solutions can significantly improve computational overhead compared to existing baselines.\u0000","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Time series indexing by dynamic covering with cross-range constraints 通过跨范围约束的动态覆盖进行时间序列索引

The VLDB Journal Pub Date : 2020-05-28 DOI: 10.1007/s00778-020-00614-9

Tao Sun, Hongbo Liu, S. McLoone, Shaoxiong Ji, Xindong Wu

引用次数: 3

A game-based framework for crowdsourced data labeling 基于游戏的众包数据标注框架

The VLDB Journal Pub Date : 2020-05-19 DOI: 10.1007/s00778-020-00613-w

Jingru Yang, Ju Fan, Zhewei Wei, Guoliang Li, Tongyu Liu, Xiaoyong Du

引用次数: 7