The VLDB Journal最新文献

筛选
英文 中文
A multi-facet analysis of BERT-based entity matching models 基于bert的实体匹配模型的多层面分析
The VLDB Journal Pub Date : 2023-11-29 DOI: 10.1007/s00778-023-00824-x
Matteo Paganelli, Donato Tiano, Francesco Guerra
{"title":"A multi-facet analysis of BERT-based entity matching models","authors":"Matteo Paganelli, Donato Tiano, Francesco Guerra","doi":"10.1007/s00778-023-00824-x","DOIUrl":"https://doi.org/10.1007/s00778-023-00824-x","url":null,"abstract":"<p>State-of-the-art Entity Matching approaches rely on transformer architectures, such as <i>BERT</i>, for generating highly contextualized embeddings of terms. The embeddings are then used to predict whether pairs of entity descriptions refer to the same real-world entity. BERT-based EM models demonstrated to be effective, but act as black-boxes for the users, who have limited insight into the motivations behind their decisions. In this paper, we perform a multi-facet analysis of the components of pre-trained and fine-tuned BERT architectures applied to an EM task. The main findings resulting from our extensive experimental evaluation are (1) the fine-tuning process applied to the EM task mainly modifies the last layers of the BERT components, but in a different way on tokens belonging to descriptions of matching/non-matching entities; (2) the special structure of the EM datasets, where records are pairs of entity descriptions, is recognized by BERT; (3) the pair-wise semantic similarity of tokens is not a key knowledge exploited by BERT-based EM models; (4) fine-tuning SBERT, a pre-trained version of BERT on the sentence similarity task, i.e., a task close to EM, does not allow the model to largely improve the effectiveness and to learn different forms of knowledge. Approaches customized for EM, such as Ditto and SupCon, seem to rely on the same knowledge as the other transformer-based models. Only the contrastive learning training allows SupCon to learn different knowledge from matching and non-matching entity descriptions; (5) the fine-tuning process based on a binary classifier does not allow the model to learn key distinctive features of the entity descriptions.\u0000</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"171 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Givens rotations for QR decomposition, SVD and PCA over database joins 给出了QR分解、SVD和PCA在数据库连接上的旋转
The VLDB Journal Pub Date : 2023-11-23 DOI: 10.1007/s00778-023-00818-9
Dan Olteanu, Nils Vortmeier, Ɖorđe Živanović
{"title":"Givens rotations for QR decomposition, SVD and PCA over database joins","authors":"Dan Olteanu, Nils Vortmeier, Ɖorđe Živanović","doi":"10.1007/s00778-023-00818-9","DOIUrl":"https://doi.org/10.1007/s00778-023-00818-9","url":null,"abstract":"<p>This article introduces <span>FiGaRo</span>, an algorithm for computing the upper-triangular matrix in the QR decomposition of the matrix defined by the natural join over relational data. <span>FiGaRo</span> ’s main novelty is that it pushes the QR decomposition past the join. This leads to several desirable properties. For acyclic joins, it takes time linear in the database size and independent of the join size. Its execution is equivalent to the application of a sequence of Givens rotations proportional to the join size. Its number of rounding errors relative to the classical QR decomposition algorithms is on par with the database size relative to the join output size. The QR decomposition lies at the core of many linear algebra computations including the singular value decomposition (SVD) and the principal component analysis (PCA). We show how <span>FiGaRo</span> can be used to compute the orthogonal matrix in the QR decomposition, the SVD and the PCA of the join output without the need to materialize the join output. A suite of experiments validate that <span>FiGaRo</span> can outperform both in runtime performance and numerical accuracy the LAPACK library Intel MKL by a factor proportional to the gap between the sizes of the join output and input.\u0000</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A survey on the evolution of stream processing systems 流处理系统发展综述
The VLDB Journal Pub Date : 2023-11-22 DOI: 10.1007/s00778-023-00819-8
Marios Fragkoulis, Paris Carbone, Vasiliki Kalavri, Asterios Katsifodimos
{"title":"A survey on the evolution of stream processing systems","authors":"Marios Fragkoulis, Paris Carbone, Vasiliki Kalavri, Asterios Katsifodimos","doi":"10.1007/s00778-023-00819-8","DOIUrl":"https://doi.org/10.1007/s00778-023-00819-8","url":null,"abstract":"<p>Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between the first (’00–’10) and second (’11–’23) generation of stream processing systems, and discuss future trends and open problems.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Alfa: active learning for graph neural network-based semantic schema alignment 基于图神经网络的语义模式对齐的主动学习
The VLDB Journal Pub Date : 2023-11-21 DOI: 10.1007/s00778-023-00822-z
Venkata Vamsikrishna Meduri, Abdul Quamar, Chuan Lei, Xiao Qin, Berthold Reinwald
{"title":"Alfa: active learning for graph neural network-based semantic schema alignment","authors":"Venkata Vamsikrishna Meduri, Abdul Quamar, Chuan Lei, Xiao Qin, Berthold Reinwald","doi":"10.1007/s00778-023-00822-z","DOIUrl":"https://doi.org/10.1007/s00778-023-00822-z","url":null,"abstract":"<p>Semantic schema alignment aims to match elements across a pair of schemas based on their semantic representation. It is a key primitive for data integration that facilitates the creation of a common data fabric across heterogeneous data sources. Deep learning approaches such as graph representation learning have shown promise for effective alignment of semantically rich schemas, often captured as ontologies. Most of these approaches are supervised and require large amounts of labeled training data, which is expensive in terms of cost and manual labor. Active learning (AL) techniques can alleviate this issue by intelligently choosing the data to be labeled utilizing a human-in-the-loop approach, while minimizing the amount of labeled training data required. However, existing active learning techniques are limited in their ability to utilize the rich semantic information from underlying schemas. Therefore, they cannot drive effective and efficient sample selection for human labeling that is necessary to scale to larger datasets. In this paper, we propose <span>Alfa</span>, an active learning framework to overcome these limitations. <span>Alfa</span> exploits the schema element properties as well as the relationships between schema elements (structure) to drive a novel ontology-aware sample selection and label propagation algorithm for training highly accurate alignment models. We propose semantic blocking to scale to larger datasets without compromising model quality. Our experimental results across three real-world datasets show that (1) <span>Alfa</span> leads to a substantial reduction (27–82%) in the cost of human labeling, (2) semantic blocking reduces label skew up to 40<span>(times )</span> without adversely affecting model quality and scales AL to large datasets, and (3) sample selection achieves comparable schema matching quality (90% F1-score) to models trained on the entire set of available training data. We also show that <span>Alfa</span> outperforms the state-of-the-art ontology alignment system, BERTMap, in terms of (1) 10<span>(times )</span> shorter time per AL iteration and (2) requiring half of the AL iterations to achieve the highest convergent F1-score.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AutoML in heavily constrained applications 在严格约束的应用程序中的自动化
The VLDB Journal Pub Date : 2023-11-17 DOI: 10.1007/s00778-023-00820-1
Felix Neutatz, Marius Lindauer, Ziawasch Abedjan
{"title":"AutoML in heavily constrained applications","authors":"Felix Neutatz, Marius Lindauer, Ziawasch Abedjan","doi":"10.1007/s00778-023-00820-1","DOIUrl":"https://doi.org/10.1007/s00778-023-00820-1","url":null,"abstract":"<p>Optimizing a machine learning pipeline for a task at hand requires careful configuration of various hyperparameters, typically supported by an AutoML system that optimizes the hyperparameters for the given training dataset. Yet, depending on the AutoML system’s own second-order meta-configuration, the performance of the AutoML process can vary significantly. Current AutoML systems cannot automatically adapt their own configuration to a specific use case. Further, they cannot compile user-defined application constraints on the effectiveness and efficiency of the pipeline and its generation. In this paper, we propose <span>Caml</span>, which uses meta-learning to automatically adapt its own AutoML parameters, such as the search strategy, the validation strategy, and the search space, for a task at hand. The dynamic AutoML strategy of <span>Caml</span> takes user-defined constraints into account and obtains constraint-satisfying pipelines with high predictive performance.\u0000</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient and robust active learning methods for interactive database exploration 用于交互式数据库探索的高效、健壮的主动学习方法
The VLDB Journal Pub Date : 2023-11-16 DOI: 10.1007/s00778-023-00816-x
Enhui Huang, Yanlei Diao, Anna Liu, Liping Peng, Luciano Di Palma
{"title":"Efficient and robust active learning methods for interactive database exploration","authors":"Enhui Huang, Yanlei Diao, Anna Liu, Liping Peng, Luciano Di Palma","doi":"10.1007/s00778-023-00816-x","DOIUrl":"https://doi.org/10.1007/s00778-023-00816-x","url":null,"abstract":"<p>There is an increasing gap between fast growth of data and the limited human ability to comprehend data. Consequently, there has been a growing demand of data management tools that can bridge this gap and help the user retrieve high-value content from data more effectively. In this work, we propose an interactive data exploration system as a new database service, using an approach called “explore-by-example.” Our new system is designed to assist the user in performing highly effective data exploration while reducing the human effort in the process. We cast the explore-by-example problem in a principled “active learning” framework. However, traditional active learning suffers from two fundamental limitations: slow convergence and lack of robustness under label noise. To overcome the slow convergence and label noise problems, we bring the properties of important classes of database queries to bear on the design of new algorithms and optimizations for active learning-based database exploration. Evaluation results using real-world datasets and user interest patterns show that our new system, both in the noise-free case and in the label noise case, significantly outperforms state-of-the-art active learning techniques and data exploration systems in accuracy while achieving the desired efficiency for interactive data exploration.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Supporting secure dynamic alert zones using searchable encryption and graph embedding 支持使用可搜索加密和图形嵌入的安全动态警报区域
The VLDB Journal Pub Date : 2023-07-18 DOI: 10.1007/s00778-023-00803-2
Sina Shaham, Gabriel Ghinita, Cyrus Shahabi
{"title":"Supporting secure dynamic alert zones using searchable encryption and graph embedding","authors":"Sina Shaham, Gabriel Ghinita, Cyrus Shahabi","doi":"10.1007/s00778-023-00803-2","DOIUrl":"https://doi.org/10.1007/s00778-023-00803-2","url":null,"abstract":"<p>Location-based alerts have gained increasing popularity in recent years, whether in the context of healthcare (e.g., COVID-19 contact tracing), marketing (e.g., location-based advertising), or public safety. However, serious privacy concerns arise when location data are used in clear in the process. Several solutions employ searchable encryption (SE) to achieve <i>secure</i> alerts directly on encrypted locations. While doing so preserves privacy, the performance overhead incurred is high. We focus on a prominent SE technique in the public-key setting–hidden vector encryption, and propose a graph embedding technique to encode location data in a way that significantly boosts the performance of processing on ciphertexts. We show that the optimal encoding is NP-hard, and we provide three heuristics that obtain significant performance gains: gray optimizer, multi-seed gray optimizer and scaled gray optimizer. Furthermore, we investigate the more challenging case of dynamic alert zones, where the area of interest changes over time. Our extensive experimental evaluation shows that our solutions can significantly improve computational overhead compared to existing baselines.\u0000</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Time series indexing by dynamic covering with cross-range constraints 通过跨范围约束的动态覆盖进行时间序列索引
The VLDB Journal Pub Date : 2020-05-28 DOI: 10.1007/s00778-020-00614-9
Tao Sun, Hongbo Liu, S. McLoone, Shaoxiong Ji, Xindong Wu
{"title":"Time series indexing by dynamic covering with cross-range constraints","authors":"Tao Sun, Hongbo Liu, S. McLoone, Shaoxiong Ji, Xindong Wu","doi":"10.1007/s00778-020-00614-9","DOIUrl":"https://doi.org/10.1007/s00778-020-00614-9","url":null,"abstract":"","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"1 9","pages":"1365 - 1384"},"PeriodicalIF":0.0,"publicationDate":"2020-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141202485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A game-based framework for crowdsourced data labeling 基于游戏的众包数据标注框架
The VLDB Journal Pub Date : 2020-05-19 DOI: 10.1007/s00778-020-00613-w
Jingru Yang, Ju Fan, Zhewei Wei, Guoliang Li, Tongyu Liu, Xiaoyong Du
{"title":"A game-based framework for crowdsourced data labeling","authors":"Jingru Yang, Ju Fan, Zhewei Wei, Guoliang Li, Tongyu Liu, Xiaoyong Du","doi":"10.1007/s00778-020-00613-w","DOIUrl":"https://doi.org/10.1007/s00778-020-00613-w","url":null,"abstract":"","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"55 33","pages":"1311 - 1336"},"PeriodicalIF":0.0,"publicationDate":"2020-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141204162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信