Proc. VLDB Endow.最新文献

筛选
英文 中文
Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching Sparkly:用于实体匹配的简单但令人惊讶的强大TF/IDF拦截器
Proc. VLDB Endow. Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583163
Derek Paulsen, Yash Govind, A. Doan
{"title":"Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching","authors":"Derek Paulsen, Yash Govind, A. Doan","doi":"10.14778/3583140.3583163","DOIUrl":"https://doi.org/10.14778/3583140.3583163","url":null,"abstract":"Blocking is a major task in entity matching. Numerous blocking solutions have been developed, but as far as we can tell, blocking using the well-known tf/idf measure has received virtually no attention. Yet, when we experimented with tf/idf blocking using Lucene, we found it did quite well. So in this paper we examine tf/idf blocking in depth. We develop Sparkly, which uses Lucene to perform top-k tf/idf blocking in a distributed share-nothing fashion on a Spark cluster. We develop techniques to identify good attributes and tokenizers that can be used to block on, making Sparkly completely automatic. We perform extensive experiments showing that Sparkly outperforms 8 state-of-the-art blockers. Finally, we provide an in-depth analysis of Sparkly's performance, regarding both recall/output size and runtime. Our findings suggest that (a) tf/idf blocking needs more attention, (b) Sparkly forms a strong baseline that future blocking work should compare against, and (c) future blocking work should seriously consider top-k blocking, which helps improve recall, and a distributed share-nothing architecture, which helps improve scalability, predictability, and extensibility.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82390413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Bringing Compiling Databases to RISC Architectures 将编译数据库引入RISC架构
Proc. VLDB Endow. Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583142
F. Gruber, Maximilian Bandle, A. Engelke, Thomas Neumann, Jana Giceva
{"title":"Bringing Compiling Databases to RISC Architectures","authors":"F. Gruber, Maximilian Bandle, A. Engelke, Thomas Neumann, Jana Giceva","doi":"10.14778/3583140.3583142","DOIUrl":"https://doi.org/10.14778/3583140.3583142","url":null,"abstract":"Current hardware development greatly influences the design decisions of modern database systems. For many modern performance-focused database systems, query compilation emerged as an integral part and different approaches for code generation evolved, making use of standard compilers, general-purpose compiler libraries, or domain-specific code generators. However, development primarily focused on the dominating x86-64 server architecture; but neglected current hardware developments towards other CPU architectures like ARM and other RISC architectures.\u0000 Therefore, we explore the design space of code generation in database systems considering a variety of state-of-the-art compilation approaches with a set of qualitative and quantitative metrics. Based on our findings, we have developed a new code generator called FireARM for AArch64-based systems in our database system, Umbra. We identify general as well as architecture-specific challenges for custom code generation in databases and provide potential solutions to abstract or handle them.\u0000 Furthermore, we present an extensive evaluation of different compilation approaches in Umbra on a wide variety of x86-64 and ARM machines. In particular, we compare quantitative performance characteristics such as compilation latency and query throughput.\u0000 Our results show that using standard languages and compiler infrastructures reduces the barrier to employing query compilation and allows for high performance on big data sets, while domain-specific code generators can achieve a significantly lower compilation overhead and allow for better targeting of new architectures.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74178815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain Data (Extended version) 不确定数据上排序和窗口查询的确定和可能答案的有效逼近(扩展版)
Proc. VLDB Endow. Pub Date : 2023-02-01 DOI: 10.48550/arXiv.2302.08676
Su Feng, Boris Glavic, Oliver Kennedy
{"title":"Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain Data (Extended version)","authors":"Su Feng, Boris Glavic, Oliver Kennedy","doi":"10.48550/arXiv.2302.08676","DOIUrl":"https://doi.org/10.48550/arXiv.2302.08676","url":null,"abstract":"Uncertainty arises naturally in many application domains due to, e.g., data entry errors and ambiguity in data cleaning. Prior work in incomplete and probabilistic databases has investigated the semantics and efficient evaluation of ranking and top-k queries over uncertain data. However, most approaches deal with top-k and ranking in isolation and do represent uncertain input data and query results using separate, incompatible data models. We present an efficient approach for under- and over-approximating results of ranking, top-k, and window queries over uncertain data. Our approach integrates well with existing techniques for querying uncertain data, is efficient, and is to the best of our knowledge the first to support windowed aggregation. We design algorithms for physical operators for uncertain sorting and windowed aggregation, and implement them in PostgreSQL. We evaluated our approach on synthetic and real world datasets, demonstrating that it outperforms all competitors, and often produces more accurate results.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79463210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VersaMatch: Ontology Matching with Weak Supervision versmatch:弱监督的本体匹配
Proc. VLDB Endow. Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583148
Jonathan Fürst, Mauricio Fadel Argerich, Bin Cheng
{"title":"VersaMatch: Ontology Matching with Weak Supervision","authors":"Jonathan Fürst, Mauricio Fadel Argerich, Bin Cheng","doi":"10.14778/3583140.3583148","DOIUrl":"https://doi.org/10.14778/3583140.3583148","url":null,"abstract":"\u0000 Ontology matching is crucial to data integration for across-silo data sharing and has been mainly addressed with heuristic and machine learning (ML) methods. While heuristic methods are often inflexible and hard to extend to new domains, ML methods rely on substantial and hard to obtain amounts of labeled training data. To overcome these limitations, we propose\u0000 VersaMatch\u0000 , a flexible, weakly-supervised ontology matching system. VersaMatch employs various weak supervision sources, such as heuristic rules, pattern matching, and external knowledge bases, to produce labels from a large amount of unlabeled data for training a discriminative ML model. For prediction, VersaMatch develops a novel ensemble model combining the weak supervision sources with the discriminative model to support generalization while retaining a high precision. Our ensemble method boosts end model performance by 4 points compared to a traditional weak-supervision baseline. In addition, compared to state-of-the-art ontology matchers, VersaMatch achieves an overall 4-point performance improvement in F1 score across 26 ontology combinations from different domains. For recently released, in-the-wild datasets, VersaMatch beats the next best matchers by 9 points in F1. Furthermore, its core weak-supervision logic can easily be improved by adding more knowledge sources and collecting more unlabeled data for training.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83940720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Robust Query Driven Cardinality Estimation under Changing Workloads 工作负载变化下的鲁棒查询驱动基数估计
Proc. VLDB Endow. Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583164
Parimarjan Negi, Ziniu Wu, Andreas Kipf, Nesime Tatbul, Ryan Marcus, S. Madden, Tim Kraska, Mohammad Alizadeh
{"title":"Robust Query Driven Cardinality Estimation under Changing Workloads","authors":"Parimarjan Negi, Ziniu Wu, Andreas Kipf, Nesime Tatbul, Ryan Marcus, S. Madden, Tim Kraska, Mohammad Alizadeh","doi":"10.14778/3583140.3583164","DOIUrl":"https://doi.org/10.14778/3583140.3583164","url":null,"abstract":"Query driven cardinality estimation models learn from a historical log of queries. They are lightweight, having low storage requirements, fast inference and training, and are easily adaptable for any kind of query. Unfortunately, such models can suffer unpredictably bad performance under workload drift, i.e., if the query pattern or data changes. This makes them unreliable and hard to deploy. We analyze the reasons why models become unpredictable due to workload drift, and introduce modifications to the query representation and neural network training techniques to make query-driven models robust to the effects of workload drift. First, we emulate workload drift in queries involving some unseen tables or columns by randomly masking out some table or column features during training. This forces the model to make predictions with missing query information, relying more on robust features based on up-to-date DBMS statistics that are useful even when query or data drift happens. Second, we introduce join bitmaps, which extends sampling-based features to be consistent across joins using ideas from sideways information passing. Finally, we show how both of these ideas can be adapted to handle data updates.\u0000 \u0000 We show significantly greater generalization than past works across different workloads and databases. For instance, a model trained with our techniques on a simple workload (JOBLight-train), with 40\u0000 k\u0000 synthetically generated queries of at most 3 tables each, is able to generalize to the much more complex Join Order Benchmark, which include queries with up to 16 tables, and improve query runtimes by 2× over PostgreSQL. We show similar robustness results with data updates, and across other workloads. We discuss the situations where we expect, and see, improvements, as well as more challenging workload drift scenarios where these techniques do not improve much over PostgreSQL. However, even in the most challenging scenarios, our models never perform worse than PostgreSQL, while standard query driven models can get much worse than PostgreSQL.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89712221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Elpis: Graph-Based Similarity Search for Scalable Data Science Elpis:面向可扩展数据科学的基于图的相似性搜索
Proc. VLDB Endow. Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583166
Ilias Azizi, Karima Echihabi, Themis Palpanas
{"title":"Elpis: Graph-Based Similarity Search for Scalable Data Science","authors":"Ilias Azizi, Karima Echihabi, Themis Palpanas","doi":"10.14778/3583140.3583166","DOIUrl":"https://doi.org/10.14778/3583140.3583166","url":null,"abstract":"\u0000 The recent popularity of learned embeddings has fueled the growth of massive collections of high-dimensional (high-d) vectors that model complex data. Finding similar vectors in these collections is at the core of many important and practical data science applications. The data series community has developed tree-based similarity search techniques that outperform state-of-the-art methods on large collections of both data series and generic high-d vectors, on all scenarios except for no-guarantees\u0000 ng\u0000 -approximate search, where graph-based approaches designed by the high-d vector community achieve the best performance. However, building graph-based indexes is extremely expensive both in time and space. In this paper, we bring these two worlds together, study the corresponding solutions and their performance behavior, and propose ELPIS, a new strong baseline that takes advantage of the best features of both to achieve a superior performance in terms of indexing and ng-approximate search in-memory. ELPIS builds the index 3x-8x faster than competitors, using 40% less memory. It also achieves a high recall of 0.99, up to 2x faster than the state-of-the-art methods, and answers 1-NN queries up to one order of magnitude faster.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82772347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Dotori: A Key-Value SSD Based KV Store Dotori:基于键值SSD的KV存储
Proc. VLDB Endow. Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583167
Carl Duffy, Jaehoon Shim, Sang-Hoon Kim, Jin-Soo Kim
{"title":"Dotori: A Key-Value SSD Based KV Store","authors":"Carl Duffy, Jaehoon Shim, Sang-Hoon Kim, Jin-Soo Kim","doi":"10.14778/3583140.3583167","DOIUrl":"https://doi.org/10.14778/3583140.3583167","url":null,"abstract":"Key-value SSDs (KVSSDs) represent a major shift in the storage stack design, with numerous potential benefits. Despite this, their lack of native features critical to operation in real world scenarios hinders their adoption, and these benefits go unrealized. Moreover, simply adapting existing key-value stores to run on KVSSDs proves underwhelming, as KVSSDs operate at lower raw device performance when compared to modern block SSDs.\u0000 This paper introduces Dotori. Dotori is a KVSSD based key-value store that provides much needed functionality in a KVSSD through an upper layer in the host, and takes advantage of the unique KVSSD interface to enable further gains in functionality and performance. At the core of Dotori is a novel B+tree design that is only practical when the underlying storage device is a KVSSD.\u0000 We test Dotori with an enterprise grade KVSSD against state-of-the-art block SSD based key-value stores through a range of micro-benchmarks and real world workloads. Despite low KVSSD raw device performance, Dotori achieves superior performance to these block-device based key-value stores while also showing significant gains in other important metrics.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79532984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Design Space Exploration and Evaluation for Main-Memory Hash Joins in Storage Class Memory 存储类内存中主存哈希连接的设计空间探索与评价
Proc. VLDB Endow. Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583144
Wentao Huang, Yunhong Ji, Xuan Zhou, Bin He, K. Tan
{"title":"A Design Space Exploration and Evaluation for Main-Memory Hash Joins in Storage Class Memory","authors":"Wentao Huang, Yunhong Ji, Xuan Zhou, Bin He, K. Tan","doi":"10.14778/3583140.3583144","DOIUrl":"https://doi.org/10.14778/3583140.3583144","url":null,"abstract":"In this paper, we seek to perform a rigorous experimental study of main-memory hash joins in storage class memory (SCM). In particular, we perform a design space exploration in real SCM for two state-of-the-art join algorithms: partitioned hash join (PHJ) and non-partitioned hash join (NPHJ), and identify the most crucial factors to implement an SCM-friendly join. Moreover, we present a rigorous evaluation with a broad spectrum of workloads for both joins and provide an in-depth analysis for choosing the most suitable algorithm in real SCM environment. With the most extensive experimental analysis up-to-date, we maintain that although there is no one universal winner in all scenarios, PHJ is generally superior to NPHJ in real SCM.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79600640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Efficient Distributed Transaction Processing in Heterogeneous Networks 异构网络中的高效分布式事务处理
Proc. VLDB Endow. Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583153
Qian Zhang, Jingyao Li, Hong-wei Zhao, Quanqing Xu, Wei Lu, Jinliang Xiao, Fusheng Han, Chuanhui Yang, Xiaoyong Du
{"title":"Efficient Distributed Transaction Processing in Heterogeneous Networks","authors":"Qian Zhang, Jingyao Li, Hong-wei Zhao, Quanqing Xu, Wei Lu, Jinliang Xiao, Fusheng Han, Chuanhui Yang, Xiaoyong Du","doi":"10.14778/3583140.3583153","DOIUrl":"https://doi.org/10.14778/3583140.3583153","url":null,"abstract":"Countrywide and worldwide business, like gaming and social networks, drives the popularity of inter-data-center transactions. To support inter-data-center transaction processing and data center fault tolerance simultaneously, existing protocols suffer from significant performance degradation due to high-latency and unstable networks. In this paper, we propose RedT, a novel distributed transaction processing protocol that works in heterogeneous networks. In detail, nodes within a data center are inter-connected via the RDMA-capable network and nodes across data centers are inter-connected via TCP/IP networks. RedT extends two-phase commit (2PC) by decomposing transactions into sub-transactions in terms of the data center granularity, and proposing a pre-write-log mechanism that is able to reduce the number of inter-data-center round-trips from a maximal of 6 to 2. Extensive evaluation against state-of-the-art protocols shows that RedT can achieve up to 1.57× higher throughputs and 0.56× lower latency.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76022134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V (Technical Report) 使用Explain-Da-V解释语义数据版本控制的数据集更改(技术报告)
Proc. VLDB Endow. Pub Date : 2023-01-30 DOI: 10.48550/arXiv.2301.13095
Roee Shraga, Renée J. Miller
{"title":"Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V (Technical Report)","authors":"Roee Shraga, Renée J. Miller","doi":"10.48550/arXiv.2301.13095","DOIUrl":"https://doi.org/10.48550/arXiv.2301.13095","url":null,"abstract":"\u0000 In multi-user environments in which data science and analysis is collaborative, multiple versions of the same datasets are generated. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this work, we introduce Explain-Da-V, a framework aiming to explain changes between two given dataset versions. Explain-Da-V generates\u0000 explanations\u0000 that use\u0000 data transformations\u0000 to explain changes. We further introduce a set of measures that evaluate the validity, generalizability, and explainability of these explanations. We empirically show, using an adapted existing benchmark and a newly created benchmark, that Explain-Da-V generates better explanations than existing data transformation synthesis methods.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78662469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信