The VLDB Journal最新文献

筛选
英文 中文
Efficient algorithms for reachability and path queries on temporal bipartite graphs 时态二叉图上可达性和路径查询的高效算法
The VLDB Journal Pub Date : 2024-05-23 DOI: 10.1007/s00778-024-00854-z
Kai Wang, Minghao Cai, Xiaoshuang Chen, Xuemin Lin, Wenjie Zhang, Lu Qin, Ying Zhang
{"title":"Efficient algorithms for reachability and path queries on temporal bipartite graphs","authors":"Kai Wang, Minghao Cai, Xiaoshuang Chen, Xuemin Lin, Wenjie Zhang, Lu Qin, Ying Zhang","doi":"10.1007/s00778-024-00854-z","DOIUrl":"https://doi.org/10.1007/s00778-024-00854-z","url":null,"abstract":"","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"23 9","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141108028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Discovering approximate implicit domain orders through order dependencies 通过阶次依赖关系发现近似隐式域阶次
The VLDB Journal Pub Date : 2024-05-21 DOI: 10.1007/s00778-024-00847-y
Reza Karegar, Melicaalsadat Mirsafian, P. Godfrey, Lukasz Golab, M. Kargar, Divesh Srivastava, Jaroslaw Szlichta
{"title":"Discovering approximate implicit domain orders through order dependencies","authors":"Reza Karegar, Melicaalsadat Mirsafian, P. Godfrey, Lukasz Golab, M. Kargar, Divesh Srivastava, Jaroslaw Szlichta","doi":"10.1007/s00778-024-00847-y","DOIUrl":"https://doi.org/10.1007/s00778-024-00847-y","url":null,"abstract":"","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"139 19","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141114526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Similarity-driven and task-driven models for diversity of opinion in crowdsourcing markets 众包市场中意见多样性的相似性驱动模型和任务驱动模型
The VLDB Journal Pub Date : 2024-05-17 DOI: 10.1007/s00778-024-00853-0
Chen Jason Zhang, Yunrui Liu, Pengcheng Zeng, Ting Wu, Lei Chen, Pan Hui, Fei Hao
{"title":"Similarity-driven and task-driven models for diversity of opinion in crowdsourcing markets","authors":"Chen Jason Zhang, Yunrui Liu, Pengcheng Zeng, Ting Wu, Lei Chen, Pan Hui, Fei Hao","doi":"10.1007/s00778-024-00853-0","DOIUrl":"https://doi.org/10.1007/s00778-024-00853-0","url":null,"abstract":"<p>The recent boom in crowdsourcing has opened up a new avenue for utilizing human intelligence in the realm of data analysis. This innovative approach provides a powerful means for connecting online workers to tasks that cannot effectively be done solely by machines or conducted by professional experts due to cost constraints. Within the field of social science, four elements are required to construct a sound crowd—Diversity of Opinion, Independence, Decentralization and Aggregation. However, while the other three components have already been investigated and implemented in existing crowdsourcing platforms, ‘Diversity of Opinion’ has not been functionally enabled yet. From a computational point of view, constructing a wise crowd necessitates quantitatively modeling and taking diversity into account. There are usually two paradigms in a crowdsourcing marketplace for worker selection: building a crowd to wait for tasks to come and selecting workers for a given task. We propose similarity-driven and task-driven models for both paradigms. Also, we develop efficient and effective algorithms for recruiting a limited number of workers with optimal diversity in both models. To validate our solutions, we conduct extensive experiments using both synthetic datasets and real data sets.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"129 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141058780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient and effective algorithms for densest subgraph discovery and maintenance 发现和维护最密集子图的高效算法
The VLDB Journal Pub Date : 2024-05-08 DOI: 10.1007/s00778-024-00855-y
Yichen Xu, Chenhao Ma, Yixiang Fang, Zhifeng Bao
{"title":"Efficient and effective algorithms for densest subgraph discovery and maintenance","authors":"Yichen Xu, Chenhao Ma, Yixiang Fang, Zhifeng Bao","doi":"10.1007/s00778-024-00855-y","DOIUrl":"https://doi.org/10.1007/s00778-024-00855-y","url":null,"abstract":"<p>The densest subgraph problem (DSP) is of great significance due to its wide applications in different domains. Meanwhile, diverse requirements in various applications lead to different density variants for DSP. Unfortunately, existing DSP algorithms cannot be easily extended to handle those variants efficiently and accurately. To fill this gap, we first unify different density metrics into a generalized density definition. We further propose a new model, <i>c</i>-core, to locate the general densest subgraph and show its advantage in accelerating the search process. Extensive experiments show that our <i>c</i>-core-based optimization can provide up to three orders of magnitude speedup over baselines. Methods for maintenance of <i>c</i>-core location are designed to accelerate updates on dynamic graphs. Moreover, we study an important variant of DSP under a size constraint, namely the densest-at-least-k-subgraph (Dal<i>k</i>S) problem. We propose an algorithm based on graph decomposition, and it is likely to give a solution that is at least 0.8 of the optimal density in our experiments, while the state-of-the-art method can only ensure a solution with a density of at least 0.5 of the optimal density. Our experiments show that our Dal<i>k</i>S algorithm can achieve at least 0.99 of the optimal density for over one-third of all possible size constraints. In addition, we develop an approximation algorithm for the Dal<i>k</i>S problem that can be more efficient than the state-of-the-art algorithm while keeping the same approximation ratio of <span>(frac{1}{3})</span>.\u0000</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140925182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lero: applying learning-to-rank in query optimizer Lero:在查询优化器中应用 "从学习到排名 "技术
The VLDB Journal Pub Date : 2024-04-25 DOI: 10.1007/s00778-024-00850-3
Xingguang Chen, Rong Zhu, Bolin Ding, Sibo Wang, Jingren Zhou
{"title":"Lero: applying learning-to-rank in query optimizer","authors":"Xingguang Chen, Rong Zhu, Bolin Ding, Sibo Wang, Jingren Zhou","doi":"10.1007/s00778-024-00850-3","DOIUrl":"https://doi.org/10.1007/s00778-024-00850-3","url":null,"abstract":"","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"90 13","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140654807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hyper-distance oracles in hypergraphs 超图中的超距规则
The VLDB Journal Pub Date : 2024-04-19 DOI: 10.1007/s00778-024-00851-2
Giulia Preti, Gianmarco De Francisci Morales, Francesco Bonchi
{"title":"Hyper-distance oracles in hypergraphs","authors":"Giulia Preti, Gianmarco De Francisci Morales, Francesco Bonchi","doi":"10.1007/s00778-024-00851-2","DOIUrl":"https://doi.org/10.1007/s00778-024-00851-2","url":null,"abstract":"<p>We study point-to-point distance estimation in hypergraphs, where the query is parameterized by a positive integer <i>s</i>, which defines the required level of overlap for two hyperedges to be considered adjacent. To answer <i>s</i>-distance queries, we first explore an oracle based on the line graph of the given hypergraph and discuss its limitations: The line graph is typically orders of magnitude larger than the original hypergraph. We then introduce <span>HypED</span>, a landmark-based oracle with a predefined size, built directly on the hypergraph, thus avoiding the materialization of the line graph. Our framework allows to approximately answer vertex-to-vertex, vertex-to-hyperedge, and hyperedge-to-hyperedge <i>s</i>-distance queries for any value of <i>s</i>. A key observation at the basis of our framework is that as <i>s</i> increases, the hypergraph becomes more fragmented. We show how this can be exploited to improve the placement of landmarks, by identifying the <i>s</i>-connected components of the hypergraph. For this latter task, we devise an efficient algorithm based on the union-find technique and a dynamic inverted index. We experimentally evaluate <span>HypED</span> on several real-world hypergraphs and prove its versatility in answering <i>s</i>-distance queries for different values of <i>s</i>. Our framework allows answering such queries in fractions of a millisecond while allowing fine-grained control of the trade-off between index size and approximation error at creation time. Finally, we prove the usefulness of the <i>s</i>-distance oracle in two applications, namely hypergraph-based recommendation and the approximation of the <i>s</i>-closeness centrality of vertices and hyperedges in the context of protein-protein interactions.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140631145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Special issue on “Machine learning and databases” 机器学习与数据库 "特刊
The VLDB Journal Pub Date : 2024-04-17 DOI: 10.1007/s00778-024-00848-x
Matthias Boehm, Nesime Tatbul
{"title":"Special issue on “Machine learning and databases”","authors":"Matthias Boehm, Nesime Tatbul","doi":"10.1007/s00778-024-00848-x","DOIUrl":"https://doi.org/10.1007/s00778-024-00848-x","url":null,"abstract":"","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":" 11","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140692731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data distribution tailoring revisited: cost-efficient integration of representative data 再论数据分布定制:具有成本效益的代表性数据整合
The VLDB Journal Pub Date : 2024-04-12 DOI: 10.1007/s00778-024-00849-w
Jiwon Chang, Bohan Cui, Fatemeh Nargesian, Abolfazl Asudeh, H. V. Jagadish
{"title":"Data distribution tailoring revisited: cost-efficient integration of representative data","authors":"Jiwon Chang, Bohan Cui, Fatemeh Nargesian, Abolfazl Asudeh, H. V. Jagadish","doi":"10.1007/s00778-024-00849-w","DOIUrl":"https://doi.org/10.1007/s00778-024-00849-w","url":null,"abstract":"<p>Data scientists often develop data sets for analysis by drawing upon available data sources. A major challenge is ensuring that the data set used for analysis adequately represents relevant demographic groups or other variables. Whether data is obtained from an experiment or a data provider, a single data source may not meet the desired distribution requirements. Therefore, combining data from multiple sources is often necessary. The data distribution tailoring (DT) problem aims to cost-efficiently collect a unified data set from multiple sources. In this paper, we present major optimizations and generalizations to previous algorithms for this problem. In situations when group distributions are known in sources, we present a novel algorithm <span>RatioColl</span> that outperforms the existing algorithm, based on the coupon collector’s problem. If distributions are unknown, we propose decaying exploration rate multi-armed-bandit algorithms that, unlike the existing algorithm used for unknown DT, does not require prior information. Through theoretical analysis and extensive experiments, we demonstrate the effectiveness of our proposed algorithms.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140592973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems 无需完全数据洗牌的随机梯度下降:在数据库内机器学习和深度学习系统中的应用
The VLDB Journal Pub Date : 2024-04-12 DOI: 10.1007/s00778-024-00845-0
Lijie Xu, Shuang Qiu, Binhang Yuan, Jiawei Jiang, Cedric Renggli, Shaoduo Gan, Kaan Kara, Guoliang Li, Ji Liu, Wentao Wu, Jieping Ye, Ce Zhang
{"title":"Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems","authors":"Lijie Xu, Shuang Qiu, Binhang Yuan, Jiawei Jiang, Cedric Renggli, Shaoduo Gan, Kaan Kara, Guoliang Li, Ji Liu, Wentao Wu, Jieping Ye, Ce Zhang","doi":"10.1007/s00778-024-00845-0","DOIUrl":"https://doi.org/10.1007/s00778-024-00845-0","url":null,"abstract":"<p>Modern machine learning (ML) systems commonly use stochastic gradient descent (SGD) to train ML models. However, SGD relies on random data order to converge, which usually requires a full data shuffle. For in-DB ML systems and deep learning systems with large datasets stored on <i>block-addressable secondary storage</i> such as HDD and SSD, this full data shuffle leads to low I/O performance—the data shuffling time can be even longer than the training itself, due to massive random data accesses. To balance the convergence rate of SGD (which favors data randomness) and its I/O performance (which favors sequential access), previous work has proposed several data shuffling strategies. In this paper, we first perform an empirical study on existing data shuffling strategies, showing that these strategies suffer from either low performance or low convergence rate. To solve this problem, we propose a simple but novel <i>two-level</i> data shuffling strategy named <span>CorgiPile</span>, which can <i>avoid</i> a full data shuffle while maintaining <i>comparable</i> convergence rate of SGD as if a full shuffle were performed. We further theoretically analyze the convergence behavior of <span>CorgiPile</span> and empirically evaluate its efficacy in both in-DB ML and deep learning systems. For in-DB ML systems, we integrate <span>CorgiPile</span> into PostgreSQL by introducing three new <i>physical</i> operators with optimizations. For deep learning systems, we extend single-process <span>CorgiPile</span> to multi-process <span>CorgiPile</span> for the parallel/distributed environment and integrate it into PyTorch. Our evaluation shows that <span>CorgiPile</span> can achieve comparable convergence rate with the full-shuffle-based SGD for both linear models and deep learning models. For in-DB ML with linear models, <span>CorgiPile</span> is 1.6<span>(times )</span> <span>(-)</span>12.8<span>(times )</span> faster than two state-of-the-art systems, Apache MADlib and Bismarck, on both HDD and SSD. For deep learning models on ImageNet, <span>CorgiPile</span> is 1.5<span>(times )</span> faster than PyTorch with full data shuffle.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140593052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hilogx: noise-aware log-based anomaly detection with human feedback Hilogx:基于人为反馈的噪声感知日志式异常检测
The VLDB Journal Pub Date : 2024-03-28 DOI: 10.1007/s00778-024-00843-2
{"title":"Hilogx: noise-aware log-based anomaly detection with human feedback","authors":"","doi":"10.1007/s00778-024-00843-2","DOIUrl":"https://doi.org/10.1007/s00778-024-00843-2","url":null,"abstract":"<h3>Abstract</h3> <p>Log-based anomaly detection is essential for maintaining system reliability. Although existing log-based anomaly detection approaches perform well in certain experimental systems, they are ineffective in real-world industrial systems with noisy log data. This paper focuses on mitigating the impact of noisy log data. To this aim, we first conduct an empirical study on the system logs of four large-scale industrial software systems. Through the study, we find five typical noise patterns that are the root causes of unsatisfactory results of existing anomaly detection models. Based on the study, we propose HiLogx, a noise-aware log-based anomaly detection approach that integrates human knowledge to identify these noise patterns and further modify the anomaly detection model with human feedback. Experimental results on four large-scale industrial software systems and two open datasets show that our approach improves over 30% precision and 15% recall on average. </p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"220 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140325588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信