Proceedings of the 2021 International Conference on Management of Data最新文献_第3页

Bidirectionally Densifying LSH Sketches with Empty Bins 用空箱双向致密化LSH草图

Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3452833

Peng Jia, Pinghui Wang, Junzhou Zhao, Shuo Zhang, Yiyan Qi, Min Hu, Chao Deng, X. Guan

{"title":"Bidirectionally Densifying LSH Sketches with Empty Bins","authors":"Peng Jia, Pinghui Wang, Junzhou Zhao, Shuo Zhang, Yiyan Qi, Min Hu, Chao Deng, X. Guan","doi":"10.1145/3448016.3452833","DOIUrl":"https://doi.org/10.1145/3448016.3452833","url":null,"abstract":"As an efficient tool for approximate similarity computation and search, Locality Sensitive Hashing (LSH) has been widely used in many research areas including databases, data mining, information retrieval, and machine learning. Classical LSH methods typically require to perform hundreds or even thousands of hashing operations when computing the LSH sketch for each input item (e.g., a set or a vector); however, this complexity is still too expensive and even impractical for applications requiring processing data in real-time. To address this issue, several fast methods such as OPH and BCWS have been proposed to efficiently compute the LSH sketches; however, these methods may generate many sketches with empty bins, which may introduce large errors for similarity estimation and also limit their usage for fast similarity search. To solve this issue, we propose a novel densification method, i.e., BiDens. Compared with existing densification methods, our BiDens is more efficient to fill a sketch's empty bins with values of its non-empty bins in either the forward or backward directions. Furthermore, it also densifies empty bins to satisfy the densification principle (i.e., the LSH property). Theoretical analysis and experimental results on similarity estimation, fast similarity search, and kernel linearization using real-world datasets demonstrate that our BiDens is up to 106 times faster than state-of-the-art methods while achieving the same or even better accuracy.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132642162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Deep Data Integration 深度数据集成

Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3460534

W. Tan

引用次数: 3

EquiTensors EquiTensors

Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3452777

A. Yan, Bill Howe

{"title":"EquiTensors","authors":"A. Yan, Bill Howe","doi":"10.1145/3448016.3452777","DOIUrl":"https://doi.org/10.1145/3448016.3452777","url":null,"abstract":"Neural methods are state-of-the-art for urban prediction problems such as transportation resource demand, accident risk, crowd mobility, and public safety. Model performance can be improved by integrating exogenous features from open data repositories (e.g., weather, housing prices, traffic, etc.), but these uncurated sources are often too noisy, incomplete, and biased to use directly. We propose to learn integrated representations, called EquiTensors, from heterogeneous datasets that can be reused across a variety of tasks. We align datasets to a consistent spatio-temporal domain, then describe an unsupervised model based on convolutional denoising autoencoders to learn shared representations. We extend this core integrative model with adaptive weighting to prevent certain datasets from dominating the signal. To combat discriminatory bias, we use adversarial learning to remove correlations with a sensitive attribute (e.g., race or income). Experiments with 23 input datasets and 4 real applications show that EquiTensors could help mitigate the effects of the sensitive information embodied in the biased data. Meanwhile, applications using EquiTensors outperform models that ignore exogenous features and are competitive with \"oracle\" models that use hand-selected datasets.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124568292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Cohesive Subgraph Search over Big Heterogeneous Information Networks: Applications, Challenges, and Solutions 大型异构信息网络上的内聚子图搜索:应用、挑战和解决方案

Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3457538

Yixiang Fang, Kai Wang, Xuemin Lin, Wenjie Zhang

{"title":"Cohesive Subgraph Search over Big Heterogeneous Information Networks: Applications, Challenges, and Solutions","authors":"Yixiang Fang, Kai Wang, Xuemin Lin, Wenjie Zhang","doi":"10.1145/3448016.3457538","DOIUrl":"https://doi.org/10.1145/3448016.3457538","url":null,"abstract":"With the advent of a wide spectrum of recent applications, querying heterogeneous information networks (HINs) has received a great deal of attention from both academic and industrial societies. HINs involve objects (vertices) and links (edges) that are classified into multiple types; examples include bibliography networks, knowledge networks, and user-item networks in E-business. An important component of these HINs is the cohesive subgraph, or a subgraph containing vertices that are densely connected internally. Searching cohesive subgraphs over HINs has found many real applications, such as community search, product recommendation, fraud detection, and so on. Consequently, how to design effective cohesive subgraph models and how to efficiently search cohesive subgraphs on large HINs become important research topics in the era of big data. In this tutorial, we first highlight the importance of cohesive subgraph search over HINs in various applications and the unique challenges that need to be addressed. Subsequently, we conduct a thorough review of existing works of cohesive subgraph search over HINs. Then, we analyze and compare the models and solutions in these works. Finally, we point out new research directions. We believe that this tutorial not only helps researchers to have a better understanding of existing cohesive subgraph search models and solutions, but also provides them insights for future study.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114666020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Interactive Search for One of the Top-k 交互式搜索Top-k中的一个

Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3457322

Weicheng Wang, R. C. Wong, Min Xie

{"title":"Interactive Search for One of the Top-k","authors":"Weicheng Wang, R. C. Wong, Min Xie","doi":"10.1145/3448016.3457322","DOIUrl":"https://doi.org/10.1145/3448016.3457322","url":null,"abstract":"When a large dataset is given, it is not desirable for a user to read all tuples one-by-one in the whole dataset to find satisfied tuples. The traditional top-k query finds the best k tuples (i.e., the top-k tuples) w.r.t. the user's preference. However, in practice, it is difficult for a user to specify his/her preference explicitly. We study how to enhance the top-k query with user interaction. Specifically, we ask a user several questions, each of which consists of two tuples and asks the user to indicate which one s/he prefers. Based on the feedback, the user's preference is learned implicitly and one of the top-k tuples w.r.t. the learned preference is returned. Here, instead of directly following the top-k query to return all the top-k tuples, since it requires heavy user effort during the interaction (e.g., answering many questions), we reduce the output size to strike for a trade-off between the user effort and the output size. To achieve this, we present an algorithm 2D-PI which asks an asymptotically optimal number of questions in a 2-dimensional space, and two algorithms HD-PI and RH with provable performance guarantee in a d-dimensional space (d >= 2), where they focus on the number of questions asked and the execution time, respectively. Experiments were conducted on synthetic and real datasets, showing that our algorithms outperform existing ones by asking fewer questions within less time to return satisfied tuples.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125743855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

iTurboGraph iTurboGraph

Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3457243

Seongyun Ko, Taesung Lee, Kijae Hong, Wonseok Lee, In Seo, Jiwon Seo, Wook-Shin Han

{"title":"iTurboGraph","authors":"Seongyun Ko, Taesung Lee, Kijae Hong, Wonseok Lee, In Seo, Jiwon Seo, Wook-Shin Han","doi":"10.1145/3448016.3457243","DOIUrl":"https://doi.org/10.1145/3448016.3457243","url":null,"abstract":"With the rise of streaming data for dynamic graphs, large-scale graph analytics meets a new requirement of Incremental Computation because the larger the graph, the higher the cost for updating the analytics results by re-execution. A dynamic graph consists of an initial graph G and graph mutation updates Δ G$ of edge insertions or deletions. Given a query Q, its results $Q(G)$, and updates for Δ G$ to G, incremental graph analytics computes updates Δ Q$ such that Q($G cup Δ G)$ = $Q(G)$ $cup$ Δ Q$ where $cup$ is a union operator. In this paper, we consider the problem of large-scale incremental neighbor-centric graph analytics (NGA ). We solve the limitations of previous systems: lack of usability due to the difficulties in programming incremental algorithms for NGA and limited scalability and efficiency due to the overheads in maintaining intermediate results for graph traversals in NGA. First, we propose a domain-specific language, ŁNGA, and develop its compiler for intuitive programming of NGA, automatic query incrementalization, and query optimizations. Second, we define Graph Streaming Algebra as a theoretical foundation for scalable processing of incremental NGA. We introduce a concept of Nested Graph Windows and model graph traversals as the generation of walk streams. Lastly, we present a system SystemName, which efficiently processes incremental NGA for large graphs. Comprehensive experiments show that it effectively avoids costly re-executions and efficiently updates the analytics results with reduced IO and computations.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126139364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

DFI: The Data Flow Interface for High-Speed Networks 高速网络的数据流接口

Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3452816

Lasse Thostrup, Jan Skrzypczak, Matthias Jasny, Tobias Ziegler, Carsten Binnig

引用次数: 17

One WITH RECURSIVE is Worth Many GOTOs 一个递归值很多个goto

Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3457272

Denis Hirn, Torsten Grust

引用次数: 15

Data Summarization with Hierarchical Taxonomy 基于层次分类法的数据摘要

Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3450578

Xuliang Zhu

引用次数: 0

Multiple Dynamic Outlier-Detection from a Data Stream by Exploiting Duality of Data and Queries 利用数据和查询的对偶性对数据流进行多动态异常点检测

Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3452810

Susik Yoon, Yooju Shin, Jae-Gil Lee, B. Lee

{"title":"Multiple Dynamic Outlier-Detection from a Data Stream by Exploiting Duality of Data and Queries","authors":"Susik Yoon, Yooju Shin, Jae-Gil Lee, B. Lee","doi":"10.1145/3448016.3452810","DOIUrl":"https://doi.org/10.1145/3448016.3452810","url":null,"abstract":"Real-time outlier detection from a data stream has become increasingly important in the current hyperconnected world. This paper focuses on an important yet unaddressed challenge in continuous outlier detection: the multiplicity and dynamicity of queries. This challenge arises from various contexts of outliers evolving over time, but the state-of-the-art algorithms cannot handle the challenge effectively, as they can only process a fixed set of outlier detection queries for each data point separately. In this paper, we propose a novel algorithm, abbreviated as MDUAL, based on a new idea called duality-based unified processing. The underlying rationale is to exploit the duality of data and queries so that a group of similar data points are processed together by a group of similar queries incrementally. Two main techniques embodying the idea, data-query grouping and prioritized group processing, are employed. Comprehensive experiments showed that MDUAL runs 216 to 221 times faster while consuming 11 to 13 times less memory than the state-of-the-art algorithms through its efficient and effective handling of the multiplicity-dynamicity challenge.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128328359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10