{"title":"Bidirectionally Densifying LSH Sketches with Empty Bins","authors":"Peng Jia, Pinghui Wang, Junzhou Zhao, Shuo Zhang, Yiyan Qi, Min Hu, Chao Deng, X. Guan","doi":"10.1145/3448016.3452833","DOIUrl":"https://doi.org/10.1145/3448016.3452833","url":null,"abstract":"As an efficient tool for approximate similarity computation and search, Locality Sensitive Hashing (LSH) has been widely used in many research areas including databases, data mining, information retrieval, and machine learning. Classical LSH methods typically require to perform hundreds or even thousands of hashing operations when computing the LSH sketch for each input item (e.g., a set or a vector); however, this complexity is still too expensive and even impractical for applications requiring processing data in real-time. To address this issue, several fast methods such as OPH and BCWS have been proposed to efficiently compute the LSH sketches; however, these methods may generate many sketches with empty bins, which may introduce large errors for similarity estimation and also limit their usage for fast similarity search. To solve this issue, we propose a novel densification method, i.e., BiDens. Compared with existing densification methods, our BiDens is more efficient to fill a sketch's empty bins with values of its non-empty bins in either the forward or backward directions. Furthermore, it also densifies empty bins to satisfy the densification principle (i.e., the LSH property). Theoretical analysis and experimental results on similarity estimation, fast similarity search, and kernel linearization using real-world datasets demonstrate that our BiDens is up to 106 times faster than state-of-the-art methods while achieving the same or even better accuracy.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132642162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Data Integration","authors":"W. Tan","doi":"10.1145/3448016.3460534","DOIUrl":"https://doi.org/10.1145/3448016.3460534","url":null,"abstract":"We are witnessing the widespread adoption of deep learning techniques as avant-garde solutions to different computational problems in recent years. In data integration, the use of deep learning techniques has helped establish several state-of-the-art results in long standing problems, including information extraction, entity matching, data cleaning, and table understanding. In this talk, I will reflect on the strengths of deep learning and how that has helped move the needle in data integration. I will also discuss a few challenges associated with solutions based on deep learning techniques and describe some opportunities for the data management community.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132830195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EquiTensors","authors":"A. Yan, Bill Howe","doi":"10.1145/3448016.3452777","DOIUrl":"https://doi.org/10.1145/3448016.3452777","url":null,"abstract":"Neural methods are state-of-the-art for urban prediction problems such as transportation resource demand, accident risk, crowd mobility, and public safety. Model performance can be improved by integrating exogenous features from open data repositories (e.g., weather, housing prices, traffic, etc.), but these uncurated sources are often too noisy, incomplete, and biased to use directly. We propose to learn integrated representations, called EquiTensors, from heterogeneous datasets that can be reused across a variety of tasks. We align datasets to a consistent spatio-temporal domain, then describe an unsupervised model based on convolutional denoising autoencoders to learn shared representations. We extend this core integrative model with adaptive weighting to prevent certain datasets from dominating the signal. To combat discriminatory bias, we use adversarial learning to remove correlations with a sensitive attribute (e.g., race or income). Experiments with 23 input datasets and 4 real applications show that EquiTensors could help mitigate the effects of the sensitive information embodied in the biased data. Meanwhile, applications using EquiTensors outperform models that ignore exogenous features and are competitive with \"oracle\" models that use hand-selected datasets.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124568292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cohesive Subgraph Search over Big Heterogeneous Information Networks: Applications, Challenges, and Solutions","authors":"Yixiang Fang, Kai Wang, Xuemin Lin, Wenjie Zhang","doi":"10.1145/3448016.3457538","DOIUrl":"https://doi.org/10.1145/3448016.3457538","url":null,"abstract":"With the advent of a wide spectrum of recent applications, querying heterogeneous information networks (HINs) has received a great deal of attention from both academic and industrial societies. HINs involve objects (vertices) and links (edges) that are classified into multiple types; examples include bibliography networks, knowledge networks, and user-item networks in E-business. An important component of these HINs is the cohesive subgraph, or a subgraph containing vertices that are densely connected internally. Searching cohesive subgraphs over HINs has found many real applications, such as community search, product recommendation, fraud detection, and so on. Consequently, how to design effective cohesive subgraph models and how to efficiently search cohesive subgraphs on large HINs become important research topics in the era of big data. In this tutorial, we first highlight the importance of cohesive subgraph search over HINs in various applications and the unique challenges that need to be addressed. Subsequently, we conduct a thorough review of existing works of cohesive subgraph search over HINs. Then, we analyze and compare the models and solutions in these works. Finally, we point out new research directions. We believe that this tutorial not only helps researchers to have a better understanding of existing cohesive subgraph search models and solutions, but also provides them insights for future study.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114666020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interactive Search for One of the Top-k","authors":"Weicheng Wang, R. C. Wong, Min Xie","doi":"10.1145/3448016.3457322","DOIUrl":"https://doi.org/10.1145/3448016.3457322","url":null,"abstract":"When a large dataset is given, it is not desirable for a user to read all tuples one-by-one in the whole dataset to find satisfied tuples. The traditional top-k query finds the best k tuples (i.e., the top-k tuples) w.r.t. the user's preference. However, in practice, it is difficult for a user to specify his/her preference explicitly. We study how to enhance the top-k query with user interaction. Specifically, we ask a user several questions, each of which consists of two tuples and asks the user to indicate which one s/he prefers. Based on the feedback, the user's preference is learned implicitly and one of the top-k tuples w.r.t. the learned preference is returned. Here, instead of directly following the top-k query to return all the top-k tuples, since it requires heavy user effort during the interaction (e.g., answering many questions), we reduce the output size to strike for a trade-off between the user effort and the output size. To achieve this, we present an algorithm 2D-PI which asks an asymptotically optimal number of questions in a 2-dimensional space, and two algorithms HD-PI and RH with provable performance guarantee in a d-dimensional space (d >= 2), where they focus on the number of questions asked and the execution time, respectively. Experiments were conducted on synthetic and real datasets, showing that our algorithms outperform existing ones by asking fewer questions within less time to return satisfied tuples.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125743855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seongyun Ko, Taesung Lee, Kijae Hong, Wonseok Lee, In Seo, Jiwon Seo, Wook-Shin Han
{"title":"iTurboGraph","authors":"Seongyun Ko, Taesung Lee, Kijae Hong, Wonseok Lee, In Seo, Jiwon Seo, Wook-Shin Han","doi":"10.1145/3448016.3457243","DOIUrl":"https://doi.org/10.1145/3448016.3457243","url":null,"abstract":"With the rise of streaming data for dynamic graphs, large-scale graph analytics meets a new requirement of Incremental Computation because the larger the graph, the higher the cost for updating the analytics results by re-execution. A dynamic graph consists of an initial graph G and graph mutation updates Δ G$ of edge insertions or deletions. Given a query Q, its results $Q(G)$, and updates for Δ G$ to G, incremental graph analytics computes updates Δ Q$ such that Q($G cup Δ G)$ = $Q(G)$ $cup$ Δ Q$ where $cup$ is a union operator. In this paper, we consider the problem of large-scale incremental neighbor-centric graph analytics (NGA ). We solve the limitations of previous systems: lack of usability due to the difficulties in programming incremental algorithms for NGA and limited scalability and efficiency due to the overheads in maintaining intermediate results for graph traversals in NGA. First, we propose a domain-specific language, ŁNGA, and develop its compiler for intuitive programming of NGA, automatic query incrementalization, and query optimizations. Second, we define Graph Streaming Algebra as a theoretical foundation for scalable processing of incremental NGA. We introduce a concept of Nested Graph Windows and model graph traversals as the generation of walk streams. Lastly, we present a system SystemName, which efficiently processes incremental NGA for large graphs. Comprehensive experiments show that it effectively avoids costly re-executions and efficiently updates the analytics results with reduced IO and computations.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126139364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lasse Thostrup, Jan Skrzypczak, Matthias Jasny, Tobias Ziegler, Carsten Binnig
{"title":"DFI: The Data Flow Interface for High-Speed Networks","authors":"Lasse Thostrup, Jan Skrzypczak, Matthias Jasny, Tobias Ziegler, Carsten Binnig","doi":"10.1145/3448016.3452816","DOIUrl":"https://doi.org/10.1145/3448016.3452816","url":null,"abstract":"In this paper, we propose the Data Flow Interface (DFI) as a way to make it easier for data processing systems to exploit high-speed networks without the need to deal with the complexity of RDMA. By lifting the level of abstraction, DFI factors out much of the complexity of network communication and makes it easier for developers to declaratively express how data should be efficiently routed to accomplish a given distributed data processing task. As we show in our experiments, DFI is able to support a wide variety of data-centric applications with high performance at a low complexity for the applications.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"170 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129525857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"One WITH RECURSIVE is Worth Many GOTOs","authors":"Denis Hirn, Torsten Grust","doi":"10.1145/3448016.3457272","DOIUrl":"https://doi.org/10.1145/3448016.3457272","url":null,"abstract":"PL/SQL integrates an imperative statement-by-statement style of programming with the plan-based evaluation of SQL queries. The disparity of both leads to friction at runtime, slowing PL/SQL execution down significantly. This work describes a compiler from PL/SQL UDFs to plain SQL queries. Post-compilation, evaluation entirely happens on the SQL side of the fence. With the friction gone, we observe execution times to improve by about a factor of 2, even for complex UDFs. The compiler builds on techniques long established by the programming language community. In particular, it uses trampolined style to compile arbitrarily nested iterative control flow in PL/SQL into SQL's recursive common table expressions.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129727656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data Summarization with Hierarchical Taxonomy","authors":"Xuliang Zhu","doi":"10.1145/3448016.3450578","DOIUrl":"https://doi.org/10.1145/3448016.3450578","url":null,"abstract":"Data summarization has wide applications in real world, e.g. attributes filter, image set labeling and personalized recommendation. In this work, we study a new problem HSD to summarize a dataset using k concepts in a hierarchical taxonomy. Different from the existed works of whole hierarchy summarization, we focus on the accurate coverage of the given query set Q. The objective is to cover more items in Q and less items not in Q. To tackle it, we first propose a dynamic programming based algorithm on the tree hierarchy, which is a simple instance of HSD problem. Furthermore, we propose a heuristic method to assign the vertex to one of its in-neighbors for HDAGs and apply the tree algorithm on it. The experimental results confirm the quality of our methods on both tree and HDAG datasets.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129869508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiple Dynamic Outlier-Detection from a Data Stream by Exploiting Duality of Data and Queries","authors":"Susik Yoon, Yooju Shin, Jae-Gil Lee, B. Lee","doi":"10.1145/3448016.3452810","DOIUrl":"https://doi.org/10.1145/3448016.3452810","url":null,"abstract":"Real-time outlier detection from a data stream has become increasingly important in the current hyperconnected world. This paper focuses on an important yet unaddressed challenge in continuous outlier detection: the multiplicity and dynamicity of queries. This challenge arises from various contexts of outliers evolving over time, but the state-of-the-art algorithms cannot handle the challenge effectively, as they can only process a fixed set of outlier detection queries for each data point separately. In this paper, we propose a novel algorithm, abbreviated as MDUAL, based on a new idea called duality-based unified processing. The underlying rationale is to exploit the duality of data and queries so that a group of similar data points are processed together by a group of similar queries incrementally. Two main techniques embodying the idea, data-query grouping and prioritized group processing, are employed. Comprehensive experiments showed that MDUAL runs 216 to 221 times faster while consuming 11 to 13 times less memory than the state-of-the-art algorithms through its efficient and effective handling of the multiplicity-dynamicity challenge.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128328359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}