Chen Zhang, Fan Zhang, W. Zhang, Boge Liu, Ying Zhang, Lu Qin, Xuemin Lin
{"title":"Exploring Finer Granularity within the Cores: Efficient (k,p)-Core Computation","authors":"Chen Zhang, Fan Zhang, W. Zhang, Boge Liu, Ying Zhang, Lu Qin, Xuemin Lin","doi":"10.1109/ICDE48307.2020.00023","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00023","url":null,"abstract":"In this paper, we propose and study a novel cohesive subgraph model, named (k,p)-core, which is a maximal subgraph where each vertex has at least k neighbours and at least p fraction of its neighbours in the subgraph. The model is motivated by the finding that each user in a community should have at least a certain fraction p of neighbors inside the community to ensure user engagement, especially for users with large degrees. Meanwhile, the uniform degree constraint k, as applied in the k-core model, guarantees a minimum level of user engagement in a community, and is especially effective for users with small degrees. We propose an O(m) algorithm to compute a (k,p)-core with given k and p, and an O(dm) algorithm to decompose a graph by (k,p)-core, where m is the number of edges in the graph G and d is the degeneracy of G. A space efficient index is designed for time-optimal (k,p)-core query processing. Novel techniques are proposed for the maintenance of (k,p)-core index against graph dynamic. Extensive experiments on 8 reallife datasets demonstrate that our (k,p)-core model is effective and the algorithms are efficient.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"14 1","pages":"181-192"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77833439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhipeng Zhang, Wentao Wu, Jiawei Jiang, Lele Yu, B. Cui, Ce Zhang
{"title":"C olumnSGD: A Column-oriented Framework for Distributed Stochastic Gradient Descent","authors":"Zhipeng Zhang, Wentao Wu, Jiawei Jiang, Lele Yu, B. Cui, Ce Zhang","doi":"10.1109/ICDE48307.2020.00134","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00134","url":null,"abstract":"Distributed machine learning (ML) has triggered tremendous research interest in recent years. Stochastic gradient descent (SGD) is one of the most popular algorithms for training ML models, and has been implemented in almost all distributed ML systems, such as Spark MLlib, Petuum, MXNet, and TensorFlow. However, current implementations often incur huge communication and memory overheads when it comes to large models. One important reason for this inefficiency is the row-oriented scheme (RowSGD) that existing systems use to partition the training data, which forces them to adopt a centralized model management strategy that leads to vast amount of data exchange over the network.We propose a novel, column-oriented scheme (ColumnSGD) that partitions training data by columns rather than by rows. As a result, ML model can be partitioned by columns as well, leading to a distributed configuration where individual data and model partitions can be collocated on the same machine. Following this locality property, we develop a simple yet powerful computation framework that significantly reduces communication overheads and memory footprints compared to RowSGD, for large-scale ML models such as generalized linear models (GLMs) and factorization machines (FMs). We implement ColumnSGD on top of Apache Spark, and study its performance both analytically and experimentally. Experimental results on both public and real-world datasets show that ColumnSGD is up to 930 x faster than MLlib, 63 x faster than Petuum, and 14 x faster than MXNet.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"15 1","pages":"1513-1524"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78915073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jilin Hu, B. Yang, Chenjuan Guo, Christian S. Jensen, Hui Xiong
{"title":"Stochastic Origin-Destination Matrix Forecasting Using Dual-Stage Graph Convolutional, Recurrent Neural Networks","authors":"Jilin Hu, B. Yang, Chenjuan Guo, Christian S. Jensen, Hui Xiong","doi":"10.1109/ICDE48307.2020.00126","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00126","url":null,"abstract":"Origin-destination (OD) matrices are used widely in transportation and logistics to record the travel cost (e.g., travel speed or greenhouse gas emission) between pairs of OD regions during different intervals within a day. We model a travel cost as a distribution because when traveling between a pair of OD regions, different vehicles may travel at different speeds even during the same interval, e.g., due to different driving styles or different waiting times at intersections. This yields stochastic OD matrices. We consider an increasingly pertinent setting where a set of vehicle trips is used for instantiating OD matrices. Since the trips may not cover all OD pairs for each interval, the resulting OD matrices are likely to be sparse. We then address the problem of forecasting complete, near future OD matrices from sparse, historical OD matrices. To solve this problem, we propose a generic learning framework that (i) employs matrix factorization and graph convolutional neural networks to contend with the data sparseness while capturing spatial correlations and that (ii) captures spatio-temporal dynamics via recurrent neural networks extended with graph convolutions. Empirical studies using two taxi trajectory data sets offer detailed insight into the properties of the framework and indicate that it is effective.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"20 1","pages":"1417-1428"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77863087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Anomaly Detection System for the Protection of Relational Database Systems against Data Leakage by Application Programs","authors":"Daren Fadolalkarim, E. Bertino, Asmaa Sallam","doi":"10.1109/ICDE48307.2020.00030","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00030","url":null,"abstract":"Application programs are a possible source of attacks to databases as attackers might exploit vulnerabilities in a privileged database application. They can perform code injection or code-reuse attack in order to steal sensitive data. However, as such attacks very often result in changes in the program’s behavior, program monitoring techniques represent an effective defense to detect on-going attacks. One such technique is monitoring the library/system calls that the application program issues while running. In this paper, we propose AD-PROM, an Anomaly Detection system that aims at protecting relational database systems against malicious/compromised applications PROgraMs aiming at stealing data. AD-PROM tracks calls executed by application programs on data extracted from a database. The system operates in two phases. The first phase statically and dynamically analyzes the behavior of the application in order to build profiles representing the application’s normal behavior. AD-PROM analyzes the control and data flow of the application program (i.e., static analysis), and builds a hidden Markov model trained by the program traces (i.e., dynamic analysis). During the second phase, the program execution is monitored in order to detect anomalies that may represent data leakage attempts. We have implemented AD-PROM and carried experimental activities to assess its performance. The results showed that our system is highly accurate in detecting changes in the application programs’ behaviors and has very low false positive rates.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"87 20 1","pages":"265-276"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84043435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"JODA: A Vertically Scalable, Lightweight JSON Processor for Big Data Transformations","authors":"Nico Schäfer, S. Michel","doi":"10.1109/ICDE48307.2020.00155","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00155","url":null,"abstract":"We describe the demonstration of JODA (Json On Demand Analytics), an approach to handling large amounts of JSON documents in a vertically scalable manner. With JODA, the user can import, filter, transform, aggregate, group, and export documents with a simple PIG-style query language, offering fast execution speed. This is achieved by utilizing a multithreaded architecture over disjoint, read-only containers of data that are processed in parallel, similar to what RDDs are to Spark. Containers are augmented with auxiliary information like Bloom filters and adaptive indices and all containers are processed in parallel by individual threads. By avoiding locks, latches, and synchronization beyond simple thread pooling, we do not risk contention and therefore maximize resource utilization. The demonstration scenarios aim at engaging visitors with several data analytics tasks around large, real-world datasets that are to be solved with the help of JODA, and further gives insights on system internals and the installation/configuration process.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"8 1","pages":"1726-1729"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89918367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Art of Efficient In-memory Query Processing on NUMA Systems: a Systematic Approach","authors":"Puya Memarzia, S. Ray, V. Bhavsar","doi":"10.1109/ICDE48307.2020.00073","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00073","url":null,"abstract":"Data analytics systems commonly utilize in-memory query processing techniques to achieve better throughput and lower latency. Modern computers increasingly rely on Non-Uniform Memory Access (NUMA) architectures to achieve scalability. A key drawback of NUMA architectures is that many existing software solutions are not aware of the underlying NUMA topology and thus do not take full advantage of the hardware. Modern operating systems are designed to provide basic support for NUMA systems. However, default system configurations are typically sub-optimal for large data analytics applications. Additionally, rewriting the application from the ground up is not always feasible.In this work, we evaluate a variety of strategies that aim to accelerate memory-intensive data analytics workloads on NUMA systems. Our findings indicate that the operating system default configurations can be detrimental to query performance. We analyze the impact of different memory allocators, memory placement strategies, thread placement, and kernel-level load balancing and memory management mechanisms. With extensive experimental evaluation, we demonstrate that the methodical application of these techniques can be used to obtain significant speedups in four commonplace in-memory query processing tasks, on three different hardware architectures. Furthermore, we show that these strategies can improve the performance of five popular database systems running a TPC-H workload. Lastly, we summarize our findings in a decision flowchart for practitioners.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"12 1","pages":"781-792"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91372173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Picube for Fast Exploration of Large Datasets","authors":"Wenxiao Fu","doi":"10.1109/ICDE48307.2020.00246","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00246","url":null,"abstract":"Hierarchical aggregation supports fast exploration of large datasets by pre-aggregating data into a multi-scale data structure. While the pre-aggregation process is done offline, it can be quite expensive, and the resulting data-structure extremely large. When the data is multi-dimensional, this is greatly compounded. Data-cube-based approaches can result in extremely large cubic data-structures, aggregating more than is needed. Other approaches do not aggregate enough, and so do not offer the necessary flexibility for dimension-wise roll-ups and drill-downs. We design a hierarchical data-structure for aggregation that strikes a balance, and provides enough flexibility for different exploration scenarios with low-cost construction and reasonable size. Inductive aggregation is a methodology to compute levels of aggregations efficiently, while the resulting data-structure supports smooth data exploration. Inspired by this, we propose a partitioned, inductively aggregated data-cube, picube. A framework we call stratum space is presented with the model to express the dependencies across aggregation levels. Optimization choices are discussed, providing good design tradeoffs between storing and querying of the data.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"229 1","pages":"2069-2073"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85577506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effective and Efficient Truss Computation over Large Heterogeneous Information Networks","authors":"Yixing Yang, Yixiang Fang, Xuemin Lin, W. Zhang","doi":"10.1109/ICDE48307.2020.00083","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00083","url":null,"abstract":"Recently, the topic of truss computation has gained plenty of attention, where the k-truss of a graph is the maximum subgraph in which each edge participates in at least (k-2) triangles. Existing solutions mainly focus on homogeneous networks, where vertices are of the same type, and thus cannot be applied to heterogeneous information networks which consist of multi-typed and interconnected objects, such as the bibliographic networks and knowledge graphs. In this paper, we study the problem of truss computation over HINs, which aims to find groups of vertices that are of the same type and densely connected.To model the relationship between two vertices of the same type, we adopt the well-known concept of meta-path, which is a sequence of vertex types and edge types between two given vertex types. We then introduce two kinds of HIN triangles for three vertices, regarding a specific meta-path P. The first one requires that each pair of vertices is connected by an instance of P${mathcal{P}}$, while the second one also has such a connectivity constraint but further needs that the three instances of P${mathcal{P}}$ form a circle. Based on these two kinds of triangles, we propose two HIN truss models respectively. We further develop efficient truss computation algorithms. We have performed extensive experiments on five real large HINs, and the results show that the proposed solutions are highly effective and efficient.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"144 1","pages":"901-912"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77546871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruiyuan Li, Huajun He, Rubin Wang, Sijie Ruan, Y. Sui, Jie Bao, Yu Zheng
{"title":"TrajMesa: A Distributed NoSQL Storage Engine for Big Trajectory Data","authors":"Ruiyuan Li, Huajun He, Rubin Wang, Sijie Ruan, Y. Sui, Jie Bao, Yu Zheng","doi":"10.1109/ICDE48307.2020.00224","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00224","url":null,"abstract":"Trajectory data is very useful for many urban applications. However, due to its spatio-temporal and high-volume properties, it is challenging to manage trajectory data. Existing trajectory data management frameworks suffer from scalability problem, and only support limited trajectory queries. This paper proposes a holistic distributed NoSQL trajectory storage engine, TrajMesa, based on GeoMesa, an open-source indexing toolkit for spatio-temporal data. TrajMesa adopts a novel storage schema, which reduces the storage size tremendously. We also devise novel indexing key designs, and propose a bunch of pruning strategies. TrajMesa can support plentiful queries efficiently, including ID-Temporal query, spatial range query, similarity query, and k-NN query. Experimental results show the powerful query efficiency and scalability of TrajMesa.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"3 1","pages":"2002-2005"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73705074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Pfadler, Huan Zhao, Jizhe Wang, Lifeng Wang, Pipei Huang, Lee
{"title":"Billion-scale Recommendation with Heterogeneous Side Information at Taobao","authors":"A. Pfadler, Huan Zhao, Jizhe Wang, Lifeng Wang, Pipei Huang, Lee","doi":"10.1109/ICDE48307.2020.00148","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00148","url":null,"abstract":"In recent years, embedding models based on skip-gram algorithm have been widely applied to real-world recommendation systems (RSs). When designing embedding-based methods for recommendation at Taobao, there are three main challenges: scalability, sparsity and cold start. The first problem is inherently caused by the extremely large numbers of users and items (in the order of billions), while the remaining two problems are caused by the fact that most items have only very few (or none at all) user interactions. To address these challenges, in this work, we present a flexible and highly scalable Side Information (SI) enhanced Skip-Gram (SISG) framework, which is deployed at Taobao. SISG overcomes the drawbacks of existing embedding-based models by modeling user metadata and capturing asymmetries of user behavior. Furthermore, as training SISG can be performed using any SGNS implementation, we present our production deployment of SISG on a custom-built word2vec engine, which allows us to compute item and SI embedding vectors for billion-scale sets of products in a join semantic space on a daily basis. Finally, using offline and online experiments we demonstrate the significant superiority of SISG over our previously deployed framework, EGES, and a well-tuned CF, as well as present evidence supporting our scalability claims.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"59 1","pages":"1667-1676"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74644457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}