2020 IEEE 36th International Conference on Data Engineering (ICDE)最新文献_第3页

Summarizing Hierarchical Multidimensional Data 分层多维数据汇总

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00081

Alexandra Kim, L. Lakshmanan, D. Srivastava

{"title":"Summarizing Hierarchical Multidimensional Data","authors":"Alexandra Kim, L. Lakshmanan, D. Srivastava","doi":"10.1109/ICDE48307.2020.00081","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00081","url":null,"abstract":"Data scientists typically analyze and extract insights from large multidimensional data sets such as US census data, enterprise sales data, and so on. But before sophisticated machine learning and statistical methods are employed, it is useful to build and explore concise summaries of the data set. While a variety of summaries have been proposed over the years, the goal of creating a concise summary of multidimensional data that can provide worst-case accuracy guarantees has remained elusive. In this paper, we propose Tree Summaries, which attain this challenging goal over arbitrary hierarchical multidimensional data sets. Intuitively, a Tree Summary is a weighted \"embedded tree\" in the lattice that is the cross-product of the dimension hierarchies; individual data values can be efficiently estimated by looking up the weight of their unique closest ancestor in the Tree Summary. We study the problems of generating lossless as well as (given a desired worst-case accuracy guarantee a) lossy Tree Summaries. We develop a polynomial-time algorithm that constructs the optimal (i.e., most concise) Tree Summary for each of these problems; this is a surprising result given the NP-hardness of constructing a variety of other optimal summaries over multidimensional data. We complement our analytical results with an empirical evaluation of our algorithm, and demonstrate with a detailed set of experiments on real and synthetic data sets that our algorithm outperforms prior methods in terms of conciseness of summaries or accuracy of estimation.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"103 1","pages":"877-888"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86657213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Scaling Out Schema-free Stream Joins 扩展无模式流连接

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00075

Damjan Gjurovski, S. Michel

引用次数: 0

Efficient Locality-Sensitive Hashing Over High-Dimensional Data Streams 高维数据流上高效的位置敏感哈希

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00220

Chengcheng Yang, Dong Deng, Shuo Shang, Ling Shao

引用次数: 5

User-driven Error Detection for Time Series with Events 带有事件的时间序列的用户驱动错误检测

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00070

Kim-Hung Le, Paolo Papotti

{"title":"User-driven Error Detection for Time Series with Events","authors":"Kim-Hung Le, Paolo Papotti","doi":"10.1109/ICDE48307.2020.00070","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00070","url":null,"abstract":"Anomalies are pervasive in time series data, such as sensor readings. Existing methods for anomaly detection cannot distinguish between anomalies that represent data errors, such as incorrect sensor readings, and notable events, such as the watering action in soil monitoring. In addition, the quality performance of such detection methods highly depends on the configuration parameters, which are dataset specific. In this work, we exploit active learning to detect both errors and events in a single solution that aims at minimizing user interaction. For this joint detection, we introduce an algorithm that accurately detects and labels anomalies with a non-parametric concept of neighborhood and probabilistic classification. Given a desired quality, the confidence of the classification is then used as termination condition for the active learning algorithm. Experiments on real and synthetic datasets demonstrate that our approach achieves F-score above 80% in detecting errors by labeling 2 to 5 points in one data series. We also show the superiority of our solution compared to the state-of-the-art approaches for anomaly detection. Finally, we demonstrate the positive impact of our error detection methods in downstream data repairing algorithms.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"59 1","pages":"745-757"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88492366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

G-thinker: A Distributed Framework for Mining Subgraphs in a Big Graph G-thinker:在大图中挖掘子图的分布式框架

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00122

Da Yan, Guimu Guo, Md Mashiur Rahman Chowdhury, M. Tamer Özsu, Wei-Shinn Ku, John C.S. Lui

引用次数: 29

Preserving Contextual Information in Relational Matrix Operations 在关系矩阵操作中保存上下文信息

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00197

O. Dolmatova, Nikolaus Augsten, Michael H. Böhlen

引用次数: 4

Doubleheader Logging: Eliminating Journal Write Overhead for Mobile DBMS 双头日志记录:消除移动DBMS的日志写开销

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00111

Sehyeon Oh, Wook-Hee Kim, Jihye Seo, Hyeonho Song, S. Noh, Beomseok Nam

{"title":"Doubleheader Logging: Eliminating Journal Write Overhead for Mobile DBMS","authors":"Sehyeon Oh, Wook-Hee Kim, Jihye Seo, Hyeonho Song, S. Noh, Beomseok Nam","doi":"10.1109/ICDE48307.2020.00111","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00111","url":null,"abstract":"Various transactional systems use out-of-place up-dates such as logging or copy-on-write mechanisms to update data in a failure-atomic manner. Such out-of-place update methods double the I/O traffic due to back-up copies in the database layer and quadruple the I/O traffic due to the file system journaling. In mobile systems, transaction sizes of mobile apps are known to be tiny and transactions run at low concurrency. For such mobile transactions, legacy out-of-place update methods such as WAL are sub-optimal. In this work, we propose a crash consistent in-place update logging method - doubleheader logging (DHL) for SQLite. DHL prevents previous consistent records from being lost by performing a copy-on-write inside the database page and co-locating the metadata-only journal information within the page. This is done, in turn, with minimal sacrifice to page utilization. DHL is similar to when journaling is disabled, in the sense that it incurs almost no additional overhead in terms of both I/O and computation. Our experimental results show that DHL outperforms other logging methods such as out-of-place update write-ahead logging (WAL) and in-place update multi-version B-tree (MVBT).","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"35 1","pages":"1237-1248"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73548468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

SLED: Semi-supervised Locally-weighted Ensemble Detector 半监督局部加权集合检测器

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/icde48307.2020.00183

Shuxiang Zhang, David Tse Jung Huang, G. Dobbie, Yun Sing Koh

{"title":"SLED: Semi-supervised Locally-weighted Ensemble Detector","authors":"Shuxiang Zhang, David Tse Jung Huang, G. Dobbie, Yun Sing Koh","doi":"10.1109/icde48307.2020.00183","DOIUrl":"https://doi.org/10.1109/icde48307.2020.00183","url":null,"abstract":"Concept drift detection refers to the process of detecting changes in the underlying distribution of data. Interest in the data stream mining community has increased, because of their role in improving the performance of online learning algorithms. Over the years, a myriad of drift detection methods have been proposed. However, most of these methods are single detectors, which usually work well only with a single type of drift. In this research, we propose a semi-supervised locally-weighted ensemble detector (SLED), where the relative performance among its base detectors is characterized by a set of weights learned in a semi-supervised manner. The aim of this technique is to effectively deal with both abrupt and gradual concept drifts. In our experiments, SLED is configured with ten well-known drift detectors. To evaluate the performance of SLED, we compare it with single detectors as well as state-of-the-art ensemble methods on both synthetic and real-world datasets using different performance measures. The experimental results show that SLED has fewer false positives, higher precision, and higher Matthews correlation coefficient while maintaining reasonably good performance for other measures.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"5 1","pages":"1838-1841"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76596511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

ML-based Cross-Platform Query Optimization 基于ml的跨平台查询优化

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00132

Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, S. Chawla

{"title":"ML-based Cross-Platform Query Optimization","authors":"Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, S. Chawla","doi":"10.1109/ICDE48307.2020.00132","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00132","url":null,"abstract":"Cost-based optimization is widely known to suffer from a major weakness: administrators spend a significant amount of time to tune the associated cost models. This problem only gets exacerbated in cross-platform settings as there are many more parameters that need to be tuned. In the era of machine learning (ML), the first step to remedy this problem is to replace the cost model of the optimizer with an ML model. However, such a solution brings in two major challenges. First, the optimizer has to transform a query plan to a vector million times during plan enumeration incurring a very high overhead. Second, a lot of training data is required to effectively train the ML model. We overcome these challenges in Robopt, a novel vector-based optimizer we have built for Rheem, a cross-platform system. Robopt not only uses an ML model to prune the search space but also bases the entire plan enumeration on a set of algebraic operations that operate on vectors, which are a natural fit to the ML model. This leads to both speed-up and scale-up of the enumeration process by exploiting modern CPUs via vectorization. We also accompany Robopt with a scalable training data generator for building its ML model. Our evaluation shows that (i) the vector-based approach is more efficient and scalable than simply using an ML model and (ii) Robopt matches and, in some cases, improves Rheem’s cost-based optimizer in choosing good plans without requiring any tuning effort.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"178 1","pages":"1489-1500"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76851090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

ForkBase: Immutable, Tamper-evident Storage Substrate for Branchable Applications ForkBase:用于可分支应用程序的不可变、防篡改的存储基板

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00153

Qian Lin, Kaiyuan Yang, Tien Tuan Anh Dinh, Qingchao Cai, Gang Chen, B. Ooi, Pingcheng Ruan, Sheng Wang, Zhongle Xie, Meihui Zhang, Olafs Vandans

引用次数: 6