Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data最新文献

Prompt: Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing Systems 提示:分布式微批处理流处理系统的动态数据分区

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data Pub Date : 2020-06-11 DOI: 10.1145/3318464.3389713

A. S. Abdelhamid, Ahmed R. Mahmood, Anas Daghistani, Walid G. Aref

{"title":"Prompt: Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing Systems","authors":"A. S. Abdelhamid, Ahmed R. Mahmood, Anas Daghistani, Walid G. Aref","doi":"10.1145/3318464.3389713","DOIUrl":"https://doi.org/10.1145/3318464.3389713","url":null,"abstract":"Advances in real-world applications require high-throughput processing over large data streams. Micro-batching has been proposed to support the needs of these applications. In micro-batching, the processing and batching of the data are interleaved, where the incoming data tuples are first buffered as data blocks, and then are processed collectively using parallel function constructs (e.g., Map-Reduce). The size of a micro-batch is set to guarantee a certain response-time latency that is to conform to the application's service-level agreement. In contrast to tuple-at-a-time data stream processing, micro-batching has the potential to sustain higher data rates. However, existing micro-batch stream processing systems use basic data-partitioning techniques that do not account for data skew and variable data rates. Load-awareness is necessary to maintain performance and to enhance resource utilization. A new data partitioning scheme termed Prompt is presented that leverages the characteristics of the micro-batch processing model. In the batching phase, a frequency-aware buffering mechanism is introduced that progressively maintains run-time statistics, and provides online key-based sorting as data tuples arrive. Because achieving optimal data partitioning is NP-Hard in this context, a workload-aware greedy algorithm is introduced that partitions the buffered data tuples efficiently for the Map stage. In the processing phase, a load-aware distribution mechanism is presented that balances the size of the input to the Reduce stage without incurring inter-task communication overhead. Moreover, Prompt elastically adapts resource consumption according to workload changes. Experimental results using real and synthetic data sets demonstrate that Prompt is robust against fluctuations in data distribution and arrival rates. Furthermore, Prompt achieves up to 200% improvement in system throughput over state-of-the-art techniques without degradation in latency.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"219 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122853080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Automating Incremental and Asynchronous Evaluation for Recursive Aggregate Data Processing 递归聚合数据处理的增量和异步计算自动化

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data Pub Date : 2020-06-11 DOI: 10.1145/3318464.3389712

Qiange Wang, Yanfeng Zhang, Hao Wang, Liang Geng, Rubao Lee, Xiaodong Zhang, Ge Yu

{"title":"Automating Incremental and Asynchronous Evaluation for Recursive Aggregate Data Processing","authors":"Qiange Wang, Yanfeng Zhang, Hao Wang, Liang Geng, Rubao Lee, Xiaodong Zhang, Ge Yu","doi":"10.1145/3318464.3389712","DOIUrl":"https://doi.org/10.1145/3318464.3389712","url":null,"abstract":"In database and large-scale data analytics, recursive aggregate processing plays an important role, which is generally implemented under a framework of incremental computing and executed synchronously and/or asynchronously. We identify three barriers in existing recursive aggregate data processing. First, the processing scope is largely limited to monotonic programs. Second, checking on conditions for monotonicity and correctness for async processing is sophisticated and manually done. Third, execution engines may be suboptimal due to separation of sync and async execution. In this paper, we lay an analytical foundation for conditions to check if a recursive aggregate program that is monotonic or even non-monotonic can be executed incrementally and asynchronously with its correct result. We design and implement a condition verification tool that can automatically check if a given program satisfies the conditions. We further propose a unified sync-async engine to execute these programs for high performance. To integrate all these effective methods together, we have developed a distributed Datalog system, called PowerLog. Our evaluation shows that PowerLog can outperform three representative Datalog systems on both monotonic and non-monotonic recursive programs.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"14 41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124746728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines TensorFlow数据验证:连续ML管道中的数据分析和验证

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data Pub Date : 2020-06-11 DOI: 10.1145/3318464.3384707

Emily Caveness, C. PaulSuganthanG., Zhuo Peng, N. Polyzotis, Sudip Roy, Martin A. Zinkevich

引用次数: 23

Analysis of Database Search Systems with THOR 基于THOR的数据库检索系统分析

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data Pub Date : 2020-06-11 DOI: 10.1145/3318464.3384679

Theofilos Belmpas, Orest Gkini, G. Koutrika

引用次数: 1

Systems and ML: When the Sum is Greater than Its Parts 系统和机器学习:当总和大于部分时

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data Pub Date : 2020-06-11 DOI: 10.1145/3318464.3393817

I. Stoica

引用次数: 4

Rethinking Message Brokers on RDMA and NVM 重新思考RDMA和NVM上的消息代理

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data Pub Date : 2020-06-11 DOI: 10.1145/3318464.3384403

Hendrik Makait

引用次数: 2

STAR: A Distributed Stream Warehouse System for Spatial Data STAR:空间数据的分布式流仓库系统

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data Pub Date : 2020-06-11 DOI: 10.1145/3318464.3384699

Zhida Chen, G. Cong, Walid G. Aref

引用次数: 8

BOOMER: A Tool for Blending Visual P-Homomorphic Queries on Large Networks 在大型网络上混合视觉p -同态查询的工具

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data Pub Date : 2020-06-11 DOI: 10.1145/3318464.3384680

Yinglong Song, Huey-Eng Chua, S. Bhowmick, Byron Choi, Shuigeng Zhou

引用次数: 0

MemFlow: Memory-Aware Distributed Deep Learning MemFlow:内存感知分布式深度学习

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data Pub Date : 2020-06-11 DOI: 10.1145/3318464.3384416

Neil Band

{"title":"MemFlow: Memory-Aware Distributed Deep Learning","authors":"Neil Band","doi":"10.1145/3318464.3384416","DOIUrl":"https://doi.org/10.1145/3318464.3384416","url":null,"abstract":"As the number of layers and the amount of training data increases, the trend is to train deep neural networks in parallel across devices. In such scenarios, neural network training is increasingly bottlenecked by high memory requirements posed by intermediate results, or feature maps, that are produced during the forward pass and consumed during the backward pass. We recognize that the best-performing device parallelization configurations should consider memory usage in addition to the canonical metric of computation time. Towards this we introduce MemFlow, an optimization framework for distributed deep learning that performs joint optimization over memory usage and computation time when searching for a parallelization strategy. MemFlow consists of: (i) a task graph with memory usage estimates; (ii) a memory-aware execution simulator; and (iii) a Markov Chain Monte Carlo search algorithm that considers various degrees of recomputation i.e., discarding feature maps during the forward pass and recomputing them during the backward pass. Our experiments demonstrate that under memory constraints, MemFlow can readily locate valid and superior parallelization strategies unattainable with previous frameworks.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"52 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114023454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Azure SQL Database Always Encrypted Azure SQL数据库始终加密

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data Pub Date : 2020-06-11 DOI: 10.1145/3318464.3386141

Panagiotis Antonopoulos, A. Arasu, Kunal D. Singh, Ken Eguro, Nitish Gupta, Rajat Jain, R. Kaushik, Hanuma Kodavalla, Donald Kossmann, Nikolas Ogg, Ravishankar Ramamurthy, J. Szymaszek, J. Trimmer, K. Vaswani, R. Venkatesan, M. Zwilling

引用次数: 40