A. S. Abdelhamid, Ahmed R. Mahmood, Anas Daghistani, Walid G. Aref
{"title":"Prompt: Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing Systems","authors":"A. S. Abdelhamid, Ahmed R. Mahmood, Anas Daghistani, Walid G. Aref","doi":"10.1145/3318464.3389713","DOIUrl":"https://doi.org/10.1145/3318464.3389713","url":null,"abstract":"Advances in real-world applications require high-throughput processing over large data streams. Micro-batching has been proposed to support the needs of these applications. In micro-batching, the processing and batching of the data are interleaved, where the incoming data tuples are first buffered as data blocks, and then are processed collectively using parallel function constructs (e.g., Map-Reduce). The size of a micro-batch is set to guarantee a certain response-time latency that is to conform to the application's service-level agreement. In contrast to tuple-at-a-time data stream processing, micro-batching has the potential to sustain higher data rates. However, existing micro-batch stream processing systems use basic data-partitioning techniques that do not account for data skew and variable data rates. Load-awareness is necessary to maintain performance and to enhance resource utilization. A new data partitioning scheme termed Prompt is presented that leverages the characteristics of the micro-batch processing model. In the batching phase, a frequency-aware buffering mechanism is introduced that progressively maintains run-time statistics, and provides online key-based sorting as data tuples arrive. Because achieving optimal data partitioning is NP-Hard in this context, a workload-aware greedy algorithm is introduced that partitions the buffered data tuples efficiently for the Map stage. In the processing phase, a load-aware distribution mechanism is presented that balances the size of the input to the Reduce stage without incurring inter-task communication overhead. Moreover, Prompt elastically adapts resource consumption according to workload changes. Experimental results using real and synthetic data sets demonstrate that Prompt is robust against fluctuations in data distribution and arrival rates. Furthermore, Prompt achieves up to 200% improvement in system throughput over state-of-the-art techniques without degradation in latency.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"219 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122853080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automating Incremental and Asynchronous Evaluation for Recursive Aggregate Data Processing","authors":"Qiange Wang, Yanfeng Zhang, Hao Wang, Liang Geng, Rubao Lee, Xiaodong Zhang, Ge Yu","doi":"10.1145/3318464.3389712","DOIUrl":"https://doi.org/10.1145/3318464.3389712","url":null,"abstract":"In database and large-scale data analytics, recursive aggregate processing plays an important role, which is generally implemented under a framework of incremental computing and executed synchronously and/or asynchronously. We identify three barriers in existing recursive aggregate data processing. First, the processing scope is largely limited to monotonic programs. Second, checking on conditions for monotonicity and correctness for async processing is sophisticated and manually done. Third, execution engines may be suboptimal due to separation of sync and async execution. In this paper, we lay an analytical foundation for conditions to check if a recursive aggregate program that is monotonic or even non-monotonic can be executed incrementally and asynchronously with its correct result. We design and implement a condition verification tool that can automatically check if a given program satisfies the conditions. We further propose a unified sync-async engine to execute these programs for high performance. To integrate all these effective methods together, we have developed a distributed Datalog system, called PowerLog. Our evaluation shows that PowerLog can outperform three representative Datalog systems on both monotonic and non-monotonic recursive programs.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"14 41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124746728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emily Caveness, C. PaulSuganthanG., Zhuo Peng, N. Polyzotis, Sudip Roy, Martin A. Zinkevich
{"title":"TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines","authors":"Emily Caveness, C. PaulSuganthanG., Zhuo Peng, N. Polyzotis, Sudip Roy, Martin A. Zinkevich","doi":"10.1145/3318464.3384707","DOIUrl":"https://doi.org/10.1145/3318464.3384707","url":null,"abstract":"Machine Learning (ML) research has primarily focused on improving the accuracy and efficiency of the training algorithms while paying much less attention to the equally important problem of understanding, validating, and monitoring the data fed to ML. Irrespective of the ML algorithms used, data errors can adversely affect the quality of the generated model. This indicates that we need to adopt a data-centric approach to ML that treats data as a first-class citizen, on par with algorithms and infrastructure which are the typical building blocks of ML pipelines. In this demonstration we showcase TensorFlow Data Validation (TFDV), a scalable data analysis and validation system for ML that we have developed at Google and recently open-sourced. This system is deployed in production as an integral part of TFX - an end-to-end machine learning platform at Google. It is used by hundreds of product teams at Google and has received significant attention from the open-source community as well.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125660492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of Database Search Systems with THOR","authors":"Theofilos Belmpas, Orest Gkini, G. Koutrika","doi":"10.1145/3318464.3384679","DOIUrl":"https://doi.org/10.1145/3318464.3384679","url":null,"abstract":"Numerous search systems have been implemented that allow users to pose unstructured queries over databases without the need to use a query language, such as SQL. Unfortunately, the landscape of efforts is fragmented with no clear sight of which system is best, and what open challenges we should pursue in our research. To help towards this direction, we present THOR that makes 4 important contributions: a query benchmark, a framework for comparing different systems, several search system implementations, and a highly interactive tool for comparing different search systems.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128786422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Systems and ML: When the Sum is Greater than Its Parts","authors":"I. Stoica","doi":"10.1145/3318464.3393817","DOIUrl":"https://doi.org/10.1145/3318464.3393817","url":null,"abstract":"BIOGRAPHY: Ion Stoica is a Professor in the EECS Department at the University of California at Berkeley, and the Director of RISELab (https://rise.cs.berkeley.edu/). He is currently doing research on cloud computing and AI systems. Past work includes Apache Spark, Apache Mesos, Tachyon, Chord DHT, and Dynamic Packet State (DPS). He is an ACM Fellow and has received numerous awards, including the Mark Weiser Award (2019), SIGOPS Hall of Fame Award (2015), the SIGCOMM Test of Time Award (2011), and the ACM doctoral dissertation award (2001). He also co-founded three companies, Anyscale (2019), Databricks (2013) and Conviva (2006).","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126228652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rethinking Message Brokers on RDMA and NVM","authors":"Hendrik Makait","doi":"10.1145/3318464.3384403","DOIUrl":"https://doi.org/10.1145/3318464.3384403","url":null,"abstract":"Over the last years, message brokers have become an important part of enterprise systems. As microservice architectures gain popularity and the need to analyze data produced by these services grows, companies increasingly rely on message brokers to orchestrate the flow of events between different applications as well as between data-producing services and streaming engines that analyze the data in real-time.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131107972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"STAR: A Distributed Stream Warehouse System for Spatial Data","authors":"Zhida Chen, G. Cong, Walid G. Aref","doi":"10.1145/3318464.3384699","DOIUrl":"https://doi.org/10.1145/3318464.3384699","url":null,"abstract":"The proliferation of mobile phones and location-based services gives rise to an explosive growth of spatial data. This spatial data contains valuable information, and calls for data stream warehouse systems that can provide real-time analytical results with the latest integrated spatial data. In this demonstration, we present the STAR (Spatial Data Stream Warehouse) system. STAR is a distributed in-memory spatial data stream warehouse system that provides low-latency and up-to-date analytical results over a fast spatial data stream. STAR supports a rich set of aggregate queries for spatial data analytics, e.g., contrasting the frequencies of spatial objects that appear in different spatial regions, or showing the most frequently mentioned topics being tweeted in different cities. STAR processes aggregate queries by maintaining distributed materialized views. Additionally, STAR supports dynamic load adjustment that makes STAR scalable and adaptive. We demonstrate STAR on top of Amazon EC2 clusters using real data sets.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131315570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yinglong Song, Huey-Eng Chua, S. Bhowmick, Byron Choi, Shuigeng Zhou
{"title":"BOOMER: A Tool for Blending Visual P-Homomorphic Queries on Large Networks","authors":"Yinglong Song, Huey-Eng Chua, S. Bhowmick, Byron Choi, Shuigeng Zhou","doi":"10.1145/3318464.3384680","DOIUrl":"https://doi.org/10.1145/3318464.3384680","url":null,"abstract":"The paradigm of interleaving (i.e. blending) visual subgraph query formulation and processing by exploiting the latency offered by the GUI brings in several potential benefits such as superior system response time (SRT) and opportunities to enhance usability of graph databases. Recent efforts at implementing this paradigm are focused on subgraph isomorphism-based queries, which are often restrictive in many real-world graph applications. In this demonstration, we present a novel system called BOOMER to realize this paradigm on more generic but complex bounded 1-1 p-homomorphic(BPH) queries on large networks. Intuitively, a BPH query maps an edge of the query to bounded paths in the data graph. We demonstrate various innovative features of BOOMER, its flexibility, and its promising performance.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128096986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MemFlow: Memory-Aware Distributed Deep Learning","authors":"Neil Band","doi":"10.1145/3318464.3384416","DOIUrl":"https://doi.org/10.1145/3318464.3384416","url":null,"abstract":"As the number of layers and the amount of training data increases, the trend is to train deep neural networks in parallel across devices. In such scenarios, neural network training is increasingly bottlenecked by high memory requirements posed by intermediate results, or feature maps, that are produced during the forward pass and consumed during the backward pass. We recognize that the best-performing device parallelization configurations should consider memory usage in addition to the canonical metric of computation time. Towards this we introduce MemFlow, an optimization framework for distributed deep learning that performs joint optimization over memory usage and computation time when searching for a parallelization strategy. MemFlow consists of: (i) a task graph with memory usage estimates; (ii) a memory-aware execution simulator; and (iii) a Markov Chain Monte Carlo search algorithm that considers various degrees of recomputation i.e., discarding feature maps during the forward pass and recomputing them during the backward pass. Our experiments demonstrate that under memory constraints, MemFlow can readily locate valid and superior parallelization strategies unattainable with previous frameworks.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"52 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114023454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Panagiotis Antonopoulos, A. Arasu, Kunal D. Singh, Ken Eguro, Nitish Gupta, Rajat Jain, R. Kaushik, Hanuma Kodavalla, Donald Kossmann, Nikolas Ogg, Ravishankar Ramamurthy, J. Szymaszek, J. Trimmer, K. Vaswani, R. Venkatesan, M. Zwilling
{"title":"Azure SQL Database Always Encrypted","authors":"Panagiotis Antonopoulos, A. Arasu, Kunal D. Singh, Ken Eguro, Nitish Gupta, Rajat Jain, R. Kaushik, Hanuma Kodavalla, Donald Kossmann, Nikolas Ogg, Ravishankar Ramamurthy, J. Szymaszek, J. Trimmer, K. Vaswani, R. Venkatesan, M. Zwilling","doi":"10.1145/3318464.3386141","DOIUrl":"https://doi.org/10.1145/3318464.3386141","url":null,"abstract":"This paper presents Always Encrypted, a recently released feature of Microsoft SQL Server that uses column granularity encryption to provide cryptographic data protection guarantees. Always Encrypted can be used to outsource database administration while keeping the data confidential from an administrator, including cloud operators. The first version of Always Encrypted was released in Azure SQL Database and as part of SQL Server 2016, and supported equality operations over deterministically encrypted columns. The second version, released as part of SQL Server 2019, uses an enclave running within a trusted execution environment to provide richer functionality that includes comparison and string pattern matching for an IND-CPA-secure (randomized) encryption scheme. We present the security, functionality, and design of Always Encrypted, and provide a performance evaluation using the TPC-C benchmark.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125000385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}