Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond最新文献

Cross-system NoSQL data transformations with NotaQL 使用NotaQL进行跨系统NoSQL数据转换

Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2016-06-26 DOI: 10.1145/2926534.2926535

Johannes Schildgen, Thomas Lottermann, S. Deßloch

引用次数: 12

Deterministic load balancing for parallel joins 并行连接的确定性负载平衡

Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2016-06-26 DOI: 10.1145/2926534.2926536

Paraschos Koutris, Nivetha Singara Vadivelu

引用次数: 1

Bridging the gap: towards optimization across linear and relational algebra 弥合差距:朝着优化跨越线性和关系代数

Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2016-06-26 DOI: 10.1145/2926534.2926540

Andreas Kunft, Alexander B. Alexandrov, Asterios Katsifodimos, V. Markl

{"title":"Bridging the gap: towards optimization across linear and relational algebra","authors":"Andreas Kunft, Alexander B. Alexandrov, Asterios Katsifodimos, V. Markl","doi":"10.1145/2926534.2926540","DOIUrl":"https://doi.org/10.1145/2926534.2926540","url":null,"abstract":"Advanced data analysis typically requires some form of pre-processing in order to extract and transform data before processing it with machine learning and statistical analysis techniques. Pre-processing pipelines are naturally expressed in dataflow APIs (e.g., MapReduce, Flink, etc.), while machine learning is expressed in linear algebra with iterations. Programmers therefore perform end-to-end data analysis utilizing multiple programming paradigms and systems. This impedance mismatch not only hinders productivity but also prevents optimization opportunities, such as sharing of physical data layouts (e.g., partitioning) and data structures among different parts of a data analysis program. The goal of this work is twofold. First, it aims to alleviate the impedance mismatch by allowing programmers to author complete end-to-end programs in one engine-independent language that is automatically parallelized. Second, it aims to enable joint optimizations over both relational and linear algebra. To achieve this goal, we present the design of Lara, a deeply embedded language in Scala which enables authoring scalable programs using two abstract data types (DataBag and Matrix) and control flow constructs. Programs written in Lara are compiled to an intermediate representation (IR) which enables optimizations across linear and relational algebra. The IR is finally used to compile code for different execution engines.","PeriodicalId":393776,"journal":{"name":"Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127134090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

On exploring efficient shuffle design for in-memory MapReduce 基于内存MapReduce的高效shuffle设计研究

Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2016-06-26 DOI: 10.1145/2926534.2926538

Harunobu Daikoku, H. Kawashima, O. Tatebe

{"title":"On exploring efficient shuffle design for in-memory MapReduce","authors":"Harunobu Daikoku, H. Kawashima, O. Tatebe","doi":"10.1145/2926534.2926538","DOIUrl":"https://doi.org/10.1145/2926534.2926538","url":null,"abstract":"MapReduce is commonly used as a way of big data analysis in many fields. Shuffling, the inter-node data exchange phase of MapReduce, has been reported as the major bottleneck of the framework. Acceleration of shuffling has been studied in literature, and we raise two questions in this paper. The first question pertains to the effect of Remote Direct Memory Access (RDMA) on the performance of shuffling. RDMA enables one machine to read and write data on the local memory of another and has been known to be an efficient data transfer mechanism. Does the pure use of RDMA affect the performance of shuffling? The second question is the data transfer algorithm to use. There are two types of shuffling algorithms for the conventional MapReduce implementations: Fully-Connected and more sophisticated algorithms such as Pairwise. Does the data transfer algorithm affect the performance of shuffling? To answer these questions, we designed and implemented yet another MapReduce system from scratch in C/C++ to gain the maximum performance and to reserve design flexibility. For the first question, we compared RDMA shuffling based on rsocket with the one based on IPoIB. The results of experiments with GroupBy showed that RDMA accelerates map+shuffle phase by around 50%. For the second question, we first compared our in-memory system with Apache Spark to investigate whether our system performed more efficiently than the existing system. Our system demonstrated performance improvement by a factor of 3.04 on Word Count, and by a factor of 2.64 on BiGram Count as compared to Spark. Then, we compared the two data exchange algorithms, Fully-Connected and Pairwise. The results of experiments with BiGram Count showed that Fully-Connected without RDMA was 13% more efficient than Pairwise with RDMA. We conclude that it is necessary to overlap map and shuffle phases to gain performance improvement. The reason of the relatively small percentage of improvement can be attributed to the time-consuming insertions of key-value pairs into the hash-map in the map phase.","PeriodicalId":393776,"journal":{"name":"Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124850305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Faucet: a user-level, modular technique for flow control in dataflow engines 水龙头:数据流引擎中用于流量控制的用户级模块化技术

Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2016-06-26 DOI: 10.1145/2926534.2926544

Andrea Lattuada, Frank McSherry, Zaheer Chothia

{"title":"Faucet: a user-level, modular technique for flow control in dataflow engines","authors":"Andrea Lattuada, Frank McSherry, Zaheer Chothia","doi":"10.1145/2926534.2926544","DOIUrl":"https://doi.org/10.1145/2926534.2926544","url":null,"abstract":"This document presents Faucet, a modular flow control approach for distributed data-parallel dataflow engines with support for arbitrary (cyclic) topologies. When compared to existing backpressure techniques Faucet has the following differentiating characteristics: (i) the implementation only relies on existing progress information exposed by the system and does not require changes to the underlying dataflow system, (ii) it can be applied selectively to certain parts of the dataflow graph, and (iii) it is designed to support a wide variety of use cases, topologies and workloads. We demonstrate Faucet on an example computation for efficiently determining a cyclic join of relations, whose variability in rates of produced and consumed tuples challenges the flow control techniques employed by systems like Storm, Heron, and Spark. Our implementation, prototyped in Timely Dataflow, introduces flow control at critical locations in the computation, keeping the computation stable and resource-bound while introducing at most 20% runtime overhead over an unconstrained implementation. Our experience is that the information Timely Dataflow provides to user logic is sufficient for a variety of flow control and scheduling tasks, and merits further investigation.","PeriodicalId":393776,"journal":{"name":"Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125641213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Tight bounds on one- and two-pass MapReduce algorithms for matrix multiplication 矩阵乘法的一遍和两遍MapReduce算法的紧密边界

Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2016-06-26 DOI: 10.1145/2926534.2926542

Prakash V. Ramanan, A. Nagar

{"title":"Tight bounds on one- and two-pass MapReduce algorithms for matrix multiplication","authors":"Prakash V. Ramanan, A. Nagar","doi":"10.1145/2926534.2926542","DOIUrl":"https://doi.org/10.1145/2926534.2926542","url":null,"abstract":"We study one- and two-pass mapReduce algorithms for multiplying two matrices. First, consider one-pass algorithms. In the literature, there is a tight bound for the tradeoff between communication cost and parallelism. It measures communication cost using the replication rate r, and measures parallelism by reducer size q. It gives a tight bound on qr for multiplying dense square matrices. We extend it in two different ways: First, to sparse rectangular matrices; second, to a different measure of parallelism, namely, reducer workload w. We present tight bounds on qr and wr2, for multiplying sparse rectangular matrices. We also show that the lower bound on qr follows from the lower bound on wr2; so, the lower bound on wr2 is stronger. Next, consider two-pass algorithms. It has been shown that, for a given reducer size, the two-pass algorithm has less communication cost than the one-pass algorithm. We present tight bounds on qfrfrs and wfr2frs, for multiplying dense rectangular matrices; the subscripts f and s correspond to the first and second pass, respectively. Also, using our bound on qfrfrs, we present a tight bound on the total communication cost as a function of qf. Our lower bounds hold for the class of two-pass algorithms that perform all the real number multiplications in the first pass.","PeriodicalId":393776,"journal":{"name":"Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121537439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

DFA minimization in map-reduce map-reduce中的DFA最小化

Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2016-06-26 DOI: 10.1145/2926534.2926537

G. Grahne, Shahab Harrafi, Iraj Hedayati, A. Moallemi

引用次数: 7

Toward elastic memory management for cloud data analytics 面向云数据分析的弹性内存管理

Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2016-06-26 DOI: 10.1145/2926534.2926541

Jingjing Wang, M. Balazinska

引用次数: 10

Model-centric computation abstractions in machine learning applications 机器学习应用中以模型为中心的计算抽象

Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2016-06-26 DOI: 10.1145/2926534.2926539

Bingjing Zhang, Bo Peng, J. Qiu

引用次数: 7

Some pairs problems 一些配对问题

Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2016-02-03 DOI: 10.1145/2926534.2926543

J. Ullman, Jonathan Ullman

引用次数: 1