Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond最新文献

FlameStream

Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2018-06-15 DOI: 10.1145/3206333.3209273

I. Kuralenok, Artem Trofimov, Nikita Marshalkin, Boris Novikov

引用次数: 6

MapRDD

Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2018-06-15 DOI: 10.1145/3206333.3206335

Zhenyu Li, Stephen Jarvis

引用次数: 1

Exploiting Data Partitioning To Provide Approximate Results 利用数据分区提供近似结果

Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2018-06-15 DOI: 10.1145/3206333.3206337

Bruhathi Sundarmurthy, Paraschos Koutris, J. Naughton

引用次数: 5

Six Pass MapReduce Implementation of Strassen's Algorithm for Matrix Multiplication 矩阵乘法Strassen算法的六步MapReduce实现

Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2018-06-15 DOI: 10.1145/3206333.3206336

Prakash V. Ramanan

{"title":"Six Pass MapReduce Implementation of Strassen's Algorithm for Matrix Multiplication","authors":"Prakash V. Ramanan","doi":"10.1145/3206333.3206336","DOIUrl":"https://doi.org/10.1145/3206333.3206336","url":null,"abstract":"Consider the multiplication of two n x n matrices. A straight-forward sequential algorithm for computing the product takes Θ(n3) time. Strassen [21] presented an algorithm that takes Θ(nlg 7) time; lg denotes logarithm to the base 2; lg 7 is about 2.81. Now, consider the implementation of these two algorithms (straightforward and Strassen) in the mapReduce framework. Several papers have studied mapReduce implementations of the straight-forward algorithm; this algorithm can be implemented using a constant number (typically, one or two) of mapReduce passes. In this paper, we study the mapReduce implementation of Strassen's algorithm. If we unwind the recursion, Strassen's algorithm consists of three parts, Parts I--III. Direct mapReduce implementations of the three parts take lg n, 1 and lg n passes, respectively; total number of passes is 2 lg n + 1. In a previous paper [7], we showed that Part I can be implemented in 2 passes, with total work Θ(nlg 7), and reducer size and reducer workload o(n). In this paper, we show that Part III can be implemented in three passes. So, overall, Strassen's algorithm can be implemented in six passes, with total work Θ(nlg 7), and reducer size and reducer workload o(n).","PeriodicalId":253916,"journal":{"name":"Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125423272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Distribution-Aware Stream Partitioning for Distributed Stream Processing Systems 分布式流处理系统的分布感知流分区

Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2018-06-15 DOI: 10.1145/3206333.3206338

Anil Pacaci, M. Tamer Özsu

{"title":"Distribution-Aware Stream Partitioning for Distributed Stream Processing Systems","authors":"Anil Pacaci, M. Tamer Özsu","doi":"10.1145/3206333.3206338","DOIUrl":"https://doi.org/10.1145/3206333.3206338","url":null,"abstract":"The performance of modern distributed stream processing systems is largely dependent on balanced distribution of the workload across cluster. Input streams with large, skewed domains pose challenges to these systems, especially for stateful applications. Key splitting, where state of a single key is partially maintained across multiple workers, is a simple yet effective technique to reduce load imbalance in such systems. However it comes with the cost of increased memory overhead which has been neglected by existing techniques so far. In this paper we present a novel stream partitioning algorithm for intra-operator parallelism which adapts to the underlying stream distribution in an online manner and provides near-optimal load imbalance with minimal memory overhead. Our technique relies on explicitly routing frequent items using a greedy heuristic which considers both load imbalance and space requirements. It uses hashing for in frequent items to keep the size of routing table small. Through extensive experimentation with real and synthetic datasets, we show that our proposed solution consistently provides near-optimal load imbalance and memory footprint over variety of distributions. Our experiments on Apache Storm show up to an order of magnitude increase in overall throughput and up to 80% space savings over state-of-the-art stream partitioning techniques.","PeriodicalId":253916,"journal":{"name":"Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","volume":"74 274 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125964717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Latency-conscious dataflow reconfiguration 延迟敏感的数据流重新配置

Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2018-06-15 DOI: 10.1145/3206333.3206334

Moritz Hoffmann, Frank McSherry, Andrea Lattuada

引用次数: 4

Automatic Caching Decision for Scientific Dataflow Execution in Apache Spark 科学数据流执行的自动缓存决策

Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2018-06-15 DOI: 10.1145/3206333.3206339

V. Gottin, Edward Pacheco, Jonas Dias, A. Ciarlini, B. Costa, Wagner Vieira, Y. M. Souto, Paulo F. Pires, F. Porto, J. G. Rittmeyer

{"title":"Automatic Caching Decision for Scientific Dataflow Execution in Apache Spark","authors":"V. Gottin, Edward Pacheco, Jonas Dias, A. Ciarlini, B. Costa, Wagner Vieira, Y. M. Souto, Paulo F. Pires, F. Porto, J. G. Rittmeyer","doi":"10.1145/3206333.3206339","DOIUrl":"https://doi.org/10.1145/3206333.3206339","url":null,"abstract":"Demands for large-scale data analysis and processing have led to the development and widespread adoption of computing frameworks that leverage in-memory data processing, largely outperforming disk-based processing systems. One such framework is Apache Spark, which adopts a lazy-evaluation execution mode. In this model, the execution of a transformation dataflow operation is delayed until its results are required by an action. Furthermore, transformation's results are not kept in memory by default, and the same transformation must be re-executed whenever required by another action. In order to spare unnecessary re-execution of entire pipelines of frequently referenced operations, Spark enables the programmer to explicitly define a cache operation to persist transformation results. However, many factors affect the efficiency of a cache in a dataflow, including the existence of other cache operations. Thus, even with a reasonably small number of transformations, choosing the optimal combination of cache operations poses a nontrivial problem. The problem is highlighted by the fact that intuitive strategies -- especially considered in isolation - may actually be harmful to the dataflow efficiency. In this work, we present an automatic procedure to compute the substantially optimal combination of cache operations given a dataflow definition and a simple model for the cost of the operations. Our results over an astronomy dataflow use case show that our algorithm is resilient to changes in the dataflow and cost model, and that it outperforms intuitive strategies, consistently deciding on a substantially optimal combination of caches.","PeriodicalId":253916,"journal":{"name":"Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133073732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Adaptive MapReduce Similarity Joins 自适应MapReduce相似连接

Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2018-04-16 DOI: 10.1145/3206333.3206340

Samuel McCauley, Francesco Silvestri

{"title":"Adaptive MapReduce Similarity Joins","authors":"Samuel McCauley, Francesco Silvestri","doi":"10.1145/3206333.3206340","DOIUrl":"https://doi.org/10.1145/3206333.3206340","url":null,"abstract":"Similarity joins are a fundamental database operation. Given data sets S and R, the goal of a similarity join is to find all points x ∈ S and y ∈ R with distance at most r. Recent research has investigated how locality-sensitive hashing (LSH) can be used for similarity join, and in particular two recent lines of work have made exciting progress on LSH-based join performance. Hu, Tao, and Yi (PODS 17) investigated joins in a massively parallel setting, showing strong results that adapt to the size of the output. Meanwhile, Ahle, Aumüller, and Pagh (SODA 17) showed a sequential algorithm that adapts to the structure of the data, matching classic bounds in the worst case but improving them significantly on more structured data. We show that this adaptive strategy can be adapted to the parallel setting, combining the advantages of these approaches. In particular, we show that a simple modification to Hu et al.'s algorithm achieves bounds that depend on the density of points in the dataset as well as the total outsize of the output. Our algorithm uses no extra parameters over other LSH approaches (in particular, its execution does not depend on the structure of the dataset), and is likely to be efficient in practice.","PeriodicalId":253916,"journal":{"name":"Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114472773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond 第五届ACM SIGMOD MapReduce及其他算法和系统研讨会论文集

Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond Pub Date : 2016-06-26 DOI: 10.1145/3206333

F. Afrati, J. Sroka, J. Hidders

引用次数: 0