Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond最新文献

筛选
英文 中文
FlameStream
I. Kuralenok, Artem Trofimov, Nikita Marshalkin, Boris Novikov
{"title":"FlameStream","authors":"I. Kuralenok, Artem Trofimov, Nikita Marshalkin, Boris Novikov","doi":"10.1145/3206333.3209273","DOIUrl":"https://doi.org/10.1145/3206333.3209273","url":null,"abstract":"Exactly-once semantics without high latency overhead is still hard to achieve within state-of-the-art stream processing systems. We introduce a model providing for exactly-once using lightweight optimistic approach for obtaining determinism and idempotence. We show its feasibility with a prototype.","PeriodicalId":253916,"journal":{"name":"Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128924463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
MapRDD
Zhenyu Li, Stephen Jarvis
{"title":"MapRDD","authors":"Zhenyu Li, Stephen Jarvis","doi":"10.1145/3206333.3206335","DOIUrl":"https://doi.org/10.1145/3206333.3206335","url":null,"abstract":"The Resilient Distributed Dataset (RDD) is the core memory abstraction behind the popular data-analytic framework Apache Spark. We present an extension to the Resilient Distributed Dataset for map transformations, that we call MapRDD, which takes advantage of the underlying relations between records in the parent and child datasets, in order to achieve random-access of individual records in a partition. The design is complemented by a new MemoryStore, which manages data sampling and data transfers asynchronously. We use the ImageNet dataset to demonstrate that: (I) The initial data loading phase is redundant and can be completely avoided; (II) Sampling on the CPU can be entirely overlapped with training on the GPU to achieve near full occupancy; (III) CPU processing cycles and memory usage can be reduced by more than 90%, allowing other applications to be run simultaneously; (IV) Constant training step time can be achieved, regardless of the size of the partition, for up to 1.3 million records in our experiments. We expect to obtain the same improvements in other RDD transformations via further research on finer-grained implicit & explicit dataset relations.","PeriodicalId":253916,"journal":{"name":"Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129379205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Exploiting Data Partitioning To Provide Approximate Results 利用数据分区提供近似结果
Bruhathi Sundarmurthy, Paraschos Koutris, J. Naughton
{"title":"Exploiting Data Partitioning To Provide Approximate Results","authors":"Bruhathi Sundarmurthy, Paraschos Koutris, J. Naughton","doi":"10.1145/3206333.3206337","DOIUrl":"https://doi.org/10.1145/3206333.3206337","url":null,"abstract":"Co-hash partitioning is a popular partitioning strategy in distributed query processing, where tables are co-located using join predicates. In this paper, we study the benefits of co-hash partitioning for obtaining approximate answers.","PeriodicalId":253916,"journal":{"name":"Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129593202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Six Pass MapReduce Implementation of Strassen's Algorithm for Matrix Multiplication 矩阵乘法Strassen算法的六步MapReduce实现
Prakash V. Ramanan
{"title":"Six Pass MapReduce Implementation of Strassen's Algorithm for Matrix Multiplication","authors":"Prakash V. Ramanan","doi":"10.1145/3206333.3206336","DOIUrl":"https://doi.org/10.1145/3206333.3206336","url":null,"abstract":"Consider the multiplication of two n x n matrices. A straight-forward sequential algorithm for computing the product takes Θ(n3) time. Strassen [21] presented an algorithm that takes Θ(nlg 7) time; lg denotes logarithm to the base 2; lg 7 is about 2.81. Now, consider the implementation of these two algorithms (straightforward and Strassen) in the mapReduce framework. Several papers have studied mapReduce implementations of the straight-forward algorithm; this algorithm can be implemented using a constant number (typically, one or two) of mapReduce passes. In this paper, we study the mapReduce implementation of Strassen's algorithm. If we unwind the recursion, Strassen's algorithm consists of three parts, Parts I--III. Direct mapReduce implementations of the three parts take lg n, 1 and lg n passes, respectively; total number of passes is 2 lg n + 1. In a previous paper [7], we showed that Part I can be implemented in 2 passes, with total work Θ(nlg 7), and reducer size and reducer workload o(n). In this paper, we show that Part III can be implemented in three passes. So, overall, Strassen's algorithm can be implemented in six passes, with total work Θ(nlg 7), and reducer size and reducer workload o(n).","PeriodicalId":253916,"journal":{"name":"Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125423272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Distribution-Aware Stream Partitioning for Distributed Stream Processing Systems 分布式流处理系统的分布感知流分区
Anil Pacaci, M. Tamer Özsu
{"title":"Distribution-Aware Stream Partitioning for Distributed Stream Processing Systems","authors":"Anil Pacaci, M. Tamer Özsu","doi":"10.1145/3206333.3206338","DOIUrl":"https://doi.org/10.1145/3206333.3206338","url":null,"abstract":"The performance of modern distributed stream processing systems is largely dependent on balanced distribution of the workload across cluster. Input streams with large, skewed domains pose challenges to these systems, especially for stateful applications. Key splitting, where state of a single key is partially maintained across multiple workers, is a simple yet effective technique to reduce load imbalance in such systems. However it comes with the cost of increased memory overhead which has been neglected by existing techniques so far. In this paper we present a novel stream partitioning algorithm for intra-operator parallelism which adapts to the underlying stream distribution in an online manner and provides near-optimal load imbalance with minimal memory overhead. Our technique relies on explicitly routing frequent items using a greedy heuristic which considers both load imbalance and space requirements. It uses hashing for in frequent items to keep the size of routing table small. Through extensive experimentation with real and synthetic datasets, we show that our proposed solution consistently provides near-optimal load imbalance and memory footprint over variety of distributions. Our experiments on Apache Storm show up to an order of magnitude increase in overall throughput and up to 80% space savings over state-of-the-art stream partitioning techniques.","PeriodicalId":253916,"journal":{"name":"Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","volume":"74 274 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125964717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Latency-conscious dataflow reconfiguration 延迟敏感的数据流重新配置
Moritz Hoffmann, Frank McSherry, Andrea Lattuada
{"title":"Latency-conscious dataflow reconfiguration","authors":"Moritz Hoffmann, Frank McSherry, Andrea Lattuada","doi":"10.1145/3206333.3206334","DOIUrl":"https://doi.org/10.1145/3206333.3206334","url":null,"abstract":"We propose a prototype incremental data migration mechanism for stateful distributed data-parallel dataflow engines with latency objectives. When compared to existing scaling mechanisms, our prototype has the following differentiating characteristics: (i) the mechanism provides tunable granularity for avoiding latency spikes, (ii) reconfigurations can be prepared ahead of time to avoid runtime coordination, and (iii) the implementation only relies on existing dataflow APIs and need not require system modifications. We demonstrate our proposal on example computations with varying amounts of state that needs to be migrated, which is a non-trivial task for systems like Dhalion and Flink. Our implementation, prototyped on Timely Dataflow, provides a scalable stateful operator template compatible with existing APIs that carefully reorganizes data to minimize migration overhead. Compared to naïve approaches we reduce service latencies by orders of magnitude.","PeriodicalId":253916,"journal":{"name":"Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123202767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Automatic Caching Decision for Scientific Dataflow Execution in Apache Spark 科学数据流执行的自动缓存决策
V. Gottin, Edward Pacheco, Jonas Dias, A. Ciarlini, B. Costa, Wagner Vieira, Y. M. Souto, Paulo F. Pires, F. Porto, J. G. Rittmeyer
{"title":"Automatic Caching Decision for Scientific Dataflow Execution in Apache Spark","authors":"V. Gottin, Edward Pacheco, Jonas Dias, A. Ciarlini, B. Costa, Wagner Vieira, Y. M. Souto, Paulo F. Pires, F. Porto, J. G. Rittmeyer","doi":"10.1145/3206333.3206339","DOIUrl":"https://doi.org/10.1145/3206333.3206339","url":null,"abstract":"Demands for large-scale data analysis and processing have led to the development and widespread adoption of computing frameworks that leverage in-memory data processing, largely outperforming disk-based processing systems. One such framework is Apache Spark, which adopts a lazy-evaluation execution mode. In this model, the execution of a transformation dataflow operation is delayed until its results are required by an action. Furthermore, transformation's results are not kept in memory by default, and the same transformation must be re-executed whenever required by another action. In order to spare unnecessary re-execution of entire pipelines of frequently referenced operations, Spark enables the programmer to explicitly define a cache operation to persist transformation results. However, many factors affect the efficiency of a cache in a dataflow, including the existence of other cache operations. Thus, even with a reasonably small number of transformations, choosing the optimal combination of cache operations poses a nontrivial problem. The problem is highlighted by the fact that intuitive strategies -- especially considered in isolation - may actually be harmful to the dataflow efficiency. In this work, we present an automatic procedure to compute the substantially optimal combination of cache operations given a dataflow definition and a simple model for the cost of the operations. Our results over an astronomy dataflow use case show that our algorithm is resilient to changes in the dataflow and cost model, and that it outperforms intuitive strategies, consistently deciding on a substantially optimal combination of caches.","PeriodicalId":253916,"journal":{"name":"Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133073732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Adaptive MapReduce Similarity Joins 自适应MapReduce相似连接
Samuel McCauley, Francesco Silvestri
{"title":"Adaptive MapReduce Similarity Joins","authors":"Samuel McCauley, Francesco Silvestri","doi":"10.1145/3206333.3206340","DOIUrl":"https://doi.org/10.1145/3206333.3206340","url":null,"abstract":"Similarity joins are a fundamental database operation. Given data sets S and R, the goal of a similarity join is to find all points x ∈ S and y ∈ R with distance at most r. Recent research has investigated how locality-sensitive hashing (LSH) can be used for similarity join, and in particular two recent lines of work have made exciting progress on LSH-based join performance. Hu, Tao, and Yi (PODS 17) investigated joins in a massively parallel setting, showing strong results that adapt to the size of the output. Meanwhile, Ahle, Aumüller, and Pagh (SODA 17) showed a sequential algorithm that adapts to the structure of the data, matching classic bounds in the worst case but improving them significantly on more structured data. We show that this adaptive strategy can be adapted to the parallel setting, combining the advantages of these approaches. In particular, we show that a simple modification to Hu et al.'s algorithm achieves bounds that depend on the density of points in the dataset as well as the total outsize of the output. Our algorithm uses no extra parameters over other LSH approaches (in particular, its execution does not depend on the structure of the dataset), and is likely to be efficient in practice.","PeriodicalId":253916,"journal":{"name":"Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114472773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond 第五届ACM SIGMOD MapReduce及其他算法和系统研讨会论文集
F. Afrati, J. Sroka, J. Hidders
{"title":"Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","authors":"F. Afrati, J. Sroka, J. Hidders","doi":"10.1145/3206333","DOIUrl":"https://doi.org/10.1145/3206333","url":null,"abstract":"The papers in this volume were presented at the 3rd International Workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR 2016), held in San Francisco, CA, US on July 1, 2016. The workshop was co-located with ACM SIGMOD, and attracted 19 submissions, of which 10 were selected by the program committee for oral presentation and for publication in this volume. This corresponds to an acceptance rate of 53%, which indicates the high level of activity in the domain of the workshop and its ability to attract many good papers.","PeriodicalId":253916,"journal":{"name":"Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127168047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信