Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures最新文献_第2页

Balls-into-bins with nearly optimal load distribution 球-仓几乎最佳负载分配

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/2486159.2486191

P. Berenbrink, K. Khodamoradi, Thomas Sauerwald, Alexandre O. Stauffer

引用次数: 36

Session details: Session 7 会话详情:会话7

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/3250644

Robert Elsässer

引用次数: 0

Locality-aware task management for unstructured parallelism: a quantitative limit study 非结构化并行的位置感知任务管理:定量限制研究

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/2486159.2486175

Richard M. Yoo, C. Hughes, Changkyu Kim, Yen-kuang Chen, C. Kozyrakis

{"title":"Locality-aware task management for unstructured parallelism: a quantitative limit study","authors":"Richard M. Yoo, C. Hughes, Changkyu Kim, Yen-kuang Chen, C. Kozyrakis","doi":"10.1145/2486159.2486175","DOIUrl":"https://doi.org/10.1145/2486159.2486175","url":null,"abstract":"As we increase the number of cores on a processor die, the on-chip cache hierarchies that support these cores are getting larger, deeper, and more complex. As a result, non-uniform memory access effects are now prevalent even on a single chip. To reduce execution time and energy consumption, data access locality should be exploited. This is especially important for task-based programming systems, where a scheduler decides when and where on the chip the code segments, i.e., tasks, should execute. Capturing locality for structured task parallelism has been done effectively, but the more difficult case, unstructured parallelism, remains largely unsolved - little quantitative analysis exists to demonstrate the potential of locality-aware scheduling, and to guide future scheduler implementations in the most fruitful direction. This paper quantifies the potential of locality-aware scheduling for unstructured parallelism on three different many-core processors. Our simulation results of 32-core systems show that locality-aware scheduling can bring up to 2.39x speedup over a randomized schedule, and 2.05x speedup over a state-of-the-art baseline scheduling scheme. At the same time, a locality-aware schedule reduces average energy consumption by 55% and 47%, relative to the random and the baseline schedule, respectively. In addition, our 1024-core simulation results project that these benefits will only increase: Compared to 32-core executions, we see up to 1.83x additional locality benefits. To capture such potentials in a practical setting, we also perform a detailed scheduler design space exploration to quantify the impact of different scheduling decisions. We also highlight the importance of locality-aware stealing, and demonstrate that a stealing scheme can exploit significant locality while performing load balancing. Over randomized stealing, our proposed scheme shows up to 2.0x speedup for stolen tasks.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131386270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Fast greedy algorithms in mapreduce and streaming mapreduce和streaming中的快速贪婪算法

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/2486159.2486168

Ravi Kumar, Benjamin Moseley, Sergei Vassilvitskii, Andrea Vattani

{"title":"Fast greedy algorithms in mapreduce and streaming","authors":"Ravi Kumar, Benjamin Moseley, Sergei Vassilvitskii, Andrea Vattani","doi":"10.1145/2486159.2486168","DOIUrl":"https://doi.org/10.1145/2486159.2486168","url":null,"abstract":"Greedy algorithms are practitioners' best friends - they are intuitive, simple to implement, and often lead to very good solutions. However, implementing greedy algorithms in a distributed setting is challenging since the greedy choice is inherently sequential, and it is not clear how to take advantage of the extra processing power. Our main result is a powerful sampling technique that aids in parallelization of sequential algorithms. We then show how to use this primitive to adapt a broad class of greedy algorithms to the MapReduce paradigm; this class includes maximum cover and submodular maximization subject to p-system constraints. Our method yields efficient algorithms that run in a logarithmic number of rounds, while obtaining solutions that are arbitrarily close to those produced by the standard sequential greedy algorithm. We begin with algorithms for modular maximization subject to a matroid constraint, and then extend this approach to obtain approximation algorithms for submodular maximization subject to knapsack or p-system constraints. Finally, we empirically validate our algorithms, and show that they achieve the same quality of the solution as standard greedy algorithms but run in a substantially fewer number of rounds.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122452008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

IRIS: a robust information system against insider dos-attacks IRIS:针对内部dos攻击的强大信息系统

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/2486159.2486186

Martina Eikel, C. Scheideler

{"title":"IRIS: a robust information system against insider dos-attacks","authors":"Martina Eikel, C. Scheideler","doi":"10.1145/2486159.2486186","DOIUrl":"https://doi.org/10.1145/2486159.2486186","url":null,"abstract":"In this work we present the first scalable distributed information system, i.e., a system with low storage overhead, that is provably robust against Denial-of-Service (DoS) attacks by a current insider. We allow a current insider to have complete knowledge about the information system and to have the power to block any ξ-fraction of its servers by a DoS-attack, where ξ can be chosen up to a constant. The task of the system is to serve any collection of lookup requests with at most one per non-blocked server in an efficient way despite this attack. Previously, scalable solutions were only known for DoS-attacks of past insiders, where a past insider only has complete knowledge about some past time point t0 of the information system. Scheideler et al. [2, 3] showed that in this case it is possible to design an information system so that any information that was inserted or last updated after t0 is safe against a DoS-attack. But their constructions would not work at all for a current insider. The key idea behind our IRIS system is to make extensive use of coding. More precisely, we present two alternative distributed coding strategies with an at most logarithmic storage overhead that can handle up to a constant fraction of blocked servers.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132175365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SybilCast: broadcast on the open airwaves (extended abstract) SybilCast:在开放的无线电波上广播(扩展摘要)

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/2486159.2486172

Seth Gilbert, Chaodong Zheng

引用次数: 8

On-the-fly pipeline parallelism 动态管道并行性

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/2486159.2486174

I. Lee, C. Leiserson, T. Schardl, Jim Sukha, Zhunping Zhang

{"title":"On-the-fly pipeline parallelism","authors":"I. Lee, C. Leiserson, T. Schardl, Jim Sukha, Zhunping Zhang","doi":"10.1145/2486159.2486174","DOIUrl":"https://doi.org/10.1145/2486159.2486174","url":null,"abstract":"Pipeline parallelism organizes a parallel program as a linear sequence of s stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a \"construct-and-run\" approach, this paper investigates \"on-the-fly\" pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding \"runaway\" pipelines. Given a pipeline computation with T1 work and T∞ span (critical-path length), Piper executes the computation on P processors in TP≤ T1/P + O(T∞ + lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as lazy enabling and dependency folding. We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114158526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Parallel rotor walks on finite graphs and applications in discrete load balancing 并联转子在有限图上行走及其在离散负载平衡中的应用

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/2486159.2486178

Hoda Akbari, P. Berenbrink

{"title":"Parallel rotor walks on finite graphs and applications in discrete load balancing","authors":"Hoda Akbari, P. Berenbrink","doi":"10.1145/2486159.2486178","DOIUrl":"https://doi.org/10.1145/2486159.2486178","url":null,"abstract":"We study the parallel rotor walk process, which works as follows: Consider a graph along with an arbitrary distribution of tokens over its nodes. Every node is equipped with a rotor that points to its neighbours in a fixed circular order. In each round, every node distributes all of its tokens using the rotor. One token is allocated to the neighbour pointed at by the rotor, then the rotor moves to the subsequent neighbour, and so on, until no token remains. The process can be considered as a deterministic analogue of a process in which tokens perform one independent random walk step in each round. We compare the distribution of tokens in the rotor walk process with expected distribution in the random walk model. The similarity between the two processes is measured by their discrepancy, which is the maximum difference between the corresponding distribution entries over all rounds and nodes. We analyze a lazy variation of rotor walks that simulates a random walk with loop probability of 1/2 on each node, and each node sends not all its tokens, but every other token in each round. Viewing the rotor walk as a load balancing process, we prove that the rotor walk falls in the class of bounded-error diffusion processes introduced in [11]. This gives us discrepancy bounds of O(log3/2 n) and O(1) for hypercube and r-dimensional torus with r=O(1), respectively, which improve over the best existing bounds of O(log2 n) and O(n1/r). Also, as a result of switching to the load balancing view, we observe that the existing load balancing results can be translated to rotor walk discrepancy bounds not previously noticed in the rotor walk literature. We also use the idea of rotor walks to propose and analyze a randomized rounding discrete load balancing process that achieves the same balancing quality as similar protocols [11, 3], but uses fewer number of random bits compared to [3], and avoids the negative load problem of [11].","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130607690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Efficient online scheduling for deadline-sensitive jobs: extended abstract 截止日期敏感作业的高效在线调度:扩展摘要

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/2486159.2486187

Brendan Lucier, Ishai Menache, J. Naor, Jonathan Yaniv

{"title":"Efficient online scheduling for deadline-sensitive jobs: extended abstract","authors":"Brendan Lucier, Ishai Menache, J. Naor, Jonathan Yaniv","doi":"10.1145/2486159.2486187","DOIUrl":"https://doi.org/10.1145/2486159.2486187","url":null,"abstract":"We consider mechanisms for online deadline-aware scheduling in large computing clusters. Batch jobs that run on such clusters often require guarantees on their completion time (i.e., deadlines). However, most existing scheduling systems implement fair-share resource allocation between users, an approach that ignores heterogeneity in job requirements and may cause deadlines to be missed. In our framework, jobs arrive dynamically and are characterized by their value and total resource demand (or estimation thereof), along with their reported deadlines. The scheduler's objective is to maximize the aggregate value of jobs completed by their deadlines. We circumvent known lower bounds for this problem by assuming that the input has slack, meaning that any job could be delayed and still finish by its deadline. Under the slackness assumption, we design a preemptive scheduler with a constant-factor worst-case performance guarantee. Along the way, we pay close attention to practical aspects, such as runtime efficiency, data locality and demand uncertainty. We evaluate the algorithm via simulations over real job traces taken from a large production cluster, and show that its actual performance is significantly better than other heuristics used in practice. We then extend our framework to handle provider commitments: the requirement that jobs admitted to service must be executed until completion. We prove that no algorithm can obtain worst-case guarantees when enforcing the commitment decision to the job arrival time. Nevertheless, we design efficient heuristics that commit on job admission, in the spirit of our basic algorithm. We show empirically that these heuristics perform just as well as (or better than) the original algorithm. Finally, we discuss how our scheduling framework can be used to design truthful scheduling mechanisms, motivated by applications to commercial public cloud offerings.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116870947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43

Online batch scheduling for flow objectives 流目标的在线批调度

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/2486159.2486161

Sungjin Im, Benjamin Moseley

引用次数: 2