P. Berenbrink, K. Khodamoradi, Thomas Sauerwald, Alexandre O. Stauffer
{"title":"Balls-into-bins with nearly optimal load distribution","authors":"P. Berenbrink, K. Khodamoradi, Thomas Sauerwald, Alexandre O. Stauffer","doi":"10.1145/2486159.2486191","DOIUrl":"https://doi.org/10.1145/2486159.2486191","url":null,"abstract":"We consider sequential balls-into-bins processes that randomly allocate m balls into n bins. We analyze two allocation schemes that achieve a close to optimal maximum load of ⌈m/n⌉ + 1 and require only O(m) (expected) allocation time. These parameters should be compared with the classic d-choice-process which achieves a maximum load of m/n + log log n/d + O(1) and requires m • d allocation time.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"242 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115841381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 7","authors":"Robert Elsässer","doi":"10.1145/3250644","DOIUrl":"https://doi.org/10.1145/3250644","url":null,"abstract":"","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"326 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124606450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Richard M. Yoo, C. Hughes, Changkyu Kim, Yen-kuang Chen, C. Kozyrakis
{"title":"Locality-aware task management for unstructured parallelism: a quantitative limit study","authors":"Richard M. Yoo, C. Hughes, Changkyu Kim, Yen-kuang Chen, C. Kozyrakis","doi":"10.1145/2486159.2486175","DOIUrl":"https://doi.org/10.1145/2486159.2486175","url":null,"abstract":"As we increase the number of cores on a processor die, the on-chip cache hierarchies that support these cores are getting larger, deeper, and more complex. As a result, non-uniform memory access effects are now prevalent even on a single chip. To reduce execution time and energy consumption, data access locality should be exploited. This is especially important for task-based programming systems, where a scheduler decides when and where on the chip the code segments, i.e., tasks, should execute. Capturing locality for structured task parallelism has been done effectively, but the more difficult case, unstructured parallelism, remains largely unsolved - little quantitative analysis exists to demonstrate the potential of locality-aware scheduling, and to guide future scheduler implementations in the most fruitful direction. This paper quantifies the potential of locality-aware scheduling for unstructured parallelism on three different many-core processors. Our simulation results of 32-core systems show that locality-aware scheduling can bring up to 2.39x speedup over a randomized schedule, and 2.05x speedup over a state-of-the-art baseline scheduling scheme. At the same time, a locality-aware schedule reduces average energy consumption by 55% and 47%, relative to the random and the baseline schedule, respectively. In addition, our 1024-core simulation results project that these benefits will only increase: Compared to 32-core executions, we see up to 1.83x additional locality benefits. To capture such potentials in a practical setting, we also perform a detailed scheduler design space exploration to quantify the impact of different scheduling decisions. We also highlight the importance of locality-aware stealing, and demonstrate that a stealing scheme can exploit significant locality while performing load balancing. Over randomized stealing, our proposed scheme shows up to 2.0x speedup for stolen tasks.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131386270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ravi Kumar, Benjamin Moseley, Sergei Vassilvitskii, Andrea Vattani
{"title":"Fast greedy algorithms in mapreduce and streaming","authors":"Ravi Kumar, Benjamin Moseley, Sergei Vassilvitskii, Andrea Vattani","doi":"10.1145/2486159.2486168","DOIUrl":"https://doi.org/10.1145/2486159.2486168","url":null,"abstract":"Greedy algorithms are practitioners' best friends - they are intuitive, simple to implement, and often lead to very good solutions. However, implementing greedy algorithms in a distributed setting is challenging since the greedy choice is inherently sequential, and it is not clear how to take advantage of the extra processing power. Our main result is a powerful sampling technique that aids in parallelization of sequential algorithms. We then show how to use this primitive to adapt a broad class of greedy algorithms to the MapReduce paradigm; this class includes maximum cover and submodular maximization subject to p-system constraints. Our method yields efficient algorithms that run in a logarithmic number of rounds, while obtaining solutions that are arbitrarily close to those produced by the standard sequential greedy algorithm. We begin with algorithms for modular maximization subject to a matroid constraint, and then extend this approach to obtain approximation algorithms for submodular maximization subject to knapsack or p-system constraints. Finally, we empirically validate our algorithms, and show that they achieve the same quality of the solution as standard greedy algorithms but run in a substantially fewer number of rounds.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122452008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IRIS: a robust information system against insider dos-attacks","authors":"Martina Eikel, C. Scheideler","doi":"10.1145/2486159.2486186","DOIUrl":"https://doi.org/10.1145/2486159.2486186","url":null,"abstract":"In this work we present the first scalable distributed information system, i.e., a system with low storage overhead, that is provably robust against Denial-of-Service (DoS) attacks by a current insider. We allow a current insider to have complete knowledge about the information system and to have the power to block any ξ-fraction of its servers by a DoS-attack, where ξ can be chosen up to a constant. The task of the system is to serve any collection of lookup requests with at most one per non-blocked server in an efficient way despite this attack. Previously, scalable solutions were only known for DoS-attacks of past insiders, where a past insider only has complete knowledge about some past time point t0 of the information system. Scheideler et al. [2, 3] showed that in this case it is possible to design an information system so that any information that was inserted or last updated after t0 is safe against a DoS-attack. But their constructions would not work at all for a current insider. The key idea behind our IRIS system is to make extensive use of coding. More precisely, we present two alternative distributed coding strategies with an at most logarithmic storage overhead that can handle up to a constant fraction of blocked servers.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132175365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SybilCast: broadcast on the open airwaves (extended abstract)","authors":"Seth Gilbert, Chaodong Zheng","doi":"10.1145/2486159.2486172","DOIUrl":"https://doi.org/10.1145/2486159.2486172","url":null,"abstract":"Consider a scenario where many wireless users are attempting to download data from a single base station. While most of the users are honest, some users may be malicious and attempt to obtain more than their fair share of the bandwidth. One possible strategy for attacking the system is to simulate multiple fake identities, each of which is given its own equal share of the bandwidth. Such an attack is often referred to as a sybil attack. To counter such behavior, we propose SybilCast, a protocol for multichannel wireless networks that limits the number of fake identities, and in doing so, ensures that each honest user gets at least a constant fraction of their fair share of the bandwidth. As a result, each honest user can complete his or her data download in asymptotically optimal time. A key aspect of this protocol is balancing the rate at which new identities are admitted and the maximum number of fake identities that can co-exist, while keeping the overhead low. Besides sybil attacks, our protocol can also tolerate spoofing and jamming.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"283 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127489017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Lee, C. Leiserson, T. Schardl, Jim Sukha, Zhunping Zhang
{"title":"On-the-fly pipeline parallelism","authors":"I. Lee, C. Leiserson, T. Schardl, Jim Sukha, Zhunping Zhang","doi":"10.1145/2486159.2486174","DOIUrl":"https://doi.org/10.1145/2486159.2486174","url":null,"abstract":"Pipeline parallelism organizes a parallel program as a linear sequence of s stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a \"construct-and-run\" approach, this paper investigates \"on-the-fly\" pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding \"runaway\" pipelines. Given a pipeline computation with T1 work and T∞ span (critical-path length), Piper executes the computation on P processors in TP≤ T1/P + O(T∞ + lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as lazy enabling and dependency folding. We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114158526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel rotor walks on finite graphs and applications in discrete load balancing","authors":"Hoda Akbari, P. Berenbrink","doi":"10.1145/2486159.2486178","DOIUrl":"https://doi.org/10.1145/2486159.2486178","url":null,"abstract":"We study the parallel rotor walk process, which works as follows: Consider a graph along with an arbitrary distribution of tokens over its nodes. Every node is equipped with a rotor that points to its neighbours in a fixed circular order. In each round, every node distributes all of its tokens using the rotor. One token is allocated to the neighbour pointed at by the rotor, then the rotor moves to the subsequent neighbour, and so on, until no token remains. The process can be considered as a deterministic analogue of a process in which tokens perform one independent random walk step in each round. We compare the distribution of tokens in the rotor walk process with expected distribution in the random walk model. The similarity between the two processes is measured by their discrepancy, which is the maximum difference between the corresponding distribution entries over all rounds and nodes. We analyze a lazy variation of rotor walks that simulates a random walk with loop probability of 1/2 on each node, and each node sends not all its tokens, but every other token in each round. Viewing the rotor walk as a load balancing process, we prove that the rotor walk falls in the class of bounded-error diffusion processes introduced in [11]. This gives us discrepancy bounds of O(log3/2 n) and O(1) for hypercube and r-dimensional torus with r=O(1), respectively, which improve over the best existing bounds of O(log2 n) and O(n1/r). Also, as a result of switching to the load balancing view, we observe that the existing load balancing results can be translated to rotor walk discrepancy bounds not previously noticed in the rotor walk literature. We also use the idea of rotor walks to propose and analyze a randomized rounding discrete load balancing process that achieves the same balancing quality as similar protocols [11, 3], but uses fewer number of random bits compared to [3], and avoids the negative load problem of [11].","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130607690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brendan Lucier, Ishai Menache, J. Naor, Jonathan Yaniv
{"title":"Efficient online scheduling for deadline-sensitive jobs: extended abstract","authors":"Brendan Lucier, Ishai Menache, J. Naor, Jonathan Yaniv","doi":"10.1145/2486159.2486187","DOIUrl":"https://doi.org/10.1145/2486159.2486187","url":null,"abstract":"We consider mechanisms for online deadline-aware scheduling in large computing clusters. Batch jobs that run on such clusters often require guarantees on their completion time (i.e., deadlines). However, most existing scheduling systems implement fair-share resource allocation between users, an approach that ignores heterogeneity in job requirements and may cause deadlines to be missed. In our framework, jobs arrive dynamically and are characterized by their value and total resource demand (or estimation thereof), along with their reported deadlines. The scheduler's objective is to maximize the aggregate value of jobs completed by their deadlines. We circumvent known lower bounds for this problem by assuming that the input has slack, meaning that any job could be delayed and still finish by its deadline. Under the slackness assumption, we design a preemptive scheduler with a constant-factor worst-case performance guarantee. Along the way, we pay close attention to practical aspects, such as runtime efficiency, data locality and demand uncertainty. We evaluate the algorithm via simulations over real job traces taken from a large production cluster, and show that its actual performance is significantly better than other heuristics used in practice. We then extend our framework to handle provider commitments: the requirement that jobs admitted to service must be executed until completion. We prove that no algorithm can obtain worst-case guarantees when enforcing the commitment decision to the job arrival time. Nevertheless, we design efficient heuristics that commit on job admission, in the spirit of our basic algorithm. We show empirically that these heuristics perform just as well as (or better than) the original algorithm. Finally, we discuss how our scheduling framework can be used to design truthful scheduling mechanisms, motivated by applications to commercial public cloud offerings.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116870947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online batch scheduling for flow objectives","authors":"Sungjin Im, Benjamin Moseley","doi":"10.1145/2486159.2486161","DOIUrl":"https://doi.org/10.1145/2486159.2486161","url":null,"abstract":"Batch scheduling gives a powerful way of increasing the throughput by aggregating multiple homogeneous jobs. It has applications in large scale manufacturing as well as in server scheduling. In batch scheduling, when explained in the setting of server scheduling, the server can process requests of the same type up to a certain number simultaneously. Batch scheduling can be seen as capacitated broadcast scheduling, a popular model considered in scheduling theory. In this paper, we consider an online batch scheduling model. For this model we address flow time objectives for the first time and give positive results for average flow time, the k-norms of flow time and maximum flow time. For average flow time and the k-norms of flow time we show algorithms that are O(1)-competitive with a small constant amount of resource augmentation. For maximum flow time we show a 2-competitive algorithm and this is the best possible competitive ratio for any online algorithm.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123059629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}