Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures最新文献

Latency-Hiding Work Stealing: Scheduling Interacting Parallel Computations with Work Stealing 延迟隐藏工作窃取:调度与工作窃取交互并行计算

Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2016-07-11 DOI: 10.1145/2935764.2935793

Stefan K. Muller, Umut A. Acar

{"title":"Latency-Hiding Work Stealing: Scheduling Interacting Parallel Computations with Work Stealing","authors":"Stefan K. Muller, Umut A. Acar","doi":"10.1145/2935764.2935793","DOIUrl":"https://doi.org/10.1145/2935764.2935793","url":null,"abstract":"With the rise of multicore computers, parallel applications no longer consist solely of computational, batch workloads, but also include applications that may, for example, take input from a user, access secondary storage or the network, or perform remote procedure calls. Such operations can incur substantial latency, requiring the program to wait for a response. In the current state of the art, the theoretical models of parallelism and parallel scheduling algorithms do not account for latency. In this work, we extend the dag (Directed Acyclic Graph) model for parallelism to account for latency and present a work-stealing algorithm that hides latency to improve performance. This algorithm allows user-level threads to suspend without blocking the underlying worker, usually a system thread. When a user-level thread suspends, the algorithm switches to another thread. Using extensions of existing techniques as well as new technical devices, we bound the running time of our scheduler on a parallel computation. We also briefly present a prototype implementation of the algorithm and some preliminary empirical findings.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121305700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Investigating the Performance of Hardware Transactions on a Multi-Socket Machine 研究多套接字机器上硬件事务的性能

Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2016-07-11 DOI: 10.1145/2935764.2935796

Trevor Brown, Alex Kogan, Yossi Lev, Victor Luchangco

{"title":"Investigating the Performance of Hardware Transactions on a Multi-Socket Machine","authors":"Trevor Brown, Alex Kogan, Yossi Lev, Victor Luchangco","doi":"10.1145/2935764.2935796","DOIUrl":"https://doi.org/10.1145/2935764.2935796","url":null,"abstract":"The introduction of hardware transactional memory (HTM) into commercial processors opens a door for designing and implementing scalable synchronization mechanisms. One example for such a mechanism is transactional lock elision (TLE), where lock-based critical sections are executed concurrently using hardware transactions. So far, the effectiveness of TLE and other HTM-based mechanisms has been assessed mostly on small, single-socket machines. This paper investigates the behavior of hardware transactions on a large two-socket machine. Using TLE as an example, we show that a system can scale as long as all threads run on the same socket, but a single thread running on a different socket can wreck performance. We identify the reason for this phenomenon, and present a simple adaptive technique that overcomes this problem by throttling threads as necessary to optimize system performance. Using extensive evaluation of multiple microbenchmarks and real applications, we demonstrate that our technique achieves the full performance of the system for workloads that scale across sockets, and avoids the performance degradation that cripples TLE for workloads that do not.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"197 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126749335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Brief Announcement: A QPTAS for Non-preemptive Speed-scaling 简要公告:用于非抢占式速度缩放的qpta

Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2016-07-11 DOI: 10.1145/2935764.2935824

Sungjin Im, Maryam Shadloo

{"title":"Brief Announcement: A QPTAS for Non-preemptive Speed-scaling","authors":"Sungjin Im, Maryam Shadloo","doi":"10.1145/2935764.2935824","DOIUrl":"https://doi.org/10.1145/2935764.2935824","url":null,"abstract":"Modern processors typically allow dynamic speed-scaling offering an effective trade-off between high throughput and energy efficiency. In a classical model, a processor/machine runs at speed s when consuming power sα where α >1 is a constant. Yao et al. [FOCS 1995] studied the problem of completing all jobs before their deadlines on a single machine with the minimum energy in their seminal work and gave a nice polynomial time algorithm. The influential work has been extended to various settings. In particular, the problem has been extensively studied in the presence of multiple machines as multi-core processors have become dominant computing units. However, when jobs must be scheduled non-preemptively, our understanding of the problem remains fairly unsatisfactory. Often, preempting a job is prohibited since it could be very costly. Previously, a O((wmax wmin)α)-approximation was known for the non-preemptive setting where wmax and wmin denote the maximum and minimum job sizes, respectively. Even when there is only one machine, the best known approximation factor had a dependency on α. In this paper, for any fixed α >1 and ε >0, we give the first (1+ε)-approximation for this problem on multiple machines which runs in nO(polylog (n)) time where n is the number of jobs to be scheduled.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114361054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Brief Announcement: Improved Approximation Algorithms for Scheduling Co-Flows 简要公告:调度协同流的改进近似算法

Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2016-07-11 DOI: 10.1145/2935764.2935809

S. Khuller, Manish Purohit

引用次数: 28

RUBIC: Online Parallelism Tuning for Co-located Transactional Memory Applications RUBIC:共置事务性内存应用程序的在线并行性调优

Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2016-07-11 DOI: 10.1145/2935764.2935770

Amin Mohtasham, J. Barreto

{"title":"RUBIC: Online Parallelism Tuning for Co-located Transactional Memory Applications","authors":"Amin Mohtasham, J. Barreto","doi":"10.1145/2935764.2935770","DOIUrl":"https://doi.org/10.1145/2935764.2935770","url":null,"abstract":"With the advent of Chip-Multiprocessors, Transactional Memory (TM) emerged as a powerful paradigm to simplify parallel programming. Unfortunately, as more cores become available in commodity systems, the scalability limits of a wide class of TM applications become more evident. Hence, online parallelism tuning techniques were proposed to adapt the optimal number of threads of TM applications. However, state-of-the-art solutions are exclusively tailored to single-process systems with relatively static workloads, exhibiting pathological behaviors in scenarios where multiple multi-threaded TM processes contend for the shared hardware resources. This paper proposes RUBIC, a novel parallelism tuning method for TM applications in both single and multi-process scenarios that overcomes the shortcomings of the preciously proposed solutions. RUBIC helps the co-running processes adapt their parallelism level so that they can efficiently space-share the hardware. When compared to previous online parallelism tuning solutions, RUBIC achieves unprecedented system-wide fairness and efficiency, both in single- and multi-process scenarios. Our evaluation with different workloads and scenarios shows that, on average, RUBIC enhances the overall performance by 26% with respect to the best-performing state-of-the-art online parallelism tuning techniques in multi-process scenarios, while incurring negligible overhead in single-process cases. RUBIC also exhibits unique features in converging to a fair and efficient state.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128179567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Provably Good and Practically Efficient Parallel Race Detection for Fork-Join Programs 证明良好的和实际有效的并行竞争检测分叉连接程序

Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2016-07-11 DOI: 10.1145/2935764.2935801

R. Utterback, Kunal Agrawal, Jeremy T. Fineman, I. Lee

{"title":"Provably Good and Practically Efficient Parallel Race Detection for Fork-Join Programs","authors":"R. Utterback, Kunal Agrawal, Jeremy T. Fineman, I. Lee","doi":"10.1145/2935764.2935801","DOIUrl":"https://doi.org/10.1145/2935764.2935801","url":null,"abstract":"If a parallel program has determinacy race(s), different schedules can result in memory accesses that observe different values --- various race-detection tools have been designed to find such bugs. A key component of race detectors is an algorithm for series-parallel (SP) maintenance, which identifies whether two accesses are logically parallel. This paper describes an asymptotically optimal algorithm, called WSP-Order, for performing SP maintenance in programs with fork-join (or nested) parallelism. Given a fork-join program with T1 work and T∞ span, WSP-Order executes it while also maintaining SP relationships in O(T1/P + T∞) time on P processors, which is asymptotically optimal. At the heart of WSP-Order is a work-stealing scheduler designed specifically for SP maintenance. We also implemented C-RACER, a race-detector based on WSP-Order within the Cilk Plus runtime system, and evaluated its performance on five benchmarks. Empirical results demonstrate that when run sequentially, it performs almost as well as previous best sequential race detectors. More importantly, when run in parallel, it achieves almost as much speedup as the original program without race-detection.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131508808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Shuffles and Circuits: (On Lower Bounds for Modern Parallel Computation) 洗牌和电路:(关于现代并行计算的下界)

Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2016-07-11 DOI: 10.1145/2935764.2935799

T. Roughgarden, Sergei Vassilvitskii, Joshua R. Wang

{"title":"Shuffles and Circuits: (On Lower Bounds for Modern Parallel Computation)","authors":"T. Roughgarden, Sergei Vassilvitskii, Joshua R. Wang","doi":"10.1145/2935764.2935799","DOIUrl":"https://doi.org/10.1145/2935764.2935799","url":null,"abstract":"The goal of this paper is to identify fundamental limitations on how efficiently algorithms implemented on platforms such as MapReduce and Hadoop can compute the central problems in the motivating application domains, such as graph connectivity problems. We introduce an abstract model of massively parallel computation, where essentially the only restrictions are that the \"fan-in\" of each machine is limited to s bits, where s is smaller than the input size n, and that computation proceeds in synchronized rounds, with no communication between different machines within a round. Lower bounds on the round complexity of a problem in this model apply to every computing platform that shares the most basic design principles of MapReduce-type systems. We prove that computations in our model that use few rounds can be represented as low-degree polynomials over the reals. This connection allows us to translate a lower bound on the (approximate) polynomial degree of a Boolean function to a lower bound on the round complexity of every (randomized) massively parallel computation of that function. These lower bounds apply even in the \"unbounded width\" version of our model, where the number of machines can be arbitrarily large. As one example of our general results, computing any non-trivial monotone graph property --- such as connectivity --- requires a super-constant number of rounds when every machine can accept only a sub-polynomial (in n) number of input bits s. Finally, we prove that, in two senses, our lower bounds are the best one could hope for. For the unbounded-width model, we prove a matching upper bound. Restricting to a polynomial number of machines, we show that asymptotically better lower bounds require proving that P ≠ NC1.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127825496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 69

Clairvoyant Dynamic Bin Packing for Job Scheduling with Minimum Server Usage Time 具有最小服务器使用时间的作业调度的透视动态装箱

Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2016-07-11 DOI: 10.1145/2935764.2935775

Runtian Ren, Xueyan Tang

{"title":"Clairvoyant Dynamic Bin Packing for Job Scheduling with Minimum Server Usage Time","authors":"Runtian Ren, Xueyan Tang","doi":"10.1145/2935764.2935775","DOIUrl":"https://doi.org/10.1145/2935764.2935775","url":null,"abstract":"The MinUsageTime Dynamic Bin Packing (DBP) problem targets at minimizing the accumulated usage time of all the bins in the packing process. It models the server acquisition and job scheduling issues in many cloud-based systems. Earlier work has studied MinUsageTime DBP in the non-clairvoyant setting where the departure time of each item is not known at the time of its arrival. In this paper, we investigate MinUsageTime DBP in the clairvoyant setting where the departure time of each item is known for packing purposes. We study both the offline and online versions of Clairvoyant MinUsageTime DBP. We present two approximation algorithms for the offline problem, including a 5-approximation Duration Descending First Fit algorithm and a 4-approximation Dual Coloring algorithm. For the online problem, we establish a lower bound of 1+√5/2 on the competitive ratio of any online packing algorithm. We propose two strategies of item classification for online packing, including a classify-by-departure-time strategy and a classify-by-duration strategy. We analyze the competitiveness of these strategies when they are applied to the classical First Fit packing algorithm. It is shown that both strategies can substantially reduce the competitive ratio for Clairvoyant MinUsageTime DBP compared to the original First Fit algorithm.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116767018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

The Cost of Unknown Diameter in Dynamic Networks 动态网络中未知直径的代价

Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2016-07-11 DOI: 10.1145/2935764.2935781

Haifeng Yu, Yuda Zhao, Irvan Jahja

引用次数: 4

A Multicore Path to Connectomics-on-Demand 按需连接组学的多核路径

Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2016-07-11 DOI: 10.1145/3155284.3018766

A. Matveev, Yaron Meirovitch, Hayk Saribekyan, Wiktor Jakubiuk, Tim Kaler, Gergely Ódor, D. Budden, A. Zlateski, N. Shavit

引用次数: 16