{"title":"Latency-Hiding Work Stealing: Scheduling Interacting Parallel Computations with Work Stealing","authors":"Stefan K. Muller, Umut A. Acar","doi":"10.1145/2935764.2935793","DOIUrl":"https://doi.org/10.1145/2935764.2935793","url":null,"abstract":"With the rise of multicore computers, parallel applications no longer consist solely of computational, batch workloads, but also include applications that may, for example, take input from a user, access secondary storage or the network, or perform remote procedure calls. Such operations can incur substantial latency, requiring the program to wait for a response. In the current state of the art, the theoretical models of parallelism and parallel scheduling algorithms do not account for latency. In this work, we extend the dag (Directed Acyclic Graph) model for parallelism to account for latency and present a work-stealing algorithm that hides latency to improve performance. This algorithm allows user-level threads to suspend without blocking the underlying worker, usually a system thread. When a user-level thread suspends, the algorithm switches to another thread. Using extensions of existing techniques as well as new technical devices, we bound the running time of our scheduler on a parallel computation. We also briefly present a prototype implementation of the algorithm and some preliminary empirical findings.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121305700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Trevor Brown, Alex Kogan, Yossi Lev, Victor Luchangco
{"title":"Investigating the Performance of Hardware Transactions on a Multi-Socket Machine","authors":"Trevor Brown, Alex Kogan, Yossi Lev, Victor Luchangco","doi":"10.1145/2935764.2935796","DOIUrl":"https://doi.org/10.1145/2935764.2935796","url":null,"abstract":"The introduction of hardware transactional memory (HTM) into commercial processors opens a door for designing and implementing scalable synchronization mechanisms. One example for such a mechanism is transactional lock elision (TLE), where lock-based critical sections are executed concurrently using hardware transactions. So far, the effectiveness of TLE and other HTM-based mechanisms has been assessed mostly on small, single-socket machines. This paper investigates the behavior of hardware transactions on a large two-socket machine. Using TLE as an example, we show that a system can scale as long as all threads run on the same socket, but a single thread running on a different socket can wreck performance. We identify the reason for this phenomenon, and present a simple adaptive technique that overcomes this problem by throttling threads as necessary to optimize system performance. Using extensive evaluation of multiple microbenchmarks and real applications, we demonstrate that our technique achieves the full performance of the system for workloads that scale across sockets, and avoids the performance degradation that cripples TLE for workloads that do not.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"197 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126749335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Brief Announcement: A QPTAS for Non-preemptive Speed-scaling","authors":"Sungjin Im, Maryam Shadloo","doi":"10.1145/2935764.2935824","DOIUrl":"https://doi.org/10.1145/2935764.2935824","url":null,"abstract":"Modern processors typically allow dynamic speed-scaling offering an effective trade-off between high throughput and energy efficiency. In a classical model, a processor/machine runs at speed s when consuming power sα where α >1 is a constant. Yao et al. [FOCS 1995] studied the problem of completing all jobs before their deadlines on a single machine with the minimum energy in their seminal work and gave a nice polynomial time algorithm. The influential work has been extended to various settings. In particular, the problem has been extensively studied in the presence of multiple machines as multi-core processors have become dominant computing units. However, when jobs must be scheduled non-preemptively, our understanding of the problem remains fairly unsatisfactory. Often, preempting a job is prohibited since it could be very costly. Previously, a O((wmax wmin)α)-approximation was known for the non-preemptive setting where wmax and wmin denote the maximum and minimum job sizes, respectively. Even when there is only one machine, the best known approximation factor had a dependency on α. In this paper, for any fixed α >1 and ε >0, we give the first (1+ε)-approximation for this problem on multiple machines which runs in nO(polylog (n)) time where n is the number of jobs to be scheduled.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114361054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Brief Announcement: Improved Approximation Algorithms for Scheduling Co-Flows","authors":"S. Khuller, Manish Purohit","doi":"10.1145/2935764.2935809","DOIUrl":"https://doi.org/10.1145/2935764.2935809","url":null,"abstract":"Co-flow scheduling is a recent networking abstraction introduced to capture application-level communication patterns in datacenters. In this paper, we consider the offline co-flow scheduling problem with release times to minimize the total weighted completion time. Recently, Qiu, Stein and Zhong (SPAA, 2015) obtained the first constant approximation algorithms for this problem with a deterministic 67/3-approximation and a randomized (9 + 16√2)/3 ≅ 16.54-approximation. In this paper, we improve upon their algorithm to yield a deterministic 12-approximation algorithm. For the special case when all release times are zero, we obtain a deterministic 8-approximation and a randomized (3+2√2) ≅ 5.83-approximation.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"380 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116325806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RUBIC: Online Parallelism Tuning for Co-located Transactional Memory Applications","authors":"Amin Mohtasham, J. Barreto","doi":"10.1145/2935764.2935770","DOIUrl":"https://doi.org/10.1145/2935764.2935770","url":null,"abstract":"With the advent of Chip-Multiprocessors, Transactional Memory (TM) emerged as a powerful paradigm to simplify parallel programming. Unfortunately, as more cores become available in commodity systems, the scalability limits of a wide class of TM applications become more evident. Hence, online parallelism tuning techniques were proposed to adapt the optimal number of threads of TM applications. However, state-of-the-art solutions are exclusively tailored to single-process systems with relatively static workloads, exhibiting pathological behaviors in scenarios where multiple multi-threaded TM processes contend for the shared hardware resources. This paper proposes RUBIC, a novel parallelism tuning method for TM applications in both single and multi-process scenarios that overcomes the shortcomings of the preciously proposed solutions. RUBIC helps the co-running processes adapt their parallelism level so that they can efficiently space-share the hardware. When compared to previous online parallelism tuning solutions, RUBIC achieves unprecedented system-wide fairness and efficiency, both in single- and multi-process scenarios. Our evaluation with different workloads and scenarios shows that, on average, RUBIC enhances the overall performance by 26% with respect to the best-performing state-of-the-art online parallelism tuning techniques in multi-process scenarios, while incurring negligible overhead in single-process cases. RUBIC also exhibits unique features in converging to a fair and efficient state.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128179567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Utterback, Kunal Agrawal, Jeremy T. Fineman, I. Lee
{"title":"Provably Good and Practically Efficient Parallel Race Detection for Fork-Join Programs","authors":"R. Utterback, Kunal Agrawal, Jeremy T. Fineman, I. Lee","doi":"10.1145/2935764.2935801","DOIUrl":"https://doi.org/10.1145/2935764.2935801","url":null,"abstract":"If a parallel program has determinacy race(s), different schedules can result in memory accesses that observe different values --- various race-detection tools have been designed to find such bugs. A key component of race detectors is an algorithm for series-parallel (SP) maintenance, which identifies whether two accesses are logically parallel. This paper describes an asymptotically optimal algorithm, called WSP-Order, for performing SP maintenance in programs with fork-join (or nested) parallelism. Given a fork-join program with T1 work and T∞ span, WSP-Order executes it while also maintaining SP relationships in O(T1/P + T∞) time on P processors, which is asymptotically optimal. At the heart of WSP-Order is a work-stealing scheduler designed specifically for SP maintenance. We also implemented C-RACER, a race-detector based on WSP-Order within the Cilk Plus runtime system, and evaluated its performance on five benchmarks. Empirical results demonstrate that when run sequentially, it performs almost as well as previous best sequential race detectors. More importantly, when run in parallel, it achieves almost as much speedup as the original program without race-detection.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131508808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Roughgarden, Sergei Vassilvitskii, Joshua R. Wang
{"title":"Shuffles and Circuits: (On Lower Bounds for Modern Parallel Computation)","authors":"T. Roughgarden, Sergei Vassilvitskii, Joshua R. Wang","doi":"10.1145/2935764.2935799","DOIUrl":"https://doi.org/10.1145/2935764.2935799","url":null,"abstract":"The goal of this paper is to identify fundamental limitations on how efficiently algorithms implemented on platforms such as MapReduce and Hadoop can compute the central problems in the motivating application domains, such as graph connectivity problems. We introduce an abstract model of massively parallel computation, where essentially the only restrictions are that the \"fan-in\" of each machine is limited to s bits, where s is smaller than the input size n, and that computation proceeds in synchronized rounds, with no communication between different machines within a round. Lower bounds on the round complexity of a problem in this model apply to every computing platform that shares the most basic design principles of MapReduce-type systems. We prove that computations in our model that use few rounds can be represented as low-degree polynomials over the reals. This connection allows us to translate a lower bound on the (approximate) polynomial degree of a Boolean function to a lower bound on the round complexity of every (randomized) massively parallel computation of that function. These lower bounds apply even in the \"unbounded width\" version of our model, where the number of machines can be arbitrarily large. As one example of our general results, computing any non-trivial monotone graph property --- such as connectivity --- requires a super-constant number of rounds when every machine can accept only a sub-polynomial (in n) number of input bits s. Finally, we prove that, in two senses, our lower bounds are the best one could hope for. For the unbounded-width model, we prove a matching upper bound. Restricting to a polynomial number of machines, we show that asymptotically better lower bounds require proving that P ≠ NC1.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127825496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Clairvoyant Dynamic Bin Packing for Job Scheduling with Minimum Server Usage Time","authors":"Runtian Ren, Xueyan Tang","doi":"10.1145/2935764.2935775","DOIUrl":"https://doi.org/10.1145/2935764.2935775","url":null,"abstract":"The MinUsageTime Dynamic Bin Packing (DBP) problem targets at minimizing the accumulated usage time of all the bins in the packing process. It models the server acquisition and job scheduling issues in many cloud-based systems. Earlier work has studied MinUsageTime DBP in the non-clairvoyant setting where the departure time of each item is not known at the time of its arrival. In this paper, we investigate MinUsageTime DBP in the clairvoyant setting where the departure time of each item is known for packing purposes. We study both the offline and online versions of Clairvoyant MinUsageTime DBP. We present two approximation algorithms for the offline problem, including a 5-approximation Duration Descending First Fit algorithm and a 4-approximation Dual Coloring algorithm. For the online problem, we establish a lower bound of 1+√5/2 on the competitive ratio of any online packing algorithm. We propose two strategies of item classification for online packing, including a classify-by-departure-time strategy and a classify-by-duration strategy. We analyze the competitiveness of these strategies when they are applied to the classical First Fit packing algorithm. It is shown that both strategies can substantially reduce the competitive ratio for Clairvoyant MinUsageTime DBP compared to the original First Fit algorithm.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116767018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Cost of Unknown Diameter in Dynamic Networks","authors":"Haifeng Yu, Yuda Zhao, Irvan Jahja","doi":"10.1145/2935764.2935781","DOIUrl":"https://doi.org/10.1145/2935764.2935781","url":null,"abstract":"For dynamic networks with unknown diameter, we prove novel lower bounds on the time complexity of a range of basic distributed computing problems. Together with trivial upper bounds under dynamic networks with known diameter for these problems, our lower bounds show that the complexities of all these problems are sensitive to whether the diameter is known to the protocol beforehand: Not knowing the diameter increases the time complexities by a large poly(N) factor as compared to when the diameter is known, resulting in an exponential gap. Here N is the number of nodes in the network. Our lower bounds are obtained via communication complexity arguments and by reducing from the two-party DisjointnessCP problem. We further prove that sometimes this large poly(N) cost can be completely avoided if the protocol is given a good estimate of N. In other words, having such an estimate makes some problems no longer sensitive to unknown diameter.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132676932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Matveev, Yaron Meirovitch, Hayk Saribekyan, Wiktor Jakubiuk, Tim Kaler, Gergely Ódor, D. Budden, A. Zlateski, N. Shavit
{"title":"A Multicore Path to Connectomics-on-Demand","authors":"A. Matveev, Yaron Meirovitch, Hayk Saribekyan, Wiktor Jakubiuk, Tim Kaler, Gergely Ódor, D. Budden, A. Zlateski, N. Shavit","doi":"10.1145/3155284.3018766","DOIUrl":"https://doi.org/10.1145/3155284.3018766","url":null,"abstract":"Connectomics is an emerging field of neurobiology that uses cutting edge machine learning and image processing to extract brain connectivity graphs from electron microscopy images. It has long been assumed that the processing of connectomics data will require mass storage and farms of CPUs and GPUs and will take months if not years. This talk shows the feasibility of designing a high-throughput connectomics-on-demand system that runs on a multicore machine with less than 100 cores and extracts connectomes at the terabyte per hour pace of modern electron microscopes. Building this system required solving algorithmic and performance engineering issues related to scaling machine learning on multicore architectures, and may have important lessons for other problem spaces in the natural sciences, where until now large distributed server or GPU farms seemed to be the only way to go.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116204820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}