Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures最新文献

筛选
英文 中文
Latency-Hiding Work Stealing: Scheduling Interacting Parallel Computations with Work Stealing 延迟隐藏工作窃取:调度与工作窃取交互并行计算
Stefan K. Muller, Umut A. Acar
{"title":"Latency-Hiding Work Stealing: Scheduling Interacting Parallel Computations with Work Stealing","authors":"Stefan K. Muller, Umut A. Acar","doi":"10.1145/2935764.2935793","DOIUrl":"https://doi.org/10.1145/2935764.2935793","url":null,"abstract":"With the rise of multicore computers, parallel applications no longer consist solely of computational, batch workloads, but also include applications that may, for example, take input from a user, access secondary storage or the network, or perform remote procedure calls. Such operations can incur substantial latency, requiring the program to wait for a response. In the current state of the art, the theoretical models of parallelism and parallel scheduling algorithms do not account for latency. In this work, we extend the dag (Directed Acyclic Graph) model for parallelism to account for latency and present a work-stealing algorithm that hides latency to improve performance. This algorithm allows user-level threads to suspend without blocking the underlying worker, usually a system thread. When a user-level thread suspends, the algorithm switches to another thread. Using extensions of existing techniques as well as new technical devices, we bound the running time of our scheduler on a parallel computation. We also briefly present a prototype implementation of the algorithm and some preliminary empirical findings.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121305700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Investigating the Performance of Hardware Transactions on a Multi-Socket Machine 研究多套接字机器上硬件事务的性能
Trevor Brown, Alex Kogan, Yossi Lev, Victor Luchangco
{"title":"Investigating the Performance of Hardware Transactions on a Multi-Socket Machine","authors":"Trevor Brown, Alex Kogan, Yossi Lev, Victor Luchangco","doi":"10.1145/2935764.2935796","DOIUrl":"https://doi.org/10.1145/2935764.2935796","url":null,"abstract":"The introduction of hardware transactional memory (HTM) into commercial processors opens a door for designing and implementing scalable synchronization mechanisms. One example for such a mechanism is transactional lock elision (TLE), where lock-based critical sections are executed concurrently using hardware transactions. So far, the effectiveness of TLE and other HTM-based mechanisms has been assessed mostly on small, single-socket machines. This paper investigates the behavior of hardware transactions on a large two-socket machine. Using TLE as an example, we show that a system can scale as long as all threads run on the same socket, but a single thread running on a different socket can wreck performance. We identify the reason for this phenomenon, and present a simple adaptive technique that overcomes this problem by throttling threads as necessary to optimize system performance. Using extensive evaluation of multiple microbenchmarks and real applications, we demonstrate that our technique achieves the full performance of the system for workloads that scale across sockets, and avoids the performance degradation that cripples TLE for workloads that do not.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"197 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126749335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Brief Announcement: A QPTAS for Non-preemptive Speed-scaling 简要公告:用于非抢占式速度缩放的qpta
Sungjin Im, Maryam Shadloo
{"title":"Brief Announcement: A QPTAS for Non-preemptive Speed-scaling","authors":"Sungjin Im, Maryam Shadloo","doi":"10.1145/2935764.2935824","DOIUrl":"https://doi.org/10.1145/2935764.2935824","url":null,"abstract":"Modern processors typically allow dynamic speed-scaling offering an effective trade-off between high throughput and energy efficiency. In a classical model, a processor/machine runs at speed s when consuming power sα where α >1 is a constant. Yao et al. [FOCS 1995] studied the problem of completing all jobs before their deadlines on a single machine with the minimum energy in their seminal work and gave a nice polynomial time algorithm. The influential work has been extended to various settings. In particular, the problem has been extensively studied in the presence of multiple machines as multi-core processors have become dominant computing units. However, when jobs must be scheduled non-preemptively, our understanding of the problem remains fairly unsatisfactory. Often, preempting a job is prohibited since it could be very costly. Previously, a O((wmax wmin)α)-approximation was known for the non-preemptive setting where wmax and wmin denote the maximum and minimum job sizes, respectively. Even when there is only one machine, the best known approximation factor had a dependency on α. In this paper, for any fixed α >1 and ε >0, we give the first (1+ε)-approximation for this problem on multiple machines which runs in nO(polylog (n)) time where n is the number of jobs to be scheduled.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114361054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Brief Announcement: Improved Approximation Algorithms for Scheduling Co-Flows 简要公告:调度协同流的改进近似算法
S. Khuller, Manish Purohit
{"title":"Brief Announcement: Improved Approximation Algorithms for Scheduling Co-Flows","authors":"S. Khuller, Manish Purohit","doi":"10.1145/2935764.2935809","DOIUrl":"https://doi.org/10.1145/2935764.2935809","url":null,"abstract":"Co-flow scheduling is a recent networking abstraction introduced to capture application-level communication patterns in datacenters. In this paper, we consider the offline co-flow scheduling problem with release times to minimize the total weighted completion time. Recently, Qiu, Stein and Zhong (SPAA, 2015) obtained the first constant approximation algorithms for this problem with a deterministic 67/3-approximation and a randomized (9 + 16√2)/3 ≅ 16.54-approximation. In this paper, we improve upon their algorithm to yield a deterministic 12-approximation algorithm. For the special case when all release times are zero, we obtain a deterministic 8-approximation and a randomized (3+2√2) ≅ 5.83-approximation.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"380 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116325806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
RUBIC: Online Parallelism Tuning for Co-located Transactional Memory Applications RUBIC:共置事务性内存应用程序的在线并行性调优
Amin Mohtasham, J. Barreto
{"title":"RUBIC: Online Parallelism Tuning for Co-located Transactional Memory Applications","authors":"Amin Mohtasham, J. Barreto","doi":"10.1145/2935764.2935770","DOIUrl":"https://doi.org/10.1145/2935764.2935770","url":null,"abstract":"With the advent of Chip-Multiprocessors, Transactional Memory (TM) emerged as a powerful paradigm to simplify parallel programming. Unfortunately, as more cores become available in commodity systems, the scalability limits of a wide class of TM applications become more evident. Hence, online parallelism tuning techniques were proposed to adapt the optimal number of threads of TM applications. However, state-of-the-art solutions are exclusively tailored to single-process systems with relatively static workloads, exhibiting pathological behaviors in scenarios where multiple multi-threaded TM processes contend for the shared hardware resources. This paper proposes RUBIC, a novel parallelism tuning method for TM applications in both single and multi-process scenarios that overcomes the shortcomings of the preciously proposed solutions. RUBIC helps the co-running processes adapt their parallelism level so that they can efficiently space-share the hardware. When compared to previous online parallelism tuning solutions, RUBIC achieves unprecedented system-wide fairness and efficiency, both in single- and multi-process scenarios. Our evaluation with different workloads and scenarios shows that, on average, RUBIC enhances the overall performance by 26% with respect to the best-performing state-of-the-art online parallelism tuning techniques in multi-process scenarios, while incurring negligible overhead in single-process cases. RUBIC also exhibits unique features in converging to a fair and efficient state.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128179567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Provably Good and Practically Efficient Parallel Race Detection for Fork-Join Programs 证明良好的和实际有效的并行竞争检测分叉连接程序
R. Utterback, Kunal Agrawal, Jeremy T. Fineman, I. Lee
{"title":"Provably Good and Practically Efficient Parallel Race Detection for Fork-Join Programs","authors":"R. Utterback, Kunal Agrawal, Jeremy T. Fineman, I. Lee","doi":"10.1145/2935764.2935801","DOIUrl":"https://doi.org/10.1145/2935764.2935801","url":null,"abstract":"If a parallel program has determinacy race(s), different schedules can result in memory accesses that observe different values --- various race-detection tools have been designed to find such bugs. A key component of race detectors is an algorithm for series-parallel (SP) maintenance, which identifies whether two accesses are logically parallel. This paper describes an asymptotically optimal algorithm, called WSP-Order, for performing SP maintenance in programs with fork-join (or nested) parallelism. Given a fork-join program with T1 work and T∞ span, WSP-Order executes it while also maintaining SP relationships in O(T1/P + T∞) time on P processors, which is asymptotically optimal. At the heart of WSP-Order is a work-stealing scheduler designed specifically for SP maintenance. We also implemented C-RACER, a race-detector based on WSP-Order within the Cilk Plus runtime system, and evaluated its performance on five benchmarks. Empirical results demonstrate that when run sequentially, it performs almost as well as previous best sequential race detectors. More importantly, when run in parallel, it achieves almost as much speedup as the original program without race-detection.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131508808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Shuffles and Circuits: (On Lower Bounds for Modern Parallel Computation) 洗牌和电路:(关于现代并行计算的下界)
T. Roughgarden, Sergei Vassilvitskii, Joshua R. Wang
{"title":"Shuffles and Circuits: (On Lower Bounds for Modern Parallel Computation)","authors":"T. Roughgarden, Sergei Vassilvitskii, Joshua R. Wang","doi":"10.1145/2935764.2935799","DOIUrl":"https://doi.org/10.1145/2935764.2935799","url":null,"abstract":"The goal of this paper is to identify fundamental limitations on how efficiently algorithms implemented on platforms such as MapReduce and Hadoop can compute the central problems in the motivating application domains, such as graph connectivity problems. We introduce an abstract model of massively parallel computation, where essentially the only restrictions are that the \"fan-in\" of each machine is limited to s bits, where s is smaller than the input size n, and that computation proceeds in synchronized rounds, with no communication between different machines within a round. Lower bounds on the round complexity of a problem in this model apply to every computing platform that shares the most basic design principles of MapReduce-type systems. We prove that computations in our model that use few rounds can be represented as low-degree polynomials over the reals. This connection allows us to translate a lower bound on the (approximate) polynomial degree of a Boolean function to a lower bound on the round complexity of every (randomized) massively parallel computation of that function. These lower bounds apply even in the \"unbounded width\" version of our model, where the number of machines can be arbitrarily large. As one example of our general results, computing any non-trivial monotone graph property --- such as connectivity --- requires a super-constant number of rounds when every machine can accept only a sub-polynomial (in n) number of input bits s. Finally, we prove that, in two senses, our lower bounds are the best one could hope for. For the unbounded-width model, we prove a matching upper bound. Restricting to a polynomial number of machines, we show that asymptotically better lower bounds require proving that P ≠ NC1.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127825496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 69
Clairvoyant Dynamic Bin Packing for Job Scheduling with Minimum Server Usage Time 具有最小服务器使用时间的作业调度的透视动态装箱
Runtian Ren, Xueyan Tang
{"title":"Clairvoyant Dynamic Bin Packing for Job Scheduling with Minimum Server Usage Time","authors":"Runtian Ren, Xueyan Tang","doi":"10.1145/2935764.2935775","DOIUrl":"https://doi.org/10.1145/2935764.2935775","url":null,"abstract":"The MinUsageTime Dynamic Bin Packing (DBP) problem targets at minimizing the accumulated usage time of all the bins in the packing process. It models the server acquisition and job scheduling issues in many cloud-based systems. Earlier work has studied MinUsageTime DBP in the non-clairvoyant setting where the departure time of each item is not known at the time of its arrival. In this paper, we investigate MinUsageTime DBP in the clairvoyant setting where the departure time of each item is known for packing purposes. We study both the offline and online versions of Clairvoyant MinUsageTime DBP. We present two approximation algorithms for the offline problem, including a 5-approximation Duration Descending First Fit algorithm and a 4-approximation Dual Coloring algorithm. For the online problem, we establish a lower bound of 1+√5/2 on the competitive ratio of any online packing algorithm. We propose two strategies of item classification for online packing, including a classify-by-departure-time strategy and a classify-by-duration strategy. We analyze the competitiveness of these strategies when they are applied to the classical First Fit packing algorithm. It is shown that both strategies can substantially reduce the competitive ratio for Clairvoyant MinUsageTime DBP compared to the original First Fit algorithm.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116767018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
The Cost of Unknown Diameter in Dynamic Networks 动态网络中未知直径的代价
Haifeng Yu, Yuda Zhao, Irvan Jahja
{"title":"The Cost of Unknown Diameter in Dynamic Networks","authors":"Haifeng Yu, Yuda Zhao, Irvan Jahja","doi":"10.1145/2935764.2935781","DOIUrl":"https://doi.org/10.1145/2935764.2935781","url":null,"abstract":"For dynamic networks with unknown diameter, we prove novel lower bounds on the time complexity of a range of basic distributed computing problems. Together with trivial upper bounds under dynamic networks with known diameter for these problems, our lower bounds show that the complexities of all these problems are sensitive to whether the diameter is known to the protocol beforehand: Not knowing the diameter increases the time complexities by a large poly(N) factor as compared to when the diameter is known, resulting in an exponential gap. Here N is the number of nodes in the network. Our lower bounds are obtained via communication complexity arguments and by reducing from the two-party DisjointnessCP problem. We further prove that sometimes this large poly(N) cost can be completely avoided if the protocol is given a good estimate of N. In other words, having such an estimate makes some problems no longer sensitive to unknown diameter.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132676932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A Multicore Path to Connectomics-on-Demand 按需连接组学的多核路径
A. Matveev, Yaron Meirovitch, Hayk Saribekyan, Wiktor Jakubiuk, Tim Kaler, Gergely Ódor, D. Budden, A. Zlateski, N. Shavit
{"title":"A Multicore Path to Connectomics-on-Demand","authors":"A. Matveev, Yaron Meirovitch, Hayk Saribekyan, Wiktor Jakubiuk, Tim Kaler, Gergely Ódor, D. Budden, A. Zlateski, N. Shavit","doi":"10.1145/3155284.3018766","DOIUrl":"https://doi.org/10.1145/3155284.3018766","url":null,"abstract":"Connectomics is an emerging field of neurobiology that uses cutting edge machine learning and image processing to extract brain connectivity graphs from electron microscopy images. It has long been assumed that the processing of connectomics data will require mass storage and farms of CPUs and GPUs and will take months if not years. This talk shows the feasibility of designing a high-throughput connectomics-on-demand system that runs on a multicore machine with less than 100 cores and extracts connectomes at the terabyte per hour pace of modern electron microscopes. Building this system required solving algorithmic and performance engineering issues related to scaling machine learning on multicore architectures, and may have important lessons for other problem spaces in the natural sciences, where until now large distributed server or GPU farms seemed to be the only way to go.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116204820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信