ACM Transactions on Parallel Computing最新文献

筛选
英文 中文
Sixteen Heuristics for Joint Optimization of Performance, Energy, and Temperature in Allocating Tasks to Multi-Cores 多核任务分配中性能、能量和温度联合优化的16种启发式方法
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2016-08-08 DOI: 10.1145/2948973
Hafiz Fahad Sheikh, I. Ahmad
{"title":"Sixteen Heuristics for Joint Optimization of Performance, Energy, and Temperature in Allocating Tasks to Multi-Cores","authors":"Hafiz Fahad Sheikh, I. Ahmad","doi":"10.1145/2948973","DOIUrl":"https://doi.org/10.1145/2948973","url":null,"abstract":"Three-way joint optimization of performance (P), energy (E), and temperature (T) in scheduling parallel tasks to multiple cores poses a challenge that is staggering in its computational complexity. The goal of the PET optimized scheduling (PETOS) problem is to minimize three quantities: the completion time of a task graph, the total energy consumption, and the peak temperature of the system. Algorithms based on conventional multi-objective optimization techniques can be designed for solving the PETOS problem. But their execution times are exceedingly high and hence their applicability is restricted merely to problems of modest size. Exacerbating the problem is the solution space that is typically a Pareto front since no single solution can be strictly best along all three objectives. Thus, not only is the absolute quality of the solutions important but “the spread of the solutions” along each objective and the distribution of solutions within the generated tradeoff front are also desired. A natural alternative is to design efficient heuristic algorithms that can generate good solutions as well as good spreads -- note that most of the prior work in energy-efficient task allocation is predominantly single- or dual-objective oriented. Given a directed acyclic graph (DAG) representing a parallel program, a heuristic encompasses policies as to what tasks should go to what cores and at what frequency should that core operate. Various policies, such as greedy, iterative, and probabilistic, can be employed. However, the choice and usage of these policies can influence a heuristic towards a particular objective and can also profoundly impact its performance. This article proposes 16 heuristics that utilize various methods for task-to-core allocation and frequency selection. This article also presents a methodical classification scheme which not only categorizes the proposed heuristics but can also accommodate additional heuristics. Extensive simulation experiments compare these algorithms while shedding light on their strengths and tradeoffs.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"22 1","pages":"9:1-9:29"},"PeriodicalIF":1.6,"publicationDate":"2016-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90794321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication 稀疏矩阵-矩阵乘法的超图划分
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2016-03-17 DOI: 10.1145/3015144
Grey Ballard, Alex Druinsky, Nicholas Knight, O. Schwartz
{"title":"Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication","authors":"Grey Ballard, Alex Druinsky, Nicholas Knight, O. Schwartz","doi":"10.1145/3015144","DOIUrl":"https://doi.org/10.1145/3015144","url":null,"abstract":"We propose a fine-grained hypergraph model for sparse matrix-matrix multiplication (SpGEMM), a key computational kernel in scientific computing and data analysis whose performance is often communication bound. This model correctly describes both the interprocessor communication volume along a critical path in a parallel computation and also the volume of data moving through the memory hierarchy in a sequential computation. We show that identifying a communication-optimal algorithm for particular input matrices is equivalent to solving a hypergraph partitioning problem. Our approach is nonzero structure dependent, meaning that we seek the best algorithm for the given input matrices.\u0000 In addition to our three-dimensional fine-grained model, we also propose coarse-grained one-dimensional and two-dimensional models that correspond to simpler SpGEMM algorithms. We explore the relations between our models theoretically, and we study their performance experimentally in the context of three applications that use SpGEMM as a key computation. For each application, we find that at least one coarse-grained model is as communication efficient as the fine-grained model. We also observe that different applications have affinities for different algorithms.\u0000 Our results demonstrate that hypergraphs are an accurate model for reasoning about the communication costs of SpGEMM as well as a practical tool for exploring the SpGEMM algorithm design space.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 12 1","pages":"18:1-18:34"},"PeriodicalIF":1.6,"publicationDate":"2016-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90366897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
Time-Warp: Efficient Abort Reduction in Transactional Memory 时间扭曲:事务性内存中的有效中断减少
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2015-07-08 DOI: 10.1145/2775435
Nuno Diegues, P. Romano
{"title":"Time-Warp: Efficient Abort Reduction in Transactional Memory","authors":"Nuno Diegues, P. Romano","doi":"10.1145/2775435","DOIUrl":"https://doi.org/10.1145/2775435","url":null,"abstract":"The multicore revolution that took place one decade ago has turned parallel programming into a major concern for the mainstream software development industry. In this context, Transactional Memory (TM) has emerged as a simpler and attractive alternative to that of lock-based synchronization, whose complexity and error-proneness are widely recognized.\u0000 The notion of permissiveness in TM translates to only aborting a transaction when it cannot be accepted in any history that guarantees a target correctness criterion. This theoretically powerful property is often neglected by state-of-the-art TMs because it imposes considerable algorithmic costs. Instead, these TMs opt to maximize their implementation’s efficiency by aborting transactions under overly conservative conditions. As a result, they risk rejecting a significant number of safe executions.\u0000 In this article, we seek to identify a sweet spot between permissiveness and efficiency by introducing the Time-Warp Multiversion (TWM) algorithm. TWM is based on the key idea of allowing an update transaction that has performed stale reads (i.e., missed the writes of concurrently committed transactions) to be serialized by “committing it in the past,” which we call a time-warp commit. At its core, TWM uses a novel, lightweight validation mechanism with little computational overhead. TWM also guarantees that read-only transactions can never be aborted. Further, TWM guarantees Virtual World Consistency, a safety property that is deemed as particularly relevant in the context of TM.\u0000 We demonstrate the practicality of this approach through an extensive experimental study: we compare TWM with five other TMs, representative of typical alternative design choices, and on a wide variety of benchmarks. This study shows an average performance improvement across all considered workloads and TMs of 65% in high concurrency scenarios, with gains extending up to 9 × with the most favorable benchmarks. These results are a consequence of TWM’s ability to achieve drastic reduction of aborts in scenarios of nonminimal contention, while introducing little overhead (approximately 10%) in worst-case, synthetically designed scenarios (i.e., no contention or contention patterns that cannot be optimized using TWM).","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"32 1","pages":"12:1-12:44"},"PeriodicalIF":1.6,"publicationDate":"2015-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84859845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Supporting Time-Based QoS Requirements in Software Transactional Memory 在软件事务性内存中支持基于时间的QoS需求
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2015-07-08 DOI: 10.1145/2779621
Walther Maldonado, P. Marlier, P. Felber, J. Lawall, Gilles Muller, E. Rivière
{"title":"Supporting Time-Based QoS Requirements in Software Transactional Memory","authors":"Walther Maldonado, P. Marlier, P. Felber, J. Lawall, Gilles Muller, E. Rivière","doi":"10.1145/2779621","DOIUrl":"https://doi.org/10.1145/2779621","url":null,"abstract":"Software transactional memory (STM) is an optimistic concurrency control mechanism that simplifies parallel programming. However, there has been little interest in its applicability to reactive applications in which there is a required response time for certain operations. We propose supporting such applications by allowing programmers to associate time with atomic blocks in the form of deadlines and quality-of-service (QoS) requirements. Based on statistics of past executions, we adjust the execution mode of transactions by decreasing the level of optimism as the deadline approaches. In the presence of concurrent deadlines, we propose different conflict resolution policies. Execution mode switching mechanisms allow the meeting of multiple deadlines in a consistent manner, with potential QoS degradations being split fairly among several threads as contention increases, and avoiding starvation. Our implementation consists of extensions to an STM runtime that allow gathering statistics and switching execution modes. We also propose novel contention managers adapted to transactional workloads subject to deadlines. The experimental evaluation shows that our approaches significantly improve the likelihood of a transaction meeting its deadline and QoS requirement, even in cases where progress is hampered by conflicts and other concurrent transactions with deadlines.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"65 1","pages":"10:1-10:30"},"PeriodicalIF":1.6,"publicationDate":"2015-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88974963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
TRADE: Precise Dynamic Race Detection for Scalable Transactional Memory Systems 贸易:精确动态竞争检测可扩展的事务性内存系统
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2015-07-08 DOI: 10.1145/2786021
Gokcen Kestor, O. Unsal, A. Cristal, S. Tasiran
{"title":"TRADE: Precise Dynamic Race Detection for Scalable Transactional Memory Systems","authors":"Gokcen Kestor, O. Unsal, A. Cristal, S. Tasiran","doi":"10.1145/2786021","DOIUrl":"https://doi.org/10.1145/2786021","url":null,"abstract":"As other multithreaded programs, transactional memory (TM) programs are prone to race conditions. Previous work focuses on extending existing definitions of data race for lock-based applications to TM applications, which requires all transactions to be totally ordered “as if” serialized by a global lock. This approach poses implementation constraints on the STM that severely limits TM applications’ performance.\u0000 This article shows that forcing total ordering among all running transactions, while sufficient, is not necessary. We introduce an alternative data race definition, relaxed transactional data race, that requires ordering of only conflicting transactions. The advantages of our relaxed definition are twofold: First, unlike the previous definition, this definition can be applied to a wide range of TMs, including those that do not enforce transaction total ordering. Second, within a single execution, it exposes a higher number of data races, which considerably reduces debugging time. Based on this definition, we propose a novel and precise race detection tool for C/C++ TM applications (TRADE), which detects data races by tracking happens-before edges among conflicting transactions.\u0000 Our experiments reveal that TRADE precisely detects data races for STAMP applications running on modern STMs with overhead comparable to state-of-the-art race detectors for lock-based applications. Our experiments also show that in a single run, TRADE identifies several races not discovered by 10 separate runs of a race detection tool based on the previous data race definition.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"53 1","pages":"11:1-11:23"},"PeriodicalIF":1.6,"publicationDate":"2015-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82703218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Remote Memory Access Programming in MPI-3 MPI-3中的远程内存访问编程
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2015-07-08 DOI: 10.1145/2780584
T. Hoefler, James Dinan, R. Thakur, Brian W. Barrett, P. Balaji, W. Gropp, K. Underwood
{"title":"Remote Memory Access Programming in MPI-3","authors":"T. Hoefler, James Dinan, R. Thakur, Brian W. Barrett, P. Balaji, W. Gropp, K. Underwood","doi":"10.1145/2780584","DOIUrl":"https://doi.org/10.1145/2780584","url":null,"abstract":"The Message Passing Interface (MPI) 3.0 standard, introduced in September 2012, includes a significant update to the one-sided communication interface, also known as remote memory access (RMA). In particular, the interface has been extended to better support popular one-sided and global-address-space parallel programming models to provide better access to hardware performance features and enable new data-access modes. We present the new RMA interface and specify formal axiomatic models for data consistency and access semantics. Such models can help users reason about details of the semantics that are hard to extract from the English prose in the standard. It also fosters the development of tools and compilers, enabling them to automatically analyze, optimize, and debug RMA programs.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"108 1","pages":"9:1-9:26"},"PeriodicalIF":1.6,"publicationDate":"2015-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76116149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 94
Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors 评估处理故障停止和静默错误的通用算法
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2014-11-16 DOI: 10.1145/2897189
A. Benoit, Aurélien Cavelan, Y. Robert, Hongyang Sun
{"title":"Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors","authors":"A. Benoit, Aurélien Cavelan, Y. Robert, Hongyang Sun","doi":"10.1145/2897189","DOIUrl":"https://doi.org/10.1145/2897189","url":null,"abstract":"In this article, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bicriteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via Dynamic Voltage and Frequency Scaling (DVFS). In this latter scenario, we determine the optimal checkpointing and verification locations, as well as the optimal speed pairs for each task segment between any two consecutive checkpoints. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"85 1","pages":"13:1-13:36"},"PeriodicalIF":1.6,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89030481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Parallel Scheduling of Task Trees with Limited Memory 有限内存条件下任务树的并行调度
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2014-10-01 DOI: 10.1145/2779052
Lionel Eyraud-Dubois, L. Marchal, O. Sinnen, F. Vivien
{"title":"Parallel Scheduling of Task Trees with Limited Memory","authors":"Lionel Eyraud-Dubois, L. Marchal, O. Sinnen, F. Vivien","doi":"10.1145/2779052","DOIUrl":"https://doi.org/10.1145/2779052","url":null,"abstract":"This article investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents some large data. A task can only be executed if all input and output data fit into memory, and a data can only be removed from memory after the completion of the task that uses it as an input data. Such trees arise in the multifrontal method of sparse matrix factorization. The peak memory needed for the processing of the entire tree depends on the execution order of the tasks. With one processor, the objective of the tree traversal is to minimize the required memory. This problem was well studied, and optimal polynomial algorithms were proposed.\u0000 Here, we extend the problem by considering multiple processors, which is of obvious interest in the application area of matrix factorization. With multiple processors comes the additional objective to minimize the time needed to traverse the tree—that is, to minimize the makespan. Not surprisingly, this problem proves to be much harder than the sequential one. We study the computational complexity of this problem and provide inapproximability results even for unit weight trees. We design a series of practical heuristics achieving different trade-offs between the minimization of peak memory usage and makespan. Some of these heuristics are able to process a tree while keeping the memory usage under a given memory limit. The different heuristics are evaluated in an extensive experimental evaluation using realistic trees.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"25 1","pages":"13:1-13:37"},"PeriodicalIF":1.6,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81771666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Simple Parallel and Distributed Algorithms for Spectral Graph Sparsification 谱图稀疏化的简单并行和分布式算法
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2014-02-16 DOI: 10.1145/2948062
I. Koutis, S. Xu
{"title":"Simple Parallel and Distributed Algorithms for Spectral Graph Sparsification","authors":"I. Koutis, S. Xu","doi":"10.1145/2948062","DOIUrl":"https://doi.org/10.1145/2948062","url":null,"abstract":"We describe simple algorithms for spectral graph sparsification, based on iterative computations of weighted spanners and sampling. Leveraging the algorithms of Baswana and Sen for computing spanners, we obtain the first distributed spectral sparsification algorithm in the CONGEST model. We also obtain a parallel algorithm with improved work and time guarantees, as well as other natural distributed implementations. Combining this algorithm with the parallel framework of Peng and Spielman for solving symmetric diagonally dominant linear systems, we get a parallel solver that is significantly more efficient in terms of the total work.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"18 1","pages":"14:1-14:14"},"PeriodicalIF":1.6,"publicationDate":"2014-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81561436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信