ACM Transactions on Parallel Computing最新文献

筛选
英文 中文
Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing 异步多gpu编程模型与大规模图形处理的应用
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2020-08-01 DOI: 10.1145/3399730
Tal Ben-Nun, M. Sutton, Sreepathi Pai, K. Pingali
{"title":"Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing","authors":"Tal Ben-Nun, M. Sutton, Sreepathi Pai, K. Pingali","doi":"10.1145/3399730","DOIUrl":"https://doi.org/10.1145/3399730","url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2329-4949/2020/06-ART18 $15.00 https://doi.org/10.1145/3399730 ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. 18:2 T. Ben-Nun et al. Fig. 1. Multi-GPU node schematics. via a low-latency, high-throughput bus (see Figure 1). These interconnects allow parallel applications to exchange data efficiently and to take advantage of the combined computational power and memory size of the GPUs, but may vary substantially between node types. Multi-GPU nodes are usually programmed using one of two methods. In the simple approach, each GPU is managed separately, using one process per device [19, 26]. Alternatively, a Bulk Synchronous Parallel (BSP) [42] programming model is used, in which applications are executed in rounds, and each round consists of local computation followed by global communication [6, 33]. The first approach is subject to overhead from various sources, such as the operating system, and requires a message-passing interface for communication. The BSP model, however, can introduce unnecessary serialization at the global barriers that implement round-based execution. Both programming methods may result in under-utilization of multi-GPU platforms, particularly for irregular applications, which may suffer from load imbalance and may have unpredictable communication patterns. In principle, asynchronous programming models can reduce some of those problems, because unlike in round-based communication, processors can compute and communicate autonomously without waiting for other processors to reach global barriers. However, there are few applications that exploit asynchronous execution, since their development requires an in-depth knowledge of the underlying architecture and communication network and involves performing intricate adaptations to the code. This article presents Groute, an asynchronous programming model and runtime environment [2] that can be used to develop a wide range of applications on multi-GPU systems. Based on concepts from low-level networking, Groute aims to overcome the programming complexity of asynchronous applications on multi-GPU and heterogeneous platforms. The communication constructs of Groute are simple, but they can be used to efficiently express programs that range from regular applications and BSP applications to nontrivial irregular algorithms. The asynchronous nature of the runtime environment also promotes load balancing, leading to better utilization of heterogeneous multi-GPU nodes. This article is an extended version of previously published work [7], where we explain the concepts in greater detail, consider newer multi-GPU topologies, and elaborate on the evaluated algorithms, as well as scalability considerations. ","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"32 1","pages":"18:1-18:27"},"PeriodicalIF":1.6,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79522401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
FEAST: A Lightweight Lock-free Concurrent Binary Search Tree FEAST:一个轻量级的无锁并发二叉搜索树
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2020-05-31 DOI: 10.1145/3391438
Aravind Natarajan, Arunmoezhi Ramachandran, N. Mittal
{"title":"FEAST: A Lightweight Lock-free Concurrent Binary Search Tree","authors":"Aravind Natarajan, Arunmoezhi Ramachandran, N. Mittal","doi":"10.1145/3391438","DOIUrl":"https://doi.org/10.1145/3391438","url":null,"abstract":"We present a lock-free algorithm for concurrent manipulation of a binary search tree (BST) in an asynchronous shared memory system that supports search, insert, and delete operations. In addition to read and write instructions, our algorithm uses (single-word) compare-and-swap (CAS) and bit-test-and-set (BTS) read-modify-write (RMW) instructions, both of which are commonly supported by many modern processors including Intel 64 and AMD64. In contrast to most of the existing concurrent algorithms for a binary search tree, our algorithm is edge-based rather than node-based. When compared to other concurrent algorithms for a binary search tree, modify (insert and delete) operations in our algorithm (a) work on a smaller section of the tree, (b) execute fewer RMW instructions, or (c) use fewer dynamically allocated objects. In our experiments, our lock-free algorithm significantly outperformed all other algorithms for a concurrent binary search tree especially when the contention was high. We also describe modifications to our basic lock-free algorithm so that the amortized complexity of any operation in the modified algorithm can be bounded by the sum of the tree height and the point contention to within a constant factor while preserving the other desirable features of our algorithm.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"30 1","pages":"10:1-10:64"},"PeriodicalIF":1.6,"publicationDate":"2020-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82801233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows CoREC:可扩展和弹性的内存数据暂存原位工作流
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2020-05-16 DOI: 10.1145/3391448
Shaohua Duan, P. Subedi, Philip E. Davis, K. Teranishi, H. Kolla, Marc Gamell, M. Parashar
{"title":"CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows","authors":"Shaohua Duan, P. Subedi, Philip E. Davis, K. Teranishi, H. Kolla, Marc Gamell, M. Parashar","doi":"10.1145/3391448","DOIUrl":"https://doi.org/10.1145/3391448","url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2329-4949/2020/05-ART12 $15.00 https://doi.org/10.1145/3391448 ACM Transactions on Parallel Computing, Vol. 7, No. 2, Article 12. Publication date: May 2020.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"14 1","pages":"12:1-12:29"},"PeriodicalIF":1.6,"publicationDate":"2020-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78792111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
ROC: A Reconfigurable Optical Computer for Simulating Physical Processes 用于模拟物理过程的可重构光学计算机
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2020-04-02 DOI: 10.1145/3380944
Jeff Anderson, Engin Kayraklioglu, Shuai Sun, Joseph Crandall, Y. Alkabani, Vikram K. Narayana, V. Sorger, T. El-Ghazawi
{"title":"ROC: A Reconfigurable Optical Computer for Simulating Physical Processes","authors":"Jeff Anderson, Engin Kayraklioglu, Shuai Sun, Joseph Crandall, Y. Alkabani, Vikram K. Narayana, V. Sorger, T. El-Ghazawi","doi":"10.1145/3380944","DOIUrl":"https://doi.org/10.1145/3380944","url":null,"abstract":"Due to the end of Moore’s law and Dennard scaling, we are entering a new era of processors. Computing systems are increasingly facing power and performance challenges due to both deviceand circuit-related challenges with resistive and capacitive charging. Non-von Neumann architectures are needed to support future computations through innovative post-Moore’s law architectures. To enable these emerging architectures with high-performance and at ultra-low power, both parallel computation and inter-node communication on-the-chip can be supported using photons. To this end, we introduce ROC, a reconfigurable optical computer that can solve partial differential equations (PDEs). PDE solvers form the basis for many traditional simulation problems in science and engineering that are currently performed on supercomputers. Instead of solving problems iteratively, the proposed engine uses a resistive mesh architecture to solve a PDE in a single iteration (one-shot). Instead of using actual electrical circuits, the physical underlying hardware emulates such structures using a silicon-photonics mesh that splits light into separate pathways, allowing it to add or subtract optical power analogous to programmable resistors. The time to obtain the PDE solution then only depends on the time-of-flight of a photon through the programmed mesh, which can be on the order of 10’s of picoseconds given the millimeter-compact integrated photonic circuit. Numerically validated experimental results show that, over multiple configurations, ROC can achieve several orders of magnitude improvement over state-of-the-art GPUs when speed, power, and size are taken into account. Further, it comes within approximately 90% precision of current numerical solvers. As such, ROC can be a viable reconfigurable, approximate computer with the potential for more precise results when replacing silicon-photonics building blocks with nanoscale photonic lumped-elements.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"21 1","pages":"8:1-8:29"},"PeriodicalIF":1.6,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78470620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Scheduling Mutual Exclusion Accesses in Equal-Length Jobs 调度等长作业中的互斥访问
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2019-09-10 DOI: 10.1145/3342562
D. Kagaris, S. Dutta
{"title":"Scheduling Mutual Exclusion Accesses in Equal-Length Jobs","authors":"D. Kagaris, S. Dutta","doi":"10.1145/3342562","DOIUrl":"https://doi.org/10.1145/3342562","url":null,"abstract":"A fundamental problem in parallel and distributed processing is the partial serialization that is imposed due to the need for mutually exclusive access to common resources. In this article, we investigate the problem of optimally scheduling (in terms of makespan) a set of jobs, where each job consists of the same number <i>L</i> of unit-duration tasks, and each task either accesses exclusively one resource from a given set of resources or accesses a fully shareable resource. We develop and establish the optimality of a fast polynomial-time algorithm to find a schedule with the shortest makespan for any number of jobs and for any number of resources for the case of <i>L</i> = 2. In the notation commonly used for job-shop scheduling problems, this result means that the problem <i>J</i> |<i>d</i><sub><i>ij</i></sub>=1, <i>n</i><sub><i>j</i></sub> =2|<i>C</i><sub>max</sub> is polynomially solvable, adding to the polynomial solutions known for the problems <i>J</i>2 | <i>n</i><sub><i>j</i></sub> ≤ 2 | <i>C</i><sub>max</sub> and <i>J</i>2 | <i>d</i><sub><i>ij</i></sub> = 1 | <i>C</i><sub>max</sub> (whereas other closely related versions such as <i>J</i>2 | <i>n</i><sub><i>j</i></sub> ≤ 3 | <i>C</i><sub>max</sub>, <i>J</i>2 | <i>d</i><sub><i>ij</i></sub> ∈ { 1,2} | <i>C</i><sub>max</sub>, <i>J</i>3 | <i>n</i><sub><i>j</i></sub> ≤ 2 | <i>C</i><sub>max</sub>, <i>J</i>3 | <i>d</i><sub><i>ij</i></sub>=1 | <i>C</i><sub>max</sub>, and <i>J</i> |<i>d</i><sub><i>ij</i></sub>=1, <i>n</i><sub><i>j</i></sub> ≤ 3| <i>C</i><sub>max</sub> are all known to be NP-complete). For the general case <i>L</i> > 2 (i.e., for the job-shop problem <i>J</i> |<i>d</i><sub><i>ij</i></sub>=1, <i>n</i><sub><i>j</i></sub> =<i>L</i>> 2| <i>C</i><sub>max</sub>), we present a competitive heuristic and provide experimental comparisons with other heuristic versions and, when possible, with the ideal integer linear programming formulation.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"8:1-8:26"},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86220203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
I/O Scheduling Strategy for Periodic Applications 周期性应用的I/O调度策略
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2019-09-10 DOI: 10.1145/3338510
G. Aupy, Ana Gainaru, Valentin Le Fèvre
{"title":"I/O Scheduling Strategy for Periodic Applications","authors":"G. Aupy, Ana Gainaru, Valentin Le Fèvre","doi":"10.1145/3338510","DOIUrl":"https://doi.org/10.1145/3338510","url":null,"abstract":"With the ever-growing need of data in HPC applications, the congestion at the I/O level becomes critical in supercomputers. Architectural enhancement such as burst buffers and pre-fetching are added to machines but are not sufficient to prevent congestion. Recent online I/O scheduling strategies have been put in place, but they add an additional congestion point and overheads in the computation of applications.\u0000 In this work, we show how to take advantage of the periodic nature of HPC applications to develop efficient periodic scheduling strategies for their I/O transfers. Our strategy computes once during the job scheduling phase a pattern that defines the I/O behavior for each application, after which the applications run independently, performing their I/O at the specified times. Our strategy limits the amount of congestion at the I/O node level and can be easily integrated into current job schedulers. We validate this model through extensive simulations and experiments on an HPC cluster by comparing it to state-of-the-art online solutions, showing that not only does our scheduler have the advantage of being de-centralized and thus overcoming the overhead of online schedulers, but also that it performs better than the other solutions, improving the application dilation up to 16% and the maximum system efficiency up to 18%.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"3 1","pages":"7:1-7:26"},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84499785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Modeling Universal Globally Adaptive Load-Balanced Routing 建模通用全局自适应负载均衡路由
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2019-09-10 DOI: 10.1145/3349620
Md Atiqul Mollah, Wenqi Wang, Peyman Faizian, Md. Shafayat Rahman, Xin Yuan, S. Pakin, M. Lang
{"title":"Modeling Universal Globally Adaptive Load-Balanced Routing","authors":"Md Atiqul Mollah, Wenqi Wang, Peyman Faizian, Md. Shafayat Rahman, Xin Yuan, S. Pakin, M. Lang","doi":"10.1145/3349620","DOIUrl":"https://doi.org/10.1145/3349620","url":null,"abstract":"Universal globally adaptive load-balanced (UGAL) routing has been proposed for various interconnection networks and has been deployed in a number of current-generation supercomputers. Although UGAL-based schemes have been extensively studied, most existing results are based on either simulation or measurement. Without a theoretical understanding of UGAL, multiple questions remain: For which traffic patterns is UGAL most suited? In addition, what determines the performance of the UGAL-based scheme on a particular network configuration? In this work, we develop a set of throughput models for UGALbased on linear programming. We show that the throughput models are valid across the torus, Dragonfly, and Slim Fly network topologies. Finally, we identify a robust model that can accurately and efficiently predict UGAL throughput for a set of representative traffic patterns across different topologies. Our models not only provide a mechanism to predict UGAL performance on large-scale interconnection networks but also reveal the inner working of UGAL and further our understanding of this type of routing.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"47 1","pages":"9:1-9:23"},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84908923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Scalable Deep Learning via I/O Analysis and Optimization 基于I/O分析和优化的可扩展深度学习
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2019-09-10 DOI: 10.1145/3331526
S. Pumma, Min Si, W. Feng, P. Balaji
{"title":"Scalable Deep Learning via I/O Analysis and Optimization","authors":"S. Pumma, Min Si, W. Feng, P. Balaji","doi":"10.1145/3331526","DOIUrl":"https://doi.org/10.1145/3331526","url":null,"abstract":"Scalable deep neural network training has been gaining prominence because of the increasing importance of deep learning in a multitude of scientific and commercial domains. Consequently, a number of researchers have investigated techniques to optimize deep learning systems. Much of the prior work has focused on runtime and algorithmic enhancements to optimize the computation and communication. Despite these enhancements, however, deep learning systems still suffer from scalability limitations, particularly with respect to data I/O. This situation is especially true for training models where the computation can be effectively parallelized, leaving I/O as the major bottleneck. In fact, our analysis shows that I/O can take up to 90% of the total training time. Thus, in this article, we first analyze LMDB, the most widely used I/O subsystem of deep learning frameworks, to understand the causes of this I/O inefficiency. Based on our analysis, we propose LMDBIO—an optimized I/O plugin for scalable deep learning. LMDBIO includes six novel optimizations that together address the various shortcomings in existing I/O for deep learning. Our experimental results show that LMDBIO significantly outperforms LMDB in all cases and improves overall application performance by up to 65-fold on a 9,216-core system.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"23 1","pages":"6:1-6:34"},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87253170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
A High-Quality and Fast Maximal Independent Set Implementation for GPUs 一种高质量、快速的gpu最大独立集实现
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2019-01-23 DOI: 10.1145/3291525
Martin Burtscher, Sindhu Devale, S. Azimi, J. Jaiganesh, Evan Powers
{"title":"A High-Quality and Fast Maximal Independent Set Implementation for GPUs","authors":"Martin Burtscher, Sindhu Devale, S. Azimi, J. Jaiganesh, Evan Powers","doi":"10.1145/3291525","DOIUrl":"https://doi.org/10.1145/3291525","url":null,"abstract":"Computing a maximal independent set is an important step in many parallel graph algorithms. This article introduces ECL-MIS, a maximal independent set implementation that works well on GPUs. It includes key optimizations to speed up computation, reduce the memory footprint, and increase the set size. Its CUDA implementation requires fewer than 30 kernel statements, runs asynchronously, and produces a deterministic result. It outperforms the maximal independent set implementations of Pannotia, CUSP, and IrGL on each of the 16 tested graphs of various types and sizes. On a Titan X GPU, ECL-MIS is between 3.9 and 100 times faster (11.5 times, on average). ECL-MIS running on the GPU is also faster than the parallel CPU codes Ligra, Ligra+, and PBBS running on 20 Xeon cores, which it outperforms by 4.1 times, on average. At the same time, ECL-MIS produces maximal independent sets that are up to 52% larger (over 10%, on average) compared to these preexisting CPU and GPU implementations. Whereas these codes produce maximal independent sets that are, on average, about 15% smaller than the largest possible such sets, ECL-MIS sets are less than 6% smaller than the maximum independent sets.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"26 1","pages":"8:1-8:27"},"PeriodicalIF":1.6,"publicationDate":"2019-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80137186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
BARAN: Bimodal Adaptive Reconfigurable-Allocator Network-on-Chip 双峰自适应可重构分配器芯片网络
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2019-01-23 DOI: 10.1145/3294049
Amirhossein Mirhosseini, Mohammad Sadrosadati, F. Aghamohammadi, M. Modarressi, H. Sarbazi-Azad
{"title":"BARAN: Bimodal Adaptive Reconfigurable-Allocator Network-on-Chip","authors":"Amirhossein Mirhosseini, Mohammad Sadrosadati, F. Aghamohammadi, M. Modarressi, H. Sarbazi-Azad","doi":"10.1145/3294049","DOIUrl":"https://doi.org/10.1145/3294049","url":null,"abstract":"Virtual channels are employed to improve the throughput under high traffic loads in Networks-on-Chips (NoCs). However, they can impose non-negligible overheads on performance by prolonging clock cycle time, especially under low traffic loads where the impact of virtual channels on performance is trivial. In this article, we propose a novel architecture, called BARAN, that can either improve on-chip network performance or reduce its power consumption (depending on the specific implementation chosen), not both at the same time, when virtual channels are underutilized; that is, the average number of virtual channel allocation requests per cycle is lower than the number of total virtual channels. We also introduce a reconfigurable arbitration logic within the BARAN architecture that can be configured to have multiple latencies and, hence, multiple slack times. The increased slack times are then used to reduce the supply voltage of the routers or increase their clock frequency in order to reduce power consumption or improve the performance of the whole NoC system. The power-centric design of BARAN reduces NoC power consumption by 43.4% and 40.6% under CMP and GPU workloads, on average, respectively, compared to a baseline architecture while imposing negligible area and performance overheads. The performance-centric design of BARAN reduces the average packet latency by 45.4% and 42.1%, on average, under CMP and GPU workloads, respectively, compared to the baseline architecture while increasing power consumption by 39.7% and 43.7%, on average. Moreover, the performance-centric BARAN postpones the network saturation rate by 11.5% under uniform random traffic compared to the baseline architecture.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"17 1","pages":"11:1-11:29"},"PeriodicalIF":1.6,"publicationDate":"2019-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74432091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信