ACM Transactions on Parallel Computing最新文献_第7页

BARAN: Bimodal Adaptive Reconfigurable-Allocator Network-on-Chip 双峰自适应可重构分配器芯片网络

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2019-01-23 DOI: 10.1145/3294049

Amirhossein Mirhosseini, Mohammad Sadrosadati, F. Aghamohammadi, M. Modarressi, H. Sarbazi-Azad

{"title":"BARAN: Bimodal Adaptive Reconfigurable-Allocator Network-on-Chip","authors":"Amirhossein Mirhosseini, Mohammad Sadrosadati, F. Aghamohammadi, M. Modarressi, H. Sarbazi-Azad","doi":"10.1145/3294049","DOIUrl":"https://doi.org/10.1145/3294049","url":null,"abstract":"Virtual channels are employed to improve the throughput under high traffic loads in Networks-on-Chips (NoCs). However, they can impose non-negligible overheads on performance by prolonging clock cycle time, especially under low traffic loads where the impact of virtual channels on performance is trivial. In this article, we propose a novel architecture, called BARAN, that can either improve on-chip network performance or reduce its power consumption (depending on the specific implementation chosen), not both at the same time, when virtual channels are underutilized; that is, the average number of virtual channel allocation requests per cycle is lower than the number of total virtual channels. We also introduce a reconfigurable arbitration logic within the BARAN architecture that can be configured to have multiple latencies and, hence, multiple slack times. The increased slack times are then used to reduce the supply voltage of the routers or increase their clock frequency in order to reduce power consumption or improve the performance of the whole NoC system. The power-centric design of BARAN reduces NoC power consumption by 43.4% and 40.6% under CMP and GPU workloads, on average, respectively, compared to a baseline architecture while imposing negligible area and performance overheads. The performance-centric design of BARAN reduces the average packet latency by 45.4% and 42.1%, on average, under CMP and GPU workloads, respectively, compared to the baseline architecture while increasing power consumption by 39.7% and 43.7%, on average. Moreover, the performance-centric BARAN postpones the network saturation rate by 11.5% under uniform random traffic compared to the baseline architecture.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"17 1","pages":"11:1-11:29"},"PeriodicalIF":1.6,"publicationDate":"2019-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74432091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Lock Contention Management in Multithreaded MPI 多线程MPI中的锁争用管理

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2019-01-23 DOI: 10.1145/3275443

A. Amer, Huiwei Lu, P. Balaji, Milind Chabbi, Yanjie Wei, J. Hammond, S. Matsuoka

{"title":"Lock Contention Management in Multithreaded MPI","authors":"A. Amer, Huiwei Lu, P. Balaji, Milind Chabbi, Yanjie Wei, J. Hammond, S. Matsuoka","doi":"10.1145/3275443","DOIUrl":"https://doi.org/10.1145/3275443","url":null,"abstract":"In this article, we investigate contention management in lock-based thread-safe MPI libraries. Specifically, we make two assumptions: (1) locks are the only form of synchronization when protecting communication paths; and (2) contention occurs, and thus serialization is unavoidable. Our work distinguishes between lock acquisitions with respect to work being performed inside a critical section; productive vs. unproductive. Waiting for message reception without doing anything else inside a critical section is an example of unproductive lock acquisition. We show that the high-throughput nature of modern scalable locking protocols translates into better communication progress for throughput-intensive MPI communication but negatively impacts latency-sensitive communication because of overzealous unproductive lock acquisition. To reduce unproductive lock acquisitions, we devised a method that promotes threads with productive work using a generic two-level priority locking protocol. Our results show that using a high-throughput protocol for productive work and a fair protocol for less productive code paths ensures the best tradeoff for fine-grained communication, whereas a fair protocol is sufficient for more coarse-grained communication. Although these efforts have been rewarding, scalability degradation remains significant. We discuss techniques that diverge from the pure locking model and offer the potential to further improve scalability.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"35 1","pages":"12:1-12:21"},"PeriodicalIF":1.6,"publicationDate":"2019-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84422335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs PowerLyra:歪斜图上的微分图计算和划分

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2019-01-23 DOI: 10.1145/3298989

Rong Chen, Jiaxin Shi, Yanzhe Chen, B. Zang, Haibing Guan, Haibo Chen

{"title":"PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs","authors":"Rong Chen, Jiaxin Shi, Yanzhe Chen, B. Zang, Haibing Guan, Haibo Chen","doi":"10.1145/3298989","DOIUrl":"https://doi.org/10.1145/3298989","url":null,"abstract":"Natural graphs with skewed distributions raise unique challenges to distributed graph computation and partitioning. Existing graph-parallel systems usually use a “one-size-fits-all” design that uniformly processes all vertices, which either suffer from notable load imbalance and high contention for high-degree vertices (e.g., Pregel and GraphLab) or incur high communication cost and memory consumption even for low-degree vertices (e.g., PowerGraph and GraphX). In this article, we argue that skewed distributions in natural graphs also necessitate differentiated processing on high-degree and low-degree vertices. We then introduce PowerLyra, a new distributed graph processing system that embraces the best of both worlds of existing graph-parallel systems. Specifically, PowerLyra uses centralized computation for low-degree vertices to avoid frequent communications and distributes the computation for high-degree vertices to balance workloads. PowerLyra further provides an efficient hybrid graph partitioning algorithm (i.e., hybrid-cut) that combines edge-cut (for low-degree vertices) and vertex-cut (for high-degree vertices) with heuristics. To improve cache locality of inter-node graph accesses, PowerLyra further provides a locality-conscious data layout optimization. PowerLyra is implemented based on the latest GraphLab and can seamlessly support various graph algorithms running in both synchronous and asynchronous execution modes. A detailed evaluation on three clusters using various graph-analytics and MLDM (Machine Learning and Data Mining) applications shows that PowerLyra outperforms PowerGraph by up to 5.53X (from 1.24X) and 3.26X (from 1.49X) for real-world and synthetic graphs, respectively, and is much faster than other systems like GraphX and Giraph, yet with much less memory consumption. A porting of hybrid-cut to GraphX further confirms the efficiency and generality of PowerLyra.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"5 1","pages":"13:1-13:39"},"PeriodicalIF":1.6,"publicationDate":"2019-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90205401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 323

An Autotuning Protocol to Rapidly Build Autotuners 快速构建自动调谐器的自动调谐协议

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2019-01-23 DOI: 10.1145/3291527

Junhong Liu, Guangming Tan, Yulong Luo, Jiajia Li, Z. Mo, Ninghui Sun

引用次数: 4

Scheduling Dynamic Parallel Workload of Mobile Devices with Access Guarantees 具有访问保证的移动设备动态并行工作负载调度

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2018-12-08 DOI: 10.1145/3291529

Antonio Fernández, D. Kowalski, Miguel A. Mosteiro, Prudence W. H. Wong

{"title":"Scheduling Dynamic Parallel Workload of Mobile Devices with Access Guarantees","authors":"Antonio Fernández, D. Kowalski, Miguel A. Mosteiro, Prudence W. H. Wong","doi":"10.1145/3291529","DOIUrl":"https://doi.org/10.1145/3291529","url":null,"abstract":"We study a dynamic resource-allocation problem that arises in various parallel computing scenarios, such as mobile cloud computing, cloud computing systems, Internet of Things systems, and others. Generically, we model the architecture as client mobile devices and static base stations. Each client “arrives” to the system to upload data to base stations by radio transmissions and then “leaves.” The problem, called Station Assignment, is to assign clients to stations so that every client uploads their data under some restrictions, including a target subset of stations, a maximum delay between transmissions, a volume of data to upload, and a maximum bandwidth for each station. We study the solvability of Station Assignment under an adversary that controls the arrival and departure of clients, limited to maximum rate and burstiness of such arrivals. We show upper and lower bounds on the rate and burstiness for various client arrival schedules and protocol classes. To the best of our knowledge, this is the first time that Station Assignment is studied under adversarial arrivals and departures.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"113 4","pages":"10:1-10:19"},"PeriodicalIF":1.6,"publicationDate":"2018-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3291529","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72538506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

New High Performance GPGPU Code Transformation Framework Applied to Large Production Weather Prediction Code 应用于大规模生产天气预报代码的新型高性能GPGPU代码转换框架

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2018-02-16 DOI: 10.1145/3291523

Michel Müller, T. Aoki

{"title":"New High Performance GPGPU Code Transformation Framework Applied to Large Production Weather Prediction Code","authors":"Michel Müller, T. Aoki","doi":"10.1145/3291523","DOIUrl":"https://doi.org/10.1145/3291523","url":null,"abstract":"We introduce “Hybrid Fortran,” a new approach that allows a high-performance GPGPU port for structured grid Fortran codes. This technique only requires minimal changes for a CPU targeted codebase, which is a significant advancement in terms of productivity. It has been successfully applied to both dynamical core and physical processes of ASUCA, a Japanese mesoscale weather prediction model with more than 150k lines of code. By means of a minimal weather application that resembles ASUCA’s code structure, Hybrid Fortran is compared to both a performance model as well as today’s commonly used method, OpenACC. As a result, the Hybrid Fortran implementation is shown to deliver the same or better performance than OpenACC, and its performance agrees with the model both on CPU and GPU. In a full-scale production run, using an ASUCA grid with 1581 × 1301 × 58 cells and real-world weather data in 2km resolution, 24 NVIDIA Tesla P100 running the Hybrid Fortran–based GPU port are shown to replace more than fifty 18-core Intel Xeon Broadwell E5-2695 v4 running the reference implementation—an achievement comparable to more invasive GPGPU rewrites of other weather models.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"31 1","pages":"7:1-7:42"},"PeriodicalIF":1.6,"publicationDate":"2018-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79334909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs 多gpu预条件共轭梯度自适应优化建模

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2016-12-26 DOI: 10.1145/2990849

Jiaquan Gao, Yu Wang, Jun Wang, Ronghua Liang

{"title":"Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs","authors":"Jiaquan Gao, Yu Wang, Jun Wang, Ronghua Liang","doi":"10.1145/2990849","DOIUrl":"https://doi.org/10.1145/2990849","url":null,"abstract":"The preconditioned conjugate gradient (PCG) algorithm is a well-known iterative method for solving sparse linear systems in scientific computations. GPU-accelerated PCG algorithms for large-sized problems have attracted considerable attention recently. However, on a specific multi-GPU platform, producing a highly parallel PCG implementation for any large-sized problem requires significant time because several manual steps are involved in adjusting the related parameters and selecting an appropriate storage format for the matrix block that is assigned to each GPU. This motivates us to propose adaptive optimization modeling of PCG on multi-GPUs, which mainly involves the following parts: (1) an optimization multi-GPU parallel framework of PCG and (2) the profile-based optimization modeling for each one of the main components of the PCG algorithm, including vector operation, inner product, and sparse matrix-vector multiplication (SpMV). Our model does not construct a new storage format or kernel but automatically and rapidly generates an optimal parallel PCG algorithm for any problem on a specific multi-GPU platform by integrating existing storage formats and kernels. We take a vector operation kernel, an inner-product kernel, and five popular SpMV kernels for an example to present the idea of constructing the model. Given that our model is general, independent of the problems, and dependent on the resources of devices, this model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"26 1","pages":"16:1-16:33"},"PeriodicalIF":1.6,"publicationDate":"2016-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82616127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Damaris: Addressing Performance Variability in Data Management for Post-Petascale Simulations Damaris:解决千兆级模拟后数据管理中的性能变化

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2016-12-26 DOI: 10.1145/2987371

Matthieu Dorier, Gabriel Antoniu, F. Cappello, M. Snir, R. Sisneros, Orcun Yildiz, Shadi Ibrahim, T. Peterka, Leigh Orf

{"title":"Damaris: Addressing Performance Variability in Data Management for Post-Petascale Simulations","authors":"Matthieu Dorier, Gabriel Antoniu, F. Cappello, M. Snir, R. Sisneros, Orcun Yildiz, Shadi Ibrahim, T. Peterka, Leigh Orf","doi":"10.1145/2987371","DOIUrl":"https://doi.org/10.1145/2987371","url":null,"abstract":"With exascale computing on the horizon, reducing performance variability in data management tasks (storage, visualization, analysis, etc.) is becoming a key challenge in sustaining high performance. This variability significantly impacts the overall application performance at scale and its predictability over time.\u0000 In this article, we present Damaris, a system that leverages dedicated cores in multicore nodes to offload data management tasks, including I/O, data compression, scheduling of data movements, in situ analysis, and visualization. We evaluate Damaris with the CM1 atmospheric simulation and the Nek5000 computational fluid dynamic simulation on four platforms, including NICS’s Kraken and NCSA’s Blue Waters. Our results show that (1) Damaris fully hides the I/O variability as well as all I/O-related costs, thus making simulation performance predictable; (2) it increases the sustained write throughput by a factor of up to 15 compared with standard I/O approaches; (3) it allows almost perfect scalability of the simulation up to over 9,000 cores, as opposed to state-of-the-art approaches that fail to scale; and (4) it enables a seamless connection to the VisIt visualization software to perform in situ analysis and visualization in a way that impacts neither the performance of the simulation nor its variability.\u0000 In addition, we extended our implementation of Damaris to also support the use of dedicated nodes and conducted a thorough comparison of the two approaches—dedicated cores and dedicated nodes—for I/O tasks with the aforementioned applications.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"13 1","pages":"15:1-15:43"},"PeriodicalIF":1.6,"publicationDate":"2016-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91252020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

Transparently Space Sharing a Multicore Among Multiple Processes 在多个进程之间透明地共享多核空间

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2016-12-26 DOI: 10.1145/3001910

T. Creech, R. Barua

{"title":"Transparently Space Sharing a Multicore Among Multiple Processes","authors":"T. Creech, R. Barua","doi":"10.1145/3001910","DOIUrl":"https://doi.org/10.1145/3001910","url":null,"abstract":"As hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads used as they run, enable sophisticated and flexible resource management. Although many existing applications parallelized for SMPs with parallel runtimes are in fact already malleable, deployed runtime environments provide no interface nor any strategy for intelligently allocating hardware threads or even preventing oversubscription. Prior research methods either depend on profiling applications ahead of time to make good decisions about allocations or do not account for process efficiency at all, leading to poor performance. None of these prior methods have been adapted widely in practice. This article presents the Scheduling and Allocation with Feedback (SCAF) system: a drop-in runtime solution that supports existing malleable applications in making intelligent allocation decisions based on observed efficiency without any changes to semantics, program modification, offline profiling, or even recompilation. Our existing implementation can control most unmodified OpenMP applications. Other malleable threading libraries can also easily be supported with small modifications without requiring application modification or recompilation.\u0000 In this work, we present the SCAF daemon and a SCAF-aware port of the GNU OpenMP runtime. We present a new technique for estimating process efficiency purely at runtime using available hardware counters and demonstrate its effectiveness in aiding allocation decisions.\u0000 We evaluated SCAF using NAS NPB parallel benchmarks on five commodity parallel platforms, enumerating architectural features and their effects on our scheme. We measured the benefit of SCAF in terms of sum of speedups improvement (a common metric for multiprogrammed environments) when running all benchmark pairs concurrently compared to equipartitioning—the best existing competing scheme in the literature. We found that SCAF improves on equipartitioning on four out of five machines, showing a mean improvement factor in sum of speedups of 1.04 to 1.11x for benchmark pairs, depending on the machine, and 1.09x on average.\u0000 Since we are not aware of any widely available tool for equipartitioning, we also compare SCAF against multiprogramming using unmodified OpenMP, which is the only environment available to end users today. SCAF improves on the unmodified OpenMP runtimes for all five machines, with a mean improvement of 1.08 to 2.07x, depending on the machine, and 1.59x on average.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"28 1","pages":"17:1-17:35"},"PeriodicalIF":1.6,"publicationDate":"2016-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88132169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Selecting Multiple Order Statistics with a Graphics Processing Unit 选择多个订单统计与图形处理单元

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2016-08-08 DOI: 10.1145/2948974

Jeffrey D. Blanchard, Erik Opavsky, Emircan Uysaler

引用次数: 1