2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第7页

swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight swDNN:一个加速神威太湖之光上深度学习应用的库

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.20

Jiarui Fang, H. Fu, Wenlai Zhao, Bingwei Chen, Weijie Zheng, Guangwen Yang

引用次数: 67

MRapid: An Efficient Short Job Optimizer on Hadoop MRapid:一个高效的Hadoop短作业优化器

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.100

Hong Zhang, Hai Huang, Liqiang Wang

引用次数: 23

ScalaIOExtrap: Elastic I/O Tracing and Extrapolation ScalaIOExtrap:弹性I/O跟踪和外推

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.45

Xiaoqing Luo, F. Mueller, P. Carns, John Jenkins, R. Latham, R. Ross, S. Snyder

引用次数: 10

General Purpose Task-Dependence Management Hardware for Task-Based Dataflow Programming Models 基于任务的数据流编程模型的通用任务依赖管理硬件

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.48

Xubin Tan, Jaume Bosch, Miquel Vidal Piñol, C. Álvarez, Daniel Jiménez-González, E. Ayguadé, M. Valero

{"title":"General Purpose Task-Dependence Management Hardware for Task-Based Dataflow Programming Models","authors":"Xubin Tan, Jaume Bosch, Miquel Vidal Piñol, C. Álvarez, Daniel Jiménez-González, E. Ayguadé, M. Valero","doi":"10.1109/IPDPS.2017.48","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.48","url":null,"abstract":"Task-based programming models such as OpenMP, IntelTBB and OmpSs offer the possibility of expressing dependences among tasks to drive their execution at runtime. Managing these dependences introduces noticeable overheads when targeting fine-grained tasks, diminishing the potential speedups or even introducing performance losses. To overcome this drawback, we present a general purpose hardware accelerator, Picos++, to manage the inter-task dependences efficiently in both time and energy. Our design also includes a novel nested task support. To this end, a new hardware/software co-design is presented to overcome the fact that nested tasks with dependences could result in system deadlocks due to the limited amount of resources in hardware task dependence managers. In this paper we describe a detailed implementation of this design and evaluate a parallel task-based programming model using Picos++ in a Linux embedded system with two ARM Cortex-A9 and a FPGA. The scalability and energy consumption of the real system implemented have been studied and compared against a software runtime. Even in a system limited to 2 threads, using Picos++ results in more than 1.8x speedup and 40% of energy savings in the most demanding parallelizations of real benchmarks. As a matter of fact, a hardware task dependence manager should be able to achieve much higher speedup and provide more energy savings with more threads.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132565940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Image-Domain Gridding on Graphics Processors 图形处理器上的图像域网格划分

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.68

B. Veenboer, M. Petschow, J. Romein

{"title":"Image-Domain Gridding on Graphics Processors","authors":"B. Veenboer, M. Petschow, J. Romein","doi":"10.1109/IPDPS.2017.68","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.68","url":null,"abstract":"Realizing the next generation of radio telescopes such as the Square Kilometre Array (SKA) requires both more efficient hardware and algorithms than today's technology provides. The recently introduced image-domain gridding (IDG) algorithm is a novel approach towards solving the most compute-intensive parts of creating sky images: gridding and degridding. It avoids the performance bottlenecks of traditional AW-projection gridding by applying instrumental and environmental corrections in the image domain instead of in the Fourier domain. In this paper, we present the first implementations of this new algorithm for CPUs and Graphics Processing Units (GPUs). A thorough performance analysis, in which we apply a modified roofline analysis, shows that our parallelization approaches and optimizations lead to nearly optimal performance on these architectures. The analysis also indicates that, by leveraging dedicated hardware to evaluate trigonometric functions, GPUs are both much faster and more energy efficient than regular CPUs. This makes IDG on GPUs a candidate for meeting the computational and energy efficiency constraints of future telescopes.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133156552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

FlexVC: Flexible Virtual Channel Management in Low-Diameter Networks FlexVC:低直径网络中的灵活虚拟通道管理

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.110

Pablo Fuentes, E. Vallejo, R. Beivide, C. Minkenberg, M. Valero

{"title":"FlexVC: Flexible Virtual Channel Management in Low-Diameter Networks","authors":"Pablo Fuentes, E. Vallejo, R. Beivide, C. Minkenberg, M. Valero","doi":"10.1109/IPDPS.2017.110","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.110","url":null,"abstract":"Deadlock avoidance mechanisms for lossless lowdistance networks typically increase the order of virtual channel (VC) index with each hop. This restricts the number of buffer resources depending on the routing mechanism and limits performance due to an inefficient use. Dynamic buffer organizations increase implementation complexity and only provide small gains in this context because a significant amount of buffering needs to be allocated statically to avoid congestion. We introduce FlexVC, a simple buffer management mechanism which permits a more flexible use of VCs. It combines statically partitioned buffers, opportunistic routing and a relaxed distancebased deadlock avoidance policy. FlexVC mitigates Head-of-Line blocking and reduces up to 50% the memory requirements. Simulation results in a Dragonfly network show congestion reduction and up to 37.8% throughput improvement, outperforming more complex dynamic approaches. FlexVC merges different flows of traffic in the same buffers, which in some cases makes more difficult to identify the traffic pattern in order to support nonminimal adaptive routing. An alternative denoted FlexVCminCred improves congestion sensing for adaptive routing by tracking separately packets routed minimally and nonminimally, rising throughput up to 20.4% with 25% savings in buffer area.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121181802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Eliminating Irregularities of Protein Sequence Search on Multicore Architectures 多核结构下蛋白质序列搜索的不规则性消除

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.120

Jing Zhang, Sanchit Misra, Hao Wang, Wu-chun Feng

{"title":"Eliminating Irregularities of Protein Sequence Search on Multicore Architectures","authors":"Jing Zhang, Sanchit Misra, Hao Wang, Wu-chun Feng","doi":"10.1109/IPDPS.2017.120","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.120","url":null,"abstract":"Finding regions of local similarity between biological sequences is a fundamental task in computational biology. BLAST is the most widely-used tool for this purpose, but it suffers from irregularities due to its heuristic nature. To achieve fast search, recent approaches construct the index from the database instead of the input query. However, database indexing introduces more challenges in the design of index structure and algorithm, especially for data access through the memory hierarchy on modern multicore processors. In this paper, based on existing heuristic algorithms, we design and develop a database indexed BLAST with the identical sensitivity as query indexed BLAST (i.e., NCBI-BLAST). Then, we identify that existing heuristic algorithms of BLAST can result in serious irregularities in database indexed search. To eliminate irregularities in BLAST algorithm, we propose muBLASTP, that uses multiple optimizations to improve data locality and parallel efficiency for multicore architectures and multi-node systems. Experiments on a single node demonstrate up to a 5.1-fold speedup over the multi-threaded NCBI BLAST. For the inter-node parallelism, we achieve nearly linear scaling on up to 128 nodes and gain up to 8.9-fold speedup over mpiBLAST.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123202715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Respin: Rethinking Near-Threshold Multiprocessor Design with Non-volatile Memory 重新思考非易失性存储器的近阈值多处理器设计

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.109

Xiang Pan, Anys Bacha, R. Teodorescu

{"title":"Respin: Rethinking Near-Threshold Multiprocessor Design with Non-volatile Memory","authors":"Xiang Pan, Anys Bacha, R. Teodorescu","doi":"10.1109/IPDPS.2017.109","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.109","url":null,"abstract":"Near-threshold computing is emerging as a promising energy-efficient alternative for power-constrained environments. Unfortunately, aggressive reduction in supply voltage to the near-threshold range, albeit effective, faces a host of challenges. This includes higher relative leakage power and high error rates, particularly in dense SRAM structures such as on-chip caches. This paper presents an architecture that rethinks the cache hierarchy in near-threshold multiprocessors. Our design uses STT-RAM to implement all on-chip caches. STT-RAM has several advantages over SRAM at low voltages including low leakage, high density, and reliability. The design consolidates the private caches of near-threshold cores into shared L1 instruction/data caches organized in clusters. We find that our consolidated cache design can service more than 95% of incoming requests within a single cycle. We demonstrate that eliminating the coherence traffic associated with private caches results in a performance boost of 11%. In addition, we propose a hardware-based core management system that dynamically consolidates virtual cores into variable numbers of physical cores to increase resource efficiency. We demonstrate that this approach can save up to 33% in energy.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124655888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Parallelism and Garbage Collection Aware I/O Scheduler with Improved SSD Performance 提高SSD性能的并行性和垃圾收集感知I/O调度器

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.55

Jiayang Guo, Yimin Hu, Bo Mao, Suzhen Wu

引用次数: 23

Apollo: Reusable Models for Fast, Dynamic Tuning of Input-Dependent Code 阿波罗:用于快速动态调优输入依赖代码的可重用模型

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.38

D. Beckingsale, Olga Pearce, I. Laguna, T. Gamblin

{"title":"Apollo: Reusable Models for Fast, Dynamic Tuning of Input-Dependent Code","authors":"D. Beckingsale, Olga Pearce, I. Laguna, T. Gamblin","doi":"10.1109/IPDPS.2017.38","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.38","url":null,"abstract":"Increasing architectural diversity makes performance portability extremely important for parallel simulation codes. Emerging on-node parallelization frameworks such as Kokkos and RAJA decouple the work done in kernels from the parallelization mechanism, allowing for a single source kernel to be tuned for different architectures at compile time. However, computational demands in production applications change at runtime, and performance depends both on the architecture and the input problem, and tuning a kernel for one set of inputs may not improve its performance on another. The statically optimized versions need to be chosen dynamically to obtain the best performance. Existing auto-tuning approaches can handle slowly evolving applications effectively, but are too slow to tune highly input-dependent kernels. We developed Apollo, an auto-tuning extension for RAJA that uses pre-trained, reusable models to tune input-dependent code at runtime. Apollo is designed for highly dynamic applications; it generates sufficiently low-overhead code to tune parameters each time a kernel runs, making fast decisions. We apply Apollo to two hydrodynamics benchmarks and to a production multi-physics code, and show that it can achieve speedups from 1.2x to 4.8x.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125844252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19