2014 43rd International Conference on Parallel Processing最新文献_第5页

Code Layout Optimization for Defensiveness and Politeness in Shared Cache 共享缓存中防御性和礼貌性代码布局优化

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI: 10.1109/ICPP.2014.24

Pengcheng Li, Hao Luo, C. Ding, Ziang Hu, Handong Ye

引用次数: 10

Adaptive Configuration Selection for Power-Constrained Heterogeneous Systems 功率受限异构系统的自适应配置选择

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI: 10.1109/ICPP.2014.46

Peter E. Bailey, D. Lowenthal, Vignesh T. Ravi, B. Rountree, M. Schulz, B. Supinski

{"title":"Adaptive Configuration Selection for Power-Constrained Heterogeneous Systems","authors":"Peter E. Bailey, D. Lowenthal, Vignesh T. Ravi, B. Rountree, M. Schulz, B. Supinski","doi":"10.1109/ICPP.2014.46","DOIUrl":"https://doi.org/10.1109/ICPP.2014.46","url":null,"abstract":"As power becomes an increasingly important design factor in high-end supercomputers, future systems will likely operate with power limitations significantly below their peak power specifications. These limitations will be enforced through a combination of software and hardware power policies, which will filter down from the system level to individual nodes. Hardware is already moving in this direction by providing power-capping interfaces to the user. The power/performance trade-off at the node level is critical in maximizing the performance of power-constrained cluster systems, but is also complex because of the many interacting architectural features and accelerators that comprise the hardware configuration of a node. The key to solving this challenge is an accurate power/performance model that will aid in selecting the right configuration from a large set of available configurations. In this paper, we present a novel approach to generate such a model offline using kernel clustering and multivariate linear regression. Our model requires only two iterations to select a configuration, which provides a significant advantage over exhaustive search-based strategies. We apply our model to predict power and performance for different applications using arbitrary configurations, and show that our model, when used with hardware frequency-limiting, selects configurations with significantly higher performance at a given power limit than those chosen by frequency-limiting alone. When applied to a set of 36 computational kernels from a range of applications, our model accurately predicts power and performance, it maintains 91% of optimal performance while meeting power constraints 88% of the time. When the model violates a power constraint, it exceeds the constraint by only 6% in the average case, while simultaneously achieving 54% more performance than an oracle.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128311669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 66

HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters HAND:在GPU集群上使用MPI数据类型加速非连续数据移动的混合方法

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI: 10.1109/ICPP.2014.31

Rong Shi, Xiaoyi Lu, S. Potluri, Khaled Hamidouche, Jie Zhang, D. Panda

{"title":"HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters","authors":"Rong Shi, Xiaoyi Lu, S. Potluri, Khaled Hamidouche, Jie Zhang, D. Panda","doi":"10.1109/ICPP.2014.31","DOIUrl":"https://doi.org/10.1109/ICPP.2014.31","url":null,"abstract":"An increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement continues to be the major bottleneck on GPU clusters, more so when data is non-contiguous, which is common in scientific applications. The existing techniques of optimizing MPI data type processing, to improve performance of non-contiguous data movement, handle only certain data patterns efficiently while incurring overheads for the others. In this paper, we first propose a set of optimized techniques to handle different MPI data types. Next, we propose a novel framework (HAND) that enables hybrid and adaptive selection among different techniques and tuning to achieve better performance with all data types. Our experimental results using the modified DDTBench suite demonstrate up to a 98% reduction in data type latency. We also apply this data type-aware design on an N-Body particle simulation application. Performance evaluation of this application on a 64 GPU cluster shows that our proposed approach can achieve up to 80% and 54% increase in performance by using struct and indexed data types compared to the existing best design. To the best of our knowledge, this is the first attempt to propose a hybrid and adaptive solution to integrate all existing schemes to optimize arbitrary non-contiguous data movement using MPI data types on GPU clusters.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130850579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Xentry: Hypervisor-Level Soft Error Detection Xentry:管理程序级软错误检测

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI: 10.1109/ICPP.2014.43

Xin Xu, R. C. Chiang, H. H. Huang

引用次数: 3

Reducing MapReduce Abstraction Costs for Text-centric Applications 减少以文本为中心的应用程序的抽象成本

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI: 10.1109/ICPP.2014.13

Chun-Hung Hsiao, Michael J. Cafarella, S. Narayanasamy

{"title":"Reducing MapReduce Abstraction Costs for Text-centric Applications","authors":"Chun-Hung Hsiao, Michael J. Cafarella, S. Narayanasamy","doi":"10.1109/ICPP.2014.13","DOIUrl":"https://doi.org/10.1109/ICPP.2014.13","url":null,"abstract":"The MapReduce framework has become widely popular for programming large clusters, even though MapReduce jobs may use underlying resources relatively inefficiently. There has been substantial research in improving MapReduce performance for applications that were inspired by relational database queries, but almost none for text-centric applications, including inverted index construction, processing large log files, and so on. We identify two simple optimizations to improve MapReduce performance on text-centric tasks: frequency-buffering and spill-matcher. The former approach improves buffer efficiency for intermediate map outputs by identifying frequent keys, effectively shrinking the amount of work that the shuffle phase must perform. Spill-matcher is a runtime controller that improves parallelization of MapReduce framework background tasks. Together, our two optimizations improve the performance of text-centric applications by up to 39.1%. We demonstrate gains on both a small local cluster and Amazon's EC2 cloud service. Unlike other MapReduce optimizations, these techniques require no user code changes, and only small changes to the MapReduce system.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117108448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Hydra: Efficient Detection of Multiple Concurrency Bugs on Fused CPU-GPU Architecture Hydra:在融合的CPU-GPU架构上高效检测多个并发错误

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI: 10.1109/ICPP.2014.42

Zhuofang Dai, Haojun Wang, Weihua Zhang, Haibo Chen, B. Zang

{"title":"Hydra: Efficient Detection of Multiple Concurrency Bugs on Fused CPU-GPU Architecture","authors":"Zhuofang Dai, Haojun Wang, Weihua Zhang, Haibo Chen, B. Zang","doi":"10.1109/ICPP.2014.42","DOIUrl":"https://doi.org/10.1109/ICPP.2014.42","url":null,"abstract":"Detecting concurrency bugs, such as data race, atomicity violation and order violation, is a cumbersome task for programmers. This situation is further being exacerbated due to the increasing number of cores in a single machine and the prevalence of threaded programming models. Unfortunately, many existing software-based approaches usually incur high runtime overhead or accuracy loss, while most hardware-based proposals usually focus on a specific type of bugs and thus are inflexible to detect a variety of concurrency bugs. In this paper, we propose Hydra, an approach that leverages massive parallelism and programmability of fused GPU architecture to simultaneously detect multiple types of concurrency bugs, including data race, atomicity violation and order violation. Hydra instruments and collects program behavior on CPU and transfers the traces to GPU for bug detection through on-chip interconnect. Furthermore, to achieve high speed, Hydra exploits bloom filter to filter out unnecessary detection traces. Hydra incurs small hardware complexity and requires no changes to internal critical-path processor components such as cache and its coherence protocol, and is with about 1.1% hardware overhead under a 32-core configuration. Experimental results show that Hydra only introduces about 0.35% overhead on average for detecting one type of bugs and 0.92% overhead for simultaneously detecting multiple bugs, yet with the similar detectability of a heavyweight software bug detector (e.g., Helgrind).","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129175238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SMARTH: Enabling Multi-pipeline Data Transfer in HDFS SMARTH:在HDFS中启用多管道数据传输

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI: 10.1109/ICPP.2014.12

Hong Zhang, Liqiang Wang, Hai Huang

{"title":"SMARTH: Enabling Multi-pipeline Data Transfer in HDFS","authors":"Hong Zhang, Liqiang Wang, Hai Huang","doi":"10.1109/ICPP.2014.12","DOIUrl":"https://doi.org/10.1109/ICPP.2014.12","url":null,"abstract":"Hadoop is a popular open-source implementation of the MapReduce programming model to handle large data sets, and HDFS is one of Hadoop's most commonly used distributed file systems. Surprisingly, we found that HDFS is inefficient when handling upload of data files from client local file system, especially when the storage cluster is configured to use replicas. The root cause is HDFS's synchronous pipeline design. In this paper, we introduce an improved HDFS design called SMARTH. It utilizes asynchronous multi-pipeline data transfers instead of a single pipeline stop-and-wait mechanism. SMARTH records the actual transfer speed of data blocks and sends this information to the namenode along with periodic heartbeat messages. The namenode sorts datanodes according to their past performance and tracks this information continuously. When a client initiates an upload request, the namenode will send it a list of \"high performance\" datanodes that it thinks will yield the highest throughput for the client. By choosing higher performance datanodes relative to each client and by taking advantage of the multi-pipeline design, our experiments show that SMARTH significantly improves the performance of data write operations compared to HDFS. Specifically, SMARTH is able to improve the throughput of data transfer by 27-245% in a heterogeneous virtual cluster on Amazon EC2.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129788848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

MiCA: Real-Time Mixed Compression Scheme for Large-Scale Distributed Monitoring MiCA:大规模分布式监控的实时混合压缩方案

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI: 10.1109/ICPP.2014.53

Bo Wang, Ying Song, Yuzhong Sun, Jun Liu

{"title":"MiCA: Real-Time Mixed Compression Scheme for Large-Scale Distributed Monitoring","authors":"Bo Wang, Ying Song, Yuzhong Sun, Jun Liu","doi":"10.1109/ICPP.2014.53","DOIUrl":"https://doi.org/10.1109/ICPP.2014.53","url":null,"abstract":"Real-time monitoring, providing the real-time status information of servers, is indispensable for the management of distributed systems, e.g. failure detection and resource scheduling. The scalability of fine-grained monitoring faces more and more severe challenges with scaling up distributed systems. The real-time compression which suppresses remote information update to reduce continuous monitoring cost is a promising approach to address the scalability problem. In this paper, we present the Linear Compression Algorithm (LCA) which is the application of the linear filter to real-time monitoring. To our best knowledge, existing work and LCA only explores the correlations of values of each single metric at various times. We present a novel lightweight REal-time Compression Algorithm (ReCA) which employs discovery methods of the correlation among metrics to suppress remote information update in distributed monitoring. The compression algorithms mentioned above have limited compression power because they only explore either the correlations of values of each single metric at various times or that among metrics. Therefore, we propose the Mixed Compression Algorithm (MiCA) which explores both of the correlations to achieve higher compression ratio. We implement our algorithms and an existing compression algorithm denoted by CCA in a distributed monitoring system Ganglia and conduct extensive experiments. The experimental results show that LCA and ReCA have comparable compression ratios with CCA, that MiCA achieves up to 38.2%, 27% and 44.5% higher compression ratios than CCA, LCA and ReCA with negligible overhead, respectively, and that LCA, and ReCA can both increase the scalability of Ganglia about 1.5 times and MiCA can increase about 2.33 times under a mixed-load circumstance.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114487806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Improving Multisite Workflow Performance Using Model-Based Scheduling 使用基于模型的调度改进多站点工作流性能

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI: 10.1109/ICPP.2014.22

K. Maheshwari, Eun-Sung Jung, Jiayuan Meng, V. Vishwanath, R. Kettimuthu

{"title":"Improving Multisite Workflow Performance Using Model-Based Scheduling","authors":"K. Maheshwari, Eun-Sung Jung, Jiayuan Meng, V. Vishwanath, R. Kettimuthu","doi":"10.1109/ICPP.2014.22","DOIUrl":"https://doi.org/10.1109/ICPP.2014.22","url":null,"abstract":"Workflows play an important role in expressing and executing scientific applications. In recent years, a variety of computational sites and resources have emerged, and users often have access to multiple resources that are geographically distributed. These computational sites are heterogeneous in nature and performance of different tasks in a workflow varies from one site to another. Additionally, users typically have a limited resource allocation at each site. In such cases, judicious scheduling strategy is required in order to map tasks in the workflow to resources so that the workload is balanced among sites and the overhead is minimized in data transfer. Most existing systems either run the entire workflow in a single site or use naive approaches to distribute the tasks across sites or leave it to the user to optimize the allocation of tasks to distributed resources. This results in a significant loss in productivity for a scientist. In this paper, we propose a multi-site workflow scheduling technique that uses performance models to predict the execution time on different resources and dynamic probes to identify the achievable network throughput between sites. We evaluate our approach using real world applications in a distributed environment using the Swift distributed execution framework and show that our approach improves the execution time by up to 60% compared to the default schedule.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124800265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

TRAM: Optimizing Fine-Grained Communication with Topological Routing and Aggregation of Messages 用拓扑路由和消息聚合优化细粒度通信

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI: 10.1109/ICPP.2014.30

Lukasz Wesolowski, Ramprasad Venkataraman, Abhishek K. Gupta, Jae-Seung Yeom, K. Bisset, Yanhua Sun, Pritish Jetley, T. Quinn, L. Kalé

{"title":"TRAM: Optimizing Fine-Grained Communication with Topological Routing and Aggregation of Messages","authors":"Lukasz Wesolowski, Ramprasad Venkataraman, Abhishek K. Gupta, Jae-Seung Yeom, K. Bisset, Yanhua Sun, Pritish Jetley, T. Quinn, L. Kalé","doi":"10.1109/ICPP.2014.30","DOIUrl":"https://doi.org/10.1109/ICPP.2014.30","url":null,"abstract":"Fine-grained communication in supercomputing applications often limits performance through high communication overhead and poor utilization of network bandwidth. This paper presents Topological Routing and Aggregation Module (TRAM), a library that optimizes fine-grained communication performance by routing and dynamically combining short messages. TRAM collects units of fine-grained communication from the application and combines them into aggregated messages with a common intermediate destination. It routes these messages along a virtual mesh topology mapped onto the physical topology of the network. TRAM improves network bandwidth utilization and reduces communication overhead. It is particularly effective in optimizing patterns with global communication and large message counts, such as all-to-all and many-to-many, as well as sparse, irregular, dynamic or data dependent patterns. We demonstrate how TRAM improves performance through theoretical analysis and experimental verification using benchmarks and scientific applications. We present speedups on petascale systems of 6x for communication benchmarks and up to 4x for applications.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129607283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21