Pengcheng Li, Hao Luo, C. Ding, Ziang Hu, Handong Ye
{"title":"Code Layout Optimization for Defensiveness and Politeness in Shared Cache","authors":"Pengcheng Li, Hao Luo, C. Ding, Ziang Hu, Handong Ye","doi":"10.1109/ICPP.2014.24","DOIUrl":"https://doi.org/10.1109/ICPP.2014.24","url":null,"abstract":"Code layout optimization seeks to reorganize the instructions of a program to better utilize the cache. On multicore, parallel executions improve the throughput but may significantly increase the cache contention, because the co-run programs share the cache and in the case of hyper-threading, the instruction cache. In this paper, we extend the reference affinity model for use in whole-program code layout optimization. We also implement the temporal relation graph (TRG) model used in prior work for comparison. For code reorganization, we have developed both function reordering and inter-procedural basic-block reordering. We implement the two models and the two transformations in the LLVM compiler. Experimental results on a set of benchmarks show frequently 20% to 50% reduction in instruction cache misses. By better utilizing the shared cache, the new techniques magnify the throughput improvement of hyper-threading by 8%.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127844963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peter E. Bailey, D. Lowenthal, Vignesh T. Ravi, B. Rountree, M. Schulz, B. Supinski
{"title":"Adaptive Configuration Selection for Power-Constrained Heterogeneous Systems","authors":"Peter E. Bailey, D. Lowenthal, Vignesh T. Ravi, B. Rountree, M. Schulz, B. Supinski","doi":"10.1109/ICPP.2014.46","DOIUrl":"https://doi.org/10.1109/ICPP.2014.46","url":null,"abstract":"As power becomes an increasingly important design factor in high-end supercomputers, future systems will likely operate with power limitations significantly below their peak power specifications. These limitations will be enforced through a combination of software and hardware power policies, which will filter down from the system level to individual nodes. Hardware is already moving in this direction by providing power-capping interfaces to the user. The power/performance trade-off at the node level is critical in maximizing the performance of power-constrained cluster systems, but is also complex because of the many interacting architectural features and accelerators that comprise the hardware configuration of a node. The key to solving this challenge is an accurate power/performance model that will aid in selecting the right configuration from a large set of available configurations. In this paper, we present a novel approach to generate such a model offline using kernel clustering and multivariate linear regression. Our model requires only two iterations to select a configuration, which provides a significant advantage over exhaustive search-based strategies. We apply our model to predict power and performance for different applications using arbitrary configurations, and show that our model, when used with hardware frequency-limiting, selects configurations with significantly higher performance at a given power limit than those chosen by frequency-limiting alone. When applied to a set of 36 computational kernels from a range of applications, our model accurately predicts power and performance, it maintains 91% of optimal performance while meeting power constraints 88% of the time. When the model violates a power constraint, it exceeds the constraint by only 6% in the average case, while simultaneously achieving 54% more performance than an oracle.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128311669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rong Shi, Xiaoyi Lu, S. Potluri, Khaled Hamidouche, Jie Zhang, D. Panda
{"title":"HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters","authors":"Rong Shi, Xiaoyi Lu, S. Potluri, Khaled Hamidouche, Jie Zhang, D. Panda","doi":"10.1109/ICPP.2014.31","DOIUrl":"https://doi.org/10.1109/ICPP.2014.31","url":null,"abstract":"An increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement continues to be the major bottleneck on GPU clusters, more so when data is non-contiguous, which is common in scientific applications. The existing techniques of optimizing MPI data type processing, to improve performance of non-contiguous data movement, handle only certain data patterns efficiently while incurring overheads for the others. In this paper, we first propose a set of optimized techniques to handle different MPI data types. Next, we propose a novel framework (HAND) that enables hybrid and adaptive selection among different techniques and tuning to achieve better performance with all data types. Our experimental results using the modified DDTBench suite demonstrate up to a 98% reduction in data type latency. We also apply this data type-aware design on an N-Body particle simulation application. Performance evaluation of this application on a 64 GPU cluster shows that our proposed approach can achieve up to 80% and 54% increase in performance by using struct and indexed data types compared to the existing best design. To the best of our knowledge, this is the first attempt to propose a hybrid and adaptive solution to integrate all existing schemes to optimize arbitrary non-contiguous data movement using MPI data types on GPU clusters.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130850579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Xentry: Hypervisor-Level Soft Error Detection","authors":"Xin Xu, R. C. Chiang, H. H. Huang","doi":"10.1109/ICPP.2014.43","DOIUrl":"https://doi.org/10.1109/ICPP.2014.43","url":null,"abstract":"Cloud data centers leverage virtualization to share commodity hardware resources, where virtual machines (VMs) achieve fault isolation by containing VM failures within the virtualization boundary. However, hypervisor failure induced by soft errors will most likely affect multiple, if not all, VMs on a single physical host. Existing fault detection techniques are not well equipped to handle such hypervisor failures. In this paper, we propose a new soft error detection framework, Xentry (a sentry on soft error for Xen), that focuses on limiting error propagation within and from the hypervisor. In particular, we have designed a VM transition detection technique to identify incorrect control flow before VM execution resumes, and a runtime detection technique to shorten detection latency. This framework requires no hardware modification and has been implemented in the Xen hypervisor. The experiment results show that Xentry incurs very small performance overhead and detects over 99% of the injected faults.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125916846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chun-Hung Hsiao, Michael J. Cafarella, S. Narayanasamy
{"title":"Reducing MapReduce Abstraction Costs for Text-centric Applications","authors":"Chun-Hung Hsiao, Michael J. Cafarella, S. Narayanasamy","doi":"10.1109/ICPP.2014.13","DOIUrl":"https://doi.org/10.1109/ICPP.2014.13","url":null,"abstract":"The MapReduce framework has become widely popular for programming large clusters, even though MapReduce jobs may use underlying resources relatively inefficiently. There has been substantial research in improving MapReduce performance for applications that were inspired by relational database queries, but almost none for text-centric applications, including inverted index construction, processing large log files, and so on. We identify two simple optimizations to improve MapReduce performance on text-centric tasks: frequency-buffering and spill-matcher. The former approach improves buffer efficiency for intermediate map outputs by identifying frequent keys, effectively shrinking the amount of work that the shuffle phase must perform. Spill-matcher is a runtime controller that improves parallelization of MapReduce framework background tasks. Together, our two optimizations improve the performance of text-centric applications by up to 39.1%. We demonstrate gains on both a small local cluster and Amazon's EC2 cloud service. Unlike other MapReduce optimizations, these techniques require no user code changes, and only small changes to the MapReduce system.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117108448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhuofang Dai, Haojun Wang, Weihua Zhang, Haibo Chen, B. Zang
{"title":"Hydra: Efficient Detection of Multiple Concurrency Bugs on Fused CPU-GPU Architecture","authors":"Zhuofang Dai, Haojun Wang, Weihua Zhang, Haibo Chen, B. Zang","doi":"10.1109/ICPP.2014.42","DOIUrl":"https://doi.org/10.1109/ICPP.2014.42","url":null,"abstract":"Detecting concurrency bugs, such as data race, atomicity violation and order violation, is a cumbersome task for programmers. This situation is further being exacerbated due to the increasing number of cores in a single machine and the prevalence of threaded programming models. Unfortunately, many existing software-based approaches usually incur high runtime overhead or accuracy loss, while most hardware-based proposals usually focus on a specific type of bugs and thus are inflexible to detect a variety of concurrency bugs. In this paper, we propose Hydra, an approach that leverages massive parallelism and programmability of fused GPU architecture to simultaneously detect multiple types of concurrency bugs, including data race, atomicity violation and order violation. Hydra instruments and collects program behavior on CPU and transfers the traces to GPU for bug detection through on-chip interconnect. Furthermore, to achieve high speed, Hydra exploits bloom filter to filter out unnecessary detection traces. Hydra incurs small hardware complexity and requires no changes to internal critical-path processor components such as cache and its coherence protocol, and is with about 1.1% hardware overhead under a 32-core configuration. Experimental results show that Hydra only introduces about 0.35% overhead on average for detecting one type of bugs and 0.92% overhead for simultaneously detecting multiple bugs, yet with the similar detectability of a heavyweight software bug detector (e.g., Helgrind).","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129175238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SMARTH: Enabling Multi-pipeline Data Transfer in HDFS","authors":"Hong Zhang, Liqiang Wang, Hai Huang","doi":"10.1109/ICPP.2014.12","DOIUrl":"https://doi.org/10.1109/ICPP.2014.12","url":null,"abstract":"Hadoop is a popular open-source implementation of the MapReduce programming model to handle large data sets, and HDFS is one of Hadoop's most commonly used distributed file systems. Surprisingly, we found that HDFS is inefficient when handling upload of data files from client local file system, especially when the storage cluster is configured to use replicas. The root cause is HDFS's synchronous pipeline design. In this paper, we introduce an improved HDFS design called SMARTH. It utilizes asynchronous multi-pipeline data transfers instead of a single pipeline stop-and-wait mechanism. SMARTH records the actual transfer speed of data blocks and sends this information to the namenode along with periodic heartbeat messages. The namenode sorts datanodes according to their past performance and tracks this information continuously. When a client initiates an upload request, the namenode will send it a list of \"high performance\" datanodes that it thinks will yield the highest throughput for the client. By choosing higher performance datanodes relative to each client and by taking advantage of the multi-pipeline design, our experiments show that SMARTH significantly improves the performance of data write operations compared to HDFS. Specifically, SMARTH is able to improve the throughput of data transfer by 27-245% in a heterogeneous virtual cluster on Amazon EC2.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129788848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MiCA: Real-Time Mixed Compression Scheme for Large-Scale Distributed Monitoring","authors":"Bo Wang, Ying Song, Yuzhong Sun, Jun Liu","doi":"10.1109/ICPP.2014.53","DOIUrl":"https://doi.org/10.1109/ICPP.2014.53","url":null,"abstract":"Real-time monitoring, providing the real-time status information of servers, is indispensable for the management of distributed systems, e.g. failure detection and resource scheduling. The scalability of fine-grained monitoring faces more and more severe challenges with scaling up distributed systems. The real-time compression which suppresses remote information update to reduce continuous monitoring cost is a promising approach to address the scalability problem. In this paper, we present the Linear Compression Algorithm (LCA) which is the application of the linear filter to real-time monitoring. To our best knowledge, existing work and LCA only explores the correlations of values of each single metric at various times. We present a novel lightweight REal-time Compression Algorithm (ReCA) which employs discovery methods of the correlation among metrics to suppress remote information update in distributed monitoring. The compression algorithms mentioned above have limited compression power because they only explore either the correlations of values of each single metric at various times or that among metrics. Therefore, we propose the Mixed Compression Algorithm (MiCA) which explores both of the correlations to achieve higher compression ratio. We implement our algorithms and an existing compression algorithm denoted by CCA in a distributed monitoring system Ganglia and conduct extensive experiments. The experimental results show that LCA and ReCA have comparable compression ratios with CCA, that MiCA achieves up to 38.2%, 27% and 44.5% higher compression ratios than CCA, LCA and ReCA with negligible overhead, respectively, and that LCA, and ReCA can both increase the scalability of Ganglia about 1.5 times and MiCA can increase about 2.33 times under a mixed-load circumstance.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114487806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Maheshwari, Eun-Sung Jung, Jiayuan Meng, V. Vishwanath, R. Kettimuthu
{"title":"Improving Multisite Workflow Performance Using Model-Based Scheduling","authors":"K. Maheshwari, Eun-Sung Jung, Jiayuan Meng, V. Vishwanath, R. Kettimuthu","doi":"10.1109/ICPP.2014.22","DOIUrl":"https://doi.org/10.1109/ICPP.2014.22","url":null,"abstract":"Workflows play an important role in expressing and executing scientific applications. In recent years, a variety of computational sites and resources have emerged, and users often have access to multiple resources that are geographically distributed. These computational sites are heterogeneous in nature and performance of different tasks in a workflow varies from one site to another. Additionally, users typically have a limited resource allocation at each site. In such cases, judicious scheduling strategy is required in order to map tasks in the workflow to resources so that the workload is balanced among sites and the overhead is minimized in data transfer. Most existing systems either run the entire workflow in a single site or use naive approaches to distribute the tasks across sites or leave it to the user to optimize the allocation of tasks to distributed resources. This results in a significant loss in productivity for a scientist. In this paper, we propose a multi-site workflow scheduling technique that uses performance models to predict the execution time on different resources and dynamic probes to identify the achievable network throughput between sites. We evaluate our approach using real world applications in a distributed environment using the Swift distributed execution framework and show that our approach improves the execution time by up to 60% compared to the default schedule.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124800265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lukasz Wesolowski, Ramprasad Venkataraman, Abhishek K. Gupta, Jae-Seung Yeom, K. Bisset, Yanhua Sun, Pritish Jetley, T. Quinn, L. Kalé
{"title":"TRAM: Optimizing Fine-Grained Communication with Topological Routing and Aggregation of Messages","authors":"Lukasz Wesolowski, Ramprasad Venkataraman, Abhishek K. Gupta, Jae-Seung Yeom, K. Bisset, Yanhua Sun, Pritish Jetley, T. Quinn, L. Kalé","doi":"10.1109/ICPP.2014.30","DOIUrl":"https://doi.org/10.1109/ICPP.2014.30","url":null,"abstract":"Fine-grained communication in supercomputing applications often limits performance through high communication overhead and poor utilization of network bandwidth. This paper presents Topological Routing and Aggregation Module (TRAM), a library that optimizes fine-grained communication performance by routing and dynamically combining short messages. TRAM collects units of fine-grained communication from the application and combines them into aggregated messages with a common intermediate destination. It routes these messages along a virtual mesh topology mapped onto the physical topology of the network. TRAM improves network bandwidth utilization and reduces communication overhead. It is particularly effective in optimizing patterns with global communication and large message counts, such as all-to-all and many-to-many, as well as sparse, irregular, dynamic or data dependent patterns. We demonstrate how TRAM improves performance through theoretical analysis and experimental verification using benchmarks and scientific applications. We present speedups on petascale systems of 6x for communication benchmarks and up to 4x for applications.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129607283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}