Andres Charif Rubial, Denis Barthou, Cédric Valensi, S. Shende, A. Malony, W. Jalby
{"title":"MIL: A language to build program analysis tools through static binary instrumentation","authors":"Andres Charif Rubial, Denis Barthou, Cédric Valensi, S. Shende, A. Malony, W. Jalby","doi":"10.1109/HiPC.2013.6799106","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799106","url":null,"abstract":"As software complexity increases, the analysis of code behavior during its execution is becoming more important. Instrumentation techniques, through the insertion of code directly into binaries, are essential for program analyses used in debugging, runtime profiling, and performance evaluation. In the context of high-performance parallel applications, building an instrumentation framework is quite challenging. One of the difficulties is due to the necessity to capture both coarse-grain behavior, such as the execution time of different functions, as well as finer-grain actions, in order to pinpoint performance issues. In this paper, we propose a language, MIL, for the development of program analysis tools based on static binary instrumentation. The key feature of MIL is to ease the integration of static, global program analysis with instrumentation. We will show how this enables both a precise targeting of the code regions to analyze and a better understanding of the optimized program behavior.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127615244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transaction scheduling using conflict avoidance and Contention Intensity","authors":"M. Pereira, A. Baldassin, G. Araújo, L. E. Buzato","doi":"10.1109/HiPC.2013.6799126","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799126","url":null,"abstract":"In the last few years, Transactional Memories (TMs) have been shown to be a parallel programming model that can effectively combine performance improvement with ease of programming. Moreover, the recent introduction of TM-based ISA extensions, by major microprocessor manufacturers, also seems to endorse TM as a programming model for today's parallel applications. One of the central issues in designing Software TM (STM) systems is to identify mechanisms/heuristics that can minimize contention arising from conflicting transactions. Although a number of mechanisms have been proposed to tackle contention, such techniques have a limited scope, as conflict is avoided by either interrupting or serializing transaction execution, thus considerably impacting performance. To deal with this limitation, we have proposed a new effective transaction scheduler, along with a conflict-avoidance heuristic, that implements a fully cooperative scheduler that switches a conflicting transaction by another with a lower conflicting probability. This paper extends such framework and introduces a new heuristic, built from the combination of our previous conflict avoidance technique with the Contention Intensity heuristic proposed by Yoo and Lee. Experimental results, obtained using the STMBench7 and STAMP benchmarks atop tinySTM, show that the proposed heuristic produces significant speedups when compared to other four solutions.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126926612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Z. Abbasi, Madhurima Pore, Ayan Banerjee, S. Gupta
{"title":"Multi-tier energy buffering management for IDCs with heterogeneous energy storage devices","authors":"Z. Abbasi, Madhurima Pore, Ayan Banerjee, S. Gupta","doi":"10.1109/HiPC.2013.6799104","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799104","url":null,"abstract":"Energy buffering, has been proposed to store renewable energy and low cost electricity in Energy Storage Devices (ESDs) and use it judiciously to reduce electricity bill in Internet data centers. Recent research have considered long term variation in electricity price, renewable power and workload and have shown the efficiency of energy buffering in reducing electricity bill. However, these aspects of data centers exhibit both long and short term variation. Further, there is inherent heterogeneity in ESD physical characteristics (e.g., charging and discharging rates). We hypothesize that a multi-tier energy buffering management can leverage the heterogeneity in ESD characteristics and better optimize utilization of renewable energy and low-cost power in presence of both short and long term variabilities in a data center. This paper proposes an analytical study of multi-tier workload and energy buffering management technique that frames each tier as an optimization problem and solves them in an online and proactive way using Receding Horizon Control (RHC). Our study shows that multi-tier energy buffering management increases the utilization of the renewables by upto two times compared to one-tier management.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122379810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Ehsan, Yao Chen, Hui Kang, R. Sion, Jennifer L. Wong
{"title":"LiPS: A cost-efficient data and task co-scheduler for MapReduce","authors":"M. Ehsan, Yao Chen, Hui Kang, R. Sion, Jennifer L. Wong","doi":"10.1109/HiPC.2013.6799103","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799103","url":null,"abstract":"We introduce LiPS, a new cost-efficient data and task co-scheduler for MapReduce in a cloud environment. By using linear programming to simultaneously co-schedule data and tasks, LiPS helps to achieve minimized dollar cost globally. We evaluated LiPS both analytically and on Amazon EC2 in order to measure actual dollar charges. The results were significant; LiPS saved 62-81% of the dollar costs when compared with the Hadoop default scheduler and the delay scheduler, while also allowing users to fine-tune the cost-performance tradeoff.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128287787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Swati Singhal, L. V. Real, Thomas George, Sandhya Aneja, Yogish Sabharwal
{"title":"A hybrid parallelization approach for high resolution operational flood forecasting","authors":"Swati Singhal, L. V. Real, Thomas George, Sandhya Aneja, Yogish Sabharwal","doi":"10.1109/HiPC.2013.6799142","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799142","url":null,"abstract":"Accurate and timely flood forecasts are becoming highly essential due to the increased incidence of flood related disasters over the last few years. Such forecasts require a high resolution integrated flood modeling approach. In this paper, we present an integrated flood forecasting system with an automated workflow over the weather modeling, surface runoff estimation and water routing components. We primarily focus on the water routing process which is the most compute intensive phase and present two parallelization strategies to scale it up to large grid sizes. Specifically, we employ nature-inspired decomposition of a simulation domain into watershed basins and propose a master slave model of parallelization for distributed processing of the basins. We also propose an intra-basin shared memory parallelization approach using OpenMP. Empirical evaluation of the proposed parallelization strategies indicates a potential for high speedups for certain types of scenarios (e.g., speedup of 13× with 16 threads using OpenMP parallelization for the large Rio de Janeiro basin).","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124615938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rakesh Kumar, Alejandro Martínez, Antonio González
{"title":"Speculative dynamic vectorization to assist static vectorization in a HW/SW co-designed environment","authors":"Rakesh Kumar, Alejandro Martínez, Antonio González","doi":"10.1109/HiPC.2013.6799102","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799102","url":null,"abstract":"Compiler based static vectorization is used widely to extract data level parallelism from computation intensive applications. Static vectorization is very effective in vectorizing traditional array based applications. However, compilers inability to reorder ambiguous memory references severely limits vectorization opportunities, especially in pointer rich applications. HW/SW co-designed processors provide an excellent opportunity to optimize the applications at runtime. The availability of dynamic application behavior at runtime will help in capturing vectorization opportunities generally missed by the compilers. This paper proposes to complement the static vectorization with a speculative dynamic vectorizer in a HW/SW co-design processor. We present a speculative dynamic vectorization algorithm that speculatively reorders ambiguous memory references to uncover vectorization opportunities. The hardware checks for any memory dependence violation due to speculative vectorization and takes corrective action in case of violation. Our experiments show that the combined (static + dynamic) vectorization approach provides 2x performance benefit compared to the static vectorization alone, for SPECFP2006. Moreover, dynamic vectorization scheme is as effective in vectorization of pointer-based applications as for the array-based ones, whereas compilers lose significant vectorization opportunities in pointer-based applications.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133832163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Loop level speculation in a task based programming model","authors":"Rahulkumar Gayatri, Rosa M. Badia, E. Ayguadé","doi":"10.1109/HiPC.2013.6799132","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799132","url":null,"abstract":"Uncountable loops (such as while loops in C) and if-conditions are some of the most common constructs in programming. While-loops are widely used to determine the convergence in linear algebra algorithms or goal finding problems from graph algorithms, to name a few. In general while-loops are used whenever the loop iteration space, the number of iterations a loop executes is unknown. Usually in while-loops, the execution of the next iteration is decided inside the current loop iteration (i.e. the execution of iteration i depends on the values computed in iteration i-1). This precludes their parallel execution in today's ubiquitous multi-core architectures. In this paper a technique to speculatively create parallel tasks from the next iterations before the current one completes is proposed. If consecutive loop-iterations are only control dependent, then multiple iterations can be executed simultaneously; later in the execution path, the runtime system will decide to either commit the results of such speculatively executed iterations or undo the changes made by them. Data dependences within or between non-speculative and speculative work are honored to guarantee correctness. The proposed technique is implemented in SMPSs, a task-based dataflow programming model for shared-memory multiprocessor architectures. The approach is evaluated on a set of applications from graph algorithms and linear algebra. Results are promising with an average increase in the speedup of 1.2x with 16 threads when compared to non speculative execution of the applications. The increase in the speedup is significant, since the performance gain is achieved over an already parallelized version of the benchmarks.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130845948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Conflict-free data access for multi-bank memory architectures using padding","authors":"Joar Sohl, Jian Wang, Andreas Karlsson, Dake Liu","doi":"10.1109/HiPC.2013.6799112","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799112","url":null,"abstract":"For high performance computation memory access is a major issue. Whether it is a supercomputer, a GPGPU device, or an Application Specific Instruction set Processor (ASIP) for Digital Signal Processing (DSP) parallel execution is a necessity. A high rate of computation puts pressure on the memory access, and it is often non-trivial to maximize the data rate to the execution units. Many algorithms that from a computational point of view can be implemented efficiently on parallel architectures fail to achieve significant speed-ups. The reason is very often that the speed-up possible with the available execution units are poorly utilized due to inefficient data access. This paper shows a method for improving the access time for sequences of data that are completely static at the cost of extra memory. This is done by resolving memory conflicts by using padding. The method can be automatically applied and it is shown to significantly reduce the data access time for sorting and FFTs. The execution time for the FFT is improved with up to a factor of 3.4 and for sorting by a factor of up to 8.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"139 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114005145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating inclusion-based pointer analysis on heterogeneous CPU-GPU systems","authors":"Yu Su, Ding Ye, Jingling Xue","doi":"10.1109/HiPC.2013.6799110","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799110","url":null,"abstract":"This paper describes the first implementation of Andersen's inclusion-based pointer analysis for C programs on a heterogeneous CPU-GPU system, where both its CPU and GPU cores are used. As an important graph algorithm, Andersen's analysis is difficult to parallelise because it makes extensive modifications to the structure of the underlying graph, in a way that is highly input-dependent and statically hard to analyse. Existing parallel solutions run on either the CPU or GPU but not both, rendering the underlying computational resources underutilised and the ratios of CPU-only over GPU-only speedups for certain programs (i.e., graphs) unpredictable. We observe that a naive parallel solution of Andersen's analysis on a CPU-GPU system suffers from poor performance due to workload imbalance. We introduce a solution that is centered around a new dynamic workload distribution scheme. The novelty lies in prioritising the distribution of different types of workloads, i.e., graph-rewriting rules in Andersen's analysis to CPU or GPU according to the degrees of the processing unit's suitability for processing them. This scheme is effective when combined with synchronisation-free execution of tasks (i.e., graph-rewriting rules) and difference propagation of points-to information between the CPU and GPU. For a set of seven C benchmarks evaluated, our CPU-GPU solution outperforms (on average) (1) the CPU-only solution by 50.6%, (2) the GPU-only solution by 78.5%, and (3) an oracle solution that behaves as the faster of (1) and (2) on every benchmark by 34.6%.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125134462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Share-o-meter: An empirical analysis of KSM based memory sharing in virtualized systems","authors":"Shashank Rachamalla, Debadatta Mishra, Purushottam Kulkarni","doi":"10.1109/HiPC.2013.6799096","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799096","url":null,"abstract":"Content based memory sharing in virtualized environments has proven to be a useful technique for over-commitment based placement of virtual machines. Kernel-based Virtual Machine (KVM) on Linux uses Kernel SamePage Merging (KSM) to identify and exploit sharing opportunities. In this paper, we present an analysis of page sharing across virtual machines by comparing page sharing achieved by KSM to total sharing opportunities presented by virtual machines. We study the impact of different KSM configurations, system resources, and workload characteristics on page sharing achieved by KSM. We also study the cost of sharing in terms of CPU utilization overhead from Copy-On-Write page breaks that occur on KSM shared pages. Our analysis is aimed at exploring the KSM configuration space towards obtaining desired sharing levels with minimal overheads for a given amount of system resources and workload characteristics. Our empirical analysis shows that for workloads exhibiting different memory usage patterns, different KSM configuration parameters are required to achieve maximum savings. We quantify the levels of savings and associated costs for several (individual and combinations) of workloads, exhibiting different sharing opportunities and memory usage characteristics. Further, we demonstrate the need for adaptive configuration of KSM's aggressiveness based on changes in total memory available for sharing and change in memory usage characteristics.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126216426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}