M. Laurenzano, Joshua Peraza, L. Carrington, Ananta Tiwari, W. A. Ward, R. Campbell
{"title":"A Static Binary Instrumentation Threading Model for Fast Memory Trace Collection","authors":"M. Laurenzano, Joshua Peraza, L. Carrington, Ananta Tiwari, W. A. Ward, R. Campbell","doi":"10.1109/SC.Companion.2012.101","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.101","url":null,"abstract":"In order to achieve a high level of performance, data intensive applications such as the real-time processing of surveillance feeds from unmanned aerial vehicles will require the strategic application of multi/many-core processors and coprocessors using a hybrid of inter-process message passing (e.g. MPI and SHMEM) and intra-process threading (e.g. pthreads and OpenMP). To facilitate program design decisions, memory traces gathered through binary instrumentation can be used to understand the low-level interactions between a data intensive code and the memory subsystem of a multi-core processor or many-core co-processor. Toward this end, this paper introduces the addition of threading support for PMaCs Efficient Binary Instrumentation Toolkit for Linux/x86 (PEBIL) and compares PEBILs threading model to the threading models of two other popular Linux/x86 binary instrumentation platforms - Pin and Dyninst - on both theoretical and empirical grounds. The empirical comparisons are based on experiments which collect memory address traces for the OpenMP-threaded implementations of the NASA Advanced Supercomputing Parallel Benchmarks (NPBs). This work shows that the overhead of collecting full memory address traces for multithreaded programs is higher in PEBIL (7.7x) than in Pin (4.7x), both of which are significantly lower than Dyninst (897x). This work also shows that PEBIL, uniquely, is able to take advantage of interval-based sampling of a memory address trace by rapidly disabling and re-enabling instrumentation at the transitions into and out of sampling periods in order to achieve significant decreases in the overhead of memory address trace collection. For collecting the memory address streams of each of the NPBs at a 10% sampling rate, PEBIL incurs an average slowdown of 2.9x compared to 4.4x with Pin and 897x with Dyninst.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"584 1","pages":"741-745"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77215158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Multi-Instance Learning Approach for Mapping the Slums of the World","authors":"Ranga Raju Vatsavai","doi":"10.1109/SC.Companion.2012.117","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.117","url":null,"abstract":"Remote sensing imagery is widely used in mapping thematic classes, such as, forests, crops, forests and other natural and man-made objects on the Earth. With the availability of very high-resolution satellite imagery, it is now possible to identify complex patterns such as formal and informal (slums) settlements. However, predominantly used single-instance learning algorithms that are widely used in thematic classification are not sufficient for recognizing complex settlement patterns. On the other hand, newer multi-instance learning schemes are useful in recognizing complex structures in images, but they are computationally expensive. In this paper, we present an adaptation of a multi-instance learning algorithm for informal settlement classification and its efficient implementation on shared memory architectures. Experimental evaluation shows that this approach is scalable and as well as accurate than commonly used single-instance learning algorithms.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"6 1","pages":"833-837"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86259966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yonghong Yan, J. Kemp, Xiaonan Tian, A. Malik, B. Chapman
{"title":"Performance and Power Characteristics of Matrix Multiplication Algorithms on Multicore and Shared Memory Machines","authors":"Yonghong Yan, J. Kemp, Xiaonan Tian, A. Malik, B. Chapman","doi":"10.1109/SC.Companion.2012.87","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.87","url":null,"abstract":"For many scientific applications, dense matrix multiplication is one of the most important and computation intensive linear algebra operations. An efficient matrix multiplication on high performance and parallel computers requires optimizations on how matrices are decomposed and exchanged between computational nodes to reduce communication and synchronization overhead, as well as to efficiently exploit the memory hierarchy within a node to improve both spatial and temporal data locality. In this paper, we presented our studies of performance, cache behavior, and energy efficiency of multiple parallel matrix multiplication algorithms on a multicore desktop computer and a medium-size shared memory machine, both being considered as referenced sizes of nodes to create a medium- and largescale computational clusters for high performance computing used in industry and national laboratories. Our results highlight both the performance and energy efficiencies, and also provide implications on the memory and resources pressures of those algorithms. We hope this could help users choose the appropriate implementations according to their specific data sets when composing larger-scale scientific applications that use parallel matrix multiplication kernels on a node.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"56 1","pages":"626-632"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89024911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sayan Ghosh, Terrence Liao, H. Calandra, B. Chapman
{"title":"Experiences with OpenMP, PGI, HMPP and OpenACC Directives on ISO/TTI Kernels","authors":"Sayan Ghosh, Terrence Liao, H. Calandra, B. Chapman","doi":"10.1109/SC.Companion.2012.95","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.95","url":null,"abstract":"GPUs are slowly becoming ubiquitous devices in High Performance Computing, as their capabilities to enhance the performance per watt of compute intensive algorithms as compared to multicore CPUs have been identified. The primary shortcoming of a GPU is usability, since vendor specific APIs are quite different from existing programming languages, and it requires a substantial knowledge of the device and programming interface to optimize applications. Hence, lately a growing number of higher level programming models are targeting GPUs to alleviate this problem. The ultimate goal for a high-level model is to expose an easy-to-use interface for the user to offload compute intensive portions of code (kernels) to the GPU, and tune the code according to the target accelerator to maximize overall performance with a reduced development effort. In this paper, we share our experiences of three of the notable high-level directive based GPU programming models - PGI, CAPS and OpenACC (from CAPS and PGI) on an Nvidia M2090 GPU. We analyze their performance and programmability against Isotropic (ISO)/Tilted Transversely Isotropic (TTI) finite difference kernels, which are primary components in the Reverse Time Migration (RTM) application used by oil and gas exploration for seismic imaging of the sub-surface. When ported to a single GPU using the mentioned directives, we observe an average 1.5-1.8x improvement in performance for both ISO and TTI kernels, when compared with optimized multi-threaded CPU implementations using OpenMP.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"1 1","pages":"691-700"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90149910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Umar Kalim, M. Gardner, Eric J. Brown, Wu-chun Feng
{"title":"Abstract: Cascaded TCP: BIG Throughput for BIG DATA Applications in Distributed HPC","authors":"Umar Kalim, M. Gardner, Eric J. Brown, Wu-chun Feng","doi":"10.1109/SC.Companion.2012.229","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.229","url":null,"abstract":"Saturating high capacity and high latency paths is a challenge with vanilla TCP implementations. This is primarily due to congestion-control algorithms which adapt window sizes when acknowledgements are received. With large latencies, the congestion-control algorithms have to wait longer to respond to network conditions (e.g., congestion), and thus result in less aggregate throughput. We argue that throughput can be improved if we reduce the impact of large end-to-end latencies by introducing layer-4 relays along the path. Such relays would enable a cascade of TCP connections, each with lower latency, resulting in better aggregate throughput. This would directly benefit typical applications as well as BIG DATA applications in distributed HPC. We present empirical results supporting our hypothesis.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"21 1","pages":"1420-1421"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83239690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Poster: GPU Accelerated Ultrasonic Tomography Using Propagation and Backpropagation Method","authors":"P. Bello, Yuanwei Jin, E. Lu","doi":"10.1109/SC.Companion.2012.249","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.249","url":null,"abstract":"This paper develops implementation strategy and method to accelerate the propagation and backpropagation (PBP) tomographic imaging algorithm using Graphic Processing Units (GPUs). The Compute Unified Device Architecture (CUDA) programming model is used to develop our parallelized algorithm since the CUDA model allows the user to interact with the GPU resources more efficiently than traditional shader methods. The results show an improvement of more than 80x when compared to the C/C++ version of the algorithm, and 515x when compared to the MATLAB version while achieving high quality imaging for both cases. We test different CUDA kernel configurations in order to measure changes in the processing-time of our algorithm. By examining the acceleration rate and the image quality, we develop an optimal kernel configuration that maximizes the throughput of CUDA implementation for the PBP method.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"os-44 1","pages":"1447"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87235876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Network-Aware Object Storage Service","authors":"Shigetoshi Yokoyama, Nobukazu Yoshioka, Motonobu Ichimura","doi":"10.1109/SC.Companion.2012.79","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.79","url":null,"abstract":"This study describes a trial for establishing a network-aware object storage service. For scientific applications that need huge amounts of remotely stored data, the cloud infrastructure has functionalities to provide a service called `cluster as a service' and an inter-cloud object storage service. The scientific applications move from locations with constrained resources to locations where they can be executed practically. The inter-cloud object storage service has to be network-aware in order to perform well.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"74 1","pages":"556-561"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80624741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bei Wang, S. Ethier, W. Tang, K. Ibrahim, Kamesh Madduri, Samuel Williams, L. Oliker, T. Williams
{"title":"Abstract: Advances in Gyrokinetic Particle in Cell Simulation for Fusion Plasmas to Extreme Scale","authors":"Bei Wang, S. Ethier, W. Tang, K. Ibrahim, Kamesh Madduri, Samuel Williams, L. Oliker, T. Williams","doi":"10.1109/SC.Companion.2012.243","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.243","url":null,"abstract":"The Gyrokinetic Particle-in-cell (PIC) method has been successfully applied in studies of low-frequency microturbulence in magnetic fusion plasmas. While the excellent scaling of PIC codes on modern computing platforms is well established, significant challenges remain in achieving high on-chip concurrency for the new path to exascale systems. In addressing associated issues, it is necessary to deal with the basic gather-scatter operation and the relatively low computational intensity in the PIC method. Significant advancements have been achieved in optimizing gather-scatter operations in the gyrokinetic PIC method for next-generation multi-core CPU and GPU architectures. In particular, we will report on new techniques that improve locality, reduce memory conflict, and efficiently utilize shared memory on GPU's. Performance benchmarks on two high-end computing platforms -- the IBM BlueGene/Q (Mira) system at the Argonne Leadership Computing Facility (ALCF) and the Cray XK6 (Titan Dev) with the latest GPU at Oak Ridge Leadership Computing Facility (OLCF) - will be presented.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"15 1","pages":"1439-1440"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90656932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Prasanna Balaprakash, Darius Buntinas, Anthony Chan, Apala Guha, Rinku Gupta, S. Narayanan, A. Chien, P. Hovland, B. Norris
{"title":"Abstract: An Exascale Workload Study","authors":"Prasanna Balaprakash, Darius Buntinas, Anthony Chan, Apala Guha, Rinku Gupta, S. Narayanan, A. Chien, P. Hovland, B. Norris","doi":"10.1109/SC.Companion.2012.261","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.261","url":null,"abstract":"Amdahl's law has been one of the factors influencing speedup in high performance computing over the last few decades. While Amdahl's approach of optimizing (10% of the code is where 90% of the execution time is spent) has worked very well in the past, new challenges related to emerging exascale heterogeneous architectures, combined with stringent power and energy limitations, require a new architectural paradigm. The 10x10 approach is an effort in this direction. In this poster, we describe our initial steps and methodologies for defining and actualizing the 10x10 approach.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"8 1","pages":"1463-1464"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89936793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Parallel Unstructured Mesh Infrastructure","authors":"E. Seol, Cameron W. Smith, D. Ibanez, M. Shephard","doi":"10.1109/SC.Companion.2012.135","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.135","url":null,"abstract":"Two Department of Energy (DOE) office of Science's Scientific Discovery through Advanced Computing (SciDAC) Frameworks, Algorithms, and Scalable Technologies for Mathematics (FASTMath) software packages, Parallel Unstructured Mesh Infrastructure (PUMI) and Partitioning using Mesh Adjacencies (ParMA), are presented.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"26 1","pages":"1124-1132"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90905673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}