{"title":"Incremental and Parallel Analytics on Astrophysical Data Streams","authors":"D. Mishin, T. Budavári, A. Szalay, Yanif Ahmad","doi":"10.1109/SC.Companion.2012.130","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.130","url":null,"abstract":"Stream processing methods and online algorithms are increasingly appealing in the scientific and large-scale data management communities due to increasing ingestion rates of scientific instruments, the ability to produce and inspect results interactively, and the simplicity and efficiency of sequential storage access over enormous datasets. This article will showcase our experiences in using off-the-shelf streaming technology to implement incremental and parallel spectral analysis of galaxies from the Sloan Digital Sky Survey (SDSS) to detect a wide variety of galaxy features. The technical focus of the article is on a robust, highly scalable principal components analysis (PCA) algorithm and its use of coordination primitives to realize consistency as part of parallel execution. Our algorithm and framework can be readily used in other domains.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"54 1","pages":"1078-1086"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82369743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aalap Tripathy, Atish Patra, S. Mohan, R. Mahapatra
{"title":"Designing a Collaborative Filtering Recommender on the Single Chip Cloud Computer","authors":"Aalap Tripathy, Atish Patra, S. Mohan, R. Mahapatra","doi":"10.1109/SC.Companion.2012.118","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.118","url":null,"abstract":"Fast response requirements for big-data applications on cloud infrastructures continues to grow. At the same time, many cores on-chip have now become a reality. These developments are set to redefine infrastructure nodes of cloud data centers in the future. For this to happen, parallel programming runtimes need to be designed for many-cores on chip as the target architecture. In this paper, we show that the commonly used MapReduce programming paradigm can be adapted to run on Intel's experimental single chip cloud computer (SCC) with 48-cores on chip. We demonstrate this using a Collaborative Filtering (CF) recommender system as an application. This is a widely used technique for information filtering to predict user's preference towards an unknown item from their past ratings. These systems are typically deployed in distributed clusters and operate on large apriori datasets. We address scalability with data partitioning, combining and sorting algorithms, maximize data locality to minimize communication cost within the SCC cores. We demonstrate ~2x speedup, ~94% lower energy consumption for benchmark workloads as compared to a distributed cluster of single and multi-processor nodes.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"49 1","pages":"838-847"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82857851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jae-Woo Choi, Youngjin Yu, Hyeonsang Eom, H. Yeom, Dongin Shin
{"title":"SAN Optimization for High Performance Storage with RDMA Data Transfer","authors":"Jae-Woo Choi, Youngjin Yu, Hyeonsang Eom, H. Yeom, Dongin Shin","doi":"10.1109/SC.Companion.2012.15","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.15","url":null,"abstract":"Today's server environments consist of many machines constructing clusters for distributed computing system or storage area networks (SAN) for effectively processing or saving enormous data. In these kinds of server environments, backend-storages are usually the bottleneck of the overall system. But it is not enough to simply replace the devices with better ones to exploit their performance benefits. In other words, proper optimizations are needed to fully utilize their performance gains. In this work, we first applied a high performance device as a backend-storage to the existing SAN solution, and found that it could not utilize the low latency and high bandwidth of the device, especially in case of small sized random I/O pattern even though a high speed network was used. To address this problem, we propose a new design that contains three optimizations: 1) removing software overheads to lower I/O latency; 2) parallelism to utilize the high bandwidth of the device; 3) temporal merge mechanism to reduce network overhead. We implemented them as a prototype and found that our solution makes substantial performance improvements in terms of both the latency and bandwidth.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"14 1","pages":"24-29"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82899841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kevin R. Wadleigh, John Amelio, K. Collins, G. Edwards
{"title":"Abstract: Hybrid Breadth First Search Implementation for Hybrid-Core Computers","authors":"Kevin R. Wadleigh, John Amelio, K. Collins, G. Edwards","doi":"10.1109/SC.Companion.2012.184","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.184","url":null,"abstract":"Summary form only given. The Graph500 benchmark is designed to evaluate the suitability of supercomputing systems for graph algorithms, which are increasingly important in HPC. The timed Graph500 kernel, Breadth First Search, exhibits memory access patterns typical of these types of applications, with poor spatial locality and synchronization between multiple streams of execution. The Graph500 benchmark was ported to the Convey HC-2ex and MX-100, hybrid-core computers with an Intel host system and a coprocessor incorporating four reprogrammable Xilinx FPGAs. The computers contain a unique memory system designed to sustain high bandwidth for random memory accesses. The BFS kernel was implemented as a hybrid algorithm with concurrent processing on both the host and coprocessor. The early steps use a top-down algorithm on the host with results copied to coprocessor memory for use in a bottom-up algorithm. The coprocessor uses thousands of threads to traverse the graph. The resulting implementation runs at over 16 billion TEPS.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"5 1","pages":"1354-1354"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90383297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher M. Sewell, J. Meredith, K. Moreland, T. Peterka, David E. DeMarle, Li-Ta Lo, J. Ahrens, Robert Maynard, Berk Geveci
{"title":"The SDAV Software Frameworks for Visualization and Analysis on Next-Generation Multi-Core and Many-Core Architectures","authors":"Christopher M. Sewell, J. Meredith, K. Moreland, T. Peterka, David E. DeMarle, Li-Ta Lo, J. Ahrens, Robert Maynard, Berk Geveci","doi":"10.1109/SC.Companion.2012.36","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.36","url":null,"abstract":"This paper surveys the four software frameworks being developed as part of the visualization pillar of the SDAV (Scalable Data Management, Analysis, and Visualization) Institute, one of the SciDAC (Scientific Discovery through Advanced Computing) Institutes established by the ASCR (Advanced Scientific Computing Research) Program of the U.S. Department of Energy. These frameworks include EAVL (Extreme-scale Analysis and Visualization Library), DAX (Data Analysis at Extreme), DIY (Do It Yourself), and PISTON. The objective of these frameworks is to facilitate the adaptation of visualization and analysis algorithms to take advantage of the available parallelism in emerging multi-core and many-core hardware architectures, in anticipation of the need for such algorithms to be run in-situ with LCF (leadership-class facilities) simulation codes on supercomputers.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"78 1","pages":"206-214"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83940541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Poster: High Performance GPU Accelerated TSP Solver","authors":"K. Rocki, R. Suda","doi":"10.1109/SC.Companion.2012.225","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.225","url":null,"abstract":"We are presenting a high performance GPU accelerated implementation of 2-opt local search algorithm for the Traveling Salesman Problem (TSP). GPU usage greatly decreases the time needed to optimize the route, however requires a complicated and well tuned implementation. With the increasing problem size, the time spent on comparing the graph edges grows significantly. We used instances from the TSPLIB library for for testing and our results show that by using our GPU algorithm, the time needed to perform a simple local search operation can be decreased approximately 5 to 45 times compared to parallel CPU code implementation using 6 cores. The code has been implemented in CUDA as well as in OpenCL and tested on NVIDIA and AMD devices. The experimental studies have shown that the optimization algorithm using the GPU local search converges from up to 300 times faster on average compared to the sequential CPU version, depending on the problem size. The main contributions of this work are the problem division scheme exploiting data locality which allows to solve arbitrarily big problem instances using GPU and the parallel implementation of the algorithm itself.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"10 1","pages":"1413-1414"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88065052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Application of High Performance Computing to Solvency and Profitability Calculations for Life Assurance Contracts","authors":"Mark Tucker, J. M. Bull","doi":"10.1109/SC.Companion.2012.140","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.140","url":null,"abstract":"In the UK, pension providers are required by law to demonstrate solvency on a regular basis; the regulations governing how solvency is demonstrated are changing. Historically, it has been sufficient to report solvency using a single `best estimate' set of assumptions. The new regulations require a Monte Carlo approach to finding a worst-case scenario that requires computing power which is outside the systems currently available in the industry. This paper aims to show that the new regulations could be met by moving away from current actuarial valuation software packages and producing well-performing ab initio code, employing a variety of HPC techniques. Using a combination of algorithmic improvements, serial optimisations and multi-core parallelism, we demonstrate a performance improvement over commercial software of a factor of over 105. We show that this brings the Monte Carlo simulations within the bounds of practicality, and we suggest possibilities for further improvements, for example using clusters of GPUs. We also identify other possible use cases for high performance solvency and profitability calculations.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"56 1","pages":"1163-1170"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86829298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Poster: Acceleration of the BLAST Hydro Code on GPU","authors":"Tingxing Dong, T. Kolev, R. Rieben, V. Dobrev","doi":"10.1109/SC.Companion.2012.172","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.172","url":null,"abstract":"The BLAST code implements a high-order numerical algorithm that solves the equations of compressible hydrodynamics using the Finite Element Method in a moving Lagrangian frame. BLAST is coded in C++ and parallelized by MPI. We accelerate the most computationally intensive parts (80%-95%) of BLAST on an NVIDIA GPU with the CUDA programming model. Several 2D and 3D problems were tested and a maximum speedup of 4.3x was delivered. Our results demonstrate the validity and capability of GPU computing.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"53 1","pages":"1337-1337"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83570189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Carns, K. Harms, D. Kimpe, R. Ross, J. Wozniak, L. Ward, M. Curry, Ruth Klundt, Geoff Danielson, Cengiz Karakoyunlu, J. Chandy, Bradley Settlemeyer, W. Gropp
{"title":"A Case for Optimistic Coordination in HPC Storage Systems","authors":"P. Carns, K. Harms, D. Kimpe, R. Ross, J. Wozniak, L. Ward, M. Curry, Ruth Klundt, Geoff Danielson, Cengiz Karakoyunlu, J. Chandy, Bradley Settlemeyer, W. Gropp","doi":"10.1109/SC.Companion.2012.19","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.19","url":null,"abstract":"High-performance computing (HPC) storage systems rely on access coordination to ensure that concurrent updates do not produce incoherent results. HPC storage systems typically employ pessimistic distributed locking to provide this functionality in cases where applications cannot perform their own coordination. This approach, however, introduces significant performance overhead and complicates fault handling. In this work we evaluate the viability of optimistic conditional storage operations as an alternative to distributed locking in HPC storage systems. We investigate design strategies and compare the two approaches in a prototype object storage system using a parallel read/modify/write benchmark. Our prototype illustrates that conditional operations can be easily integrated into distributed object storage systems and can outperform standard coordination primitives for simple update workloads. Our experiments show that conditional updates can achieve over two orders of magnitude higher performance than pessimistic locking for some parallel read/modify/write workloads.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"34 1","pages":"48-53"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79485477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}