{"title":"Mixed-Precision Parallel Linear Programming Solver","authors":"Mujahed Eleyat, L. Natvig","doi":"10.1109/SBAC-PAD.2010.14","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.14","url":null,"abstract":"We use mixed-precision technique, which is used to exploit the high single precision performance of modern processors, to build the first sparse mixed-precision linear programming solver on the Cell BE processor. The technique is used to enhance the performance of an LP IPM-based solver by implementing mixed-precision sparse Cholesky factorization, the most time consuming part of LP solvers. Moreover, we implemented sparse matrix multiplication of the form required by the solver as it is also very time consuming for some LP problems. Implemented on the Cell BE processor (Playstation 3) and tested using Netlib data sets, our LP solver achieved a maximum speedup of 2.9 just by using the mixed-precision technique. Moreover, we found that some problems, especially in final iterations, result in ill-conditioned matrices where mixed-precision can not be used. As a result, the solver needs to switch to double-precision if a more accurate solution of an LP problem is required.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115362392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Vezolle, Jerry Heyman, Bruce D. D'Amora, G. W. Braudaway, Karen A. Magerlein, J. Magerlein, Y. Fournier
{"title":"Accelerating Computational Fluid Dynamics on the IBM Blue Gene/P Supercomputer","authors":"P. Vezolle, Jerry Heyman, Bruce D. D'Amora, G. W. Braudaway, Karen A. Magerlein, J. Magerlein, Y. Fournier","doi":"10.1109/SBAC-PAD.2010.27","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.27","url":null,"abstract":"Computational Fluid Dynamics (CFD) is an increasingly important application domain for computational scientists. In this paper, we propose and analyze optimizations necessary to run CFD simulations consisting of multi-billion-cell mesh models on large processor systems. Our investigation leverages the general industrial Navier-Stokes open-source CFD application, Code_Saturne, developed by Electricité de France (EDF). Our work considers emerging processor features such as many-core, Symmetric Multi-threading (SMT), Single Instruction Multiple Data (SIMD), Transactional Memory, and Thread Level Speculation. Initially, we have targeted per-node performance improvements by reconstructing the code and data layouts to optimally use multiple threads. We present a general loop transformation that will enable the compiler to generate OpenMP threads effectively with minimal impact to overall code structure. A renumbering scheme for mesh faces is proposed to enhance thread-level parallelism and generally improve data locality. Performance results on IBM Blue Gene/P supercomputer and Intel Xeon Westmere cluster are included.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127483008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Worst Case of Scheduling with Task Replication on Computational Grids","authors":"E. C. Xavier, Robson R. S. Peixoto","doi":"10.1109/SBAC-PAD.2010.24","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.24","url":null,"abstract":"We study the problem of scheduling tasks in a computational grid. We give analytical results for Work queue with Replication (WQR) based algorithms. There are several works presenting simulation results for scheduling algorithms for computational grid, but few provide analytical evidence of the quality of the solution of these algorithms. In this paper we show that under the TPCC metric cite{FujimotoH03} there is an optimal algorithm if the machines speed are predictable and tasks have the same length. If machines speed are not predictable we show an approximation result for the WQRxx algorithm and show that the result is tight. When tasks have different lengths the problem of minimizing the make span does not admit an approximation algorithm, even when machines speed are predictable. On the other hand, we show that the WQR based algorithm is a $m$-approximation when minimizing the TPCC in the unpredictable case, and this result is tight. To finish we show how to add replication to any scheduling algorithm using a simple interface and present computational simulations comparing the quality of the solutions of some well know algorithms with the addition of replication.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133038758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Teams in OpenMP","authors":"J. Schönherr, Jan Richling, Hans-Ulrich Heiß","doi":"10.1109/SBAC-PAD.2010.36","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.36","url":null,"abstract":"While OpenMP conceptually allows to vary the degree of parallelism from one parallel region to the next in order to adapt to the system load, this might still be too coarse-grained in certain scenarios. Especially applications designed for parallelism may stay within one parallel region for a long time. This may lead either to an oversubscribed system where individual applications are not restricted in their degree of parallelism, or to an underutilized system, because individual applications are restricted to a too small degree of parallelism. In this paper, we tackle both problems by dynamically restricting the number of active threads within a parallel region without violating the OpenMP specification.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122712719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Impact of I/O Coordination on a NFS-Based Parallel File System with Dynamic Reconfiguration","authors":"Rodrigo Kassick, F. Boito, P. Navaux","doi":"10.1109/SBAC-PAD.2010.32","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.32","url":null,"abstract":"The large gap between processing and I/O speed makes the storage infrastructure of a cluster a great bottleneck for HPC applications. Parallel File Systems propose a solution to this issue by distributing data onto several servers, dividing the load of I/O operations and increasing the available bandwidth. However, most parallel file systems use a fixed number of I/O servers defined during initialization and do not support addition of new resources as applications’ demands grow. With the execution of different applications at the same time, the concurrent access to these resources can impact the performance and aggravate the existing bottleneck. The dNFSp File System proposes a reconfiguration mechanism that aims to include new I/O resources as application’s demands grow. These resources are standard cluster nodes and are dedicated to a single application. This paper presents a study of the I/O performance of this reconfiguration mechanism under two circunstances: the use of several independent processes on a multi-core system or of a single centralized I/O process that coordinates the requests from all instances on a node. We show that the use of coordination can improve performance of applications with regular intervals between I/O phases. For applications with no such intervals, on the other hand, uncoordinated I/O presents better performance.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124232056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Control Scheme for a CGRA","authors":"M. A. Shami, A. Hemani","doi":"10.1109/SBAC-PAD.2010.12","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.12","url":null,"abstract":"Ability to instantiate low cost and agile FSMs that can implement an arbitrary parallelism and combine such FSMs in a chain and in a hierarchy is one of the key differentiating factors between the ASICs and MPSOCs. CGRAs that have been reported in literature, like MPSOCs, also lack this ASIC like ability. The downside of ASICs is their lack of reuse and high engineering cost. We present a CGRA architecture that retains the programmability of CGRA and yet has the ASIC like ability to construct a) arbitrarily parallel data-path/FSM combine, b) chain an arbitrary number of such FSMs and c) create a hierarchy of such chains. We present in detail the architecture of such a control scheme and illustrate its use for an example composed of FFT and FIRs. We quantify the benefits of our approach by benchmarking for energy-delay product against a) ASICs (4.8X worse), b) a state-of-the-art CGRA (4.58X better) and FPGAs (63.95X better).","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123114090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xi Zhang, Chongmin Li, Haixia Wang, Dongsheng Wang
{"title":"A Cache Replacement Policy Using Adaptive Insertion and Re-reference Prediction","authors":"Xi Zhang, Chongmin Li, Haixia Wang, Dongsheng Wang","doi":"10.1109/SBAC-PAD.2010.21","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.21","url":null,"abstract":"Previous research shows that LRU replacement policy is not efficient when applications exhibit a distant re-reference interval. Recently proposed RRIP policy improves performance for such workloads. However, RRIP lacks of access recency information, which may confuse the replacement policy to make accurate prediction. Consequently, RRIP is not robust for recency-friendly workloads. This paper proposes an Adaptive Insertion and Re-reference Prediction (AI-RRP) policy which evicts data based on both re-reference prediction value and the access recency information. To make the replacement policy more adaptive across different workloads and different phases during execution, Dynamic AI-RRP (DAI-RRP) is proposed which adjusts the insertion position and prediction value for different access patterns. Simulation results show DAI-RRP reduces CPI over LRU and Dynamic RRIP by an average of 8.3% and 4.1% respectively on a single-core processor with a 1MB 16-way set last-level cache (LLC). Evaluations on quad-core CMP with a 4MB shared LLC show that DAI-RRP outperforms LRU and Dynamic RRIP (DRRIP) on the weighted speedup metric by an average of 13.2% and 26.7% respectively. Furthermore, compred to LRU, DAI-RRP requires similar hardware, or even less hardware for high-associativity cache.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121517506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Dynamic Block Remapping Cache","authors":"Felipe Pedroni, A. D. Souza, C. Badue","doi":"10.1109/SBAC-PAD.2010.39","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.39","url":null,"abstract":"In this paper we present a new architecture of Level 2 (L2) cache – the Dynamic Block Remapping Cache (DBRC). DBRC mimics important characteristics of virtual memory systems to reduce the impact of L2 in system performance. Similar to virtual memory systems, the DBRC uses a hierarchy of tables to map blocks of L2 cache into blocks of physical memory. It also uses a Block-TLB to speedup accesses to previously performed block translations. We verified that the benefits of full associativity and the consequent possibility of employment of global block replacement algorithms allow hit rates higher than those of equivalent standard caches. We compare DBRC with standard caches in terms of miss rate, energy consumption and impact on the instruction-level parallelism (ILP) of a simulated superscalar processor. Our results show that DBRC outperforms standard caches in terms of miss rate, energy consumption and impact on ILP.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133124240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sharing Resources for Performance and Energy Optimization of Concurrent Streaming Applications","authors":"A. Benoit, Paul Renaud-Goud, Yves Robert","doi":"10.1109/SBAC-PAD.2010.19","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.19","url":null,"abstract":"We aim at finding optimal mappings for concurrent streaming applications. Each application consists of a linear chain with several stages, and processes successive data sets in pipeline mode. The objective is to minimize the energy consumption of the whole platform, while satisfying given performance-related bounds on the period and latency of each application. The problem is to decide which processors to enroll, at which speed (or mode) to use them, and which stages they should execute. We distinguish two mapping categories, interval mappings without reuse, and fully arbitrary general mappings. On the theoretical side, we establish complexity results for this tri-criteria mapping problem (energy, period, latency). Furthermore, we derive an integer linear program that provides the optimal solution in the most general case. On the experimental side, we design polynomial-time heuristics, and assess their absolute performance thanks to the linear program. One main goal is to evaluate the impact of processor sharing on the quality of the solution.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122003402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed Evidence Propagation in Junction Trees","authors":"Yinglong Xia, V. Prasanna","doi":"10.1109/SBAC-PAD.2010.25","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.25","url":null,"abstract":"Evidence propagation is a major step in exact inference, a key problem in exploring probabilistic graphical models. In this paper, we propose a novel approach for evidence propagation on clusters. We decompose a junction tree into a set of sub trees, and then perform evidence propagation in the sub trees in parallel. The partially updated sub trees are merged after evidence collection. In addition, we propose a technique to explore tradeoff between overhead due to startup latency of message passing and bandwidth utilization efficiency. We implemented the proposed method on state-of-the-art clusters using MPI. Experimental results show that the proposed method exhibits superior performance compared with the baseline methods.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121800289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}