R. Prabhakar, Shekhar Srikantaiah, C. Patrick, M. Kandemir
{"title":"Dynamic storage cache allocation in multi-server architectures","authors":"R. Prabhakar, Shekhar Srikantaiah, C. Patrick, M. Kandemir","doi":"10.1145/1654059.1654068","DOIUrl":"https://doi.org/10.1145/1654059.1654068","url":null,"abstract":"We introduce a dynamic and efficient shared cache management scheme, called Maxperf, that manages the aggregate cache space in multi-server storage architectures such that the service level objectives (SLOs) of concurrently executing applications are satisfied and any spare cache capacity is proportionately allocated according to the marginal gains of the applications to maximize performance. We use a combination of Neville's algorithm and linear-programming-model to discover the required storage cache partition size, on each server, for every application accessing that server. Experimental results show that our algorithm enforces partitions to provide stronger isolation to applications, meets application level SLOs even in the presence of dynamically changing storage cache requirements, and improves I/O latency of individual applications as well as the overall I/O latency significantly compared to two alternate storage cache management schemes, and a state-of-the-art single server storage cache management scheme extended to multi-server architecture.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127099584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparative study of one-sided factorizations with multiple software packages on multi-core hardware","authors":"E. Agullo, B. Hadri, H. Ltaief, J. Dongarra","doi":"10.1145/1654059.1654080","DOIUrl":"https://doi.org/10.1145/1654059.1654080","url":null,"abstract":"The emergence and continuing use of multi-core architectures require changes in the existing software and sometimes even a redesign of the established algorithms in order to take advantage of now prevailing parallelism. The Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) is a project that aims to achieve both high performance and portability across a wide range of multi-core architectures. We present in this paper a comparative study of PLASMA's performance against established linear algebra packages (LAPACK and ScaLAPACK), against new approaches at parallel execution (Task Based Linear Algebra Subroutines - TBLAS), and against equivalent commercial software offerings (MKL, ESSL and PESSL). Our experiments were conducted on one-sided linear algebra factorizations (LU, QR and Cholesky) and used multi-core architectures (based on Intel Xeon EMT64 and IBM Power6). A performance improvement of 67% was for instance obtained on the Cholesky factorization of a matrix of order 4000, using 32 cores.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128856776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sparse matrix factorization on massively parallel computers","authors":"Anshul Gupta, S. Koric, Thomas George","doi":"10.1145/1654059.1654061","DOIUrl":"https://doi.org/10.1145/1654059.1654061","url":null,"abstract":"Direct methods for solving sparse systems of linear equations have a high asymptotic computational and memory requirements relative to iterative methods. However, systems arising in some applications, such as structural analysis, can often be too ill-conditioned for iterative solvers to be effective. We cite real applications where this is indeed the case, and using matrices extracted from these applications to conduct experiments on three different massively parallel architectures, show that a well designed sparse factorization algorithm can attain very high levels of performance and scalability. We present strong scalability results for test data from real applications on up to 8,192 cores, along with both analytical and experimental weak scalability results for a model problem on up to 16,384 cores---an unprecedented number for sparse factorization. For the model problem, we also compare experimental results with multiple analytical scaling metrics and distinguish between some commonly used weak scaling methods.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122955452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enabling software management for multicore caches with a lightweight hardware support","authors":"Jiang Lin, Q. Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, P. Sadayappan","doi":"10.1145/1654059.1654074","DOIUrl":"https://doi.org/10.1145/1654059.1654074","url":null,"abstract":"The management of shared caches in multicore processors is a critical and challenging task. Many hardware and OS-based methods have been proposed. However, they may be hardly adopted in practice due to their non-trivial overheads, high complexities, and/or limited abilities to handle increasingly complicated scenarios of cache contention caused by many-cores. In order to turn cache partitioning methods into reality in the management of multicore processors, we propose to provide an affordable and lightweight hardware support to coordinate with OS-based cache management policies. The proposed methods are scalable to many-cores, and perform comparably with other proposed hardware solutions, but have much lower overheads, therefore can be easily adopted in commodity processors. Having conducted extensive experiments with 37 multi-programming workloads, we show the effectiveness and scalability of the proposed methods. For example on 8-core systems, one of our proposed policies improves performance over LRU-based hardware cache management by 14.5% on average.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"236 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115326539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nagesh B. Lakshminarayana, Jaekyu Lee, Hyesoon Kim
{"title":"Age based scheduling for asymmetric multiprocessors","authors":"Nagesh B. Lakshminarayana, Jaekyu Lee, Hyesoon Kim","doi":"10.1145/1654059.1654085","DOIUrl":"https://doi.org/10.1145/1654059.1654085","url":null,"abstract":"Asymmetric (or Heterogeneous) Multiprocessors are becoming popular in the current era of multi-cores due to their power efficiency and potential performance and energy efficiency. However, scheduling of multithreaded applications in Asymmetric Multiprocessors is still a challenging problem. Scheduling algorithms for Asymmetric Multiprocessors must not only be aware of asymmetry in processor performance, but have to consider the characteristics of application threads also. In this paper, we propose a new scheduling policy, Age based scheduling, that assigns a thread with a larger remaining execution time to a fast core. Age based scheduling predicts the remaining execution time of threads based on their age, i.e., when the threads were created. These predictions are based on the insight that most threads that are created together tend to have similar execution durations. Using Age based scheduling, we improve the overall performance of several important multithreaded applications including Parsec and asymmetric benchmarks from Splash-II and Omp-SCR. Our evaluations show that Age based scheduling improves performance up to 37% compared to the state-of-the-art Asymmetric Multiprocessor scheduling policy and on average by 10.4% for the Parsec benchmarks. Our results also show that the Age based scheduling policy with profiling improves the average performance by 13.2% for the Parsec benchmarks.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128346809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable massively parallel I/O to task-local files","authors":"W. Frings, F. Wolf, Ventsislav Petkov","doi":"10.1145/1654059.1654077","DOIUrl":"https://doi.org/10.1145/1654059.1654077","url":null,"abstract":"Parallel applications often store data in multiple task-local files, for example, to remember checkpoints, to circumvent memory limitations, or to record performance data. When operating at very large processor configurations, such applications often experience scalability limitations when the simultaneous creation of thousands of files causes metadataserver contention or simply when large file counts complicate file management or operations on those files even destabilize the file system. SIONlib is a parallel I/O library that addresses this problem by transparently mapping a large number of task-local files onto a small number of physical files via internal metadata handling and block alignment to ensure high performance. While requiring only minimal source code changes, SIONlib significantly reduces file creation overhead and simplifies file handling without penalizing read and write performance. We evaluate SIONlib's efficiency with up to 288 K tasks and report significant performance improvements in two application scenarios.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126513601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating similarity-based trace reduction techniques for scalable performance analysis","authors":"K. Mohror, K. Karavanic","doi":"10.1145/1654059.1654115","DOIUrl":"https://doi.org/10.1145/1654059.1654115","url":null,"abstract":"Event traces are required to correctly diagnose a number of performance problems that arise on today's highly parallel systems. Unfortunately, the collection of event traces can produce a large volume of data that is difficult, or even impossible, to store and analyze. One approach for compressing a trace is to identify repeating trace patterns and retain only one representative of each pattern. However, determining the similarity of sections of traces, i.e., identifying patterns, is not straightforward. In this paper, we investigate pattern-based methods for reducing traces that will be used for performance analysis. We evaluate the different methods against several criteria, including size reduction, introduced error, and retention of performance trends, using both benchmarks with carefully chosen performance behaviors, and a real application.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126534305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Lang, P. Carns, R. Latham, R. Ross, K. Harms, W. Allcock
{"title":"I/O performance challenges at leadership scale","authors":"S. Lang, P. Carns, R. Latham, R. Ross, K. Harms, W. Allcock","doi":"10.1145/1654059.1654100","DOIUrl":"https://doi.org/10.1145/1654059.1654100","url":null,"abstract":"Today's top high performance computing systems run applications with hundreds of thousands of processes, contain hundreds of storage nodes, and must meet massive I/O requirements for capacity and performance. These leadership-class systems face daunting challenges to deploying scalable I/O systems. In this paper we present a case study of the I/O challenges to performance and scalability on Intrepid, the IBM Blue Gene/P system at the Argonne Leadership Computing Facility. Listed in the top 5 fastest supercomputers of 2008, Intrepid runs computational science applications with intensive demands on the I/O system. We show that Intrepid's file and storage system sustain high performance under varying workloads as the applications scale with the number of processes.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115367534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GridBot: execution of bags of tasks in multiple grids","authors":"M. Silberstein, A. Sharov, D. Geiger, A. Schuster","doi":"10.1145/1654059.1654071","DOIUrl":"https://doi.org/10.1145/1654059.1654071","url":null,"abstract":"We present a holistic approach for efficient execution of bags-of-tasks (BOTs) on multiple grids, clusters, and volunteer computing grids virtualized as a single computing platform. The challenge is twofold: to assemble this compound environment and to employ it for execution of a mixture of throughput- and performance-oriented BOTs, with a dozen to millions of tasks each. Our generic mechanism allows per BOT specification of dynamic arbitrary scheduling and replication policies as a function of the system state, BOT execution state, and BOT priority. We implement our mechanism in the GridBot system and demonstrate its capabilities in a production setup. GridBot has executed hundreds of BOTs with over 9 million jobs during three months alone; these have been invoked on 25,000 hosts, 15,000 from the Superlink@Technion community grid and the rest from the Technion campus grid, local clusters, the Open Science Grid, EGEE, and the UW Madison pool.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128284463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating the impact of inaccurate information in utility-based scheduling","authors":"Alvin AuYoung, Amin Vahdat, A. Snoeren","doi":"10.1145/1654059.1654098","DOIUrl":"https://doi.org/10.1145/1654059.1654098","url":null,"abstract":"Proponents of utility-based scheduling policies have shown the potential for a 100--1400% increase in value-delivered to users when used in lieu of traditional approaches such as FCFS, backfill or priority queues. However, perhaps due to concerns about their potential fragility, these policies are rarely used in practice. We present an evaluation of a utility-based scheduling policy based upon real workload data from both an auction-based resource infrastructure, and a supercomputing cluster. We model potential sources of imperfect operating conditions for a utility-based policy: user uncertainty and wealth inequity. Through simulation, we find that while the value delivered by a utility-based policy can degrade to half that of traditional approaches in the worst case, the policy we study provides 20--100% improvement under realistic operating conditions. We conclude that future efforts in designing utility-based allocation mechanisms and policies must explicitly consider the fidelity of elicited job value information from users.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116339350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}