{"title":"Instruction-level simulation of a cluster at scale","authors":"E. León, R. Riesen, A. Maccabe, P. Bridges","doi":"10.1145/1654059.1654063","DOIUrl":"https://doi.org/10.1145/1654059.1654063","url":null,"abstract":"Instruction-level simulation is necessary to evaluate new architectures. However, single-node simulation cannot predict the behavior of a parallel application on a supercomputer. We present a scalable simulator that couples a cycle-accurate node simulator with a supercomputer network model. Our simulator executes individual instances of IBM's Mambo PowerPC simulator on hundreds of cores. We integrated a NIC emulator into Mambo and model the network instead of fully simulating it. This decouples the individual node simulators and makes our design scalable. Our simulator runs unmodified parallel message-passing applications on hundreds of nodes. We can change network and detailed node parameters, inject network traffic directly into caches, and use different policies to decide when that is an advantage. This paper describes our simulator in detail, evaluates it, and demonstrates its scalability. We show its suitability for architecture research by evaluating the impact of cache injection on parallel application performance.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128553627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive and scalable metadata management to support a trillion files","authors":"Jing Xing, Jin Xiong, Ninghui Sun, Jie Ma","doi":"10.1145/1654059.1654086","DOIUrl":"https://doi.org/10.1145/1654059.1654086","url":null,"abstract":"Nowadays more and more applications require file systems to efficiently maintain million or more files. How to provide high access performance with such a huge number of files and such large directories is a big challenge for cluster file systems. Limited by static directory structures, existing file systems will be prohibitively inefficient for this use. To address this problem, we present a scalable and adaptive metadata management system which aims to maintain a trillion files efficiently. Firstly, our system exploits an adaptive two-level directory partitioning based on extendible hashing to manage very large directories. Secondly, our system utilizes fine-grained parallel processing within a directory and greatly improves performance of file creation or deletion. Thirdly, our system uses multiple-layered metadata cache management which improves memory utilization on the servers. And finally, our system uses a dynamic loadbalance mechanism based on consistent hashing which enables our system to scale up and down easily. Our performance results on 32 metadata servers show that our user-level prototype implementation can create more than 74 thousand files per second and can get more than 270 thousand files' attributes per second in a single directory with 100 million files. Moreover, it delivers a peak throughput of more than 60 thousand file creates/second in a single directory with 1 billion files.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"365 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115906264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shekhar Srikantaiah, R. Das, Asit K. Mishra, C. Das, M. Kandemir
{"title":"A case for integrated processor-cache partitioning in chip multiprocessors","authors":"Shekhar Srikantaiah, R. Das, Asit K. Mishra, C. Das, M. Kandemir","doi":"10.1145/1654059.1654066","DOIUrl":"https://doi.org/10.1145/1654059.1654066","url":null,"abstract":"Existing cache partitioning schemes are designed in a manner oblivious to the implicit processor partitioning enforced by the operating system. This paper examines an operating system directed integrated processor-cache partitioning scheme that partitions both the available processors and the shared cache in a chip multiprocessor among different multi-threaded applications. Extensive simulations using a set of multiprogrammed workloads show that our integrated processor-cache partitioning scheme facilitates achieving better performance isolation as compared to state of the art hardware/software based solutions. Specifically, our integrated processor-cache partitioning approach performs, on an average, 20.83% and 14.14% better than equal partitioning and the implicit partitioning enforced by the underlying operating system, respectively, on the fair speedup metric on an 8 core system. We also compare our approach to processor partitioning alone and a state-of-the-art cache partitioning scheme and our scheme fares 8.21% and 9.19% better than these schemes on a 16 core system.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"244 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115960168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SmartStore: a new metadata organization paradigm with semantic-awareness for next-generation file systems","authors":"Yu Hua, Hong Jiang, Yifeng Zhu, D. Feng, Lei Tian","doi":"10.1145/1654059.1654070","DOIUrl":"https://doi.org/10.1145/1654059.1654070","url":null,"abstract":"Existing storage systems using hierarchical directory tree do not meet scalability and functionality requirements for exponentially growing datasets and increasingly complex queries in Exabyte-level systems with billions of files. This paper proposes semantic-aware organization, called SmartStore, which exploits metadata semantics of files to judiciously aggregate correlated files into semantica-ware groups by using information retrieval tools. Decentralized design improves system scalability and reduces query latency for complex queries (range and top-k queries), which is conducive to constructing semantic-aware caching, and conventional filename-based query. SmartStore limits search scope of complex query to a single or a minimal number of semantically related groups and avoids or alleviates brute-force search in entire system. Extensive experiments using real-world traces show that SmartStore improves system scalability and reduces query latency over basic database approaches by one thousand times. To the best of our knowledge, this is the first study implementing complex queries in large-scale file systems.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117025410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Hamada, T. Narumi, Rio Yokota, K. Yasuoka, Keigo Nitadori, M. Taiji
{"title":"42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence","authors":"T. Hamada, T. Narumi, Rio Yokota, K. Yasuoka, Keigo Nitadori, M. Taiji","doi":"10.1145/1654059.1654123","DOIUrl":"https://doi.org/10.1145/1654059.1654123","url":null,"abstract":"As an entry for the 2009 Gordon Bell price/performance prize, we present the results of two different hierarchical N-body simulations on a cluster of 256 graphics processing units (GPUs). Unlike many previous N-body simulations on GPUs that scale as O(N2), the present method calculates the O(N log N) treecode and O(N) fast multipole method (FMM) on the GPUs with unprecedented efficiency. We demonstrate the performance of our method by choosing one standard application -a gravitational N-body simulation- and one non-standard application -simulation of turbulence using vortex particles. The gravitational simulation using the treecode with 1,608,044,129 particles showed a sustained performance of 42.15 TFlops. The vortex particle simulation of homogeneous isotropic turbulence using the periodic FMM with 16,777,216 particles showed a sustained performance of 20.2 TFlops. The overall cost of the hardware was 228,912 dollars. The maximum corrected performance is 28.1TFlops for the gravitational simulation, which results in a cost performance of 124 MFlops/$. This correction is performed by counting the Flops based on the most efficient CPU algorithm. Any extra Flops that arise from the GPU implementation and parameter differences are not included in the 124 MFlops/$.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114376230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Kaushik, Micheal Smith, A. Wollaber, Barry F. Smith, A. Siegel, W. Yang
{"title":"Enabling high-fidelity neutron transport simulations on petascale architectures","authors":"D. Kaushik, Micheal Smith, A. Wollaber, Barry F. Smith, A. Siegel, W. Yang","doi":"10.1145/1654059.1654128","DOIUrl":"https://doi.org/10.1145/1654059.1654128","url":null,"abstract":"The UNIC code is being developed as part of the DOE's Nuclear Energy Advanced Modeling and Simulation (NEAMS) program. UNIC is an unstructured, deterministic neutron transport code that allows a highly detailed description of a nuclear reactor. The primary goal of our simulation efforts is to reduce the uncertainties and biases in reactor design calculations by progressively replacing existing multilevel averaging (homogenization) techniques with more direct solution methods based on first principles. Since the neutron transport equation is seven dimensional (three in space, two in angle, one in energy, and one in time), these simulations are among the most memory and computationally intensive in all of computational science. In order to model the complex physics of a reactor core, billions of spatial elements, hundreds of angles, and thousands of energy groups are necessary, leading to problem sizes with petascale degrees of freedom. Therefore, these calculations exhaust memory resources on current and even next-generation architectures. In this paper, we present UNIC simulation results for two important representative problems in reactor design and analysis---PHENIX and ZPR-6. In each case, UNIC shows good weak scalability on up to 163,840 cores of Blue Gene/P (Argonne) and 122,800 cores of XT5 (Oak Ridge). While our current per processor performance is less than ideal, we demonstrate a clear ability to effectively utilize the leadership computing platforms. Over the coming months, we aim to improve the per processor performance while maintaining the high parallel efficiency by employing better algorithms such as spatial p- and h-multigrid preconditioners, optimized matrix-tensor operations, and weighted partitioning for better load balancing. Combining these additional algorithmic improvements with the availability of larger parallel machines should allow us to realize our long-term goal of explicit geometry coupled multiphysics reactor simulations. In the long run, these high-fidelity simulations will be able to replace expensive mockup experiments and reduce the uncertainty in crucial reactor design and operational parameters.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130483382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Machine learning-based prefetch optimization for data center applications","authors":"Shih-Wei Liao, Tzu-Han Hung, Donald Nguyen, Chinyen Chou, Chia-Heng Tu, Hucheng Zhou","doi":"10.1145/1654059.1654116","DOIUrl":"https://doi.org/10.1145/1654059.1654116","url":null,"abstract":"Performance tuning for data centers is essential and complicated. It is important since a data center comprises thousands of machines and thus a single-digit performance improvement can significantly reduce cost and power consumption. Unfortunately, it is extremely difficult as data centers are dynamic environments where applications are frequently released and servers are continually upgraded. In this paper, we study the effectiveness of different processor prefetch configurations, which can greatly influence the performance of memory system and the overall data center. We observe a wide performance gap when comparing the worst and best configurations, from 1.4% to 75.1%, for 11 important data center applications. We then develop a tuning framework which attempts to predict the optimal configuration based on hardware performance counters. The framework achieves performance within 1% of the best performance of any single configuration for the same set of applications.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114579533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Increasing memory miss tolerance for SIMD cores","authors":"D. Tarjan, Jiayuan Meng, K. Skadron","doi":"10.1145/1654059.1654082","DOIUrl":"https://doi.org/10.1145/1654059.1654082","url":null,"abstract":"Manycore processors with wide SIMD cores are becoming a popular choice for the next generation of throughput oriented architectures. We introduce a hardware technique called \"diverge on miss\" that allows SIMD cores to better tolerate memory latency for workloads with non-contiguous memory access patterns. Individual threads within a SIMD \"warp\" are allowed to slip behind other threads in the same warp, letting the warp continue execution even if a subset of threads are waiting on memory. Diverge on miss can either increase the performance of a given design by up to a factor of 3.14 for a single warp per core, or reduce the number of warps per core needed to sustain a given level of performance from 16 to 2 warps, reducing the area per core by 35%.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128569852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jung Ho Ahn, N. Binkert, A. Davis, M. McLaren, R. Schreiber
{"title":"HyperX: topology, routing, and packaging of efficient large-scale networks","authors":"Jung Ho Ahn, N. Binkert, A. Davis, M. McLaren, R. Schreiber","doi":"10.1145/1654059.1654101","DOIUrl":"https://doi.org/10.1145/1654059.1654101","url":null,"abstract":"In the push to achieve exascale performance, systems will grow to over 100,000 sockets, as growing cores-per-socket and improved single-core performance provide only part of the speedup needed. These systems will need affordable interconnect structures that scale to this level. To meet the need, we consider an extension of the hypercube and flattened butterfly topologies, the HyperX, and give an adaptive routing algorithm, DAL. HyperX takes advantage of high-radix switch components that integrated photonics will make available. Our main contributions include a formal descriptive framework, enabling a search method that finds optimal HyperX configurations; DAL; and a low cost packaging strategy for an exascale HyperX. Simulations show that HyperX can provide performance as good as a folded Clos, with fewer switches. We also describe a HyperX packaging scheme that reduces system cost. Our analysis of efficiency, performance, and packaging demonstrates that the HyperX is a strong competitor for exascale networks.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129314288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Takashi Soga, A. Musa, Y. Shimomura, Ryusuke Egawa, K. Itakura, H. Takizawa, Koki Okabe, Hiroaki Kobayashi
{"title":"Performance evaluation of NEC SX-9 using real science and engineering applications","authors":"Takashi Soga, A. Musa, Y. Shimomura, Ryusuke Egawa, K. Itakura, H. Takizawa, Koki Okabe, Hiroaki Kobayashi","doi":"10.1145/1654059.1654088","DOIUrl":"https://doi.org/10.1145/1654059.1654088","url":null,"abstract":"This paper describes a new-generation vector parallel supercomputer, NEC SX-9 system. The SX-9 processor has an outstanding core to achieve over 100Gflop/s, and a software-controllable on-chip cache to keep the high ratio of the memory bandwidth to the floating-point operation rate. Moreover, its large SMP nodes of 16 vector processors with 1.6Tflop/s performance and 1TB memory are connected with dedicated network switches, which can achieve inter-node communication at 128GB/s per direction. The sustained performance of the SX-9 processor is evaluated using six practical applications in comparison with conventional vector processors and the latest scalar processor such as Nehalem-EP. Based on the results, this paper discusses the performance tuning strategies for new-generation vector systems. An SX-9 system of 16 nodes is also evaluated by using the HPC challenge benchmark suite and a CFD code. Those evaluation results clarify the highest sustained performance and scalability of the SX-9 system.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125096729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}