IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.最新文献

Comprehensive throughput evaluation of LANs in clusters of PCs with Switchbench - or how to bring your switch to its knees 综合吞吐量评估的局域网集群的pc与交换机-或如何使您的交换机膝盖

IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005. Pub Date : 2005-11-07 DOI: 10.1109/IISWC.2005.1526012

F. Rauch

{"title":"Comprehensive throughput evaluation of LANs in clusters of PCs with Switchbench - or how to bring your switch to its knees","authors":"F. Rauch","doi":"10.1109/IISWC.2005.1526012","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1526012","url":null,"abstract":"Understanding the performance of parallel applications for prevalent clusters of commodity PCs is still not an easy task: One must understand performance characteristics of all subsystems in the cluster machine, besides the inherently required knowledge about the applications' behaviour. While there are already many benchmarks that characterise a single node's subsystems like CPU, memory and I/O. as well as a few to evaluate its network interface with point-to-point data streams, there are to the best of our knowledge no benchmarks available that characterise a cluster network or LAN as a whole. We present Switchbench (2005), a set of three microbenchmarks that thoroughly evaluate the throughput characteristics of networks for clusters. A first microbenchmark tests the basic processing limitations of the switches, by sending and receiving data concurrently at maximum throughputs on all network interfaces. A second microbenchmark tests arbitrary communication patterns by pairwise connecting nodes for high-speed throughput tests. A third and slightly more realistic microbenchmark executes an all-to-all personalised communication (AAPC) algorithm to test many different patterns and critical bisections in the network. The microbenchmarks already proved to be extremely useful in a previous study to experimentally quantify performance limitations in different networks of clusters of PCs with up to 128 nodes. We also establish the suitability of our microbenchmarks by comparing their results with two application benchmarks. The benchmarks consist of two C programs supported by shell scripts to start the programs on all nodes of the cluster with the correct execution parameters to automatically scale the workloads from a few nodes up to the full cluster size.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115699089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A multi-level comparative performance characterization of SPECjbb2005 versus SPECjbb2000 SPECjbb2005与SPECjbb2000的多级性能比较

IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005. Pub Date : 2005-11-07 DOI: 10.1109/IISWC.2005.1526002

R. Morin, A. Kumar, E. Ilyina

{"title":"A multi-level comparative performance characterization of SPECjbb2005 versus SPECjbb2000","authors":"R. Morin, A. Kumar, E. Ilyina","doi":"10.1109/IISWC.2005.1526002","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1526002","url":null,"abstract":"SPEC has released SPECjbb2005, a new server-side Java benchmark which supersedes SPECjbb2000. SPECjbb2005 is a substantial update to SPECjbb2000, intended to make the workload more representative based on current Java development practices. SPECjbb2000 has been in existence for about five years and it has been a valuable tool for optimizing the performance of commercial JVMs as well as supporting research activities. Since SPECjbb2005 replaces SPECjbb2000, it is important to understand the key differences between the two, as well as implications for JVM and hardware designers. In this paper, we present a comparative characterization of these two workloads based on detailed measurements on an Intel/spl reg/ Xeon/spl trade/ processor-based commercial server. First, we describe key functional changes introduced in SPECjbb2005. Using low-intrusion application profiling tools we compare application execution profiles. Through JVM monitoring tools, we compare JVM behavior including JIT optimization and garbage collection. Using operating system monitoring tools we compare key system level metrics including CPU utilization. With the aid of processor performance monitoring events, we compare key architectural characteristics such as cache miss rates, memory/bus utilization, and branch behavior. Finally, we summarize key findings, provide recommendations to JVM developers and hardware designers, and suggest areas for future work.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125854353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Understanding the causes of performance variability in HPC workloads 了解HPC工作负载中性能变化的原因

IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005. Pub Date : 2005-11-07 DOI: 10.1109/IISWC.2005.1526010

David Skinner, William Kramer

{"title":"Understanding the causes of performance variability in HPC workloads","authors":"David Skinner, William Kramer","doi":"10.1109/IISWC.2005.1526010","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1526010","url":null,"abstract":"While most workload characterization focuses on application and architecture performance, the variability in performance also has wide ranging impacts on the users and managers of large scale computing resources. Performance variability, though secondary to absolute performance itself can significantly detract from both the overall performance realized by parallel workloads and the suitability of a given architecture for a workload. In making choices about how to best match an HPC workload to an HPC architecture most examinations focus primarily on application performance, often in terms nominal or optimal performance. A practical concern which brackets the degree to which one can expect to see this performance in a multi-user production computing environment is the degree to which performance varies. Without an understanding of the performance variability exhibited by a computer for a given workload, in a practical sense, the effective performance that can be realized is still undetermined. In this work we examine both architectural and application causes of variability, quantify their impacts, and demonstrate performance gains realized by reducing variability.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"2018 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128630785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 116

Detecting recurrent phase behavior under real-system variability 检测在实际系统变异性下的周期性行为

IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005. Pub Date : 2005-11-07 DOI: 10.1109/IISWC.2005.1525997

C. Isci, M. Martonosi

{"title":"Detecting recurrent phase behavior under real-system variability","authors":"C. Isci, M. Martonosi","doi":"10.1109/IISWC.2005.1525997","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1525997","url":null,"abstract":"As computer systems become ever more complex and power hungry, research on dynamic on-the-fly system management and adaptations receives increasing attention. Such research relies on recognizing and responding to patterns or phases in application execution, which has therefore become an important and widely-studied research area. While application phase analysis has received significant attention, much of this attention thus far has focused on simulation-based studies. In these cycle-level simulations without indeterministic operating system intervention, applications display behavior that is repeatable from phase to phase and from run to run. A natural question, therefore, concerns how these phases appear in real system runs, where interrupts and time variability can influence the timing and behavior of the program. Our paper examines the phase behavior of applications running on real systems. The key goals of our work are to reliably discern and recover phase behavior in the face of application variability stemming from real system effects and time sampling. We propose a set of new, \"transition-based\" phase detection techniques. Our techniques can detect repeatable workload phase information from time-varying, real system measurements with less than 5% false alarm probabilities. In comparison to previous value-based detection methods, our transition-based techniques achieve on average 6x higher recurrent phase detection efficiency under real system variability.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117054876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

A study of Java virtual machine scalability issues on SMP systems SMP系统上Java虚拟机可伸缩性问题的研究

IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005. Pub Date : 2005-11-07 DOI: 10.1109/IISWC.2005.1526008

Zhongbo Cao, Wei Huang, J.M. Chang

{"title":"A study of Java virtual machine scalability issues on SMP systems","authors":"Zhongbo Cao, Wei Huang, J.M. Chang","doi":"10.1109/IISWC.2005.1526008","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1526008","url":null,"abstract":"This paper studies the scalability issues of Java virtual machine (JVM) on symmetrical multiprocessing (SMP) systems. Using a cycle-accurate simulator, we evaluate the performance scaling of multithreaded Java benchmarks with the number of processors and application threads. By correlating low-level hardware performance data to two high-level software constructs: thread types and memory regions, we present in detail the performance analysis and study the potential performance impacts of memory system latencies and resource contentions on scalability. Several key findings are revealed through this paper. First, among the memory access latency components, the primary portion of memory stalls are produced by L2 cache misses and cache-to-cache transfers. Second, among the regions of memory, Java heap space produces most memory stalls. Additionally, a large majority of memory stalls occur in application threads, as opposed to other JVM threads. Furthermore, we find that increasing the number of processors or application threads, independently of each other, leads to increases in L2 cache miss ratio and cache-to-cache transfer ratio. This problem can be alleviated by using a thread-local heap or allocation buffer which can improve L2 cache performance. For certain benchmarks such as Raytracer, their cache-to-cache transfers, mainly dominated by false sharing, can be significantly reduced. Our experiments also show that a thread-local allocation buffer with a size between 16KB and 256KB often leads to optimal performance.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126593362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Understanding ultra-scale application communication requirements 了解超大规模应用程序通信需求

IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005. Pub Date : 2005-11-07 DOI: 10.1109/IISWC.2005.1526015

Kamil Shoaib, J. Shalf, L. Oliker, David Skinner

引用次数: 45

Reducing overheads for acquiring dynamic memory traces 减少获取动态内存跟踪的开销

IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005. Pub Date : 2005-11-07 DOI: 10.1109/IISWC.2005.1526000

Xiaofeng Gao, M. Laurenzano, Beth Simon, A. Snavely

{"title":"Reducing overheads for acquiring dynamic memory traces","authors":"Xiaofeng Gao, M. Laurenzano, Beth Simon, A. Snavely","doi":"10.1109/IISWC.2005.1526000","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1526000","url":null,"abstract":"Tools for acquiring dynamic memory address information for large scale applications are important for performance modeling, optimization, and for trace-driven simulation. However, straightforward use of binary instrumentation tools for such a fine-grained task as address tracing can cause astonishing slowdown in application run time. For example, in a large scale FY05 collaboration with the Department of Defense High Performance Computing Modernization Office (HPCMO), over 1 million processor hours were expended in order to gather address information on 7 parallel applications. In this paper, we discuss in detail the issues surrounding the performance of memory address acquisition using low-level binary instrumentation tracing. We present three techniques and optimizations to improve performance: 1) SimPoint-guided sampling, 2) instrumentation tool routine optimization, and 3) reduction of instrumentation points through static application analysis. The use of these three techniques together reduces instrumented application slowdown by an order of magnitude. The techniques are generally applicable and have been deployed in the MetaSim tracer thereby enabling memory address acquisition for real-sized applications. We expect the optimizations reported here reduce the HPCMO effort by approximately 80% in FY06.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133566615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Accurate statistical approaches for generating representative workload compositions 用于生成代表性工作负载组合的准确统计方法

IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005. Pub Date : 2005-11-07 DOI: 10.1109/IISWC.2005.1526001

L. Eeckhout, R. Sundareswara, J. Yi, D. Lilja, Paul Schrater

{"title":"Accurate statistical approaches for generating representative workload compositions","authors":"L. Eeckhout, R. Sundareswara, J. Yi, D. Lilja, Paul Schrater","doi":"10.1109/IISWC.2005.1526001","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1526001","url":null,"abstract":"Composing a representative workload is a crucial step during the design process of a microprocessor. The workload should be composed in such a way that it is representative for the target domain of application and yet, the amount of redundancy in the workload should be minimized as much as possible in order not to overly increase the total simulation time. As a result, there is an important trade-off that needs to be made between workload representativeness and simulation accuracy versus simulation speed. Previous work used statistical data analysis techniques to identify representative benchmarks and corresponding inputs, also called a subset, from a large set of potential benchmarks and inputs. These methodologies measure a number of program characteristics on which principal components analysis (PCA) is applied before identifying distinct program behaviors among the benchmarks using cluster analysis. In this paper we propose independent components analysis (ICA) as a better alternative to PCA as it does not assume that the original data set has a Gaussian distribution, which allows ICA to better find the important axes in the workload space. Our experimental results using SPEC CPU2000 benchmarks show that ICA significantly outperforms PCA in that ICA achieves smaller benchmark subsets that are more accurate than those found by PCA.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130342390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

BioPerf: a benchmark suite to evaluate high-performance computer architecture on bioinformatics applications BioPerf:一个用于评估生物信息学应用的高性能计算机架构的基准套件

IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005. Pub Date : 2005-11-07 DOI: 10.1109/IISWC.2005.1526013

David A. Bader, Yue Li, Tao Li, Vipin Sachdeva

{"title":"BioPerf: a benchmark suite to evaluate high-performance computer architecture on bioinformatics applications","authors":"David A. Bader, Yue Li, Tao Li, Vipin Sachdeva","doi":"10.1109/IISWC.2005.1526013","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1526013","url":null,"abstract":"The exponential growth in the amount of genomic data has spurred growing interest in large scale analysis of genetic information. Bioinformatics applications, which explore computational methods to allow researchers to sift through the massive biological data and extract useful information, are becoming increasingly important computer workloads. This paper presents BioPerf a benchmark suite of representative bioinformatics applications to facilitate the design and evaluation of high-performance computer architectures for these emerging workloads. Currently, the BioPerf suite contains codes from 10 highly popular bioinformatics packages and covers the major fields of study in computational biology such as sequence comparison, phylogenetic reconstruction, protein structure prediction, and sequence homology & gene finding. We demonstrate the use of BioPerf by providing simulation points of pre-compiled Alpha binaries and with a performance study on IBM Power using IBM Mambo simulations cross-compared with Apple G5 executions. The BioPerf suite (available from www.bioperf.org) includes benchmark source code, input datasets of various sizes, and information for compiling and using the benchmarks. Our benchmark suite includes parallel codes where available.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117005988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 102

Workload characterization of biometric applications on Pentium 4 microarchitecture 基于Pentium 4微架构的生物识别应用负载表征

IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005. Pub Date : 2005-11-07 DOI: 10.1109/IISWC.2005.1526003

Chang-Burm Cho, A. Chande, Yue Li, Tao Li

{"title":"Workload characterization of biometric applications on Pentium 4 microarchitecture","authors":"Chang-Burm Cho, A. Chande, Yue Li, Tao Li","doi":"10.1109/IISWC.2005.1526003","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1526003","url":null,"abstract":"Biometric computing is a technique that uses physiological and behavioral characteristics of persons to identify and authenticate individuals. Due to the increasing demand on security, privacy and anti-terrorism, biometric applications represent the rapidly growing computing workloads. However, very few results on the execution characteristics of these applications on the state-of-the-art microprocessor and memory systems have been published so far. This paper proposes a suite of biometric applications and reports the results of a biometric workload characterization effort, focusing on various architecture features. To understand the impacts and implications of biometric workloads on the processor and memory architecture design, we contrast the characteristics of biometric workloads and the widely used SPEC 2000 integer benchmarks. Our experiments show that biometric applications typically show small instruction footprint that can fit in the L1 instruction cache. The loads and stores account for more than 50% of the dynamic instructions. This indicates that biometric applications are data-centric in nature. Although biometric applications work across large-scale datasets to identify matched patterns, the active working sets of these workloads are usually small. As a result, prefetching and large L2 cache effectively handle the data footprints of a majority of the studied benchmarks. Branch misprediction rate is less than 4% on all studied workloads. The IPC of the studied benchmarks ranges from 0.13 to 0.77 indicates that out-of-order superscalar execution is not quite efficient. The developed biometric benchmark suite (BMW) and input data sets are freely available and can be downloaded from http://www.ideal.ece.ufl.edu/BMW.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129122497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10