Jeremy Lau, Erez Perelman, Greg Hamerly, T. Sherwood, B. Calder
{"title":"Motivation for Variable Length Intervals and Hierarchical Phase Behavior","authors":"Jeremy Lau, Erez Perelman, Greg Hamerly, T. Sherwood, B. Calder","doi":"10.1109/ISPASS.2005.1430568","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430568","url":null,"abstract":"Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a program's execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior techniques focus on fixed length intervals (such as a hundred million instructions) to find phase behavior. Fixed length intervals can make a program's periodic phase behavior difficult to find, because the fixed interval length can be out of sync with the period of the program's actual phase behavior. In addition, a fixed interval length can only express one level of phase behavior. In this paper, we graphically show that there exists a hierarchy of phase behavior in programs and motivate the need for variable length intervals. We describe the changes applied to SimPoint to support variable length intervals. We finally conclude by providing an initial study into using variable length intervals to guide SimPoint","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121211676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A High Performance, Energy Efficient GALS ProcessorMicroarchitecture with Reduced Implementation Complexity","authors":"Yongkang Zhu, D. Albonesi, A. Buyuktosunoglu","doi":"10.1109/ISPASS.2005.1430558","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430558","url":null,"abstract":"As the costs and challenges of global clock distribution grow with each new microprocessor generation, a globally asynchronous, locally synchronous (GALS) approach becomes an attractive alternative. One proposed GALS approach, called a multiple clock domain (MCD) processor, achieves impressive energy savings for a relatively low performance cost. However, the approach requires separating the processor into four domains, including separating the integer and memory domains which complicates load scheduling, and the implementation of 32 voltage and frequency levels in each domain. In addition, the hardware-based control algorithm, though effective overall, produces a significant performance degradation for some applications. In this paper, we devise modifications to the MCD design that retain many of its benefits while greatly reducing the implementation complexity. We first determine that the synchronization channels that are most responsible for the MCD performance degradation are those involving cache access, and propose merging the integer and memory domains to virtually eliminate this overhead. We further propose significantly reducing the number of voltage levels, separating the reorder buffer into its own domain to permit front-end frequency scaling, separating the L2 cache to permit standard power optimizations to be used, and a new online algorithm that provides consistent results across our benchmark suite. The overall result is a significant reduction in the performance degradation of the original MCD approach and greater energy savings, with a greatly simplified microarchitecture that is much easier to implement","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115230965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aashish Phansalkar, A. Joshi, L. Eeckhout, L. John
{"title":"Measuring Program Similarity: Experiments with SPEC CPU Benchmark Suites","authors":"Aashish Phansalkar, A. Joshi, L. Eeckhout, L. John","doi":"10.1109/ISPASS.2005.1430555","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430555","url":null,"abstract":"It is essential that a subset of benchmark programs used to evaluate an architectural enhancement, is well distributed within the target workload space rather than clustered in specific areas. Past efforts for identifying subsets have primarily relied on using microarchitecture-dependent metrics of program performance, such as cycles per instruction and cache miss-rate. The shortcoming of this technique is that the results could be biased by the idiosyncrasies of the chosen configurations. The objective of this paper is to present a methodology to measure similarity of programs based on their inherent microarchitecture-independent characteristics which will make the results applicable to any microarchitecture. We apply our methodology to the SPEC CPU2000 benchmark suite and demonstrate that a subset of 8 programs can be used to effectively represent the entire suite. We validate the usefulness of this subset by using it to estimate the average IPC and L1 data cache miss-rate of the entire suite. The average IPC of 8-way and 16-way issue superscalar processor configurations could be estimated with 3.9% and 4.4% error respectively. This methodology is applicable not only to find subsets from a benchmark suite, but also to identify programs for a benchmark suite from a list of potential candidates. Studying the four generations of SPEC CPU benchmark suites, we find that other than a dramatic increase in the dynamic instruction count and increasingly poor temporal data locality, the inherent program characteristics have more or less remained the same","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122693928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Multiprocessor Simulation with a Memory Timestamp Record","authors":"K. Barr, Heidi Pan, Michael Zhang, K. Asanović","doi":"10.1109/ISPASS.2005.1430560","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430560","url":null,"abstract":"We introduce a fast and accurate technique for initializing the directory and cache state of a multiprocessor system based on a novel software structure called the memory timestamp record (MTR). The MTR is a versatile, compressed snapshot of memory reference patterns which can be rapidly updated during fast-forwarded simulation, or stored as part of a checkpoint. We evaluate MTR using a full-system simulation of a directory-based cache-coherent multiprocessor running a range of multithreaded workloads. Both MTR and a multiprocessor version of functional fast-forwarding (FFW) make similar performance estimates, usually within 15% of our detailed model. In addition to other benefits, we show that MTR has up to a 1.45x speedup over FFW, and a 7.7x speedup over our detailed baseline","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123611723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dataflow: A Complement to Superscalar","authors":"M. Budiu, Pedro V. Artigas, S. Goldstein","doi":"10.1109/ISPASS.2005.1430572","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430572","url":null,"abstract":"There has been a resurgence of interest in dataflow architectures, because of their potential for exploiting parallelism with low overhead. In this paper we analyze the performance of a class of static dataflow machines on integer media and control-intensive programs and we explain why a dataflow machine, even with unlimited resources, does not always outperform a superscalar processor on general-purpose codes, under the assumption that both machines take the same time to execute basic operations. We compare a program-specific dataflow machine with unlimited parallelism to a superscalar processor running the same program. While the dataflow machines provide very good performance on most data-parallel programs, we show that the dataflow machine cannot always take advantage of the available parallelism. Using the dynamic critical path we investigate the mechanisms used by superscalar processors to provide a performance advantage and their impact on a dataflow model","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115266577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Anatomy and Performance of SSL Processing","authors":"Li Zhao, R. Iyer, S. Makineni, L. Bhuyan","doi":"10.1109/ISPASS.2005.1430574","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430574","url":null,"abstract":"A wide spectrum of e-commerce (B2B/B2C), banking, financial trading and other business applications require the exchange of data to be highly secure. The Secure Sockets Layer (SSL) protocol provides the essential ingredients of secure communications - privacy, integrity and authentication. Though it is well-understood that security always comes at the cost of performance, these costs depend on the cryptographic algorithms. In this paper, we present a detailed description of the anatomy of a secure session. We analyze the time spent on the various cryptographic operations (symmetric, asymmetric and hashing) during the session negotiation and data transfer. We then analyze the most frequently used cryptographic algorithms (RSA, AES, DES, 3DES, RC4, MD5 and SHA-1). We determine the key components of these algorithms (setting up key schedules, encryption rounds, substitutions, permutations, etc) and determine where most of the time is spent. We also provide an architectural analysis of these algorithms, show the frequently executed instructions and discuss the ISA/hardware support that may be beneficial to improving SSL performance. We believe that the performance data presented in this paper is useful to performance analysts and processor architects to help accelerate SSL performance in future processors","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115729967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeremy Lau, J. Sampson, Erez Perelman, Greg Hamerly, B. Calder
{"title":"The Strong correlation Between Code Signatures and Performance","authors":"Jeremy Lau, J. Sampson, Erez Perelman, Greg Hamerly, B. Calder","doi":"10.1109/ISPASS.2005.1430578","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430578","url":null,"abstract":"A recent study examined the use of sampled hardware counters to create sampled code signatures. This approach is attractive because sampled code signatures can be quickly gathered for any application. The conclusion of their study was that there exists a fuzzy correlation between sampled code signatures and performance predictability. The paper raises the question of how much information is lost in the sampling process, and our paper focuses on examining this issue. We first focus on showing that there exists a strong correlation between code signatures and performance. We then examine the relationship between sampled and full code signatures, and how these affect performance predictability. Our results confirm that there is a fuzzy correlation found in recent work for the SPEC programs with sampled code signatures, but that a strong correlation exists with full code signatures. In addition, we propose converting the sampled instruction counts, used in the prior work, into sampled code signatures representing loop and procedure execution frequencies. These sampled loop and procedure code signatures allow phase analysis to more accurately and easily find patterns, and they correlate better with performance","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126342115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Albayraktaroglu, A. Jaleel, Xue Wu, Manoj Franklin, Bruce Jacob, C. Tseng, Donald Yeung
{"title":"BioBench: A Benchmark Suite of Bioinformatics Applications","authors":"K. Albayraktaroglu, A. Jaleel, Xue Wu, Manoj Franklin, Bruce Jacob, C. Tseng, Donald Yeung","doi":"10.1109/ISPASS.2005.1430554","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430554","url":null,"abstract":"Recent advances in bioinformatics and the significant increase in computational power available to researchers have made it possible to make better use of the vast amounts of genetic data that has been collected over the last two decades. As the uses of genetic data expand to include drug discovery and development of gene-based therapies, bioinformatics is destined to take its place in the forefront of scientific computing application domains. Despite the clear importance of this field, common bioinformatics applications and their implication on microarchitectural design have received scant attention from the computer architecture community so far. The availability of a common set of bioinformatics benchmarks could be the first step to motivate further research in this crucial area. To this end, this paper presents BioBench, a benchmark suite that represents a diverse set of bioinformatics applications. The first version of BioBench includes applications from different application domains, with a particular emphasis on mature genomics applications. The applications in the benchmark are described briefly, and basic execution characteristics obtained on a real processor are presented. Compared to SPEC INT and SPEC FP benchmarks, applications in BioBench display a higher percentage of load/store instructions, almost negligible floating-point operation content, and higher IPC than either SPEC INT or SPEC FP applications. Our evaluation suggests that bioinformatics applications have distinctly different characteristics from the applications in both of the mentioned SPEC suites; and our findings indicate that bioinformatics workloads can benefit from architectural improvements to memory bandwidth and techniques that exploit their high levels of ILP. The entire BioBench suite and accompanying reference data will be made freely available to researchers","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131367897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pro-active Page Replacement for Scientific Applications: A Characterization","authors":"M. Vilayannur, A. Sivasubramaniam, M. Kandemir","doi":"10.1109/ISPASS.2005.1430579","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430579","url":null,"abstract":"Paging policies implemented by today's operating systems cause scientific applications to exhibit poor performance, when the application's working set does not fit in main memory. This has been typically attributed to the sub-optimal performance of LRU-like virtual-memory replacement algorithms. On one end of the spectrum, researchers in the past have proposed fully automated compiler-based techniques that provide crucial information on future access patterns (reuse-distances, release hints etc) of an application that can be exploited by the operating system to make intelligent prefetching and replacement decisions. Static techniques like the aforementioned can be quite accurate, but require that the source code be available and analyzable. At the other end of the spectrum, researchers have also proposed pure system-level algorithmic innovations to improve the performance of LRU-like algorithms, some of which are only interesting from the theoretical sense and may not really be implementable. Instead, in this paper we explore the possibility of tracking application's runtime behavior in the operating system, and find that there are several useful characteristics in the virtual memory behavior that can be anticipated and used to pro-actively manage physical memory usage. Specifically, we show that LRU-like replacement algorithms hold onto pages long after they outlive their usefulness and propose a new replacement algorithm that exploits the predictability of the application's page-fault patterns to reduce the number of page-faults. Our results demonstrate that such techniques can reduce page-faults by as much as 78% over both LRU and EELRU that is considered to be one of the state-of-the-art algorithms towards addressing the performance shortcomings of LRU. Further, we also present an implementable replacement algorithm within the operating system, that performs considerably better than the Linux kernel's replacement algorithm","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130271902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Trace-Driven Simulator For Palm OS Devices","authors":"Hyrum D. Carroll, J. Flanagan, Satish Baniya","doi":"10.1109/ISPASS.2005.1430570","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430570","url":null,"abstract":"Due to the high cost of producing hardware prototypes, software simulators are typically used to determine the performance of proposed systems. To accurately represent a system with a simulator, the simulator inputs need to be representative of actual system usage. Trace-driven simulators that use logs of actual usage are generally preferred by researchers and developers to other types of simulators to determine expected performance. In this paper we explain the design and results of a trace-driven simulator for Palm OS devices capable of starting in a specified state and replaying a log of inputs originally generated on a handheld. We collect the user inputs with an acceptable amount of overhead while a device is executing real applications in normal operating environments. We based our simulator on the deterministic state machine model. The model specifies that two equivalent systems that start in the same state and have the same inputs applied, follow the same execution paths. By replaying the collected inputs we are able to collect traces and performance statistics from the simulator that are representative of actual usage with minimal perturbation. Our simulator can be used to evaluate various hardware modifications to Palm OS devices such as adding a cache. At the end of this paper we present an in-depth case study analyzing the expected memory performance from adding a cache to a Palm m515 device","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133334652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}