T. J. Dysart, Branden J. Moore, Lambert Schaelicke, P. Kogge
{"title":"Cache implications of aggressively pipelined high performance microprocessors","authors":"T. J. Dysart, Branden J. Moore, Lambert Schaelicke, P. Kogge","doi":"10.1109/ISPASS.2004.1291364","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291364","url":null,"abstract":"One of the major design decisions when developing a new microprocessor is determining the target pipeline depth and clock rate since both factors interact closely with one another. The optimal pipeline depth of a processor has been studied before, but the impact of the memory system on pipeline performance has received less attention. This study analyzes the affect of different level-1 cache designs across a range of pipeline depths to determine what role the memory system design plays in choosing a clock rate and pipeline depth for a microprocessor. The pipeline depths studied here range from those found in current processors to those predicted for future processors. For each pipeline depth a variety of level-1 cache sizes are simulated to explore the relationship between clock rate, pipeline depth, cache size and access latency. Results show that the larger caches afforded by shorter pipelines with slower clocks outperform longer pipelines with smaller caches and higher clock rates.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116020886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A co-phase matrix to guide simultaneous multithreading simulation","authors":"Michael Van Biesbrouck, T. Sherwood, B. Calder","doi":"10.1109/ISPASS.2004.1291355","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291355","url":null,"abstract":"Several commercial processors have architectures that include support for simultaneous multithreading (SMT), yet there is still not a validated methodology for estimating the performance of an SMT machine that does not rely on full program simulation. To create an efficient sampling approach for SMT we must determine how far to fast-forward each individual thread between samples. The fast-forwarding distance for each thread will vary according to execution phases, thread interactions and changes to the architectural configuration. We examine using individual program phase information to guide SMT simulation. This is accomplished by creating what we call a co-phase matrix. The co-phase matrix is populated by collecting samples of the programs' phase combinations, and is used to guide fastforwarding between samples. We show for 28 pairs of SPEC programs that using the co-phase matrix provides an average error rate of 4% while requiring that only 1% of the full simulation be performed. The methods are also validated using a variety of architectural configurations and four-threaded workloads.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124505966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spectral analysis for characterizing program power and performance","authors":"R. Joseph, M. Martonosi, Zhigang Hu","doi":"10.1109/ISPASS.2004.1291367","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291367","url":null,"abstract":"Performance and power analysis in modern processors requires managing a large amount of complex information across many time-scales. For a example, thermal control issues are a power subproblem with relevant time constants of millions of cycles or more, while the so-called dI/dT problem is also a power subproblem but occurs because of current variability on a much finer granularity: tens to hundreds of cycles. Likewise, for performance issues, program phase analysis for selecting simulation regions requires looking for periodicity on the order of millions of cycles or more, while some aspects of cache performance optimization requires understanding repetitive patterns on much finer granularities. Fourier analysis allows one to transform waveform into a sum of component (usually sinusoidal) waveforms in the frequency domain; in this way, the waveform's fundamental frequencies (periodicities of repetition) can be clearly identified. This paper shows how one can use Fourier analysis to produce frequency spectra for some of the time waveforms seen in processor execution. By working in the frequency domain, one can easily identify key application tendencies. For example, we demonstrate how to use spectral analysis to characterize the power behavior of real programs. As we show, this is useful for understanding both the temperature profile of a program and its voltage stability. These are particularly relevant issues for architects since thermal concerns and the dI/dT problem have significant influence on processor design. Frequency analysis can also be used to examine program performance. In particular, it can also identify periodic occurrences of important microarchitectural events like cache misses. Overall, the paper demonstrates the value of using frequency analysis in different research efforts related to characterizing and optimizing application performance and power.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125735218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"StatCache: a probabilistic approach to efficient and accurate data locality analysis","authors":"Erik Berg, Erik Hagersten","doi":"10.1109/ISPASS.2004.1291352","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291352","url":null,"abstract":"The widening memory gap reduces performance of applications with poor data locality. Therefore, there is a need for methods to analyze data locality and help application optimization. In this paper we present StatCache, a novel sampling-based method for performing data-locality analysis on realistic workloads. StatCache is based on a probabilistic model of the cache, rather than a functional cache simulator. It uses statistics from a single run to accurately estimate miss ratios of fully-associative caches of arbitrary sizes and generate working-set graphs. We evaluate StatCache using the SPEC CPU2000 benchmarks and show that StatCache gives accurate results with a sampling rate as low as 10/sup -4/. We also provide a proof-of-concept implementation, and discuss potentially very fast implementation alternatives.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129990447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient architectural design of high performance microprocessors","authors":"L. Eeckhout, K. D. Bosschere","doi":"10.1016/S0065-2458(03)61002-8","DOIUrl":"https://doi.org/10.1016/S0065-2458(03)61002-8","url":null,"abstract":"","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"43 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130169501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deconstructing commit","authors":"Gordon B. Bell, Mikko H. Lipasti","doi":"10.1109/ISPASS.2004.1291357","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291357","url":null,"abstract":"Many modern processors execute instructions out of their original program order to exploit instruction-level parallelism and achieve higher performance. However even though instructions can execute in an arbitrary order, they must eventually commit, or retire from execution, in program order. This constraint provides a safety mechanism to ensure that mis-speculated instructions are not inadvertently committed, but can consume valuable processor resources and severely limit the degree of parallelism exposed in a program. We assert that such a constraint is overly conservative, and propose conditions under which it can be relaxed. This paper deconstructs the notion of commit in an out-of-order processor, and examines the set of necessary conditions under which instructions can be permitted to retire out of program order. It provides a detailed analysis of the frequency and relative importance of these conditions, and discusses microarchitectural modifications that relax the in-order commit requirement. Overall, we found that for a given set of processor resources our technique achieves speedups of up to 68% and 8% for floating point and integer benchmarks, respectively. Conversely, because out-of-order commit allows more efficient utilization of cycle-time limiting resources, it can alternatively enable simpler designs with potentially higher clock frequencies.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134336052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic pretenuring schemes for generational garbage collection","authors":"Wei Huang, W. Srisa-an, J. M. Chang","doi":"10.1109/ISPASS.2004.1291365","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291365","url":null,"abstract":"Previous research efforts have shown that pretenuring can potentially reduce the copying cost by creating long lived objects into the mature memory regions directly. To date, researchers often employ profiling and static analysis to accurately select the objects that should be pretenured. However, little research efforts have been spent on dynamic approaches for pretenuring objects. In this paper, we propose a novel approach that dynamically predicts object lifespan to assist with pretenuring selection. The proposed scheme performs dynamic pretenuring selection based on a feedback mechanism that records lifespan of objects from each class during garbage collection invocations. This information is then used to pretenure objects in subsequent allocation requests. We experiment with two approaches, jumpstart feedback and continuous feedback, to collect tenuring information. The experimental results of selected benchmark programs show that our schemes can improve the garbage collection time of IBM's Jikes RVM by up to 37%, and improve the overall execution time by up to 28%.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133306946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Communication breakdown: analyzing CPU usage in commercial Web workloads","authors":"Jaidev P. Patwardhan, A. Lebeck, Daniel J. Sorin","doi":"10.1109/ISPASS.2004.1291351","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291351","url":null,"abstract":"There is increasing concern among developers that future Web servers running commercial workloads may be limited by network processing overhead in the CPU as 10Gb Ethernet becomes prevalent. We analyze CPU usage of real hardware running popular commercial workloads, with an emphasis on identifying networking overhead. Contrary to much popular belief, our experiments show that network processing is unlikely to be a problem for workloads that perform significant data processing. For the dynamic Web serving workloads we examine, networking overhead is negligible (3% or less), and data processing limits performance. However, for Web servers that serve static content, networking processing can significantly impact performance (up to 25% of CPU cycles). With an analytical model, we calculate the maximum possible improvement in throughput due to protocol offload to be 50% for the static Web workloads.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125046296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Eccentric and fragile benchmarks","authors":"H. Vandierendonck, K. D. Bosschere","doi":"10.1109/ISPASS.2004.1291350","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291350","url":null,"abstract":"Benchmarks are essential for computer architecture research and performance evaluation. Constructing a good benchmark suite is, however, non-trivial: it must be representative, show different types of behavior and the benchmarks should not be easily tweaked. This paper uses principal components analysis, a statistical data analysis technique, to detect differences in behavior between benchmarks. Two specific types of benchmarks are identified. Eccentric benchmarks have a behavior that differs significantly from the other benchmarks. They are useful to incorporate different types of behavior in a suite. Fragile benchmarks are weak benchmarks: their execution time is determined almost entirely by a single bottleneck. Removing that bottleneck reduces their execution time excessively. This paper argues that fragile benchmarks are not useful and shows how they can be detected by means of workload characterization techniques. These techniques are applied to the SPEC CPU95 and CPU2000 benchmark suites. It is shown that these suites contain both eccentric and fragile benchmarks. The notions of eccentric and fragile benchmarks are important when composing a benchmark suite and to guide the sub-setting of a benchmark suite.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115999654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using cache mapping to improve memory performance handheld devices","authors":"Rong-Chang Xu, Zhiyuan Li","doi":"10.1109/ISPASS.2004.1291362","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291362","url":null,"abstract":"Processors such as the Intel StrongARM SA-1110 and the Intel XScale provide flexible control over the cache management to achieve better cache utilization. Programs can specify the cache mapping policy for each virtual page, i.e. mapping it to the main cache, the mini-cache, or neither. For the latter case, the page is marked as non-cacheable. In this paper, we use memory profiling to guide such page-based cache mapping. We model the cache mapping problem and prove that finding the optimal cache mapping is NP-hard. We then present a heuristic to select the mapping. Execution time measurement shows that our heuristics can improve the performance from 1% to 21% for a set of test programs. As a byproduct of performance enhancement, we also save the energy by 4% to 28%.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121320508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}