{"title":"Performance evaluation of exclusive cache hierarchies","authors":"Ying Zheng, B. Davis, M. Jordan","doi":"10.1109/ISPASS.2004.1291359","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291359","url":null,"abstract":"Memory hierarchy performance, specifically cache memory capacity, is a constraining factor in the performance of modern computers. This paper presents the results of two-level cache memory simulations and examines the impact of exclusive caching on system performance. Exclusive caching enables higher capacity with the same cache area by eliminating redundant copies. The experiments presented compare an exclusive cache hierarchy with an inclusive cache hierarchy utilizing similar L1 and L2 parameters. Experiments indicate that significant performance advantages can be gained for some benchmark through the use of an exclusive organization. The performance differences are illustrated using the L2 cache misses and execution time metrics. The most significant improvement shown is a 16% reduction in execution time, with an average reduction of 8% for the smallest cache configuration tested. With equal size victim buffer and victim cache for exclusive and inclusive cache hierarchies respectively, some benchmarks show increased execution time for exclusive caches because a victim cache can reduce conflict misses significantly while a victim buffer can introduce worst-case penalties. Considering the inconsistent performance improvement, the increased complexity of an exclusive cache hierarchy needs to be justified based upon the specifics of the application and system.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116059865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, Jiesheng Wu, D. Panda
{"title":"Sockets Direct Protocol over InfiniBand in clusters: is it beneficial?","authors":"P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, Jiesheng Wu, D. Panda","doi":"10.1109/ISPASS.2004.1291353","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291353","url":null,"abstract":"The Sockets Direct Protocol (SDP) had been proposed recently in order to enable sockets based applications to take advantage of the enhanced features provided by InfiniBand architecture. In this paper, we study the benefits and limitations of an implementation of SDP. We first analyze the performance of SDP based on a detailed suite of micro-benchmarks. Next, we evaluate it on two different real application domains: (1) A multitier data-center environment and (2) A Parallel Virtual File System (PVFS). Our micro-benchmark results show that SDP is able to provide up to 2.7 times better bandwidth as compared to the native sockets implementation over InfiniBand (IPoIB) and significantly better latency for large message sizes. Our experimental results also show that SDP is able to achieve a considerably higher performance (improvement of up to 2.4 times) as compared to IPoIB in the PVFS environment. In the data-center environment, SDP outperforms IPoIB for large file transfers inspite of currently being limited by a high connection setup time. However, this limitation is entirely implementation specific and as the InfiniBand software and hardware products are rapidly maturing, we expect this limitation to be overcome soon. Based on this, we have shown that the projected performance for SDP, without the connection setup time, can outperform IPoIB for small message transfers as well.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115475915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effectiveness of simple memory models for performance prediction","authors":"I. Tuduce, T. Gross","doi":"10.1109/ISPASS.2004.1291361","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291361","url":null,"abstract":"Many situations call for an estimation of the execution time of applications, e.g., during design or evaluation of computer systems. In this paper we focus on large applications where the execution times heavily depend on the performance of the memory system. Since such applications are computationally expensive, direct simulation is not an option and an analytical model is called for. This paper addresses this problem by developing and evaluating two simple analytical models. These models focus on an application's interaction with the memory system. Applications are characterized by their memory access types. A regular application has continuous and stride memory accesses. An irregular application has three memory access types: continuous accesses, accesses within the same L1/L2 cache line, and random accesses. The analytical models are combined with results from micro-benchmarking or with appropriate performance estimates of memory accesses to predict application performance, either on real or future machines. We apply these models to executions of CHARMM (Chemistry at HARvard Molecular Mechanics) - a scientific application written in FORTRAN, SMV (Symbolic Model Verifier) - coded in C++. For all three applications, the approaches described here produce results with 5% accuracy on average (compared to the effective run-time measured on a real SPARC system).","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132564306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liem Tran, Nicholas Nelson, Fung Ngai, S. Dropsho, Michael C. Huang
{"title":"Dynamically reducing pressure on the physical register file through simple register sharing","authors":"Liem Tran, Nicholas Nelson, Fung Ngai, S. Dropsho, Michael C. Huang","doi":"10.1109/ISPASS.2004.1291358","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291358","url":null,"abstract":"Using register renaming and physical registers, modern microprocessors eliminate false data dependences from reuse of the instruction set defined registers (logical registers). High performance processors that have longer pipelines and a greater capacity to exploit instruction-level parallelism have more instructions in-flight and require more physical registers. Simultaneous multithreading architectures further exacerbate this register pressure. This paper evaluates two register sharing techniques for reducing register usage. The first technique dynamically combines physical registers having the same value the second technique combines the demand of several instructions updating the same logical register and share physical register storage among them. While similar techniques have been proposed previously, an important contribution of this paper is to exploit only special cases that provide most of the benefits of more general solutions but at a very low hardware complexity. Despite the simplicity, our design reduces the required number of physical registers by more than 10% on some applications, and provides almost half of the total benefits of an aggressive (complex) scheme. More importantly, we show the simpler design to reduce register pressure has significant performance effects in a simultaneous multithreaded (SMT) architecture where register availability can be a bottleneck. Our results show an average of 25.6% performance improvement for an SMT architecture with 160 registers or, equivalently, similar performance as an SMT with 200 registers (25% more) but no register sharing.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131318522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Bonilla-Lucas, P. Plachta, Aamer Sachedina, Daniel Jiménez-González, C. Zuzarte, J. Larriba-Pey
{"title":"Characterization of the data access behavior for TPC-C traces","authors":"R. Bonilla-Lucas, P. Plachta, Aamer Sachedina, Daniel Jiménez-González, C. Zuzarte, J. Larriba-Pey","doi":"10.1109/ISPASS.2004.1291363","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291363","url":null,"abstract":"In this paper, we look into the characteristics of the reference stream of TPC-C workloads from the buffer pool point of view. We analyze a trace coming from DB2 UDB version 8.1 fix pack 4 and compare it to a trace from DB2 UDB version 8.1 GA. We perform three types of analysis. A static analysis of the number of reads and writes for index and data pages. We conclude that index pages receive less references than data pages by are more frequently accessed individually. Then, we analyze how DB2 processes access those pages. Index pages have more references than data pages when accessed by more than one process. Finally, we understand the accesses along the life of a page. We conclude that there is a significant burstiness in the reference stream, where, each burst is caused by one process.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131286379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The future of simulation: A field of dreams","authors":"B. Calder, D. Citron, Y. Patt, James E. Smith","doi":"10.1109/ISPASS.2004.1291369","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291369","url":null,"abstract":"Quantitative evaluation of next-generation computer architectures and processor enhancements is possible only by running simulations. However, since the insights that are gained through simulation are predicated on the accuracy of the simulation results, and since the design decisions for future processor architectures -- which cost billions of dollars to design and implement -- are based on those insights, periodic examination of the simulation process becomes a necessity, rather than a luxury. Accordingly, this panel discusses the deficiencies of existing simulators, benchmarks, and simulation methodologies and techniques, and, in addition, what future directions are available for each.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116288349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Architectures and compilers for multimedia","authors":"W. Wolf","doi":"10.1109/ISPASS.2004.1291371","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291371","url":null,"abstract":"Summary form only given. The article covers architectures and compilers for multimedia systems. Multimedia applications impose real-time constraints on continuous media; they also include a surprisingly wide variety of algorithms. Many multimedia systems also operate under power/energy constraints. As such, multimedia computing systems are an important area of interest for ISPASS. This tutorial targets individuals with experience in hardware and software but who have limited expertise in multimedia. We start with an introduction to multimedia algorithms such as video and audio compression since the characteristics of these algorithms help to shape measurement strategies and architectural decisions. We then cover modern multimedia architectures and compilation techniques relevant to those architectures. We conclude with a case study drawn from our own research - the design of a multiprocessor system-on-chip for real-time gesture recognition.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129733220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Structures for phase classification","authors":"Jeremy Lau, Stefan Schoenmackers, B. Calder","doi":"10.1109/ISPASS.2004.1291356","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291356","url":null,"abstract":"Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group these similar intervals of execution into phases, where all he intervals in a phase have homogeneous behavior and similar resource requirements. In this paper we examine different program structures for capturing phase behavior. The goal is to compare the size and accuracy of these structures for performing phase classification. We focus on profiling the frequency of program level structures that are independent from underlying architecture performance metrics. This allows the phase classification to be used across different hardware designs that support the same instruction set (ISA). We compare using basic blocks, loop branches, procedures, opcodes, register usage, and memory address information for guiding phase classification. We compare these different structures in terms of their ability to create homogeneous phases, and evaluate the accuracy of using these structures to pick simulation points for SimPoint.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132790511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Leonardo R. Bachega, J. Brunheroto, L. D. Rose, Pedro Mindlin, J. Moreira
{"title":"The BlueGene/L pseudo cycle-accurate simulator","authors":"Leonardo R. Bachega, J. Brunheroto, L. D. Rose, Pedro Mindlin, J. Moreira","doi":"10.1109/ISPASS.2004.1291354","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291354","url":null,"abstract":"The design and development of a new computer system is a lengthy process, with a considerable amount of time elapsed between the beginning of development and first hardware availability. Hence, fast and reasonably accurate simulation of processor architecture has become critical as an enabling mechanism for software engineers to develop and tune system software and applications. In this paper, we present the time-stamped timing model extensions to the BlueGene/L functional simulator. These extensions were implemented to create a pseudo cycle-accurate simulator capable of providing tracing capabilities for detection of bottlenecks and for performance tuning of applications, before the actual hardware became available. Our validation tests, using the DAXPY kernel and the serial version of the NAS benchmarks, show that our pseudo cycle-accurate simulator provides timing information within 15% of the times measured using the actual BlueGene/L hardware. In addition, we present a couple of case studies, which describes how this simulator can be used for identification of performance bottlenecks and for application tuning.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133838133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Kadayif, Partho Nath, M. Kandemir, A. Sivasubramaniam
{"title":"Compiler-directed physical address generation for reducing dTLB power","authors":"I. Kadayif, Partho Nath, M. Kandemir, A. Sivasubramaniam","doi":"10.1109/ISPASS.2004.1291368","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291368","url":null,"abstract":"Address translation using the Translation Lookaside Buffer (TLB) consumes as much as 16% of the chip power on some processors because of its high associativity and access frequency. While prior work has looked into optimizing this structure at the circuit and architectural levels, this paper takes a different approach of optimizing its power by reducing the number of data TLB (dTLB) lookups for data references. The main idea is to keep translations in a set of translation registers, and intelligently use them in software to directly generate the physical addresses without going through the dTLB. The software has to work within the confines of the translation registers provided by the hardware, and has to maximize the reuse of such translations to be effective. We propose strategies and code transformations for achieving this in array-based and pointer-based codes, looking to optimize data accesses. Results with a suite of Spec95 array-based and pointer-based codes show dTLB energy savings of up to 73% and 88%, respectively, compared to directly using the dTLB for all references. Despite the small increase in instructions executed with our mechanisms, the approach can in fact provide performance benefits in certain cases.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"397 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132167124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}