IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004最新文献

筛选
英文 中文
Performance evaluation of exclusive cache hierarchies 独占缓存层次结构的性能评估
Ying Zheng, B. Davis, M. Jordan
{"title":"Performance evaluation of exclusive cache hierarchies","authors":"Ying Zheng, B. Davis, M. Jordan","doi":"10.1109/ISPASS.2004.1291359","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291359","url":null,"abstract":"Memory hierarchy performance, specifically cache memory capacity, is a constraining factor in the performance of modern computers. This paper presents the results of two-level cache memory simulations and examines the impact of exclusive caching on system performance. Exclusive caching enables higher capacity with the same cache area by eliminating redundant copies. The experiments presented compare an exclusive cache hierarchy with an inclusive cache hierarchy utilizing similar L1 and L2 parameters. Experiments indicate that significant performance advantages can be gained for some benchmark through the use of an exclusive organization. The performance differences are illustrated using the L2 cache misses and execution time metrics. The most significant improvement shown is a 16% reduction in execution time, with an average reduction of 8% for the smallest cache configuration tested. With equal size victim buffer and victim cache for exclusive and inclusive cache hierarchies respectively, some benchmarks show increased execution time for exclusive caches because a victim cache can reduce conflict misses significantly while a victim buffer can introduce worst-case penalties. Considering the inconsistent performance improvement, the increased complexity of an exclusive cache hierarchy needs to be justified based upon the specifics of the application and system.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116059865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
Sockets Direct Protocol over InfiniBand in clusters: is it beneficial? 套接字直接协议在InfiniBand集群:它是有益的?
P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, Jiesheng Wu, D. Panda
{"title":"Sockets Direct Protocol over InfiniBand in clusters: is it beneficial?","authors":"P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, Jiesheng Wu, D. Panda","doi":"10.1109/ISPASS.2004.1291353","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291353","url":null,"abstract":"The Sockets Direct Protocol (SDP) had been proposed recently in order to enable sockets based applications to take advantage of the enhanced features provided by InfiniBand architecture. In this paper, we study the benefits and limitations of an implementation of SDP. We first analyze the performance of SDP based on a detailed suite of micro-benchmarks. Next, we evaluate it on two different real application domains: (1) A multitier data-center environment and (2) A Parallel Virtual File System (PVFS). Our micro-benchmark results show that SDP is able to provide up to 2.7 times better bandwidth as compared to the native sockets implementation over InfiniBand (IPoIB) and significantly better latency for large message sizes. Our experimental results also show that SDP is able to achieve a considerably higher performance (improvement of up to 2.4 times) as compared to IPoIB in the PVFS environment. In the data-center environment, SDP outperforms IPoIB for large file transfers inspite of currently being limited by a high connection setup time. However, this limitation is entirely implementation specific and as the InfiniBand software and hardware products are rapidly maturing, we expect this limitation to be overcome soon. Based on this, we have shown that the projected performance for SDP, without the connection setup time, can outperform IPoIB for small message transfers as well.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115475915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 73
Effectiveness of simple memory models for performance prediction 简单内存模型对性能预测的有效性
I. Tuduce, T. Gross
{"title":"Effectiveness of simple memory models for performance prediction","authors":"I. Tuduce, T. Gross","doi":"10.1109/ISPASS.2004.1291361","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291361","url":null,"abstract":"Many situations call for an estimation of the execution time of applications, e.g., during design or evaluation of computer systems. In this paper we focus on large applications where the execution times heavily depend on the performance of the memory system. Since such applications are computationally expensive, direct simulation is not an option and an analytical model is called for. This paper addresses this problem by developing and evaluating two simple analytical models. These models focus on an application's interaction with the memory system. Applications are characterized by their memory access types. A regular application has continuous and stride memory accesses. An irregular application has three memory access types: continuous accesses, accesses within the same L1/L2 cache line, and random accesses. The analytical models are combined with results from micro-benchmarking or with appropriate performance estimates of memory accesses to predict application performance, either on real or future machines. We apply these models to executions of CHARMM (Chemistry at HARvard Molecular Mechanics) - a scientific application written in FORTRAN, SMV (Symbolic Model Verifier) - coded in C++. For all three applications, the approaches described here produce results with 5% accuracy on average (compared to the effective run-time measured on a real SPARC system).","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132564306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Dynamically reducing pressure on the physical register file through simple register sharing 通过简单的寄存器共享动态减少对物理寄存器文件的压力
Liem Tran, Nicholas Nelson, Fung Ngai, S. Dropsho, Michael C. Huang
{"title":"Dynamically reducing pressure on the physical register file through simple register sharing","authors":"Liem Tran, Nicholas Nelson, Fung Ngai, S. Dropsho, Michael C. Huang","doi":"10.1109/ISPASS.2004.1291358","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291358","url":null,"abstract":"Using register renaming and physical registers, modern microprocessors eliminate false data dependences from reuse of the instruction set defined registers (logical registers). High performance processors that have longer pipelines and a greater capacity to exploit instruction-level parallelism have more instructions in-flight and require more physical registers. Simultaneous multithreading architectures further exacerbate this register pressure. This paper evaluates two register sharing techniques for reducing register usage. The first technique dynamically combines physical registers having the same value the second technique combines the demand of several instructions updating the same logical register and share physical register storage among them. While similar techniques have been proposed previously, an important contribution of this paper is to exploit only special cases that provide most of the benefits of more general solutions but at a very low hardware complexity. Despite the simplicity, our design reduces the required number of physical registers by more than 10% on some applications, and provides almost half of the total benefits of an aggressive (complex) scheme. More importantly, we show the simpler design to reduce register pressure has significant performance effects in a simultaneous multithreaded (SMT) architecture where register availability can be a bottleneck. Our results show an average of 25.6% performance improvement for an SMT architecture with 160 registers or, equivalently, similar performance as an SMT with 200 registers (25% more) but no register sharing.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131318522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Characterization of the data access behavior for TPC-C traces TPC-C走线的数据访问行为表征
R. Bonilla-Lucas, P. Plachta, Aamer Sachedina, Daniel Jiménez-González, C. Zuzarte, J. Larriba-Pey
{"title":"Characterization of the data access behavior for TPC-C traces","authors":"R. Bonilla-Lucas, P. Plachta, Aamer Sachedina, Daniel Jiménez-González, C. Zuzarte, J. Larriba-Pey","doi":"10.1109/ISPASS.2004.1291363","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291363","url":null,"abstract":"In this paper, we look into the characteristics of the reference stream of TPC-C workloads from the buffer pool point of view. We analyze a trace coming from DB2 UDB version 8.1 fix pack 4 and compare it to a trace from DB2 UDB version 8.1 GA. We perform three types of analysis. A static analysis of the number of reads and writes for index and data pages. We conclude that index pages receive less references than data pages by are more frequently accessed individually. Then, we analyze how DB2 processes access those pages. Index pages have more references than data pages when accessed by more than one process. Finally, we understand the accesses along the life of a page. We conclude that there is a significant burstiness in the reference stream, where, each burst is caused by one process.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131286379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
The future of simulation: A field of dreams 模拟的未来:一个梦想的领域
B. Calder, D. Citron, Y. Patt, James E. Smith
{"title":"The future of simulation: A field of dreams","authors":"B. Calder, D. Citron, Y. Patt, James E. Smith","doi":"10.1109/ISPASS.2004.1291369","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291369","url":null,"abstract":"Quantitative evaluation of next-generation computer architectures and processor enhancements is possible only by running simulations. However, since the insights that are gained through simulation are predicated on the accuracy of the simulation results, and since the design decisions for future processor architectures -- which cost billions of dollars to design and implement -- are based on those insights, periodic examination of the simulation process becomes a necessity, rather than a luxury. Accordingly, this panel discusses the deficiencies of existing simulators, benchmarks, and simulation methodologies and techniques, and, in addition, what future directions are available for each.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116288349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 60
Architectures and compilers for multimedia 多媒体的体系结构和编译器
W. Wolf
{"title":"Architectures and compilers for multimedia","authors":"W. Wolf","doi":"10.1109/ISPASS.2004.1291371","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291371","url":null,"abstract":"Summary form only given. The article covers architectures and compilers for multimedia systems. Multimedia applications impose real-time constraints on continuous media; they also include a surprisingly wide variety of algorithms. Many multimedia systems also operate under power/energy constraints. As such, multimedia computing systems are an important area of interest for ISPASS. This tutorial targets individuals with experience in hardware and software but who have limited expertise in multimedia. We start with an introduction to multimedia algorithms such as video and audio compression since the characteristics of these algorithms help to shape measurement strategies and architectural decisions. We then cover modern multimedia architectures and compilation techniques relevant to those architectures. We conclude with a case study drawn from our own research - the design of a multiprocessor system-on-chip for real-time gesture recognition.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129733220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structures for phase classification 相分类结构
Jeremy Lau, Stefan Schoenmackers, B. Calder
{"title":"Structures for phase classification","authors":"Jeremy Lau, Stefan Schoenmackers, B. Calder","doi":"10.1109/ISPASS.2004.1291356","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291356","url":null,"abstract":"Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group these similar intervals of execution into phases, where all he intervals in a phase have homogeneous behavior and similar resource requirements. In this paper we examine different program structures for capturing phase behavior. The goal is to compare the size and accuracy of these structures for performing phase classification. We focus on profiling the frequency of program level structures that are independent from underlying architecture performance metrics. This allows the phase classification to be used across different hardware designs that support the same instruction set (ISA). We compare using basic blocks, loop branches, procedures, opcodes, register usage, and memory address information for guiding phase classification. We compare these different structures in terms of their ability to create homogeneous phases, and evaluate the accuracy of using these structures to pick simulation points for SimPoint.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132790511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 120
The BlueGene/L pseudo cycle-accurate simulator BlueGene/L伪周期精确模拟器
Leonardo R. Bachega, J. Brunheroto, L. D. Rose, Pedro Mindlin, J. Moreira
{"title":"The BlueGene/L pseudo cycle-accurate simulator","authors":"Leonardo R. Bachega, J. Brunheroto, L. D. Rose, Pedro Mindlin, J. Moreira","doi":"10.1109/ISPASS.2004.1291354","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291354","url":null,"abstract":"The design and development of a new computer system is a lengthy process, with a considerable amount of time elapsed between the beginning of development and first hardware availability. Hence, fast and reasonably accurate simulation of processor architecture has become critical as an enabling mechanism for software engineers to develop and tune system software and applications. In this paper, we present the time-stamped timing model extensions to the BlueGene/L functional simulator. These extensions were implemented to create a pseudo cycle-accurate simulator capable of providing tracing capabilities for detection of bottlenecks and for performance tuning of applications, before the actual hardware became available. Our validation tests, using the DAXPY kernel and the serial version of the NAS benchmarks, show that our pseudo cycle-accurate simulator provides timing information within 15% of the times measured using the actual BlueGene/L hardware. In addition, we present a couple of case studies, which describes how this simulator can be used for identification of performance bottlenecks and for application tuning.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133838133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Compiler-directed physical address generation for reducing dTLB power 编译器导向的物理地址生成,以减少dTLB功率
I. Kadayif, Partho Nath, M. Kandemir, A. Sivasubramaniam
{"title":"Compiler-directed physical address generation for reducing dTLB power","authors":"I. Kadayif, Partho Nath, M. Kandemir, A. Sivasubramaniam","doi":"10.1109/ISPASS.2004.1291368","DOIUrl":"https://doi.org/10.1109/ISPASS.2004.1291368","url":null,"abstract":"Address translation using the Translation Lookaside Buffer (TLB) consumes as much as 16% of the chip power on some processors because of its high associativity and access frequency. While prior work has looked into optimizing this structure at the circuit and architectural levels, this paper takes a different approach of optimizing its power by reducing the number of data TLB (dTLB) lookups for data references. The main idea is to keep translations in a set of translation registers, and intelligently use them in software to directly generate the physical addresses without going through the dTLB. The software has to work within the confines of the translation registers provided by the hardware, and has to maximize the reuse of such translations to be effective. We propose strategies and code transformations for achieving this in array-based and pointer-based codes, looking to optimize data accesses. Results with a suite of Spec95 array-based and pointer-based codes show dTLB energy savings of up to 73% and 88%, respectively, compared to directly using the dTLB for all references. Despite the small increase in instructions executed with our mechanisms, the approach can in fact provide performance benefits in certain cases.","PeriodicalId":188291,"journal":{"name":"IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004","volume":"397 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132167124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信