M. Pilla, P. Navaux, B. Childers, Amarildo T. da Costa, F. França
{"title":"Value predictors for reuse through speculation on traces","authors":"M. Pilla, P. Navaux, B. Childers, Amarildo T. da Costa, F. França","doi":"10.1109/CAHPC.2004.42","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.42","url":null,"abstract":"Reusing dynamic sequences of instructions - i.e., traces - improves performance for many benchmarks. However, many traces are not reused because of unavailable inputs in the reuse test. Reuse through speculation on traces (RST) aims to increase the number of reused traces by predicting those inputs when necessary, with minimal additional hardware when compared to nonspeculative trace reuse. In this paper, we compare last n-value and stride-aware prediction for trace inputs. Last n-value prediction uses the last recorded values as predictions, while stride-aware prediction identifies and uses strides to compute new predictions. Stride-aware RST has a higher hardware cost than last n-value RST and has also the shortcoming of not allowing branches inside predicted traces. This paper aims to determine which scheme is the most beneficial for RST. We show that stride values are important for reuse in RST and that last n-value prediction works as well as the more sophisticated stride-aware approach with simpler hardware.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123282205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The eDRAM based L3-cache of the BlueGene/L supercomputer processor node","authors":"M. Ohmacht, D. Hoenicke, R. Haring, A. Gara","doi":"10.1109/CAHPC.2004.40","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.40","url":null,"abstract":"BlueGene/L is a supercomputer consisting of 64K dual-processor system-on-a-chip compute nodes, capable of delivering an arithmetic peak performance of 5.6Gflops per node. To match the memory speed to the high compute performance, the system implements an aggressive three-level on-chip cache hierarchy for each node. The implemented hierarchy offers high bandwidth and integrated prefetching on cache hierarchy levels 2 and 3 to reduce memory access time. The integrated L3-cache stores a total of 4MB of data, using multibank embedded DRAM. The 1024 bit wide data port of the embedded DRAM provides 22.4GB/s bandwidth to serve the speculative prefetching demands of the two processor cores and the Gigabit Ethernet DMA engine.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122966733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IATO: a flexible EPIC simulation environment","authors":"A. Darsch, André Seznec","doi":"10.1109/CAHPC.2004.20","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.20","url":null,"abstract":"High-performance superscalar processors are designed with the help of complex simulation environment. The simulation infrastructure permits to validate the processor instruction set and contributes as well to the performance evaluation of the selected microarchitecture. Unfortunately, new architectures like the EPIC are not properly supported in the research community. Due to its specificity, the EPIC architecture requires a new framework that gives the researcher an opportunity to explore the EPIC paradigm by characterizing the static and dynamic behavior of binary programs. In particular, this task is made difficult by the fact that the EPIC architecture defines a fully predicated ISA. This paper presents a novel simulation infrastructure, called IATO that permits to analyze, emulate and simulate the EPIC microarchitecture by using the IA64 ISA as the reference architecture. The novelty of the environment is to provide an in-order and an out-of-order cycle accurate execution-driven simulators. In particular, the out-of-order simulator provides an innovative solution for the out-of-order execution of a fully predicated ISA.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134153949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Ávila, P. Navaux, P. Lombard, A. Lèbre, Y. Denneulin
{"title":"Performance evaluation of a prototype distributed NFS server","authors":"R. Ávila, P. Navaux, P. Lombard, A. Lèbre, Y. Denneulin","doi":"10.1109/CAHPC.2004.33","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.33","url":null,"abstract":"A high-performance file system is normally a key point for large cluster installations, where hundreds or even thousands of nodes frequently need to manage large volumes of data. While most solutions usually make use of dedicated hardware and/or specific distribution and replication protocols, the NFSP (NFS Parallel) project aims at improving performance within a standard NFS client/server system. In this paper we investigate the possibilities of a replication model for the NFS server, which is based on Lasy Release Consistency (LRC). A prototype has been built upon the user-level NFSv2 server and a performance evaluation is carried out.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"40 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131249846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cache filtering techniques to reduce the negative impact of useless speculative memory references on processor performance","authors":"O. Mutlu, Hyesoon Kim, D. N. Armstrong, Y. Patt","doi":"10.1109/CAHPC.2004.11","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.11","url":null,"abstract":"High-performance processors employ aggressive speculation and prefetching techniques to increase performance. Speculative memory references caused by these techniques sometimes bring data into the caches that are not needed by correct execution. This paper proposes the use of the first-level caches as filters that predict the usefulness of speculative memory references. With the proposed technique, speculative memory references bring data only into the first-level caches rather than all levels in the cache hierarchy. The processor monitors the use of the cache blocks in the first-level caches and decides which blocks to keep in the cache hierarchy based on the usefulness of cache blocks. It is shown that a simple implementation of this technique usually outperforms inclusive and exclusive baseline cache hierarchies commonly used by today's processors and results in IPC performance improvements of up to 9.2% on the SPEC2000 integer benchmarks.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116026716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Rigo, G. Araújo, Marcus Bartholomeu, R. Azevedo
{"title":"ArchC: a systemC-based architecture description language","authors":"S. Rigo, G. Araújo, Marcus Bartholomeu, R. Azevedo","doi":"10.1109/CAHPC.2004.8","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.8","url":null,"abstract":"This paper presents an architecture description language (ADL) called ArchC, which is an open-source SystemC-based language that is specialized for processor architecture description. Its main goal is to provide enough information, at the right level of abstraction, in order to allow users to explore and verify new architectures, by automatically generating software tools like simulators and co-verification interfaces. ArchC's key features are a storage-based co-verification mechanism that automatically checks the consistency of a refined ArchC model against a reference (functional) description, memory hierarchy modeling capability, the possibility of integration with other SystemC IPs and the automatic generation of high-level SystemC simulators. We have used ArchC to synthesize both functional and cycle-based simulators for the MIPS, Intel 8051 and SPARC V8 processors, as well as functional models of modern architectures like TMS320C62x, XScale and PowerPC.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116100815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving parallel execution time of sorting on heterogeneous clusters","authors":"C. Cérin, Michel Koskas, Hazem Fkaier, M. Jemni","doi":"10.1109/CAHPC.2004.21","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.21","url":null,"abstract":"The aim of the paper is to introduce techniques in order to optimize the parallel execution time of sorting on heterogeneous platforms (processors speeds are related by a constant factor). We develop a constant time technique for mastering processor load balancing and execution time in an heterogeneous environment. We develop an analytical model for the parallel execution time, sustained by preliminary experimental results in the case of a 2-processors systems. The computation of the solution is independent of the problem size. Consequently, there is no overhead regarding the sorting problem.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125132603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A new migration model based on the evaluation of processes load and lifetime on heterogeneous computing environments","authors":"R. Mello, Luciano José Senger","doi":"10.1109/CAHPC.2004.2","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.2","url":null,"abstract":"This paper presents a new model for evaluation of the positive and negative impacts related to the process migration on environments composed by heterogeneous capacity computers. On this model, a busy computer analyzes the occupation of each process and selects the more adequate for migration. The analysis and selection are done through a migration factor. This factor reflects how much the busy computer will be freed and how much the destination computer will be overloaded, in view of the migration of each process. The migrated processes are the ones that present migration factors to enhance the environment load balancing. The results from the carried out experiments have proved this model contributions when compared to related work. The contribution is the decrease in process average response time, which means higher performance.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128137091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Revisiting a BSP/CGM transitive closure algorithm","authors":"E. Cáceres, Cristiano C. A. Vieira","doi":"10.1109/CAHPC.2004.36","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.36","url":null,"abstract":"We present a new BSP/CGM parallel algorithm for the transitive closure problem. Our algorithm uses O(n/sup 3//p/spl alpha/) local computation time with O(p//spl alpha/) communication rounds, where /spl alpha/ is the size in bits that can be stored in a primitive data item. For all the randomly generated graphs that were used in the tests, the number of communication rounds was bounded by log p/spl bsol//spl alpha/+1. Our algorithm, even for the worst case, improves the previous results. The algorithm was implemented and the results show the efficiency and scalability of the presented algorithm and compare favorably with other parallel implementations.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"516 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133132622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Bartolini, I. Branovic, R. Giorgi, E. Martinelli
{"title":"A performance evaluation of ARM ISA extension for elliptic curve cryptography over binary finite fields","authors":"S. Bartolini, I. Branovic, R. Giorgi, E. Martinelli","doi":"10.1109/CAHPC.2004.5","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.5","url":null,"abstract":"In this paper, we present an evaluation of possible ARM instruction set extension for elliptic curve cryptography (ECC) over binary finite fields GF(2/sup m/). The use of elliptic curve cryptography is becoming common in embedded domain, where its reduced key size at a security level equivalent to standard public-key methods (such as RSA) allows for power consumption savings and more efficient operation. ARM processor was selected because it is widely used for embedded system applications. We developed an ECC benchmark set with three widely used public-key algorithms: Diffie-Hellman for key exchange, digital signature algorithm, as well as El-Gamal method for encryption/decryption. We analyzed the major bottlenecks at function level and evaluated the performance improvement, when we introduce some simple architectural support in the ARM ISA. Results of our experiments show that the use of a word-level multiplication instruction over binary field allows for an average 33% reduction of the total number of dynamically executed instructions, while execution time improves by the same amount when projective coordinates are used.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128169160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}