10th International Symposium on High Performance Computer Architecture (HPCA'04)最新文献

Wavelet analysis for microprocessor design: experiences with wavelet-based dI/dt characterization 微处理器设计的小波分析:基于小波的dI/dt表征的经验

10th International Symposium on High Performance Computer Architecture (HPCA'04) Pub Date : 2004-02-14 DOI: 10.1109/HPCA.2004.10027

R. Joseph, Zhigang Hu, M. Martonosi

{"title":"Wavelet analysis for microprocessor design: experiences with wavelet-based dI/dt characterization","authors":"R. Joseph, Zhigang Hu, M. Martonosi","doi":"10.1109/HPCA.2004.10027","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10027","url":null,"abstract":"As microprocessors become increasingly complex, the techniques used to analyze and predict their behavior must become increasingly rigorous. We apply wavelet analysis techniques to the problem of dl/dt estimation and control in modern microprocessors. While prior work has considered Bayesian phase analysis, Markov analysis, and other techniques to characterize hardware and software behavior, we know of no prior work using wavelets for characterizing computer systems. The dl/dt problem has been increasingly vexing in recent years, because of aggressive drops in supply voltage and increasingly large relative fluctuations in CPU current dissipation. Because the dl/dt problem has natural frequency dependence (it is worst in the mid-frequency range of roughly 50-200 MHz) it is natural to apply frequency-oriented techniques like wavelets to understand it. Our work proposes (i) an offline wavelet-based estimation technique that can accurately predict a benchmark's likelihood of causing voltage emergencies, and (ii) an online wavelet-based control technique that uses key wavelet coefficients to predict and avert impending voltage emergencies. The offline estimation technique works with roughly 0.94% error. The online control technique reduces false positives in dl/dt prediction, allowing, voltage control to occur with less than 2.5% performance overhead on the SPEC benchmark suite.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115091195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Processor Aware Anticipatory Prefetching in Loops 循环中处理器感知的预期预取

10th International Symposium on High Performance Computer Architecture (HPCA'04) Pub Date : 2004-02-14 DOI: 10.1109/HPCA.2004.10029

Spiros Kalogeropulos, M. Rajagopalan, V. Rao, Yonghong Song, P. Tirumalai

{"title":"Processor Aware Anticipatory Prefetching in Loops","authors":"Spiros Kalogeropulos, M. Rajagopalan, V. Rao, Yonghong Song, P. Tirumalai","doi":"10.1109/HPCA.2004.10029","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10029","url":null,"abstract":"As microprocessor speeds increase, a large fraction of the execution time is often lost to cache miss penalties. This loss can be particularly severe in processors such as the UltraSPARC-IIICu which have in-order execution and block on cache misses. Such processors rely greatly on the compiler to reduce stalls and achieve high performance. This paper describes a compiler technique for software prefetching that is aware of the specific prefetch behaviors of the target processor. The implementation targets loops containing control-flow and strided or irregular memory access patterns. A two phase locality analysis, capable of handling complex subscript expressions, is used for enhanced identification of prefetch candidates. Prefetch instructions are scheduled with careful consideration of the prefetch behaviors in the target system. Compared to a previous implementation, our technique produced performance improvements of 9% on the geometric mean, and up to 44% on individual tests, in Sun’s first UltraSPARC-IIICu based SPEC CPU2000 submission [5] and has been used in all later submissions to date.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122325564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Reducing the Scheduling Critical Cycle Using Wakeup Prediction 利用唤醒预测减少调度关键周期

10th International Symposium on High Performance Computer Architecture (HPCA'04) Pub Date : 2004-02-14 DOI: 10.1109/HPCA.2004.10016

Todd E. Ehrhart, Sanjay J. Patel

{"title":"Reducing the Scheduling Critical Cycle Using Wakeup Prediction","authors":"Todd E. Ehrhart, Sanjay J. Patel","doi":"10.1109/HPCA.2004.10016","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10016","url":null,"abstract":"For highest performance, a modern microprocessor must be able to determine if an instruction is ready in the same cycle in which it is to be selected for execution. This creates a cycle of logic involving wakeup and select. However, the time a static instruction spends waiting for wakeup shows little dynamic variance. This idea is used to build a machine where wakeup times are predicted, and instructions executed too early are replayed. This form of self-scheduling reduces the critical cycle by eliminating the wakeup logic at the expense of additional replays. However, replays and other pipeline effects affect the cost of misprediction. To solve this, an allowance is added to the predicted wakeup time to decrease the probability of a replay. This allowance may be associated with individual instructions or the global state, and is dynamically adjusted by a gradient-descent minimum-searching technique. When processor load is low, prediction may be more aggressive — increasing the chance of replays, but increasing performance, so the aggressiveness of the predictor is dynamically adjusted using processor load as a feedback parameter.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128316964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Architectural characterization of TCP/IP packet processing on the Pentium/spl reg/ M microprocessor Pentium/spl reg/ M微处理器上TCP/IP数据包处理的体系结构表征

10th International Symposium on High Performance Computer Architecture (HPCA'04) Pub Date : 2004-02-14 DOI: 10.1109/HPCA.2004.10024

S. Makineni, R. Iyer

{"title":"Architectural characterization of TCP/IP packet processing on the Pentium/spl reg/ M microprocessor","authors":"S. Makineni, R. Iyer","doi":"10.1109/HPCA.2004.10024","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10024","url":null,"abstract":"A majority of the current and next generation server applications (Web services, e-commerce, storage, etc.) employ TCP/IP as the communication protocol of choice. As a result, the performance of these applications is heavily dependent on the efficient TCP/IP packet processing within the termination nodes. This dependency becomes even greater as the bandwidth needs of these applications grow from 100 Mbps to 1 Gbps to 10 Gbps in the near future. Motivated by this, we focus on the following: (a) to understand the performance behavior of the various modes of TCP/IP processing, (b) to analyze the underlying architectural characteristics of TCP/IP packet processing and (c) to quantify the computational requirements of the TCP/IP packet processing component within realistic workloads. We achieve these goals by performing an in-depth analysis of packet processing performance on Intel's state-of-the-art low power Pentium/spl reg/ M microprocessor running the Microsoft Windows* Server 2003 operating system. Some of our key observations are - (i) that the mode of TCP/IP operation can significantly affect the performance requirements, (ii) that transmit-side processing is largely compute-intensive as compared to receive-side processing which is more memory-bound and (iii) that the computational requirements for sending/receiving packets can form a substantial component (28% to 40%) of commercial server workloads. From our analysis, we also discuss architectural as well as stack-related improvements that can help achieve higher server network throughput and result in improved application performance.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"272 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115598243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

Creating converged trace schedules using string matching 使用字符串匹配创建聚合跟踪计划

10th International Symposium on High Performance Computer Architecture (HPCA'04) Pub Date : 2004-02-14 DOI: 10.1109/HPCA.2004.10012

S. Narayanasamy, Yuanfang Hu, S. Sair, B. Calder

引用次数: 1

Accurate and complexity-effective spatial pattern prediction 精确和复杂有效的空间格局预测

10th International Symposium on High Performance Computer Architecture (HPCA'04) Pub Date : 2004-02-14 DOI: 10.1109/HPCA.2004.10010

Chi F. Chen, Se-Hyun Yang, B. Falsafi, Andreas Moshovos

{"title":"Accurate and complexity-effective spatial pattern prediction","authors":"Chi F. Chen, Se-Hyun Yang, B. Falsafi, Andreas Moshovos","doi":"10.1109/HPCA.2004.10010","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10010","url":null,"abstract":"Recent research suggests that there are large variations in a cache's spatial usage, both within and across programs. Unfortunately, conventional caches typically employ fixed cache line sizes to balance the exploitation of spatial and temporal locality, and to avoid prohibitive cache fill bandwidth demands. The resulting inability of conventional caches to exploit spatial variations leads to suboptimal performance and unnecessary cache power dissipation. We describe the spatial pattern predictor (SPP), a cost-effective hardware mechanism that accurately predicts reference patterns within a spatial group (i.e., a contiguous region of data in memory) at runtime. The key observation enabling an accurate, yet low-cost, SPP design is that spatial patterns correlate well with instruction addresses and data reference offsets within a cache line. We require only a small amount of predictor memory to store the predicted patterns. Simulation results for a 64-Kbyte 2-way set-associative Ll data cache with 64-byte lines show that: (1) a 256-entry tag-less direct-mapped SPP can achieve, on average, a prediction coverage of 95%, over-predicting the patterns by only 8%, (2) assuming a 70 nm process technology, the SPP helps reduce leakage energy in the base cache by 41% on average, incurring less than 1% performance degradation, and (3) prefetching spatial groups of up to 512 bytes using SPP improves execution time by 33% on average and up to a factor of two.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128531051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 110

Exploiting the cache capacity of a single-chip multi-core processor with execution migration 利用带有执行迁移的单芯片多核处理器的缓存容量

10th International Symposium on High Performance Computer Architecture (HPCA'04) Pub Date : 2004-02-14 DOI: 10.1109/HPCA.2004.10026

P. Michaud

引用次数: 44

The thrifty barrier: energy-aware synchronization in shared-memory multiprocessors 节约的障碍:共享内存多处理器中的能量感知同步

10th International Symposium on High Performance Computer Architecture (HPCA'04) Pub Date : 2004-02-14 DOI: 10.1109/HPCA.2004.10018

Jian Li, José F. Martínez, Michael C. Huang

引用次数: 126

Signature buffer: bridging performance gap between registers and caches 签名缓冲区:弥合寄存器和缓存之间的性能差距

10th International Symposium on High Performance Computer Architecture (HPCA'04) Pub Date : 2004-02-14 DOI: 10.1109/HPCA.2004.10020

Lu Peng, J. Peir, K. Lai

引用次数: 4

Program counter based techniques for dynamic power management 基于程序计数器的动态电源管理技术

10th International Symposium on High Performance Computer Architecture (HPCA'04) Pub Date : 2004-02-14 DOI: 10.1109/HPCA.2004.10021

C. Gniady, Y. C. Hu, Yung-Hsiang Lu

引用次数: 53