10th International Symposium on High Performance Computer Architecture (HPCA'04)最新文献

筛选
英文 中文
Wavelet analysis for microprocessor design: experiences with wavelet-based dI/dt characterization 微处理器设计的小波分析:基于小波的dI/dt表征的经验
R. Joseph, Zhigang Hu, M. Martonosi
{"title":"Wavelet analysis for microprocessor design: experiences with wavelet-based dI/dt characterization","authors":"R. Joseph, Zhigang Hu, M. Martonosi","doi":"10.1109/HPCA.2004.10027","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10027","url":null,"abstract":"As microprocessors become increasingly complex, the techniques used to analyze and predict their behavior must become increasingly rigorous. We apply wavelet analysis techniques to the problem of dl/dt estimation and control in modern microprocessors. While prior work has considered Bayesian phase analysis, Markov analysis, and other techniques to characterize hardware and software behavior, we know of no prior work using wavelets for characterizing computer systems. The dl/dt problem has been increasingly vexing in recent years, because of aggressive drops in supply voltage and increasingly large relative fluctuations in CPU current dissipation. Because the dl/dt problem has natural frequency dependence (it is worst in the mid-frequency range of roughly 50-200 MHz) it is natural to apply frequency-oriented techniques like wavelets to understand it. Our work proposes (i) an offline wavelet-based estimation technique that can accurately predict a benchmark's likelihood of causing voltage emergencies, and (ii) an online wavelet-based control technique that uses key wavelet coefficients to predict and avert impending voltage emergencies. The offline estimation technique works with roughly 0.94% error. The online control technique reduces false positives in dl/dt prediction, allowing, voltage control to occur with less than 2.5% performance overhead on the SPEC benchmark suite.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115091195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Processor Aware Anticipatory Prefetching in Loops 循环中处理器感知的预期预取
Spiros Kalogeropulos, M. Rajagopalan, V. Rao, Yonghong Song, P. Tirumalai
{"title":"Processor Aware Anticipatory Prefetching in Loops","authors":"Spiros Kalogeropulos, M. Rajagopalan, V. Rao, Yonghong Song, P. Tirumalai","doi":"10.1109/HPCA.2004.10029","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10029","url":null,"abstract":"As microprocessor speeds increase, a large fraction of the execution time is often lost to cache miss penalties. This loss can be particularly severe in processors such as the UltraSPARC-IIICu which have in-order execution and block on cache misses. Such processors rely greatly on the compiler to reduce stalls and achieve high performance. This paper describes a compiler technique for software prefetching that is aware of the specific prefetch behaviors of the target processor. The implementation targets loops containing control-flow and strided or irregular memory access patterns. A two phase locality analysis, capable of handling complex subscript expressions, is used for enhanced identification of prefetch candidates. Prefetch instructions are scheduled with careful consideration of the prefetch behaviors in the target system. Compared to a previous implementation, our technique produced performance improvements of 9% on the geometric mean, and up to 44% on individual tests, in Sun’s first UltraSPARC-IIICu based SPEC CPU2000 submission [5] and has been used in all later submissions to date.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122325564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Reducing the Scheduling Critical Cycle Using Wakeup Prediction 利用唤醒预测减少调度关键周期
Todd E. Ehrhart, Sanjay J. Patel
{"title":"Reducing the Scheduling Critical Cycle Using Wakeup Prediction","authors":"Todd E. Ehrhart, Sanjay J. Patel","doi":"10.1109/HPCA.2004.10016","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10016","url":null,"abstract":"For highest performance, a modern microprocessor must be able to determine if an instruction is ready in the same cycle in which it is to be selected for execution. This creates a cycle of logic involving wakeup and select. However, the time a static instruction spends waiting for wakeup shows little dynamic variance. This idea is used to build a machine where wakeup times are predicted, and instructions executed too early are replayed. This form of self-scheduling reduces the critical cycle by eliminating the wakeup logic at the expense of additional replays. However, replays and other pipeline effects affect the cost of misprediction. To solve this, an allowance is added to the predicted wakeup time to decrease the probability of a replay. This allowance may be associated with individual instructions or the global state, and is dynamically adjusted by a gradient-descent minimum-searching technique. When processor load is low, prediction may be more aggressive — increasing the chance of replays, but increasing performance, so the aggressiveness of the predictor is dynamically adjusted using processor load as a feedback parameter.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128316964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Architectural characterization of TCP/IP packet processing on the Pentium/spl reg/ M microprocessor Pentium/spl reg/ M微处理器上TCP/IP数据包处理的体系结构表征
S. Makineni, R. Iyer
{"title":"Architectural characterization of TCP/IP packet processing on the Pentium/spl reg/ M microprocessor","authors":"S. Makineni, R. Iyer","doi":"10.1109/HPCA.2004.10024","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10024","url":null,"abstract":"A majority of the current and next generation server applications (Web services, e-commerce, storage, etc.) employ TCP/IP as the communication protocol of choice. As a result, the performance of these applications is heavily dependent on the efficient TCP/IP packet processing within the termination nodes. This dependency becomes even greater as the bandwidth needs of these applications grow from 100 Mbps to 1 Gbps to 10 Gbps in the near future. Motivated by this, we focus on the following: (a) to understand the performance behavior of the various modes of TCP/IP processing, (b) to analyze the underlying architectural characteristics of TCP/IP packet processing and (c) to quantify the computational requirements of the TCP/IP packet processing component within realistic workloads. We achieve these goals by performing an in-depth analysis of packet processing performance on Intel's state-of-the-art low power Pentium/spl reg/ M microprocessor running the Microsoft Windows* Server 2003 operating system. Some of our key observations are - (i) that the mode of TCP/IP operation can significantly affect the performance requirements, (ii) that transmit-side processing is largely compute-intensive as compared to receive-side processing which is more memory-bound and (iii) that the computational requirements for sending/receiving packets can form a substantial component (28% to 40%) of commercial server workloads. From our analysis, we also discuss architectural as well as stack-related improvements that can help achieve higher server network throughput and result in improved application performance.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"272 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115598243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
Creating converged trace schedules using string matching 使用字符串匹配创建聚合跟踪计划
S. Narayanasamy, Yuanfang Hu, S. Sair, B. Calder
{"title":"Creating converged trace schedules using string matching","authors":"S. Narayanasamy, Yuanfang Hu, S. Sair, B. Calder","doi":"10.1109/HPCA.2004.10012","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10012","url":null,"abstract":"We focus on generating efficient software pipelined schedules for in-order machines, which we call converged trace schedules. For a candidate loop, we form a string of trace block identifiers by hashing together addresses of aggressively scheduled instructions from multiple iterations of a loop. In this process, the loop is unrolled and scheduled until we identify a repeating pattern in the string. Instructions corresponding to this repeating pattern form the kernel for our software pipelined schedule. We evaluate this approach to create aggressive schedules by using it in dynamic hardware and software optimization systems for an in-order architecture.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124502683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accurate and complexity-effective spatial pattern prediction 精确和复杂有效的空间格局预测
Chi F. Chen, Se-Hyun Yang, B. Falsafi, Andreas Moshovos
{"title":"Accurate and complexity-effective spatial pattern prediction","authors":"Chi F. Chen, Se-Hyun Yang, B. Falsafi, Andreas Moshovos","doi":"10.1109/HPCA.2004.10010","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10010","url":null,"abstract":"Recent research suggests that there are large variations in a cache's spatial usage, both within and across programs. Unfortunately, conventional caches typically employ fixed cache line sizes to balance the exploitation of spatial and temporal locality, and to avoid prohibitive cache fill bandwidth demands. The resulting inability of conventional caches to exploit spatial variations leads to suboptimal performance and unnecessary cache power dissipation. We describe the spatial pattern predictor (SPP), a cost-effective hardware mechanism that accurately predicts reference patterns within a spatial group (i.e., a contiguous region of data in memory) at runtime. The key observation enabling an accurate, yet low-cost, SPP design is that spatial patterns correlate well with instruction addresses and data reference offsets within a cache line. We require only a small amount of predictor memory to store the predicted patterns. Simulation results for a 64-Kbyte 2-way set-associative Ll data cache with 64-byte lines show that: (1) a 256-entry tag-less direct-mapped SPP can achieve, on average, a prediction coverage of 95%, over-predicting the patterns by only 8%, (2) assuming a 70 nm process technology, the SPP helps reduce leakage energy in the base cache by 41% on average, incurring less than 1% performance degradation, and (3) prefetching spatial groups of up to 512 bytes using SPP improves execution time by 33% on average and up to a factor of two.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128531051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 110
Exploiting the cache capacity of a single-chip multi-core processor with execution migration 利用带有执行迁移的单芯片多核处理器的缓存容量
P. Michaud
{"title":"Exploiting the cache capacity of a single-chip multi-core processor with execution migration","authors":"P. Michaud","doi":"10.1109/HPCA.2004.10026","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10026","url":null,"abstract":"We propose to modify a conventional single-chip multicore so that a sequential program can migrate from one core to another automatically during execution. The goal of execution migration is to take advantage of the overall on-chip cache capacity. We introduce the affinity algorithm, a method for distributing cache lines automatically on several caches. We show that on working-sets exhibiting a property called \"splittability\", it is possible to trade cache misses for migrations. Our experimental results indicate that the proposed method has a potential for improving the performance of certain sequential programs, without degrading significantly the performance of others.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122188916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
The thrifty barrier: energy-aware synchronization in shared-memory multiprocessors 节约的障碍:共享内存多处理器中的能量感知同步
Jian Li, José F. Martínez, Michael C. Huang
{"title":"The thrifty barrier: energy-aware synchronization in shared-memory multiprocessors","authors":"Jian Li, José F. Martínez, Michael C. Huang","doi":"10.1109/HPCA.2004.10018","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10018","url":null,"abstract":"Much research has been devoted to making microprocessors energy-efficient. However, little attention has been paid to multiprocessor environments where, due to the cooperative nature of the computation, the most energy-efficient execution in each processor may not translate into the most energy-efficient overall execution. We present the thrifty barrier, a hardware-software approach to saving energy in parallel applications that exhibit barrier synchronization imbalance. Threads that arrive early to a thrifty barrier pick among existing low-power processor sleep states based on predicted barrier stall time and other factors. We leverage the coherence protocol and propose small hardware extensions to achieve timely wake-up of these dormant threads, maximizing energy savings while minimizing the impact on performance.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134341284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 126
Signature buffer: bridging performance gap between registers and caches 签名缓冲区:弥合寄存器和缓存之间的性能差距
Lu Peng, J. Peir, K. Lai
{"title":"Signature buffer: bridging performance gap between registers and caches","authors":"Lu Peng, J. Peir, K. Lai","doi":"10.1109/HPCA.2004.10020","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10020","url":null,"abstract":"Data communications between producer instructions and consumer instructions through memory incur extra delays that degrade processor performance. We introduce a new storage media with a novel addressing mechanism to avoid address calculations. Instead of a memory address, each load and store is assigned a signature for accessing the new storage. A signature consists of the color of the base register along with its displacement value. A unique color is assigned to a register whenever the register is updated. When two memory instructions have the same signature, they address to the same memory location. This memory signature can be formed early in the processor pipeline. A small signature buffer, addressed by the memory signature, can be established to permit stores and loads bypassing normal memory hierarchy for fast data communication. Performance evaluations based on an Alpha 21264-like pipeline using SPEC2000 integer benchmarks show that an IPC (instruction-per-cycle) improvement of 13-18% is possible using a small 8-entry signature buffer.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128912776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Program counter based techniques for dynamic power management 基于程序计数器的动态电源管理技术
C. Gniady, Y. C. Hu, Yung-Hsiang Lu
{"title":"Program counter based techniques for dynamic power management","authors":"C. Gniady, Y. C. Hu, Yung-Hsiang Lu","doi":"10.1109/HPCA.2004.10021","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10021","url":null,"abstract":"Reducing energy consumption has become one of the major challenges in designing future computing systems. We propose a novel idea of using program counters to predict I/O activities in the operating system. We present a complete design of program-counter access predictor (PCAP) that dynamically learns the access patterns of applications and predicts when an I/O device can be shut down to save energy. PCAP uses path-based correlation to observe a particular sequence of program counters leading to each idle period, and predicts future occurrences of that idle period. PCAP differs from previously proposed shutdown predictors in its ability to: (1) correlate I/O operations to particular behavior of the applications and users, (2) carry prediction information across multiple executions of the applications, and (3) attain better energy savings while incurring low mispredictions.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128549250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信