IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.最新文献

PowerFITS: Reduce Dynamic and Static I-Cache Power Using Application Specific Instruction Set Synthesis PowerFITS:使用应用特定指令集合成降低动态和静态I-Cache功率

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005. Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430557

A. Cheng, G. Tyson, T. Mudge

{"title":"PowerFITS: Reduce Dynamic and Static I-Cache Power Using Application Specific Instruction Set Synthesis","authors":"A. Cheng, G. Tyson, T. Mudge","doi":"10.1109/ISPASS.2005.1430557","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430557","url":null,"abstract":"Power consumption, performance, area, and cost are critical concerns in designing microprocessors for embedded systems such as portable handheld computing and personal telecommunication devices. In previous work [A. Cheng et al., (2004)], we introduced the concept of framework-based instruction-set tuning synthesis (FITS), which is a new instruction synthesis paradigm that falls between a general-purpose embedded processor and a synthesized application specific processor (ASP). We address these design constraints through FITS by improving the code density. A FITS processor improves code density by tailoring the instruction set to the requirement of a target application to reduce the code size. This is achieved by replacing the fixed instruction and register decoding of general purpose embedded processor with programmable decoders that can achieve ASP performance, low power consumption, and compact chip area with the fabrication advantages of a mass produced single chip solution to amortize the cost. Instruction cache has been recognized as one of the most predominant source of power dissipation in a microprocessor. For instance, in Intel's StrongARMprocessor, 27% of total chip power loss goes into the instruction cache [J. Montanaro et al., (1996)]. In this paper, we demonstrate how FITS can be applied to improve the instruction cache power efficiency. Experimental results show that our synthesized instruction sets result in significant power reduction in the instruction cache compared to ARM instructions. For 21 benchmarks from the MiBench suite [M. Guthaus et al., (2001)], our simulation results indicate on average: a 49.4% saving for switching power; a 43.9% saving for internal power; a 14.9% saving for leakage power; a 46.6% saving for total cache power with up to 60.3% saving for peak power","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"198 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122843754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Architectural Characterization of Processor Affinity in Network Processing 网络处理中处理器亲和性的体系结构表征

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005. Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430575

A. Foong, Jason M. Fung, D. Newell, S. Abraham, Peggy Irelan, Alex A. Lopez-Estrada

引用次数: 38

Fast, Accurate Microarchitecture Simulation Using Statistical Phase Detection 使用统计相位检测快速，准确的微架构仿真

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005. Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430569

R. Srinivasan, Jeanine E. Cook, S. Cooper

{"title":"Fast, Accurate Microarchitecture Simulation Using Statistical Phase Detection","authors":"R. Srinivasan, Jeanine E. Cook, S. Cooper","doi":"10.1109/ISPASS.2005.1430569","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430569","url":null,"abstract":"Simulation-based microarchitecture research is often hindered by the slow speed of simulators. In this work, we propose a novel statistical technique to identify highly representative unique behaviors or phases in a benchmark based on its IPC (instructions committed per cycle) trace. By simulating the timing of only the unique phases, the cycle-accurate simulation time for the SPEC suite is reduced from 5 months to 5 days, with a significant retention of the original dynamic behavior. Evaluation across many processor configurations within the same architecture family shows that the algorithm is robust. A cost function is provided that enables users to easily optimize the parameters of the algorithm for either simulation speed or accuracy depending on preference. A new measure is introduced to quantify the ability of a simulation speedup technique to retain behavior realized in the original workload. Unlike a first order statistic such as mean value, the newly introduced measure captures important differences in dynamic behavior between the complete and the sampled simulations","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130238879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Performance Characterization of Java Applications on SMT Processors SMT处理器上Java应用程序的性能表征

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005. Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430565

Wei Huang, Jiang Lin, Zhao Zhang, J. M. Chang

引用次数: 19

Studying Thermal Management for Graphics-Processor Architectures 研究图形处理器架构的热管理

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005. Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430559

J. Sheaffer, K. Skadron, D. Luebke

引用次数: 48

On the provision of prioritization and soft qos in dynamically reconfigurable shared data-centers over infiniband 动态可重构共享数据中心中优先级和软qos的提供

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005. Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430582

P. Balaji, S. Narravula, K. Vaidyanathan, Hyun-Wook Jin, D. Panda

{"title":"On the provision of prioritization and soft qos in dynamically reconfigurable shared data-centers over infiniband","authors":"P. Balaji, S. Narravula, K. Vaidyanathan, Hyun-Wook Jin, D. Panda","doi":"10.1109/ISPASS.2005.1430582","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430582","url":null,"abstract":"In the past few years several researchers have proposed and configured data-centers providing multiple independent services, known as shared data-centers. For example, several ISPs and other Web service providers host multiple unrelated Web-sites on their data-centers allowing potential differentiation in the service provided to each of them. Such differentiation becomes essential in several scenarios in a shared data-center environment. In this paper, we extend our previously proposed scheme on dynamic re-configurability to allow service differentiation in the shared data-center environment. In particular, we point out the issues associated with the basic dynamic configurability scheme and propose two extensions to it, namely (i) dynamic reconfiguration with prioritization and (ii) dynamic reconfiguration with prioritization and QoS. Our experimental results show that our extensions can allow the dynamic reconfigurability scheme to attain a performance improvement of up to five times for high priority Web sites irrespective of any background low priority requests. Also, these extensions are able to significantly improve the performance of low priority requests when there are minimal or no high priority requests in the system. Further, they can achieve a similar performance as a static scheme with up to 43% lesser nodes in some cases","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"12 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116817931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Intrinsic Checkpointing: A Methodology for Decreasing Simulation Time Through Binary Modification 内在检查点:一种通过二值修改减少仿真时间的方法

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005. Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430561

J. Ringenberg, Chris Pelosi, D. Oehmke, T. Mudge

引用次数: 30

Balancing Performance and Reliability in the Memory Hierarchy 在内存层次结构中平衡性能和可靠性

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005. Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430581

H. Asadi, Vilas Sridharan, M. Tahoori, D. Kaeli

引用次数: 115

Enhancing Multiprocessor Architecture Simulation Speed Using Matched-Pair Comparison 利用配对比较提高多处理器体系结构仿真速度

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005. Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430562

M. Ekman, P. Stenström

{"title":"Enhancing Multiprocessor Architecture Simulation Speed Using Matched-Pair Comparison","authors":"M. Ekman, P. Stenström","doi":"10.1109/ISPASS.2005.1430562","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430562","url":null,"abstract":"While cycle-level, full-system architecture simulation tools are capable of estimating performance at arbitrary accuracy, the time to simulate an entire application is usually prohibitive. Moreover, simulating multi-threaded applications further exacerbates this problem as most simulation tools are single-threaded. Recently, statistical sampling techniques, such as SMARTS, have managed to bring down the simulation time significantly by making it possible to only simulate about 1% of the code with sufficient accuracy. However, thousands of simulation points throughout the benchmark must still be simulated. First of all, we propose to use the well-established statistical method matched-pair comparison and motivate why this will bring down the number of simulation points needed to achieve a given accuracy. We apply it to single-processor as well as multiprocessor simulation and show that it is capable of reducing the number of needed simulation points by one order of magnitude. Secondly, since we apply the technique to single- as well as multiprocessors, we study for the first time the efficiency of statistical sampling techniques in multiprocessor systems to establish a baseline to compare with. We show theoretically and confirm experimentally, that while the instruction throughput vary significantly on each individual processor, the variability of instruction throughput across processors in a multiprocessor system decreases as we increase the number of processors for some important workloads. Thus, a factor of P fewer simulation points, where P is the number of processors, are needed to begin with when sampling is applied to multiprocessors","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114972882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Reaping the Benefit of Temporal Silence to Improve Communication Performance 从暂时的沉默中获益，提高沟通表现

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005. Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430580

Kevin M. Lepak, Mikko H. Lipasti

{"title":"Reaping the Benefit of Temporal Silence to Improve Communication Performance","authors":"Kevin M. Lepak, Mikko H. Lipasti","doi":"10.1109/ISPASS.2005.1430580","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430580","url":null,"abstract":"Communication misses - those serviced by dirty data in remote caches - are a pressing performance limiter in shared-memory multiprocessors. Recent research has indicated that temporally silent stores can be exploited to substantially reduce such misses, either with coherence protocol enhancements (MESTI); by employing speculation to create atomic silent store-pairs that achieve speculative lock elision (SLE); or by employing load value prediction (LVP). We evaluate all three approaches utilizing full-system, execution-driven simulation, with scientific and commercial workloads, to measure performance. Our studies indicate that accurate detection of elision idioms for SLE is vitally important for delivering robust performance and appears difficult for existing commercial codes. Furthermore, common datapath issues in out-of-order cores cause barriers to speculation and therefore may cause SLE failures unless SLE-specific speculation mechanisms are added to the microarchitecture. We also propose novel prediction and silence detection mechanisms that enable the MESTI protocol to deliver robust performance for all workloads. Finally, we conduct a detailed execution-driven performance evaluation of load value prediction (LVP), another simple method for capturing the benefit of temporally silent stores. We show that while theoretically LVP can capture the greatest fraction of communication misses among all approaches, it is usually not the most effective at delivering performance. This occurs because attempting to hide latency by speculating at the consumer, i.e. predicting load values, is fundamentally less effective than eliminating the latency at the source, by removing the invalidation effect of stores. Applying each method, we observe performance changes in application benchmarks ranging from 1% to 14% for an enhanced version of MESTI, -1.0% to 9% for LVP, -3% to 9% for enhanced SLE, and 2% to 21% for combined techniques","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129525314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0