{"title":"Aestimo: a feedback-directed optimization evaluation tool","authors":"Paul Berube, J. N. Amaral","doi":"10.1109/ISPASS.2006.1620809","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620809","url":null,"abstract":"Published studies that use feedback-directed optimization (FDO) techniques use either a single input for both training and performance evaluation, or a single input for training and a single input for evaluation. Thus an important question is if the FDO results published in the literature are sensitive to the training and testing input selection. Aestimo is a new evaluation tool that uses a workload of inputs to evaluate the sensitivity of specific code transformations to the choice of inputs in the training and testing phases. Aestimo uses optimization logs to isolate the effects of individual code transformations. It incorporates metrics to determine the effect of training input selection on individual compiler decisions. Besides describing the structure of Aestimo, this paper presents a case study that uses SPEC CINT2000 benchmark programs with the Open Research Compiler (ORC) to investigate the effect of training/testing input selection on in-lining and if-conversion. The experimental results indicate that: (1) training input selection affects the compiler decisions made for these code transformation; (2) the choice of training/testing inputs can have a significant impact on measured performance.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"204 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123044568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A statistical multiprocessor cache model","authors":"Erik Berg, Håkan Zeffer, Erik Hagersten","doi":"10.1109/ISPASS.2006.1620793","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620793","url":null,"abstract":"The introduction of general-purpose microprocessors running multiple threads will put a focus on methods and tools helping a programmer to write efficient parallel applications. Such a tool should be fast enough to meet a software developer's need for short turn-around time, but also be accurate and flexible enough to provide trend-correct and intuitive feedback. This paper presents a novel sample-based method for analyzing the data locality of a multithreaded application. Very sparse data is collected during a single execution of the studied application. The architectural-independent information collected during the execution is fed to a mathematical memory-system model for predicting the cache miss ratio. The sparse data can be used to characterize the application's data locality with respect to almost any possible memory system, such as complicated multiprocessor multilevel cache hierarchies. Any combination of cache size, cache-line size and degree of sharing can be modeled. Each modeled design point takes only a fraction of a second to evaluate, even though the application from which the sampled data was collected may have executed for hours. This makes the tool not just usable for software developers, but also for hardware developers who need to evaluate a huge memory-system design space. The accuracy of the method is evaluated using a large number of commercial and technical multi-threaded applications. The result produced by the algorithm is shown to be consistent with results from a traditional (and much slower) architecture simulation.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126687709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Naveen Muralimanohar, K. Ramani, R. Balasubramonian
{"title":"Power efficient resource scaling in partitioned architectures through dynamic heterogeneity","authors":"Naveen Muralimanohar, K. Ramani, R. Balasubramonian","doi":"10.1109/ISPASS.2006.1620794","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620794","url":null,"abstract":"The ever increasing demand for high clock speeds and the desire to exploit abundant transistor budgets have resulted in alarming increases in processor power dissipation. Partitioned (or clustered) architectures have been proposed in recent years to address scalability concerns in future billion-transistor microprocessors. Our analysis shows that increasing processor resources in a clustered architecture results in a linear increase in power consumption, while providing diminishing improvements in single-thread performance. To preserve high performance to power ratios, we claim that the power consumption of additional resources should be in proportion to the performance improvements they yield. Hence, in this paper, we propose the implementation of heterogeneous clusters that have varying delay and power characteristics. A cluster's performance and power characteristic is tuned by scaling its frequency and novel policies dynamically assign frequencies to clusters, while attempting to either meet a fixed power budget or minimize a metric such as Energy /spl times/ Delay/sup 2/ (ED/sup 2/). By increasing resources in a power-efficient manner, we observe an 11% improvement in ED/sup 2/ and a 22.4% average reduction in peak temperature, when compared to a processor with homogeneous units. Our proposed processor model also provides strategies to handle thermal emergencies that have a relatively low impact on performance.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125080733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Pereira, Leonardo Silva, Wagner Meira Jr, W. Santos
{"title":"Assessing the impact of reactive workloads on the performance of Web applications","authors":"A. Pereira, Leonardo Silva, Wagner Meira Jr, W. Santos","doi":"10.1109/ISPASS.2006.1620805","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620805","url":null,"abstract":"Designing systems with better performance and scalability is a real need to fulfill the user demands and generate profitable Web services. Being able to mimic user behavior and the workload they generate on the servers is fundamental to evaluate the performance of systems and their improvements. One aspect that is usually neglected by workload generators is the user reactivity, that is, how the users react to variable server response time. Further, it is not clear how the reactivity-related changes in the user generated workload affect the server and how these dependences converge. This paper addresses this problem by proposing, implementing, and validating a workload generator that accounts for reactivity while interacting with servers. Our workload generator is used, for instance, to generate workloads based on a TPC-W benchmark. These workloads are used to assess the impacts of reactivity on the performance of a Web application. The results show significant changes in terms of throughput and response time for the experiments, raising the possibility of improving the performance of Web systems considering user reactivity.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133780146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simon Albert, Sven Kalms, Christian Weiss, A. Schramm
{"title":"Acquisition and evaluation of long DDR2-SDRAM access sequences","authors":"Simon Albert, Sven Kalms, Christian Weiss, A. Schramm","doi":"10.1109/ISPASS.2006.1620808","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620808","url":null,"abstract":"Trace driven simulation is extensively used in memory system evaluation. Traditional measurement equipment such as logic analyzers currently lack of the capability to record long memory access sequences (e.g. multiple seconds or even entire benchmark runs) due to their limited sampling depth, without altering system behavior. This paper presents a system, that is capable of recording long access sequences in realtime without affecting system operation. For the first time, a classification of the SPEC CPU2000 benchmark suite along main memory access criteria is provided. Furthermore the impact of shared memory graphics on system performance affecting future system simulation methodology is investigated.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132413068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Branch trace compression for snapshot-based simulation","authors":"K. Barr, K. Asanović","doi":"10.1109/ISPASS.2006.1620787","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620787","url":null,"abstract":"We present a scheme to compress branch trace information for use in snapshot-based microarchitecture simulation. The compressed trace can be used to warm any arbitrary branch predictor's state before detailed microarchitecture simulation of the snapshot. We show that compressed branch traces require less space than snapshots of concrete predictor state. Our branch-predictor based compression (BPC) technique uses a software branch predictor to provide an accurate model of the input branch trace, requiring only mispredictions to be stored in the compressed trace file. The decompressor constructs a matching software branch predictor to help reconstruct the original branch trace from the record of mispredictions. Evaluations using traces from the Journal of ILP branch predictor competition show we achieve compression rates of 0.013-0.72 bits/branch (depending on workload), which is up to 210/spl times/ better than gzip; up to 52/spl times/ better than the best general-purpose compression techniques; and up to 4.4/spl times/ better than recently-published, more general trace compression techniques. Moreover, BPC-compressed traces can be decompressed in less time than corresponding traces compressed with other methods.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133690010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Wenisch, Roland E. Wunderlich, B. Falsafi, J. Hoe
{"title":"Simulation sampling with live-points","authors":"T. Wenisch, Roland E. Wunderlich, B. Falsafi, J. Hoe","doi":"10.1109/ISPASS.2006.1620785","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620785","url":null,"abstract":"Current simulation-sampling techniques construct accurate model state for each measurement by continuously warming large microarchitectural structures (e.g., caches and the branch predictor) while functionally simulating the billions of instructions between measurements. This approach, called functional warming, is the main performance bottleneck of simulation sampling and requires hours of runtime while the detailed simulation of the sample requires only minutes. Existing simulators can avoid functional simulation by jumping directly to particular instruction stream locations with architectural state checkpoints. To replace functional warming, these checkpoints must additionally provide microarchitectural model state that is accurate and reusable across experiments while meeting tight storage constraints. In this paper, we present a simulation-sampling framework that replaces functional warming with live-points without sacrificing accuracy. A live-point stores the bare minimum of functionally-warmed state for accurate simulation of a limited execution window while placing minimal restrictions on microarchitectural configuration. Live-points can be processed in random rather than program order, allowing simulation results and their statistical confidence to be reported while simulations are in progress. Our framework matches the accuracy of prior simulation-sampling techniques (i.e., /spl plusmn/3% error with 99.7% confidence), while estimating the performance of an 8-way out-of-order superscalar processor running SPEC CPU2000 in 91 seconds per benchmark, on average, using a 12 GB live-point library.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116785635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating architectural exploration using canonical instruction segments","authors":"Rose F. Liu, K. Asanović","doi":"10.1109/ISPASS.2006.1620786","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620786","url":null,"abstract":"Detailed microarchitectural simulators are not well suited for exploring large design spaces due to their excessive simulation times. We introduce AXCIS, a framework for fast and accurate design space exploration. AXCIS achieves fast simulation times by exploiting repetitions in program behavior to reduce the number of instructions simulated. For each dynamic instruction encountered during an initial full run of a benchmark, AXCIS builds an instruction segment, which concisely represents performance-critical information. AXCIS then compresses the string of dynamic segments into a table of canonical instruction segments (CIST) to give a compact representation of the entire benchmark trace. Given a precompiled CIST and a target microarchitecture configuration, AXCIS can quickly and accurately estimate performance metrics such as instructions per cycle (IPC). For the SPEC CPU2000 benchmarks and all simulated configurations, AXCIS achieves an average IPC error of 2.6%. While cycle-accurate simulators can take many hours to simulate billions of dynamic instructions, AXCIS can complete the same simulation on the corresponding CIST within seconds.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115547069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compiler-based adaptive fetch throttling for energy-efficiency","authors":"Huaping Wang, Yao Guo, I. Koren, C. M. Krishna","doi":"10.1109/ISPASS.2006.1620795","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620795","url":null,"abstract":"Front-end instruction delivery accounts for a significant fraction of energy consumption in dynamically scheduled superscalar processors. Different front-end throttling techniques have been introduced to reduce the chip-wide energy consumption caused by redundant fetching. Hardware-based techniques, such as flow-based throttling, could reduce the energy consumption considerably, but with a high performance loss. On the other hand, compiler-based IPC-estimation-driven software fetch throttling (CFT) techniques result in relatively low performance degradation, which is desirable for high-performance processors. However, their energy savings are limited by the fact that they typically use a predefined fixed low IPC-threshold to control throttling. In this paper, we propose a compiler-based adaptive fetch throttling (CAFT) technique that allows changing the throttling threshold dynamically at runtime. Instead of using a fixed threshold, our technique uses the decode/issue difference (DID) to assist the fetch throttling decision based on the statically estimated IPC. Changing the threshold dynamically makes it possible to throttle at a higher estimated IPC, thus increasing the throttling opportunities and resulting in larger energy savings. We demonstrate that CAFT could increase the energy savings significantly compared to CFT, while preserving its benefit of low performance loss. Our simulation results show that the proposed technique doubles the energy-delay product (EDP) savings compared to the fixed threshold throttling and achieves a 6.1% average EDP saving.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129461856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Bell, Rajiv Bhatia, L. John, Jeffrey Stuecheli, J. Griswell, P. Tu, Louis Capps, A. Blanchard, Ravel Thai
{"title":"Automatic testcase synthesis and performance model validation for high performance PowerPC processors","authors":"R. Bell, Rajiv Bhatia, L. John, Jeffrey Stuecheli, J. Griswell, P. Tu, Louis Capps, A. Blanchard, Ravel Thai","doi":"10.1109/ISPASS.2006.1620800","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620800","url":null,"abstract":"The latest high-performance IBM PowerPC microprocessor, the POWERS chip, poses challenges for performance model validation. The current state-of-the-art is to use simple hand-coded bandwidth and latency testcases, but these are not comprehensive for processors as complex as the POWER5 chip. Applications and benchmark suites such as SPEC CPU are difficult to set up or take too long to execute on functional models or even on detailed performance models. We present an automatic testcase synthesis methodology to address these concerns. By basing testcase synthesis on the workload characteristics of an application, source code is created that largely represents the performance of the application, but which executes in a fraction of the runtime. We synthesize representative PowerPC versions of the SPEC2000, STREAM, TPC-C and Java benchmarks, compile and execute them, and obtain an average IPC within 2.4% of the average IPC of the original benchmarks and with many similar average workload characteristics. The synthetic testcases often execute two orders of magnitude faster than the original applications, typically in less than 300K instructions, making performance model validation for today's complex processors feasible.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128478805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}