{"title":"Hybrid analytical-statistical modeling for efficiently exploring architecture and workload design spaces","authors":"L. Eeckhout, K. D. Bosschere","doi":"10.1109/PACT.2001.953285","DOIUrl":"https://doi.org/10.1109/PACT.2001.953285","url":null,"abstract":"Microprocessor design time and effort are getting impractical due to the huge number of simulations that need to be done to evaluate various processor configurations for various workloads. An early design stage methodology could be useful to efficiently cull huge design spaces to identify regions of interest to be further explored using more accurate simulations. The authors present an early design stage method that bridges the gap between analytical and statistical modeling. The hybrid analytical-statistical method presented is based on the observation that register traffic characteristics exhibit power law properties which allows its to fully characterize a workload with just a few parameters which is much more efficient than the collection of distributions that need to be specified in classical statistical modeling. We evaluate the applicability and the usefulness of this hybrid analytical-statistical modeling technique to efficiently and accurately cull huge architectural design spaces. In addition, we demonstrate that this hybrid analytical-statistical modeling technique can be used to explore the entire workload space by varying just a few workload parameters.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125004933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the stability of temporal data reference profiles","authors":"Trishul M. Chilimbi","doi":"10.1109/PACT.2001.953296","DOIUrl":"https://doi.org/10.1109/PACT.2001.953296","url":null,"abstract":"Growing computer system complexity has made program optimization based solely on static analyses increasingly difficult. Consequently; many code optimizations incorporate information from program execution profiles. Most memory system optimizations go further and rely primarily on profiles. This reliance on profiles makes off-line optimization effectiveness dependent on profile stability across multiple program runs. While code profiles such as basic block, edge, and branch profiles, have been shown to satisfy this requirement, the stability of data reference profiles, especially temporal data reference profiles that are necessary for cache level optimizations, has neither been studied nor established. This paper shows that temporal data reference profiles expressed in terms of hot data streams, which are data reference sequences that frequently repeat, are quite stable; an encouraging result for memory optimization research. Most hot data streams belong to one of two categories: those that appear in multiple runs with their data elements referenced in the same order, and those with the same set of elements referenced in a different order, and this category membership is extremely stable. In addition, the fraction of hot data streams that belong to the first category is quite large.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122174480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A unified modulo scheduling and register allocation technique for clustered processors","authors":"J. M. Codina, Jesús Sánchez, Antonio González","doi":"10.1109/PACT.2001.953298","DOIUrl":"https://doi.org/10.1109/PACT.2001.953298","url":null,"abstract":"This work presents a modulo scheduling framework for clustered ILP processors that integrates the cluster assignment, instruction scheduling and register allocation steps in a single phase. This unified approach is more effective than traditional approaches based on sequentially performing some (or all) of the three steps, since it allows optimizing the global code generation problem instead of searching for optimal solutions to each individual step. Besides, it avoids the iterative nature of traditional approaches, which require repeated applications of the three steps until a valid solution is found. The proposed framework includes a mechanism to insert spill code on-the-fly and heuristics to evaluate the quality of partial schedules considering simultaneously inter-cluster communications, memory pressure and register pressure. Transformations that allow trading pressure on a type of resource for another resource are also included. We show that the proposed technique outperforms previously proposed techniques. For instance, the average speed-up for the SPECfp95 is 36% for a 4-cluster configuration.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114796074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the efficiency of reductions in /spl mu/-SIMD media extensions","authors":"J. Corbal, R. Espasa, M. Valero","doi":"10.1109/PACT.2001.953290","DOIUrl":"https://doi.org/10.1109/PACT.2001.953290","url":null,"abstract":"Many important multimedia applications contain a significant fraction of reduction operations. Although, in general, multimedia applications are characterized for having high amounts of Data Level Parallelism, reductions and accumulations are difficult to parallelize and show a poor tolerance to increases in the latency of the instructions. This is specially significant for /spl mu/-SIMD extensions such as MMX or AltiVec. To overcome the problem of reductions in /spl mu/-SIMD ISAs, designers tend to include more and more complex instructions able to deal with the most common forms of reductions in multimedia. As long as the number of processor pipeline stages grows, the number of cycles needed to execute these multimedia instructions increases with every processor generation, severely compromising performance. The paper presents an in-depth discussion of how reductions/accumulations are performed in current /spl mu/-SIMD architectures and evaluates the performance trade-offs for near-future highly aggressive superscalar processors with three different styles of /spl mu/-SIMD extensions. We compare a MMX-like alternative to a MDMX-like extension that has packed accumulators to attack the reduction problem, and we also compare it to MOM, a matrix register ISA. We show that while packed accumulators present several advantages, they introduce artificial recurrences that severely degrade performance for processors with high number of registers and long latency operations. On the other hand, the paper demonstrates that longer SIMD media extensions such as MOM can take great advantage of accumulators by exploiting the associative parallelism implicit in reductions.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127320526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huiyang Zhou, Mark C. Toburen, E. Rotenberg, T. Conte
{"title":"Adaptive mode control: a static-power-efficient cache design","authors":"Huiyang Zhou, Mark C. Toburen, E. Rotenberg, T. Conte","doi":"10.1109/PACT.2001.953288","DOIUrl":"https://doi.org/10.1109/PACT.2001.953288","url":null,"abstract":"Lower threshold voltages in deep sub-micron technologies cause store leakage current, increasing static power dissipation. This trend, combined with the trend of larger/more cache memories dominating die area, has prompted circuit designers to develop SRAM cells with low-leakage operating modes (e.g., sleep mode). Sleep mode reduces static power dissipation but data stored in a sleeping cell is unreliable or lost. So, at the architecture level, there is interest in exploiting sleep mode to reduce static power dissipation while maintaining high performance. Current approaches dynamically control the operating mode of large groups of cache lines or even individual cache lines. However, the performance monitoring mechanism that controls the percentage of sleep-mode lines, and identifies particular lines for sleep mode, is somewhat arbitrary. There is no way to know what the performance could be with all cache lines active, so arbitrary miss rate targets are set (perhaps on a per-benchmark basis using profile information) and the control mechanism tracks these targets. We propose applying sleep mode only to the data store and not the tag store. By keeping the entire tag store active, the hardware knows what the hypothetical miss rate would be if all data lines were active and the actual miss rate can be made to precisely track it. Simulations show an average of 73% of I-cache lines and 54% of D-cache lines are put in sleep mode with an average IPC impact of only 1.7%, for 64KB caches.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122342936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recovery mechanism for latency misprediction","authors":"E. Morancho, J. Llabería, À. Olivé","doi":"10.1109/PACT.2001.953293","DOIUrl":"https://doi.org/10.1109/PACT.2001.953293","url":null,"abstract":"Signalling result availability from the functional units to the instruction scheduler can increase the cycle time and/or the effective latency of the instructions. The knowledge of all instruction latencies would allow the instruction scheduler to operate without the need for external signalling. However, the latency of some instructions is unknown; but, the scheduler can optimistically predict the latency of these instructions and speculatively issue their dependent instructions. Although prediction techniques have great performance potential, their gain can vanish due to misprediction handling. For instance, holding speculatively scheduled instructions in the issue queue reduces its capacity to lookahead for independent instructions. The paper evaluates a recovery mechanism for latency mispredictions that retains the speculatively issued instructions in a structure apart from the issue queue: the recovery buffer. When data becomes available after a latency misprediction, the dependent instructions will be re-issued from the recovery buffer. Moreover in order to simplify the reissue logic of the recovery buffer, the instructions will be recorded in issue order. On mispredictions, the recovery buffer increases the effective capacity of the issue queue to hold instructions waiting for operands. Our evaluations in integer benchmarks show that the recovery buffer mechanism reduces issue-queue size requirements by about 20-25%. Also, this mechanism is less sensitive to the verification delay than the recovery mechanism that retains the instructions in the issue queue.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131255200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bharat Chandramouli, J. Carter, Wilson C. Hsieh, S. Mckee
{"title":"A cost framework for evaluating integrated restructuring optimizations","authors":"Bharat Chandramouli, J. Carter, Wilson C. Hsieh, S. Mckee","doi":"10.1109/PACT.2001.953294","DOIUrl":"https://doi.org/10.1109/PACT.2001.953294","url":null,"abstract":"Loop transformations and array restructuring optimizations usually improve performance by increasing the memory locality of applications, but not always. For instance, loop and array restructuring can either complement or compete with one another. Previous research has proposed integrating loop and array restructuring, but there existed no analytic framework for determining how best to combine the optimizations for a given program. Since the choice of which optimizations to apply, alone or in combination, is highly application and input-dependent, a cost framework is needed if integrated restructuring is to be automated by an optimizing compiler. To this end, we develop a cost model that considers standard loop optimizations along with two potential forms of array restructuring: conventional copying-based restructuring and remapping-based restructuring that exploits a smart memory controller. We simulate eight applications on a variety of input sizes and with a variety of hand-applied restructuring optimizations. We find that employing a fixed strategy does not always deliver the best performance. Finally; our cost model accurately predicts the best combination of restructuring optimizations among those we examine, and yields performance within a geometric mean of 5% of the best combination across all benchmarks and input sizes.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131997594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparing and combining read miss clustering and software prefetching","authors":"Vijay S. Pai, S. Adve","doi":"10.1109/PACT.2001.953310","DOIUrl":"https://doi.org/10.1109/PACT.2001.953310","url":null,"abstract":"A recent latency tolerance technique, read-miss clustering, restructures code to send demand-miss references in parallel to the underlying memory system. An alternative, widely-used latency tolerance technique is software prefetching, which initiates data fetches ahead of expected demand-miss references by a certain distance. Since both techniques seem to target the same types of latencies and use the same system resources, it is unclear which technique is superior or if both can be combined. This paper shows that these two techniques are actually mutually beneficial, each helping to overcome limitations of the other: We perform our study for uniprocessor and multiprocessor configurations, in simulation and on a real machine (the Convex Exemplar). Compared to prefetching alone (the state-of-the-art implemented in systems today), the combination of the two techniques reduces the execution time by an average of 21% across all cases studied in simulation, and by an average of 16% for 5 out of 10 cases on the Exemplar. The combination sees execution time reductions relative to clustering alone averaging 15% for 6 out of 11 cases in simulation and 20% for 6 out of 10 cases on the Exemplar.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117079415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using dataflow based context for accurate value prediction","authors":"Renju Thomas, M. Franklin","doi":"10.1109/PACT.2001.953292","DOIUrl":"https://doi.org/10.1109/PACT.2001.953292","url":null,"abstract":"We explore the reasons behind the rather low prediction accuracy of existing data value predictors. Our studies show that contexts formed only from the outcomes of the last several instances of a static instruction do not always encapsulate all of the information required for correct prediction. Complex interactions between data flow and control flow change the context in ways that result in predictability loss for a significant number of dynamic instructions. For improving the prediction accuracy, we propose the concept of using contexts derived from the predictable portions of the data flow graph. That is, the predictability of hard-to-predict instructions can be improved by taking advantage of the predictability of the easy-to-predict instructions that precede it in the data flow graph. We propose and investigate a run-time scheme for producing such an improved context from the predicted values of previous instructions. We also propose a novel predictor called dynamic dataflow-inherited speculative context (DDISC) based predictor for specifically predicting hard-to-predict instructions. Simulation results verify that the use of dataflow-based contexts yields significant improvements in prediction accuracies, ranging from, 35% to 99%. This translates to an overall prediction accuracy of 68% to 99.9%.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126022497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reactive-associative caches","authors":"Brannon Batson, T. N. Vijaykumar","doi":"10.1109/PACT.2001.953287","DOIUrl":"https://doi.org/10.1109/PACT.2001.953287","url":null,"abstract":"While set-associative caches typically incur fewer misses than direct-mapped caches, set-associative caches have slower hit tithes. We propose the reactive-associative cache (r-a cache), which provides flexible associativity by placing most blocks in direct-mapped positions and reactively displacing only conflicting blocks to set-associative positions. The r-a cache uses way-prediction (like the predictive associative cache, PSA) to access displaced blocks on the initial probe. Unlike PSA, however, the r-a cache employs a novel feedback mechanism to prevent unpredictable blocks from being displaced. Reactive displacement and feedback allow the r-a cache to use a novel PC-based way-prediction and achieve high accuracy; without impractical block swapping as in column associative and group associative, and without relying on timing-constrained XOR way prediction. A one-port, 4-way r-a cache achieves up to 9% speedup over a direct-mapped cache and performs within 2% of an idealized 2-way set-associative, 1-cycle cache. A 4-way r-a cache achieves up to 13% speedup over a PSA cache, with both r-a and PSA rising the PC scheme. CACTI estimates that for sizes larger than 8KB, a 4-way r-a cache is within 1% of direct-mapped hit times, and 24% faster than a 2-way set-associative cache.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116079206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}