2006 IEEE International Symposium on Performance Analysis of Systems and Software最新文献

Characterizing the branch misprediction penalty 描述分支错误预测惩罚

2006 IEEE International Symposium on Performance Analysis of Systems and Software Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620789

Stijn Eyerman, James E. Smith, L. Eeckhout

引用次数: 59

Performance modeling and prediction for scientific Java applications 科学Java应用程序的性能建模和预测

2006 IEEE International Symposium on Performance Analysis of Systems and Software Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620804

Rui Zhang, Zoran Budimlic, K. Kennedy

{"title":"Performance modeling and prediction for scientific Java applications","authors":"Rui Zhang, Zoran Budimlic, K. Kennedy","doi":"10.1109/ISPASS.2006.1620804","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620804","url":null,"abstract":"With the expansion of the Internet, the grid has become an attractive platform for scientific computing. Java, with a platform-independent execution model and built-in support for distributed computing is an inviting choice for implementation of applications intended for grid execution. Recent work has shown that an accurate performance model combined with a load-balancing scheduling strategy can significantly improve the performance of distributed applications on a heterogeneous computing platform, such as the grid. However, current performance modeling techniques are not suitable for Java applications, as the virtual machine execution model presents several difficulties: 1) a significant amount of time is spent on compilation at the beginning of the execution, 2) the virtual machine continuously profiles and recompiles the code during the execution, 3) garbage collection can have unpredictable effects on memory hierarchy, 4) some applications can spend more time garbage collecting than computing for certain heap sizes and 5) small variations in virtual machine implementation can have a large impact on the application's behavior. In this paper, we present a practical profile-based strategy for performance modeling of Java scientific applications intended for execution on the grid. We introduce two novel concepts for the Java execution model: point of predictability (PoP) and point of unpredictability (PoU). PoP accounts for the volatile nature of the effects of the virtual machine on execution time for small problem sizes. PoU accounts for the effects of garbage collection on certain applications that have a memory footprint that approaches the total heap size. We present an algorithm for determining PoP and PoU for Java applications, given the hardware platform, virtual machine and heap size. We also present a code-instrumentation-based mechanism for building the algorithm complexity model for a given application. We introduce a technique for calibrating this model that is able to accurately predict the execution time of Java programs for problem sizes between PoP and PoU. Our preliminary experiments show that techniques can achieve load balancing with more than 90% average CPU utilization.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130391958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Comparing simulation techniques for microarchitecture-aware floorplanning 微架构感知平面规划的仿真技术比较

2006 IEEE International Symposium on Performance Analysis of Systems and Software Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620792

Vidyasagar Nookala, Ying Chen, D. Lilja, S. Sapatnekar

{"title":"Comparing simulation techniques for microarchitecture-aware floorplanning","authors":"Vidyasagar Nookala, Ying Chen, D. Lilja, S. Sapatnekar","doi":"10.1109/ISPASS.2006.1620792","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620792","url":null,"abstract":"Due to the long simulation times of the reference input sets, microarchitects resort to alternative techniques to speed up cycle-accurate simulations. However, the reduction in the runtimes comes with an associated loss of accuracy in replicating the characteristics of the reference sets. In addition, the effect of these inaccuracies on the overall performance can vary across different microarchitecture optimizations or enhancements. In this work, we study and compare two such techniques, reduced input sets and statistical sampling, in the context of microarchitecture-aware floorplanning, a physical design stage, where the objective is to find an IPC-optimal global placement of the blocks of a microprocessor. The variation in the IPC results due the insertion of additional flip-flops on some across-chip wires of the processor that have multicycle delays in nanometer technology nodes. The objective of IPC-aware floorplanning is to minimize the amount of pipelining required by the system buses that are critical in determining the system performance. Our results indicate that, although the two techniques exhibit contrasting behavior in quantifying the criticality of bus latencies, the ensuing floorplanning optimization process results in almost identical performance improvements for both reduced input sets and sampling. The reason behind this is that, for discrete optimization problems such as IPC-aware floorplanning, a reasonably accurate relative ordering of performance bottlenecks is sufficient, absolute accuracy is not necessary.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127210181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Revisiting the performance impact of branch predictor latencies 回顾分支预测器延迟对性能的影响

2006 IEEE International Symposium on Performance Analysis of Systems and Software Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620790

G. Loh

引用次数: 12

Friendly fire: understanding the effects of multiprocessor prefetches 误伤:了解多处理器预取的影响

2006 IEEE International Symposium on Performance Analysis of Systems and Software Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620802

Natalie D. Enright Jerger, Eric L. Hill, Mikko H. Lipasti

{"title":"Friendly fire: understanding the effects of multiprocessor prefetches","authors":"Natalie D. Enright Jerger, Eric L. Hill, Mikko H. Lipasti","doi":"10.1109/ISPASS.2006.1620802","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620802","url":null,"abstract":"Modern processors attempt to overcome increasing memory latencies by anticipating future references and prefetching those blocks from memory. The behavior and possible negative side effects of prefetching schemes are fairly well understood for uniprocessor systems. However, in a multiprocessor system a prefetch can steal read and/or write permissions for shared blocks from other processors, leading to permission thrashing and overall performance degradation. In this paper, we present a taxonomy that classifies the effects of multiprocessor prefetches. We also present a characterization of the effects of four different hardware prefetching schemes - sequential prefetching, content-directed data prefetching, wrong path prefetching and exclusive prefetching - in a bus-based multiprocessor system. We show that accuracy and coverage are inadequate metrics for describing prefetching in a multiprocessor; rather, we also need to understand what fraction of prefetches interferes with remote processors. We present an upper bound on the performance of various prefetching algorithms if no harmful prefetches are issued, and suggest prefetch filtering schemes that can accomplish this goal.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"56 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120839371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

MESA: reducing cache conflicts by integrating static and run-time methods MESA:通过集成静态和运行时方法来减少缓存冲突

2006 IEEE International Symposium on Performance Analysis of Systems and Software Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620803

Xiaoning Ding, Dimitrios S. Nikolopoulos, Song Jiang, Xiaodong Zhang

{"title":"MESA: reducing cache conflicts by integrating static and run-time methods","authors":"Xiaoning Ding, Dimitrios S. Nikolopoulos, Song Jiang, Xiaodong Zhang","doi":"10.1109/ISPASS.2006.1620803","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620803","url":null,"abstract":"The paper proposes MESA (Multicoloring with Embedded Skewed Associativity), a novel cache indexing scheme that integrates dynamic page coloring with static skewed associativity to reduce conflicts in L2/L3 caches with a small degree of associativity. MESA associates multiple cache pages (colors) with each virtual memory page and uses two-level skewed associativity, first to map a page to a different color in each bank of the cache, and then to disperse the lines of a page across the banks and within the colors of the page. MESA is a multi-grained cache indexing scheme that combines the best of two worlds, page coloring and skewed associativity. We also propose a novel cache management scheme based on page remapping, which uses cache miss imbalance between colors in each bank as the metric to track conflicts and trigger remapping. We evaluate MESA using 24 benchmarks from multiple application domains and with various degrees of sensitivity to conflict misses, on both an in-order issue processor (using complete system simulation) and an out-of-order issue processor (using SimpleScalar). MESA outperforms skewed associativity, prime modulo hashing, and dynamic page coloring schemes proposed earlier. Compared to a 4-way associative cache, MESA can provide as much as 76% improvement in IPC.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"221 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131611406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Improved stride prefetching using extrinsic stream characteristics 使用外部流特性改进步幅预取

2006 IEEE International Symposium on Performance Analysis of Systems and Software Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620801

H. Al-Sukhni, James Holt, D. Connors

{"title":"Improved stride prefetching using extrinsic stream characteristics","authors":"H. Al-Sukhni, James Holt, D. Connors","doi":"10.1109/ISPASS.2006.1620801","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620801","url":null,"abstract":"Stride-based prefetching mechanisms exploit regular streams of memory accesses to hide memory latency. While these mechanisms are effective, they can be improved by studying the properties of regular streams. As evidence of this, the establishment of metrics to quantify intrinsic characteristics of regular streams has been shown to enable software-based code optimizations. In this paper we extend previously identified regular stream metrics to quantify extrinsic characteristics of regular streams, and show how these new metrics can be employed to improve the efficiency of stride prefetching. The extrinsic metrics we introduce are stream affinity and stream density. Stream affinity enables prefetching for short streams that were previously ignored by stride prefetching mechanisms. Stream density enables a prioritization mechanism that dynamically selects amongst available streams in favor of those that promise more miss coverage, and provides thrashing control amongst several coexisting streams. Finally, we show that using intrinsic and extrinsic stream metrics in combination allows a novel hardware technique for controlling prefetch ahead distance (PAD) which dynamically adjusts the prefetch launch time to better enable timely prefetches while minimizing cache pollution. For a representative set of SPEC2K traces, our techniques consistently outperform our implementation of the closest previously reported stride-based prefetching technique.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125863726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Critical path analysis of the TRIPS architecture 与贸易有关的知识产权协议》架构的关键路径分析

2006 IEEE International Symposium on Performance Analysis of Systems and Software Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620788

R. Nagarajan, Xia Chen, Robert G. McDonald, D. Burger, S. Keckler

{"title":"Critical path analysis of the TRIPS architecture","authors":"R. Nagarajan, Xia Chen, Robert G. McDonald, D. Burger, S. Keckler","doi":"10.1109/ISPASS.2006.1620788","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620788","url":null,"abstract":"Fast, accurate, and effective performance analysis is essential for the design of modern processor architectures and improving application performance. Recent trends toward highly concurrent processors make this goal increasingly difficult. Conventional techniques, based on simulators and performance monitors, are ill-equipped to analyze how a plethora of concurrent events interact and how they affect performance. Prior research has shown the utility of critical path analysis in solving this problem. This analysis abstracts the execution of a program with a dependence graph. With simple manipulations on the graph, designers can gain insights into the bottlenecks of a design. This paper extends critical path analysis to understand the performance of a next-generation, high-ILP architecture. The TRIPS architecture introduces new features not present in conventional superscalar architectures. We show how dependence constraints introduced by these features, specifically the execution model and operand communication links, can be modeled with a dependence graph. We describe a new algorithm that tracks critical path information at a fine-grained level and yet can deliver an order of magnitude (30x) improvement in performance over previously proposed techniques. Finally, we provide a breakdown of the critical path for a select set of benchmarks and show an example where we use this information to improve the performance of a heavily-hand-optimized program by as much as 11%.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126023281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

ATTILA: a cycle-level execution-driven simulator for modern GPU architectures ATTILA:一个用于现代GPU架构的周期级执行驱动模拟器

2006 IEEE International Symposium on Performance Analysis of Systems and Software Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620807

Victor Moya Del Barrio, Carlos González, Jordi Roca, Agustín Fernández, R. Espasa

引用次数: 110

Modeling TCAM power for next generation network devices 为下一代网络设备建模TCAM功率

2006 IEEE International Symposium on Performance Analysis of Systems and Software Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620796

B. Agrawal, T. Sherwood

{"title":"Modeling TCAM power for next generation network devices","authors":"B. Agrawal, T. Sherwood","doi":"10.1109/ISPASS.2006.1620796","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620796","url":null,"abstract":"Applications in computer networks often require high throughput access to large data structures for lookup and classification. Many advanced algorithms exist to speed these search primitives on network processors, general purpose machines, and even custom ASICs. However, supporting these applications with standard memories requires very careful analysis of access patterns, and achieving worst case performance can be quite difficult and complex. A simple solution is often possible if a Ternary CAM (content addressable memory) is used to perform a fully parallel search across the entire data set. Unfortunately, this parallelism means that large portions of the chip are switching during each cycle, causing large amounts of power to be consumed. While researchers have begun to explore new ways of managing the power consumption, quantifying design alternatives is difficult due to a lack of available models. In this paper, we examine the structure inside a modern TCAM and present a simple, yet accurate, power model. We present techniques to estimate the dynamic power consumption of a large TCAM. We validate the model using industrial TCAM datasheets and prior published works. We present an extensive analysis of the model by varying various architectural parameters. We also describe how new network algorithms have the potential to address the growing problem of power management in next-generation network devices.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125717758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 117