2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)最新文献_第2页

Pinpointing data locality bottlenecks with low overhead 以低开销精确定位数据局部性瓶颈

2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2013-04-21 DOI: 10.1109/ISPASS.2013.6557169

Xu Liu, J. Mellor-Crummey

{"title":"Pinpointing data locality bottlenecks with low overhead","authors":"Xu Liu, J. Mellor-Crummey","doi":"10.1109/ISPASS.2013.6557169","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557169","url":null,"abstract":"A wide gap exists between the speed of modern processors and memory subsystems. As a result, long latencies associated with fetching data from memory often significantly degrade execution performance. To aid with program tuning, application developers need tools to analyze memory access patterns and guide them how to reuse data in the fastest levels of a system's memory hierarchy. In this paper, we describe a novel, efficient, and effective tool for data locality measurement and analysis. Unlike other tools, our tool uses both statistical PMU sampling to quantify the cost of data locality bottlenecks and cache simulation to compute reuse distance to diagnose the causes of locality problems. This approach enables us to collect rich information to provide insight into a program's data locality problems. Our tool attributes quantitative measurements of observed memory latency to program variables and dynamically allocated data, as well as code. Our tool identifies data touched by reuse pairs and the accesses involved, identified with their full calling context. Finally, our tool employs both sampling and parallelization to accelerate the computation of representative reuse distance information. Experiments show that with an overhead of only about 13%, our tool provides detailed insights that enabled us to make non-trivial improvements to memory-bound HPC benchmarks.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114290616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

A statistical machine learning based modeling and exploration framework for run-time cross-stack energy optimization 基于统计机器学习的运行时跨堆栈能量优化建模与探索框架

2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2013-04-21 DOI: 10.1109/ISPASS.2013.6557161

Changshu Zhang, A. Ravindran

{"title":"A statistical machine learning based modeling and exploration framework for run-time cross-stack energy optimization","authors":"Changshu Zhang, A. Ravindran","doi":"10.1109/ISPASS.2013.6557161","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557161","url":null,"abstract":"As the complexity of many-core processors grow, meeting performance, energy, temperature, reliability, and noise requirements under dynamically changing operating conditions requires run-time optimization of all parts of the computing stack - architecture, system software, and applications. Unfortunately, the combination of design parameters for the entire computing stack results in an operating space of millions of points that must be explored and evaluated at run-time. In this paper, we present a statistical machine learning (SML) based modeling framework that can be used to rapidly explore such vast operating spaces. We construct a multivariate adaptive regression spline (MARS) based model that uses a number of architecture and application parameters as predictor variables to predict performance and power. We then use a Pareto-front exploring evolutionary algorithm to determine operating points for optimal power and performance. The operating points constituting the Pareto front are stored in look-up tables for runtime use. The proposed framework is applied to an ×264 video encoding application executing on a quad core processor. The microarchitectural predictor variables include core and cache parameters. The application predictor variables include the video resolution, and visual quality determined by the choice of the motion estimation algorithm. The model outputs the average frames per second (FPS) and the average power consumption. The MARS model has an R2 of 0.9657 and 0.9467 respectively for FPS and power. For a video frame resolution of 480x320, and FPS of 20, a power saving of 55% can be obtained by jointly tuning the microarchitectural parameters and the visual quality.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126496847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Characterizing the microarchitectural side effects of operating system calls 描述操作系统调用的微架构副作用

2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2013-04-21 DOI: 10.1109/ISPASS.2013.6557158

A. Mayberry, Matthew Laquidara, C. Weems

引用次数: 2

Synergistic coupling of SSD and hard disk for QoS-aware virtual memory 基于qos感知的虚拟内存中SSD和硬盘的协同耦合

2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2013-04-21 DOI: 10.1109/ISPASS.2013.6557143

Ke Liu, Xuechen Zhang, K. Davis, Song Jiang

{"title":"Synergistic coupling of SSD and hard disk for QoS-aware virtual memory","authors":"Ke Liu, Xuechen Zhang, K. Davis, Song Jiang","doi":"10.1109/ISPASS.2013.6557143","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557143","url":null,"abstract":"With significant advantages in capacity, power consumption, and price, solid state disk (SSD) has good potential to be employed as an extension of DRAM (memory), such that applications with large working sets could run efficiently on a modestly configured system. While initial results reported in recent works show promising prospects for this use of SSD by incorporating it into the management of virtual memory, frequent writes from write-intensive programs could quickly wear out SSD, making the idea less practical. We propose a scheme, HybridSwap, that integrates a hard disk with an SSD for virtual memory management, synergistically achieving the advantages of both. In addition, HybridSwap can constrain performance loss caused by swapping according to user-specified QoS requirements. To minimize writes to the SSD without undue performance loss, HybridSwap sequentially swaps a set of pages of virtual memory to the hard disk if they are expected to be read together. Using a history of page access patterns HybridSwap dynamically creates an out-of-memory virtual memory page layout on the swap space spanning the SSD and hard disk such that random reads are served by SSD and sequential reads are asynchronously served by the hard disk with high efficiency. In practice HybridSwap can effectively exploit the aggregate bandwidth of the two devices to accelerate page swapping. We have implemented HybridSwap in a recent Linux kernel, version 2.6.35.7. Our evaluation with representative benchmarks, such as Memcached for key-value store, and scientific programs from the ALGLIB cross-platform numerical analysis and data processing library, shows that the number of writes to SSD can be reduced by 40% with the system's performance comparable to that with pure SSD swapping, and can satisfy a swapping-related QoS requirement as long as","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128220275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Trace filtering of multithreaded applications for CMP memory simulation 跟踪过滤的多线程应用程序的CMP内存模拟

2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2013-04-21 DOI: 10.1109/ISPASS.2013.6557160

Alejandro Rico, Alex Ramírez, M. Valero

引用次数: 2

McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling McSimA+:一个多核模拟器，具有应用级仿真和详细的微架构建模

2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2013-04-21 DOI: 10.1109/ISPASS.2013.6557148

Jung Ho Ahn, Sheng Li, O. Seongil, N. Jouppi

{"title":"McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling","authors":"Jung Ho Ahn, Sheng Li, O. Seongil, N. Jouppi","doi":"10.1109/ISPASS.2013.6557148","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557148","url":null,"abstract":"With their significant performance and energy advantages, emerging manycore processors have also brought new challenges to the architecture research community. Manycore processors are highly integrated complex system-on-chips with complicated core and uncore subsystems. The core subsystems can consist of a large number of traditional and asymmetric cores. The uncore subsystems have also become unprecedentedly powerful and complex with deeper cache hierarchies, advanced on-chip interconnects, and high-performance memory controllers. In order to conduct research for emerging manycore processor systems, a microarchitecture-level and cycle-level manycore simulation infrastructure is needed. This paper introduces McSimA+, a new timing simulation infrastructure, to meet these needs. McSimA+ models x86based asymmetric manycore microarchitectures in detail for both core and uncore subsystems, including a full spectrum of asymmetric cores from single-threaded to multithreaded and from in-order to out-of-order, sophisticated cache hierarchies, coherence hardware, on-chip interconnects, memory controllers, and main memory. McSimA+ is an application-level+ simulator, offering a middle ground between a full-system simulator and an application-level simulator. Therefore, it enjoys the light weight of an application-level simulator and the full control of threads and processes as in a full-system simulator. This paper also explores an asymmetric clustered manycore architecture that can reduce the thread migration cost to achieve a noticeable performance improvement compared to a state-of-the-art asymmetric manycore architecture.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115062865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 121

Non-determinism and overcount on modern hardware performance counter implementations 现代硬件性能计数器实现的不确定性和超计数

2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2013-04-21 DOI: 10.1109/ISPASS.2013.6557172

Vincent M. Weaver, D. Terpstra, S. Moore

{"title":"Non-determinism and overcount on modern hardware performance counter implementations","authors":"Vincent M. Weaver, D. Terpstra, S. Moore","doi":"10.1109/ISPASS.2013.6557172","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557172","url":null,"abstract":"Ideal hardware performance counters provide exact deterministic results. Real-world performance monitoring unit (PMU) implementations do not always live up to this ideal. Events that should be exact and deterministic (such as retired instructions) show run-to-run variation and overcount on ×86_64 machines, even when run in strictly controlled environments. These effects are non-intuitive to casual users and cause difficulties when strict determinism is desirable, such as when implementing deterministic replay or deterministic threading libraries. We investigate eleven different x86 64 CPU implementations and discover the sources of divergence from expected count totals. Of all the counter events investigated, we find only a few that exhibit enough determinism to be used without adjustment in deterministic execution environments. We also briefly investigate ARM, IA64, POWER and SPARC systems and find that on these platforms the counter events have more determinism. We explore various methods of working around the limitations of the ×86_64 events, but in many cases this is not possible and would require architectural redesign of the underlying PMU.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128261109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 108

Selecting benchmark combinations for the evaluation of multicore throughput 为评估多核吞吐量选择基准组合

2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2013-04-21 DOI: 10.1109/ISPASS.2013.6557168

Ricardo A. Velásquez, P. Michaud, André Seznec

{"title":"Selecting benchmark combinations for the evaluation of multicore throughput","authors":"Ricardo A. Velásquez, P. Michaud, André Seznec","doi":"10.1109/ISPASS.2013.6557168","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557168","url":null,"abstract":"Most high-performance processors today are able to execute multiple threads of execution simultaneously. Threads share processor resources, like the last-level cache, which may decrease throughput in a non obvious way, depending on threads' characteristics. Computer architects usually study multiprogrammed workloads by considering a set of benchmarks and some combinations of these benchmarks. Because detailed microarchitecture simulators are slow, we want a subset of combinations that is as small as possible, yet representative. However, there is no standard method for selecting such sample, and different authors have used different methods. It is not clear how the choice of a particular sample impacts the conclusions of a study. We propose and compare different sampling methods for defining multiprogrammed workloads for computer architecture studies. We evaluate their effectiveness with a case study, the comparison of several multicore last-level cache replacement policies. We show that random sampling, the simplest method, is a possible way to define a representative workload sample, provided the sample is large enough. We propose a method for estimating the required sample size based on fast approximate simulation. We propose a new method, workload stratification, which is very effective at reducing the sample size in situations where random sampling would require large samples.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132823582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Increasing the Transparent Page Sharing in Java 增加Java中的透明页面共享

2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2013-04-21 DOI: 10.1109/ISPASS.2013.6557144

Kazunori Ogata, Tamiya Onodera

引用次数: 3

ISA-independent workload characterization and its implications for specialized architectures 独立于isa的工作负载表征及其对专用体系结构的含义

2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2013-04-21 DOI: 10.1109/ISPASS.2013.6557175

Y. Shao, D. Brooks

引用次数: 73