2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)最新文献_第4页

Estimation-based profiling for code placement optimization in sensor network programs 基于估计的传感器网络程序代码放置优化分析

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095799

Lipeng Wan, Qing Cao, Wenjun Zhou

引用次数: 2

Self-monitoring overhead of the Linux perf_ event performance counter interface Linux perf_事件性能计数器接口的自监视开销

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095789

Vincent M. Weaver

引用次数: 39

Nyami: a synthesizable GPU architectural model for general-purpose and graphics-specific workloads

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095803

Jeffrey T. Bush, Philip Dexter, Timothy N. Miller, A. Carpenter

{"title":"Nyami: a synthesizable GPU architectural model for general-purpose and graphics-specific workloads","authors":"Jeffrey T. Bush, Philip Dexter, Timothy N. Miller, A. Carpenter","doi":"10.1109/ISPASS.2015.7095803","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095803","url":null,"abstract":"Graphics processing units (GPUs) continue to grow in popularity for general-purpose, highly parallel, high-throughput systems. This has forced GPU vendors to increase their focus on general purpose workloads, sometimes at the expense of the graphics-specific workloads. Using GPUs for general-purpose computation is a departure from the driving forces behind programmable GPUs that were focused on a narrow subset of graphics rendering operations. Rather than focus on purely graphics-related or general-purpose use, we have designed and modeled an architecture that optimizes for both simultaneously to efficiently handle all GPU workloads. In this paper, we present Nyami, a co-optimized GPU architecture and simulation model with an open-source implementation written in Verilog. This approach allows us to more easily explore the GPU design space in a synthesizable, cycle-precise, modular environment. An instruction-precise functional simulator is provided for co-simulation and verification. Overall, we assume a GPU may be used as a general-purpose GPU (GPGPU) or a graphics engine and account for this in the architecture's construction and in the options and modules selectable for synthesis and simulation. To demonstrate Nyami's viability as a GPU research platform, we exploit its flexibility and modularity to explore the impact of a set of architectural decisions. These include sensitivity to cache size and associativity, barrel and switch-on-stall multithreaded instruction scheduling, and software vs. hardware implementations of rasterization. Through these experiments, we gain insight into commonly accepted GPU architecture decisions, adapt the architecture accordingly, and give examples of the intended use as a GPU research tool.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125256549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Synchrotrace: synchronization-aware architecture-agnostic traces for light-weight multicore simulation Synchrotrace:用于轻量级多核仿真的同步感知体系结构不可知跟踪

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095813

Siddharth Nilakantan, K. Sangaiah, A. More, G. Salvador, B. Taskin, Mark Hempstead

{"title":"Synchrotrace: synchronization-aware architecture-agnostic traces for light-weight multicore simulation","authors":"Siddharth Nilakantan, K. Sangaiah, A. More, G. Salvador, B. Taskin, Mark Hempstead","doi":"10.1109/ISPASS.2015.7095813","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095813","url":null,"abstract":"Trace-driven simulation of chip multiprocessor (CMP) systems offers many advantages over execution-driven simulation, such as reducing simulation time and complexity, and allowing portability, and scalability. However, trace-based simulation approaches have encountered difficulty capturing and accurately replaying multi-threaded traces due to the inherent non-determinism in the execution of multi-threaded programs. In this work, we present SynchroTrace, a scalable, flexible, and accurate trace-based multi-threaded simulation methodology. The methodology captures synchronization- and dependency-aware, architecture-agnostic, multi-threaded traces and uses a replay mechanism that plays back these traces correctly. By recording synchronization events and dependencies in the traces, independent of the host architecture, the methodology is able to accurately model the non-determinism of multi-threaded programs for different platforms. We validate the SynchroTrace simulation flow by successfully achieving the equivalent results of a constraint-based design space exploration with the Gem5 Full-System simulator. The results from simulating benchmarks from PARSEC 2.1 and Splash-2 show that our trace-based approach with trace filtering has a peak speedup of up to 18.4x over simulation in Gem5 Full-System with an average of about 7.5x speedup. We are also able to compress traces up to 74% of their original size with almost no impact on accuracy.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120901563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Graph-matching-based simulation-region selection for multiple binaries 基于图匹配的多二进制模拟区域选择

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095784

Charles R. Yount, H. Patil, M. S. Islam, Aditya Srikanth

{"title":"Graph-matching-based simulation-region selection for multiple binaries","authors":"Charles R. Yount, H. Patil, M. S. Islam, Aditya Srikanth","doi":"10.1109/ISPASS.2015.7095784","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095784","url":null,"abstract":"Comparison of simulation-based performance estimates of program binaries built with different compiler settings or targeted at variants of an instruction set architecture is essential for software/hardware co-design and similar engineering activities. Commonly-used sampling techniques for selecting simulation regions do not ensure that samples from the various binaries being compared represent the same source-level work, leading to biased speedup estimates and difficulty in comparative performance debugging. The task of creating equal-work samples is made difficult by differences between the structure and execution paths across multiple binaries such as variations in libraries, in-lining, and loop-iteration counts. Such complexities are addressed in this work by first applying an existing graph-matching technique to call and loop graphs for multiple binaries for the same source program. Then, a new sequence-alignment algorithm is applied to execution traces from the various binaries, using the graph-matching results to define intervals of equal work. A basic-block profile generated for these matched intervals can then be used for phase-detection and simulation-region selection across all binaries simultaneously. The resulting selected simulation regions match both in number and the work done across multiple binaries. The application of this technique is demonstrated on binaries compiled for different Intel 64 Architecture instruction-set extensions. Quality metrics for speedup estimation and an example of applying the data for performance debugging are presented.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114981643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Characterization and cross-platform analysis of high-throughput accelerators 高通量加速器的特性和跨平台分析

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095797

Keitarou Oka, Wenhao Jia, M. Martonosi, Koji Inoue

{"title":"Characterization and cross-platform analysis of high-throughput accelerators","authors":"Keitarou Oka, Wenhao Jia, M. Martonosi, Koji Inoue","doi":"10.1109/ISPASS.2015.7095797","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095797","url":null,"abstract":"Today's computer systems often employ high-throughput accelerators (such as Intel Xeon Phi coprocessors and NVIDIA Tesla GPUs) to improve the performance of some applications or portions of applications. While such accelerators are useful for suitable applications, it remains challenging to predict which workloads will run well on these platforms and to predict the resulting performance trends for varying input. This paper provides detailed characterizations on such platforms across a range of programs and input sizes. Furthermore, we show opportunities for cross-platform performance analysis and comparison between Xeon Phi and Tesla. Our crossplatform comparison has three steps. First, we build Xeon Phi performance regression models as a function of important Xeon Phi performance counters to identify critical architectural resources that highly affect a benchmark's performance. Then, cross-platform Tesla performance regression models are built to relate the Tesla performance trends of the benchmark to the Xeon Phi performance counter measurements of the benchmark. Finally, we compare the counters most important for Xeon Phi models to those most important for Tesla's models; this reveals similarities and distinctions of dynamic application behaviors on the two platforms.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127668741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A study of mobile device utilization 移动设备利用的研究

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095808

Cao Gao, Anthony Gutierrez, M. Rajan, R. Dreslinski, T. Mudge, Carole-Jean Wu

{"title":"A study of mobile device utilization","authors":"Cao Gao, Anthony Gutierrez, M. Rajan, R. Dreslinski, T. Mudge, Carole-Jean Wu","doi":"10.1109/ISPASS.2015.7095808","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095808","url":null,"abstract":"Mobile devices are becoming more powerful and versatile than ever, calling for better embedded processors. Following the trend in desktop CPUs, microprocessor vendors are trying to meet such needs by increasing the number of cores in mobile device SoCs. However, increasing the number does not translate proportionally into performance gain and power reduction. In the past, studies have shown that there exists little parallelism to be exploited by a multi-core processor in desktop platform applications, and many cores sit idle during runtime. In this paper, we investigate whether the same is true for current mobile applications. We analyze the behavior of a broad range of commonly used mobile applications on real devices. We measure their Thread Level Parallelism (TLP), which is the machine utilization over the non-idle runtime. Our results demonstrate that mobile applications are utilizing less than 2 cores on average, even with background applications running concurrently. We observe a diminishing return on TLP with increasing the number of cores, and low TLP even with heavy-load scenarios. These studies suggest that having many powerful cores is over-provisioning. Further analysis of TLP behavior and big-little core energy efficiency suggests that current mobile workloads can benefit from an architecture that has the flexibility to accommodate both high performance and good energy-efficiency for different application phases.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128771443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

Can RDMA benefit online data processing workloads on memcached and MySQL? RDMA对memcached和MySQL上的在线数据处理工作负载有好处吗?

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095796

D. Shankar, Xiaoyi Lu, Jithin Jose, Md. Wasi-ur-Rahman, Nusrat S. Islam, D. Panda

引用次数: 6

Analyzing communication models for distributed thread-collaborative processors in terms of energy and time 从精力和时间的角度分析分布式线程协作处理器的通信模型

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095817

Benjamin Klenk, Lena Oden, H. Fröning

{"title":"Analyzing communication models for distributed thread-collaborative processors in terms of energy and time","authors":"Benjamin Klenk, Lena Oden, H. Fröning","doi":"10.1109/ISPASS.2015.7095817","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095817","url":null,"abstract":"Accelerated computing has become pervasive for increasing the computational power and energy efficiency in terms of GFLOPs/Watt. For application areas with highest demands, for instance high performance computing, data warehousing and high performance analytics, accelerators like GPUs or Intel's MICs are distributed throughout the cluster. Since current analyses and predictions show that data movement will be the main contributor to energy consumption, we are entering an era of communication-centric heterogeneous systems that are operating with hard power constraints. In this work, we analyze data movement optimizations for distributed heterogeneous systems based on CPUs and GPUs. Thread-collaborative processors like GPUs differ significantly in their execution model from generalpurpose processors like CPUs, but available communication models are still designed and optimized for CPUs. Similar to heterogeneity in processing, heterogeneity in communication can have a huge impact on energy and time. To analyze this impact, we use multiple workloads with distinct properties regarding computational intensity and communication characteristics. We show for which workloads tailored communication models are essential, not only reducing execution time but also saving energy. Exposing the impact in terms of energy and time for communication-centric heterogeneous systems is crucial for future optimizations, and this work is a first step in this direction.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114776271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Hierarchical cycle accounting: a new method for application performance tuning 分层循环记帐:应用程序性能调优的新方法

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095790

A. Nowak, D. Levinthal, W. Zwaenepoel

{"title":"Hierarchical cycle accounting: a new method for application performance tuning","authors":"A. Nowak, D. Levinthal, W. Zwaenepoel","doi":"10.1109/ISPASS.2015.7095790","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095790","url":null,"abstract":"To address the growing difficulty of performance debugging on modern processors with increasingly complex micro-architectures, we present Hierarchical Cycle Accounting (HCA), a structured, hierarchical, architecture-agnostic methodology for the identification of performance issues in workloads running on these modern processors. HCA reports to the user the cost of a number of execution components, such as load latency, memory bandwidth, instruction starvation, and branch misprediction. A critical novel feature of HCA is that all cost components are presented in the same unit, core pipeline cycles. Their relative importance can therefore be compared directly. These cost components are furthermore presented in a hierarchical fashion, with architecture-agnostic components at the top levels of the hierarchy and architecture-specific components at the bottom. This hierarchical structure is useful in guiding the performance debugging effort to the places where it can be the most effective. For a given architecture, the cost components are computed based on the observation of architecture-specific events, typically provided by a performance monitoring unit (PMU), and using a set of formulas to attribute a certain cost in cycles to each event. The selection of what PMU events to use, their validation, and the derivation of the formulas are done offline by an architecture expert, thereby freeing the non-expert from the burdensome and error-prone task of directly interpreting PMU data. We have implemented the HCA methodology in Gooda, a publicly available tool. We describe the application of Gooda to the analysis of several workloads in wide use, showing how HCA's features facilitated performance debugging for these applications. We also describe the discovery of relevant bugs in Intel hardware and the Linux Kernel as a result of using HCA.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128912372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13