2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)最新文献

筛选
英文 中文
Estimation-based profiling for code placement optimization in sensor network programs 基于估计的传感器网络程序代码放置优化分析
Lipeng Wan, Qing Cao, Wenjun Zhou
{"title":"Estimation-based profiling for code placement optimization in sensor network programs","authors":"Lipeng Wan, Qing Cao, Wenjun Zhou","doi":"10.1109/ISPASS.2015.7095799","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095799","url":null,"abstract":"In this work, we focus on applying profiling guided code placement to programs running on resource-constrained sensor motes. Specifically, we model the execution of sensor network programs under nondeterministic inputs as discrete-time Markov processes, and propose a novel approach named Code Tomography to estimate parameters of the Markov models that reflect sensor network programs' dynamic execution behavior by only using end-to-end timing information measured at start and end points of each procedure. The parameters estimated by Code Tomography are fed back to compilers to optimize the code placement so that branch misprediction rate can be reduced.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125064812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Self-monitoring overhead of the Linux perf_ event performance counter interface Linux perf_事件性能计数器接口的自监视开销
Vincent M. Weaver
{"title":"Self-monitoring overhead of the Linux perf_ event performance counter interface","authors":"Vincent M. Weaver","doi":"10.1109/ISPASS.2015.7095789","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095789","url":null,"abstract":"Most modern CPUs include hardware performance counters: architectural registers that allow programmers to gain low-level insight into system performance. Low-overhead access to these counters is necessary for accurate performance analysis, making the operating system interface critical to providing low-latency performance data. We investigate the overhead of self-monitoring performance counter measurements on the Linux perf_event interface. We find that default code (such as that used by PAPI) implementing the perf_event self-monitoring interface can have large overhead: up to an order of magnitude larger than the previously used perfctr and perfmon2 performance counter implementations. We investigate the causes of this overhead and find that with proper coding this overhead can be greatly reduced on recent Linux kernels.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"78 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114152106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Nyami: a synthesizable GPU architectural model for general-purpose and graphics-specific workloads
Jeffrey T. Bush, Philip Dexter, Timothy N. Miller, A. Carpenter
{"title":"Nyami: a synthesizable GPU architectural model for general-purpose and graphics-specific workloads","authors":"Jeffrey T. Bush, Philip Dexter, Timothy N. Miller, A. Carpenter","doi":"10.1109/ISPASS.2015.7095803","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095803","url":null,"abstract":"Graphics processing units (GPUs) continue to grow in popularity for general-purpose, highly parallel, high-throughput systems. This has forced GPU vendors to increase their focus on general purpose workloads, sometimes at the expense of the graphics-specific workloads. Using GPUs for general-purpose computation is a departure from the driving forces behind programmable GPUs that were focused on a narrow subset of graphics rendering operations. Rather than focus on purely graphics-related or general-purpose use, we have designed and modeled an architecture that optimizes for both simultaneously to efficiently handle all GPU workloads. In this paper, we present Nyami, a co-optimized GPU architecture and simulation model with an open-source implementation written in Verilog. This approach allows us to more easily explore the GPU design space in a synthesizable, cycle-precise, modular environment. An instruction-precise functional simulator is provided for co-simulation and verification. Overall, we assume a GPU may be used as a general-purpose GPU (GPGPU) or a graphics engine and account for this in the architecture's construction and in the options and modules selectable for synthesis and simulation. To demonstrate Nyami's viability as a GPU research platform, we exploit its flexibility and modularity to explore the impact of a set of architectural decisions. These include sensitivity to cache size and associativity, barrel and switch-on-stall multithreaded instruction scheduling, and software vs. hardware implementations of rasterization. Through these experiments, we gain insight into commonly accepted GPU architecture decisions, adapt the architecture accordingly, and give examples of the intended use as a GPU research tool.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125256549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Synchrotrace: synchronization-aware architecture-agnostic traces for light-weight multicore simulation Synchrotrace:用于轻量级多核仿真的同步感知体系结构不可知跟踪
Siddharth Nilakantan, K. Sangaiah, A. More, G. Salvador, B. Taskin, Mark Hempstead
{"title":"Synchrotrace: synchronization-aware architecture-agnostic traces for light-weight multicore simulation","authors":"Siddharth Nilakantan, K. Sangaiah, A. More, G. Salvador, B. Taskin, Mark Hempstead","doi":"10.1109/ISPASS.2015.7095813","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095813","url":null,"abstract":"Trace-driven simulation of chip multiprocessor (CMP) systems offers many advantages over execution-driven simulation, such as reducing simulation time and complexity, and allowing portability, and scalability. However, trace-based simulation approaches have encountered difficulty capturing and accurately replaying multi-threaded traces due to the inherent non-determinism in the execution of multi-threaded programs. In this work, we present SynchroTrace, a scalable, flexible, and accurate trace-based multi-threaded simulation methodology. The methodology captures synchronization- and dependency-aware, architecture-agnostic, multi-threaded traces and uses a replay mechanism that plays back these traces correctly. By recording synchronization events and dependencies in the traces, independent of the host architecture, the methodology is able to accurately model the non-determinism of multi-threaded programs for different platforms. We validate the SynchroTrace simulation flow by successfully achieving the equivalent results of a constraint-based design space exploration with the Gem5 Full-System simulator. The results from simulating benchmarks from PARSEC 2.1 and Splash-2 show that our trace-based approach with trace filtering has a peak speedup of up to 18.4x over simulation in Gem5 Full-System with an average of about 7.5x speedup. We are also able to compress traces up to 74% of their original size with almost no impact on accuracy.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120901563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Graph-matching-based simulation-region selection for multiple binaries 基于图匹配的多二进制模拟区域选择
Charles R. Yount, H. Patil, M. S. Islam, Aditya Srikanth
{"title":"Graph-matching-based simulation-region selection for multiple binaries","authors":"Charles R. Yount, H. Patil, M. S. Islam, Aditya Srikanth","doi":"10.1109/ISPASS.2015.7095784","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095784","url":null,"abstract":"Comparison of simulation-based performance estimates of program binaries built with different compiler settings or targeted at variants of an instruction set architecture is essential for software/hardware co-design and similar engineering activities. Commonly-used sampling techniques for selecting simulation regions do not ensure that samples from the various binaries being compared represent the same source-level work, leading to biased speedup estimates and difficulty in comparative performance debugging. The task of creating equal-work samples is made difficult by differences between the structure and execution paths across multiple binaries such as variations in libraries, in-lining, and loop-iteration counts. Such complexities are addressed in this work by first applying an existing graph-matching technique to call and loop graphs for multiple binaries for the same source program. Then, a new sequence-alignment algorithm is applied to execution traces from the various binaries, using the graph-matching results to define intervals of equal work. A basic-block profile generated for these matched intervals can then be used for phase-detection and simulation-region selection across all binaries simultaneously. The resulting selected simulation regions match both in number and the work done across multiple binaries. The application of this technique is demonstrated on binaries compiled for different Intel 64 Architecture instruction-set extensions. Quality metrics for speedup estimation and an example of applying the data for performance debugging are presented.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114981643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Characterization and cross-platform analysis of high-throughput accelerators 高通量加速器的特性和跨平台分析
Keitarou Oka, Wenhao Jia, M. Martonosi, Koji Inoue
{"title":"Characterization and cross-platform analysis of high-throughput accelerators","authors":"Keitarou Oka, Wenhao Jia, M. Martonosi, Koji Inoue","doi":"10.1109/ISPASS.2015.7095797","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095797","url":null,"abstract":"Today's computer systems often employ high-throughput accelerators (such as Intel Xeon Phi coprocessors and NVIDIA Tesla GPUs) to improve the performance of some applications or portions of applications. While such accelerators are useful for suitable applications, it remains challenging to predict which workloads will run well on these platforms and to predict the resulting performance trends for varying input. This paper provides detailed characterizations on such platforms across a range of programs and input sizes. Furthermore, we show opportunities for cross-platform performance analysis and comparison between Xeon Phi and Tesla. Our crossplatform comparison has three steps. First, we build Xeon Phi performance regression models as a function of important Xeon Phi performance counters to identify critical architectural resources that highly affect a benchmark's performance. Then, cross-platform Tesla performance regression models are built to relate the Tesla performance trends of the benchmark to the Xeon Phi performance counter measurements of the benchmark. Finally, we compare the counters most important for Xeon Phi models to those most important for Tesla's models; this reveals similarities and distinctions of dynamic application behaviors on the two platforms.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127668741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A study of mobile device utilization 移动设备利用的研究
Cao Gao, Anthony Gutierrez, M. Rajan, R. Dreslinski, T. Mudge, Carole-Jean Wu
{"title":"A study of mobile device utilization","authors":"Cao Gao, Anthony Gutierrez, M. Rajan, R. Dreslinski, T. Mudge, Carole-Jean Wu","doi":"10.1109/ISPASS.2015.7095808","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095808","url":null,"abstract":"Mobile devices are becoming more powerful and versatile than ever, calling for better embedded processors. Following the trend in desktop CPUs, microprocessor vendors are trying to meet such needs by increasing the number of cores in mobile device SoCs. However, increasing the number does not translate proportionally into performance gain and power reduction. In the past, studies have shown that there exists little parallelism to be exploited by a multi-core processor in desktop platform applications, and many cores sit idle during runtime. In this paper, we investigate whether the same is true for current mobile applications. We analyze the behavior of a broad range of commonly used mobile applications on real devices. We measure their Thread Level Parallelism (TLP), which is the machine utilization over the non-idle runtime. Our results demonstrate that mobile applications are utilizing less than 2 cores on average, even with background applications running concurrently. We observe a diminishing return on TLP with increasing the number of cores, and low TLP even with heavy-load scenarios. These studies suggest that having many powerful cores is over-provisioning. Further analysis of TLP behavior and big-little core energy efficiency suggests that current mobile workloads can benefit from an architecture that has the flexibility to accommodate both high performance and good energy-efficiency for different application phases.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128771443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 60
Can RDMA benefit online data processing workloads on memcached and MySQL? RDMA对memcached和MySQL上的在线数据处理工作负载有好处吗?
D. Shankar, Xiaoyi Lu, Jithin Jose, Md. Wasi-ur-Rahman, Nusrat S. Islam, D. Panda
{"title":"Can RDMA benefit online data processing workloads on memcached and MySQL?","authors":"D. Shankar, Xiaoyi Lu, Jithin Jose, Md. Wasi-ur-Rahman, Nusrat S. Islam, D. Panda","doi":"10.1109/ISPASS.2015.7095796","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095796","url":null,"abstract":"At the onset of the widespread usage of social networking services in the Web 2.0/3.0 era, leveraging a distributed and scalable caching layer like Memcached is often invaluable to application server performance. Since a majority of the existing clusters today are equipped with modern high speed interconnects such as InfiniBand, that offer high bandwidth and low latency communication, there is potential to improve the response time and throughput of the application servers, by taking advantage of advanced features like RDMA. We explore the potential of employing RDMA to improve the performance of Online Data Processing (OLDP) workloads on MySQL using Memcached for real-world web applications.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133029658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Analyzing communication models for distributed thread-collaborative processors in terms of energy and time 从精力和时间的角度分析分布式线程协作处理器的通信模型
Benjamin Klenk, Lena Oden, H. Fröning
{"title":"Analyzing communication models for distributed thread-collaborative processors in terms of energy and time","authors":"Benjamin Klenk, Lena Oden, H. Fröning","doi":"10.1109/ISPASS.2015.7095817","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095817","url":null,"abstract":"Accelerated computing has become pervasive for increasing the computational power and energy efficiency in terms of GFLOPs/Watt. For application areas with highest demands, for instance high performance computing, data warehousing and high performance analytics, accelerators like GPUs or Intel's MICs are distributed throughout the cluster. Since current analyses and predictions show that data movement will be the main contributor to energy consumption, we are entering an era of communication-centric heterogeneous systems that are operating with hard power constraints. In this work, we analyze data movement optimizations for distributed heterogeneous systems based on CPUs and GPUs. Thread-collaborative processors like GPUs differ significantly in their execution model from generalpurpose processors like CPUs, but available communication models are still designed and optimized for CPUs. Similar to heterogeneity in processing, heterogeneity in communication can have a huge impact on energy and time. To analyze this impact, we use multiple workloads with distinct properties regarding computational intensity and communication characteristics. We show for which workloads tailored communication models are essential, not only reducing execution time but also saving energy. Exposing the impact in terms of energy and time for communication-centric heterogeneous systems is crucial for future optimizations, and this work is a first step in this direction.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114776271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Hierarchical cycle accounting: a new method for application performance tuning 分层循环记帐:应用程序性能调优的新方法
A. Nowak, D. Levinthal, W. Zwaenepoel
{"title":"Hierarchical cycle accounting: a new method for application performance tuning","authors":"A. Nowak, D. Levinthal, W. Zwaenepoel","doi":"10.1109/ISPASS.2015.7095790","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095790","url":null,"abstract":"To address the growing difficulty of performance debugging on modern processors with increasingly complex micro-architectures, we present Hierarchical Cycle Accounting (HCA), a structured, hierarchical, architecture-agnostic methodology for the identification of performance issues in workloads running on these modern processors. HCA reports to the user the cost of a number of execution components, such as load latency, memory bandwidth, instruction starvation, and branch misprediction. A critical novel feature of HCA is that all cost components are presented in the same unit, core pipeline cycles. Their relative importance can therefore be compared directly. These cost components are furthermore presented in a hierarchical fashion, with architecture-agnostic components at the top levels of the hierarchy and architecture-specific components at the bottom. This hierarchical structure is useful in guiding the performance debugging effort to the places where it can be the most effective. For a given architecture, the cost components are computed based on the observation of architecture-specific events, typically provided by a performance monitoring unit (PMU), and using a set of formulas to attribute a certain cost in cycles to each event. The selection of what PMU events to use, their validation, and the derivation of the formulas are done offline by an architecture expert, thereby freeing the non-expert from the burdensome and error-prone task of directly interpreting PMU data. We have implemented the HCA methodology in Gooda, a publicly available tool. We describe the application of Gooda to the analysis of several workloads in wide use, showing how HCA's features facilitated performance debugging for these applications. We also describe the discovery of relevant bugs in Intel hardware and the Linux Kernel as a result of using HCA.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128912372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信