Lillian Pentecost, Udit Gupta, Elisa Ngan, J. Beyer, Gu-Yeon Wei, D. Brooks, M. Behrisch
{"title":"CHAMPVis: Comparative Hierarchical Analysis of Microarchitectural Performance","authors":"Lillian Pentecost, Udit Gupta, Elisa Ngan, J. Beyer, Gu-Yeon Wei, D. Brooks, M. Behrisch","doi":"10.1109/ProTools49597.2019.00013","DOIUrl":"https://doi.org/10.1109/ProTools49597.2019.00013","url":null,"abstract":"Performance analysis and optimization are essential tasks for hardware and software engineers. In the age of datacenter-scale computing, it is particularly important to conduct comparative performance analysis to understand discrepancies and limitations among different hardware systems and applications. However, there is a distinct lack of productive visualization tools for these comparisons. We present CHAMPVis, a web-based, interactive visualization tool that leverages the hierarchical organization of hardware systems to enable productive performance analysis. With CHAMPVis, users can make definitive performance comparisons across applications or hardware platforms. In addition, CHAMPVis provides methods to rank and cluster based on performance metrics to identify common optimization opportunities. Our thorough task analysis reveals three types of datacenter-scale performance analysis tasks: summarization, detailed comparative analysis, and interactive performance bottleneck identification. We propose techniques for each class of tasks including (1) 1-D feature space projection for similarity analysis; (2) Hierarchical parallel co-ordinates for comparative analysis; and (3) User interactions for rapid diagnostic queries to identify optimization targets. We evaluate CHAMPVis by analyzing standard datacenter applications and machine learning benchmarks in two different case studies.","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115181967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Allen R. Sanderson, John A. Schmidt, A. Humphrey, M. Papka, R. Sisneros
{"title":"In Situ Visualization of Performance Metrics in Multiple Domains","authors":"Allen R. Sanderson, John A. Schmidt, A. Humphrey, M. Papka, R. Sisneros","doi":"10.1109/ProTools49597.2019.00014","DOIUrl":"https://doi.org/10.1109/ProTools49597.2019.00014","url":null,"abstract":"As application scientists develop and deploy simula- tion codes on to leadership-class computing resources, there is a need to instrument these codes to better understand performance to efficiently utilize these resources. This instrumentation may come from independent third-party tools that generate and store performance metrics or from custom instrumentation tools built directly into the application. The metrics collected are then available for visual analysis, typically in the domain in which there were collected. In this paper, we introduce an approach to visualize and analyze the performance metrics in situ in the context of the machine, application, and communication domains (MAC model) using a single visualization tool. This visualization model provides a holistic view of the application performance in the context of the resources where it is executing.","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123395528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"[Copyright notice]","authors":"","doi":"10.1109/protools49597.2019.00002","DOIUrl":"https://doi.org/10.1109/protools49597.2019.00002","url":null,"abstract":"","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125149698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"[Title page]","authors":"","doi":"10.1109/protools49597.2019.00001","DOIUrl":"https://doi.org/10.1109/protools49597.2019.00001","url":null,"abstract":"","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122997009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Level Performance Instrumentation for Kokkos Applications Using TAU","authors":"S. Shende, Nicholas Chaimov, A. Malony, N. Imam","doi":"10.1109/ProTools49597.2019.00012","DOIUrl":"https://doi.org/10.1109/ProTools49597.2019.00012","url":null,"abstract":"The TAU Performance System® provides a multi-level instrumentation strategy for instrumentation of Kokkos applications. Kokkos provides a performance portable API for expressing parallelism at the node level. TAU uses the Kokkos profiling system to expose performance factors using user-specified parallel kernel names for lambda functions or C++ functors. It can also use instrumentation at the OpenMP, CUDA, pthread, or other runtime levels to expose the implementation details giving a dual focus of higher-level abstractions as well as low-level execution dynamics. This multi-level instrumentation strategy adopted by TAU can highlight performance problems across multiple layers of the runtime system without modifying the application binary.","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117028952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jan-Patrick Lehr, A. Calotoiu, C. Bischof, F. Wolf
{"title":"Automatic Instrumentation Refinement for Empirical Performance Modeling","authors":"Jan-Patrick Lehr, A. Calotoiu, C. Bischof, F. Wolf","doi":"10.1109/ProTools49597.2019.00011","DOIUrl":"https://doi.org/10.1109/ProTools49597.2019.00011","url":null,"abstract":"The analysis of runtime performance is important during the development and throughout the life cycle of HPC applications. One important objective in performance analysis is to identify regions in the code that show significant runtime increase with larger problem sizes or more processes. One approach to identify such regions is to use empirical performance modeling, i.e., building performance models based on measurements. While the modeling itself has already been streamlined and automated, the generation of the required measurements is time consuming and tedious. In this paper, we propose an approach to automatically adjust the instrumentation to reduce overhead and focus the measurements to relevant regions, i.e.,such that show increasing runtime with larger input parameters or increasing number of MPI ranks. Our approach employs Extra-P to generate performance models, which it then uses to extrapolate runtime and, finally, decide which functions should be kept for measurement. Also, the analysis expands the instrumentation, by heuristically adding functions based on static source-code features. We evaluate our approach using benchmarks from SPEC CPU 2006, SU2, and parallel MILC. The evaluation shows that our approach can filter functions of little interest and generate profiles that contain mostly relevant regions. For example, the overhead for SU2 can be improved automatically from 200% to 11% compared to filtered Score-P measurements.","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124424296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qinglei Cao, Yu Pei, T. Hérault, Kadir Akbudak, A. Mikhalev, G. Bosilca, H. Ltaief, D. Keyes, J. Dongarra
{"title":"Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools","authors":"Qinglei Cao, Yu Pei, T. Hérault, Kadir Akbudak, A. Mikhalev, G. Bosilca, H. Ltaief, D. Keyes, J. Dongarra","doi":"10.1109/ProTools49597.2019.00009","DOIUrl":"https://doi.org/10.1109/ProTools49597.2019.00009","url":null,"abstract":"This paper highlights the necessary development of new instrumentation tools within the PaRSE task-based runtime system to leverage the performance of low-rank matrix computations. In particular, the tile low-rank (TLR) Cholesky factorization represents one of the most critical matrix operations toward solving challenging large-scale scientific applications. The challenge resides in the heterogeneous arithmetic intensity of the various computational kernels, which stresses PaRSE's dynamic engine when orchestrating the task executions at runtime. Such irregular workload imposes the deployment of new scheduling heuristics to privilege the critical path, while exposing task parallelism to maximize hardware occupancy. To measure the effectiveness of PaRSE's engine and its various scheduling strategies for tackling such workloads, it becomes paramount to implement adequate performance analysis and profiling tools tailored to fine-grained and heterogeneous task execution. This permits us not only to provide insights from PaRSE, but also to identify potential applications' performance bottlenecks. These instrumentation tools may actually foster synergism between applications and PaRSE developers for productivity as well as high-performance computing purposes. We demonstrate the benefits of these amenable tools, while assessing the performance of TLR Cholesky factorization from data distribution, communication-reducing and synchronization-reducing perspectives. This tool-assisted performance analysis results in three major contributions: a new hybrid data distribution, a new hierarchical TLR Cholesky algorithm, and a new performance model for tuning the tile size. The new TLR Cholesky factorization achieves an 8X performance speedup over existing implementations on massively parallel supercomputers, toward solving large-scale 3D climate and weather prediction applications.","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122264437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Boehme, K. Huck, Jonathan Madsen, J. Weidendorfer
{"title":"The Case for a Common Instrumentation Interface for HPC Codes","authors":"David Boehme, K. Huck, Jonathan Madsen, J. Weidendorfer","doi":"10.1109/ProTools49597.2019.00010","DOIUrl":"https://doi.org/10.1109/ProTools49597.2019.00010","url":null,"abstract":"Lightweight timekeeping functionality for basic performance logging, regression testing, and anomaly detection is essential in HPC codes. We present the Caliper, TiMemory, and PerfStubs libraries that have recently been developed as common solutions for these tasks. Lightweight, always-on profiling solutions are typically built around user-defined instrumentation points, which can benefit a variety of use cases beyond application timekeeping. We argue for the creation of a tool-agnostic adapter layer to make these instrumentation points available to third-party tools, runtime systems, and system software.","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131721087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Asvie: A Timing-Agnostic SVE Optimization Methodology","authors":"M. T. Cruz, Daniel Ruiz, Roxana Rusitoru","doi":"10.1109/ProTools49597.2019.00007","DOIUrl":"https://doi.org/10.1109/ProTools49597.2019.00007","url":null,"abstract":"As we are quickly approaching exascale and moving onwards towards the next challenge, we are exploring a wider range of technologies and architectures. The further out the timeframes considered, the less likely prototype hardware is available. A popular method of exploring new architectural extensions is to emulate them on existing platforms. The Arm Instruction Emulator (ArmIE) is such a tool, which we use on existing Armv8 platforms to run Arm's latest vector architecture, the Scalable Vector Extension (SVE). To aid with porting applications towards SVE, we developed an application optimization methodology based on ArmIE that uses timing-agnostic metrics to assess application quality. We show how we have successfully optimized the High Performance Conjugate Gradient (HPCG) High Performance Computing benchmark to SVE by using our methodology, resulting in a hand-optimized intrinsics-based version.","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"934 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123780368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Calotoiu, Thomas Höhl, H. Mantel, Toni Nguyen, F. Wolf
{"title":"Designing Efficient Parallel Software via Compositional Performance Modeling","authors":"A. Calotoiu, Thomas Höhl, H. Mantel, Toni Nguyen, F. Wolf","doi":"10.1109/ProTools49597.2019.00008","DOIUrl":"https://doi.org/10.1109/ProTools49597.2019.00008","url":null,"abstract":"Performance models are powerful instruments for understanding the performance of parallel systems and uncovering their bottlenecks. Already during system design, performance models can help ponder alternatives. However, creating a performance model - whether theoretically or empirically - for an entire application that does not exist yet is challenging unless the interactions between all system components are well understood, which is often not the case during design. In this paper, we propose to generate performance models of full programs from performance models of their components using formal composition operators derived from parallel design patterns such as pipeline or task pool. As long as the design of the overall system follows such a pattern, its performance model can be predicted with reasonable accuracy without an actual implementation.","PeriodicalId":418029,"journal":{"name":"2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131786663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}