Alexander Hankin, David Werner, M. Amiraski, Julien Sébot, Kaushik Vaidyanathan, Mark Hempstead
{"title":"HotGauge: A Methodology for Characterizing Advanced Hotspots in Modern and Next Generation Processors","authors":"Alexander Hankin, David Werner, M. Amiraski, Julien Sébot, Kaushik Vaidyanathan, Mark Hempstead","doi":"10.1109/IISWC53511.2021.00025","DOIUrl":"https://doi.org/10.1109/IISWC53511.2021.00025","url":null,"abstract":"On-chip thermal hotspots are becoming one of the primary design concerns for next generation processors. Industry chip design trends coupled with post-Dennard power density scaling has led to a stark increase in localized and application-dependent hotspots. These “advanced” hotspots cause a variety of adverse effects if untreated, ranging from dramatic performance loss, incorrect circuit operation, and reduced device lifespan. In the past, hotspots could be addressed with physical cooling systems and EDA tools; however, the severity of advanced hotspots is prohibitively high for conventional thermal regulation techniques alone. Fine-grained, architecture-level techniques are needed. To develop these new techniques, the architecture community needs the methods and metrics for simulating and characterizing advanced hotspots. This work presents a novel hotspot characterization methodology for modern and next generation processors which we have coined, HotGauge. HotGauge includes new methods and metrics to enable architects to build hotspot mitigation techniques of the future. To demonstrate the utility of HotGauge, we present a case study in which we characterize the hotspot behavior in a modern 7nm high-performance client CPU. We observe an average Time-until-hotspot (TUH) that is 2× shorter than in its 14nm cousin for many SPEC2006 benchmarks, and we observe TUH varies by up to 2 orders of magnitude between different benchmarks. The first hotspot arises after only 0.2 ms. We then use HotGauge to compare hotspot severity across different floorplans, and we observe that floorplanning-based hotspot mitigation techniques like area scaling are inadequate. To enable the broader community to conduct architecture-level hotspot mitigation research, HotGauge, along with all models used in the case study in this work, is publicly available at https://github.com/TuftsCompArchLab/HotGaugeand https://doi.org/10.5281/zenodo.5523504.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133437563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ioannis Tsiokanos, G. Papadimitriou, D. Gizopoulos, G. Karakonstantis
{"title":"Boosting Microprocessor Efficiency: Circuit- and Workload-Aware Assessment of Timing Errors","authors":"Ioannis Tsiokanos, G. Papadimitriou, D. Gizopoulos, G. Karakonstantis","doi":"10.1109/IISWC53511.2021.00022","DOIUrl":"https://doi.org/10.1109/IISWC53511.2021.00022","url":null,"abstract":"Aggressive technology scaling and increased static and dynamic variability caused by process, temperature, voltage, and aging effects make nanometer circuits prone to timing errors which threaten system functionality. Accurately evaluating the impact of those circuit-level errors on the resilience of a CPU and the executed applications remains a first-class design issue. However, existing error assessment frameworks fail to accurately model the effects of timing errors because they neglect microarchitecture- and workload-dependent parameters that critically affect the error manifestation and propagation. This paper provides a novel, cross-layer framework that addresses the lack of a holistic methodology for the understanding of the full system impact of hardware timing errors as they propagate from the circuit-level through the microarchitecture up to the application software. The proposed microarchitecture-aware tool is able to realistically inject timing errors considering circuit and workload features, accurately assessing timing error effects on any application binary. We estimate the location (bit position and instruction) and the time (cycle) of the injected errors via a workload-aware error model which relies on post place-and-route dynamic timing analysis. We also leverage microarchitectural error injection to access the timing error reliability of a widely deployed pipelined processor under several workloads and voltage reduction levels. To evaluate the proposed tool, our fully automated toolflow is also configured to support timing error injection based on existing workload-agnostic error models. Evaluation results for various workloads and voltage reduction levels, show that our circuit- and workload-aware error injection model improves the accuracy of the error injection ratio by ~ 250× on average compared to workload-agnostic models. Finally, we quantify the degree to which various applications are prone to timing errors using an application vulnerability metric that can be used early in the design cycle to guide the adoption of energy-efficient error mitigation strategies.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129289609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Cost of Speculation: Revisiting Overheads in the V8 JavaScript Engine","authors":"Alberto Parravicini, René Müller","doi":"10.1109/IISWC53511.2021.00013","DOIUrl":"https://doi.org/10.1109/IISWC53511.2021.00013","url":null,"abstract":"JavaScript, while already widely used in web applications, is also gaining popularity beyond the web, such as on the server as part of Node.js or on the desktop with the Electron framework (e.g., in the Slack application). However, executing code written in this weakly and dynamically typed language efficiently, requires sophisticated Just-in-time (JIT) compilation. In this paper, we revisit the execution overheads of JavaScript. We perform a detailed analysis of the JetStream2 benchmark suite on Google's modern V8 JavaScript engine. We identify micro-architectural bottlenecks and runtime overheads that result from the speculations made by the JIT compiler when it generates machine code from JavaScript. We find that checks that verify assumptions made by the compiler have an average execution overhead of 8 %, 2-4x of what an earlier study reported on an older version of V8. For the check for Small Integers (SMI), we observe that the conditional branches are not the underlying cause but rather the computation of the condition itself. This indicates that these checks provide an attractive avenue for a HW/SW codesign solution. We present an extension for the ARMv8 instruction set that optimizes SMI load instructions and checks. We can improve execution time on SMI-heavy computations by up to 10 % on a prototype implementation of the new instructions in the gem5 simulator.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123723744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaochun Zhang, Timothy M. Jones, Simone Campanoni
{"title":"Quantifying the Semantic Gap Between Serial and Parallel Programming","authors":"Xiaochun Zhang, Timothy M. Jones, Simone Campanoni","doi":"10.1109/IISWC53511.2021.00024","DOIUrl":"https://doi.org/10.1109/IISWC53511.2021.00024","url":null,"abstract":"Automatic parallelizing compilers are often constrained in their transformations because they must conservatively respect data dependences within the program. Developers, on the other hand, often take advantage of domain-specific knowledge to apply transformations that modify data dependences but respect the application's semantics. This creates a semantic gap between the parallelism extracted automatically by compilers and manually by developers. Although prior work has proposed programming language extensions to close this semantic gap, their relative contribution is unclear and it is uncertain whether compilers can actually achieve the same performance as manually parallelized code when using them. We quantify this semantic gap in a set of sequential and parallel programs and leverage these existing programming-language extensions to empirically measure the impact of closing it for an automatic parallelizing compiler. This lets us achieve an average speedup of 12.6× on an Intel-based 28-core machine, matching the speedup obtained by the manually parallelized code. Further, we apply these extensions to widely used sequential system tools, obtaining 7.1× speedup on the same system.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"2015 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134239741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing Tail Latency in Serverless Clouds with STeLLAR","authors":"Dmitrii Ustiugov, Theodor Amariucai, Boris Grot","doi":"10.1109/IISWC53511.2021.00016","DOIUrl":"https://doi.org/10.1109/IISWC53511.2021.00016","url":null,"abstract":"Serverless computing has seen rapid adoption because of its instant scalability, flexible billing model, and economies of scale. In serverless, developers structure their applications as a collection of functions invoked by various events like clicks, and cloud providers take responsibility for cloud infrastructure management. As with other cloud services, serverless deployments require responsiveness and performance predictability manifested through low average and tail latencies. While the average end-to-end latency has been extensively studied in prior works, existing papers lack a detailed characterization of the effects of tail latency in real-world serverless scenarios and their root causes. In response, we introduce STeLLAR, an open-source serverless benchmarking framework, which enables an accurate performance characterization of serverless deployments. STeLLAR is provider-agnostic and highly configurable, allowing the analysis of both end-to-end and per-component performance with minimal instrumentation effort. Using STeLLAR, we study three leading serverless clouds and reveal that storage accesses and bursty function invocation traffic are key factors impacting tail latency in modern serverless systems. Finally, we identify important factors that do not contribute to latency variability, such as the choice of language runtime.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123026980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterizing Soft Error Vulnerability of CPUs Across Compiler Optimizations and Microarchitectures","authors":"G. Papadimitriou, D. Gizopoulos","doi":"10.1109/IISWC53511.2021.00021","DOIUrl":"https://doi.org/10.1109/IISWC53511.2021.00021","url":null,"abstract":"In this paper, we present a fine-grained characterization of the impact of transient faults (soft errors) on program execution for different compiler optimization levels and two out-of-order microarchitectures through extensive microarchitecture-level fault injection experiments. We evaluate how the different levels of compiler optimization impact the failure probability of the most important hardware structures in two different out-of-order Arm microarchitectures (Cortex-A15 and Cortex-A 72). We analyze 32 different executables: sources come from eight different benchmarks with large datasets, each one compiled with three different levels of compiler optimization (O1, O2, 03) and the baseline unoptimized code level (O0); execution times of the 32 binaries range from 72M cycles to 1.4B cycles. We show how the different compiler optimization levels affect the vulnerability of eight important hardware structures. We perform extensive soft error fault injection campaigns to measure with high statistical significance the Architectural Vulnerability Factor (AVF) of all hardware structures at each optimization level, and identify the structures whose vulnerability is more sensitive to compiler optimizations. Finally, we aggregate the vulnerabilities of the hardware structures into the overall failure rates of the microprocessor and complement with a performance-aware comparison of all optimization levels. The performance-aware vulnerability analysis shows that higher optimization levels counterbalance their increased vulnerability with the speedup the deliver. From the failure rates sole point of view, an unprotected design has variable behavior, however, when typical ECC protection is employed the O2 optimization level is consistently the most robust one, while for more recent microarchitectures, O1 can be equally robust to O2 which is not the case in older microarchitectures.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"30 13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125813356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Huzaifa, Rishi Desai, Samuel Grayson, Xutao Jiang, Ying Jing, Jae Lee, Fang Lu, Yihan Pang, Joseph Ravichandran, Finn Sinclair, Boyuan Tian, Hengzhi Yuan, Jeffrey Zhang, S. Adve
{"title":"ILLIXR: Enabling End-to-End Extended Reality Research","authors":"Muhammad Huzaifa, Rishi Desai, Samuel Grayson, Xutao Jiang, Ying Jing, Jae Lee, Fang Lu, Yihan Pang, Joseph Ravichandran, Finn Sinclair, Boyuan Tian, Hengzhi Yuan, Jeffrey Zhang, S. Adve","doi":"10.1109/IISWC53511.2021.00014","DOIUrl":"https://doi.org/10.1109/IISWC53511.2021.00014","url":null,"abstract":"An increasing number of edge systems have large computational demands, stringent resource constraints, and end-to-end quality-driven goodness metrics. Architects have embraced domain-specific accelerators to meet the demands of such systems. We make the case for research that shifts emphasis from domain-specific accelerators to domain-specific systems, with a consequent shift from evaluations using benchmarks that are collections of independent applications to those using testbeds that are full integrated systems. We describe extended reality (XR) as an exciting domain motivating such domain-specific systems research, but hampered by the lack of an end-to-end evaluation testbed. We present ILLIXR (Illinois Extended Reality testbed), the first fully open source XR system and research testbed. ILLIXR enables system innovations with end-to-end co-designed hardware, compiler, OS, and algorithm, and driven by end-user perceived quality-of-experience (QoE) metrics. Using ILLIXR, we perform the first comprehensive quantitative analysis of performance, power, and QoE for a complete XR system and its individual components. We describe several implications of our results that propel new directions in architecture, systems, and algorithm research for domain-specific systems in general, and XR in particular, all enabled by ILLIXR.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126102200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterizing and Taming Resolution in Convolutional Neural Networks","authors":"Eddie Q. Yan, Liang Luo, L. Ceze","doi":"10.1109/IISWC53511.2021.00027","DOIUrl":"https://doi.org/10.1109/IISWC53511.2021.00027","url":null,"abstract":"Image resolution has a significant effect on the accuracy and computational, storage, and bandwidth costs of computer vision model inference. These costs are exacerbated when scaling out models to large inference serving systems and make image resolution an attractive target for optimization. However, the choice of resolution inherently introduces additional tightly coupled choices, such as image crop size, image detail, and compute kernel implementation that impact computational, storage, and bandwidth costs. Further complicating this setting, the optimal choices from the perspective of these metrics are highly dependent on the dataset and problem scenario. We characterize this tradeoff space, quantitatively studying the accuracy and efficiency tradeoff via systematic and automated tuning of image resolution, image quality and convolutional neural network operators. With the insights from this study, we propose a dynamic resolution mechanism that removes the need to statically choose a resolution ahead of time. Our evaluation shows that our dynamic resolution approach improves inference latency by 1.2×-1.7×, reduces data access volume by up to 20–30%, without affecting accuracy. We establish the dynamic resolution approach as a viable alternative to fine-tuning for a specific object scale to compensate for unknown crop sizes, which is the current state of the art.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129969595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yangjie Zhou, Mengtian Yang, Cong Guo, Jingwen Leng, Yun Liang, Quan Chen, M. Guo, Yuhao Zhu
{"title":"Characterizing and Demystifying the Implicit Convolution Algorithm on Commercial Matrix-Multiplication Accelerators","authors":"Yangjie Zhou, Mengtian Yang, Cong Guo, Jingwen Leng, Yun Liang, Quan Chen, M. Guo, Yuhao Zhu","doi":"10.1109/IISWC53511.2021.00029","DOIUrl":"https://doi.org/10.1109/IISWC53511.2021.00029","url":null,"abstract":"Many of today's deep neural network accelerators, e.g., Google's TPU and NVIDIA's tensor core, are built around accelerating the general matrix multiplication (i.e., GEMM). However, supporting convolution on GEMM-based accelerators is not trivial. The naive method explicitly lowers the convolution to GEMM, commonly known as im2co1, which introduces significant performance and memory overhead. Existing implicit im2co1 algorithms require unscalable hardware and are inefficient in supporting important convolution variants such as strided convolution. In this paper, we propose a memory-efficient and hardware-friendly implicit im2co1 algorithm used by Google's TPU, which dynamically converts a convolution into a GEMM with practically zero performance and memory overhead, fully unleashing the power of GEMM engines. Through comprehensive experimental results, we quantitatively argue that this algorithm has been adopted in commercial closed-source platforms, and we are the first to describe its high-level idea and implementation details. Finally, we show that our algorithm can also be generally applied to Nvidia's Tensor Cores (TC), matching and out-performing the measured performance on TCs.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133920924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}