Shubham Nema, Justin Kirschner, Debpratim Adak, S. Agarwal, Ben Feinberg, Arun Rodrigues, M. Marinella, Amro Awad
{"title":"Eris: Fault Injection and Tracking Framework for Reliability Analysis of Open-Source Hardware","authors":"Shubham Nema, Justin Kirschner, Debpratim Adak, S. Agarwal, Ben Feinberg, Arun Rodrigues, M. Marinella, Amro Awad","doi":"10.1109/ISPASS55109.2022.00027","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00027","url":null,"abstract":"As transistors have been scaled over the past decade, modern systems have become increasingly susceptible to faults. Increased transistor densities and lower capacitances make a particle strike more likely to cause an upset. At the same time, complex computer systems are increasingly integrated into safety-critical systems such as autonomous vehicles. These two trends make the study of system reliability and fault tolerance essential for modern systems. To analyze and improve system reliability early in the design process, new tools are needed for RTL fault analysis.This paper proposes Eris, a novel framework to identify vulnerable components in hardware designs through fault-injection and fault propagation tracking. Eris builds on ESSENT—a fast C/C++ RTL simulation framework—to provide fault injection, fault tracking, and control-flow deviation detection capabilities for RTL designs. To demonstrate Eris’ capabilities, we analyze the reliability of the open source Rocket Chip SoC by randomly injecting faults during thousands of runs on four microbenchmarks. As part of this analysis we measure the sensitivity of different hardware structures to faults based on the likelihood of a random fault causing silent data corruption, unrecoverable data errors, program crashes, and program hangs. We detect control flow deviations and determine whether or not they are benign. Additionally, using Eris’ novel fault-tracking capabilities we are able to find 78% more vulnerable components in the same number of simulations compared to RTL-based fault injection techniques without these capabilities. We will release Eris as an open-source tool to aid future research into processor reliability and hardening.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121011913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hadjer Benmeziane, S. Niar, Hamza Ouarnoughi, Kaoutar El Maghraoui
{"title":"Pareto Rank Surrogate Model for Hardware-aware Neural Architecture Search","authors":"Hadjer Benmeziane, S. Niar, Hamza Ouarnoughi, Kaoutar El Maghraoui","doi":"10.1109/ISPASS55109.2022.00040","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00040","url":null,"abstract":"Hardware-aware Neural Architecture Search (HWNAS) has recently gained much attention by automating the design of efficient deep learning models with tiny resources and reduced inference time requirements. However, HW-NAS inherits and exacerbates the expensive computational complexity of general NAS due to its significantly increased search spaces and more complex NAS evaluation component. To speed up HWNAS, existing efforts use surrogate models to predict a neural architecture’s accuracy and hardware performance on a specific platform. Thereby reducing the expensive training process and significantly reducing search time. We show that using multiple surrogate models to estimate the different objectives does not achieve the true Pareto front. Therefore, we propose HW-PRNAS, a novel Pareto Rank-preserving surrogate model. HWPR-NAS training is based on a new loss function that ranks the architectures according to their Pareto front. We evaluate our approach on seven different hardware platforms, including ASIC, FPGA, GPU and multi-cores. Our results show that we can achieve up to 2. 5x speedup while achieving better Pareto-front results than state of the art surrogate models.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117055171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Bakhshalipour, M. Likhachev, Phillip B. Gibbons
{"title":"RTRBench: A Benchmark Suite for Real-Time Robotics","authors":"Mohammad Bakhshalipour, M. Likhachev, Phillip B. Gibbons","doi":"10.1109/ispass55109.2022.00024","DOIUrl":"https://doi.org/10.1109/ispass55109.2022.00024","url":null,"abstract":"The emergence of “robotics in the wild” has triggered a wave of recent research in hardware and software to boost robots’ compute capabilities. Nevertheless, research in this area is hindered by the lack of a comprehensive benchmark suite.In this paper, we present RTRBench, a benchmark suite for robotic kernels. RTRBench includes 16 kernels, spanning the entire software pipeline of a wide swath of robots, all implemented in C++ for fast execution.Together with the suite, we conduct an evaluation of the workloads at the architecture level. We pinpoint the sources of inefficiencies in a modern robotic processor when executing the robotic kernels, along with the opportunities for improvements.The source code of the benchmark suite is available in https://cmu-roboarch.github.io/rtrbench/.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123558100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dimitris Sartzetakis, G. Papadimitriou, D. Gizopoulos
{"title":"gpuFI-4: A Microarchitecture-Level Framework for Assessing the Cross-Layer Resilience of Nvidia GPUs","authors":"Dimitris Sartzetakis, G. Papadimitriou, D. Gizopoulos","doi":"10.1109/ISPASS55109.2022.00004","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00004","url":null,"abstract":"Pre-silicon reliability evaluation of processors is usually performed at the microarchitecture or at the software level. Recent studies on CPUs have, however, shown that software level approaches can mislead the soft error vulnerability assessment process and drive designers towards wrong error protection decisions. To avoid such pitfalls in the GPUs domain, the availability of microarchitecture level reliability assessment tools is of paramount importance. Although there are several publicly available frameworks for the reliability assessment of GPUs, they only operate at the software level, and do not consider the microarchitecture. This paper aims at accurate microarchitecture level GPU soft error vulnerability assessment. We introduce gpuFI-4: a detailed microarchitecture-level fault injection framework to assess the cross-layer vulnerability of hardware structures and entire GPU chips for single and multiple bit faults, built on top of the state-of-the-art simulator GPGPU-Sim 4.0. We employ gpuFI-4 for fault injection of soft errors on CUDA-enabled Nvidia GPU architectures. The target hardware structures that our framework analyzes are the register file, the shared memory, the LI data and texture caches and the L2 cache, altogether accounting for tens of MBs of on-chip GPU storage. We showcase the features of the tool reporting the vulnerability of three Nvidia GPU chip models: two different modem GPU architectures – RTX 2060 (Turing) and Quadro GV100 (Volta) – and an older generation – GTX Titan (Kepler), for both single-bit and triple-bit fault injections and for twelve different CUDA benchmarks that are simulated on the actual physical instruction set (SASS). Our experiments report the Architectural Vulnerability Factor (AVF) of the GPU chips (which can be only measured at the microarchitecture level) as well as their predicted Failures in Time (FIT) rate when technology information is incorporated in the assessment.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129923399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Performance Deployment of Text Detection Model: Compression and Hardware Platform considerations","authors":"Nupur Sumeet, Karan Rawat, M. Nambiar","doi":"10.1109/ISPASS55109.2022.00022","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00022","url":null,"abstract":"Network compression is often adopted for high throughput implementation on commercial accelerators. We propose a heuristic based approach to obtain compressed networks with a hardware-friendly architecture as an alternative to conventional NAS algorithms that are computationally expensive. The proposed compressed network introduces 142 $times$ memory-footprint reduction and provide throughput improvement of 5-8 $times$ on target hardware platforms, while retaining accuracy within 5% of the baseline trained model. We report performance acceleration on CPU, GPU, and FPGAs for a text detection task.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134356004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Geonhwa Jeong, Bikash Sharma, Nick Terrell, A. Dhanotia, Zhiwei Zhao, Niket Agarwal, A. Kejariwal, T. Krishna
{"title":"Understanding Data Compression in Warehouse-Scale Datacenter Services","authors":"Geonhwa Jeong, Bikash Sharma, Nick Terrell, A. Dhanotia, Zhiwei Zhao, Niket Agarwal, A. Kejariwal, T. Krishna","doi":"10.1109/ISPASS55109.2022.00028","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00028","url":null,"abstract":"Data compression has emerged as a promising technique to alleviate the memory, storage, and network cost with some associated compute overheads in warehouse-scale datacenter services. Despite being one of the most important components of the overall datacenter taxes, there has not been a comprehensive characterization of compression usage in data center workloads. In this work, we first provide a holistic characterization of compression as used by various warehouse-scale datacenter services at a global social media provider (Meta). Next, we deep dive into a few representative use cases of compression in the production environment and characterize compression usage of services while running live traffic.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"37 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134635017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shalini Jain, Yashas Andaluri, S. VenkataKeerthy, Ramakrishna Upadrasta
{"title":"POSET-RL: Phase ordering for Optimizing Size and Execution Time using Reinforcement Learning","authors":"Shalini Jain, Yashas Andaluri, S. VenkataKeerthy, Ramakrishna Upadrasta","doi":"10.1109/ISPASS55109.2022.00012","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00012","url":null,"abstract":"The ever increasing memory requirements of several applications has led to increased demands which might not be met by embedded devices. Constraining the usage of memory in such cases is of paramount importance. It is important that such code size improvements should not have a negative impact on the runtime. Improving the execution time while optimizing for code size is a non-trivial but a significant task.The ordering of standard optimization sequences in modern compilers is fixed, and are heuristically created by the compiler domain experts based on their expertise. However, this ordering is sub-optimal, and does not generalize well across all the cases.We present a reinforcement learning based solution to the phase ordering problem, where the ordering improves both the execution time and code size. We propose two different approaches to model the sequences: one by manual ordering, and other based on a graph called Oz Dependence Graph (ODG). Our approach uses minimal data as training set, and is integrated with LLVM.We show results on x86 and AArch64 architectures on the benchmarks from SPEC-CPU 2006, SPEC-CPU 2017 and MiBench. We observe that the proposed model based on ODG outperforms the current Oz sequence both in terms of size and execution time by 6.19% and 11.99% in SPEC 2017 benchmarks, on an average.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130053107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Santos, T. R. Kepe, Francis B. Moreira, P. C. Santos, M. Alves
{"title":"Advancing Near-Data Processing with Precise Exceptions and Efficient Data Fetching","authors":"S. Santos, T. R. Kepe, Francis B. Moreira, P. C. Santos, M. Alves","doi":"10.1109/ISPASS55109.2022.00031","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00031","url":null,"abstract":"Near-Data Processing (NDP) modifies the traditional computer system design by placing logic near the memory, bringing computation to the data. One NDP approach places such elements on the logic layer of 3D-stacked memories to quickly access data while avoiding reliance on narrow buses and better accessing the parallelism these devices offer. However, NDP architectures often fail to fully leverage available memory resources. In this work, we propose adding an instruction buffer to a common NDP design with large vector instructions. This modification allows the NDP to fetch instruction operands out of program order and delegates some responsibility regarding precise exceptions to the near-data device. Our results show our modifications cause a reduction in execution time of up to 28% while consuming up to 25% less energy.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121446420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Uday Kumar Reddy Vengalam, Anshujit Sharma, Michael C. Huang
{"title":"LoopIn: A Loop-Based Simulation Sampling Mechanism","authors":"Uday Kumar Reddy Vengalam, Anshujit Sharma, Michael C. Huang","doi":"10.1109/ispass55109.2022.00029","DOIUrl":"https://doi.org/10.1109/ispass55109.2022.00029","url":null,"abstract":"Understanding program behavior is at the heart of general-purpose architecture design. Whether we are testing a new design offline or making a design adapt to changing behavior online, a central assumption is that the test cases represent real workload in steady state. Typical computer programs have been known to exhibit patterns of runtime behavior that repeat during the course of their execution. Simulation and adaptation strategies all exploit this repetition to some extent. In this paper, we introduce a simple mechanism that is more explicit in identifying and exploiting behavior repetition at the granularity of (broadly defined) loops. The result is that a typical benchmark will be categorized into tens of loops. In terms of architectural simulations, this strategy will create a moderate number (on the orders of 100) of relatively short (tens of thousands of instructions) segments. There are two major benefits in our view. The first and more quantifiable benefit is that, the strategy requires less simulation and obtains increased accuracy compared to the commonly used SimPoint approach. Second, instead of depicting average statistics of an entire program, we can accurately describe intra-program behavior variation, which simple sampling strategies cannot. LoopIn produces many small simulation segments. In certain usage scenarios, microarchitectural state warm-up may be costly. In these cases, an existing tool BLRL can help create efficient warm-up arrangements.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"317 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123233085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Konstantin Levit-Gurevich, Alex Skaletsky, Michael Berezalsky, Yulia Kuznetcova, Hila Yakov
{"title":"Profiling Intel Graphics Architecture with Long Instruction Traces","authors":"Konstantin Levit-Gurevich, Alex Skaletsky, Michael Berezalsky, Yulia Kuznetcova, Hila Yakov","doi":"10.1109/ISPASS55109.2022.00001","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00001","url":null,"abstract":"In the process of developing software and hardware, profiling workloads is critical. Binary Instrumentation Technology plays a key role in this task for both x86 architecture and Intel Graphics Processing Units. The GTPin framework is the first tool that allows the profiling of graphics and compute kernels running on Intel GPUs. However, GTPin capabilities are less flexible than x86 profiling tools. In this paper, we introduce the concept of “gLIT” – Long Instruction Trace for Intel GPUs. Generated on real hardware, gLIT can be replayed on a simulator or an emulator running on the CPU device, and thus, can be easily profiled and analyzed “on the fly” with analysis tools of any complexity. Since the graphics devices are extremely parallel, the gLIT trace is, by definition, a multi-threaded trace, reflecting a kernel concurrently running hundreds of hardware threads. The ability to thoroughly profile and analyze workloads is critical for improving hardware and software readiness and creates new possibilities for academic research on Intel graphics devices.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"434 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126100647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}