Sofiane Chetoui, Rahul Shahi, Seif Abdelaziz, Abhinav Golas, Farrukh Hijaz, S. Reda
{"title":"ARBench: Augmented Reality Benchmark For Mobile Devices","authors":"Sofiane Chetoui, Rahul Shahi, Seif Abdelaziz, Abhinav Golas, Farrukh Hijaz, S. Reda","doi":"10.1109/ISPASS55109.2022.00035","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00035","url":null,"abstract":"This paper takes an important step towards the improvement of the AR mobile experience by designing and developing ARBench, the first Augmented Reality (AR) benchmark for mobile devices. ARBench incorporates different AR workloads that stress multiple hardware units of the SoC (CPU, GPU, DSP, etc), and measures the individual score for each AR workload. The proposed benchmark suite is then used to evaluate the AR performance of various commercial mobile devices, and their ability to support various functions of AR workloads.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114530114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SGXGauge: A Comprehensive Benchmark Suite for Intel SGX","authors":"Sandeep Kumar, Abhisek Panda, S. Sarangi","doi":"10.1109/ISPASS55109.2022.00014","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00014","url":null,"abstract":"Trusted execution environments (TEEs) such as Intel SGX facilitate the secure execution of an application on untrusted machines. A plethora of work focuses on improving the performance of such environments necessitating the need for a standard, widely accepted benchmark suite. We present SGXGauge, a benchmark suite for SGX containing a diverse set of workloads from different domains. We also thoroughly characterize the behavior of the benchmark suite on a native platform and on a platform that uses a library OS-based shim layer (GrapheneSGX).","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124317965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongqin Wang, G. Suh, Wenjie Xiong, Benjamin Lefaudeux, Brian Knott, M. Annavaram, Hsien-Hsin S. Lee
{"title":"Characterization of MPC-based Private Inference for Transformer-based Models","authors":"Yongqin Wang, G. Suh, Wenjie Xiong, Benjamin Lefaudeux, Brian Knott, M. Annavaram, Hsien-Hsin S. Lee","doi":"10.1109/ISPASS55109.2022.00025","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00025","url":null,"abstract":"In this work, we provide an in-depth characterization study of the performance overhead for running Transformer models with secure multi-party computation (MPC). MPC is a cryptographic framework for protecting both the model and input data privacy in the presence of untrusted compute nodes. Our characterization study shows that Transformers introduce several performance challenges for MPC-based private machine learning inference. First, Transformers rely extensively on “softmax” functions. While softmax functions are relatively cheap in a non-private execution, softmax dominates the MPC inference runtime, consuming up to 50% of the total inference runtime. Further investigation shows that computing the maximum, needed for providing numerical stability to softmax, is a key culprit for the increase in latency. Second, MPC relies on approximating non-linear functions that are part of the softmax computations, and the narrow dynamic ranges make optimizing softmax while maintaining accuracy quite difficult. Finally, unlike CNNs, Transformer-based NLP models use large embedding tables to convert input words into embedding vectors. Accesses to these embedding tables can disclose inputs; hence, additional obfuscation for embedding access patterns is required for guaranteeing the input privacy. One approach to hide address accesses is to convert an embedding table lookup into a matrix multiplication. However, this naive approach increases MPC inference runtime significantly. We then apply tensor-train (TT) decomposition, a lossy compression technique for representing embedding tables, and evaluate its performance on embedding lookups. We show the trade-off between performance improvements and the corresponding impact on model accuracy using detailed experiments.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126676864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qijing Huang, Charles Hong, J. Wawrzynek, Mahesh Subedar, Y. Shao
{"title":"Learning A Continuous and Reconstructible Latent Space for Hardware Accelerator Design","authors":"Qijing Huang, Charles Hong, J. Wawrzynek, Mahesh Subedar, Y. Shao","doi":"10.1109/ispass55109.2022.00041","DOIUrl":"https://doi.org/10.1109/ispass55109.2022.00041","url":null,"abstract":"The hardware design space is high-dimensional and discrete. Systematic and efficient exploration of this space has been a significant challenge. Central to this problem is the intractable search complexity that grows exponentially with the design choices and the discrete nature of the search space. This work investigates the feasibility of learning a meaningful low-dimensional continuous representation for hardware designs to reduce such complexity and facilitate the search process. We devise a variational autoencoder (VAE)-based design space exploration framework called VAESA, to encode the hardware design space in a compact and continuous representation. We show that black-box and gradient-based design space exploration algorithms can be applied to the latent space, and design points optimized in the latent space can be reconstructed to high-performance realistic hardware designs. Our experiments show that performing the design space search on the latent space consistently leads to the optimal design point under a fixed number of samples. In addition, the latent space can improve the sample efficiency of the original algorithm by 6.8$times$ and can discover hardware designs that are up to 5% more efficient than the optimal design searched directly in the high-dimensional input space.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126203320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sh. Sato, Kota Iizuka, N. Yoshifuji, Masaki Natsume
{"title":"VIPP: Validation-Included Precision-Parametric N-Body Benchmark Suite","authors":"Sh. Sato, Kota Iizuka, N. Yoshifuji, Masaki Natsume","doi":"10.1109/ISPASS55109.2022.00021","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00021","url":null,"abstract":"Many efforts have recently been made to analyze and validate floating-point errors, particularly in mixed-precision arithmetic. However, real-world applications in approximate computing typically incorporate both model-level approximation and arithmetic-level precision. It is crucial to analyze the combined effects of both precision parameters to the extent valid in terms of approximate algorithms. In this work, we develop a benchmark suite of the practical approximate solvers of various N-body problems that parameterize both N-body approximation and arithmetic precision. It involves precision criteria to prevent us from unrestricted reduced precision and serves as a testbed to analyze the combined effects of model-level approximation and arithmetic-level reduced precision. It would help the design of precision control in approximate computing.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124264986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenjie Liu, W. Heirman, Stijn Eyerman, Shoaib Akram, L. Eeckhout
{"title":"Scale-Model Architectural Simulation","authors":"Wenjie Liu, W. Heirman, Stijn Eyerman, Shoaib Akram, L. Eeckhout","doi":"10.1109/ISPASS55109.2022.00006","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00006","url":null,"abstract":"Computer architects extensively use simulation to steer future processor research and development. Simulating large-scale multicore processors is extremely time-consuming and is sometimes impossible because of simulation infrastructure constraints and/or simulation host compute and memory limitations. This paper proposes scale-model simulation, a novel methodology to predict large-scale multicore system performance. Scale-model simulation first constructs and simulates a scale model of the target system with reduced core count and shared resources. Target system performance is then predicted through machine-learning (ML) based extrapolation. Scale-model simulation predicts 32-core target system performance based on a single-core scale model with an average error of 8.0% and 15.8% for homogeneous and heterogeneous multiprogram workloads, respectively, while yielding a $28times$ simulation speedup.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134061406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyokeun Lee, Hyungsuk Kim, Seokbo Shim, Seungyong Lee, Dosun Hong, Hyuk-Jae Lee, Hyun Kim
{"title":"PCMCsim: An Accurate Phase-Change Memory Controller Simulator and its Performance Analysis","authors":"Hyokeun Lee, Hyungsuk Kim, Seokbo Shim, Seungyong Lee, Dosun Hong, Hyuk-Jae Lee, Hyun Kim","doi":"10.1109/ISPASS55109.2022.00043","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00043","url":null,"abstract":"With the growing demand for technology scaling and storage capacity in data centers, phase-change memory (PCM) has garnered attention as a next-generation nonvolatile memory (NVM). However, an accurate simulator that includes the necessary hardware features for PCM is not available, lagging behind current PCM technology. In this study, a functional and cycle-accurate PCM controller simulator, called PCMCsim, is presented to revitalize the related research. The proposed simulator incorporates necessary features for current PCM products and the latest DDR5 specifications. Based on rigorous performance analysis, this study characterizes bottlenecks of the PCM subsystem by sweeping hardware parameters, providing important takeaway messages to designers. Furthermore, the latency is significantly reduced by introducing a dedicated prefetcher into the address translation module. The proposed simulator is validated against a command trace made by a PCM product developer. We release our simulator as open-source software, except for industry-confidential features.11https://github.com/harrylee365/pcmcsim-pub1ic","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133046503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marcos Horro, L. Pouchet, Gabriel Rodríguez, J. Touriño
{"title":"MARTA: Multi-configuration Assembly pRofiler and Toolkit for performance Analysis","authors":"Marcos Horro, L. Pouchet, Gabriel Rodríguez, J. Touriño","doi":"10.1109/ISPASS55109.2022.00008","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00008","url":null,"abstract":"Benchmarking to characterize specific software or hardware features is an error-prone, arduous and repetitive task. Designing a specialized experimental setup frequently requires writing new scripts or ad-hoc programs in order to properly exhibit interesting performance effects, using code changes and hardware events measurements. These artifacts may have limited reusability for subsequent experiments, since they are dependent on specific problems and, in some cases, platforms. To improve productivity and reproducibility of such experiments, which are often investigative in nature, we introduce MARTA: a fully customizable toolkit that aims to increase productivity by generating benchmark templates, compiling them, and profiling the regions of interest (RoI) specified using hardware events, and performing static code analysis. MARTA can also be applied on existing code regions of interest, it only requires to write a simple configuration file. In an orthogonal dimension, the system is able to run various statistical analyses on the measurements collected. MARTA uses data mining and machine learning or AI-based techniques for classification and regression, automatically extracting the features of the experimental setup which have the most impact on performance or whichever other metric of interest, given a large set of experiments and dimensions to consider. These post-processing tasks are valuable for deriving knowledge from experiments and are not included in most profiling tools. We also provide a set of cases of study to illustrate the ability of MARTA to conveniently create a reliable and reproducible setup for high-performance computing experiments, investigating three vastly different performance effects on modern processors.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123028715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jorge L. Ortiz, David Corbalán-Navarro, Juan L. Aragón, Antonio González
{"title":"MEGsim: A Novel Methodology for Efficient Simulation of Graphics Workloads in GPUs","authors":"Jorge L. Ortiz, David Corbalán-Navarro, Juan L. Aragón, Antonio González","doi":"10.1109/ispass55109.2022.00007","DOIUrl":"https://doi.org/10.1109/ispass55109.2022.00007","url":null,"abstract":"An important drawback of cycle-accurate microarchitectural simulators is that they are several orders of magnitude slower than the system they model. This becomes an important issue when simulations have to be repeated multiple times sweeping over the desired design space. In the specific context of graphics workloads, performing cycle-accurate simulations are even more demanding due to the high number of triangles that have to be shaded, lighted and textured to compose a single frame. As a result, simulating a few minutes of a video game sequence is extremely time-consuming.In this paper, we make the observation that collecting information about the vertices and primitives that are processed, along with the times that shader programs are invoked, allows us to characterize the activity performed on a given frame. Based on that, we propose a novel methodology for the efficient simulation of graphics workloads called MEGsim, an approach that is capable of accurately characterizing entire video sequences by using a small subset of selected frames which substantially drops the simulation time. For a set of popular Android games, we show that MEGsim achieves an average simulation speedup of 126×, achieving remarkably accurate results for the estimated final statistics, e.g., with average relative errors of just 0.84% for the total number of cycles, 0.99% for the number of DRAM accesses, 1.2% for the number of L2 cache accesses, and 0.86% for the number of L1 (tile cache) accesses.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131978566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"XFeatur: Hardware Feature Extraction for DNN Auto-tuning","authors":"J. Acosta, Andreas Diavastos, Antonio González","doi":"10.1109/ispass55109.2022.00013","DOIUrl":"https://doi.org/10.1109/ispass55109.2022.00013","url":null,"abstract":"In this work, we extend the auto-tuning process of the state-of-the-art TVM framework with XFeatur; a tool that extracts new meaningful hardware-related features that improve the quality of the representation of the search space and consequently improve the accuracy of its prediction algorithm. These new features provide information about the amount of thread-level parallelism, shared memory usage, register usage, dynamic instruction count and memory access dependencies. Optimizing ResNet-18 with the proposed features improves the quality of the search space representation by 63% on average and a maximum of 2× for certain tasks, while it reduces the tuning time by 9% (approximately 1.1 hours) and produces configurations that have equal or better performance (up to 92.7%) than the baseline.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125035424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}