Dilip P. Vasudevan, George Michelogiannakis, D. Donofrio, J. Shalf
{"title":"PARADISE - Post-Moore Architecture and Accelerator Design Space Exploration Using Device Level Simulation and Experiments","authors":"Dilip P. Vasudevan, George Michelogiannakis, D. Donofrio, J. Shalf","doi":"10.1109/ISPASS.2019.00022","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00022","url":null,"abstract":"An increasing number of technologies are being proposed to preserve digital computing performance scaling as lithographic scaling slows. These technologies include new devices, specialized architectures, memories, and 3D integration. Currently, no end-to-end tool flow is available to rapidly perform architectural-level evaluation using device-level models and for a variety of emerging technologies at once. We propose PARADISE: An open-source comprehensive methodology to evaluate emerging technologies with a vertical simulation flow from the individual device level all the way up to the architec-turallevel. To demonstrate its effectiveness, we use PARADISE to perform end-to-end simulation and analysis of heterogeneous architectures using CNFETs, TFETs, and NCFETs, along with multiple hardware designs. To demonstrate its accuracy, we show that PARADISE has only a 6% mean deviation for delay and 9% for power compared to previous studies using commercial synthesis tools.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"417 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116001230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterization of Unnecessary Computations in Web Applications","authors":"Hossein Golestani, S. Mahlke, S. Narayanasamy","doi":"10.1109/ISPASS.2019.00010","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00010","url":null,"abstract":"Web applications are widely used in many different daily activities-such as online shopping, navigation through maps, and social networking-in both desktop and mobile environments. Advances in technology, such as network connection, hardware platforms, and software design techniques, have empowered Web developers to design Web pages that are highly rich in content and engage users through an interactive experience. However, the performance of Web applications is not ideal today, and many users experience poor quality of service, including long page load times and irregular animations. One of the contributing factors to low performance is the very design of Web applications, particularly Web browsers. In this work, we argue that there are unnecessary computations in today's Web applications, which are completely or most likely wasted. We first describe the potential unnecessary computations at a high level, and then design a profiler based on dynamic backward program slicing that detects such computations. Our profiler reveals that for four different websites, only 45% of dynamically executed instructions are useful in rendering the main page, on average. We then analyze and categorize unnecessary computations. Our analysis shows that processing JavaScript codes is the most notable category of unnecessary computations, specifically during page loading. Therefore, such computations are either completely wasted or could be deferred to a later time, i.e., when they are actually needed, thereby providing higher performance and better energy efficiency.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129010058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. D. Pestel, S. V. D. Steen, Shoaib Akram, L. Eeckhout
{"title":"RPPM: Rapid Performance Prediction of Multithreaded Workloads on Multicore Processors","authors":"S. D. Pestel, S. V. D. Steen, Shoaib Akram, L. Eeckhout","doi":"10.1109/ISPASS.2019.00038","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00038","url":null,"abstract":"Analytical performance modeling is a useful complement to detailed cycle-level simulation to quickly explore the design space in an early design stage. Mechanistic analytical modeling is particularly interesting as it provides deep insight and does not require expensive offline profiling as empirical modeling. Previous work in mechanistic analytical modeling, unfortunately, is limited to single-threaded applications running on single-core processors. This work proposes RPPM, a mechanistic analytical performance model for multi-threaded applications on multicore hardware. RPPM collects microarchitecture-independent characteristics of a multi-threaded workload to predict performance on a previously unseen multicore architecture. The profile needs to be collected only once to predict a range of processor architectures. We evaluate RPPM's accuracy against simulation and report a performance prediction error of 11.2% on average (23% max). We demonstrate RPPM's usefulness for conducting design space exploration experiments as well as for analyzing parallel application performance.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128793770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) ISPASS 2019","authors":"Matthew Halpern","doi":"10.1109/ispass.2019.00004","DOIUrl":"https://doi.org/10.1109/ispass.2019.00004","url":null,"abstract":"","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129046238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Parashar, Priyanka Raina, Y. Shao, Yu-hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, S. Keckler, J. Emer
{"title":"Timeloop: A Systematic Approach to DNN Accelerator Evaluation","authors":"A. Parashar, Priyanka Raina, Y. Shao, Yu-hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, S. Keckler, J. Emer","doi":"10.1109/ISPASS.2019.00042","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00042","url":null,"abstract":"This paper presents Timeloop, an infrastructure for evaluating and exploring the architecture design space of deep neural network (DNN) accelerators. Timeloop uses a concise and unified representation of the key architecture and implementation attributes of DNN accelerators to describe a broad space of hardware topologies. It can then emulate those topologies to generate an accurate projection of performance and energy efficiency for a DNN workload through a mapper that finds the best way to schedule operations and stage data on the specified architecture. This enables fair comparisons across different architectures and makes DNN accelerator design more systematic. This paper describes Timeloop's underlying models and algorithms in detail and shows results from case studies enabled by Timeloop, which provide interesting insights into the current state of DNN architecture design. In particular, they reveal that dataflow and memory hierarchy co-design plays a critical role in optimizing energy efficiency. Also, there is currently still not a single architecture that achieves the best performance and energy efficiency across a diverse set of workloads due to flexibility and efficiency trade-offs. These results provide inspiration into possible directions for DNN accelerator research.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127013925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kewen Wang, Mohammad Maifi Hasan Khan, Nhan Nguyen, S. Gokhale
{"title":"A Model Driven Approach Towards Improving the Performance of Apache Spark Applications","authors":"Kewen Wang, Mohammad Maifi Hasan Khan, Nhan Nguyen, S. Gokhale","doi":"10.1109/ISPASS.2019.00036","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00036","url":null,"abstract":"Apache Spark applications often execute in multiple stages where each stage consists of multiple tasks running in parallel. However, prior efforts noted that the execution time of different tasks within a stage can vary significantly for various reasons (e.g., inefficient partition of input data), and tasks can be distributed unevenly across worker nodes for different reasons (e.g., data co-locality). While these problems are well-known, it is nontrivial to predict and address them effectively. In this paper we present an analytical model driven approach that can predict the possibility of such problems by executing an application with a limited amount of input data and recommend ways to address the identified problems by repartitioning input data (in case of task straggler problem) and/or changing the locality configuration setting (in case of skewed task distribution problem). The novelty of our approach lies in automatically predicting the potential problems a priori based on limited execution data and recommending the locality setting and partition number. Our experimental result using 9 Apache Spark applications on two different clusters shows that our model driven approach can predict these problems with high accuracy and improve the performance by up to 71%.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"179 1-4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114026186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kuba Kaszyk, Harry Wagstaff, T. Spink, Björn Franke, M. O’Boyle, Bruno Bodin, Henrik Uhrenholt
{"title":"Full-System Simulation of Mobile CPU/GPU Platforms","authors":"Kuba Kaszyk, Harry Wagstaff, T. Spink, Björn Franke, M. O’Boyle, Bruno Bodin, Henrik Uhrenholt","doi":"10.1109/ISPASS.2019.00015","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00015","url":null,"abstract":"Graphics Processing Units (GPUs) critically rely on a complex system software stack comprising kernel- and user-space drivers and Just-in-time (JIT) compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is also due to the lack of an integrated CPU/GPU simulation framework, which is complete and powerful enough to drive the complex GPU software environment. This has led to a situation where research on GPU architectures and compilers is largely based on outdated or greatly simplified architectures and software stacks, undermining the validity of the generated results. In this paper we develop a full-system system simulation environment for a mobile platform, which enables users to run a complete and unmodified software stack for a state-of-the-art mobile Arm CPU and Mali-G71 GPU powered device. We validate our simulator against a hardware implementation and Arm's stand-alone GPU simulator, achieving 100% architectural accuracy across all available toolchains. We demonstrate the capability of our GPU simulation framework by optimizing an advanced Computer Vision application using simulated statistics unavailable with other simulation approaches or physical GPU implementations. We demonstrate that performance optimizations for desktop GPUs trigger bottlenecks on mobile GPUs, and show the importance of efficient memory use.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134320475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthew Halpern, Behzad Boroujerdian, Todd W. Mummert, E. Duesterwald, V. Reddi
{"title":"One Size Does Not Fit All: Quantifying and Exposing the Accuracy-Latency Trade-Off in Machine Learning Cloud Service APIs via Tolerance Tiers","authors":"Matthew Halpern, Behzad Boroujerdian, Todd W. Mummert, E. Duesterwald, V. Reddi","doi":"10.1109/ISPASS.2019.00012","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00012","url":null,"abstract":"Today's cloud service architectures follow a “one size fits all” deployment strategy where the same service version instantiation is provided to the end users. However, consumers are broad and different applications have different accuracy and responsiveness requirements, which as we demonstrate renders the “one size fits all” approach inefficient in practice. We use a production grade speech recognition engine, which serves several thousands of users, and an open source computer vision based system, to explain our point. To overcome the limitations of the “one size fits all” approach, we recommend Tolerance Tiers where each MLaaS tier exposes an accuracy/responsiveness characteristic, and consumers can programmatically select a tier. We evaluate our proposal on the CPU-based automatic speech recognition (ASR) engine and cutting-edge neural networks for image classification deployed on both CPUs and GPUs. The results show that our proposed approach provides a MLaaS cloud service architecture that can be tuned by the end API user or consumer to outperform the conventional “one size fits all” approach.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"163 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133845040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sangkug Lym, Donghyuk Lee, Mike O'Connor, Niladrish Chatterjee, M. Erez
{"title":"DeLTA: GPU Performance Model for Deep Learning Applications with In-Depth Memory System Traffic Analysis","authors":"Sangkug Lym, Donghyuk Lee, Mike O'Connor, Niladrish Chatterjee, M. Erez","doi":"10.1109/ISPASS.2019.00041","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00041","url":null,"abstract":"Training convolutional neural networks (CNNs) requires intense compute throughput and high memory bandwidth. Especially, convolution layers account for the majority of execution time of CNN training, and GPUs are commonly used to accelerate these layer workloads. GPU design optimization for efficient CNN training acceleration requires the accurate modeling of how their performance improves when computing and memory resources are increased. We present DeLTA, the first analytical model that accurately estimates the traffic at each GPU memory hierarchy level, while accounting for the complex reuse patterns of a parallel convolution algorithm. We demonstrate that our model is both accurate and robust for different CNNs and GPU architectures. We then show how this model can be used to carefully balance the scaling of different GPU resources for efficient CNN performance improvement.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114769828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tosiron Adegbija, Junwhan Ahn, Nelson Amaral, Sarah Bird, Simone Campanoni, Trevor E. Carlson, Lizhong Chen, Jason Clemons, Jeanine Cook, S. Diestelhorst, D. Gizopoulos, Rui Hou, D. Kaeli, Ulya R. Karpuzcu, Omer Khan, Jangwoo Kim, John Kim, Qiuyun Llull, Xiaosong Ma, Moriyoshi Ohara, Antonio J. Peña, Ravi Soundararajan, Hyojin Sung, R. Teodorescu, Yash Ukidave, Yuhao Zhu, Adrián Castelló
{"title":"ISPASS 2019 Program Committee","authors":"Tosiron Adegbija, Junwhan Ahn, Nelson Amaral, Sarah Bird, Simone Campanoni, Trevor E. Carlson, Lizhong Chen, Jason Clemons, Jeanine Cook, S. Diestelhorst, D. Gizopoulos, Rui Hou, D. Kaeli, Ulya R. Karpuzcu, Omer Khan, Jangwoo Kim, John Kim, Qiuyun Llull, Xiaosong Ma, Moriyoshi Ohara, Antonio J. Peña, Ravi Soundararajan, Hyojin Sung, R. Teodorescu, Yash Ukidave, Yuhao Zhu, Adrián Castelló","doi":"10.1109/ispass.2019.00008","DOIUrl":"https://doi.org/10.1109/ispass.2019.00008","url":null,"abstract":"Tosiron Adegbija, University of Arizona Junwhan Ahn, Google J. Nelson Amaral, University of Alberta Sarah Bird, Facebook Simone Campanoni, Northwestern University Trevor E. Carlson, National University of Singapore Lizhong Chen, Oregon State University Jason Clemons, NVIDIA Jeanine Cook, Sandia National Laboratories Stephan Diestelhorst, ARM Stijn Eyerman, Intel Dimitris Gizopoulos, University of Athens Rajiv Gupta, University of California Riverside Rui Hou, Institute of Information Engineering Lizy John, University of Texas at Austin David Kaeli, Northeastern University Ulya Karpuzcu, University of Minnesota/Brown University Omer Khan, University of Connecticut Jangwoo Kim, Seoul National University John Kim, Korea Advanced Institute of Science and Technology Qiuyun Llull, VMware Xiaosong Ma, Qatar Computing Research Institute Andreas Moshovos, University of Toronto Moriyoshi Ohara, IBM Research Tokyo Michael Papamichael, Microsoft Research Antonio J Peña, Barcelona Supercomputing Center (BSC) Ravi Soundararajan, VMware Hyojin Sung, IBM Research Radu Teodorescu, Ohio State University Yash Ukidave, AMD Yuhao Zhu, University of Rochester Adrian Castello Universitat Jaume I (UJI) (external) Yang Hu University of Texas at Dallas (external) Xiongchao Tang Tsinghua University (external)","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131190393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}