2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)最新文献

Quantifying Process Variations and Its Impacts on Smartphones 量化过程变化及其对智能手机的影响

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-04-22 DOI: 10.1109/ISPASS.2019.00019

Guru Prasad Srinivasa, Scott Haseley, Geoffrey Challen, Mark Hempstead

{"title":"Quantifying Process Variations and Its Impacts on Smartphones","authors":"Guru Prasad Srinivasa, Scott Haseley, Geoffrey Challen, Mark Hempstead","doi":"10.1109/ISPASS.2019.00019","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00019","url":null,"abstract":"Process variation can cause the performance and energy consumption of smartphones of the same model to vary significantly. While process variation has been studied in detail, the effects on smartphone performance have not been quantified and evaluated. In this work we study the performance and energy differences of 5 recent SoC generations caused by underlying process variation. We make two important contributions. First, we present a methodology to construct a temperature-stabilized environment to perform repeatable power and performance measurements. Studying power-performance characteristics of smartphones is difficult. Running a benchmark back-to-back often produces significantly different results due to heat. Temperature, both device and ambient, play a significant role in determining performance and energy. Our methodology allows us to control for various factors and isolate the effects of the underlying process variation. We then apply our methodology to investigate performance and energy characteristics of several recent generations of smart-phone CPUs that result from process variation. Our results show that devices of the same model may exhibit differences of 10% and 12% difference in performance and energy over a fixed-duration workload.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129140202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Parallelism Analysis of Prominent Desktop Applications: An 18- Year Perspective 杰出桌面应用程序的并行性分析:18年的展望

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-24 DOI: 10.1109/ISPASS.2019.00033

Siying Feng, S. Pal, Yichen Yang, R. Dreslinski

{"title":"Parallelism Analysis of Prominent Desktop Applications: An 18- Year Perspective","authors":"Siying Feng, S. Pal, Yichen Yang, R. Dreslinski","doi":"10.1109/ISPASS.2019.00033","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00033","url":null,"abstract":"Improvements in clock speed and exploitation of Instruction-Level Parallelism (ILP) hit a roadblock during mid-2000s. This, coupled with the demise of Dennard scaling, led to the rise of multi-core machines. Today, multi-core processors are ubiquitous and architects have moved to specialization to work around the walls hit by single-core performance and chip Thermal Design Power (TDP). The pressure of innovation in the aftermath of Dennard scaling is shifting to software developers, who are required to write programs that make the most effective use of underlying hardware. This work presents quantitative and qualitative analyses of how software has evolved to reap the benefits of multi-core and heterogeneous computers, compared to state-of-the-art systems in 2000 and 2010. We study a wide spectrum of commonly-used applications on a state-of-the-art desktop machine and analyze two important metrics, Thread-Level Parallelism (TLP) and GPU utilization. We compare the results to prior work over the last two decades, which state that 2–3 CPU cores are sufficient for most applications and that the GPU is usually under-utilized. Our analyses show that the harnessed parallelism has improved and emerging workloads show good utilization of hardware resources. The average TLP across the applications we study is 3.1, with most applications attaining the maximum instantaneous TLP of 12 during execution. The GPU is over-provisioned for most applications, but workloads such as cryptocurrency mining utilize it to the fullest. Overall, we conclude that the effectiveness of software in utilizing the underlying hardware has improved, but still has scope for optimizations.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128688193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

An Improved Dynamic Vertical Partitioning Technique for Semi-Structured Data 一种改进的半结构化数据动态垂直分区技术

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-24 DOI: 10.1109/ISPASS.2019.00037

Sahel Sharify, A. W. Lu, Jin Chen, Arnamoy Bhattacharyya, Ali B. Hashemi, Nick Koudas, C. Amza

{"title":"An Improved Dynamic Vertical Partitioning Technique for Semi-Structured Data","authors":"Sahel Sharify, A. W. Lu, Jin Chen, Arnamoy Bhattacharyya, Ali B. Hashemi, Nick Koudas, C. Amza","doi":"10.1109/ISPASS.2019.00037","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00037","url":null,"abstract":"Semi-structured data such as JSON has become the de facto standard for supporting data exchange on the Web. At the same time, relational support for JSON data poses new challenges due to the large number of attributes, sparse attributes and dynamic changes in both workload and data set, which are all typical in such data. In this paper, we address these challenges through a lightweight, in-memory relational database engine prototype and a flexible vertical partitioning algorithm that uses simple heuristics to adapt the data layout for the workload, on the fly. Our experimental evaluation using the Nobench dataset for JSON data, shows that we outperform Argo, a state-of-the-art data model that also maps the JSON data format into relational databases, by a factor of 3. We also outperform Hyrise, a state-of-the-art vertical partitioning algorithm designed for in-memory databases, by 24%. Furthermore, our algorithm is able to achieve around 40% better cache utilization and 35% better TLB utilization. Our experiments also show that our partitioning algorithm adapts to workload changes within a few seconds.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"303 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124324346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Workload Characterization of Nondeterministic Programs Parallelized by STATS 用STATS并行化的不确定性程序的工作负载表征

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-24 DOI: 10.1109/ISPASS.2019.00032

E. A. Deiana, Simone Campanoni

{"title":"Workload Characterization of Nondeterministic Programs Parallelized by STATS","authors":"E. A. Deiana, Simone Campanoni","doi":"10.1109/ISPASS.2019.00032","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00032","url":null,"abstract":"Chip Multiprocessors (CMP) are everywhere, from mobile systems, to servers. Thread Level Parallelism (TLP) is the characteristic of a program that makes use of the parallel cores of a CMP to generate performance. Despite all efforts for creating TLP, multiple cores are still underutilized even though we have been in the multicore era for more than a decade. Recently, a new approach called STATS has been proposed to generate additional TLP for complex and irregular nondeterministic programs. STATS allows a developer to describe application-specific information that its compiler uses to automatically generate a new source of TLP. This new source of TLP increases with the size of the input and it has the potential to generate scalable performance with the number of cores. Even though STATS obtains most of its potential, some of it is still unreached. This paper identifies and characterizes the sources of overhead that are currently blocking STATS parallelized programs to achieve their full potential. To this end, we characterized the workloads generated by the STATS compiler on a 28 core Intel-based machine (dual-socket). This paper shows that the performance loss is due to a combination of factors: some can be optimized via engineering efforts and some require a deeper evolution of STATS. We also highlight potential solutions to significantly reduce most of this overhead. Exploiting these insights will unblock scalable performance for the parallel binaries generated by STATS.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122718504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

GeST: An Automatic Framework For Generating CPU Stress-Tests 生成CPU压力测试的自动框架

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-24 DOI: 10.1109/ISPASS.2019.00009

Zacharias Hadjilambrou, Shidhartha Das, P. Whatmough, David M. Bull, Yiannakis Sazeides

{"title":"GeST: An Automatic Framework For Generating CPU Stress-Tests","authors":"Zacharias Hadjilambrou, Shidhartha Das, P. Whatmough, David M. Bull, Yiannakis Sazeides","doi":"10.1109/ISPASS.2019.00009","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00009","url":null,"abstract":"This work presents GeST (Generator for Stress-Tests): a framework for automatically generating CPU stress-tests. The framework is based on genetic algorithm search and can be used to maximize different target CPU metrics such as power, temperature, instructions executed per cycle and dl/dt voltage noise. We demonstrate the generality and effectiveness of the framework by generating various workloads that stress the CPU power, thermal and voltage margins more than both conventional benchmarks and manually written stress-tests. The key framework strengths are its extensibility and flexibility. The user can specify custom measurement and fitness functions as well as the CPU instructions that will be used in the genetic algorithm search. The paper demonstrates the framework prowess by using it with simple and complex fitness functions to generate stress-tests: a) for various platform types ranging from low-power mobile ARM CPUs to high-power x86 CPUs and b) with different measurement instruments such as oscilloscopes and software accessible performance counters and sensors.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132825038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Fast Modeling of the L2 Cache Reuse Distance Histograms from Software Traces 基于软件轨迹的L2缓存重用距离直方图快速建模

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-24 DOI: 10.1109/ISPASS.2019.00025

Jiancong Ge, Ming Ling

引用次数: 4

Hierarchical Page Eviction Policy for Unified Memory in GPUs gpu统一内存的分层页面清除策略

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-24 DOI: 10.1109/ISPASS.2019.00027

Qi Yu, B. Childers, Libo Huang, Cheng Qian, Zhiying Wang

引用次数: 5

Demystifying Bayesian Inference Workloads 揭开贝叶斯推理工作负载的神秘面纱

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-24 DOI: 10.1109/ISPASS.2019.00031

Y. Wang, Yuhao Zhu, Glenn G. Ko, Brandon Reagen, Gu-Yeon Wei, D. Brooks

{"title":"Demystifying Bayesian Inference Workloads","authors":"Y. Wang, Yuhao Zhu, Glenn G. Ko, Brandon Reagen, Gu-Yeon Wei, D. Brooks","doi":"10.1109/ISPASS.2019.00031","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00031","url":null,"abstract":"The recent surge of machine learning has motivated computer architects to focus intently on accelerating related workloads, especially in deep learning. Deep learning has been the pillar algorithm that has led the advancement of learning patterns from a vast amount of labeled data, or supervised learning. However, for unsupervised learning, Bayesian methods often work better than deep learning. Bayesian modeling and inference works well with unlabeled or limited data, can leverage informative priors, and has interpretable models. Despite being an important branch of machine learning, Bayesian inference generally has been overlooked by the architecture and systems communities. In this paper, we facilitate the study of Bayesian inference with the development of BayesSuite, a collection of seminal Bayesian inference workloads. We characterize the power and performance profiles of BayesSuite across a variety of current-generation processors and find significant diversity. Manually tuning and deploying Bayesian inference workloads requires deep understanding of the workload characteristics and hardware specifications. To address these challenges and provide high-performance, energy-efficient support for Bayesian inference, we introduce a scheduling and optimization mechanism that can be plugged into a system scheduler. We also propose a computation elision technique that further improves the performance and energy efficiency of the workloads by skipping computations that do not improve the quality of the inference. Our proposed techniques are able to increase Bayesian inference performance by 5.8 × on average over the naive assignment and execution of the workloads.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124976329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

DSMM: A Dynamic Setting for Memory Management in Apache Spark DSMM: Apache Spark中内存管理的动态设置

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-24 DOI: 10.1109/ISPASS.2019.00024

Suk-Joo Chae, Tae-Sun Chung

引用次数: 4

Racing to Hardware-Validated Simulation 竞相进行硬件验证仿真

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-24 DOI: 10.1109/ISPASS.2019.00014

Almutaz Adileh, Cecilia González-Alvarez, J. Ruiz, L. Eeckhout

{"title":"Racing to Hardware-Validated Simulation","authors":"Almutaz Adileh, Cecilia González-Alvarez, J. Ruiz, L. Eeckhout","doi":"10.1109/ISPASS.2019.00014","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00014","url":null,"abstract":"Processor simulators rely on detailed timing models of the processor pipeline to evaluate performance. The diversity in real-world processor designs mandates building flexible simulators that expose parts of the underlying model to the user in the form of configurable parameters. Consequently, the accuracy of modeling a real processor relies on both the accuracy of the pipeline model itself, and the accuracy of adjusting the configuration parameters according to the modeled processor. Unfortunately, processor vendors publicly disclose only a subset of their design decisions, raising the probability of introducing specification inaccuracies when modeling these processors. Inaccurately tuning model parameters deviates the simulated processor from the actual one. In the worst case, using improper parameters may lead to imbalanced pipeline models compromising the simulation output. Therefore, simulation models should be hardware-validated before using them for performance evaluation. As processors increase in complexity and diversity, validating a simulator model against real hardware becomes increasingly more challenging and time-consuming. In this work, we propose a methodology for validating simulation models against real hardware. We create a framework that relies on micro-benchmarks to collect performance statistics on real hardware, and machine learning-based algorithms to fine-tune the unknown parameters based on the accumulated statistics. We overhaul the Sniper simulator to support the ARM AArch64 instruction-set architecture (ISA), and introduce two new timing models for ARM-based in-order and out-of-order cores. Using our proposed simulator validation framework, we tune the in-order and out-of-order models to match the performance of a real-world implementation of the Cortex-A53 and Cortex-A72 cores with an average error of 7% and 15%, respectively, across a set of SPEC CPU2017 benchmarks.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122910103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4