Guru Prasad Srinivasa, Scott Haseley, Geoffrey Challen, Mark Hempstead
{"title":"Quantifying Process Variations and Its Impacts on Smartphones","authors":"Guru Prasad Srinivasa, Scott Haseley, Geoffrey Challen, Mark Hempstead","doi":"10.1109/ISPASS.2019.00019","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00019","url":null,"abstract":"Process variation can cause the performance and energy consumption of smartphones of the same model to vary significantly. While process variation has been studied in detail, the effects on smartphone performance have not been quantified and evaluated. In this work we study the performance and energy differences of 5 recent SoC generations caused by underlying process variation. We make two important contributions. First, we present a methodology to construct a temperature-stabilized environment to perform repeatable power and performance measurements. Studying power-performance characteristics of smartphones is difficult. Running a benchmark back-to-back often produces significantly different results due to heat. Temperature, both device and ambient, play a significant role in determining performance and energy. Our methodology allows us to control for various factors and isolate the effects of the underlying process variation. We then apply our methodology to investigate performance and energy characteristics of several recent generations of smart-phone CPUs that result from process variation. Our results show that devices of the same model may exhibit differences of 10% and 12% difference in performance and energy over a fixed-duration workload.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129140202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallelism Analysis of Prominent Desktop Applications: An 18- Year Perspective","authors":"Siying Feng, S. Pal, Yichen Yang, R. Dreslinski","doi":"10.1109/ISPASS.2019.00033","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00033","url":null,"abstract":"Improvements in clock speed and exploitation of Instruction-Level Parallelism (ILP) hit a roadblock during mid-2000s. This, coupled with the demise of Dennard scaling, led to the rise of multi-core machines. Today, multi-core processors are ubiquitous and architects have moved to specialization to work around the walls hit by single-core performance and chip Thermal Design Power (TDP). The pressure of innovation in the aftermath of Dennard scaling is shifting to software developers, who are required to write programs that make the most effective use of underlying hardware. This work presents quantitative and qualitative analyses of how software has evolved to reap the benefits of multi-core and heterogeneous computers, compared to state-of-the-art systems in 2000 and 2010. We study a wide spectrum of commonly-used applications on a state-of-the-art desktop machine and analyze two important metrics, Thread-Level Parallelism (TLP) and GPU utilization. We compare the results to prior work over the last two decades, which state that 2–3 CPU cores are sufficient for most applications and that the GPU is usually under-utilized. Our analyses show that the harnessed parallelism has improved and emerging workloads show good utilization of hardware resources. The average TLP across the applications we study is 3.1, with most applications attaining the maximum instantaneous TLP of 12 during execution. The GPU is over-provisioned for most applications, but workloads such as cryptocurrency mining utilize it to the fullest. Overall, we conclude that the effectiveness of software in utilizing the underlying hardware has improved, but still has scope for optimizations.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128688193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sahel Sharify, A. W. Lu, Jin Chen, Arnamoy Bhattacharyya, Ali B. Hashemi, Nick Koudas, C. Amza
{"title":"An Improved Dynamic Vertical Partitioning Technique for Semi-Structured Data","authors":"Sahel Sharify, A. W. Lu, Jin Chen, Arnamoy Bhattacharyya, Ali B. Hashemi, Nick Koudas, C. Amza","doi":"10.1109/ISPASS.2019.00037","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00037","url":null,"abstract":"Semi-structured data such as JSON has become the de facto standard for supporting data exchange on the Web. At the same time, relational support for JSON data poses new challenges due to the large number of attributes, sparse attributes and dynamic changes in both workload and data set, which are all typical in such data. In this paper, we address these challenges through a lightweight, in-memory relational database engine prototype and a flexible vertical partitioning algorithm that uses simple heuristics to adapt the data layout for the workload, on the fly. Our experimental evaluation using the Nobench dataset for JSON data, shows that we outperform Argo, a state-of-the-art data model that also maps the JSON data format into relational databases, by a factor of 3. We also outperform Hyrise, a state-of-the-art vertical partitioning algorithm designed for in-memory databases, by 24%. Furthermore, our algorithm is able to achieve around 40% better cache utilization and 35% better TLB utilization. Our experiments also show that our partitioning algorithm adapts to workload changes within a few seconds.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"303 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124324346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Workload Characterization of Nondeterministic Programs Parallelized by STATS","authors":"E. A. Deiana, Simone Campanoni","doi":"10.1109/ISPASS.2019.00032","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00032","url":null,"abstract":"Chip Multiprocessors (CMP) are everywhere, from mobile systems, to servers. Thread Level Parallelism (TLP) is the characteristic of a program that makes use of the parallel cores of a CMP to generate performance. Despite all efforts for creating TLP, multiple cores are still underutilized even though we have been in the multicore era for more than a decade. Recently, a new approach called STATS has been proposed to generate additional TLP for complex and irregular nondeterministic programs. STATS allows a developer to describe application-specific information that its compiler uses to automatically generate a new source of TLP. This new source of TLP increases with the size of the input and it has the potential to generate scalable performance with the number of cores. Even though STATS obtains most of its potential, some of it is still unreached. This paper identifies and characterizes the sources of overhead that are currently blocking STATS parallelized programs to achieve their full potential. To this end, we characterized the workloads generated by the STATS compiler on a 28 core Intel-based machine (dual-socket). This paper shows that the performance loss is due to a combination of factors: some can be optimized via engineering efforts and some require a deeper evolution of STATS. We also highlight potential solutions to significantly reduce most of this overhead. Exploiting these insights will unblock scalable performance for the parallel binaries generated by STATS.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122718504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zacharias Hadjilambrou, Shidhartha Das, P. Whatmough, David M. Bull, Yiannakis Sazeides
{"title":"GeST: An Automatic Framework For Generating CPU Stress-Tests","authors":"Zacharias Hadjilambrou, Shidhartha Das, P. Whatmough, David M. Bull, Yiannakis Sazeides","doi":"10.1109/ISPASS.2019.00009","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00009","url":null,"abstract":"This work presents GeST (Generator for Stress-Tests): a framework for automatically generating CPU stress-tests. The framework is based on genetic algorithm search and can be used to maximize different target CPU metrics such as power, temperature, instructions executed per cycle and dl/dt voltage noise. We demonstrate the generality and effectiveness of the framework by generating various workloads that stress the CPU power, thermal and voltage margins more than both conventional benchmarks and manually written stress-tests. The key framework strengths are its extensibility and flexibility. The user can specify custom measurement and fitness functions as well as the CPU instructions that will be used in the genetic algorithm search. The paper demonstrates the framework prowess by using it with simple and complex fitness functions to generate stress-tests: a) for various platform types ranging from low-power mobile ARM CPUs to high-power x86 CPUs and b) with different measurement instruments such as oscilloscopes and software accessible performance counters and sensors.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132825038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast Modeling of the L2 Cache Reuse Distance Histograms from Software Traces","authors":"Jiancong Ge, Ming Ling","doi":"10.1109/ISPASS.2019.00025","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00025","url":null,"abstract":"As the speed gap between the CPU and the main memory keeps increasing, multi-level caches are widely used in modern processors to improve the memory access latency. Therefore, modeling behaviors of the downstream caches becomes a critical part of the processor performance evaluation. In this paper, we propose a fast, yet accurate, L2 cache reuse distance histogram model without time-consuming full simulations, which can be utilized to evaluate the L2 cache miss rate with the Random and LRU replacement policies. The inputs of our model only need to be profiled once and can be reused for evaluations of different L2 cache configurations. To evaluate our model, we compare the output L2 RDH from our model and that of gem5 cycle-accurate simulations. When used to evaluate the L2 cache miss rates, the average absolute error is 3% for SPEC2006 benchmarks.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122581680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qi Yu, B. Childers, Libo Huang, Cheng Qian, Zhiying Wang
{"title":"Hierarchical Page Eviction Policy for Unified Memory in GPUs","authors":"Qi Yu, B. Childers, Libo Huang, Cheng Qian, Zhiying Wang","doi":"10.1109/ISPASS.2019.00027","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00027","url":null,"abstract":"The introduction of unified memory in discrete GPUs not only improves programmability but also enables oversubscription. However, it introduces high overhead when page faults occur. Therefore, when GPU memory is full, how to select eviction candidates becomes an important issue. The widely used policy LRU performs poorly for workloads with thrashing access patterns, and the advanced cache replacement policy RRIP incurs thrashing when directly applied to GPU memory. In this paper, we propose hierarchical page eviction policy for GPU memory, which relies on a software-managed page set chain to select eviction candidates. Results show that for 15 selected applications, our policy achieves an average speedup of 1.44 and 1.2 over LRU when the oversubscription rate is 75% and 50 %, respectively.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123043961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Wang, Yuhao Zhu, Glenn G. Ko, Brandon Reagen, Gu-Yeon Wei, D. Brooks
{"title":"Demystifying Bayesian Inference Workloads","authors":"Y. Wang, Yuhao Zhu, Glenn G. Ko, Brandon Reagen, Gu-Yeon Wei, D. Brooks","doi":"10.1109/ISPASS.2019.00031","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00031","url":null,"abstract":"The recent surge of machine learning has motivated computer architects to focus intently on accelerating related workloads, especially in deep learning. Deep learning has been the pillar algorithm that has led the advancement of learning patterns from a vast amount of labeled data, or supervised learning. However, for unsupervised learning, Bayesian methods often work better than deep learning. Bayesian modeling and inference works well with unlabeled or limited data, can leverage informative priors, and has interpretable models. Despite being an important branch of machine learning, Bayesian inference generally has been overlooked by the architecture and systems communities. In this paper, we facilitate the study of Bayesian inference with the development of BayesSuite, a collection of seminal Bayesian inference workloads. We characterize the power and performance profiles of BayesSuite across a variety of current-generation processors and find significant diversity. Manually tuning and deploying Bayesian inference workloads requires deep understanding of the workload characteristics and hardware specifications. To address these challenges and provide high-performance, energy-efficient support for Bayesian inference, we introduce a scheduling and optimization mechanism that can be plugged into a system scheduler. We also propose a computation elision technique that further improves the performance and energy efficiency of the workloads by skipping computations that do not improve the quality of the inference. Our proposed techniques are able to increase Bayesian inference performance by 5.8 × on average over the naive assignment and execution of the workloads.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124976329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DSMM: A Dynamic Setting for Memory Management in Apache Spark","authors":"Suk-Joo Chae, Tae-Sun Chung","doi":"10.1109/ISPASS.2019.00024","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00024","url":null,"abstract":"Apache Spark (Spark) is a unified analytics engine for large-scale data processing. Unlike traditional data processing engines like Hadoop, Spark is a framework that caches data in memory. Therefore, memory management in Spark is importance. However, there are several factors that interfere with memory management. First, if users want to cache data in memory, they need to choose their own storage level. In this case, if they do not select the optimal storage level, Spark will be put a heavy burden on memory. Next, users need to select the ratio for spark memory directly within Spark. If they do not choose optimal ratio for spark memory, garbage collection overheads will be incurred. In this poster, we propose DSMM that dynamically select the above factors on the system for memory management. Our experimental result shows 13% execution time improvement as compared to standard Spark.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121086737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Almutaz Adileh, Cecilia González-Alvarez, J. Ruiz, L. Eeckhout
{"title":"Racing to Hardware-Validated Simulation","authors":"Almutaz Adileh, Cecilia González-Alvarez, J. Ruiz, L. Eeckhout","doi":"10.1109/ISPASS.2019.00014","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00014","url":null,"abstract":"Processor simulators rely on detailed timing models of the processor pipeline to evaluate performance. The diversity in real-world processor designs mandates building flexible simulators that expose parts of the underlying model to the user in the form of configurable parameters. Consequently, the accuracy of modeling a real processor relies on both the accuracy of the pipeline model itself, and the accuracy of adjusting the configuration parameters according to the modeled processor. Unfortunately, processor vendors publicly disclose only a subset of their design decisions, raising the probability of introducing specification inaccuracies when modeling these processors. Inaccurately tuning model parameters deviates the simulated processor from the actual one. In the worst case, using improper parameters may lead to imbalanced pipeline models compromising the simulation output. Therefore, simulation models should be hardware-validated before using them for performance evaluation. As processors increase in complexity and diversity, validating a simulator model against real hardware becomes increasingly more challenging and time-consuming. In this work, we propose a methodology for validating simulation models against real hardware. We create a framework that relies on micro-benchmarks to collect performance statistics on real hardware, and machine learning-based algorithms to fine-tune the unknown parameters based on the accumulated statistics. We overhaul the Sniper simulator to support the ARM AArch64 instruction-set architecture (ISA), and introduce two new timing models for ARM-based in-order and out-of-order cores. Using our proposed simulator validation framework, we tune the in-order and out-of-order models to match the performance of a real-world implementation of the Cortex-A53 and Cortex-A72 cores with an average error of 7% and 15%, respectively, across a set of SPEC CPU2017 benchmarks.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122910103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}