Yishen Chen, Ajay Brahmakshatriya, Charith Mendis, Alex Renda, Eric Hamilton Atkinson, O. Sýkora, Saman P. Amarasinghe, Michael Carbin
{"title":"BHive: A Benchmark Suite and Measurement Framework for Validating x86-64 Basic Block Performance Models","authors":"Yishen Chen, Ajay Brahmakshatriya, Charith Mendis, Alex Renda, Eric Hamilton Atkinson, O. Sýkora, Saman P. Amarasinghe, Michael Carbin","doi":"10.1109/IISWC47752.2019.9042166","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9042166","url":null,"abstract":"Compilers and performance engineers use hardware performance models to simplify program optimizations. Performance models provide a necessary abstraction over complex modern processors. However, constructing and maintaining a performance model can be onerous, given the numerous microarchi-tectural optimizations employed by modern processors. Despite their complexity and reported inaccuracy (e.g., deviating from native measurement by more than 30%), existing performance models-such as IACA and llvm-mca-have not been systematically validated, because there is no scalable machine code profiler that can automatically obtain throughput of arbitrary basic blocks while conforming to common modeling assumptions. In this paper, we present a novel profiler that can profile arbitrary memory-accessing basic blocks without any user intervention. We used this profiler to build BHive, a benchmark for systematic validation of performance models of x86-64 basic blocks. We used BHive to evaluate four existing performance models: IACA, llvm-mca, Ithemal, and OSACA. We automatically cluster basic blocks in the benchmark suite based on their utilization of CPU resources. Using this clustering, our benchmark can give a detailed analysis of a performance model's strengths and weaknesses on different workloads (e.g., vectorized vs. scalar basic blocks). We additionally demonstrate that our dataset well captures basic properties of two Google applications: Spanner and Dremel.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125554731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Persistent Memory Workload Characterization: A Hardware Perspective*","authors":"Xiao Liu, Bhaskar Jupudi, P. Mehra, Jishen Zhao","doi":"10.1109/IISWC47752.2019.9042041","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9042041","url":null,"abstract":"Persistent memory is a new tier of memory that functions as a hybrid of traditional storage systems and main memory. It combines the advantages of both: the data persistence property of storage with the byte-addressability and fast load/store interface of memory. As such, persistent memory provides direct data access without the performance and energy overhead of secondary storage access. Being at early stages of development, most previous persistent memory system designs are motivated and evaluated by software-based performance profiling and characterization. Yet by attaching on the processor-memory bus, persistent memory is managed by both system software and hardware control units in processors and memory devices. Therefore, understanding the hardware behavior is critical to unlocking the full potential of persistent memory. In this paper, we explore the performance interaction across applications, persistent memory system software, and hardware components, such as caching, address translation, buffering, and control logic in processors and memory systems. Based on our characterization results, we provide a set of implications and recommendations that can be used to optimize persistent memory system software and hardware designs.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122366478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander Hankin, Tomer Shapira, K. Sangaiah, Michael Lui, Mark Hempstead
{"title":"Evaluation of Non-Volatile Memory Based Last Level Cache Given Modern Use Case Behavior","authors":"Alexander Hankin, Tomer Shapira, K. Sangaiah, Michael Lui, Mark Hempstead","doi":"10.1109/IISWC47752.2019.9042051","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9042051","url":null,"abstract":"To confront the memory wall and keep up with the demands of changing use cases, Non-Volatile Memories (NVMs) have begun to be considered as a replacement for SRAM in the Last Level Cache (LLC). Recent work has shown that the small cell size of NVMs like Spin- Torque Transfer RAM (STTRAM) and Resistive RAM (RRAM) allows designers to build significantly denser LLCs than those with SRAM-based cells. In some cases, this allows for storing up to 10× more data on-chip than before. As the working set size of use cases increases with the advent of statistical inference (e.g., machine learning (ML) and artificial intelligence (AI)), more capacity close to the processor is necessary to keep up with the demand for performance and low power. Despite the growing potential of NVM-based LLCs, there are still fundamental problems that need to be addressed. First, the research community is lacking a methodology for consistently modeling these devices, which leads to apples-to-oranges comparisons across NVM-based LLCs. Second, NVMs exhibit a key operational difference with SRAM: read and write asymmetry. The effects of this asymmetry on use case performance and power are mostly unknown with prior art relying only on total read and write counts and on limited sets of use cases. In this work we present two novel contributions: (1) a set of heuristics for modeling emerging NVM-based LLCs, and (2) a workload characterization framework that learns how architecture-agnostic features, like entropy and working set size, affect the performance and power of a NVM-based LLC system for different use cases. In addition, with this work we release our NVM cell models and make them publicly available online. Using our NVM-based LLC models we show that NVM-based LLC energy use is up to an order of magnitude less than that of an SRAM-based LLC while ED2p is generally on par. From our workload characterization framework, we show that for the AI use cases, energy and speedup are 99% correlated with write entropy, 90% write footprint, and unique write footprint while negligibly correlated with total read and write footprint.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129610657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Faster than Flash: An In-Depth Study of System Challenges for Emerging Ultra-Low Latency SSDs","authors":"Sungjoon Koh, Junhyeok Jang, Changrim Lee, Miryeong Kwon, Jie Zhang, Myoungsoo Jung","doi":"10.1109/IISWC47752.2019.9042009","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9042009","url":null,"abstract":"Emerging storage systems with new flash exhibit ultra-low latency (ULL) that can address performance disparities between DRAM and conventional solid state drives (SSDs) in the memory hierarchy. Considering the advanced low-latency characteristics, different types of I/O completion methods (polling/hybrid) and storage stack architecture (SPDK) are proposed. While these new techniques are expected to take costly software interventions off the critical path in ULL-applied systems, unfortunately no study exists to quantitatively analyze system-level characteristics and challenges of combining such newly-introduced techniques with real ULL SSDs. In this work, we comprehensively perform empirical evaluations with 800GB ULL SSD prototypes and characterize ULL behaviors by considering a wide range of I/O path parameters, such as different queues and access patterns. We then analyze the efficiencies and challenges of the polled-mode and hybrid polling I/O completion methods (added into Linux kernels 4.4 and 4.10, respectively) and compare them with the efficiencies of a conventional interrupt-based I/O path. In addition, we revisit the common expectations of SPDK by examining all the system resources and parameters. Finally, we demonstrate the challenges of ULL SSDs in a real SPDK-enabled server-client system. Based on the performance behaviors that this study uncovers, we also discuss several system implications, which are required to take a full advantage of ULL SSD in the future.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128632095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"One Size Doesn't Fit All: Quantifying Performance Portability of Graph Applications on GPUs","authors":"Tyler Sorensen, Sreepathi Pai, A. Donaldson","doi":"10.1109/IISWC47752.2019.9042139","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9042139","url":null,"abstract":"Hand-optimising graph algorithm code for different GPUs is particularly labour-intensive and error-prone, involving complex and ill-understood interactions between GPU chips, applications, and inputs. Although the generation of optimised variants has been automated through graph algorithm DSL compilers, these do not yet use an optimisation policy. Instead they defer to techniques like autotuning, which can produce good results, but at the expense of portability. In this work, we propose a methodology to automatically identify portable optimisation policies that can be tailored (“semi-specialised”) as needed over a combination of chips, applications and inputs. Using a graph algorithm DSL compiler that targets the OpenCL programming model, we demonstrate optimising graph algorithms to run in a portable fashion across a wide range of GPU devices for the first time. We use this compiler and its optimisation space as the basis for a large empirical study across 17 graph applications, 3 diverse graph inputs and 6 GPUs spanning multiple vendors. We show that existing automatic approaches for building a portable optimisation policy fall short on our dataset, providing trivial or biased results. Thus, we present a new statistical analysis which can characterise optimisations and quantify performance trade-offs at various degrees of specialisation. We use this analysis to quantify the performance tradeoffs as portability is sacrificed for specialisation across three natural dimensions: chip, application, and input. Compared to not optimising programs at all, a fully portable approach provides a $1.15times$ improvement in geometric mean performance, rising to $1.29 times$ when specialised to application and inputs (but not hardware). Furthermore, these semi-specialised optimisations provide insights into performance-critical features of specialisation. For example, optimisations specialised by chip reveal subtle, yet performance-critical, characteristics of various GPUs.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"216 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116370007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterizing the Performance/Accuracy Tradeoff of High-Precision Applications via Auto-tuning*","authors":"Ruidong Gu, Paul Beata, M. Becchi","doi":"10.1109/IISWC47752.2019.9042137","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9042137","url":null,"abstract":"Many scientific applications (e.g., molecular dynamics, climate modeling and astrophysical simulations) rely on floating-point arithmetic. Floating-point representation is by definition a finite approximation of real numbers, and thus it can lead to inaccuracy and reproducibility issues. To overcome these issues, existing work has proposed high-precision floating-point libraries to be used in scientific simulations, but they come at the cost of significant additional execution time. In this work we analyze performance and accuracy effects from tuning down groups of variables and operations guided by compile-time considerations. The goal of our tuning approach is to convert existing floating-point programs to mixed precision while balancing accuracy and performance. To this end, the tuner starts by maximizing accuracy through the use of a high-precision library and then achieves performance gains under a given error bound by incrementally tuning down groups of variables and operations from higher to lower precision (e.g., double precision). The approach provides input-data independence in its results by defining tuning strategies based on loop structures and the investigation of floating-point computation patterns. In addition, it has a smaller search space than exhaustive or bitonic search algorithms, leading to a significant reduction in tuning time, especially on larger, long-running applications. We tested our tuning on a computational fluid dynamics (CFD) application.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123989792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ramyad Hadidi, Jiashen Cao, Yilun Xie, Bahar Asgari, T. Krishna, Hyesoon Kim
{"title":"Characterizing the Deployment of Deep Neural Networks on Commercial Edge Devices","authors":"Ramyad Hadidi, Jiashen Cao, Yilun Xie, Bahar Asgari, T. Krishna, Hyesoon Kim","doi":"10.1109/IISWC47752.2019.9041955","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9041955","url":null,"abstract":"The great success of deep neural networks (DNNs) has significantly assisted humans in numerous applications such as computer vision. DNNs are widely used in today's applications and systems. However, in-the-edge inference of DNNs is still a severe challenge mainly because of the contradiction between the inherent intensive resource requirements of DNNs and the tight resource availability of edge devices. Nevertheless, in-the-edge inferencing preserves privacy in several user-centric domains and applies in several scenarios with limited Internet connectivity (e.g., drones, robots, autonomous vehicles). That is why several companies have released specialized edge devices for accelerating the execution performance of DNNs in the edge. Although preliminary studies have characterized such edge devices separately, a unified comparison with the same set of assumptions has not been fully performed. In this paper, we endeavor to address this knowledge gap by characterizing several commercial edge devices on popular frameworks using well-known convolution neural networks (CNNs), a type of DNN. We analyze the impact of frameworks, their software stack, and their implemented optimizations on the final performance. Moreover, we measure energy consumption and temperature behavior of these edge devices.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124601097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nima Elyasi, Changho Choi, A. Sivasubramaniam, Jingpei Yang, V. Balakrishnan
{"title":"Trimming the Tail for Deterministic Read Performance in SSDs","authors":"Nima Elyasi, Changho Choi, A. Sivasubramaniam, Jingpei Yang, V. Balakrishnan","doi":"10.1109/IISWC47752.2019.9042073","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9042073","url":null,"abstract":"With SSDs becoming commonplace in several customer-facing datacenter applications, there is a critical need for optimizing for tail latencies (particularly reads). In this paper, we conduct a systematic analysis, removing one bottleneck after another, to study the root causes behind long tail latencies on a state-of-the-art high-end SSD. Contrary to a lot of prior observations, we find that Garbage Collection (GC) is not a key contributor, and it is more the variances in queue lengths across the flash chips that is the culprit. Particularly, reads waiting for long latency writes, which has been the target for much study, is at the root of this problem. While write pausing/preemption has been proposed as a remedy, in this paper we explore a more simple and alternate solution that leverages existing RAID groups into which flash chips are organized. While a long latency operation is ongoing, rather than waiting, the read could get its data by reconstructing it from the remaining chips of that group (including parity). However, this introduces additional reads, and we propose an adaptive scheduler called ATLAS that dynamically figures out whether to wait or to reconstruct the data from other chips. The resulting ATLAS optimization cuts the 99.99th percentile read latency by as much as 10X, with a reduction of 4X on the average across a wide spectrum of workloads.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131028233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Youngdong Do, Hyungmo Kim, Pyeongseok Oh, Daeyoung Park, Jaejin Lee
{"title":"SNU-NPB 2019: Parallelizing and Optimizing NPB in OpenCL and CUDA for Modern GPUs","authors":"Youngdong Do, Hyungmo Kim, Pyeongseok Oh, Daeyoung Park, Jaejin Lee","doi":"10.1109/IISWC47752.2019.9041954","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9041954","url":null,"abstract":"Although GPUs are widely used as accelerators in heterogeneous systems for high-performance computing, there are still few GPU benchmark suites available, and they do not consider modern high-end GPU architectures. For this reason, we propose a benchmark suite, called SNU-NPB 2019, which is based on NPB 3.3.1 and written in both OpenCL and CUDA. We also introduce code parallelization/optimization techniques for modern GPUs that are applied to our benchmark programs and their performance characteristics. We evaluate SNU-NPB 2019 on state-of-the-art, high-end GPUs and compare the result with the original NPB suite and SNU-NPB. Unlike SNU-NPB, it covers all the problem sizes available in the original NPB suite. Each application is fully optimized for modern GPU architectures. The evaluation result indicates that our parallelization/optimization techniques are quite effective. We expect that the techniques and the optimized code introduced in our benchmark suite provide good model cases of parallelizing and optimizing OpenCL and CUDA code for modern GPUs.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116388513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}