Nishil Talati, Di Jin, Haojie Ye, Ajay Brahmakshatriya, Ganesh S. Dasika, S. Amarasinghe, T. Mudge, Danai Koutra, R. Dreslinski
{"title":"A Deep Dive Into Understanding The Random Walk-Based Temporal Graph Learning","authors":"Nishil Talati, Di Jin, Haojie Ye, Ajay Brahmakshatriya, Ganesh S. Dasika, S. Amarasinghe, T. Mudge, Danai Koutra, R. Dreslinski","doi":"10.5281/ZENODO.5555384","DOIUrl":"https://doi.org/10.5281/ZENODO.5555384","url":null,"abstract":"Machine learning on graph data has gained significant interest because of its applicability to various domains ranging from product recommendations to drug discovery. While there is a rapid growth in the algorithmic community, the computer architecture community has so far focused on a subset of graph learning algorithms including Graph Convolution Network (GCN), and a few others. In this paper, we study another, more scalable, graph learning algorithm based on random walks, which operates on dynamic input graphs and has attracted less attention in the architecture community compared to GCN. We propose high-performance CPU and GPU implementations of two important graph learning tasks, that cover a broad class of applications, using random walks on continuous-time dynamic graphs: link prediction and node classification. We show that the resulting workload exhibits distinct characteristics, measured in terms of irregularity, core and memory utilization, and cache hit rates, compared to graph traversals, deep learning, and GCN. We further conduct an in-depth performance analysis focused on both algorithm and hardware to guide future software optimization and architecture exploration. The algorithm-focused study presents a rich trade-off space between algorithmic performance and runtime complexity to identify optimization opportunities. We find an optimal hyperparameter setting that strikes balance in this trade-off space. Using this setting, we also perform a detailed microarchitectural characterization to analyze hardware behavior of these applications and uncover execution bottlenecks, which include high cache misses and dependency-related stalls. The outcome of our study includes recommendations for further performance optimization, and open-source implementations for future investigation.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120960626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohsen Koohi Esfahani, Peter Kilpatrick, H. Vandierendonck
{"title":"Locality Analysis of Graph Reordering Algorithms","authors":"Mohsen Koohi Esfahani, Peter Kilpatrick, H. Vandierendonck","doi":"10.1109/IISWC53511.2021.00020","DOIUrl":"https://doi.org/10.1109/IISWC53511.2021.00020","url":null,"abstract":"A major challenge in processing real-world graphs stems from poor locality of memory accesses and vertex reordering algorithms (RAs) have been proposed to improve locality by changing the order of memory accesses. While state-of-the-art RAs like SlashBurn, GOrder, and Rabbit-Order effectively speed up graph algorithms, their capabilities and disadvantages are not fully understood, mainly, for three reasons: (1) the large size of datasets, (2) the lack of suitable measurement tools, and (3) disparate characteristics of graphs. The paucity of analysis has also inhibited the design of more efficient RAs. This paper unlocks this black box by introducing a number of tools, including: (1) a cache simulation technique for processing large graphs, (2) the Neighbour to Neighbour Average ID Distance (N2N AID) as a spatial locality metric, (3) the degree distributions of simulated cache miss rate and AID to investigate how locality of different vertices is affected by RAs, and (4) “effective cache size” to measure how much of cache capacity is used to support random accesses. We introduce (1) asymmetricity degree distribution, (2) degree range decomposition, and (3) push and pull locality to present a structural analysis of different types of real-world graphs by explaining their contrasting behaviours in confronting RAs. Finally, we propose a number of improvements for RAs using the analysis provided in this paper.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125978038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quantum Computing in the Cloud: Analyzing job and machine characteristics","authors":"Gokul Subramanian Ravi","doi":"10.1109/IISWC53511.2021.00015","DOIUrl":"https://doi.org/10.1109/IISWC53511.2021.00015","url":null,"abstract":"As the popularity of quantum computing continues to grow, quantum machine access over the cloud is critical to both academic and industry researchers across the globe. And as cloud quantum computing demands increase exponentially, the analysis of resource consumption and execution characteristics are key to efficient management of jobs and resources at both the vendor-end as well as the client-end. While the analysis of resource consumption and management are popular in the classical HPC domain, it is severely lacking for more nascent technology like quantum computing. This paper is a first-of-its-kind academic study, analyzing various trends in job execution and resources consumption / utilization on quantum cloud systems. We focus on IBM Quantum systems and analyze characteristics over a two year period, encompassing over 6000 jobs which contain over 600,000 quantum circuit executions and correspond to almost 10 billion “shots” or trials over 20+ quantum machines. Specifically, we analyze trends focused on, but not limited to, execution times on quantum machines, queuing/waiting times in the cloud, circuit compilation times, machine utilization, as well as the impact of job and machine characteristics on all of these trends. Our analysis identifies several similarities and differences with classical HPC cloud systems. Based on our insights, we make recommendations and contributions to improve the management of resources and jobs on future quantum cloud systems.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115400337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francisco Muñoz-Martínez, José L. Abellán, M. Acacio, T. Krishna
{"title":"STONNE: Enabling Cycle-Level Microarchitectural Simulation for DNN Inference Accelerators","authors":"Francisco Muñoz-Martínez, José L. Abellán, M. Acacio, T. Krishna","doi":"10.1109/IISWC53511.2021.00028","DOIUrl":"https://doi.org/10.1109/IISWC53511.2021.00028","url":null,"abstract":"The design of specialized architectures for accelerating the inference procedure of Deep Neural Networks (DNNs) is a booming area of research nowadays. While first-generation rigid accelerator proposals used simple fixed dataflows tailored for dense DNNs, more recent architectures have argued for flexibility to efficiently support a wide variety of layer types, dimensions, and sparsity. As the complexity of these accelerators grows, the analytical models currently being used for design-space exploration are unable to capture execution-time subtleties, leading to inexact results in many cases as we demonstrate. This opens up a need for cycle-level simulation tools to allow for fast and accurate design-space exploration of DNN accelerators, and rapid quantification of the efficacy of architectural enhancements during the early stages of a design. To this end, we present STONNE (Simulation TOol of Neural Network/Engines), a cycle-level microarchitectural simulation framework that can plug into any high-level DNN framework as an accelerator device and perform full-model evaluation (i.e. we are able to simulate real, complete, unmodified DNN models) of state-of-the-art rigid and flexible DNN accelerators, both with and without sparsity support. As a proof of concept, we use STONNE in three use cases: i) a direct comparison of three dominant inference accelerators using real DNN models; ii) back-end extensions and iii) front-end extensions of the simulator to showcase the capability of STONNE to rapidly and precisely evaluate data-dependent optimizations.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129232752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterizing and Mitigating the I/O Scalability Challenges for Serverless Applications","authors":"Rohan Basu Roy, Tirthak Patel, Devesh Tiwari","doi":"10.1109/IISWC53511.2021.00018","DOIUrl":"https://doi.org/10.1109/IISWC53511.2021.00018","url":null,"abstract":"As serverless computing paradigm becomes widespread, it is important to understand the I/O performance characteristics on serverless computing platforms. To the best of our knowledge, we provide the first study that analyzes the observed I/O performance characteristics - some expected and some unexpected findings that reveal the hidden, complex interactions between the application I/O characteristics, the serverless computing platform, and the storage engines. The goal of this analysis is to provide data-driven guidelines to serverless programmers and system designers about the performance trade-offs and pitfalls of serverless I/O.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126564285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaewoong Chung, Dhruva R. Chakrabarti, C. Minh, Nilanjan Goswami, Ramkumar Shankar, Madhura Joshi, Tao Li, Yaozu Dong, Xudong Zheng, Xiantao Zhang, J. Dai, Jianhui Li, Xin Li, Gang Zhai, Haibing Guan, N. Bronson, Christos Kozyrakis, K. Olukotun, Lin Sun, Zhen Fang, Peng Li, Tao Wang, Ravishankar Iyer, R. Illikkal, Dong Liu, Damon Chandler, Yun-Cheng Tsai, Chia-Lin Yang, Zhiteng Huang, Jiangang Duan, H. Inoue, T. Nakatani, T. Nakaike, Rei Odaira, Maged M. Michael, Shuai Che, J. Sheaffer, Michael Boyer, Lukasz G. Szafaryn, Liang Wang, K. Skadron, Yohei Ueda, A. Buyuktosunoglu
{"title":"2021 IEEE International Symposium on Workload Characterization","authors":"Jaewoong Chung, Dhruva R. Chakrabarti, C. Minh, Nilanjan Goswami, Ramkumar Shankar, Madhura Joshi, Tao Li, Yaozu Dong, Xudong Zheng, Xiantao Zhang, J. Dai, Jianhui Li, Xin Li, Gang Zhai, Haibing Guan, N. Bronson, Christos Kozyrakis, K. Olukotun, Lin Sun, Zhen Fang, Peng Li, Tao Wang, Ravishankar Iyer, R. Illikkal, Dong Liu, Damon Chandler, Yun-Cheng Tsai, Chia-Lin Yang, Zhiteng Huang, Jiangang Duan, H. Inoue, T. Nakatani, T. Nakaike, Rei Odaira, Maged M. Michael, Shuai Che, J. Sheaffer, Michael Boyer, Lukasz G. Szafaryn, Liang Wang, K. Skadron, Yohei Ueda, A. Buyuktosunoglu","doi":"10.1109/iiswc11981.2006","DOIUrl":"https://doi.org/10.1109/iiswc11981.2006","url":null,"abstract":"","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126050525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cross-Platform Performance Evaluation of Stateful Serverless Workflows","authors":"Narges Shahidi, J. Gunasekaran, M. Kandemir","doi":"10.1109/IISWC53511.2021.00017","DOIUrl":"https://doi.org/10.1109/IISWC53511.2021.00017","url":null,"abstract":"Serverless computing, with its inherent event-driven design along with instantaneous scalability due to cloud-provider managed infrastructure, is starting to become a de-facto model for deploying latency critical user-interactive services. However, as much as they are suitable for event-driven services, their stateless nature is a major impediment for deploying long-running stateful applications. While commercial cloud providers offer a variety of solutions that club serverless functions along with intermediate storage to maintain application state, they are still far from optimized for deploying stateful applications at scale. More specifically, factors such as storage latency and scalability, network bandwidth, and deployment costs play a crucial role in determining whether current serverless applications are suitable for stateful workloads. In this paper, we evaluate the two widely-used stateful server-less offerings, Azure Durable functions and AWS Step functions, to quantify their effectiveness for implementing complex stateful workflows. We conduct a detailed measurement-driven characterization study with two real-world use cases, machine learning pipelines (inference and training) and video analytics, in order to characterize the different performance latency and cost tradeoffs. We observe from our experiments that AWS is suitable for workloads with higher degree of parallelism, while Azure durable entities offer a simplified framework that enables quicker application development. Overall, AWS is 89% more expensive than Azure for machine learning training application while Azure is 2× faster than AWS for the machine learning inference application. Our results indicate that Azure durable is extremely inefficient in implementing parallel processing. Furthermore, we summarize the key findings from our characterization, which we believe to be insightful for any cloud tenant who has the problem of choosing an appropriate cloud vendor and offering, when deploving stateful workloads on serverless platforms,","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116240858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GoAT: Automated Concurrency Analysis and Debugging Tool for Go","authors":"Saeed Taheri, G. Gopalakrishnan","doi":"10.1109/IISWC53511.2021.00023","DOIUrl":"https://doi.org/10.1109/IISWC53511.2021.00023","url":null,"abstract":"The use of increasing levels of parallelism and concurrency in the system design-especially in a feature-rich language such as Go-demands effective concurrency debugging techniques that are easy to deploy in practice. We present GoAT, a combined static and dynamic concurrency testing and analysis tool that facilitates the process of debugging for real-world programs. Key ideas in GoAT include 1) automated dynamic tracing to capture the behavior of concurrency primitives, 2) systematic schedule space exploration to accelerate the bug occurrence and 3) deadlock detection with supplementary visualizations and reports. We also propose a set of coverage requirements that characterize the dynamic behavior of concurrency primitives and provide metrics to measure the quality of tests. Evaluation of GoAT on 68 curated real-world bug scenarios demonstrates that GoAT is significantly effective in detecting rare bugs, and its schedule perturbation method based on schedule yielding detects these bugs with less than three yields. These results together with the ease of deploying GoAT on real-world Go programs hold significant promise in the field-debugging of Go programs.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"242 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122062522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Omais Shafi, Chinmay Rai, Rijurekha Sen, Gayathri Ananthanarayanan
{"title":"Demystifying TensorRT: Characterizing Neural Network Inference Engine on Nvidia Edge Devices","authors":"Omais Shafi, Chinmay Rai, Rijurekha Sen, Gayathri Ananthanarayanan","doi":"10.1109/IISWC53511.2021.00030","DOIUrl":"https://doi.org/10.1109/IISWC53511.2021.00030","url":null,"abstract":"Edge devices are seeing tremendous growth in sensing and computational capabilities. Running state-of-the-art deep neural network (NN) based data processing on multi-core CPU processors, embedded Graphics Processing Units (GPU), Tensor Processing Units (TPU), Neural Processing Units (NPU), Deep Learning Accelerators (DLA) etc., edge devices are now able to handle heavy data computations with limited or without cloud connectivity. In addition to hardware resources, software frameworks that optimize a trained neural network (NN) model through weight clustering and pruning, weight and input-output quantization to fewer bits, fusing NN layers etc., for more efficient execution of NN inferences on edge platforms, play an important role in making machine learning at the edge (namely EdgeML) a reality. This paper is a first effort in characterizing these software frameworks for DNN inference optimizations on edge devices, especially edge GPUs which are now ubiquitously used in all embedded deep learning systems. The interactions between software optimizations and the underlying GPU hardware is carefully examined. As most NN optimization engines are proprietary softwares with undocumented internal details in the public domain, our empirical analysis on real embedded GPU platforms using a variety of widely used DNNs, provide various interesting findings. We observe tremendous performance gain and non-negligible accuracy gain from the software optimizations, but also find highly unexpected non-deterministic behaviors such as different outputs on same inputs or increased execution latency for same NN model on more powerful hardware platforms. Application developers using these proprietary software optimization engines, would benefit from our analysis and the discussed implications of our findings, with examples from real applications like intelligent traffic intersection control and Advanced Driving Assistance Systems (ADAS). There are important implications of our findings on performance modeling and prediction research too, that focus on micro-architecture modeling based application performance prediction, but should now additionally consider optimization engines that this paper examines.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130340406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cactus: Top-Down GPU-Compute Benchmarking using Real-Life Applications","authors":"Mahmood Naderan-Tahan, L. Eeckhout","doi":"10.1109/IISWC53511.2021.00026","DOIUrl":"https://doi.org/10.1109/IISWC53511.2021.00026","url":null,"abstract":"Benchmarking is the de facto standard for evaluating hardware architectures in academia and industry. While several benchmark suites targeting different application domains have been developed for CPU processors over many decades, benchmarking GPU architectures is not as mature. Since the introduction of GPUs for general-purpose computing, the purpose has been to accelerate (a) specific part(s) of the code, called (a) kernel(s). The initial GPU-compute benchmark suites, which are still widely used today, hence consist of relatively simple workloads that are composed of one or few kernels with specific unambiguous execution characteristics. In contrast, we find that modern-day real-life GPU-compute applications are much more complex consisting of many more kernels with differing characteristics. A fundamental question can hence be raised: are current benchmark suites still representative for modern real-life applications? In this paper, we introduce Cactus, a collection of widely used real-life open-source GPU-compute applications. The aim of this work is to offer a new perspective on GPU-compute benchmarking: while existing benchmark suites are designed in a bottom-up fashion (i.e., starting from kernels that are likely to perform well on GPUs), we perform GPU-compute benchmarking in a top-down fashion, starting from complex real-life applications that are composed of multiple kernels. We characterize the Cactus benchmarks by quantifying their kernel execution time distribution, by analyzing the workloads using the roofline model, by performing a performance metrics correlation analysis, and by classifying their constituent kernels through multi-dimensional data analysis. The overall conclusion is that the Cactus workloads execute many more kernels, include more diverse and more complex execution behavior, and cover a broader range of the workload space compared to the prevalently used benchmark suites. We hence believe that Cactus is a useful complement to the existing GPU-compute benchmarking toolbox.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121022448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}