2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)最新文献

筛选
英文 中文
Platform Agnostic Streaming Data Application Performance Models 与平台无关的流数据应用性能模型
Clayton J. Faber, Tom Plano, Samatha Kodali, Zhili Xiao, Abhishek Dwaraki, J. Buhler, R. Chamberlain, A. Cabrera
{"title":"Platform Agnostic Streaming Data Application Performance Models","authors":"Clayton J. Faber, Tom Plano, Samatha Kodali, Zhili Xiao, Abhishek Dwaraki, J. Buhler, R. Chamberlain, A. Cabrera","doi":"10.1109/rsdha54838.2021.00008","DOIUrl":"https://doi.org/10.1109/rsdha54838.2021.00008","url":null,"abstract":"The mapping of computational needs onto execution resources is, by and large, a manual task, and users are frequently guided simply by intuition and past experiences. We present a queueing theory based performance model for streaming data applications that takes steps towards a better understanding of resource mapping decisions, thereby assisting application developers to make good mapping choices. The performance model (and associated cost model) are agnostic to the specific properties of the compute resource and application, simply characterizing them by their achievable data throughput. We illustrate the model with a pair of applications, one chosen from the field of computational biology and the second is a classic machine learning problem.","PeriodicalId":119942,"journal":{"name":"2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128052767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
[Copyright notice] (版权)
{"title":"[Copyright notice]","authors":"","doi":"10.1109/rsdha54838.2021.00002","DOIUrl":"https://doi.org/10.1109/rsdha54838.2021.00002","url":null,"abstract":"","PeriodicalId":119942,"journal":{"name":"2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130231700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ELIχR: Eliminating Computation Redundancy in CNN-Based Video Processing ELIχR:消除基于cnn的视频处理中的计算冗余
Jordan Schmerge, Daniel Mawhirter, Connor Holmes, Jedidiah McClurg, Bo-Zong Wu
{"title":"ELIχR: Eliminating Computation Redundancy in CNN-Based Video Processing","authors":"Jordan Schmerge, Daniel Mawhirter, Connor Holmes, Jedidiah McClurg, Bo-Zong Wu","doi":"10.1109/rsdha54838.2021.00010","DOIUrl":"https://doi.org/10.1109/rsdha54838.2021.00010","url":null,"abstract":"Video processing frequently relies on applying convolutional neural networks (CNNs) for various tasks, including object tracking, real-time action classification, and image recognition. Due to complicated network design, processing even a single frame requires many operations, leading to low throughput and high latency. This process can be parallelized, but since consecutive images have similar content, most of these operations produce identical results, leading to inefficient usage of parallel hardware accelerators. In this paper, we present ELIχR, a software system that systematically addresses this computation redundancy problem in an architecture-independent way, using two key techniques. First, ELIχR implements a lightweight change propagation algorithm to automatically determine which data to recompute for each new frame based on changes in the input. Second, ELIχR implements a dynamic check to further reduce needed computations by leveraging special operators in the model (e.g., ReLU), and trading off accuracy for performance. We evaluate ELIχR on two real-world models, Inception V3 and Resnet-50, and two video streams. We show that ELIχR running on the CPU produces up to 3.49X speedup (1.76X on average) compared with frame sampling, given the same accuracy and real-time processing requirements, and we describe how our approach can be applied in an architecture-independent way to improve CNN performance in heterogeneous systems.","PeriodicalId":119942,"journal":{"name":"2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129088666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Energy Efficient Task Graph Execution Using Compute Unit Masking in GPUs 在gpu中使用计算单元掩蔽的节能任务图执行
M. Chow, K. Ranganath, R. Lerias, Mika Shanela Carodan, Daniel Wong
{"title":"Energy Efficient Task Graph Execution Using Compute Unit Masking in GPUs","authors":"M. Chow, K. Ranganath, R. Lerias, Mika Shanela Carodan, Daniel Wong","doi":"10.1109/rsdha54838.2021.00011","DOIUrl":"https://doi.org/10.1109/rsdha54838.2021.00011","url":null,"abstract":"The frontiers of Supercomputers are pushed by novel discrete accelerators. Accelerators such as GPUs are employed to enable faster execution of Machine Learning, Scientific and High-Performance Computing applications. However, it has been harder to gain increased parallelism in traditional workloads. This is why more focus has been into Task Graphs. AMD’s Directed Acyclic Graph Execution Engine (DAGEE) allows the programmer to define a workload in fine-grained tasks, and the system handles the dependencies at the lower-level. We evaluate DAGEE with the Winograd-Strassen Matrix Multiplication algorithm and show that DAGEE achieves on average 15.3% speed up over the traditional matrix multiplication algorithm.While using DAGEE this may increase the contention among kernels due to the increased amount of parallelism. However, AMD allows the programmer to set the number of active Compute Unit (CU) by masking. This fine-grain scaling allows the system software to enable only the required number of Computation Units within a GPU. Using this mechanism we develop a Runtime that masks CU’s for each task during a task graph execution and partitions each task into their separate CU’s, reducing overall contention and energy consumption. We show that our CU Masking runtime on average reduces energy by 18%.","PeriodicalId":119942,"journal":{"name":"2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130701513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Multi-accelerator Neural Network Inference in Diversely Heterogeneous Embedded Systems 不同异构嵌入式系统中的多加速器神经网络推理
Ismet Dagli, M. Belviranli
{"title":"Multi-accelerator Neural Network Inference in Diversely Heterogeneous Embedded Systems","authors":"Ismet Dagli, M. Belviranli","doi":"10.1109/rsdha54838.2021.00006","DOIUrl":"https://doi.org/10.1109/rsdha54838.2021.00006","url":null,"abstract":"Neural network inference (NNI) is commonly used in mobile and autonomous systems for latency-sensitive critical operations such as obstacle detection and avoidance. In addition to latency, energy consumption is also an important factor in such workloads, since the battery is a limited resource in such systems. Energy and latency demands of critical workload execution in such systems can vary based on the physical system state. For example, the remaining energy on a low-running battery should be prioritized for motor consumption in a quadcopter. On the other hand, if the quadcopter is flying through obstacles, latency-aware execution becomes a priority. Many recent mobile and autonomous system-on-chips embed a diverse range of accelerators with varying power and performance characteristics which can be utilized to achieve this fine trade-off between energy and latency.In this paper, we investigate Multi-accelerator Execution (MAE) on diversely heterogeneous embedded systems, where sub-components of a given workload, such as NNI, can be assigned to different type of accelerators to achieve a desired latency or energy goal. We first analyze the energy and performance characteristics of execution of neural network layers on different type of accelerators. We then explore energy/performance trade-offs via layer-wise scheduling for NNI by considering different layer-to-PE mappings. We finally propose a customizable metric, called multi-accelerator execution gain (MAEG), in order to measure the energy or performance benefits of MAE of a given workload. Our empirical results on Jetson Xavier SoCs show that our methodology can provide up to 28% energy/performance trade-off benefit when compared to the case where all layers are assigned to a single PE.","PeriodicalId":119942,"journal":{"name":"2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131100623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Comparing LLC-Memory Traffic between CPU and GPU Architectures 比较CPU和GPU架构之间的LLC-Memory流量
Mohammad Alaul Haque Monil, Seyong Lee, J. Vetter, A. Malony
{"title":"Comparing LLC-Memory Traffic between CPU and GPU Architectures","authors":"Mohammad Alaul Haque Monil, Seyong Lee, J. Vetter, A. Malony","doi":"10.1109/rsdha54838.2021.00007","DOIUrl":"https://doi.org/10.1109/rsdha54838.2021.00007","url":null,"abstract":"The cache hierarchy in modern CPUs and GPUs is becoming increasingly complex, which makes understanding the handshake between the memory access patterns and the cache hierarchy difficult. Moreover, the details of different cache policies are not publicly available. Therefore, the research community relies on observation to understand the relationship between memory access patterns and cache hierarchy. Our previous studies delved into the different microarchitectures of Intel CPUs. In this study, GPUs from NVIDIA and AMD are considered. Even though the execution models in CPUs and GPUs are distinct, this study attempts to correlate the behavior of the cache hierarchy of CPUs and GPUs. Using the knowledge gathered from studying Intel CPUs, the similarities and dissimilarities between CPUs and GPUs are identified. Through model evaluation, this study provides a proof of concept that traffic between last-level cache and memory can be predicted for sequential streaming and strided access patterns on GPUs.","PeriodicalId":119942,"journal":{"name":"2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125715197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed Training for High Resolution Images: A Domain and Spatial Decomposition Approach 高分辨率图像的分布式训练:一种域和空间分解方法
A. Tsaris, Jacob D. Hinkle, D. Lunga, P. Dias
{"title":"Distributed Training for High Resolution Images: A Domain and Spatial Decomposition Approach","authors":"A. Tsaris, Jacob D. Hinkle, D. Lunga, P. Dias","doi":"10.2172/1827010","DOIUrl":"https://doi.org/10.2172/1827010","url":null,"abstract":"In this work we developed two Pytorch libraries using the PyTorch RPC interface for distributed deep learning approaches on high resolution images. The spatial decomposition library allows for distributed training on very large images, which otherwise wouldn’t be possible on a single GPU. The domain parallelism library allows for distributed training across multiple domain unlabeled data, by leveraging the domain separation architecture. Both of those libraries where tested on the Summit supercomputer at Oak Ridge National Laboratory at a moderate scale.","PeriodicalId":119942,"journal":{"name":"2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116867405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信