Proceedings of the 14th Workshop on General Purpose Processing Using GPU最新文献

筛选
英文 中文
Compiler-assisted scheduling for multi-instance GPUs 多实例gpu的编译器辅助调度
Proceedings of the 14th Workshop on General Purpose Processing Using GPU Pub Date : 2022-04-03 DOI: 10.1145/3530390.3532734
C. Porter, Chao Chen, S. Pande
{"title":"Compiler-assisted scheduling for multi-instance GPUs","authors":"C. Porter, Chao Chen, S. Pande","doi":"10.1145/3530390.3532734","DOIUrl":"https://doi.org/10.1145/3530390.3532734","url":null,"abstract":"NVIDIA's Multi-Instance GPU (MIG) feature allows users to partition a GPU's compute and memory into independent hardware instances. MIG guarantees full isolation among co-executing kernels on the device, which boosts security and prevents performance interference-related degradation. Despite the benefits of isolation, however, certain workloads do not necessarily need such guarantees, and in fact enforcing such isolation can negatively impact the throughput of a group of processes. In this work we aim to relax the isolation property for certain types of jobs, and to show how this can dramatically boost throughput across a mixed workload consisting of jobs that demand isolation and others that do not. The number of MIG partitions is hardware-limited but configurable, and state-of-the-art workload managers cannot safely take advantage of unused and wasted resources inside a given partition. We show how a compiler and runtime system working in tandem can be used to pack jobs into partitions when isolation is not necessary. Using this technique we improve overall utilization of the device while still reaping the benefits of MIG's isolation properties. Our experimental results on NVIDIA A30s with a throughput-oriented workload show an average of 1.45x throughput improvement and 2.93x increase in GPU memory utilization over the Slurm workload manager. The presented framework is fully automatic and requires no changes to user code. Based on these results, we believe our scheme is a practical and strong advancement over state-of-the-art techniques currently employed for MIG.","PeriodicalId":442986,"journal":{"name":"Proceedings of the 14th Workshop on General Purpose Processing Using GPU","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124614615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding wafer-scale GPU performance using an architectural simulator 使用架构模拟器了解晶圆级GPU性能
Proceedings of the 14th Workshop on General Purpose Processing Using GPU Pub Date : 2022-04-03 DOI: 10.1145/3530390.3532736
C. Thames, Hang Yan, Yifan Sun
{"title":"Understanding wafer-scale GPU performance using an architectural simulator","authors":"C. Thames, Hang Yan, Yifan Sun","doi":"10.1145/3530390.3532736","DOIUrl":"https://doi.org/10.1145/3530390.3532736","url":null,"abstract":"Wafer-Scale chips have the potential to break the die-size limitation and provide extreme performance scalability. Existing solutions have demonstrated the possibility of integrating multi-CPU and multi-GPU systems at a significantly larger scale on a wafer. This increased capability results in an increase in complexity in managing the memory and computing resources. To support the community studying wafer-scale systems, this paper develops an architectural simulator dedicated to modeling wafer-scale multi-device systems. Also, this work demonstrates an analysis of initial results from simulations on wafer-scale GPU systems, providing useful insight that can guide future system design.","PeriodicalId":442986,"journal":{"name":"Proceedings of the 14th Workshop on General Purpose Processing Using GPU","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132313140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ScaleServe ScaleServe
Proceedings of the 14th Workshop on General Purpose Processing Using GPU Pub Date : 2022-04-03 DOI: 10.1145/3530390.3532735
Ali Jahanshahi, M. Chow, Daniel Wong
{"title":"ScaleServe","authors":"Ali Jahanshahi, M. Chow, Daniel Wong","doi":"10.1145/3530390.3532735","DOIUrl":"https://doi.org/10.1145/3530390.3532735","url":null,"abstract":"We present, ScaleServe, a scalable multi-GPU machine learning inference system that (1) is built on an end-to-end open-sourced software stack, (2) is hardware vendor-agnostic, and (3) is designed with modular components to provide users with ease to modify and extend various configuration knobs. ScaleServe also provides detailed performance metrics from different layers of the inference server which allow designers to pinpoint bottlenecks. We demonstrate ScaleServe's serving scalability with several machine learning tasks including computer vision and natural language processing on an 8-GPU server. The performance results for ResNet152 shows that ScaleServe is able to scale well on a multi-GPU platform.","PeriodicalId":442986,"journal":{"name":"Proceedings of the 14th Workshop on General Purpose Processing Using GPU","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126799826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Near LLC versus near main memory processing 近LLC与近主存处理
Proceedings of the 14th Workshop on General Purpose Processing Using GPU Pub Date : 2022-04-03 DOI: 10.1145/3530390.3532726
Hossein Bitalebi, Vahid Geraeinejad, M. Ebrahimi
{"title":"Near LLC versus near main memory processing","authors":"Hossein Bitalebi, Vahid Geraeinejad, M. Ebrahimi","doi":"10.1145/3530390.3532726","DOIUrl":"https://doi.org/10.1145/3530390.3532726","url":null,"abstract":"Emerging advanced applications, such as deep learning and graph processing, with enormous processing demand and massive memory requests call for a comprehensive processing system or advanced solutions to address these requirements. Near data processing is one of the promising structures targeting this goal. However, most recent studies have focused on processing instructions near the main memory data banks while ignoring the benefits of processing instructions near other memory hierarchy levels such as LLC. In this study, we investigate the near LLC processing structures, and compare it to the near main memory processing alternative, specifically in graphics processing units. We analyze these two structures on various applications in terms of performance and power. Results show a clear benefit of near LLC processing over near main memory processing in a class of applications. Further, we suggest an architecture, which could benefit from both near main memory and near LLC processing structures, but requiring the applications to be characterized in advance or at run time.","PeriodicalId":442986,"journal":{"name":"Proceedings of the 14th Workshop on General Purpose Processing Using GPU","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130459216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Systematically extending a high-level code generator with support for tensor cores 系统地扩展了一个支持张量内核的高级代码生成器
Proceedings of the 14th Workshop on General Purpose Processing Using GPU Pub Date : 2022-04-03 DOI: 10.1145/3530390.3532733
Lukas Siefke, Bastian Köpcke, S. Gorlatch, Michel Steuwer
{"title":"Systematically extending a high-level code generator with support for tensor cores","authors":"Lukas Siefke, Bastian Köpcke, S. Gorlatch, Michel Steuwer","doi":"10.1145/3530390.3532733","DOIUrl":"https://doi.org/10.1145/3530390.3532733","url":null,"abstract":"High-level code generators like Halide, Lift, and RISE make a compelling proposition: write programs in a simple high-level language and get high-performing GPU code \"for free\". They achieve this feat by restricting the input language to a specific domain (such as image and array processing in Halide) or to a fixed set of flexible parallel patterns (as Lift and RISE do). Implementing high-level code generators that produce high-performance code is challenging, specifically as the target hardware constantly evolves. In this paper, we discuss how we systematically extend the RISE high-level code generator with support for tensor cores, a specialized hardware feature of recent Nvidia GPUs. We highlight the design of RISE that makes it easily extensible by following a systematic bottom-up approach, that first, exposes the imperative tensor core API to the code generator, then, raises the abstractions to an internal low-level functional representation, that, finally, is targeted by a rewrite process that starts from a high-level functional program. Our experimental evaluation shows that RISE with support for tensor cores generates code of competitive performance to manually optimized CUDA code, which is only up to 36%, but on average only 10%, slower than Nvidia's highly optimized cuBLAS library, and clearly outperforms any code that does not exploit tensor cores.","PeriodicalId":442986,"journal":{"name":"Proceedings of the 14th Workshop on General Purpose Processing Using GPU","volume":"365 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132612434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accelerating data transfer between host and device using idle GPU 利用空闲GPU加速主机与设备之间的数据传输
Proceedings of the 14th Workshop on General Purpose Processing Using GPU Pub Date : 2022-04-03 DOI: 10.1145/3530390.3532732
Yuya Tatsugi, Akira Nukada
{"title":"Accelerating data transfer between host and device using idle GPU","authors":"Yuya Tatsugi, Akira Nukada","doi":"10.1145/3530390.3532732","DOIUrl":"https://doi.org/10.1145/3530390.3532732","url":null,"abstract":"When running single-GPU applications on multi-GPU compute nodes, the remaining GPU devices are kept idle. We propose a novel technology to accelerate these single-GPU applications using the idle GPU devices. The data transfers between host and device are performed not only by the first GPU but also by the second GPU as well as the alternative route with PCI-Express and NV-Link connected to it. Our performance evaluations show the proposed method enables about twice data transfer speed as native single GPU case for large data sizes.","PeriodicalId":442986,"journal":{"name":"Proceedings of the 14th Workshop on General Purpose Processing Using GPU","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123688328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信