Improving GPU performance in multimedia applications through FPGA based adaptive DMA controller

IF 0.8 Q4 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

International Journal of Pervasive Computing and Communications Pub Date : 2022-10-17 DOI:10.1108/ijpcc-06-2022-0241

S. B, K. E.

{"title":"Improving GPU performance in multimedia applications through FPGA based adaptive DMA controller","authors":"S. B, K. E.","doi":"10.1108/ijpcc-06-2022-0241","DOIUrl":null,"url":null,"abstract":"\nPurpose\nDeep learning techniques are unavoidable in a variety of domains such as health care, computer vision, cyber-security and so on. These algorithms demand high data transfers but require bottlenecks in achieving the high speed and low latency synchronization while being implemented in the real hardware architectures. Though direct memory access controller (DMAC) has gained a brighter light of research for achieving bulk data transfers, existing direct memory access (DMA) systems continue to face the challenges of achieving high-speed communication. The purpose of this study is to develop an adaptive-configured DMA architecture for bulk data transfer with high throughput and less time-delayed computation.\n\n\nDesign/methodology/approach\nThe proposed methodology consists of a heterogeneous computing system integrated with specialized hardware and software. For the hardware, the authors propose an field programmable gate array (FPGA)-based DMAC, which transfers the data to the graphics processing unit (GPU) using PCI-Express. The workload characterization technique is designed using Python software and is implementable for the advanced risk machine Cortex architecture with a suitable communication interface. This module offloads the input streams of data to the FPGA and initiates the FPGA for the control flow of data to the GPU that can achieve efficient processing.\n\n\nFindings\nThis paper presents an evaluation of a configurable workload-based DMA controller for collecting the data from the input devices and concurrently applying it to the GPU architecture, bypassing the hardware and software extraneous copies and bottlenecks via PCI Express. It also investigates the usage of adaptive DMA memory buffer allocation and workload characterization techniques. The proposed DMA architecture is compared with the other existing DMA architectures in which the performance of the proposed DMAC outperforms traditional DMA by achieving 96% throughput and 50% less latency synchronization.\n\n\nOriginality/value\nThe proposed gated recurrent unit has produced 95.6% accuracy in characterization of the workloads into heavy, medium and normal. The proposed model has outperformed the other algorithms and proves its strength for workload characterization.\n","PeriodicalId":43952,"journal":{"name":"International Journal of Pervasive Computing and Communications","volume":" ","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Pervasive Computing and Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/ijpcc-06-2022-0241","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 1

Abstract

Purpose Deep learning techniques are unavoidable in a variety of domains such as health care, computer vision, cyber-security and so on. These algorithms demand high data transfers but require bottlenecks in achieving the high speed and low latency synchronization while being implemented in the real hardware architectures. Though direct memory access controller (DMAC) has gained a brighter light of research for achieving bulk data transfers, existing direct memory access (DMA) systems continue to face the challenges of achieving high-speed communication. The purpose of this study is to develop an adaptive-configured DMA architecture for bulk data transfer with high throughput and less time-delayed computation. Design/methodology/approach The proposed methodology consists of a heterogeneous computing system integrated with specialized hardware and software. For the hardware, the authors propose an field programmable gate array (FPGA)-based DMAC, which transfers the data to the graphics processing unit (GPU) using PCI-Express. The workload characterization technique is designed using Python software and is implementable for the advanced risk machine Cortex architecture with a suitable communication interface. This module offloads the input streams of data to the FPGA and initiates the FPGA for the control flow of data to the GPU that can achieve efficient processing. Findings This paper presents an evaluation of a configurable workload-based DMA controller for collecting the data from the input devices and concurrently applying it to the GPU architecture, bypassing the hardware and software extraneous copies and bottlenecks via PCI Express. It also investigates the usage of adaptive DMA memory buffer allocation and workload characterization techniques. The proposed DMA architecture is compared with the other existing DMA architectures in which the performance of the proposed DMAC outperforms traditional DMA by achieving 96% throughput and 50% less latency synchronization. Originality/value The proposed gated recurrent unit has produced 95.6% accuracy in characterization of the workloads into heavy, medium and normal. The proposed model has outperformed the other algorithms and proves its strength for workload characterization.

查看原文本刊更多论文

通过FPGA自适应DMA控制器提高多媒体应用中GPU的性能

目的深度学习技术在医疗保健、计算机视觉、网络安全等领域是不可避免的。这些算法需要高数据传输，但在实际硬件架构中实现时，在实现高速和低延迟同步方面需要瓶颈。尽管直接存储器存取控制器（DMAC）在实现大容量数据传输方面获得了更光明的研究前景，但现有的直接存储器存取（DMA）系统仍然面临着实现高速通信的挑战。本研究的目的是开发一种自适应配置的DMA架构，用于具有高吞吐量和较少时延计算的批量数据传输。设计/方法论/方法论所提出的方法论由集成了专用硬件和软件的异构计算系统组成。在硬件方面，作者提出了一种基于现场可编程门阵列（FPGA）的DMAC，它使用PCI Express将数据传输到图形处理单元（GPU）。工作负载表征技术是使用Python软件设计的，并且可以在具有适当通信接口的高级风险机器Cortex架构中实现。该模块将输入数据流卸载到FPGA，并启动FPGA以控制到GPU的数据流，从而实现高效处理。发现本文对一种基于可配置工作负载的DMA控制器进行了评估，该控制器用于从输入设备收集数据，并同时将其应用于GPU架构，通过PCI Express绕过硬件和软件无关的副本和瓶颈。它还研究了自适应DMA内存缓冲区分配和工作负载表征技术的使用。将所提出的DMA架构与其他现有DMA架构进行比较，其中所提出的DMAC的性能优于传统DMA，实现了96%的吞吐量和50%的延迟同步。独创性/价值所提出的门控递归单元在将工作量分为重、中和正常时的准确率为95.6%。所提出的模型优于其他算法，并证明了其在工作负载表征方面的优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Pervasive Computing and Communications COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-

CiteScore

6.60

自引率

0.00%

发文量