2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)最新文献

筛选

英文中文

dsODENet: Neural ODE and Depthwise Separable Convolution for Domain Adaptation on FPGAs 基于深度可分离卷积的fpga域自适应研究

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP) Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00031

Hiroki Kawakami, Hirohisa Watanabe, K. Sugiura, Hiroki Matsutani

{"title":"dsODENet: Neural ODE and Depthwise Separable Convolution for Domain Adaptation on FPGAs","authors":"Hiroki Kawakami, Hirohisa Watanabe, K. Sugiura, Hiroki Matsutani","doi":"10.1109/pdp55904.2022.00031","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00031","url":null,"abstract":"High-performance deep neural network (DNN)-based systems are in high demand in edge environments. Due to its high computational complexity, it is challenging to deploy DNNs on edge devices with strict limitations on computational resources. In this paper, we derive a compact while highly-accurate DNN model, termed dsODENet, by combining recently-proposed parameter reduction techniques: Neural ODE (Ordinary Differential Equation) and DSC (Depthwise Separable Convolution). Neural ODE exploits a similarity between ResNet and ODE, and shares most of weight parameters among multiple layers, which greatly reduces the memory consumption. We apply dsODENet to a domain adaptation as a practical use case with image classification datasets. We also propose a resource-efficient FPGA-based design for dsODENet, where all the parameters and feature maps except for pre- and post-processing layers can be mapped onto onchip memories. It is implemented on Xilinx ZCU104 board and evaluated in terms of domain adaptation accuracy, training speed, FPGA resource utilization, and speedup rate compared to a software counterpart. The results demonstrate that dsODENet achieves comparable or slightly better domain adaptation accuracy compared to our baseline Neural ODE implementation, while the total parameter size without pre- and post-processing layers is reduced by 54.2% to 79.8%. Our FPGA implementation accelerates the inference speed by 27.9 times.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"346 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126677288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Accelerating Distributed Deep Reinforcement Learning by In-Network Experience Sampling 利用网络内经验采样加速分布式深度强化学习

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP) Pub Date : 2021-10-26 DOI: 10.1109/pdp55904.2022.00020

Masaki Furukawa, Hiroki Matsutani

{"title":"Accelerating Distributed Deep Reinforcement Learning by In-Network Experience Sampling","authors":"Masaki Furukawa, Hiroki Matsutani","doi":"10.1109/pdp55904.2022.00020","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00020","url":null,"abstract":"A computing cluster that interconnects multiple compute nodes is used to accelerate distributed reinforcement learning based on DQN (Deep Q-Network). In distributed reinforcement learning, Actor nodes acquire experiences by interacting with a given environment and a Learner node optimizes their DQN model. Since data transfer between Actor and Learner nodes increases depending on the number of Actor nodes and their experience size, communication overhead between them is one of major performance bottlenecks. In this paper, their communication performance is optimized by using DPDK (Data Plane Development Kit). Specifically, DPDK-based low-latency experience replay memory server is deployed between Actor and Learner nodes interconnected with a 40GbE (40Gbit Ethernet) network. Evaluation results show that, as a network optimization technique, kernel bypassing by DPDK reduces network access latencies to a shared memory server by 32.7% to 58.9%. As another network optimization technique, an in-network experience replay memory server between Actor and Learner nodes reduces access latencies to the experience replay memory by 11.7% to 28.1% and communication latencies for prioritized experience sampling by 21.9% to 29.1%.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122317103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页