Tanvir Ahmed, Johannes Maximilian Kühn, Ken Namura
{"title":"A Highly Efficient Layout-Aware FPGA Overlay Accelerator Mapping Method","authors":"Tanvir Ahmed, Johannes Maximilian Kühn, Ken Namura","doi":"10.1109/MCSoC51149.2021.00046","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00046","url":null,"abstract":"FPGAs are gathering traction as a platform for the acceleration of applications requiring both high performance and specialization. However, exploiting the maximum compute potential of FPGAs remains a critical and time-consuming task, usually requiring expert knowledge. Typically, designers seek to maximize the usage of hardened arithmetic blocks (DSP, such as DSP48 in Xilinx devices), but as their number is limited, the critical path quickly increases when portions are mapped to lookup tables (LUT). To mitigate the DSP limitation and to maximize FPGA utilization, we propose combining FPGA overlay accelerators and a mapping method that efficiently exploits the FPGA's layout information and its resources. This mapping method relies on a two-step process: 1. extraction of architectural and layout information of the FPGA, 2. optimized placement of the processing elements (PEs) of the accelerator onto the FPGA resources. The placement step maps the PEs to DSPs and LUTs to reduce the critical path among PEs. We applied our method to implement a systolic array, a multiplier array, and a coarse-grained reconfigurable architecture (CGRA) on a Xilinx FPGA. The proposed method achieves more than 14 x performance and energy efficiency increase over the vendor tool mapping while equally maximizing FPGA utilization by more than 1.5 x compared to DSP limited mappings.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130648263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CNN-based End-to-end Autonomous Driving on FPGA Using TVM and VTA","authors":"Toshihiro Uetsuki, Y. Okuyama, Jungpil Shin","doi":"10.1109/MCSoC51149.2021.00028","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00028","url":null,"abstract":"This paper presents a method reducing inference time and maintaining inference accuracy in autonomous driving using TVM and Versatile Tensor Accelerator (VTA) on Field Programmable Gate Array (FPGA). We focus on End-to-end deep neural networks (DNNs) that directly calculate throttle and steering values of cars using camera images to realize autonomous driving. This network is highly accurate in that it does not add any artificial features. However, real-time implementation of autonomous driving DNNs in embedded systems is problematic due to the limited computational resources and electric power. To address this problem, we implemented the network on an FPGA using TVM and VTA. We modified the network using TVM to (1) reduce the number of bits in the neural network parameters from float32 to int8, (2) schedule the matrix computation in hardware, and (3) optimize the operators, tensors, and hardware parameters to maximize the performance of the neural network at runtime. We measured inference time and accuracy of the CPU and CPU + FPGA implementations on the same board. The experiment shows that CPU+FPGA reduced the inference time by 61%, with a 1 % decrease in inference accuracy than CPU implementation. We conclude that FPGA implementation of the end-to-end autonomous driving network can reduce the inference time and maintain the inference accuracy.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130247241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzable Publish-Subcribe Communication Through a Wait-Free FIFO Channel for MPSoC Real-Time Applications","authors":"Saeid Dehnavi, Dip Goswami, K. Goossens","doi":"10.1109/MCSoC51149.2021.00064","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00064","url":null,"abstract":"As a transparent communication protocol for concurrent distributed applications, the Publish-Subscribe (Pub-Sub) paradigm is a trending programming model in many recent industrial use-cases in robotics and avionics. To apply the Pub-Sub programming model for safety-critical concurrent realtime applications in Multi-Processor Systems on Chip (MPSoC) environments, a non-blocking wait-free First-In-First-Out (FIFO) channel is a fundamental requirement. However, the proposed approaches in the literature have no proven real-time guarantees. In this paper, we propose a novel wait-free FIFO approach for single-producer-single-consumer core-to-core communication through shared memory. By analyzing the execution paths of each involved process, we prove that the execution time of each read/write operation is bounded by a Worst Case Execution Time (WCET). Moreover, we define a Timed Automata model of our approach. Using the UPPAAL model checker, we prove freedom of deadlock and starvation. For the performance evaluation of the proposed approach, we apply a stochastic analysis technique on the defined UPPAAL model. Finally, we implement the proposed approach on the CompSOC platform as the underlying realtime MPSoC to show that the implementation conforms to the proposed formal model and to experimentally validate the formal properties. The experimental evaluation on an instance of CompSOC that works at 40 MHz has a throughput of 109K tokens per second.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131058287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Autotuning Capability with a History Database","authors":"Younghyun Cho, J. Demmel, Hengrui Luo","doi":"10.1109/MCSoC51149.2021.00044","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00044","url":null,"abstract":"Autotuning is gaining importance to achieve the best possible performance for exascale applications. The performance of an autotuner usually depends on the amount of performance data collected for the application, however, collecting performance data for large-scale applications is oftentimes an expensive and daunting task. This paper presents an autotuner database, which we call a history database, for enhancing the reusability and reproducibility of performance data. The history database is built into a publicly available autotuner called GPTune, and allows users to store performance data obtained from autotuning and download historical performance data provided by the same or other users. The database not only allows reuse of the best available tuning results for widely used codes but also enables transfer learning that can leverage knowledge of pre-trained performance models. An evaluation shows that, for ScaLAPACK's PDGEQRF routine, a transfer learning approach using the history database can attain up to 33% better tuning results compared to single task learning without using prior knowledge, on 2,048 cores of NERSC's Cori supercomputer.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114661179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Heterogeneous Full-stack AI Platform for Performance Monitoring and Hardware-specific Optimizations","authors":"Zikang Zhou, Chao-ying Fu, Ruiqi Xie, Jun Han","doi":"10.1109/MCSoC51149.2021.00032","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00032","url":null,"abstract":"Many hardware accelerators are proposed to accelerate the computation of DNN to meet the real-time application. However, constrained by the microarchitecture of accelerators, the same neural network generally will have huge performance differences when deployed on different accelerators. It forces the network designers to rethink the network structure from a hardware view. Such a designed effort is more likely to achieve better performance on the targeted accelerator. In this paper, in order to explore hardware-specific optimizations, we designed a full-stack heterogeneous evaluation platform based on the open-source neural network accelerator NVDLA and TVM with a monitoring function. This evaluation platform integrates two processors with instruction sets of Arm and RISC-V and a DNN accelerator, and DNNs under common frameworks (Pytorch, Keras, ONNX, etc.) can be deployed on the platform to analyze its adaptability to the hardware through a simple process. Based on the platform, we conduct some experiments to demonstrate how can neural network affect the performance of specific hardware design. The experimental results show that the unsuited structure of the neural networks will cause additional data transfer on the target hardware, which is the main source of performance and energy degradation. The order of network operators, the width and depth of networks, and the number of operations that are unsupported by accelerators will all affect the performance of the network on specific accelerators. Designers should do some targeted optimizations toward specific hardware deployment and NAS (Network Automatic Search) should consider these factors.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115821501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Octavio Delgadillo, Bernhard Blieninger, Juri Kuhn, U. Baumgarten
{"title":"An Architecture to Enable Machine-Learning-Based Task Migration for Multi-Core Real-Time Systems","authors":"Octavio Delgadillo, Bernhard Blieninger, Juri Kuhn, U. Baumgarten","doi":"10.1109/MCSoC51149.2021.00066","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00066","url":null,"abstract":"ECU consolidation is an automotive trend that tends to reduce the number of electronic devices in a vehicle to optimize resources and costs. However, its implementation introduces new challenges, especially in terms of safety. Research at our group is exploring the idea of task migration between different electronic-control-units (ECUs) to add redundancy and fail-safety capabilities to an automotive setup. In particular, we are exploring machine-learning-aided schedulability analysis strategies as means to decide which ECU a task should be mapped to. In this paper. we present the implementation of an architecture that allows for testing different machine-learning techniques for schedulability analysis, enabling the deployment of tasks to the respective ECUs and a simple migration of tasks between them. The architecture is based on a real-time operating system. The test system developed implements a mix of dummy tasks with constant execution times and an autonomous task that interacts with a virtual environment and with a variable execution time. Also, the architecture allows for collecting data on each task for proving if the executed task sets are actually schedulable, as predicted by the machine learning component.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130309433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A function-rich FPGA system of camera image processing for video meeting","authors":"Takashi Odan, Takuto Kanamori, Kenji Kise","doi":"10.1109/MCSoC51149.2021.00013","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00013","url":null,"abstract":"A video conference is used as a means of meeting in remote places for research and education purposes. The video conference system provides some enhanced functions, such as virtual background function, slide sharing function, non-verbal feedback function. However, there is a problem that some computers cannot support such functions because the video conference system may require high-performance processors and the latest software stacks. In this paper, we describe a function-rich web camera system using an FPGA. We design and implement the core module named video processing module (VPM) for video processing. VPM provides various video processing functions such as virtual background, displaying messages, and the resize of the human face suitable for video meetings. We show that our camera system can be implemented using a low-end FPGA. Our evaluations show that our camera system is 11.7 times lower latency against a software implementation, providing rich functions with high frame rates at 60fps of 1280×720 pixels.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116895132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Service Recommendation Using Lightweight BERT-based Service Embedding in Edge Computing","authors":"Kungan Zeng, Incheon Paik","doi":"10.1109/MCSoC51149.2021.00035","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00035","url":null,"abstract":"With the rapid development of the Internet of Things (IoT) as well as edge computing, and fog computing, many microservices are being created. Service recommendation based on these distributed environments is an important issue for boosting the utilization of services since service composition in edge and cloud computing has increasingly attracted attention. However, the direct application of traditional service recommendation methods in edge computing encounters several problems such as insufficient computing resources, and the dynamic update of recommendation systems. This paper presents a deep learning-based approach for dynamic service recommendations using lightweight BERT-based service embedding to address the problems. First, a lightweight BERT-based service embedding was proposed to learn the practical-value vector of service based on the invocation association. Second, based on service embedding, a content-based filtering method is utilized to perform service recommendations. Next, a dynamic update process is implemented on the system by fine-tuning the model. Finally, the experimental results show that our approach can perform service recommendations effectively.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133159256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}