{"title":"Software/Hardware Co-design for Multi-modal Multi-task Learning in Autonomous Systems","authors":"Cong Hao, Deming Chen","doi":"10.1109/AICAS51828.2021.9458577","DOIUrl":"https://doi.org/10.1109/AICAS51828.2021.9458577","url":null,"abstract":"Optimizing the quality of result (QoR) and the quality of service (QoS) of AI-empowered autonomous systems simultaneously is very challenging. First, there are multiple input sources, e.g., multimodal data from different sensors, requiring diverse data preprocessing, sensor fusion, and feature aggregation. Second, there are multiple tasks that require various AI models to run simultaneously, e.g., perception, localization, and control. Third, the computing and control system is heterogeneous, composed of hardware components with varied features, such as embedded CPUs, GPUs, FPGAs, and dedicated accelerators. Therefore, autonomous systems essentially require multi-modal multitask (MMMT) learning which must be aware of hardware performance and implementation strategies. While MMMT learning has been attracting intensive research interests, its applications in autonomous systems are still underexplored. In this paper, we first discuss the opportunities of applying MMMT techniques in autonomous systems, and then discuss the unique challenges that must be solved. In addition, we discuss the necessity and opportunities of MMMT model and hardware co-design, which is critical for autonomous systems especially with power/resource-limited or heterogeneous platforms. We formulate the MMMT model and heterogeneous hardware implementation co-design as a differentiable optimization problem, with the objective of improving the solution quality and reducing the overall power consumption and critical path latency. We advocate for further explorations of MMMT in autonomous systems and software/hardware co-design solutions.","PeriodicalId":173204,"journal":{"name":"2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130787018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Rasch, Diego Moreda, T. Gokmen, M. L. Gallo, F. Carta, Cindy Goldberg, Kaoutar El Maghraoui, A. Sebastian, V. Narayanan
{"title":"A Flexible and Fast PyTorch Toolkit for Simulating Training and Inference on Analog Crossbar Arrays","authors":"M. Rasch, Diego Moreda, T. Gokmen, M. L. Gallo, F. Carta, Cindy Goldberg, Kaoutar El Maghraoui, A. Sebastian, V. Narayanan","doi":"10.1109/AICAS51828.2021.9458494","DOIUrl":"https://doi.org/10.1109/AICAS51828.2021.9458494","url":null,"abstract":"We introduce the IBM ANALOG HARDWARE ACCELERATION KIT, a new and first of a kind open source toolkit to simulate analog crossbar arrays in a convenient fashion from within PYTORCH (freely available at https://github.com/IBM/aihwkit). The toolkit is under active development and is centered around the concept of an “analog tile” which captures the computations performed on a crossbar array. Analog tiles are building blocks that can be used to extend existing network modules with analog components and compose arbitrary artificial neural networks (ANNs) using the flexibility of the PYTORCH framework. Analog tiles can be conveniently configured to emulate a plethora of different analog hardware characteristics and their non-idealities, such as device-to-device and cycle-to-cycle variations, resistive device response curves, and weight and output noise. Additionally, the toolkit makes it possible to design custom unit cell configurations and to use advanced analog optimization algorithms such as Tiki-Taka. Moreover, the backward and update behavior can be set to “ideal\" to enable hardware-aware training features for chips that target inference acceleration only. To evaluate the inference accuracy of such chips over time, we provide statistical programming noise and drift models calibrated on phase-change memory hardware. Our new toolkit is fully GPU accelerated and can be used to conveniently estimate the impact of material properties and non-idealities of future analog technology on the accuracy for arbitrary ANNs.","PeriodicalId":173204,"journal":{"name":"2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122584372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zishen Wan, Yuyang Zhang, A. Raychowdhury, Bo Yu, Yanjun Zhang, Shaoshan Liu
{"title":"An Energy-Efficient Quad-Camera Visual System for Autonomous Machines on FPGA Platform","authors":"Zishen Wan, Yuyang Zhang, A. Raychowdhury, Bo Yu, Yanjun Zhang, Shaoshan Liu","doi":"10.1109/AICAS51828.2021.9458486","DOIUrl":"https://doi.org/10.1109/AICAS51828.2021.9458486","url":null,"abstract":"In our past few years’ of commercial deployment experiences, we identify localization as a critical task in autonomous machine applications, and a great acceleration target. In this paper, based on the observation that the visual frontend is a major performance and energy consumption bottleneck, we present our design and implementation of an energy-efficient hardware architecture for ORB (Oriented-Fast and Rotated-BRIEF) based localization system on FPGAs. To support our multi-sensor autonomous machine localization system, we present hardware synchronization, frame-multiplexing, and parallelization techniques, which are integrated in our design. Compared to Nvidia TX1 and Intel i7, our FPGA-based implementation achieves $5.6times$ and $3.4times$ speedup, as well as $3.0times$ and $34.6times$ power reduction, respectively.","PeriodicalId":173204,"journal":{"name":"2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114582794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thorir Mar Ingolfsson, Xiaying Wang, Michael Hersche, A. Burrello, L. Cavigelli, L. Benini
{"title":"ECG-TCN: Wearable Cardiac Arrhythmia Detection with a Temporal Convolutional Network","authors":"Thorir Mar Ingolfsson, Xiaying Wang, Michael Hersche, A. Burrello, L. Cavigelli, L. Benini","doi":"10.1109/AICAS51828.2021.9458520","DOIUrl":"https://doi.org/10.1109/AICAS51828.2021.9458520","url":null,"abstract":"Personalized ubiquitous healthcare solutions require energy-efficient wearable platforms that provide an accurate classification of bio-signals while consuming low average power for long-term battery-operated use. Single lead electrocardiogram (ECG) signals provide the ability to detect, classify, and even predict cardiac arrhythmia. In this paper we propose a novel temporal convolutional network (TCN) that achieves high accuracy while still being feasible for wearable platform use. Experimental results on the ECG5000 dataset show that the TCN has a similar accuracy (94.2%) score as the state-of-the-art (SoA) network while achieving an improvement of 16.5% in the balanced accuracy score. This accurate classification is done with $27 times$ fewer parameters and $37 times$ less multiply-accumulate operations. We test our implementation on two publicly available platforms, the STM32L475, which is based on ARM Cortex M4F, and the GreenWaves Technologies GAP8 on the GAPuino board, based on $1 +8$ RISC-V CV32E40P cores. Measurements show that the GAP8 implementation respects the real-time constraints while consuming 0.10mJ per inference. With 9.91GMAC/s/W, it is $23.0 times$ more energy-efficient and $46.85 times$ faster than an implementation on the ARM Cortex M4F (0.43GMAC/s/W). Overall, we obtain 8.1% higher accuracy while consuming $19.6times$ less energy and being $35.1 times$ faster compared to a previous SoA embedded implementation.","PeriodicalId":173204,"journal":{"name":"2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133286121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}