{"title":"Distributed Neural Networks using TensorFlow over Multicore and Many-Core Systems","authors":"Jagadish Kumar Ranbirsingh, Hanke Kimm, H. Kimm","doi":"10.1109/MCSoC.2019.00022","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00022","url":null,"abstract":"This paper focuses on distributed deep learning models that simulate the HAR (Human Activity Recognition) data set from the UCI machine learning Repository. The proposed deep learning LSTM (Long Short-Term Memory) model works with the TensorFlow framework using the Python 3 programming language which supports the distributed architecture. In order to simulate the distributed deep learning models over different multicore and many-core systems, two hardware platforms are built; the first one is equipped with a Raspberry Pi cluster with 16 Pi 3 model B+ boards which each having 1 GB of RAM and 32 GB flash storage. The second platform is houses an Octa-core Intel Xeon CPU system with a 16MB Cache, 32 GB RAM and 2 TB SSD primary storage with 10 TB HDD secondary storage. In this paper, the performance of the distributed LSTM model over multicore and many-core systems is presented in terms of execution speed and efficiency of prediction accuracy upon varying number of deep layers with corresponding hidden nodes. In this experiment, a 3 x 3 distributed LSTM model has been used, which furnishes higher prediction accuracy with faster computation time than the models that different number of layers provide.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131857909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Learning Framework with Arbitrary Numerical Precision","authors":"M. Kiyama, M. Amagasaki, M. Iida","doi":"10.1109/MCSoC.2019.00019","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00019","url":null,"abstract":"Deep neural networks (DNNs) have recently shown outstanding performance in solving problems in many domains. However, it is difficult to run such applications on mobile devices due to limited hardware resources. Quantization is one method to reduce the hardware requirements. By default, 32-bit floating-point numbers are used in DNNs, while quantization uses fewer bits, such as 4-bit fixed points, at the cost of precision. Previous research has explored two problems related to this: (1) differences between software emulation and implementation that affect model accuracy and (2) lowered accuracy during normalization. In this paper, we developed a new DNNs framework, PyParch, that allows easy manipulation of quantization and propose a training method for fitting to a hardware-friendly model. We show that our developed tool can solve the two problems mentioned above. Quantized models described in previous methods need 18 bits in order to recover the original accuracy, whereas our method requires only 14 bits.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130435233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design of Asynchronous CNN Circuits on Commercial FPGA from Synchronous CNN Circuits","authors":"Hayato Kato, H. Saito","doi":"10.1109/MCSoC.2019.00016","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00016","url":null,"abstract":"To accelerate performance, Convolutional Neural Networks (CNNs) are frequently used in Field Programmable Gate Arrays (FPGAs). In this paper, to reduce the power consumption of CNN circuits, we propose a design method to design asynchronous CNN circuits on commercial FPGAs. First, the proposed method converts Register Transfer Level (RTL) models of synchronous CNN circuits to RTL models of asynchronous CNN circuits. Then, the proposed method designs asynchronous CNN circuits using a commercial FPGA design environment. In the experiment, we designed an asynchronous CNN circuit and evaluated the performance. Compared to the synchronous counterpart, the asynchronous CNN circuit consumed about 2.3% less energy.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121755394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy and Performance Analysis of STTRAM Caches for Mobile Applications","authors":"Kyle Kuan, Tosiron Adegbija","doi":"10.1109/MCSoC.2019.00044","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00044","url":null,"abstract":"Spin-Transfer Torque RAMs (STTRAMs) have been shown to offer much promise for implementing emerging cache architectures. This paper studies the viability of STTRAM caches for mobile workloads from the perspective of energy and latency. Specifically, we explore the benefits of reduced retention STTRAM caches for mobile applications. We analyze the characteristics of mobile applications' cache blocks and how those characteristics dictate the appropriate retention time for mobile device caches. We show that due to their inherently interactive nature, mobile applications' execution characteristics—and hence, STTRAM cache design requirements—differ from other kinds of applications. We also explore various STTRAM cache designs in both single and multicore systems, and at different cache levels, that can efficiently satisfy mobile applications' execution requirements, in order to maximize energy savings without introducing substantial latency overhead.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"88 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126307798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}