2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)最新文献_第4页

Title Page i 第1页

2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) Pub Date : 2018-09-01 DOI: 10.1109/mcsoc2018.2018.00001

引用次数: 0

Area and Energy Optimization for Bit-Serial Log-Quantized DNN Accelerator with Shared Accumulators 带共享蓄能器的位串行对数量化DNN加速器的面积和能量优化

2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) Pub Date : 2018-09-01 DOI: 10.1109/MCSoC2018.2018.00048

Takumi Kudo, Kodai Ueyoshi, Kota Ando, Kazutoshi Hirose, Ryota Uematsu, Yuka Oba, M. Ikebe, T. Asai, M. Motomura, Shinya Takamaeda-Yamazaki

{"title":"Area and Energy Optimization for Bit-Serial Log-Quantized DNN Accelerator with Shared Accumulators","authors":"Takumi Kudo, Kodai Ueyoshi, Kota Ando, Kazutoshi Hirose, Ryota Uematsu, Yuka Oba, M. Ikebe, T. Asai, M. Motomura, Shinya Takamaeda-Yamazaki","doi":"10.1109/MCSoC2018.2018.00048","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00048","url":null,"abstract":"In the remarkable evolution of deep neural network (DNN), development of a highly optimized DNN accelerator for edge computing with both less hardware resource and high computing performance is strongly required. As a well-known characteristic, DNN processing involves a large number multiplication and accumulation operations. Thus, low-precision quantization, such as binary and logarithm, is an essential technique in edge computing devices with strict restriction of circuit resource and energy. Bit-width requirement in quantization depends on application characteristics. Variable bit-width architecture based on the bit-serial processing has been proposed as a scalable alternative that allows different requirements of performance and accuracy balance by a unified hardware structure. In this paper, we propose a well-optimized DNN hardware architecture with supports of binary and variable bit-width logarithmic quantization. The key idea is the distributed-and-shared accumulator that processes multiple bit-serial inputs by a single accumulator with an additional low-overhead circuit for the binary mode. The evaluation results show that the idea reduces hardware resources by 29.8% compared to the prior architecture without losing any functionality, computing speed, and recognition accuracy. Moreover, it achieves 19.6% energy reduction using a practical DNN model of VGG 16.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133094881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Search Space Reduction for Parameter Tuning of a Tsunami Simulation on the Intel Knights Landing Processor 在Intel Knights Landing处理器上的海啸模拟参数调整的搜索空间缩减

2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) Pub Date : 2018-09-01 DOI: 10.1109/MCSoC2018.2018.00030

K. Komatsu, Takumi Kishitani, Masayuki Sato, A. Musa, Hiroaki Kobayashi

{"title":"Search Space Reduction for Parameter Tuning of a Tsunami Simulation on the Intel Knights Landing Processor","authors":"K. Komatsu, Takumi Kishitani, Masayuki Sato, A. Musa, Hiroaki Kobayashi","doi":"10.1109/MCSoC2018.2018.00030","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00030","url":null,"abstract":"The structures of recent computing systems have become complicated such as heterogeneous memory systems with a deep hierarchy and many core systems. To achieve high performance of HPC applications on such computing systems, performance tuning is mandatory. However, the number of tuning parameters has become large due to the complexities of the systems and applications. In addition, along with the improvement of computing systems, HPC applications are getting larger and complicated, resulting in long execution time of each application execution. Due to a large number of tuning parameters and a long time of each execution, a time to search for an appropriate tuning parameter combination becomes huge. This paper proposes a method to reduce the time to search for an appropriate tuning parameter combination. By considering the characteristics of a many-core processor and a simulation code, a search space of tuning parameters is reduced. Moreover, a time of each application execution for parameter search is reduced by limiting a simulation period of an application unless characteristics of the application are changed. Through the evaluation of performance tuning using the tsunami simulation code on the Intel Xeon Phi Knight Landing processor, it is clarified that a 3.67x performance improvement can be achieved by the parameter tuning. It is also clarified that the time for parameter tuning can drastically be saved by reducing the number of tuning parameters to be searched and limiting the simulation period of each application execution.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114368746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Publisher's Information 出版商的信息

2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) Pub Date : 2018-09-01 DOI: 10.1109/mcsoc2018.2018.00047

引用次数: 0

Code Generation of Graph-Based Vision Processing for Multiple CUDA Cores SoC Jetson TX 多CUDA内核SoC Jetson TX基于图形的视觉处理代码生成

2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) Pub Date : 2018-09-01 DOI: 10.1109/MCSoC2018.2018.00013

Elishai Ezra Tsur, Elyassaf Madar, Natan Danan

引用次数: 2

An Efficient Parallel Hardware Scheme for Solving the N-Queens Problem 一种求解N-Queens问题的高效并行硬件方案

2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) Pub Date : 2018-09-01 DOI: 10.1109/MCSoC2018.2018.00015

Yuuma Azuma, H. Sakagami, Kenji Kise

{"title":"An Efficient Parallel Hardware Scheme for Solving the N-Queens Problem","authors":"Yuuma Azuma, H. Sakagami, Kenji Kise","doi":"10.1109/MCSoC2018.2018.00015","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00015","url":null,"abstract":"The N-Queens problem is a generalized problem with the 8-Queens puzzle. The computational complexity of this problem is increased drastically when increasing N. To calculate the unsolved N-Queens problem in realistic time, implementing the high-speed solver and system is important. Therefore, efficient search methods of solutions by backtracking, bit operation, etc. have been introduced. Also, parallelization schemes of searching for solutions by arranging several queens in advance and gen-erating a large number of subproblems have been introduced. In the state-of-the-art system, to solve such subproblems a lot of solver modules are implemented on several FPGAs. In this paper, we propose two methods to enable further large-scale parallelization with realistic hardware resources. One is a method to reduce the hardware usage of a solver module using an encoder and a decoder for the crucial data structure. The other is an efficient method for distributing the subproblems to each solver module and collecting the resulting counts from each solver module. Through these methods, it is possible to increase the number of solver modules to be implemented on an FPGA. The evaluation results show that the performance of the proposed system implementing 700 solver modules achieves 2.58x of the previous work.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130839955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Design and Evaluation of a Configurable Hardware Merge Sorter for Various Output Records 多种输出记录可配置硬件归并排序器的设计与评价

2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) Pub Date : 2018-09-01 DOI: 10.1109/MCSoC2018.2018.00041

E. Elsayed, Kenji Kise

引用次数: 2