2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)最新文献

筛选
英文 中文
Area and Energy Optimization for Bit-Serial Log-Quantized DNN Accelerator with Shared Accumulators 带共享蓄能器的位串行对数量化DNN加速器的面积和能量优化
Takumi Kudo, Kodai Ueyoshi, Kota Ando, Kazutoshi Hirose, Ryota Uematsu, Yuka Oba, M. Ikebe, T. Asai, M. Motomura, Shinya Takamaeda-Yamazaki
{"title":"Area and Energy Optimization for Bit-Serial Log-Quantized DNN Accelerator with Shared Accumulators","authors":"Takumi Kudo, Kodai Ueyoshi, Kota Ando, Kazutoshi Hirose, Ryota Uematsu, Yuka Oba, M. Ikebe, T. Asai, M. Motomura, Shinya Takamaeda-Yamazaki","doi":"10.1109/MCSoC2018.2018.00048","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00048","url":null,"abstract":"In the remarkable evolution of deep neural network (DNN), development of a highly optimized DNN accelerator for edge computing with both less hardware resource and high computing performance is strongly required. As a well-known characteristic, DNN processing involves a large number multiplication and accumulation operations. Thus, low-precision quantization, such as binary and logarithm, is an essential technique in edge computing devices with strict restriction of circuit resource and energy. Bit-width requirement in quantization depends on application characteristics. Variable bit-width architecture based on the bit-serial processing has been proposed as a scalable alternative that allows different requirements of performance and accuracy balance by a unified hardware structure. In this paper, we propose a well-optimized DNN hardware architecture with supports of binary and variable bit-width logarithmic quantization. The key idea is the distributed-and-shared accumulator that processes multiple bit-serial inputs by a single accumulator with an additional low-overhead circuit for the binary mode. The evaluation results show that the idea reduces hardware resources by 29.8% compared to the prior architecture without losing any functionality, computing speed, and recognition accuracy. Moreover, it achieves 19.6% energy reduction using a practical DNN model of VGG 16.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133094881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Multikernel Design and Implementation for Improving Responsiveness of Aperiodic Tasks 提高非周期任务响应性的多内核设计与实现
Hidehito Yabuuchi, Shinichi Awamoto, Hiroyuki Chishiro, S. Kato
{"title":"Multikernel Design and Implementation for Improving Responsiveness of Aperiodic Tasks","authors":"Hidehito Yabuuchi, Shinichi Awamoto, Hiroyuki Chishiro, S. Kato","doi":"10.1109/MCSoC2018.2018.00029","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00029","url":null,"abstract":"Modern real-time systems need to efficiently handle aperiodic tasks as well as periodic ones. This paper presents a system design applying the hybrid operating system approach to multi-core architectures. A core is allocated exclusively and dynamically to a newly booted kernel and an aperiodic task on it so that the task can avoid overhead caused by the rest of the system, leading to reduced response time. We implemented and evaluated the presented design on a real multi-core architecture. The evaluation results indicate that the design improves responsiveness of aperiodic tasks that access shared resources frequently.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130477947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Search Space Reduction for Parameter Tuning of a Tsunami Simulation on the Intel Knights Landing Processor 在Intel Knights Landing处理器上的海啸模拟参数调整的搜索空间缩减
K. Komatsu, Takumi Kishitani, Masayuki Sato, A. Musa, Hiroaki Kobayashi
{"title":"Search Space Reduction for Parameter Tuning of a Tsunami Simulation on the Intel Knights Landing Processor","authors":"K. Komatsu, Takumi Kishitani, Masayuki Sato, A. Musa, Hiroaki Kobayashi","doi":"10.1109/MCSoC2018.2018.00030","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00030","url":null,"abstract":"The structures of recent computing systems have become complicated such as heterogeneous memory systems with a deep hierarchy and many core systems. To achieve high performance of HPC applications on such computing systems, performance tuning is mandatory. However, the number of tuning parameters has become large due to the complexities of the systems and applications. In addition, along with the improvement of computing systems, HPC applications are getting larger and complicated, resulting in long execution time of each application execution. Due to a large number of tuning parameters and a long time of each execution, a time to search for an appropriate tuning parameter combination becomes huge. This paper proposes a method to reduce the time to search for an appropriate tuning parameter combination. By considering the characteristics of a many-core processor and a simulation code, a search space of tuning parameters is reduced. Moreover, a time of each application execution for parameter search is reduced by limiting a simulation period of an application unless characteristics of the application are changed. Through the evaluation of performance tuning using the tsunami simulation code on the Intel Xeon Phi Knight Landing processor, it is clarified that a 3.67x performance improvement can be achieved by the parameter tuning. It is also clarified that the time for parameter tuning can drastically be saved by reducing the number of tuning parameters to be searched and limiting the simulation period of each application execution.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114368746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Publisher's Information 出版商的信息
{"title":"Publisher's Information","authors":"","doi":"10.1109/mcsoc2018.2018.00047","DOIUrl":"https://doi.org/10.1109/mcsoc2018.2018.00047","url":null,"abstract":"","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117126966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Code Generation of Graph-Based Vision Processing for Multiple CUDA Cores SoC Jetson TX 多CUDA内核SoC Jetson TX基于图形的视觉处理代码生成
Elishai Ezra Tsur, Elyassaf Madar, Natan Danan
{"title":"Code Generation of Graph-Based Vision Processing for Multiple CUDA Cores SoC Jetson TX","authors":"Elishai Ezra Tsur, Elyassaf Madar, Natan Danan","doi":"10.1109/MCSoC2018.2018.00013","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00013","url":null,"abstract":"Embedded vision processing is currently ingrained into many aspects of modern life, from computer-aided surgeries to navigation of unmanned aerial vehicles. Vision processing can be described using coarse-grained data flow graphs, which were standardized by OpenVX to enable both system and kernel level optimization via separation of concerns. Notably, graph-based specification provides a gateway to a code generation engine, which can produce an optimized, hardware-specific code for deployment. Here we provide an algorithm and JAVA-MVC-based implementation of automated code generation engine for OpenVX-based vision applications, tailored to NVIDIA multiple CUDA Cores SoC Jetson TX. Our algorithm pre-processes the graph, translates it into an ordered layer-oriented data model, and produces C code, which is optimized for the Jetson TX1 and comprised of error checking and iterative execution for real time vision processing.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123770250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An Efficient Parallel Hardware Scheme for Solving the N-Queens Problem 一种求解N-Queens问题的高效并行硬件方案
Yuuma Azuma, H. Sakagami, Kenji Kise
{"title":"An Efficient Parallel Hardware Scheme for Solving the N-Queens Problem","authors":"Yuuma Azuma, H. Sakagami, Kenji Kise","doi":"10.1109/MCSoC2018.2018.00015","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00015","url":null,"abstract":"The N-Queens problem is a generalized problem with the 8-Queens puzzle. The computational complexity of this problem is increased drastically when increasing N. To calculate the unsolved N-Queens problem in realistic time, implementing the high-speed solver and system is important. Therefore, efficient search methods of solutions by backtracking, bit operation, etc. have been introduced. Also, parallelization schemes of searching for solutions by arranging several queens in advance and gen-erating a large number of subproblems have been introduced. In the state-of-the-art system, to solve such subproblems a lot of solver modules are implemented on several FPGAs. In this paper, we propose two methods to enable further large-scale parallelization with realistic hardware resources. One is a method to reduce the hardware usage of a solver module using an encoder and a decoder for the crucial data structure. The other is an efficient method for distributing the subproblems to each solver module and collecting the resulting counts from each solver module. Through these methods, it is possible to increase the number of solver modules to be implemented on an FPGA. The evaluation results show that the performance of the proposed system implementing 700 solver modules achieves 2.58x of the previous work.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130839955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Design and Evaluation of a Configurable Hardware Merge Sorter for Various Output Records 多种输出记录可配置硬件归并排序器的设计与评价
E. Elsayed, Kenji Kise
{"title":"Design and Evaluation of a Configurable Hardware Merge Sorter for Various Output Records","authors":"E. Elsayed, Kenji Kise","doi":"10.1109/MCSoC2018.2018.00041","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00041","url":null,"abstract":"Sorting is one of the fundamental operations that are important in many applications such as image processing and database. Many researches have been developed to improve the performance of sorting. One of the most promising techniques is FPGA-based hardware merge sorters (HMS). While previous studies on HMS achieved a very high throughput, most of them could output only power of two records per clock cycle. Moreover, they couldn't evaluate the performance of HMS configuration that outputs more than 32 records per clock cycle due to hardware resources limitation. In this paper, we propose an HMS architecture that can be configured to output not only power of two records but various outputs e.g., 3, 7, and 12. In addition, our proposed HMS can be configured to output more than 32 records such as 40, 48, and 56 records per clock cycle. Finally, we study the performance evaluation for different configurations of key and data widths that can be required by different sorting applications.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"29 12","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113955272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信