{"title":"Integrating Intra-and Intercellular Simulation of a 2D HL-1 Cardiac Model Based on Embedded GPUs","authors":"Baohua Liu, W. Shen, Xin Zhu, Xingyu Wangchen","doi":"10.1109/MCSoC.2019.00041","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00041","url":null,"abstract":"Simulation of electrophysiological cardiac models enables researchers to investigate the activity of heart under various circumstances. Fortunately, recent development in embedded parallel computing architectures has made it possible for one to efficiently simulate sophisticated electrophysiological models that match up to real conditions on embedded computing devices, which typically relies on large scale CPU or GPU clusters in the past. In this paper, a simultaneous implementation of a 2D Takeuchi-HL-1 cardiac model combining unicellular and intercellular solver is proposed and conducted on NVIDIA Jetson Tegra X2 embedded computer. The experiment results demonstrate that our implementation yields considerable efficiency improvement compared with that using non-simultaneous methods, without loss of simulation accuracy. Moreover, it's also proved that embedded devices are much more energy-efficient than conventional systems on the simulation.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"27 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131408237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MITRACA: A Next-Gen Heterogeneous Architecture","authors":"Riadh Ben Abdelhamid, Y. Yamaguchi, T. Boku","doi":"10.1109/MCSoC.2019.00050","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00050","url":null,"abstract":"GPU (Graphics Processing Unit) and CPU (Central Processing Unit) possess a sufficient and appropriate performance to compute massively parallel applications like AI, Big data, and material sciences. However, their real performance is far lower than those theoretical ones. The primary reason for the performance degradation is that they suffer from limited memory bandwidth and inefficient interconnection topology not optimized for these types of applications. Thus, from the viewpoint of real computational performance called computational efficiency, FPGA (Field Programmable Gate Array) is now becoming an attractive chip for these types of applications with massively parallel computation. FPGA can efficiently propose optimized communication and bridge different computing accelerators as customized hardware. In other words, FPGA-based hardware accelerators offer a convenient solution for both high performance and high memory bandwidth. However, one serious concern is usability. For example, the FPGA design using hardware description language is a meticulous task and requires specialized skill sets as well as a long time to market. An overlay architecture will become an appropriate candidate that can resolve this issue because it offers a software layer that simplifies FPGA programmability by abstracting the fabric resources. Thus, this article proposes an overlay architecture based on a tightly-connected many-core-based CGRA (Coarse-Grained Reconfigurable Architecture). It will help software engineers on seamlessly implementing their applications. Our final goal is not on the current fine-grained FPGAs but new middle-to-course-grained programmable chips. If an ASIC (Application-Specific Integrated Circuit) implementation was adopted, the performance would achieve at least ten times higher compared with the current FPGA implementation because of the working frequency. In this article, the proposed overlay system provides a programmable interface that virtualizes FPGA resources and let prospected users focus on high-level software programming.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132295681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Theingi Myint, M. Amagasaki, Qian Zhao, M. Iida, M. Kiyama
{"title":"A Novel SLM-Based Virtual FPGA Overlay Architecture","authors":"Theingi Myint, M. Amagasaki, Qian Zhao, M. Iida, M. Kiyama","doi":"10.1109/MCSoC.2019.00018","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00018","url":null,"abstract":"To implement virtual field-programmable gate array (vFPGA) layers on physical devices, FPGA overlay technologies have been introduced to provide inter-FPGA bitstream compatibility. Conventional LUT-based vFPGA overlay architectures have very large resource overheads because LUT resource requirements increase as O(2k) with an increasing number of inputs, k. In this paper, we propose a novel SLM-based vFPGA overlay architectures that employ our previously proposed scalable logic module (SLM) as a logic cell. SLMs can cover most frequently used logics with far fewer hardware resources than LUTs. Evaluation results show that a 6-input SLM-based vFPGA can reduce LUT and flip-flop resource usage by up to 21% and 21% on an Artix-7 FPGA, on a Kintex-7 FPGA, and on a Kintex UltraScale+ FPGA respectively, as compared to a LUT-based vFPGA of the same input size. Similarly, a 7-input SLM-based vFPGA can reduce LUT and flip-flop resource usage by up to 32% and 35% on an Artix-7 FPGA, 30% and 35% on a Kintex-7 FPGA, and 30% and 35% on a Kintex UltraScale+ FPGA respectively, as compared to a LUT-based vFPGA of the same input size. Delay results of SLM-based vFPGA overlay architectures are almost the same with the comparison of LUTbased vFPGA overlay architectures.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121175210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Low-Latency and Flexible TDM NoC for Strong Isolation in Security-Critical Systems","authors":"M. Alonso, J. Flich, M. Turki, D. Bertozzi","doi":"10.1109/MCSoC.2019.00029","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00029","url":null,"abstract":"Shared security-critical systems are typically organized as a set of domains that must be kept separate. The network-on-chip (NoC) is key to delivering strong domain isolation, since many of its internal resources are shared between packets from different domains; therefore time-division multiplexing (TDM) is often implemented to avoid any form of interference. Prior approaches to TDM-based scheduling of NoCs lose relevance when they are challenged with conflicting requirements of latency optimization, area efficiency, architectural flexibility and fast reconfigurability. In many cases, aggressive latency optimizations are performed at the cost of timing channel protection. In this paper, we propose a new scheduling approach of time slots in 2D-mesh TDM NoCs that follows directly from the properties of the Channel Dependency Graph. As a result, the isolation-performance trade-off is consistently improved with respect to state-of-the-art solutions across the domain configuration space. When combined with a new token-based mechanism to dispatch scheduling directives, our approach enables the effective reconfiguration of the number of domains, unlike the static nature of most previous proposals.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127193142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Traffic-Robust Routing Algorithm for Network-on-Chip Systems","authors":"Siying Xu, M. Meyer, Xin Jiang, Takahiro Watanabe","doi":"10.1109/MCSoC.2019.00037","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00037","url":null,"abstract":"Network-on-chip (NoC) has been proposed as a better interconnection method than the bus architecture. Recently, a large number of routing algorithms have been proposed to improve the network performance. They usually show their benefits under particular traffic patterns. However, traffic patterns are generally unknown in advance and vary according to the application due to the behavioral diversity between inter-core and memory access communications. In this paper, a local traffic pattern detecting mechanism is proposed to detect the current traffic patterns including uniform, transpose, hotspot and real workloads, and then the routing algorithm will be switched to the most suitable one according to the detection result. Experimental results show that the traffic pattern can be accurately detected. For the hotspot traffic pattern, the success rate of the detector can reach up to 100 percent when the hotspot percentage is larger than 8. With the help of the proposed traffic-robust routing algorithm, the network can always work with a more suitable routing algorithm and achieve better performance.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124064248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhanced ID Authentication Scheme Using FPGA-Based Ring Oscillator PUF","authors":"Van-Toan Tran, Quang-Kien Trinh, Van‐Phuc Hoang","doi":"10.1109/MCSoC.2019.00052","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00052","url":null,"abstract":"FPGA-based ring oscillator (RO) PUF is very popular for its unique properties and easy implementation. However, the designs are normally expensive, and the RO frequency is highly sensitive to operating condition and other types of global variations. In addition, the local variations are also highly correlated, which normally requires complex the identification (ID) extraction algorithm and/or a large number of ROs. In this work, by using statistical analysis, we have experimentally shown that the RO frequencies are very sensitive to global variation factors. Fortunately, their local process variations within a die are relatively consistent regardless of the operating condition and this can be used for unique ID extraction. Furthermore, we have proposed an ID authentication scheme using FPGA-based RO PUF. Our proposed scheme allows to fully extract the local variation characteristics by using an almost technology-and vendor-agnostic PUF circuit. In addition, the ID extraction circuit is kept simple and compact, so that the overall design is area-and energy-efficient. The experimental results show a very good level of reliability (99.94 %) for a design of 32 ROs in different physical FPGAs.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127424977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed O(N) Linear Solver for Dense Symmetric Hierarchical Semi-Separable Matrices","authors":"Chenhan D. Yu, Severin Reiz, G. Biros","doi":"10.1109/MCSoC.2019.00008","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00008","url":null,"abstract":"We present a distributed memory algorithm for the approximate hierarchical factorization of symmetric positive definite (SPD) matrices. Our method is based on the distributed memory GOFMM, an algorithm that appeared in SC18 (doi:10.1109/SC.2018.00018). GOFMM constructs a hierarchical matrix approximation of an arbitrary SPD matrix that compresses the matrix by creating low-rank approximations of the off-diagonal blocks. GOFMM method has no guarantees of success for arbitrary SPD matrices. (This is similar to the SVD; not every matrix admits a good low-rank approximation.) But for many SPD matrices, GOFMM does enable compression that results in fast matrix-vector multiplication that can reach N logN time—as opposed to N2 required for a dense matrix. GOFMM supports shared and distributed memory parallelism. In this paper, we build an approximate \"ULV\" factorization based on the Hierarchically Semi-Separable (HSS) compression of the GOFMM. This factorization requires O(N) work (given the compressed matrix) and O(N=p) + O(log p) time on p MPI processes (assuming a hypercube topology). The previous state-of-the-art required O(N logN) work. We present the factorization algorithm, discuss its complexity, and present weak and strong scaling results for the \"factorization\" and \"solve\" phases of our algorithm. We also discuss the performance of the inexact ULV factorization as a preconditioner for a few exemplary large dense linear systems. In our largest run, we were able to factorize a 67M-by-67M matrix in less than one second; and solve a system with 64 right-hand sides in less than one-tenth of a second. This run was on 6,144 Intel \"Skylake\" cores on the SKX partition of the Stampede2 system at the Texas Advanced Computing Center.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128723743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards an Efficient Hardware Architecture for Odd-Even Based Merge Sorter","authors":"Elsayed A. Elsayed, Kenji Kise","doi":"10.1109/MCSoC.2019.00043","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00043","url":null,"abstract":"Sorting is widely used in several practical applications such as searching and database. This paper proposes two improved FPGA-based architectures for merge sorter that use less hardware resources compared to the state-of-the-art. For instance, with 64 sorted records are output per cycle, implementation results of our first proposal show an improvement in the required number of Flip Flops (FFs) and Look-Up Tables (LUTs) by 84.4% and 77.7%, respectively over the state-of-the-art. In addition, the throughput of our merge sorter is 1.065x higher than that of state-of-the-art. As for the second proposal, a significant improvement is achieved by 66.3% and 84.6% for the needed FFs and LUTs, respectively. Moreover, while our second proposed merge sorter uses significant less resources, it achieves about 95.9% of the performance of state-of-the-art merge sorter.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123808935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tumour Detection using Convolutional Neural Network on a Lightweight Multi-Core Device","authors":"T. Teo, Weihao Tan, Y. Tan","doi":"10.1109/MCSoC.2019.00020","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00020","url":null,"abstract":"Convolutional neural networks (CNN) have been the main driving force behind image classification and it is widely used. Large amounts of processing power and computation complexity is required to mimic our human brain as in the image classification. Such complexity may result in large bulky systems. A lack of such, while possible, may result in a rather limited use case and as such constrained functional implementation. One solution is to explore the use of Multicore System on Chips (MCSoC). CNN, however, were commonly built on Graphics Processing Units (GPU) based machine. In this paper, we reduce the overall size of a CNN while retaining a satisfactory level of accuracy so that it is better suited to be deployed in an MCSoC environment. We trained a CNN model that was validated on detecting malignant tumor cells. The results show significant boost in functionality in the form of faster inference times and smaller model parameter sizes, deploying neural networks in an environment that would have otherwise seemed less practical. Efficient inference networks on lightweight systems can serve as an inexpensive and physically small alternative to existing Artificial Intelligence (AI) systems that are generally costly, bulky and power hungry.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130831317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Search-Space Encoding for System-Level Design Space Exploration of Embedded Systems","authors":"Valentina Richthammer, M. Glaß","doi":"10.1109/MCSoC.2019.00046","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00046","url":null,"abstract":"For Design Space Exploration (DSE) of embedded systems as a combinatorial Multi-Objective Optimization Problem (MOP), metaheuristic optimization approaches are typically employed to determine high-quality solutions within limited optimization time. This requires the encoding of implementations from the design space in a search space which represents the available degrees of freedom for the optimization approach. Determining an encoding that ensures all design constraints are met by construction is, however, impossible for multi-/many-core DSE problems, so that the search space contains infeasible solutions. While state-of-the-art DSE techniques repair infeasible solutions, little to no attention has been paid to the efficiency of the resulting encoding w.r.t. its suitability for the employed optimization approach. Therefore, we formally define requirements for an efficient search space and analyze the drawbacks of automatically generated inefficient encodings. We furthermore present efficient search-space encodings for a state-of-the-art hybrid optimization approach suitable for a wide range of MOPs. The proposed encodings significantly reduce the required degree of repair, allowing us to introduce a feedback loop from repaired solutions in the design space to the respective encoded solutions in the efficient search space to further improve the optimization. The positive effects of the proposed efficient encoding and design-space feedback are demonstrated for system-level DSE using benchmarks from the domains of embedded many-core as well as networked automotive systems. Compared to inefficient search spaces from literature, significant enhancements in both optimization quality and time are observed. Furthermore, we propose metrics to quantify search-space efficiency which provide novel insights into the interdependence of search space and design space for multi-/many-core DSE.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130085574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}