Ralf Kundel, Kadir Eryigit, Jonas Markussen, C. Griwodz, Osama Abboud, Rhaban Hark, R. Steinmetz
{"title":"Host Bypassing: Direct Data Piping from the Network to the Hardware Accelerator","authors":"Ralf Kundel, Kadir Eryigit, Jonas Markussen, C. Griwodz, Osama Abboud, Rhaban Hark, R. Steinmetz","doi":"10.1109/MCSoC51149.2021.00012","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00012","url":null,"abstract":"Computer networks have become very important and influential over the last years for many common services such as Internet connectivity as well as time-sensitive applications such as videotelephony. Furthermore, approaches like in-network computing enable the offloading of latency-critical and high-performance network functions into the network, e.g. 5G network functions, to enable such time-sensitive applications. In this work, we show how FPGAs in PCIe-based systems, which are typically used as hardware accelerators for latency-critical in-network functions, can be integrated into the data path. Our approach, named host bypassing, allows direct data transfer from the network interface to the accelerator and accomplishes substantial performance benefits over existing state-of-the-art approaches. Our detailed evaluation results demonstrate the possibility of achieving deterministic low latency while operating under heavy load without any packet loss. In addition, fewer CPU resources are required.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129090759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Implementation of CNN on Multi-FPGA Cluster","authors":"Yasuyu Fukushima, Kensuke Iizuka, H. Amano","doi":"10.1109/MCSoC51149.2021.00019","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00019","url":null,"abstract":"We developed a PYNQ cluster called M-KUBOS that consists of economical Zynq boards that are interconnected through low-cost high-performance GTH serial links. For the software environment, we employed the PYNQ open-source software platform. The PYNQ cluster is anticipated to be a multi-access edge computing (MEC) server for 5G mobile networks. We implemented the ResNet-50 inference accelerator on the PYNQ cluster for image recognition of MEC applications. By estimating the execution time of each ResNet-50 layer, layers of ResNet-50 were divided into four boards so that the execution time of each board would be as equal as possible for efficient pipeline processing. Owing to the PYNQ cluster in which FPGAs were directly connected by high-speed serial links, stream processing without network bottlenecks and pipeline processing between boards were readily realized. The implementation achieved 292 GOPS performance, 75.1 FPS throughput, and 5.15 GOPS/W power efficiency. It achieved 17 times faster speed and 86 times more power efficiency compared to the implementation on the CPU, and 3.8 times more power efficiency compared to the implementation on the GPU.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125085421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sparse Matrix Ordering Method with a Quantum Annealing Approach and its Parameter Tuning","authors":"Tomoko Komiyama, Tomohiro Suzuki","doi":"10.1109/MCSoC51149.2021.00045","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00045","url":null,"abstract":"Quantum annealing realizes quantum computers specialized for combinatorial optimization problems (COPs). A COP is formulated as a Hamiltonian, and quantum annealing obtains a solution by finding the ground state of the Hamiltonian. The ease of finding a solution depends on the weights assigned to the cost and constraint functions when formulating the problem. In other words, parameter tuning is essential in solving problems with quantum annealing. In the present paper, the problem of searching an ordering that reduces the fill-in for a sparse direct solver is formulated as a Hamiltonian, and quantum annealing finds the solution to this problem. We discuss the necessity and effectiveness of parameter tuning for solving COPs with quantum annealing. The results after weight tuning show that we can improve the rate of an optimal solution obtained by a maximum of 94% for ${5,times ,5}$ matrices, 68% for ${6,times ,6}$ matrices, and 27% for ${7,times ,7}$ matrices. Moreover, it is shown that giving high weights to the constraints we want to satisfy will not provide an optimal solution.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"28 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132030740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Masahito Kumagai, K. Komatsu, Masayuki Sato, Hiroaki Kobayashi
{"title":"Ising-Based Combinatorial Clustering Using the Kernel Method","authors":"Masahito Kumagai, K. Komatsu, Masayuki Sato, Hiroaki Kobayashi","doi":"10.1109/MCSoC51149.2021.00037","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00037","url":null,"abstract":"Combinatorial clustering based on the Ising model is getting attention as a method to obtain high-quality clustering results. Furthermore, combinatorial clustering using the kernel method can handle any irregular data type by using a kernel trick. The kernel trick is an approach to the extension of the data to an arbitrary high-dimensional feature space by switching the kernel function. However, the conventional kernel clustering based on the Ising model can only be used in the limited case where the number of clusters is two. This is because the Ising model is composed of decision variables that represent binary values. This paper proposes Ising-based combinatorial clustering using a kernel method that can handle two or more clusters. The key idea of the proposed method is to represent clustering results using one-hot encoding. One-hot encoding represents a cluster to which a single data belongs by using bits whose number is the same as that of clusters. However, the one-hot constraint caused by the use of one-hot encoding decreases the quality of clustering. To this problem, in this paper, combinatorial clustering based on an externally defined one-hot constraint is used. The proposed kernel-based combinatorial clustering works with more than two clusters. Therefore, the proposed method is compared with Euclidean distance-based combinatorial clustering that divides the data into two or more clusters as the conventional method. Through experiments, it is clarified that the quality of the clustering results of the proposed method for irregular data is significantly better than that of the conventional method.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131654561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Task-level Redundancy vs Instruction-level Redundancy against Single Event Upsets in Real-time DAG scheduling","authors":"L. Miedema, Benjamin Rouxel, C. Grelck","doi":"10.1109/MCSoC51149.2021.00062","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00062","url":null,"abstract":"Real-time cyber-physical systems have become ubiquitous. As such systems are often mission-critical, designers must include mitigations against various types of hardware faults, including Single Event Upsets (SEU). SEUs can be mitigated using both software and hardware approaches. When using software approaches, the application designer needs to select the appropriate redundancy level for the application. We propose the use of task-level redundancy for SEU detection, aiming at applications structured as a Directed Acyclic Graph (DAG) of tasks. This work compares existing instruction-level redundancy against task-level redundancy using the UPPAAL model checking tool in SMC mode. Our comparison shows that task-level redundancy implemented using Dual Modular Spatial Redundancy and Checkpoint-Restart offers significantly lower deadline miss ratios when slack is limited. While task-level redundancy usually performs better or equal, we also show that rare cases exist where long running DAG application benefit more from instruction-level redundancy.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132451203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SIMD Parallel Execution on GPU from High-Level Dataflow Synthesis","authors":"Aurelien Bloch, S. Brunet, M. Mattavelli","doi":"10.1109/MCSoC51149.2021.00017","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00017","url":null,"abstract":"Writing and optimizing application software for heterogeneous platforms including GPU units is a very difficult task that requires designer efforts and resources to consider several key elements to obtain good performance. Dataflow programming has shown to be a good approach for accomplishing such a difficult task for its properties of portability and the possibility of arbitrary partitioning a dataflow network on each unit of heterogeneous platforms. However, such a design methodology is not sufficient by itself to obtain good performance. The paper describes some methodological steps for improving the performance of dataflow programs written in RVC-CAL and synthesized to execute on heterogeneous CPU/GPU co-processing platforms. The steps do include the optimization of the performance of the communication tasks between processing elements, a strategy for the efficient scheduling of independent GPU partitions, and the introduction of dynamic programming for leveraging the SIMD nature of GPU platforms. The approach is validated qualitatively and quantitatively using dataflow application program examples executed by applying several partitioning configurations.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116617026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Comparision of TPU, GPU, CPU on Google Colaboratory Over Distributed Deep Learning","authors":"H. Kimm, Incheon Paik, Hanke Kimm","doi":"10.1109/MCSoC51149.2021.00053","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00053","url":null,"abstract":"Deep Learning models need massive amounts compute powers and tend to improve performance running on special purpose processors accelerators designed to speed up compute-intensive applications. The accelerators like Tensor Processing Units (TPUs) and Graphics Processing Units (GPUs) are widely used as deep learning hardware platforms which can often achieve better performance than CPUs, with their massive parallel execution resources and high memory bandwidth. Google Colaboratory known as Colab is a cloud service based on Jupyter Notebook that allows the users to write and execute mostly Python in a browser and admits free access to TPUs and GPUs without extra configuration need, which are widely available cloud hardware platforms. In this paper, we present a through comparison of the hardware platforms on Google Colab that is benchmarked with Distributed Bidirectional Long Short-Term Memory (dBLSTM) models upon the number of layers, the number of units each layer, and the numbers of input and output units the datasets. Human Activity Recognition (HAR) data from UCI machine-learning library have been applied to the proposed distributed bidirectional LSTM model to find the performance, strengths, bottlenecks of the hardware platforms of TPU, GPU and CPU upon hyperparameters, execution time, and evaluation metrics: accuracy, precision, recall and F1 score.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127241552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed Neural Network with TensorFlow on Human Activity Recognition Over Multicore TPU","authors":"H. Kimm, Incheon Paik","doi":"10.1109/MCSoC51149.2021.00026","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00026","url":null,"abstract":"There have been increasing interests and success of applying deep learning neural networks to their big data platforms and workflows, say Distributed Deep Learning. In this paper, we present distributed long short-term memory (dLSTM) neural network model using TensorFlow over multicore Tensor Processing Unit (TPU) on Google Cloud. LSTM is a variant of the recurrent neural network (RNN), which is more suitable for processing temporal sequences. This model could extract human activity features automatically and classify them with a few model parameters. In the proposed model, the raw data collected by mobile sensors was fed into distributed multi-layer LSTM layers. Human activity recognition data from UCI machine-learning library have been applied to the proposed distributed LSTM (dLSTM) model to compare the efficiency of TensorFlow over CPU and TPU based on execution time, and evaluation metrics: accuracy, precision, recall and F1 score along with the use of Google Colab Notebook.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128067713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiport Register File Design for High-Performance Embedded Cores","authors":"J. Kadomoto, H. Irie, S. Sakai","doi":"10.1109/MCSoC51149.2021.00048","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00048","url":null,"abstract":"As the application areas of embedded SoCs continue to expand, there is a need to adopt general-purpose cores with higher performance. One method of achieving higher performance in general-purpose processors is to use superscalar execution, which exploits instruction-level parallelism to achieve higher performance by simultaneously executing multiple instructions. As the number of parallel execution lanes of the processor increases, more ports are required in the internal memory structures, including a register file, to enable reading or writing multiple data in parallel. As the number of ports increases, the power consumption and area of the register file become larger, and the design becomes exceedingly complex. Therefore, an elaborate design space exploration of such register files is crucial for developing higher-performance cores. In this paper, we discuss the design of multiport register files, especially for 32-bit out-of-order superscalar processors, and investigate the design space through SPICE simulations.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"531 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123064169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Egwom Onyinyechi Jessica, Mohamed Hamada, S. Yusuf, Mohammed Hassan
{"title":"The Role of Linear Discriminant Analysis for Accurate Prediction of Breast Cancer","authors":"Egwom Onyinyechi Jessica, Mohamed Hamada, S. Yusuf, Mohammed Hassan","doi":"10.1109/MCSoC51149.2021.00057","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00057","url":null,"abstract":"With the recent advances in clinical technologies, a huge amount of data has been accumulated for breast cancer diagnosis. Extracting information from the data to support the clinical diagnosis of breast cancer is a tedious and time-consuming task. The use of machine learning and data mining techniques has significantly changed the whole process of a breast cancer diagnosis. In this research, a prediction model for breast cancer prediction has been developed using features extracted from individual medical screening and tests. To overcome the problem of overfitting and obtain a good prediction accuracy, a Linear Discriminant Analysis (LDA) is applied for the extraction of useful features. This is done to reduce the number of features in the experimental dataset. The proposed model can create new features from the existing features and then get rid of the original features. The newly created features were able to summarize the necessary information contained initially in the original set of features. LDA was chosen because of its usefulness in detecting whether a set of features is worthwhile in predicting breast cancer. In addition to LDA, the proposed model uses Support Vector Machine (SVM) for accurate prediction, hence the name LDA-SVM prediction model. Based on 5-fold cross-validation, the proposed model yields an accuracy of 99.2%, precision of 98.0%, and Recall of 99.0% when tested on the Wisconsin Diagnostic Breast Cancer (WDBC) dataset from the University of California- Irvine machine learning repository. Therefore, SVM shows high efficiency in handling classification problems when combined with feature extraction techniques.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127806169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}