{"title":"Scalable FPGA Accelerator for Deep Convolutional Neural Networks with Stochastic Streaming","authors":"Mohammed Alawad;mingjie lin","doi":"10.1109/TMSCS.2018.2886266","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2886266","url":null,"abstract":"FPGA-based heterogeneous computing platform, due to its extreme logic reconfigurability, emerges to be a strong contender as computing fabric in modern AI. As a result, various FPGA-based accelerators for deep CNN—the key driver of modern AI—have been proposed due to their advantages of high performance, reconfigurability, and fast development round, etc. In general, the consensus among researchers is that, although FPGA-based accelerator can achieve much higher energy efficiency, its raw computing performance lags behind when compared with GPUs with similar logic density. In this paper, we develop an alternative methodology to efficiently implement CNNs with FPGAs that outperform GPUs in terms of both power consumption and performance. Our key idea is to design a scalable hardware architecture and circuit design for large-scale CNNs that leverages a stochastic-based computing principle. Specifically, there are three major performance advantages. First, all key components of our deep learning CNN are designed and implemented to compute stochastically, thus achieving excellent computing performance and energy efficiency. Second, because our proposed CNN architecture enables a stream-mode computing, all of its stages can process even the partial results from preceding stages, therefore not incurring unnecessary latency due to data dependency. Finally, our FPGA-based deep CNN also provides a superior hardware scalability when compared with conventional FPGA implementations by reducing the bandwidth requirement between layers. The results show that our proposed CNN architecture significantly outperforms all previous FPGA-based deep CNN implementation approaches. It achieves 1.58x more GOPS, 6.42x more GOPS/Slice, and 10.92x more GOPS/W when compared with state-of-the-art CNN architecture. The top-5 accuracy of stochastic VGG-16 CNN is 86.77 percent with 18.91 fps frame rate.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"888-899"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2886266","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68025494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Big Data Layered Architecture and Functional Units for the Multimedia Internet of Things","authors":"Kah Phooi Seng;Li-Minn Ang","doi":"10.1109/TMSCS.2018.2886843","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2886843","url":null,"abstract":"The escalating growth of multimedia content in Internet of Things (IoT) applications leads to a huge volume of unstructured data being generated. Unstructured Big data has no particular format or structure and can be in any form such as text, audio, images, and video. Furthermore, current IoT systems cannot successfully realize the notion of having ubiquitous connectivity of everything if they are not capable to include ‘multimedia things’. In this paper, we address two issues by proposing a new architecture for the Multimedia Internet of Things (MIoT) with Big multimodal computation layer. We first introduce MIoT as a novel paradigm in which smart heterogeneous multimedia things can interact and cooperate with one another and with other things connected to the Internet to facilitate multimedia-based services and applications that are globally available to the users. The MIoT architecture consists of six layers. The computation layer is specially designed for Big multimodal analytics. This layer has four important functional units: Data Centralized Unit, Multimodal Data Aggregation Unit, Multimodal Data Divide & Conquer Computation Unit, and Fusion & Decision Making Unit. A novel and highly scalable technique called the Divide & Conquer Principal Component Analysis (DC-PCA) for feature extraction in the divide and conquer mechanism is proposed to be used together with the Divide & Conquer Linear Discriminant Analysis (DC-LDA) for multimodal Big data analytics. Experiments are conducted to confirm the good performance of these techniques in the functional units of the Divide & Conquer computational mechanisms. The final section of the paper gives application on a camera sensing IoT platform and real-world data analytics on multicore architecture implementations.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"500-512"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2886843","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67861366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory-Based Combination PUFs for Device Authentication in Embedded Systems","authors":"Soubhagya Sutar;Arnab Raha;Vijay Raghunathan","doi":"10.1109/TMSCS.2018.2885758","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2885758","url":null,"abstract":"Embedded systems play a crucial role in fueling the growth of the Internet-of-Things (IoT) in application domains such as health care, home automation, transportation, etc. However, their increasingly network-connected nature, coupled with their ability to access potentially sensitive/confidential information, has given rise to a plethora of security and privacy concerns. An additional challenge is the growing number of counterfeit components in these devices, with serious reliability and financial repercussions. Physically Unclonable Functions (PUFs) are a promising security primitive to help address these concerns. Memory-based PUFs are particularly attractive as they can be realized with minimal or no additional hardware beyond what is already present in all embedded systems, i.e., memory. However, current memory-based PUFs utilize only a single memory technology for constructing the PUF, which has many disadvantages including making them vulnerable to certain security attacks. Several of these PUFs also suffer from other shortcomings such as low entropy, limited number of challenge-response pairs, etc. In this paper, we propose the design of a new memory-based combination PUF that tightly integrates (two) heterogeneous memory technologies to address these challenges/shortcomings. Our design enables us to authenticate an on-chip component and an off-chip component, thereby taking a step towards multi-component authentication in a device, without incorporating any additional hardware. We have implemented a prototype of the proposed combination PUF using a Terasic TR4-230 FPGA development board and several off-the-shelf SRAMs and DRAMs. Measured experimental results demonstrate substantial improvements over current memory-based PUFs including the ability to resist various security attacks. We also propose a lightweight authentication scheme that ensures robust operation of the PUF across environmental and temporal variations. Extensive authentication tests performed on several PUF prototypes achieved a true-positive rate of greater than 97.5 percent across these variations. The absence of any false-positives, even under an invasive attack, further highlighted the effectiveness of the overall design.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"793-810"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2885758","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RAPID: Memory-Aware NoC for Latency Optimized GPGPU Architectures","authors":"Venkata Yaswanth Raparti;Sudeep Pasricha","doi":"10.1109/TMSCS.2018.2871094","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2871094","url":null,"abstract":"The growing parallelism in most of today's applications has led to an increased demand for parallel computing in processors. General Purpose Graphics Processing Units (GPGPUs) have been used extensively to support highly parallel applications in recent years. Such GPGPUs generate huge volumes of network traffic between memory controllers (MCs) and shader cores. As a result, the network-on-chip (NoC) fabric can become a performance bottleneck, especially for memory intensive applications running on GPGPUs. Traditional mesh-based NoC topologies are not suitable for GPGPUs as they possess high network latency that leads to congestion at MCs and an increase in application execution time. In this article, we propose a novel memory-aware NoC that has two (request and reply) planes tailored to exploit the traffic characteristics in GPGPUs. The request layer consists of low power, and low latency routers that are optimized for the many-to-few traffic pattern. In the reply layer, flits are sent on fast overlay circuits to reach their destinations in just three cycles (at 1 GHz). In addition, as traditional memory controllers are not aware of the application memory intensity that leads to higher waiting time for applications on the shader cores, we propose an enhanced memory controller that prioritizes burst packets to improve application performance on GPGPUs. Experimental results indicate that our framework yields an improvement of \u0000<inline-formula><tex-math>${mathrm{4}}-{mathrm{10}}times$</tex-math></inline-formula>\u0000 in NoC latency, up to 63 percent in execution time, and up to 4× in total energy consumption compared to the state-of-the-art.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"874-887"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2871094","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Deep Structure of Person Re-Identification Using Multi-Level Gaussian Models","authors":"Dinesh Kumar Vishwakarma;Sakshi Upadhyay","doi":"10.1109/TMSCS.2018.2870592","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2870592","url":null,"abstract":"Person re-identification is being widely used in the forensic, and security and surveillance system these days. However, it is still a challenging task in a real life scenario. Hence, in this work, a new feature descriptor model has been proposed using a multilayer framework of the Gaussian distribution model on pixel features, which include color moments, color space values, gradient information, and Schmid filter responses. An image of a person usually consists of distinct body regions, usually with differentiable clothing followed by local colors and texture patterns. Thus, the image is evaluated locally by dividing the image into overlapping regions. Each region is further fragmented into a set of local Gaussians on small patches. A global Gaussian encodes these local Gaussians for each region, creating a multi-level structure. Hence, the global picture of a person is described by local level information present in it, which is often ignored. Also, we have analyzed the efficiency of some existing metric learning methods on this descriptor. The performance of the descriptor is evaluated on four publicly available challenging datasets and the highest accuracy achieved on these datasets are compared with similar state-of-the-art works. It clearly demonstrates the superior performance of the proposed descriptor.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"513-521"},"PeriodicalIF":0.0,"publicationDate":"2018-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2870592","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ravindra Babu Ganapathi;Aravind Gopalakrishnan;Russell W. McGuire
{"title":"HPC Process and Optimal Network Device Affinitization","authors":"Ravindra Babu Ganapathi;Aravind Gopalakrishnan;Russell W. McGuire","doi":"10.1109/TMSCS.2018.2871444","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2871444","url":null,"abstract":"High Performance Computing (HPC) applications have demanding need for hardware resources such as processor, memory, and storage. Applications in the area of Artificial Intelligence and Machine Learning are taking center stage in HPC, which is driving demand for increasing compute resources per node which in turn is pushing bandwidth requirement between the compute nodes. New system design paradigms exist where deploying a system with more than one high performance IO device per node provides benefits. The number of I/O devices connected to the HPC node can be increased with PCIe switches and hence some of the HPC nodes are designed to include PCIe switches to provide a large number of PCIe slots. With multiple IO devices per node, application programmers are forced to consider HPC process affinity to not only compute resources but extend this to include IO devices. Mapping of process to processor cores and the closest IO device(s) increases complexity due to three way mapping and varying HPC node architectures. While operating systems perform reasonable mapping of process to processor core(s), they lack the application developer's knowledge of process workflow and optimal IO resource allocation when more than one IO device is attached to the compute node. This paper is an extended version of our work published in \u0000<xref>[1]</xref>\u0000 . Our previous work provided solution for IO device affinity choices by abstracting the device selection algorithm from HPC applications. In this paper, we extend the affinity solution to enable OpenFabric Interfaces (OFI) which is a generic HPC API designed as part of the OpenFabrics Alliance that enables wider HPC programming models and applications supported by various HPC fabric vendors. We present a solution for IO device affinity choices by abstracting the device selection algorithm from HPC applications. MPI continues to be the dominant programming model for HPC and hence we provide evaluation with MPI based micro benchmarks. Our solution is then extended to OpenFabric Interfaces which supports other HPC programming models such as SHMEM, GASNet, and UPC. We propose a solution to solve NUMA issues at the lower level of the software stack that forms the runtime for MPI and other programming models independent of HPC applications. Our experiments are conducted on a two node system where each node consists of two socket Intel Xeon servers, attached with up to four Intel Omni-Path fabric devices connected over PCIe. The performance benefits seen by applications by affinitizing processes with best possible network device is evident from the results where we notice up to 40 percent improvement in uni-directional bandwidth, 48 percent bi-directional bandwidth, 32 percent improvement in latency measurements, and up to 40 percent improvement in message rate with OSU benchmark suite. We also extend our evaluation to include OFI operations and an MPI benchmark used for Genome assembly. With OFI Remote Memory Access (RMA) op","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"749-757"},"PeriodicalIF":0.0,"publicationDate":"2018-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2871444","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Lifetime Reliability Management for Chip Multiprocessors","authors":"Milad Ghorbani Moghaddam;Cristinel Ababei","doi":"10.1109/TMSCS.2018.2870187","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2870187","url":null,"abstract":"We introduce an algorithm for dynamic lifetime reliability optimization of chip multiprocessors (CMPs). The proposed dynamic reliability management (DRM) algorithm combines thread migration and dynamic voltage and frequency scaling (DVFS) as the two primary techniques to change the CMP operation. The goal is to increase the lifetime reliability of the overall system to the desired target with minimal performance degradation. We test the proposed algorithm with a variety of benchmarks on 16 and 64 core network-on-chip (NoC) based CMP architectures. Full-system based simulations using a customized GEM5 simulator demonstrate that lifetime reliability can be improved by 100 percent for an average performance penalty of 7.7 and 8.7 percent for the two CMP architectures.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"952-958"},"PeriodicalIF":0.0,"publicationDate":"2018-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2870187","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Energy Optimization in Chip Multiprocessors Using Deep Neural Networks","authors":"Milad Ghorbani Moghaddam;Wenkai Guan;Cristinel Ababei","doi":"10.1109/TMSCS.2018.2870438","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2870438","url":null,"abstract":"We investigate the use of deep neural network (DNN) models for energy optimization under performance constraints in chip multiprocessor systems. We introduce a dynamic energy management algorithm implemented in three phases. In the first phase, training data is collected by running several selected instrumented benchmarks. A training data point represents a pair of values of cores’ workload characteristics and of optimal voltage/frequency (V/F) pairs. This phase employs Kalman filtering for workload prediction and an efficient heuristic algorithm based on dynamic voltage and frequency scaling. The second phase represents the training process of the DNN model. In the last phase, the DNN model is used to directly identify V/F pairs that can achieve lower energy consumption without performance degradation beyond the acceptable threshold set by the user. Simulation results on 16 and 64 core network-on-chip based architectures demonstrate that the proposed approach can achieve up to 55 percent energy reduction for 10 percent performance degradation constraints. In addition, the proposed DNN approach is compared against existing approaches based on reinforcement learning and Kalman filtering and found that it provides average improvements in energy-delay-product (EDP) of 6.3 and 6 percent for the 16 core architecture and of 7.4 and 5.5 percent for the 64 core architecture.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"649-661"},"PeriodicalIF":0.0,"publicationDate":"2018-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2870438","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67861363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Guest Editorial: Advances in Parallel Graph Processing: Algorithms, Architectures, and Application Frameworks","authors":"Ananth Kalyanaraman;Mahantesh Halappanavar","doi":"10.1109/TMSCS.2018.2858297","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2858297","url":null,"abstract":"The papers in this special section explore recent advancements in parallel graph processing. In the sphere of modern data science and data-driven applications, graph algorithms have achieved a pivotal place in advancing the state of scientific discovery and knowledge. Nearly three centuries of ideas have made graph theory and its applications a mature area in computational sciences. Yet, today we find ourselves at a crossroads between theory and application. Spurred by the digital revolution, data from a diverse range of high throughput channels and devices, from across internet-scale applications, are starting to mark a new era in data-driven computing and discovery. Building robust graph models and implementing scalable graph application frameworks in the context of this new era are proving to be significant challenges. Concomitant to the digital revolution, we have also experienced an explosion in computing architectures, with a broad range of multicores, manycores, heterogeneous platforms, and hardware accelerators (CPUs, GPUs) being actively developed and deployed within servers and multinode clusters. Recent advances have started to show that in more than one way, these two fields—graph theory and architectures–are capable of benefiting and in fact spurring new research directions in one another. This special section is aimed at introducing some of the new avenues of cutting-edge research happening at the intersection of graph algorithm design and their implementation on advanced parallel architectures.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"188-189"},"PeriodicalIF":0.0,"publicationDate":"2018-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2858297","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67861116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DeltaFrame-BP: An Algorithm Using Frame Difference for Deep Convolutional Neural Networks Training and Inference on Video Data","authors":"Bing Han;Kaushik Roy","doi":"10.1109/TMSCS.2018.2865303","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2865303","url":null,"abstract":"Inspired by the success of deep convolutional neural networks (CNNs) with back-propagation (BP) training on large-scale image recognition tasks, recent research efforts concentrated on expending deep CNNs toward more challenging automatized video analysis, such as video classification, object tracking, action recognition and optical flow detection. Video comprises a sequence of images (frames) captured over time in which image data is a function of space and time. Extracting three-dimensional spatial-temporal features from multiple frames becomes a key ingredient for capturing and incorporating appearance and dynamic representations using deep CNNs. Hence, training deep CNNs on video involves significant computational resources and energy consumption due to extended number of frames across the time line of video length. We propose DeltaFrame-BP, a deep learning algorithm, which significantly reduces computational cost and energy consumption without accuracy degradation by streaming frame differences for deep CNNs training and inference. The inherent similarity between video frames due to high fps (frames per second) in video recording helps achieving high-sparsity and low-dynamic range data streaming using frame differences in comparison with raw video frames. According to our simulation, nearly 25 percent energy reduction was achieved in training using the proposed accuracy-lossless DeltaFrame-BP algorithm in comparison with the standard Back-propagation algorithm.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"624-634"},"PeriodicalIF":0.0,"publicationDate":"2018-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2865303","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67861362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}