{"title":"KV-FTL: A novel key-value based FTL scheme for large scale SSDs","authors":"Juan Li, Zhengguo Chen, Zhiguang Chen, Nong Xiao, Fang Liu, Wei Chen","doi":"10.1109/HPCC-SmartCity-DSS.2017.14","DOIUrl":"https://doi.org/10.1109/HPCC-SmartCity-DSS.2017.14","url":null,"abstract":"Both traditional coarse-grained and fine-grained Flash Translation Layer schemes are unsuitable for ultra-large SSDs. They produce overmuch mapping entries which fail to be kept in embedded DRAM completely and can suffer severely from low spatial and temporal localities. In this paper, we propose a novel KV-FTL for ultra-large SSDs, which mostly maps logical addresses to physical addresses via a simple hash function, while handles hash collisions and out-of-place data updates by the traditional manner, i.e., the mapping table. Our KV-FTL can accelerate address translation by avoiding loading mapping table from flash memory to DRAM, thus improve performance; as well as reduce the write-traffic incurred by the mapping table, thus extend the lifespan of SSDs. Experimental results show that our KV-FTL facilitates SSDs to survive longer lifespan by a factor of up to 18.7% with an average of 13.6%; improves read performance ranging from 18.4% to 50.7% with an average of 39% with optimization, and in the case of extremely intensive requests, improves the access performance for requests with an average of 47%.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133300231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amin Malekpour, R. Ragel, A. Ignjatović, S. Parameswaran
{"title":"DoSGuard: Protecting pipelined MPSoCs against hardware Trojan based DoS attacks","authors":"Amin Malekpour, R. Ragel, A. Ignjatović, S. Parameswaran","doi":"10.1109/ASAP.2017.7995258","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995258","url":null,"abstract":"Billions of transistors on a chip and the power wall made embedded systems to be designed with Multiprocessor System-on-Chip (MPSoC) architectures. One utilization of MPSoCs is the Pipelined MPSoCs (PMPSoCs). As many reliable and safety critical systems are deployed with MPSoCs, denying their service would have adverse effects. One such possibility is the insertion of a hardware Trojan that performs Denial of Service (DoS) attacks. DoSGuard present a novel PMPSoC architecture that continues its execution in the presence of DoS Trojans in Third Party Intellectual Property (3PIP) cores. DoSGuard deploys two methods; one can detect the presence of Trojans and recover, and the other can also identify the 3PIPs under attack using buffer delays. While the state of the art incurs 3× area and power overheads, DoSGuard consumes 1.5M+3 area and leakage power (M is the number of cores in the base system) and a small (the power consumption of the monitoring system) dynamic power overheads. On a cycle accurate commercial multiprocessor simulator, DoSGuard takes 531 clock cycles to detect a DoS attack. With DoSGuard the throughput reduction due to a DoS attack varies with the application and the monitoring interval but is negligible (< 10−3%) for real world scenarios, where millions of iterations take place.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129088744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hierarchical Dataflow Model for efficient programming of clustered manycore processors","authors":"J. Hascoet, K. Desnos, J. Nezan, B. Dinechin","doi":"10.1109/ASAP.2017.7995270","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995270","url":null,"abstract":"Programming Multiprocessor Systems-on-Chips (MPSoCs) with hundreds of heterogeneous Processing Elements (PEs), complex memory architectures, and Networks-on-Chips (NoCs) remains a challenge for embedded system designers. Dataflow Models of Computation (MoCs) are increasingly used for developing parallel applications as their high-level of abstraction eases the automation of mapping, task scheduling and memory allocation onto MPSoCs. This paper introduces a technique for deploying hierarchical dataflow graphs efficiently onto MPSoC. The proposed technique exploits different granularity of dataflow parallelism to generate both NoC-based communications and nested OpenMP loops. Deployment of an image processing application on a many-core MPSoC results in speedups of up to 58.7 compared to the sequential execution.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125105737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An efficient embedded multi-ported memory architecture for next-generation FPGAs","authors":"S. N. Shahrouzi, D. Perera","doi":"10.1109/ASAP.2017.7995263","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995263","url":null,"abstract":"In recent years, there has been a dramatic increase in utilization of FPGAs to enhance the speed-performance of many real-time compute and data intensive applications on embedded platforms. FPGA-based designs leverage parallelism in computations to achieve high speed-performance. Parallel computations require multi-ported memories to provide any number of ports for simultaneous multiple read/write (R/W) operations. Although several multi-ported memories are proposed in the literature, these designs become complex due to the extra logic and routing used for techniques/architectures to provide an arbitrary number of R/W ports. In this research work, we introduce a novel and efficient multi-ported memory architecture utilizing simple dual-port BRAMs, to provide an arbitrary number of R/W ports. Apart from the BRAMs, our proposed multi-ported memory design only consists of the Decision Making Modules and a counter, thus simplifying the design process. The R/W operations within our architecture are also straightforward. Experiments are performed to evaluate the feasibility and efficiency of our multi-ported memory architecture. We also evaluate our architecture with the most recently proposed multi-ported memory designs, implemented using LVT and XOR techniques, from the existing literature. FPGA manufacturers could employ our multi-ported memory architecture to accelerate real-time compute/data intensive applications with their next-generation FPGAs. Due to lower design complexity compared to the existing designs, our simplified memory architecture would enable seamless integration to the existing FPGA-based CAD tools with minimal design cost.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122389523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware support for embedded operating system security","authors":"Arman Pouraghily, T. Wolf, R. Tessier","doi":"10.1109/ASAP.2017.7995260","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995260","url":null,"abstract":"Internet-connected embedded systems have limited capabilities to defend themselves against remote hacking attacks. The potential effects of such attacks, however, can have a significant impact in the context of the Internet of Things, industrial control systems, smart health systems, etc. Embedded systems cannot effectively utilize existing software-based protection mechanisms due to limited processing capabilities and energy resources. We propose a novel hardware-based monitoring technique that can detect if the embedded operating system or any running application deviates from the originally programmed behavior due to an attack. We present an FPGA-based prototype implementation that shows the effectiveness of such a security approach.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121686759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast and efficient implementation of Convolutional Neural Networks on FPGA","authors":"Abhinav Podili, Chi Zhang, V. Prasanna","doi":"10.1109/ASAP.2017.7995253","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995253","url":null,"abstract":"State-of-the-art CNN models for Image recognition use deep networks with small filters instead of shallow networks with large filters, because the former requires fewer weights. In the light of above trend, we present a fast and efficient FPGA based convolution engine to accelerate CNN models over small filters. The convolution engine implements Winograd minimal filtering algorithm to reduce the number of multiplications by 38% to 55% for state-of-the-art CNNs. We exploit the parallelism of the Winograd convolution engine to scale the overall performance. We show that our overall design sustains the peak throughput of the convolution engines. We propose a novel data layout to reduce the required memory bandwidth of our design by half. One noteworthy feature of our Winograd convolution engine is that it hides the computation latency of the pooling layer. As a case study we implement VGG16 CNN model and compare it with previous approaches. Compared with the state-of-the-art reduced precision VGG16 implementation, our implementation achieves 1.2× improvement in throughput by using 3× less multipliers and 2× less on-chip memory without impacting the classification accuracy. The improvements in throughput per multiplier and throughput per unit on-chip memory are 3.7× and 2.47× respectively, compared with the state-of-the-art design.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114994991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jehandad Khan, P. Athanas, S. Booth, John Marshall
{"title":"OpenCL-based design pattern for line rate packet processing","authors":"Jehandad Khan, P. Athanas, S. Booth, John Marshall","doi":"10.1109/ASAP.2017.7995278","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995278","url":null,"abstract":"The ever changing nature of network technology requires a flexible platform that can change as the technology evolves. In this work, a complete networking switch designed in OpenCL is presented, identifying several high-level constructs that form the building blocks of any network application targeting FPGAs. These include the notion of an on-chip global memory and kernels constantly processing data without the intervention of the host. The use of OpenCL is motivated by the ability to rapidly change designs and to be maintainable by a wider developer community. Pieces of the design that cannot be realized using current OpenCL technology are also identified and a solution to the problem is presented.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114577043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-performance FPGA implementation of equivariant adaptive separation via independence algorithm for Independent Component Analysis","authors":"M. Nazemi, Shahin Nazarian, Massoud Pedram","doi":"10.1109/ASAP.2017.7995255","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995255","url":null,"abstract":"Independent Component Analysis (ICA) is a dimensionality reduction technique that can boost efficiency of machine learning models that deal with probability density functions, e.g. Bayesian neural networks. Algorithms that implement adaptive ICA converge slower than their nonadaptive counterparts, however, they are capable of tracking changes in underlying distributions of input features. This intrinsically slow convergence of adaptive methods combined with existing hardware implementations that operate at very low clock frequencies necessitate fundamental improvements in both algorithm and hardware design. This paper presents an algorithm that allows efficient hardware implementation of ICA. Compared to previous work, our FPGA implementation of adaptive ICA improves clock frequency by at least one order of magnitude and throughput by at least two orders of magnitude. Our proposed algorithm is not limited to ICA and can be used in various machine learning problems that use stochastic gradient descent optimization.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121924340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Li, A. Sapio, Jiahao Wu, Yanzhou Liu, Kyunghun Lee, M. Wolf, S. Bhattacharyya
{"title":"Design and implementation of adaptive signal processing systems using Markov decision processes","authors":"Lin Li, A. Sapio, Jiahao Wu, Yanzhou Liu, Kyunghun Lee, M. Wolf, S. Bhattacharyya","doi":"10.1109/ASAP.2017.7995275","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995275","url":null,"abstract":"In this paper, we propose a novel framework, called Hierarchical MDP framework for Compact System-level Modeling (HMCSM), for design and implementation of adaptive embedded signal processing systems. The HMCSM framework applies Markov decision processes (MDPs) to enable autonomous adaptation of embedded signal processing under multidimensional constraints and optimization objectives. The framework integrates automated, MDP-based generation of optimal reconfiguration policies, dataflow-based application modeling, and implementation of embedded control software that carries out the generated reconfiguration policies. HMCSM systematically decomposes a complex, monolithic MDP into a set of separate MDPs that are connected hierarchically, and that operate more efficiently through such a modularized structure. We demonstrate the effectiveness of our new MDP-based system design framework through experiments with an adaptive wireless communications receiver.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125576791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"reMinMin: A novel static energy-centric list scheduling approach based on real measurements","authors":"Achim Lösch, M. Platzner","doi":"10.1109/ASAP.2017.7995272","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995272","url":null,"abstract":"Heterogeneous compute nodes in form of CPUs with attached GPU and FPGA accelerators have strongly gained interested in the last years. Applications differ in their execution characteristics and can therefore benefit from such heterogeneous resources in terms of performance or energy consumption. While performance optimization has been the only goal for a long time, nowadays research is more and more focusing on techniques to minimize energy consumption due to rising electricity costs. This paper presents reMinMin, a novel static list scheduling approach for optimizing the total energy consumption for a set of tasks executed on a heterogeneous compute node. reMinMin bases on a new energy model that differentiates between static and dynamic energy components and covers effects of accelerator tasks on the host CPU. The required energy values are retrieved by measurements on the real computing system. In order to evaluate reMinMin, we compare it with two reference implementations on three task sets with different degrees of heterogeneity. In our experiments, MinMin is consistently better than a scheduler optimizing for dynamic energy only, which requires up to 19.43% more energy, and very close to optimal schedules.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127587051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}