{"title":"High-performance sparse fast Fourier transforms","authors":"J. Schumacher, Markus Püschel","doi":"10.1109/SiPS.2014.6986055","DOIUrl":"https://doi.org/10.1109/SiPS.2014.6986055","url":null,"abstract":"The sparse fast Fourier transform (SFFT) is a recent novel algorithm to compute discrete Fourier transforms on signals with a sparse frequency domain with an improved asymptotic runtime. Reference implementations exist for different variants of the algorithm and were already shown to be faster than state-of-the-art FFT implementations in cases of sufficient sparsity. However, to date the SFFT has not been carefully optimized for modern processors. In this paper, we first analyze the performance of the existing SFFT implementations and discuss possible improvements. Then we present an optimized implementation. We achieve a speedup of 2-5 compared to the existing code and an efficiency that is competitive to highperformance FFT libraries.","PeriodicalId":167156,"journal":{"name":"2014 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116524067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francesco Conti, D. Rossi, A. Pullini, Igor Loi, L. Benini
{"title":"Energy-efficient vision on the PULP platform for ultra-low power parallel computing","authors":"Francesco Conti, D. Rossi, A. Pullini, Igor Loi, L. Benini","doi":"10.1109/SiPS.2014.6986099","DOIUrl":"https://doi.org/10.1109/SiPS.2014.6986099","url":null,"abstract":"Many-core architectures structured as fabrics of tightly-coupled clusters have shown promising results on embedded computer vision benchmarks, providing state-of-art performance with a reduced power budget. We propose PULP (Parallel processing Ultra-Low Power platform), an architecture built on clusters of tightly-coupled OpenRISC ISA cores, with advanced techniques for fast performance and energy scalability that exploit the capabilities of the STMicroelectronics UTB FD-SOI 28nm technology. As a use case for PULP, we show that a computationally demanding vision kernel based on Convolutional Neural Networks can be quickly and efficiently switched from a low power, low frame-rate operating point to a high frame-rate one when a detection is performed. Our results show that PULP performance can be scaled over a 1x-354x range, with a peak performance/power efficiency of 211 GOPS/W.","PeriodicalId":167156,"journal":{"name":"2014 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128484868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tse-Wei Chen, Noriyasu Hashiguchi, M. Ariizumi, Kinya Osa, Daisuke Nakashima, Yasuo Fukuda, Shiori Wakino, Shinji Shiraga, Masami Kato
{"title":"Acceleration of clustering-based superpixel algorithms with low memory costs","authors":"Tse-Wei Chen, Noriyasu Hashiguchi, M. Ariizumi, Kinya Osa, Daisuke Nakashima, Yasuo Fukuda, Shiori Wakino, Shinji Shiraga, Masami Kato","doi":"10.1109/SiPS.2014.6986095","DOIUrl":"https://doi.org/10.1109/SiPS.2014.6986095","url":null,"abstract":"As a pre-processing step of image segmentation, superpixel algorithms are used to produce small, uniform and compact regions, which can be used for region-based image coding, region-based image processing, and object recognition. In order to meet the requirements of real-time applications for embedded computing, it is necessary to reduce the computational costs of superpixel algorithms and increase the processing speed. In this paper, a series of acceleration schemes for superpixels algorithm is proposed. The features and contributions of this work are stated as follows. Firstly, the spatial distances and the color distances are calculated individually, so that the redundant distance computations can be saved. Secondly, by searching the nearest cluster centroids with centroid priority, the nearest clusters can be found at an early stage. Thirdly, the early-termination mechanism can be applied to the search process to speed up the algorithm without decreasing the quality of image segmentation. Fourthly, the storage for label images and distance images is not required since the operations of nearest centroids are processed in the inner loop of the algorithm. The experiments show that the proposed method achieves the same level of performance as the related work with only 75% of distance computations and 33% of memory costs.","PeriodicalId":167156,"journal":{"name":"2014 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128595580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Runle Du, Jiaqi Liu, Zhifeng Li, Zhenhong Niu, Zhiye Jiang, Yadong Yang
{"title":"Composite data fusion algorithm for miniature vehicles building navigation base in formation flying","authors":"Runle Du, Jiaqi Liu, Zhifeng Li, Zhenhong Niu, Zhiye Jiang, Yadong Yang","doi":"10.1109/SiPS.2014.6986068","DOIUrl":"https://doi.org/10.1109/SiPS.2014.6986068","url":null,"abstract":"When multiple miniature vehicles with individual position and inter-vehicle distance measurement ability collaborate in a formation, navigation base can be established by data fusion in a decentralized and standalone scheme. A Composite Data Fusion (CDF) algorithm which combines Least Square Error and Kalman Filtering is proposed in this paper to build navigation base with optimized computing stress. In CDF, Enhanced LSE is incorporated as the preprocessing stage to build a coarse estimation and handle temporary or permanent group number failure. KF stage is then built to further alleviate noises in the pre-processed estimations In CDF, the dynamic model can be much simpler than KF, so the computation load is reduced while the result still has the advantage of high precision. Simulation results show that, when the fault rate of measurement in each vehicle goes 5 thousandth, the result is still acceptable. The computation time of the proposed method is less than three percent of that of KF, while its precision is almost the same to that of KF.","PeriodicalId":167156,"journal":{"name":"2014 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115832823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Moore, Máire O’Neill, Neil Hanley, E. O'Sullivan
{"title":"Accelerating integer-based fully homomorphic encryption using Comba multiplication","authors":"C. Moore, Máire O’Neill, Neil Hanley, E. O'Sullivan","doi":"10.1109/SiPS.2014.6986063","DOIUrl":"https://doi.org/10.1109/SiPS.2014.6986063","url":null,"abstract":"Fully Homomorphic Encryption (FHE) is a recently developed cryptographic technique which allows computations on encrypted data. There are many interesting applications for this encryption method, especially within cloud computing. However, the computational complexity is such that it is not yet practical for real-time applications. This work proposes optimised hardware architectures of the encryption step of an integer-based FHE scheme with the aim of improving its practicality. A low-area design and a high-speed parallel design are proposed and implemented on a Xilinx Virtex-7 FPGA, targeting the available DSP slices, which offer high-speed multiplication and accumulation. Both use the Comba multiplication scheduling method to manage the large multiplications required with uneven sized multiplicands and to minimise the number of read and write operations to RAM. Results show that speed up factors of 3.6 and 10.4 can be achieved for the encryption step with medium-sized security parameters for the low-area and parallel designs respectively, compared to the benchmark software implementation on an Intel Core2 Duo E8400 platform running at 3 GHz.","PeriodicalId":167156,"journal":{"name":"2014 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"172 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114953432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On modified EMD: Selective extrema analysis","authors":"Asma Qureshi, Maite Brandt-Pearce","doi":"10.1109/SiPS.2014.6986070","DOIUrl":"https://doi.org/10.1109/SiPS.2014.6986070","url":null,"abstract":"The Empirical Mode Decomposition (EMD) algorithm was introduced as the first step of the Hilbert-Huang Transform, proposed by Huang et al. (1998). EMD decomposes a signal into so-called Intrinsic Mode Functions (IMFs) in a systematic way. Since then, various versions of EMD have been developed, addressing weaknesses of the original EMD procedure and aiming to optimize the original algorithm in a number of ways. This paper The Empirical Mode Decomposition (EMD) algorithm was introduced as the first step of the Hilbert-Huang Transform, proposed by Huang et al. (1998). EMD decomposes a signal into so-called Intrinsic Mode Functions (IMFs) in a systematic way. Since then, various versions of EMD have been developed, addressing weaknesses of the original EMD procedure and aiming to optimize the original algorithm in a number of ways. This paper proposes to use selective extrema analysis while generating IMFs with two goals. One is to reduce/control the number of IMFs a signal is decomposed into with a small decomposition error, and second is to make EMD insensitive to small variations in the analyzed signal. The proposed algorithm is applied to a gait signal and shown to consistently yield two IMFs, even in the presence of small disturbances.proposes to use selective extrema analysis while generating IMFs with two goals. One is to reduce/control the number of IMFs a signal is decomposed into with a small decomposition error, and second is to make EMD insensitive to small variations in the analyzed signal. The proposed algorithm is applied to a gait signal and shown to consistently yield two IMFs, even in the presence of small disturbances.","PeriodicalId":167156,"journal":{"name":"2014 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132301727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A search-less DEC BCH decoder for low-complexity fault-tolerant systems","authors":"Injae Yoo, I. Park","doi":"10.1109/SiPS.2014.6986060","DOIUrl":"https://doi.org/10.1109/SiPS.2014.6986060","url":null,"abstract":"This paper proposes a new decoding algorithm and its decoder architecture to completely remove the parallel Chien search in double error correcting (DEC) BCH decoders. The proposed algorithm called search-less decoding utilizes a quadratic formula to efficiently compute the roots of an error-location polynomial in the finite field. Since the parallel Chien search block dominates the overall complexity of a conventional DEC BCH decoder, the proposed algorithm is effective in mitigating the hardware complexity. Furthermore, a search-less (44, 32, 2) BCH decoder architecture is proposed for fault-tolerant embedded systems. Compared to the conventional decoder associated with 16-parallel Chien search, the proposed decoder decreases the hardware complexity by 51% without sacrificing the decoding throughput.","PeriodicalId":167156,"journal":{"name":"2014 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127777922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effect of computation offload on performance and energy consumption of mobile face recognition","authors":"Nanoka Sumi, A. Baba, V. Moshnyaga","doi":"10.1109/SiPS.2014.6986056","DOIUrl":"https://doi.org/10.1109/SiPS.2014.6986056","url":null,"abstract":"Computation offloading is a paramount technology to leverage network resources for mobile devices. This paper studies effect of computation offloading on efficiency of mobile face recognition. It compares offloading alternatives used in existing mobile face-recognition system and reports on their efficiency in terms of energy consumption, processing time and recognition accuracy. The offloading method which leads to the best energy-performance tradeoff is outlined.","PeriodicalId":167156,"journal":{"name":"2014 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"192 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132915040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Janne Janhunen, P. Jääskeläinen, J. Hannuksela, Tero Rintaluoma, Aki Kuusela
{"title":"Programmable in-loop deblock filter processor for video decoders","authors":"Janne Janhunen, P. Jääskeläinen, J. Hannuksela, Tero Rintaluoma, Aki Kuusela","doi":"10.1109/SiPS.2014.6986071","DOIUrl":"https://doi.org/10.1109/SiPS.2014.6986071","url":null,"abstract":"The short time to market cycle and the target to reduce design and verification costs are driving forces to design programmable implementations of the video processing algorithms. We present two processor architectures the first one representing an application-specific instruction set processor (ASIP) design, whereas the second architecture represents a domain-specific instruction-set processor (DSIP) architecture with more general purpose instruction-set. In this work, we present results for H264 and VP8 in-loop deblocking algorithms. The processors are based on the transport triggered architecture which provides scalable instruction-level parallelism and, thanks to its simple structure, lend itself to cost effective designs. Both of the designs are programmed with C language with a minimal additional parallelism markup. The designs fulfill realtime requirements for filtering macroblocks in high-definition video. The first architecture, based on special function units, filters a high-definition stream (1920 × 1080) at 75 fps, whereas the second architecture, which provides a better programmability, filters the stream at 53 fps. The processors run on 200 MHz clock frequency and the areas vary from 146k to 373k gate equivalents depending on the processor architecture.","PeriodicalId":167156,"journal":{"name":"2014 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134403777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hooman Jarollahi, N. Onizawa, Vincent Gripon, T. Hanyu, W. Gross
{"title":"Algorithm and architecture for a multiple-field context-driven search engine using fully-parallel clustered associative memories","authors":"Hooman Jarollahi, N. Onizawa, Vincent Gripon, T. Hanyu, W. Gross","doi":"10.1109/SiPS.2014.6986075","DOIUrl":"https://doi.org/10.1109/SiPS.2014.6986075","url":null,"abstract":"In this paper, a context-driven search engine is presented based on a new family of associative memories. It stores only the associations between items from multiple search fields in the form of binary links, and merges repeated field items to reduce the memory requirements. It achieves 13.6× reduction in memory bits and accesses, and 8.6× reduced number of clock cycles in search operation compared to a classical field-based search structure using content-addressable memory. Furthermore, using parallel computational nodes in the proposed search engine, it achieves five orders of magnitude reduced number of clock cycles compared to a CPU-based counterpart running a classical search algorithm in software.","PeriodicalId":167156,"journal":{"name":"2014 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132619958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}