{"title":"A VLIW-Vector co-processor design for accelerating Basic Linear Algebraic Operations in OpenCV","authors":"Venkata Ganapathi Puppala","doi":"10.1109/ISVDAT.2014.6881085","DOIUrl":"https://doi.org/10.1109/ISVDAT.2014.6881085","url":null,"abstract":"OpenCV is a widely used computer vision library written in C++. Basic Linear Algebraic Operations (BLAOP) involving matrices are at the heart of OpenCV. Though OpenCV provides ubiquity in the computer vision field, it runs slow when ported on embedded processors. Accelerating the LAOPs using a co-processor certainly helps improving the throughput. In this paper we present a floating point VLIW-Vector Co-processor Architecture with Vector Floating Point Datapath (VFPDP) and a 4-slot VLIW processor core to accelerate BLAOps achieving performance of two GFLOPS when run at 500MHz clock frequency. We also demonstrate a detailed mapping strategy of One sided Jacobi Singular Value Decomposition (OJSVD) algorithm onto the proposed architecture. The proposed architecture is designed using Verilog HDL and it is synthesized using Synopsis Design Compiler with 28nm TSMC target libraries. The clock period is set to 2ns and the timing constraints are met. Using the Altera's SOPC builder, an experimental system is created with the co-processor interfaced to the NIOS II soft processor and implemented in Cyclone IV FPGA. The OJSVD algorithm is ported onto both the standalone NIOS II processor based system and the system with the proposed co-processor. The results show that 15X performance improvement achieved with this co-processor.","PeriodicalId":217280,"journal":{"name":"18th International Symposium on VLSI Design and Test","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127577977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA-based implementation of M4RM for matrix multiplication over GF(2)","authors":"Vivek Kumar, Vinay B. Y. Kumar, S. Patkar","doi":"10.1109/ISVDAT.2014.6881072","DOIUrl":"https://doi.org/10.1109/ISVDAT.2014.6881072","url":null,"abstract":"The Method of Four Russians for Multiplication (M4RM) is one of the most efficient algorithms for dense matrix multiplication over binary field targeting particularly the commodity general purpose processors. We present an efficient tile-based hardware/software implementation of M4RM, with the hardware side handling the constituent block multiplications in a streaming fashion, and the software side doing the accumulations. With designs for 64 × 64 and 128 × 128 sized block matrix multiplications, sizes feasible for targeting FPGAs, we compare the performance with the fastest software implementations of M4RM on commodity processors. The designs were implemented in Bluespec SystemVerilog, and evaluated over the hardware/software co-emulation framework, SCE-MI. Using the 128 × 128 hardware modules, a 16, 384 × 16, 384 matrix multiplication, running at 140 MHz could be done in ~ 3.0s using the Strassen-Winograd scheme when targeting a Cyclone IV FPGA and at a sustained bit operations per cycle of ~ 8000; where, in comparision, M4RM on Intel Core2Duo running at 2.33GHz, takes ~ 8s and at a sustained bit operations per cycle of ~ 500.","PeriodicalId":217280,"journal":{"name":"18th International Symposium on VLSI Design and Test","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122541808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A spare link based reliable Network-on-Chip design","authors":"Navonil Chatterjee, N. Prasad, S. Chattopadhyay","doi":"10.1109/ISVDAT.2014.6881036","DOIUrl":"https://doi.org/10.1109/ISVDAT.2014.6881036","url":null,"abstract":"In this paper we have presented a reliable On-chip interconnection network design using spare links. It helps to mitigate the problem of fault chain formation due to failure of boundary links. The modified router design uses the redundant ports in boundary routers along with spare links for establishing connection with adjacent routers in case of link faults. This design modification on mesh based network along with proposed routing algorithm improves system reliability in case of single and multiple link failures. The performance evaluation in terms of network latency has also been improved compared to recent works with minimal area overhead.","PeriodicalId":217280,"journal":{"name":"18th International Symposium on VLSI Design and Test","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116620910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modelling and analysis of wireless communication over Networks-on-Chip","authors":"Apoorv Kumar, H. Kapoor","doi":"10.1109/ISVDAT.2014.6881044","DOIUrl":"https://doi.org/10.1109/ISVDAT.2014.6881044","url":null,"abstract":"Multi-cores and many-cores are becoming the next computing platform with the interconnection bus becoming the new bottleneck. The bus is replaced by a Network-on-Chip (NoC) for scalability issues. However, the NoC still being RC-wire based links, there are limitations in the transmission speed. As we reach far more denser integration, the problem is likely to aggravate. Wireless interconnects holds a good promise to solve the speed and scalability issue. In this paper we analyse the improvements offered by wireless links as shortcut interconnects in wormhole based NoCs. We measure latency and throughput and observe their variations by altering congestion level, represented by Packet Injection Rate (PIR) and channel count. Using a simple Media Access Control (MAC) protocol we analyse the effect of number of channels and traffic, demonstrating the advantage of using wireless NoC.","PeriodicalId":217280,"journal":{"name":"18th International Symposium on VLSI Design and Test","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129065086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VLSI implementation of novel fast confluence ICA algorithm for signal processing applications","authors":"M. Ranjith, N. Muniraj","doi":"10.1109/ISVDAT.2014.6881086","DOIUrl":"https://doi.org/10.1109/ISVDAT.2014.6881086","url":null,"abstract":"Independent component analysis is an iterative procedure to extract sources from observed mixtures. Power area and Convergence speed are important parameters to be improved in VLSI implementation of Independent component analysis (ICA) techniques. This paper presents VLSI implementation of novel fast confluence adaptive independent component analysis (FCAICA) technique which has reduced power, area and improved convergence speed. The reduction in area and power is achieved by hardware optimization scheme and high convergence speed is achieved by a novel optimization scheme that adaptively changes the weight vector based on the kurtosis value. To increase the number precision and dynamic range of the signals, floating-point (FP) arithmetic units are used. Simulation, Synthesis, Floor planning, Placement, Routing are carried out and data stream are created with Cadence Tool 10.1. The FCA ICA algorithm operates at 2.91MHz with 12.092 mW of power in 0.18um technology. It is more effective compared with most popular FastICA algorithm.","PeriodicalId":217280,"journal":{"name":"18th International Symposium on VLSI Design and Test","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124387149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A locally reconfigurable Network-on-Chip architecture and application mapping onto it","authors":"J. Soumya, Ashish Sharma, S. Chattopadhyay","doi":"10.1109/ISVDAT.2014.6881041","DOIUrl":"https://doi.org/10.1109/ISVDAT.2014.6881041","url":null,"abstract":"This paper presents a reconfigurable Network-on-Chip (NoC) architecture built around mesh topology. It provides the facility of changing the attachment of cores to local routers across applications. Applications share cores, but communication pattern between them may vary. Compared to many other reconfigurable NoCs, our architecture needs only about 0.2% extra area overhead than simple mesh. Application mapping and reconfiguration policy have been developed using Integer Linear Programming (ILP) and heuristic for the proposed topology. It has been shown that the reconfiguration strategy could improve communication costs of applications significantly which often resulted in improved latency and energy values, keeping throughput unaffected.","PeriodicalId":217280,"journal":{"name":"18th International Symposium on VLSI Design and Test","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126316820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Pseudo-Deadline Based O(1) proportional share scheduler for embedded systems","authors":"Swarnendu Ray, A. Sarkar","doi":"10.1109/ISVDAT.2014.6881083","DOIUrl":"https://doi.org/10.1109/ISVDAT.2014.6881083","url":null,"abstract":"This paper presents Pseudo-Deadline Based Round-Robin (PDBRR), an O(1) proportional share scheduler for Embedded Systems that execute a mix of jobs with varying timeliness priorities. Simulation based experimental results reveal that PDBRR is able to achieve high proportional share scheduling accuracy.","PeriodicalId":217280,"journal":{"name":"18th International Symposium on VLSI Design and Test","volume":"374 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115948797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Loop unrolling with fine grained power gating for runtime leakage power reduction","authors":"Sumanta Pyne, A. Pal","doi":"10.1109/ISVDAT.2014.6881084","DOIUrl":"https://doi.org/10.1109/ISVDAT.2014.6881084","url":null,"abstract":"The present work introduces a compilation technique to reduce runtime leakage power of functional units of a processor by combining loop unrolling with power gating. The instructions in the unrolled loop are scheduled to provide opportunities for power gating the functional units which are not in need for a considerable amount of time. The number of clock cycles taken by the power gating instructions is less than or equal to the number of clock cycles saved by loop unrolling. This results in 23-64% reduction of the total energy consumed by the benchmark programs without any degradation of performance.","PeriodicalId":217280,"journal":{"name":"18th International Symposium on VLSI Design and Test","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116247700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An LUT based RNS FIR filter implementation for reconfigurable applications","authors":"Srinivasa Reddy Kotha, Sumit Bajaj, S. K. Sahoo","doi":"10.1109/ISVDAT.2014.6881047","DOIUrl":"https://doi.org/10.1109/ISVDAT.2014.6881047","url":null,"abstract":"In this work, two approaches to realize a look up table (LUT) based finite impulse response (FIR) filter using Residue Number System (RNS) are proposed. The proposed implementations take advantage of shift and add approach offered by the chosen moduli set. The two proposed filter architecture are compared with an earlier proposed version of reconfigurable RNS FIR filter. The filters are synthesized using Cadence RTL compiler in UMC 90 nm technology. The performance of the filters are compared in terms of Area (A), Power (P), and Delay (T). The results show that one of the proposed architecture offers significant improvement in terms of delay, while the second approach is well suited for applications that require minimal power and area. Both implementations offer advantage in area-delay AT and power-delay-product PTP. Proposed approaches are also verified functionally using Altera DSP Builder.","PeriodicalId":217280,"journal":{"name":"18th International Symposium on VLSI Design and Test","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128038616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Power analysis attack using neural networks with wavelet transform as pre-processor","authors":"P. Saravanan, P. Kalpana, V. Prcethisri, V. Sneha","doi":"10.1109/ISVDAT.2014.6881059","DOIUrl":"https://doi.org/10.1109/ISVDAT.2014.6881059","url":null,"abstract":"This work proposes a novel methodology to perform power analysis attack on secure system by using wavelet transform as a pre-processor followed by machine learning technique. The proposed methodology uses known plain text attack. The power supply current traces from the cryptographic device are obtained by varying the atmospheric temperature. Then the current traces are pre-processed by using wavelet transform, data normalization and principal component analysis (PCA). The featured data samples selected by the pre-processor are then used to train the neural network. Through supervised learning algorithm and wavelet pre-processing, we are able to achieve around 25% improvement in guessing the secret key when compared to existing method of machine learning alone.","PeriodicalId":217280,"journal":{"name":"18th International Symposium on VLSI Design and Test","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115235902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}