18th International Symposium on VLSI Design and Test最新文献

A VLIW-Vector co-processor design for accelerating Basic Linear Algebraic Operations in OpenCV 在OpenCV中加速基本线性代数运算的VLIW-Vector协处理器设计

18th International Symposium on VLSI Design and Test Pub Date : 2014-08-21 DOI: 10.1109/ISVDAT.2014.6881085

Venkata Ganapathi Puppala

{"title":"A VLIW-Vector co-processor design for accelerating Basic Linear Algebraic Operations in OpenCV","authors":"Venkata Ganapathi Puppala","doi":"10.1109/ISVDAT.2014.6881085","DOIUrl":"https://doi.org/10.1109/ISVDAT.2014.6881085","url":null,"abstract":"OpenCV is a widely used computer vision library written in C++. Basic Linear Algebraic Operations (BLAOP) involving matrices are at the heart of OpenCV. Though OpenCV provides ubiquity in the computer vision field, it runs slow when ported on embedded processors. Accelerating the LAOPs using a co-processor certainly helps improving the throughput. In this paper we present a floating point VLIW-Vector Co-processor Architecture with Vector Floating Point Datapath (VFPDP) and a 4-slot VLIW processor core to accelerate BLAOps achieving performance of two GFLOPS when run at 500MHz clock frequency. We also demonstrate a detailed mapping strategy of One sided Jacobi Singular Value Decomposition (OJSVD) algorithm onto the proposed architecture. The proposed architecture is designed using Verilog HDL and it is synthesized using Synopsis Design Compiler with 28nm TSMC target libraries. The clock period is set to 2ns and the timing constraints are met. Using the Altera's SOPC builder, an experimental system is created with the co-processor interfaced to the NIOS II soft processor and implemented in Cyclone IV FPGA. The OJSVD algorithm is ported onto both the standalone NIOS II processor based system and the system with the proposed co-processor. The results show that 15X performance improvement achieved with this co-processor.","PeriodicalId":217280,"journal":{"name":"18th International Symposium on VLSI Design and Test","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127577977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

FPGA-based implementation of M4RM for matrix multiplication over GF(2) 基于fpga的矩阵乘法M4RM在GF(2)上的实现

18th International Symposium on VLSI Design and Test Pub Date : 2014-08-21 DOI: 10.1109/ISVDAT.2014.6881072

Vivek Kumar, Vinay B. Y. Kumar, S. Patkar

{"title":"FPGA-based implementation of M4RM for matrix multiplication over GF(2)","authors":"Vivek Kumar, Vinay B. Y. Kumar, S. Patkar","doi":"10.1109/ISVDAT.2014.6881072","DOIUrl":"https://doi.org/10.1109/ISVDAT.2014.6881072","url":null,"abstract":"The Method of Four Russians for Multiplication (M4RM) is one of the most efficient algorithms for dense matrix multiplication over binary field targeting particularly the commodity general purpose processors. We present an efficient tile-based hardware/software implementation of M4RM, with the hardware side handling the constituent block multiplications in a streaming fashion, and the software side doing the accumulations. With designs for 64 × 64 and 128 × 128 sized block matrix multiplications, sizes feasible for targeting FPGAs, we compare the performance with the fastest software implementations of M4RM on commodity processors. The designs were implemented in Bluespec SystemVerilog, and evaluated over the hardware/software co-emulation framework, SCE-MI. Using the 128 × 128 hardware modules, a 16, 384 × 16, 384 matrix multiplication, running at 140 MHz could be done in ~ 3.0s using the Strassen-Winograd scheme when targeting a Cyclone IV FPGA and at a sustained bit operations per cycle of ~ 8000; where, in comparision, M4RM on Intel Core2Duo running at 2.33GHz, takes ~ 8s and at a sustained bit operations per cycle of ~ 500.","PeriodicalId":217280,"journal":{"name":"18th International Symposium on VLSI Design and Test","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122541808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A spare link based reliable Network-on-Chip design 一种基于备用链路的可靠片上网络设计

18th International Symposium on VLSI Design and Test Pub Date : 2014-07-16 DOI: 10.1109/ISVDAT.2014.6881036

Navonil Chatterjee, N. Prasad, S. Chattopadhyay

引用次数: 6

Modelling and analysis of wireless communication over Networks-on-Chip 基于片上网络的无线通信建模与分析

18th International Symposium on VLSI Design and Test Pub Date : 2014-07-16 DOI: 10.1109/ISVDAT.2014.6881044

Apoorv Kumar, H. Kapoor

引用次数: 1

VLSI implementation of novel fast confluence ICA algorithm for signal processing applications VLSI实现新型快速融合ICA算法，用于信号处理应用

18th International Symposium on VLSI Design and Test Pub Date : 2014-07-16 DOI: 10.1109/ISVDAT.2014.6881086

M. Ranjith, N. Muniraj

引用次数: 0

A locally reconfigurable Network-on-Chip architecture and application mapping onto it 一个本地可重构的片上网络体系结构和映射到它的应用程序

18th International Symposium on VLSI Design and Test Pub Date : 2014-07-16 DOI: 10.1109/ISVDAT.2014.6881041

J. Soumya, Ashish Sharma, S. Chattopadhyay

引用次数: 3

A Pseudo-Deadline Based O(1) proportional share scheduler for embedded systems 基于伪截止日期的嵌入式系统O(1)比例共享调度器

18th International Symposium on VLSI Design and Test Pub Date : 2014-07-16 DOI: 10.1109/ISVDAT.2014.6881083

Swarnendu Ray, A. Sarkar

引用次数: 0

Loop unrolling with fine grained power gating for runtime leakage power reduction 循环展开与细粒度功率门控运行时泄漏功率减少

18th International Symposium on VLSI Design and Test Pub Date : 2014-07-16 DOI: 10.1109/ISVDAT.2014.6881084

Sumanta Pyne, A. Pal

引用次数: 0

An LUT based RNS FIR filter implementation for reconfigurable applications 基于LUT的可重构应用的RNS FIR滤波器实现

18th International Symposium on VLSI Design and Test Pub Date : 2014-07-16 DOI: 10.1109/ISVDAT.2014.6881047

Srinivasa Reddy Kotha, Sumit Bajaj, S. K. Sahoo

引用次数: 4

A thermal aware 3D IC partitioning technique 一种热感知的3D IC分区技术

18th International Symposium on VLSI Design and Test Pub Date : 2014-07-16 DOI: 10.1109/ISVDAT.2014.6881069

Sabyasachee Banerjee, S. Majumder

{"title":"A thermal aware 3D IC partitioning technique","authors":"Sabyasachee Banerjee, S. Majumder","doi":"10.1109/ISVDAT.2014.6881069","DOIUrl":"https://doi.org/10.1109/ISVDAT.2014.6881069","url":null,"abstract":"On-chip power density plays a major role in case of Highperformance VLSI circuits. 3D chips have significantly larger power densities compared to their 2D counterparts due to increased scaling of technology and also increased number of components with higher frequency and bandwidth. The consumed power is usually converted into dissipated heat, affecting the performance and reliability of a chip. Thermal problems and limitations on inter-layer via (TSV) densities are important design constraints on three-dimensional integrated circuits (3D ICs). In this paper we introduce an algorithm where the modules with relatively high power densities are placed at the bottom layer and subsequently modules with relatively less power densities are placed on more higher layers. The temperatures of the layers vary in a non-increasing manner from the bottommost layer to the topmost layer to ensure efficient heat dissipation of the whole chip, which means we may require lesser number of heat TSVs to dissipate heat. Along with this thermal aware partitioning technique, we also tried to minimize the number of inter-layer vias (Signal TSVs) by swapping some modules across layers, in exchange of little increment in the area of the layer that has the maximum area in the circuitry. The experimental results we got are quite encouraging.","PeriodicalId":217280,"journal":{"name":"18th International Symposium on VLSI Design and Test","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121116810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2