A VLIW-Vector co-processor design for accelerating Basic Linear Algebraic Operations in OpenCV

18th International Symposium on VLSI Design and Test Pub Date : 2014-08-21 DOI:10.1109/ISVDAT.2014.6881085

Venkata Ganapathi Puppala

{"title":"A VLIW-Vector co-processor design for accelerating Basic Linear Algebraic Operations in OpenCV","authors":"Venkata Ganapathi Puppala","doi":"10.1109/ISVDAT.2014.6881085","DOIUrl":null,"url":null,"abstract":"OpenCV is a widely used computer vision library written in C++. Basic Linear Algebraic Operations (BLAOP) involving matrices are at the heart of OpenCV. Though OpenCV provides ubiquity in the computer vision field, it runs slow when ported on embedded processors. Accelerating the LAOPs using a co-processor certainly helps improving the throughput. In this paper we present a floating point VLIW-Vector Co-processor Architecture with Vector Floating Point Datapath (VFPDP) and a 4-slot VLIW processor core to accelerate BLAOps achieving performance of two GFLOPS when run at 500MHz clock frequency. We also demonstrate a detailed mapping strategy of One sided Jacobi Singular Value Decomposition (OJSVD) algorithm onto the proposed architecture. The proposed architecture is designed using Verilog HDL and it is synthesized using Synopsis Design Compiler with 28nm TSMC target libraries. The clock period is set to 2ns and the timing constraints are met. Using the Altera's SOPC builder, an experimental system is created with the co-processor interfaced to the NIOS II soft processor and implemented in Cyclone IV FPGA. The OJSVD algorithm is ported onto both the standalone NIOS II processor based system and the system with the proposed co-processor. The results show that 15X performance improvement achieved with this co-processor.","PeriodicalId":217280,"journal":{"name":"18th International Symposium on VLSI Design and Test","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"18th International Symposium on VLSI Design and Test","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISVDAT.2014.6881085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

OpenCV is a widely used computer vision library written in C++. Basic Linear Algebraic Operations (BLAOP) involving matrices are at the heart of OpenCV. Though OpenCV provides ubiquity in the computer vision field, it runs slow when ported on embedded processors. Accelerating the LAOPs using a co-processor certainly helps improving the throughput. In this paper we present a floating point VLIW-Vector Co-processor Architecture with Vector Floating Point Datapath (VFPDP) and a 4-slot VLIW processor core to accelerate BLAOps achieving performance of two GFLOPS when run at 500MHz clock frequency. We also demonstrate a detailed mapping strategy of One sided Jacobi Singular Value Decomposition (OJSVD) algorithm onto the proposed architecture. The proposed architecture is designed using Verilog HDL and it is synthesized using Synopsis Design Compiler with 28nm TSMC target libraries. The clock period is set to 2ns and the timing constraints are met. Using the Altera's SOPC builder, an experimental system is created with the co-processor interfaced to the NIOS II soft processor and implemented in Cyclone IV FPGA. The OJSVD algorithm is ported onto both the standalone NIOS II processor based system and the system with the proposed co-processor. The results show that 15X performance improvement achieved with this co-processor.

查看原文本刊更多论文

在OpenCV中加速基本线性代数运算的VLIW-Vector协处理器设计

OpenCV是一个使用c++编写的广泛使用的计算机视觉库。涉及矩阵的基本线性代数运算(BLAOP)是OpenCV的核心。尽管OpenCV在计算机视觉领域提供了无处不在的应用，但它在移植到嵌入式处理器上时运行缓慢。使用协处理器加速LAOPs当然有助于提高吞吐量。在本文中，我们提出了一个带有矢量浮点数据路径(VFPDP)和4槽VLIW处理器核心的浮点VLIW-矢量协处理器架构，以加速BLAOps，在500MHz时钟频率下运行时达到两个GFLOPS的性能。我们还演示了单侧Jacobi奇异值分解(OJSVD)算法到所提出的体系结构的详细映射策略。该架构采用Verilog HDL进行设计，并采用28纳米TSMC目标库的概要设计编译器进行合成。时钟周期设置为2ns，满足时间约束。利用Altera的SOPC构建器，通过NIOS II软处理器接口的协处理器创建了一个实验系统，并在Cyclone IV FPGA中实现。OJSVD算法既可以移植到基于NIOS II处理器的独立系统上，也可以移植到带有拟议协处理器的系统上。结果表明，该协处理器的性能提高了15倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

18th International Symposium on VLSI Design and Test

自引率

0.00%

发文量