A Network-on-Chip Based H.264 Video Decoder Prototype Implemented on FPGAs

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI:10.1109/FCCM.2017.10

Ian J. Barge, Cristinel Ababei

{"title":"A Network-on-Chip Based H.264 Video Decoder Prototype Implemented on FPGAs","authors":"Ian J. Barge, Cristinel Ababei","doi":"10.1109/FCCM.2017.10","DOIUrl":null,"url":null,"abstract":"We present a field programmable gate array (FPGA) based implementation of the H.264 video decoder algorithm. The novelty of our design is that the communication between the decoder modules is done using a network-on-chip (NoC). This makes our design scalable and easily integrated within larger future NoC based systems, where the same hardware platform can host other algorithms such as compression, filtering, etc. Our primary objective is to study the achievable performance with a NoC based H.264 decoder solution. The design process involves primarily three main steps. First, the H.264 algorithm is split into eight different partitions, which are implemented as individual processing elements (PEs). These processing elements are attached to the routers of the regular mesh NoC and include: network abstraction layer (NAL) parser and entropy decoder, frame buffer and integer motion, inverse quantization inverse transform, intra prediction, luma sub-pixel motion, chroma sub-pixel motion, deblocking filter, and display driver. These PEs are described in VHDL with the first two being executed on Nios II softcores. The network-on-chip was generated with the Connect tool from Carnegie Mellon University and integrated within the top level design entity. Second, we specify the location of each of the PEs inside the regular mesh NoC. Because we use eight PEs, the NoC architecture needs to be a 3x3 regular mesh topology. When we specify the location of the PEs inside the mesh topology (i.e., specify the router to which a particular PE is attached), we effectively solve what is called the NoC mapping problem. To do that, we use manual mapping, which is done intelligently based on information about the internal structure of the decoding algorithm. This helps to reduce the number of routers that packets must travel through the network. Finally, the entire project is synthesized, placed, and routed with Quartus Prime Standard Edition 16.1 tool. The final design is tested and verified on the DE4 development board, which uses Altera's Stratix IV GX FPGA chip. The performance of the implementation at the time of the submission is that to decode 100 frames takes 33 seconds for a frame size of 192x144 pixels and to decode 100 frames takes 56 seconds for a resolution of 320x240 pixels per frame. Documentation and source codes of the entire project will be released to the public domain. We hope that this will enable other researchers to easily replicate and compare results to ours and that it will encourage and facilitate further research in the areas of image processing, computer vision, and advanced VHDL design and FPGAs.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"141 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2017.10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

We present a field programmable gate array (FPGA) based implementation of the H.264 video decoder algorithm. The novelty of our design is that the communication between the decoder modules is done using a network-on-chip (NoC). This makes our design scalable and easily integrated within larger future NoC based systems, where the same hardware platform can host other algorithms such as compression, filtering, etc. Our primary objective is to study the achievable performance with a NoC based H.264 decoder solution. The design process involves primarily three main steps. First, the H.264 algorithm is split into eight different partitions, which are implemented as individual processing elements (PEs). These processing elements are attached to the routers of the regular mesh NoC and include: network abstraction layer (NAL) parser and entropy decoder, frame buffer and integer motion, inverse quantization inverse transform, intra prediction, luma sub-pixel motion, chroma sub-pixel motion, deblocking filter, and display driver. These PEs are described in VHDL with the first two being executed on Nios II softcores. The network-on-chip was generated with the Connect tool from Carnegie Mellon University and integrated within the top level design entity. Second, we specify the location of each of the PEs inside the regular mesh NoC. Because we use eight PEs, the NoC architecture needs to be a 3x3 regular mesh topology. When we specify the location of the PEs inside the mesh topology (i.e., specify the router to which a particular PE is attached), we effectively solve what is called the NoC mapping problem. To do that, we use manual mapping, which is done intelligently based on information about the internal structure of the decoding algorithm. This helps to reduce the number of routers that packets must travel through the network. Finally, the entire project is synthesized, placed, and routed with Quartus Prime Standard Edition 16.1 tool. The final design is tested and verified on the DE4 development board, which uses Altera's Stratix IV GX FPGA chip. The performance of the implementation at the time of the submission is that to decode 100 frames takes 33 seconds for a frame size of 192x144 pixels and to decode 100 frames takes 56 seconds for a resolution of 320x240 pixels per frame. Documentation and source codes of the entire project will be released to the public domain. We hope that this will enable other researchers to easily replicate and compare results to ours and that it will encourage and facilitate further research in the areas of image processing, computer vision, and advanced VHDL design and FPGAs.

查看原文本刊更多论文

基于片上网络的H.264视频解码器的fpga实现

提出了一种基于现场可编程门阵列(FPGA)的H.264视频解码器算法。我们设计的新颖之处在于解码器模块之间的通信是使用片上网络(NoC)完成的。这使得我们的设计可扩展并易于集成到更大的未来基于NoC的系统中，其中相同的硬件平台可以承载其他算法，如压缩，过滤等。我们的主要目标是研究基于NoC的H.264解码器解决方案的可实现性能。设计过程主要包括三个主要步骤。首先，H.264算法被分成八个不同的分区，每个分区被实现为单独的处理元素(pe)。这些处理元素附加在规则网格NoC的路由器上，包括:网络抽象层(NAL)解析器和熵解码器、帧缓冲和整数运动、逆量化反变换、帧内预测、亮度亚像素运动、色度亚像素运动、去块滤波和显示驱动。这些pe用VHDL描述，前两个pe在Nios II软核上执行。片上网络是用卡内基梅隆大学的Connect工具生成的，并集成在顶层设计实体中。其次，我们在规则网格NoC中指定每个pe的位置。因为我们使用8个pe，所以NoC架构需要是一个3x3的规则网格拓扑结构。当我们在网状拓扑中指定PE的位置时(即指定特定PE所连接的路由器)，我们有效地解决了所谓的NoC映射问题。为了做到这一点，我们使用手动映射，这是基于解码算法内部结构的信息智能地完成的。这有助于减少数据包必须通过网络的路由器数量。最后，使用Quartus Prime Standard Edition 16.1工具对整个项目进行合成、放置和路由。最终的设计在DE4开发板上进行了测试和验证，该开发板使用Altera的Stratix IV GX FPGA芯片。在提交时，实现的性能是解码100帧需要33秒，帧大小为192x144像素，解码100帧需要56秒，分辨率为每帧320x240像素。整个项目的文档和源代码将被发布到公共领域。我们希望这将使其他研究人员能够轻松地复制和比较我们的结果，并且它将鼓励和促进在图像处理，计算机视觉，先进的VHDL设计和fpga领域的进一步研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

自引率

0.00%

发文量