Real-time object detection in software with custom vector instructions and algorithm changes

Joe Edwards, G. Lemieux
{"title":"Real-time object detection in software with custom vector instructions and algorithm changes","authors":"Joe Edwards, G. Lemieux","doi":"10.1109/ASAP.2017.7995262","DOIUrl":null,"url":null,"abstract":"Real-time vision applications place stringent performance requirements on embedded systems. To meet performance requirements, embedded systems often require hardware implementations. This approach is unfavorable as hardware development can be difficult to debug, time-consuming, and require extensive skill. This paper presents a case study of accelerating face detection, often part of a complex image processing pipeline, using a software/hardware hybrid approach. As a baseline, the algorithm is initially run on a scalar ARM Cortex-A9 application processor found on a Xilinx Zynq device. Next, using a previously designed vector engine implemented in the FPGA fabric, the algorithm is vectorized, using only standard vector instructions, to achieve a 25× speedup. Then, we accelerate the critical inner loops by adding two hardware-assisted custom vector instructions for an additional 10× speedup, yielding 248× speedup over the initial Cortex-A9 baseline. Collectively, the custom instructions require fewer than 800 lines of VHDL code, including comments and blank lines. Compared to previous hardware-only face detection systems, our work is 1.5 to 6.8 times faster. This approach demonstrates that good performance can be obtained from software-only vectorization, and a small amount of custom hardware can provide a significant acceleration boost.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASAP.2017.7995262","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Real-time vision applications place stringent performance requirements on embedded systems. To meet performance requirements, embedded systems often require hardware implementations. This approach is unfavorable as hardware development can be difficult to debug, time-consuming, and require extensive skill. This paper presents a case study of accelerating face detection, often part of a complex image processing pipeline, using a software/hardware hybrid approach. As a baseline, the algorithm is initially run on a scalar ARM Cortex-A9 application processor found on a Xilinx Zynq device. Next, using a previously designed vector engine implemented in the FPGA fabric, the algorithm is vectorized, using only standard vector instructions, to achieve a 25× speedup. Then, we accelerate the critical inner loops by adding two hardware-assisted custom vector instructions for an additional 10× speedup, yielding 248× speedup over the initial Cortex-A9 baseline. Collectively, the custom instructions require fewer than 800 lines of VHDL code, including comments and blank lines. Compared to previous hardware-only face detection systems, our work is 1.5 to 6.8 times faster. This approach demonstrates that good performance can be obtained from software-only vectorization, and a small amount of custom hardware can provide a significant acceleration boost.
实时目标检测软件与自定义矢量指令和算法的变化
实时视觉应用对嵌入式系统提出了严格的性能要求。为了满足性能需求,嵌入式系统通常需要硬件实现。这种方法是不利的,因为硬件开发可能难以调试,耗时,并且需要大量的技能。本文介绍了一个使用软件/硬件混合方法加速人脸检测的案例研究,人脸检测通常是复杂图像处理管道的一部分。作为基准,该算法最初在Xilinx Zynq设备上的标量ARM Cortex-A9应用处理器上运行。接下来,使用先前设计的在FPGA结构中实现的矢量引擎,仅使用标准矢量指令对算法进行矢量化,以实现25倍的加速。然后,我们通过添加两个硬件辅助的自定义矢量指令来加速关键的内环,以获得额外的10倍加速,从而在初始Cortex-A9基线上产生248倍的加速。总的来说,自定义指令需要少于800行VHDL代码,包括注释和空白行。与之前的纯硬件人脸检测系统相比,我们的工作速度快了1.5到6.8倍。这种方法表明,仅通过软件矢量化可以获得良好的性能,并且少量的定制硬件可以提供显着的加速提升。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信