{"title":"Real-time object detection in software with custom vector instructions and algorithm changes","authors":"Joe Edwards, G. Lemieux","doi":"10.1109/ASAP.2017.7995262","DOIUrl":null,"url":null,"abstract":"Real-time vision applications place stringent performance requirements on embedded systems. To meet performance requirements, embedded systems often require hardware implementations. This approach is unfavorable as hardware development can be difficult to debug, time-consuming, and require extensive skill. This paper presents a case study of accelerating face detection, often part of a complex image processing pipeline, using a software/hardware hybrid approach. As a baseline, the algorithm is initially run on a scalar ARM Cortex-A9 application processor found on a Xilinx Zynq device. Next, using a previously designed vector engine implemented in the FPGA fabric, the algorithm is vectorized, using only standard vector instructions, to achieve a 25× speedup. Then, we accelerate the critical inner loops by adding two hardware-assisted custom vector instructions for an additional 10× speedup, yielding 248× speedup over the initial Cortex-A9 baseline. Collectively, the custom instructions require fewer than 800 lines of VHDL code, including comments and blank lines. Compared to previous hardware-only face detection systems, our work is 1.5 to 6.8 times faster. This approach demonstrates that good performance can be obtained from software-only vectorization, and a small amount of custom hardware can provide a significant acceleration boost.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASAP.2017.7995262","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Real-time vision applications place stringent performance requirements on embedded systems. To meet performance requirements, embedded systems often require hardware implementations. This approach is unfavorable as hardware development can be difficult to debug, time-consuming, and require extensive skill. This paper presents a case study of accelerating face detection, often part of a complex image processing pipeline, using a software/hardware hybrid approach. As a baseline, the algorithm is initially run on a scalar ARM Cortex-A9 application processor found on a Xilinx Zynq device. Next, using a previously designed vector engine implemented in the FPGA fabric, the algorithm is vectorized, using only standard vector instructions, to achieve a 25× speedup. Then, we accelerate the critical inner loops by adding two hardware-assisted custom vector instructions for an additional 10× speedup, yielding 248× speedup over the initial Cortex-A9 baseline. Collectively, the custom instructions require fewer than 800 lines of VHDL code, including comments and blank lines. Compared to previous hardware-only face detection systems, our work is 1.5 to 6.8 times faster. This approach demonstrates that good performance can be obtained from software-only vectorization, and a small amount of custom hardware can provide a significant acceleration boost.