Siyuan Lu, Jinming Lu, Jun Lin, Zhongfeng Wang, L. Du
{"title":"一种低延迟、低复杂度的CTC波束搜索解码硬件结构","authors":"Siyuan Lu, Jinming Lu, Jun Lin, Zhongfeng Wang, L. Du","doi":"10.1109/SiPS47522.2019.9020324","DOIUrl":null,"url":null,"abstract":"The recurrent neural networks (RNNs) along with connectionist temporal classification (CTC) have been widely used in many sequence to sequence tasks, including automatic speech recognition (ASR), lipreading, and scene text recognition (STR). In these systems, CTC-trained RNNs usually require specific CTC-decoders after their output layers. Many existing CTC-trained RNN inference systems use FPGA to do calculations of RNNs, and decode their outputs on CPU. However, with the development of FPGA-based RNN hardware accelerators, existing CPU-based CTC-decoder can not meet the latency requirement of them. To resolve this issue, this paper proposes an efficient hardware architecture for the CTC beam search decoder based on the decoding method reported in our previous work. The experimental results show that the system latency per sample of the CTC-decoder is only 7.19us on Xilinx xc7vx1140tflg19301 FPGA platform, which is lower than state-of-the-art RNNs. We also implement the origin algorithm on the same FPGA platform. Comparison results show that the improved one reduces the system latency per sample by 63.67%, the LUTRAMs by 97.44%, the FFs by 79.55%, and the DSPs by 50%. To the best of our knowledge, this is the first work on hardware implementation for CTC beam search decoder.","PeriodicalId":256971,"journal":{"name":"2019 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Low-Latency and Low-Complexity Hardware Architecture for CTC Beam Search Decoding\",\"authors\":\"Siyuan Lu, Jinming Lu, Jun Lin, Zhongfeng Wang, L. Du\",\"doi\":\"10.1109/SiPS47522.2019.9020324\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The recurrent neural networks (RNNs) along with connectionist temporal classification (CTC) have been widely used in many sequence to sequence tasks, including automatic speech recognition (ASR), lipreading, and scene text recognition (STR). In these systems, CTC-trained RNNs usually require specific CTC-decoders after their output layers. Many existing CTC-trained RNN inference systems use FPGA to do calculations of RNNs, and decode their outputs on CPU. However, with the development of FPGA-based RNN hardware accelerators, existing CPU-based CTC-decoder can not meet the latency requirement of them. To resolve this issue, this paper proposes an efficient hardware architecture for the CTC beam search decoder based on the decoding method reported in our previous work. The experimental results show that the system latency per sample of the CTC-decoder is only 7.19us on Xilinx xc7vx1140tflg19301 FPGA platform, which is lower than state-of-the-art RNNs. We also implement the origin algorithm on the same FPGA platform. Comparison results show that the improved one reduces the system latency per sample by 63.67%, the LUTRAMs by 97.44%, the FFs by 79.55%, and the DSPs by 50%. To the best of our knowledge, this is the first work on hardware implementation for CTC beam search decoder.\",\"PeriodicalId\":256971,\"journal\":{\"name\":\"2019 IEEE International Workshop on Signal Processing Systems (SiPS)\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Workshop on Signal Processing Systems (SiPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SiPS47522.2019.9020324\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Workshop on Signal Processing Systems (SiPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SiPS47522.2019.9020324","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Low-Latency and Low-Complexity Hardware Architecture for CTC Beam Search Decoding
The recurrent neural networks (RNNs) along with connectionist temporal classification (CTC) have been widely used in many sequence to sequence tasks, including automatic speech recognition (ASR), lipreading, and scene text recognition (STR). In these systems, CTC-trained RNNs usually require specific CTC-decoders after their output layers. Many existing CTC-trained RNN inference systems use FPGA to do calculations of RNNs, and decode their outputs on CPU. However, with the development of FPGA-based RNN hardware accelerators, existing CPU-based CTC-decoder can not meet the latency requirement of them. To resolve this issue, this paper proposes an efficient hardware architecture for the CTC beam search decoder based on the decoding method reported in our previous work. The experimental results show that the system latency per sample of the CTC-decoder is only 7.19us on Xilinx xc7vx1140tflg19301 FPGA platform, which is lower than state-of-the-art RNNs. We also implement the origin algorithm on the same FPGA platform. Comparison results show that the improved one reduces the system latency per sample by 63.67%, the LUTRAMs by 97.44%, the FFs by 79.55%, and the DSPs by 50%. To the best of our knowledge, this is the first work on hardware implementation for CTC beam search decoder.