Tsendsuren Munkhdalai, Zelin Wu, G. Pundak, K. Sim, Jiayang Li, Pat Rondon, Tara N. Sainath
{"title":"面向自适应ASR的可扩展端到端上下文偏置","authors":"Tsendsuren Munkhdalai, Zelin Wu, G. Pundak, K. Sim, Jiayang Li, Pat Rondon, Tara N. Sainath","doi":"10.1109/SLT54892.2023.10023323","DOIUrl":null,"url":null,"abstract":"Attention-based biasing techniques for end-to-end ASR systems are able to achieve large accuracy gains without requiring the inference algorithm adjustments and parameter tuning common to fusion approaches. However, it is challenging to simultaneously scale up attention-based biasing to realistic numbers of biased phrases; maintain in-domain WER gains, while minimizing out-of-domain losses; and run in real time. We present NAM+, an attention-based biasing approach which achieves a 16X inference speedup per acoustic frame over prior work when run with 3,000 biasing entities, as measured on a typical mobile CPU. NAM+ achieves these run-time gains through a combination of Two-Pass Hierarchical Attention and Dilated Context Update. Compared to the adapted baseline, NAM+ further decreases the in-domain WER by up to 12.6% relative, while incurring an out-of-domain WER regression of 20% relative. Compared to the non-adapted baseline, the out-of-domain WER regression is 7.1 % relative.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"NAM+: Towards Scalable End-to-End Contextual Biasing for Adaptive ASR\",\"authors\":\"Tsendsuren Munkhdalai, Zelin Wu, G. Pundak, K. Sim, Jiayang Li, Pat Rondon, Tara N. Sainath\",\"doi\":\"10.1109/SLT54892.2023.10023323\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Attention-based biasing techniques for end-to-end ASR systems are able to achieve large accuracy gains without requiring the inference algorithm adjustments and parameter tuning common to fusion approaches. However, it is challenging to simultaneously scale up attention-based biasing to realistic numbers of biased phrases; maintain in-domain WER gains, while minimizing out-of-domain losses; and run in real time. We present NAM+, an attention-based biasing approach which achieves a 16X inference speedup per acoustic frame over prior work when run with 3,000 biasing entities, as measured on a typical mobile CPU. NAM+ achieves these run-time gains through a combination of Two-Pass Hierarchical Attention and Dilated Context Update. Compared to the adapted baseline, NAM+ further decreases the in-domain WER by up to 12.6% relative, while incurring an out-of-domain WER regression of 20% relative. Compared to the non-adapted baseline, the out-of-domain WER regression is 7.1 % relative.\",\"PeriodicalId\":352002,\"journal\":{\"name\":\"2022 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT54892.2023.10023323\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT54892.2023.10023323","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
NAM+: Towards Scalable End-to-End Contextual Biasing for Adaptive ASR
Attention-based biasing techniques for end-to-end ASR systems are able to achieve large accuracy gains without requiring the inference algorithm adjustments and parameter tuning common to fusion approaches. However, it is challenging to simultaneously scale up attention-based biasing to realistic numbers of biased phrases; maintain in-domain WER gains, while minimizing out-of-domain losses; and run in real time. We present NAM+, an attention-based biasing approach which achieves a 16X inference speedup per acoustic frame over prior work when run with 3,000 biasing entities, as measured on a typical mobile CPU. NAM+ achieves these run-time gains through a combination of Two-Pass Hierarchical Attention and Dilated Context Update. Compared to the adapted baseline, NAM+ further decreases the in-domain WER by up to 12.6% relative, while incurring an out-of-domain WER regression of 20% relative. Compared to the non-adapted baseline, the out-of-domain WER regression is 7.1 % relative.