Antonio Bevilacqua, Paolo Saviano, Alessandro Amirante, Simon Pietro Romano
{"title":"Whispy: Adapting STT Whisper Models to Real-Time Environments","authors":"Antonio Bevilacqua, Paolo Saviano, Alessandro Amirante, Simon Pietro Romano","doi":"arxiv-2405.03484","DOIUrl":null,"url":null,"abstract":"Large general-purpose transformer models have recently become the mainstay in\nthe realm of speech analysis. In particular, Whisper achieves state-of-the-art\nresults in relevant tasks such as speech recognition, translation, language\nidentification, and voice activity detection. However, Whisper models are not\ndesigned to be used in real-time conditions, and this limitation makes them\nunsuitable for a vast plethora of practical applications. In this paper, we\nintroduce Whispy, a system intended to bring live capabilities to the Whisper\npretrained models. As a result of a number of architectural optimisations,\nWhispy is able to consume live audio streams and generate high level, coherent\nvoice transcriptions, while still maintaining a low computational cost. We\nevaluate the performance of our system on a large repository of publicly\navailable speech datasets, investigating how the transcription mechanism\nintroduced by Whispy impacts on the Whisper output. Experimental results show\nhow Whispy excels in robustness, promptness, and accuracy.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.03484","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Large general-purpose transformer models have recently become the mainstay in
the realm of speech analysis. In particular, Whisper achieves state-of-the-art
results in relevant tasks such as speech recognition, translation, language
identification, and voice activity detection. However, Whisper models are not
designed to be used in real-time conditions, and this limitation makes them
unsuitable for a vast plethora of practical applications. In this paper, we
introduce Whispy, a system intended to bring live capabilities to the Whisper
pretrained models. As a result of a number of architectural optimisations,
Whispy is able to consume live audio streams and generate high level, coherent
voice transcriptions, while still maintaining a low computational cost. We
evaluate the performance of our system on a large repository of publicly
available speech datasets, investigating how the transcription mechanism
introduced by Whispy impacts on the Whisper output. Experimental results show
how Whispy excels in robustness, promptness, and accuracy.