{"title":"SpeechTaxi: On Multilingual Semantic Speech Classification","authors":"Lennart Keller, Goran Glavaš","doi":"arxiv-2409.06372","DOIUrl":null,"url":null,"abstract":"Recent advancements in multilingual speech encoding as well as transcription\nraise the question of the most effective approach to semantic speech\nclassification. Concretely, can (1) end-to-end (E2E) classifiers obtained by\nfine-tuning state-of-the-art multilingual speech encoders (MSEs) match or\nsurpass the performance of (2) cascading (CA), where speech is first\ntranscribed into text and classification is delegated to a text-based\nclassifier. To answer this, we first construct SpeechTaxi, an 80-hour\nmultilingual dataset for semantic speech classification of Bible verses,\ncovering 28 diverse languages. We then leverage SpeechTaxi to conduct a wide\nrange of experiments comparing E2E and CA in monolingual semantic speech\nclassification as well as in cross-lingual transfer. We find that E2E based on\nMSEs outperforms CA in monolingual setups, i.e., when trained on in-language\ndata. However, MSEs seem to have poor cross-lingual transfer abilities, with\nE2E substantially lagging CA both in (1) zero-shot transfer to languages unseen\nin training and (2) multilingual training, i.e., joint training on multiple\nlanguages. Finally, we devise a novel CA approach based on transcription to\nRomanized text as a language-agnostic intermediate representation and show that\nit represents a robust solution for languages without native ASR support. Our\nSpeechTaxi dataset is publicly available at: https://huggingface.co/\ndatasets/LennartKeller/SpeechTaxi/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06372","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recent advancements in multilingual speech encoding as well as transcription
raise the question of the most effective approach to semantic speech
classification. Concretely, can (1) end-to-end (E2E) classifiers obtained by
fine-tuning state-of-the-art multilingual speech encoders (MSEs) match or
surpass the performance of (2) cascading (CA), where speech is first
transcribed into text and classification is delegated to a text-based
classifier. To answer this, we first construct SpeechTaxi, an 80-hour
multilingual dataset for semantic speech classification of Bible verses,
covering 28 diverse languages. We then leverage SpeechTaxi to conduct a wide
range of experiments comparing E2E and CA in monolingual semantic speech
classification as well as in cross-lingual transfer. We find that E2E based on
MSEs outperforms CA in monolingual setups, i.e., when trained on in-language
data. However, MSEs seem to have poor cross-lingual transfer abilities, with
E2E substantially lagging CA both in (1) zero-shot transfer to languages unseen
in training and (2) multilingual training, i.e., joint training on multiple
languages. Finally, we devise a novel CA approach based on transcription to
Romanized text as a language-agnostic intermediate representation and show that
it represents a robust solution for languages without native ASR support. Our
SpeechTaxi dataset is publicly available at: https://huggingface.co/
datasets/LennartKeller/SpeechTaxi/.