SpeechTaxi: On Multilingual Semantic Speech Classification

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-10 DOI:arxiv-2409.06372

Lennart Keller, Goran Glavaš

{"title":"SpeechTaxi: On Multilingual Semantic Speech Classification","authors":"Lennart Keller, Goran Glavaš","doi":"arxiv-2409.06372","DOIUrl":null,"url":null,"abstract":"Recent advancements in multilingual speech encoding as well as transcription\nraise the question of the most effective approach to semantic speech\nclassification. Concretely, can (1) end-to-end (E2E) classifiers obtained by\nfine-tuning state-of-the-art multilingual speech encoders (MSEs) match or\nsurpass the performance of (2) cascading (CA), where speech is first\ntranscribed into text and classification is delegated to a text-based\nclassifier. To answer this, we first construct SpeechTaxi, an 80-hour\nmultilingual dataset for semantic speech classification of Bible verses,\ncovering 28 diverse languages. We then leverage SpeechTaxi to conduct a wide\nrange of experiments comparing E2E and CA in monolingual semantic speech\nclassification as well as in cross-lingual transfer. We find that E2E based on\nMSEs outperforms CA in monolingual setups, i.e., when trained on in-language\ndata. However, MSEs seem to have poor cross-lingual transfer abilities, with\nE2E substantially lagging CA both in (1) zero-shot transfer to languages unseen\nin training and (2) multilingual training, i.e., joint training on multiple\nlanguages. Finally, we devise a novel CA approach based on transcription to\nRomanized text as a language-agnostic intermediate representation and show that\nit represents a robust solution for languages without native ASR support. Our\nSpeechTaxi dataset is publicly available at: https://huggingface.co/\ndatasets/LennartKeller/SpeechTaxi/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06372","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advancements in multilingual speech encoding as well as transcription raise the question of the most effective approach to semantic speech classification. Concretely, can (1) end-to-end (E2E) classifiers obtained by fine-tuning state-of-the-art multilingual speech encoders (MSEs) match or surpass the performance of (2) cascading (CA), where speech is first transcribed into text and classification is delegated to a text-based classifier. To answer this, we first construct SpeechTaxi, an 80-hour multilingual dataset for semantic speech classification of Bible verses, covering 28 diverse languages. We then leverage SpeechTaxi to conduct a wide range of experiments comparing E2E and CA in monolingual semantic speech classification as well as in cross-lingual transfer. We find that E2E based on MSEs outperforms CA in monolingual setups, i.e., when trained on in-language data. However, MSEs seem to have poor cross-lingual transfer abilities, with E2E substantially lagging CA both in (1) zero-shot transfer to languages unseen in training and (2) multilingual training, i.e., joint training on multiple languages. Finally, we devise a novel CA approach based on transcription to Romanized text as a language-agnostic intermediate representation and show that it represents a robust solution for languages without native ASR support. Our SpeechTaxi dataset is publicly available at: https://huggingface.co/ datasets/LennartKeller/SpeechTaxi/.

查看原文本刊更多论文

SpeechTaxi：多语言语义语音分类

多语言语音编码和转录领域的最新进展提出了语义语音分类最有效方法的问题。具体来说，（1）通过微调最先进的多语种语音编码器（MSE）而获得的端到端（E2E）分类器的性能能否与（2）级联（CA）的性能相媲美或超越（CA），即首先将语音转录为文本，然后将分类委托给基于文本的分类器。为了回答这个问题，我们首先构建了 SpeechTaxi，这是一个用于圣经经文语义语音分类的 80 小时多语言数据集，涵盖 28 种不同语言。然后，我们利用 SpeechTaxi 进行了更广泛的实验，比较了 E2E 和 CA 在单语语义语音分类和跨语言传输中的效果。我们发现，基于 MSEs 的 E2E 在单语设置中优于 CA，即在语言内数据上进行训练时。然而，MSEs 的跨语言迁移能力似乎很差，E2E 在以下两个方面都大大落后于 CA：(1) 向训练中未见语言的零点迁移；(2) 多语言训练，即多语言联合训练。最后，我们设计了一种基于转录到罗马化文本的新颖 CA 方法，作为一种与语言无关的中间表示，并证明它对于没有本地 ASR 支持的语言是一种稳健的解决方案。我们的 SpeechTaxi 数据集可在以下网址公开获取：https://huggingface.co/datasets/LennartKeller/SpeechTaxi/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量