Language-agnostic Age and Gender Classification of Voice using Self-supervised Pre-training

2022 Swedish Artificial Intelligence Society Workshop (SAIS) Pub Date : 2022-06-13 DOI:10.1109/sais55783.2022.9833071

Fredrik Lastow, Edwin Ekberg, P. Nugues

{"title":"Language-agnostic Age and Gender Classification of Voice using Self-supervised Pre-training","authors":"Fredrik Lastow, Edwin Ekberg, P. Nugues","doi":"10.1109/sais55783.2022.9833071","DOIUrl":null,"url":null,"abstract":"Extracting speaker-dependent paralinguistic information out of a person’s voice, provides an opportunity for adaptive behaviour related to speaker information in speech processing applications. For instance, in audio-based conversational applications, adapting responses to the attributes of the correspondent is an integral part in making the conversations effective. Two speaker attributes that humans can estimate quite well, based solely on hearing a person speak, is the gender and age of that person. However, in the field of speech processing, age and gender classification are relatively unexplored tasks, especially in a multilingual setting. In most cases, hand-crafted features, such as MFCCs, have been used with some success. However, recently large transformer networks, utilizing self-supervised pre-training, have shown promise in creating general speech embeddings for various speech processing tasks. We present a baseline for gender and age detection, in both monolingual and multilingual settings, for multiple state-of-the-art speech processing models, fine-tuned for age classification. We created four different datasets with data extracted from the Common Voice project to compare monolingual and multilingual performances. For gender classification, we could reach a macro average F1 score of ~96% in both a monolingual and multilingual setting. For age classification, using classes with a size of 10 years, we obtained a macro average mean absolute class error (MACE) of 0.68 and 0.86 on monolingual and multilingual datasets, respectively. For the English TIMIT dataset, we improve upon the previous state of the art for both age regression and gender classification. Our fine-tuned WavLM model reaches a mean absolute error (MAE) of 4.11 years for males and 4.44 for females in age estimation and our fine-tuned UniSpeech-SAT model reaches an accuracy of 99.8% for gender classification. All the models were deemed fast enough on a GPU to be used in real-time settings, and accurate enough, using only a small amount of speech, to be applicable in multilingual speech processing applications.","PeriodicalId":228143,"journal":{"name":"2022 Swedish Artificial Intelligence Society Workshop (SAIS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Swedish Artificial Intelligence Society Workshop (SAIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/sais55783.2022.9833071","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Extracting speaker-dependent paralinguistic information out of a person’s voice, provides an opportunity for adaptive behaviour related to speaker information in speech processing applications. For instance, in audio-based conversational applications, adapting responses to the attributes of the correspondent is an integral part in making the conversations effective. Two speaker attributes that humans can estimate quite well, based solely on hearing a person speak, is the gender and age of that person. However, in the field of speech processing, age and gender classification are relatively unexplored tasks, especially in a multilingual setting. In most cases, hand-crafted features, such as MFCCs, have been used with some success. However, recently large transformer networks, utilizing self-supervised pre-training, have shown promise in creating general speech embeddings for various speech processing tasks. We present a baseline for gender and age detection, in both monolingual and multilingual settings, for multiple state-of-the-art speech processing models, fine-tuned for age classification. We created four different datasets with data extracted from the Common Voice project to compare monolingual and multilingual performances. For gender classification, we could reach a macro average F1 score of ~96% in both a monolingual and multilingual setting. For age classification, using classes with a size of 10 years, we obtained a macro average mean absolute class error (MACE) of 0.68 and 0.86 on monolingual and multilingual datasets, respectively. For the English TIMIT dataset, we improve upon the previous state of the art for both age regression and gender classification. Our fine-tuned WavLM model reaches a mean absolute error (MAE) of 4.11 years for males and 4.44 for females in age estimation and our fine-tuned UniSpeech-SAT model reaches an accuracy of 99.8% for gender classification. All the models were deemed fast enough on a GPU to be used in real-time settings, and accurate enough, using only a small amount of speech, to be applicable in multilingual speech processing applications.

查看原文本刊更多论文

基于自监督预训练的语言不可知论年龄和性别语音分类

从人的声音中提取依赖于说话人的副语言信息，为语音处理应用中与说话人信息相关的自适应行为提供了机会。例如，在基于音频的会话应用程序中，使响应适应通信者的属性是使会话有效的一个组成部分。人类仅凭听一个人说话就能很好地估计出说话人的两个属性，那就是这个人的性别和年龄。然而，在语音处理领域，年龄和性别分类是相对未开发的任务，特别是在多语言环境中。在大多数情况下，手工制作的功能，如mfccc，已经取得了一些成功。然而，最近利用自我监督预训练的大型变压器网络在为各种语音处理任务创建通用语音嵌入方面显示出了希望。我们提出了性别和年龄检测的基线，在单语言和多语言设置中，针对多个最先进的语音处理模型，对年龄分类进行微调。我们用从Common Voice项目中提取的数据创建了四个不同的数据集，以比较单语言和多语言的表现。对于性别分类，我们可以在单语和多语设置中达到~96%的宏观平均F1分数。对于年龄分类，使用10年大小的类别，我们在单语言和多语言数据集上分别获得了0.68和0.86的宏观平均绝对类别误差(MACE)。对于英文TIMIT数据集，我们在年龄回归和性别分类方面改进了以前的技术水平。我们的微调WavLM模型在年龄估计方面达到了男性4.11岁和女性4.44岁的平均绝对误差(MAE)，我们的微调UniSpeech-SAT模型在性别分类方面达到了99.8%的准确率。所有的模型都被认为在GPU上足够快，可以用于实时设置，并且足够准确，只使用少量的语音，可以适用于多语言语音处理应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 Swedish Artificial Intelligence Society Workshop (SAIS)

自引率

0.00%

发文量