Exploring Fusion of the Face and Voice Modalities Using CNN Features for a Better Performance

2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG) Pub Date : 2023-04-05 DOI:10.1109/SEB-SDG57117.2023.10124540

A. Akinrinmade, E. Adetiba, J. Badejo, C.O. Lawal

{"title":"Exploring Fusion of the Face and Voice Modalities Using CNN Features for a Better Performance","authors":"A. Akinrinmade, E. Adetiba, J. Badejo, C.O. Lawal","doi":"10.1109/SEB-SDG57117.2023.10124540","DOIUrl":null,"url":null,"abstract":"Face recognition and speaker recognition have gained attention in recent times in the large-scale space. Large-scale face databases such as VGG-Face and VGGFace2 have been created using a semi-automated pipeline and used to develop methods for face recognition achieving state-of-the-art performance. This is also true for speaker recognition with large-scale databases such as VoxCeleb and VoxCeleb2. Howbeit, these two modalities have been treated individually. Although some works have explored the fusion of both modalities, they have played in small-scale space. This work aims at creating a large-scale face and corresponding voice database from YouTube under unconstrained conditions with a size comparable to the earlier mentioned and explores the fusion of both face and voice modalities for recognition in the large-scale space. To this end, a face and corresponding voice database of Nigerians available on YouTube was created for 2,656 Nigerians containing 2,055,169 face images and 195 hours of voice recording using a semi-automated curation pipeline. Convolutional Neural Networks (CNNs) were used to perform face recognition and speaker recognition individually. This was followed by the use of CNN for a combination of both modalities achieving an Equal Error Rate (EER) more than 5 times lower than the best result in the individual cases.","PeriodicalId":185729,"journal":{"name":"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)","volume":"10 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SEB-SDG57117.2023.10124540","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Face recognition and speaker recognition have gained attention in recent times in the large-scale space. Large-scale face databases such as VGG-Face and VGGFace2 have been created using a semi-automated pipeline and used to develop methods for face recognition achieving state-of-the-art performance. This is also true for speaker recognition with large-scale databases such as VoxCeleb and VoxCeleb2. Howbeit, these two modalities have been treated individually. Although some works have explored the fusion of both modalities, they have played in small-scale space. This work aims at creating a large-scale face and corresponding voice database from YouTube under unconstrained conditions with a size comparable to the earlier mentioned and explores the fusion of both face and voice modalities for recognition in the large-scale space. To this end, a face and corresponding voice database of Nigerians available on YouTube was created for 2,656 Nigerians containing 2,055,169 face images and 195 hours of voice recording using a semi-automated curation pipeline. Convolutional Neural Networks (CNNs) were used to perform face recognition and speaker recognition individually. This was followed by the use of CNN for a combination of both modalities achieving an Equal Error Rate (EER) more than 5 times lower than the best result in the individual cases.

查看原文本刊更多论文

使用CNN特征探索面部和语音模式的融合，以获得更好的性能

近年来，人脸识别和说话人识别在大范围内受到了广泛的关注。VGG-Face和VGGFace2等大型人脸数据库已使用半自动管道创建，并用于开发实现最先进性能的人脸识别方法。对于VoxCeleb和VoxCeleb2等大型数据库的说话人识别也是如此。然而，这两种模式被单独对待。虽然有些作品探索了两种形式的融合，但它们都是在小范围的空间中发挥作用。本研究的目的是在不受约束的条件下，从YouTube上创建一个规模与前面提到的相当的大规模人脸和相应的语音数据库，并探索在大规模空间中融合人脸和语音模式进行识别。为此，我们在YouTube上为2656名尼日利亚人创建了一个面部和相应的声音数据库，其中包含2,055,169张面部图像和195小时的语音记录，使用半自动管理管道。使用卷积神经网络(cnn)分别进行人脸识别和说话人识别。随后，将CNN用于两种模式的组合，其平均错误率(EER)比个别情况下的最佳结果低5倍以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)

自引率

0.00%

发文量