使用CNN特征探索面部和语音模式的融合,以获得更好的性能

A. Akinrinmade, E. Adetiba, J. Badejo, C.O. Lawal
{"title":"使用CNN特征探索面部和语音模式的融合,以获得更好的性能","authors":"A. Akinrinmade, E. Adetiba, J. Badejo, C.O. Lawal","doi":"10.1109/SEB-SDG57117.2023.10124540","DOIUrl":null,"url":null,"abstract":"Face recognition and speaker recognition have gained attention in recent times in the large-scale space. Large-scale face databases such as VGG-Face and VGGFace2 have been created using a semi-automated pipeline and used to develop methods for face recognition achieving state-of-the-art performance. This is also true for speaker recognition with large-scale databases such as VoxCeleb and VoxCeleb2. Howbeit, these two modalities have been treated individually. Although some works have explored the fusion of both modalities, they have played in small-scale space. This work aims at creating a large-scale face and corresponding voice database from YouTube under unconstrained conditions with a size comparable to the earlier mentioned and explores the fusion of both face and voice modalities for recognition in the large-scale space. To this end, a face and corresponding voice database of Nigerians available on YouTube was created for 2,656 Nigerians containing 2,055,169 face images and 195 hours of voice recording using a semi-automated curation pipeline. Convolutional Neural Networks (CNNs) were used to perform face recognition and speaker recognition individually. This was followed by the use of CNN for a combination of both modalities achieving an Equal Error Rate (EER) more than 5 times lower than the best result in the individual cases.","PeriodicalId":185729,"journal":{"name":"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)","volume":"10 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring Fusion of the Face and Voice Modalities Using CNN Features for a Better Performance\",\"authors\":\"A. Akinrinmade, E. Adetiba, J. Badejo, C.O. Lawal\",\"doi\":\"10.1109/SEB-SDG57117.2023.10124540\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Face recognition and speaker recognition have gained attention in recent times in the large-scale space. Large-scale face databases such as VGG-Face and VGGFace2 have been created using a semi-automated pipeline and used to develop methods for face recognition achieving state-of-the-art performance. This is also true for speaker recognition with large-scale databases such as VoxCeleb and VoxCeleb2. Howbeit, these two modalities have been treated individually. Although some works have explored the fusion of both modalities, they have played in small-scale space. This work aims at creating a large-scale face and corresponding voice database from YouTube under unconstrained conditions with a size comparable to the earlier mentioned and explores the fusion of both face and voice modalities for recognition in the large-scale space. To this end, a face and corresponding voice database of Nigerians available on YouTube was created for 2,656 Nigerians containing 2,055,169 face images and 195 hours of voice recording using a semi-automated curation pipeline. Convolutional Neural Networks (CNNs) were used to perform face recognition and speaker recognition individually. This was followed by the use of CNN for a combination of both modalities achieving an Equal Error Rate (EER) more than 5 times lower than the best result in the individual cases.\",\"PeriodicalId\":185729,\"journal\":{\"name\":\"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)\",\"volume\":\"10 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SEB-SDG57117.2023.10124540\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SEB-SDG57117.2023.10124540","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

近年来,人脸识别和说话人识别在大范围内受到了广泛的关注。VGG-Face和VGGFace2等大型人脸数据库已使用半自动管道创建,并用于开发实现最先进性能的人脸识别方法。对于VoxCeleb和VoxCeleb2等大型数据库的说话人识别也是如此。然而,这两种模式被单独对待。虽然有些作品探索了两种形式的融合,但它们都是在小范围的空间中发挥作用。本研究的目的是在不受约束的条件下,从YouTube上创建一个规模与前面提到的相当的大规模人脸和相应的语音数据库,并探索在大规模空间中融合人脸和语音模式进行识别。为此,我们在YouTube上为2656名尼日利亚人创建了一个面部和相应的声音数据库,其中包含2,055,169张面部图像和195小时的语音记录,使用半自动管理管道。使用卷积神经网络(cnn)分别进行人脸识别和说话人识别。随后,将CNN用于两种模式的组合,其平均错误率(EER)比个别情况下的最佳结果低5倍以上。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Exploring Fusion of the Face and Voice Modalities Using CNN Features for a Better Performance
Face recognition and speaker recognition have gained attention in recent times in the large-scale space. Large-scale face databases such as VGG-Face and VGGFace2 have been created using a semi-automated pipeline and used to develop methods for face recognition achieving state-of-the-art performance. This is also true for speaker recognition with large-scale databases such as VoxCeleb and VoxCeleb2. Howbeit, these two modalities have been treated individually. Although some works have explored the fusion of both modalities, they have played in small-scale space. This work aims at creating a large-scale face and corresponding voice database from YouTube under unconstrained conditions with a size comparable to the earlier mentioned and explores the fusion of both face and voice modalities for recognition in the large-scale space. To this end, a face and corresponding voice database of Nigerians available on YouTube was created for 2,656 Nigerians containing 2,055,169 face images and 195 hours of voice recording using a semi-automated curation pipeline. Convolutional Neural Networks (CNNs) were used to perform face recognition and speaker recognition individually. This was followed by the use of CNN for a combination of both modalities achieving an Equal Error Rate (EER) more than 5 times lower than the best result in the individual cases.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信