{"title":"Exploring Fusion of the Face and Voice Modalities Using CNN Features for a Better Performance","authors":"A. Akinrinmade, E. Adetiba, J. Badejo, C.O. Lawal","doi":"10.1109/SEB-SDG57117.2023.10124540","DOIUrl":null,"url":null,"abstract":"Face recognition and speaker recognition have gained attention in recent times in the large-scale space. Large-scale face databases such as VGG-Face and VGGFace2 have been created using a semi-automated pipeline and used to develop methods for face recognition achieving state-of-the-art performance. This is also true for speaker recognition with large-scale databases such as VoxCeleb and VoxCeleb2. Howbeit, these two modalities have been treated individually. Although some works have explored the fusion of both modalities, they have played in small-scale space. This work aims at creating a large-scale face and corresponding voice database from YouTube under unconstrained conditions with a size comparable to the earlier mentioned and explores the fusion of both face and voice modalities for recognition in the large-scale space. To this end, a face and corresponding voice database of Nigerians available on YouTube was created for 2,656 Nigerians containing 2,055,169 face images and 195 hours of voice recording using a semi-automated curation pipeline. Convolutional Neural Networks (CNNs) were used to perform face recognition and speaker recognition individually. This was followed by the use of CNN for a combination of both modalities achieving an Equal Error Rate (EER) more than 5 times lower than the best result in the individual cases.","PeriodicalId":185729,"journal":{"name":"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)","volume":"10 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SEB-SDG57117.2023.10124540","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Face recognition and speaker recognition have gained attention in recent times in the large-scale space. Large-scale face databases such as VGG-Face and VGGFace2 have been created using a semi-automated pipeline and used to develop methods for face recognition achieving state-of-the-art performance. This is also true for speaker recognition with large-scale databases such as VoxCeleb and VoxCeleb2. Howbeit, these two modalities have been treated individually. Although some works have explored the fusion of both modalities, they have played in small-scale space. This work aims at creating a large-scale face and corresponding voice database from YouTube under unconstrained conditions with a size comparable to the earlier mentioned and explores the fusion of both face and voice modalities for recognition in the large-scale space. To this end, a face and corresponding voice database of Nigerians available on YouTube was created for 2,656 Nigerians containing 2,055,169 face images and 195 hours of voice recording using a semi-automated curation pipeline. Convolutional Neural Networks (CNNs) were used to perform face recognition and speaker recognition individually. This was followed by the use of CNN for a combination of both modalities achieving an Equal Error Rate (EER) more than 5 times lower than the best result in the individual cases.