{"title":"使用CNN特征探索面部和语音模式的融合,以获得更好的性能","authors":"A. Akinrinmade, E. Adetiba, J. Badejo, C.O. Lawal","doi":"10.1109/SEB-SDG57117.2023.10124540","DOIUrl":null,"url":null,"abstract":"Face recognition and speaker recognition have gained attention in recent times in the large-scale space. Large-scale face databases such as VGG-Face and VGGFace2 have been created using a semi-automated pipeline and used to develop methods for face recognition achieving state-of-the-art performance. This is also true for speaker recognition with large-scale databases such as VoxCeleb and VoxCeleb2. Howbeit, these two modalities have been treated individually. Although some works have explored the fusion of both modalities, they have played in small-scale space. This work aims at creating a large-scale face and corresponding voice database from YouTube under unconstrained conditions with a size comparable to the earlier mentioned and explores the fusion of both face and voice modalities for recognition in the large-scale space. To this end, a face and corresponding voice database of Nigerians available on YouTube was created for 2,656 Nigerians containing 2,055,169 face images and 195 hours of voice recording using a semi-automated curation pipeline. Convolutional Neural Networks (CNNs) were used to perform face recognition and speaker recognition individually. This was followed by the use of CNN for a combination of both modalities achieving an Equal Error Rate (EER) more than 5 times lower than the best result in the individual cases.","PeriodicalId":185729,"journal":{"name":"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)","volume":"10 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring Fusion of the Face and Voice Modalities Using CNN Features for a Better Performance\",\"authors\":\"A. Akinrinmade, E. Adetiba, J. Badejo, C.O. Lawal\",\"doi\":\"10.1109/SEB-SDG57117.2023.10124540\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Face recognition and speaker recognition have gained attention in recent times in the large-scale space. Large-scale face databases such as VGG-Face and VGGFace2 have been created using a semi-automated pipeline and used to develop methods for face recognition achieving state-of-the-art performance. This is also true for speaker recognition with large-scale databases such as VoxCeleb and VoxCeleb2. Howbeit, these two modalities have been treated individually. Although some works have explored the fusion of both modalities, they have played in small-scale space. This work aims at creating a large-scale face and corresponding voice database from YouTube under unconstrained conditions with a size comparable to the earlier mentioned and explores the fusion of both face and voice modalities for recognition in the large-scale space. To this end, a face and corresponding voice database of Nigerians available on YouTube was created for 2,656 Nigerians containing 2,055,169 face images and 195 hours of voice recording using a semi-automated curation pipeline. Convolutional Neural Networks (CNNs) were used to perform face recognition and speaker recognition individually. This was followed by the use of CNN for a combination of both modalities achieving an Equal Error Rate (EER) more than 5 times lower than the best result in the individual cases.\",\"PeriodicalId\":185729,\"journal\":{\"name\":\"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)\",\"volume\":\"10 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SEB-SDG57117.2023.10124540\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SEB-SDG57117.2023.10124540","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Exploring Fusion of the Face and Voice Modalities Using CNN Features for a Better Performance
Face recognition and speaker recognition have gained attention in recent times in the large-scale space. Large-scale face databases such as VGG-Face and VGGFace2 have been created using a semi-automated pipeline and used to develop methods for face recognition achieving state-of-the-art performance. This is also true for speaker recognition with large-scale databases such as VoxCeleb and VoxCeleb2. Howbeit, these two modalities have been treated individually. Although some works have explored the fusion of both modalities, they have played in small-scale space. This work aims at creating a large-scale face and corresponding voice database from YouTube under unconstrained conditions with a size comparable to the earlier mentioned and explores the fusion of both face and voice modalities for recognition in the large-scale space. To this end, a face and corresponding voice database of Nigerians available on YouTube was created for 2,656 Nigerians containing 2,055,169 face images and 195 hours of voice recording using a semi-automated curation pipeline. Convolutional Neural Networks (CNNs) were used to perform face recognition and speaker recognition individually. This was followed by the use of CNN for a combination of both modalities achieving an Equal Error Rate (EER) more than 5 times lower than the best result in the individual cases.