{"title":"Audio-visual occlusion-robust gender recognition and age estimation approach based on multi-task cross-modal attention","authors":"Maxim Markitantov , Elena Ryumina , Alexey Karpov","doi":"10.1016/j.eswa.2025.127473","DOIUrl":null,"url":null,"abstract":"<div><div>Gender recognition and age estimation are essential tasks within soft biometric systems, where identifying these characteristics supports a wide range of applications. In real-world scenarios, challenges such as partial facial occlusion complicate these tasks by obscuring crucial voice and facial characteristics. These challenges highlight the importance of development of robust and efficient approaches for gender recognition and age estimation. In this study, we develop a novel audio-visual Occlusion-Robust GENder recognition and AGE estimation (ORAGEN) approach. The proposed approach is based on intermediate features of unimodal transformer-based models and two Multi-Task Cross-Modal Attention (MTCMA) blocks, which predict gender, age, and protective mask type using voice and facial characteristics. We conduct detailed cross-corpus experiments on the TIMIT, aGender, CommonVoice, LAGENDA, IMDB-Clean, AFEW, VoxCeleb2, and BRAVE-MASKS corpora. The proposed unimodal models outperform State-of-the-Art approaches for gender recognition and age estimation. We investigate the impact of various protective mask types on the performance of audio-visual gender recognition and age estimation. The results show that the current large-scale data are still insufficient for a robust gender recognition and age estimation in partial facial occlusion conditions. On the Test subset of the VoxCeleb2 corpus, the proposed approach showed Unweighted Average Recall (UAR) of 99.51%, Mean Absolute Error (MAE) of 5.42, and UAR of 100% for gender recognition, age estimation, and protective mask type recognition, respectively, while on the Test subset of the BRAVE-MASKS corpus, it showed UAR=96.63%, MAE=7.52, and UAR=95.87%, for the same tasks. These results indicate that using data of people wearing protective masks, as well as including the protective mask type recognition task, yields performance gains on all tasks considered. ORAGEN can be integrated into the OCEAN-AI framework for optimizing Human Resources processes, as well as into expert systems with practical applications in various domains including forensics, healthcare, and industrial safety. We make the source code publicly available at <span><span>https://smil-spcras.github.io/ORAGEN/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"296 ","pages":"Article 127473"},"PeriodicalIF":7.5000,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425010954","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Gender recognition and age estimation are essential tasks within soft biometric systems, where identifying these characteristics supports a wide range of applications. In real-world scenarios, challenges such as partial facial occlusion complicate these tasks by obscuring crucial voice and facial characteristics. These challenges highlight the importance of development of robust and efficient approaches for gender recognition and age estimation. In this study, we develop a novel audio-visual Occlusion-Robust GENder recognition and AGE estimation (ORAGEN) approach. The proposed approach is based on intermediate features of unimodal transformer-based models and two Multi-Task Cross-Modal Attention (MTCMA) blocks, which predict gender, age, and protective mask type using voice and facial characteristics. We conduct detailed cross-corpus experiments on the TIMIT, aGender, CommonVoice, LAGENDA, IMDB-Clean, AFEW, VoxCeleb2, and BRAVE-MASKS corpora. The proposed unimodal models outperform State-of-the-Art approaches for gender recognition and age estimation. We investigate the impact of various protective mask types on the performance of audio-visual gender recognition and age estimation. The results show that the current large-scale data are still insufficient for a robust gender recognition and age estimation in partial facial occlusion conditions. On the Test subset of the VoxCeleb2 corpus, the proposed approach showed Unweighted Average Recall (UAR) of 99.51%, Mean Absolute Error (MAE) of 5.42, and UAR of 100% for gender recognition, age estimation, and protective mask type recognition, respectively, while on the Test subset of the BRAVE-MASKS corpus, it showed UAR=96.63%, MAE=7.52, and UAR=95.87%, for the same tasks. These results indicate that using data of people wearing protective masks, as well as including the protective mask type recognition task, yields performance gains on all tasks considered. ORAGEN can be integrated into the OCEAN-AI framework for optimizing Human Resources processes, as well as into expert systems with practical applications in various domains including forensics, healthcare, and industrial safety. We make the source code publicly available at https://smil-spcras.github.io/ORAGEN/.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.