Audio-visual occlusion-robust gender recognition and age estimation approach based on multi-task cross-modal attention

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-07-11 DOI:10.1016/j.eswa.2025.127473

Maxim Markitantov , Elena Ryumina , Alexey Karpov

{"title":"Audio-visual occlusion-robust gender recognition and age estimation approach based on multi-task cross-modal attention","authors":"Maxim Markitantov , Elena Ryumina , Alexey Karpov","doi":"10.1016/j.eswa.2025.127473","DOIUrl":null,"url":null,"abstract":"<div><div>Gender recognition and age estimation are essential tasks within soft biometric systems, where identifying these characteristics supports a wide range of applications. In real-world scenarios, challenges such as partial facial occlusion complicate these tasks by obscuring crucial voice and facial characteristics. These challenges highlight the importance of development of robust and efficient approaches for gender recognition and age estimation. In this study, we develop a novel audio-visual Occlusion-Robust GENder recognition and AGE estimation (ORAGEN) approach. The proposed approach is based on intermediate features of unimodal transformer-based models and two Multi-Task Cross-Modal Attention (MTCMA) blocks, which predict gender, age, and protective mask type using voice and facial characteristics. We conduct detailed cross-corpus experiments on the TIMIT, aGender, CommonVoice, LAGENDA, IMDB-Clean, AFEW, VoxCeleb2, and BRAVE-MASKS corpora. The proposed unimodal models outperform State-of-the-Art approaches for gender recognition and age estimation. We investigate the impact of various protective mask types on the performance of audio-visual gender recognition and age estimation. The results show that the current large-scale data are still insufficient for a robust gender recognition and age estimation in partial facial occlusion conditions. On the Test subset of the VoxCeleb2 corpus, the proposed approach showed Unweighted Average Recall (UAR) of 99.51%, Mean Absolute Error (MAE) of 5.42, and UAR of 100% for gender recognition, age estimation, and protective mask type recognition, respectively, while on the Test subset of the BRAVE-MASKS corpus, it showed UAR=96.63%, MAE=7.52, and UAR=95.87%, for the same tasks. These results indicate that using data of people wearing protective masks, as well as including the protective mask type recognition task, yields performance gains on all tasks considered. ORAGEN can be integrated into the OCEAN-AI framework for optimizing Human Resources processes, as well as into expert systems with practical applications in various domains including forensics, healthcare, and industrial safety. We make the source code publicly available at <span><span>https://smil-spcras.github.io/ORAGEN/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"296 ","pages":"Article 127473"},"PeriodicalIF":7.5000,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425010954","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Gender recognition and age estimation are essential tasks within soft biometric systems, where identifying these characteristics supports a wide range of applications. In real-world scenarios, challenges such as partial facial occlusion complicate these tasks by obscuring crucial voice and facial characteristics. These challenges highlight the importance of development of robust and efficient approaches for gender recognition and age estimation. In this study, we develop a novel audio-visual Occlusion-Robust GENder recognition and AGE estimation (ORAGEN) approach. The proposed approach is based on intermediate features of unimodal transformer-based models and two Multi-Task Cross-Modal Attention (MTCMA) blocks, which predict gender, age, and protective mask type using voice and facial characteristics. We conduct detailed cross-corpus experiments on the TIMIT, aGender, CommonVoice, LAGENDA, IMDB-Clean, AFEW, VoxCeleb2, and BRAVE-MASKS corpora. The proposed unimodal models outperform State-of-the-Art approaches for gender recognition and age estimation. We investigate the impact of various protective mask types on the performance of audio-visual gender recognition and age estimation. The results show that the current large-scale data are still insufficient for a robust gender recognition and age estimation in partial facial occlusion conditions. On the Test subset of the VoxCeleb2 corpus, the proposed approach showed Unweighted Average Recall (UAR) of 99.51%, Mean Absolute Error (MAE) of 5.42, and UAR of 100% for gender recognition, age estimation, and protective mask type recognition, respectively, while on the Test subset of the BRAVE-MASKS corpus, it showed UAR=96.63%, MAE=7.52, and UAR=95.87%, for the same tasks. These results indicate that using data of people wearing protective masks, as well as including the protective mask type recognition task, yields performance gains on all tasks considered. ORAGEN can be integrated into the OCEAN-AI framework for optimizing Human Resources processes, as well as into expert systems with practical applications in various domains including forensics, healthcare, and industrial safety. We make the source code publicly available at https://smil-spcras.github.io/ORAGEN/.

查看原文本刊更多论文

基于多任务跨模态注意的视听闭塞稳健性别识别和年龄估计方法

性别识别和年龄估计是软生物识别系统中的基本任务，其中识别这些特征支持广泛的应用。在现实世界中，部分面部遮挡等挑战会模糊关键的声音和面部特征，从而使这些任务复杂化。这些挑战突出了发展强有力和有效的性别识别和年龄估计方法的重要性。在这项研究中，我们开发了一种新的视听闭塞-鲁棒性性别识别和年龄估计（ORAGEN）方法。该方法基于基于单模态变压器的模型的中间特征和两个多任务跨模态注意（MTCMA）块，使用语音和面部特征预测性别、年龄和防护面具类型。我们在TIMIT、aGender、CommonVoice、LAGENDA、IMDB-Clean、few、VoxCeleb2和BRAVE-MASKS语料库上进行了详细的跨语料库实验。提出的单峰模型优于最先进的性别识别和年龄估计方法。我们研究了不同类型的防护口罩对视听性别识别和年龄估计性能的影响。结果表明，目前的大规模数据仍然不足以在部分面部遮挡条件下进行稳健的性别识别和年龄估计。在VoxCeleb2语料库的Test子集上，该方法对性别识别、年龄估计和防护面具类型识别的未加权平均召回率（UAR）分别为99.51%、5.42和100%，而在BRAVE-MASKS语料库的Test子集上，该方法对相同任务的未加权平均召回率（UAR）为96.63%、平均绝对误差（MAE）为7.52和UAR=95.87%。这些结果表明，使用戴防护面具的人的数据，以及包括防护面具类型识别任务，在所有考虑的任务上都能产生性能提升。ORAGEN可以集成到OCEAN-AI框架中，用于优化人力资源流程，也可以集成到专家系统中，在法医、医疗保健和工业安全等各个领域具有实际应用。我们在https://smil-spcras.github.io/ORAGEN/上公开提供源代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.