Improving speaker-independent visual language identification using deep neural networks with training batch augmentation

Jacob L. Newman
{"title":"Improving speaker-independent visual language identification using deep neural networks with training batch augmentation","authors":"Jacob L. Newman","doi":"10.1016/j.iswa.2025.200517","DOIUrl":null,"url":null,"abstract":"<div><div>Visual Language Identification (VLID) is concerned with using the appearance and movement of the mouth to determine the identity of spoken language. VLID has applications where conventional audio based approaches are ineffective due to acoustic noise, or where an audio signal is unavailable, such as remote surveillance. The main challenge associated with VLID is the speaker-dependency of image based visual recognition features, which bear little meaningful correspondence between speakers.</div><div>In this work, we examine a novel VLID task using video of 53 individuals reciting the Universal Declaration of Human Rights in their native languages of Arabic, English or Mandarin. We describe a speaker-independent, five fold cross validation experiment, where the task is to discriminate the language spoken in 10 s videos of the mouth. We use the YOLO object detection algorithm to track the mouth through time, and we employ an ensemble of 3D Convolutional and Recurrent Neural Networks for this classification task. We describe a novel approach to the construction of training batches, in which samples are duplicated, then reversed in time to form a <em>distractor</em> class. This method encourages the neural networks to learn the discriminative temporal features of language rather than the identity of individual speakers.</div><div>The maximum accuracy obtained across all three language experiments was 84.64%, demonstrating that the system can distinguish languages to a good degree, from just 10 s of visual speech. A 7.77% improvement on classification accuracy was obtained using our distractor class approach compared to normal batch selection. The use of ensemble classification consistently outperformed the results of individual networks, increasing accuracies by up to 7.27%. In a two language experiment intended to provide a comparison with our previous work, we observed an absolute improvement in classification accuracy of 3.6% (90.01% compared to 83.57%).</div></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"26 ","pages":"Article 200517"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Systems with Applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667305325000432","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Visual Language Identification (VLID) is concerned with using the appearance and movement of the mouth to determine the identity of spoken language. VLID has applications where conventional audio based approaches are ineffective due to acoustic noise, or where an audio signal is unavailable, such as remote surveillance. The main challenge associated with VLID is the speaker-dependency of image based visual recognition features, which bear little meaningful correspondence between speakers.
In this work, we examine a novel VLID task using video of 53 individuals reciting the Universal Declaration of Human Rights in their native languages of Arabic, English or Mandarin. We describe a speaker-independent, five fold cross validation experiment, where the task is to discriminate the language spoken in 10 s videos of the mouth. We use the YOLO object detection algorithm to track the mouth through time, and we employ an ensemble of 3D Convolutional and Recurrent Neural Networks for this classification task. We describe a novel approach to the construction of training batches, in which samples are duplicated, then reversed in time to form a distractor class. This method encourages the neural networks to learn the discriminative temporal features of language rather than the identity of individual speakers.
The maximum accuracy obtained across all three language experiments was 84.64%, demonstrating that the system can distinguish languages to a good degree, from just 10 s of visual speech. A 7.77% improvement on classification accuracy was obtained using our distractor class approach compared to normal batch selection. The use of ensemble classification consistently outperformed the results of individual networks, increasing accuracies by up to 7.27%. In a two language experiment intended to provide a comparison with our previous work, we observed an absolute improvement in classification accuracy of 3.6% (90.01% compared to 83.57%).

Abstract Image

基于训练批增强的深度神经网络改进与说话人无关的视觉语言识别
视觉语言识别(VLID)是利用嘴的外观和运动来确定口语的身份。VLID可以应用在传统的基于音频的方法由于噪声而无效的情况下,或者在音频信号不可用的情况下,例如远程监视。与VLID相关的主要挑战是基于图像的视觉识别特征对说话人的依赖性,这些特征在说话人之间几乎没有有意义的对应关系。在这项工作中,我们研究了一个新的VLID任务,使用53个人用他们的母语阿拉伯语、英语或普通话背诵《世界人权宣言》的视频。我们描述了一个独立于说话者的五重交叉验证实验,其中的任务是区分在10s的嘴巴视频中所说的语言。我们使用YOLO目标检测算法随时间跟踪嘴巴,并且我们使用3D卷积和循环神经网络的集合来完成这个分类任务。我们描述了一种新的方法来构建训练批次,其中样本是重复的,然后及时反转以形成一个分心类。这种方法鼓励神经网络学习语言的判别性时间特征,而不是个体说话者的身份。在所有三个语言实验中获得的最高准确率为84.64%,这表明该系统可以从10秒的视觉语音中很好地区分语言。与常规批量选择方法相比,我们的分心物分类方法的分类准确率提高了7.77%。集成分类的使用始终优于单个网络的结果,准确率提高了7.27%。在一个旨在与我们之前的工作进行比较的两种语言实验中,我们观察到分类准确率的绝对提高为3.6%(90.01%对83.57%)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
5.60
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信