Spatio-temporal Weber Gradient Directional feature for visual and audio-visual phrase recognition systems

Salam Nandakishor, Debadatta Pati
{"title":"Spatio-temporal Weber Gradient Directional feature for visual and audio-visual phrase recognition systems","authors":"Salam Nandakishor, Debadatta Pati","doi":"10.1007/s41870-024-02138-9","DOIUrl":null,"url":null,"abstract":"<p>Visual phrase recognition needs lip movement related visual features, while audio-visual phrase recognition requires both acoustic and visual features. In this work, we propose a novel visual feature; Spatio-temporal Weber Gradient Directional (SWGD) to effectively represent the micro-patterns of lip movements. The proposed visual feature is obtained by using micro-texture information; local differential excitation, gradient orientation, and gradient directional information. Experiments are conducted using standard OuluVS database. Polynomial kernel based support vector machine (SVM) classifier is employed, as it provides relatively better performance. The SWGD extracted from <span>\\(2\\times 5\\times 3\\)</span> video block size provides higher performance of 73.9%. Additionally, we explore twelve distinct local descriptors commonly employed in face recognition and utilize them for the first time in a comparative study of phrase recognition. SWGD performs better than these twelve distinct features but has higher dimension of 4320. By reducing the dimension to 100 using the soft locality preserving map (SLPM), performance improved from 73.9 to 81.3%. The dimensionally reduced SWGD (SWGD<span>\\(_{\\text {SLPM}}\\)</span>) outperforms other state-of-the-art visual features mentioned in this paper. This shows the benefit of the salient micro-texture information considered in the proposed feature but neglected in state-of-the-art features. We observe that the SWGD<span>\\(_{\\text {SLPM}}\\)</span> feature has high discriminative ability to represent distinct lip movement patterns for different phrases. Mel-frequency cepstral coefficient (MFCC) based audio phrase recognizer performance degrades as the signal-to-noise level decreases. Including the SWGD<span>\\(_{\\text {SLPM}}\\)</span> visual feature and Glottal MFCC (GMFCC) excitation source feature improves performance by 3.6%, reflecting noise robustness.</p>","PeriodicalId":14138,"journal":{"name":"International Journal of Information Technology","volume":"10 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41870-024-02138-9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Visual phrase recognition needs lip movement related visual features, while audio-visual phrase recognition requires both acoustic and visual features. In this work, we propose a novel visual feature; Spatio-temporal Weber Gradient Directional (SWGD) to effectively represent the micro-patterns of lip movements. The proposed visual feature is obtained by using micro-texture information; local differential excitation, gradient orientation, and gradient directional information. Experiments are conducted using standard OuluVS database. Polynomial kernel based support vector machine (SVM) classifier is employed, as it provides relatively better performance. The SWGD extracted from \(2\times 5\times 3\) video block size provides higher performance of 73.9%. Additionally, we explore twelve distinct local descriptors commonly employed in face recognition and utilize them for the first time in a comparative study of phrase recognition. SWGD performs better than these twelve distinct features but has higher dimension of 4320. By reducing the dimension to 100 using the soft locality preserving map (SLPM), performance improved from 73.9 to 81.3%. The dimensionally reduced SWGD (SWGD\(_{\text {SLPM}}\)) outperforms other state-of-the-art visual features mentioned in this paper. This shows the benefit of the salient micro-texture information considered in the proposed feature but neglected in state-of-the-art features. We observe that the SWGD\(_{\text {SLPM}}\) feature has high discriminative ability to represent distinct lip movement patterns for different phrases. Mel-frequency cepstral coefficient (MFCC) based audio phrase recognizer performance degrades as the signal-to-noise level decreases. Including the SWGD\(_{\text {SLPM}}\) visual feature and Glottal MFCC (GMFCC) excitation source feature improves performance by 3.6%, reflecting noise robustness.

Abstract Image

用于视觉和视听短语识别系统的时空韦伯梯度方向特征
视觉短语识别需要与嘴唇运动相关的视觉特征,而视听短语识别则需要声学和视觉特征。在这项工作中,我们提出了一种新的视觉特征:时空韦伯梯度方向(SWGD),以有效地表示嘴唇运动的微模式。所提出的视觉特征是通过使用微纹理信息、局部差异激励、梯度方向和梯度方向信息获得的。实验使用标准 OuluVS 数据库进行。采用了基于多项式内核的支持向量机(SVM)分类器,因为它能提供相对更好的性能。从视频块大小(2×5×3)中提取的 SWGD 性能更高,达到 73.9%。此外,我们还探索了人脸识别中常用的十二种不同的局部描述符,并首次将它们用于短语识别的比较研究中。SWGD 的性能优于这十二种不同的特征,但其维度高达 4320。通过使用软定位保护图(SLPM)将维度降低到 100,性能从 73.9% 提高到 81.3%。降维后的 SWGD(SWGD/(_{text {SLPM}}/))优于本文提到的其他最先进的视觉特征。这表明了在所提出的特征中考虑到但在最先进的特征中被忽略的突出微纹理信息所带来的好处。我们观察到,SWGD(_{text {SLPM}}\)特征在表示不同短语的不同嘴唇运动模式方面具有很高的辨别能力。基于 Mel-frequency cepstral coefficient (MFCC) 的音频短语识别器的性能会随着信噪比的降低而降低。加入 SWGD\(_{text {SLPM}}\)视觉特征和声门 MFCC(GMFCC)激励源特征后,性能提高了 3.6%,这反映了噪声的鲁棒性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信