{"title":"用于视觉和视听短语识别系统的时空韦伯梯度方向特征","authors":"Salam Nandakishor, Debadatta Pati","doi":"10.1007/s41870-024-02138-9","DOIUrl":null,"url":null,"abstract":"<p>Visual phrase recognition needs lip movement related visual features, while audio-visual phrase recognition requires both acoustic and visual features. In this work, we propose a novel visual feature; Spatio-temporal Weber Gradient Directional (SWGD) to effectively represent the micro-patterns of lip movements. The proposed visual feature is obtained by using micro-texture information; local differential excitation, gradient orientation, and gradient directional information. Experiments are conducted using standard OuluVS database. Polynomial kernel based support vector machine (SVM) classifier is employed, as it provides relatively better performance. The SWGD extracted from <span>\\(2\\times 5\\times 3\\)</span> video block size provides higher performance of 73.9%. Additionally, we explore twelve distinct local descriptors commonly employed in face recognition and utilize them for the first time in a comparative study of phrase recognition. SWGD performs better than these twelve distinct features but has higher dimension of 4320. By reducing the dimension to 100 using the soft locality preserving map (SLPM), performance improved from 73.9 to 81.3%. The dimensionally reduced SWGD (SWGD<span>\\(_{\\text {SLPM}}\\)</span>) outperforms other state-of-the-art visual features mentioned in this paper. This shows the benefit of the salient micro-texture information considered in the proposed feature but neglected in state-of-the-art features. We observe that the SWGD<span>\\(_{\\text {SLPM}}\\)</span> feature has high discriminative ability to represent distinct lip movement patterns for different phrases. Mel-frequency cepstral coefficient (MFCC) based audio phrase recognizer performance degrades as the signal-to-noise level decreases. Including the SWGD<span>\\(_{\\text {SLPM}}\\)</span> visual feature and Glottal MFCC (GMFCC) excitation source feature improves performance by 3.6%, reflecting noise robustness.</p>","PeriodicalId":14138,"journal":{"name":"International Journal of Information Technology","volume":"10 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Spatio-temporal Weber Gradient Directional feature for visual and audio-visual phrase recognition systems\",\"authors\":\"Salam Nandakishor, Debadatta Pati\",\"doi\":\"10.1007/s41870-024-02138-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Visual phrase recognition needs lip movement related visual features, while audio-visual phrase recognition requires both acoustic and visual features. In this work, we propose a novel visual feature; Spatio-temporal Weber Gradient Directional (SWGD) to effectively represent the micro-patterns of lip movements. The proposed visual feature is obtained by using micro-texture information; local differential excitation, gradient orientation, and gradient directional information. Experiments are conducted using standard OuluVS database. Polynomial kernel based support vector machine (SVM) classifier is employed, as it provides relatively better performance. The SWGD extracted from <span>\\\\(2\\\\times 5\\\\times 3\\\\)</span> video block size provides higher performance of 73.9%. Additionally, we explore twelve distinct local descriptors commonly employed in face recognition and utilize them for the first time in a comparative study of phrase recognition. SWGD performs better than these twelve distinct features but has higher dimension of 4320. By reducing the dimension to 100 using the soft locality preserving map (SLPM), performance improved from 73.9 to 81.3%. The dimensionally reduced SWGD (SWGD<span>\\\\(_{\\\\text {SLPM}}\\\\)</span>) outperforms other state-of-the-art visual features mentioned in this paper. This shows the benefit of the salient micro-texture information considered in the proposed feature but neglected in state-of-the-art features. We observe that the SWGD<span>\\\\(_{\\\\text {SLPM}}\\\\)</span> feature has high discriminative ability to represent distinct lip movement patterns for different phrases. Mel-frequency cepstral coefficient (MFCC) based audio phrase recognizer performance degrades as the signal-to-noise level decreases. Including the SWGD<span>\\\\(_{\\\\text {SLPM}}\\\\)</span> visual feature and Glottal MFCC (GMFCC) excitation source feature improves performance by 3.6%, reflecting noise robustness.</p>\",\"PeriodicalId\":14138,\"journal\":{\"name\":\"International Journal of Information Technology\",\"volume\":\"10 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Information Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s41870-024-02138-9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41870-024-02138-9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Spatio-temporal Weber Gradient Directional feature for visual and audio-visual phrase recognition systems
Visual phrase recognition needs lip movement related visual features, while audio-visual phrase recognition requires both acoustic and visual features. In this work, we propose a novel visual feature; Spatio-temporal Weber Gradient Directional (SWGD) to effectively represent the micro-patterns of lip movements. The proposed visual feature is obtained by using micro-texture information; local differential excitation, gradient orientation, and gradient directional information. Experiments are conducted using standard OuluVS database. Polynomial kernel based support vector machine (SVM) classifier is employed, as it provides relatively better performance. The SWGD extracted from \(2\times 5\times 3\) video block size provides higher performance of 73.9%. Additionally, we explore twelve distinct local descriptors commonly employed in face recognition and utilize them for the first time in a comparative study of phrase recognition. SWGD performs better than these twelve distinct features but has higher dimension of 4320. By reducing the dimension to 100 using the soft locality preserving map (SLPM), performance improved from 73.9 to 81.3%. The dimensionally reduced SWGD (SWGD\(_{\text {SLPM}}\)) outperforms other state-of-the-art visual features mentioned in this paper. This shows the benefit of the salient micro-texture information considered in the proposed feature but neglected in state-of-the-art features. We observe that the SWGD\(_{\text {SLPM}}\) feature has high discriminative ability to represent distinct lip movement patterns for different phrases. Mel-frequency cepstral coefficient (MFCC) based audio phrase recognizer performance degrades as the signal-to-noise level decreases. Including the SWGD\(_{\text {SLPM}}\) visual feature and Glottal MFCC (GMFCC) excitation source feature improves performance by 3.6%, reflecting noise robustness.