{"title":"唇分析支持语音识别","authors":"W. Butt","doi":"10.5565/REV/ELCVIA.953","DOIUrl":null,"url":null,"abstract":"Computers have become more pervasive than ever with a wide range of devices and multiple ways of interaction. Traditional ways of human computer interaction using keyboards, mice and display monitors are being replaced by more natural modes such as speech, touch, and gesture. The continuous progress of technology brings to an irreversible change of paradigms of interaction between human and machine. They are now used in daily life in many devices that have revolutionized the way users interact with machines. In fact new PCs, tablets and smartphones are moving increasingly toward a direction that will bring in a short time to have interaction paradigms so advanced that will be completely transparent to users. The various modes of human-machine interaction, through voice recognition are without doubt one of the most considered. A number of researchers have revealed that a speech reading system is beneficial complement to an audio speech recognition system by using of visual cues of the speakers, such as face in noisy environment. However, robust and precise extraction of visual features is a challenging problem in object recognition, due to high variation in pose, lighting and facial makeup. Most of the existing approaches use constraints such as the use of reflective marker on subjects lips, lip movements recorded with a fixed camera position (head mounted camera) and lip segmentation in organized illumination conditions. Furthermore, there is no common consensus about the visual features selection and their significance for a particular phoneme. Speech is the natural procedure of communication. Therefore speech would be an apparently preferred option for human computer interaction. In the past years, development in technology, combined with a significant reduction in cost, has led to the pervasive use of automated speech recognition in variety of systems such as telephony, human-computer interaction and robotics. Visual speech cues are prospective source of speech information and they are apparently not affected in noisy acoustic environmental condition and cross talking between speakers. Visual information of a speaker is the key component of Speech Recognition system such as outside area of mouth, mouth gestures and facial expressions. The major problem to develop robust speech recognition system is to find the precise visual feature extraction method. Sometime hearer observes improper from speaker because of the incompatible effect of visual features. These visual features have great role in the lip reading process. These interpretations gave a motivation for developing a computer speech recognition system. Butt et al. / Electronic Letters on Computer Vision and Image Analysis 15(2):30-32, 2016 31 I propose a speech recognition system using face detection, lip extraction and tracking with some preprocessing techniques to overwhelmed the pose/lighting variation problems. The proposed approach is useful for face/lip detection and tracking in sequence of images and to augment global facial features to improve the recognition performance. a)Original Image a1)face detection a2) eyes and Face b)Original Image b1)face detection b2) Eyes and Face Figure 1. Face, eyes and mouth detection from videos a) Original b)After Illumination c) Teeth filtering d)Original e)Shadow Filtering Figure 2. Some results of Illumination equalization, Teeth and Shadow FIltering The Proposed approach consists of four major parts, firstly detecting/localizing human faces, lips and define the lip region of interest in the first frame as shown in Figure 1, secondly three pre-processing steps, namely illumination equalization, teeth detection and shadow removal developed, aiming at investigating edge information and global statistical characteristics which are sensitive to the uneven illuminations and susceptible to the complex appearance in presence of teeth and shadow. In contrast, the proposed method, which is aimed at local region analysis, can successfully avoid the complex appearance (e.g. low contrast, shadow, moustaches and teeth). The high average extraction performance is reached as shown in Figure 2, thirdly create contour line (3a), draw the 16 points by splitting image into four parts as shown in the figure 3 (b), and stored the coordinates of these constraints. The new approach is implemented in the lip tracking module. Using this lip tracking module from the lip boundary lines a feature vector of 16 points lip model of the speaker’s lips, stores the coordinates of these points and tracks these coordinates during the utterance by the speaker and tracked in every image of the image sequence. Contour line 16 Points Draw Points d)Open/Close Figure 3. (a) Contourline (b) Drawn points of the four parts of the image Finally track the lip contour with their coordinates in the following frames. Extensive experiments show the encouraging results and the effectiveness of the proposed method in comparison with the existing methods. The proposed approach has also been evaluated by testing the system in noisy real world facial image sequences. Experiments have shown that outliers detecting and better predicting ROIs can further 32 Butt et al. / Electronic Letters on Computer Vision and Image Analysis 15(2):30-32, 2016 reduce the number of frames with locating or tracking failures. Figure 4 shows the complete proceedure for speech recognition system on first frame of the video. Figure 4. Complete results for single image by proposed SRS References [1] Waqqas ur Rehman Butt, Luca Lombardi “An Improved Local Region Based Approach for Lip Detection and Tracking towards Speech Recognition” submitted to Journal of Visual Communication and Image Representation, Under review (June 2016) [2] Waqqas ur Rehman Butt, Luca Lombardi, Dr. Marion Pause “Automatic Object detection in digital Images under non standardized Conditions\" submitted to Signal, Image and Video Processing Journal, Under Review (2016) [3] Waqqas ur Rehman Butt, Luca Lombardi \"A Methodological Comparison of Moving Object Detection and Tracking in Videos with Background Subtraction and Mixture of Gaussian \". accepted a regular research paper (RRP) in (IPCV’15)The 2015 International Conference on Image Processing, Computer Vision & Pattern Recognition (July 27-30, 2015, Las Vegas, USA) [4] Luca Lombardi, Waqqas ur Rehman Butt, Marco Grecuccio \"Lip Tracking Towards An Automatic Lip Reading Approach\". Journal of Multimedia Processing and Technologies Volume 5 Number 1 March 2014, Pages 111 , Print ISSN: 0976-4127, Online ISSN: 0976-4135 [5] W.U.R. Butt, L. Lombardi “Comparisons of Visual Features Extraction Towards Automatic Lip Reading” 5th International Conference on Education and New Learning Technologies, Barcelona, Spain. (1-3 July, 2013) EDULEARN13 Proceedings, Pages: 2188-2196, ISBN: 978-84-616-3822-2, ISSN: 2340-1117. http://library.iated.org/view/BUTT2013COM Original Illumination Grey Scale Split 1 Split 2 smoothing 4 Points Ellipse Normalize New Roi Graph Lip points","PeriodicalId":38711,"journal":{"name":"Electronic Letters on Computer Vision and Image Analysis","volume":"70 1","pages":"30-32"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Speech Recognition Supported by Lip Analysis\",\"authors\":\"W. Butt\",\"doi\":\"10.5565/REV/ELCVIA.953\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Computers have become more pervasive than ever with a wide range of devices and multiple ways of interaction. Traditional ways of human computer interaction using keyboards, mice and display monitors are being replaced by more natural modes such as speech, touch, and gesture. The continuous progress of technology brings to an irreversible change of paradigms of interaction between human and machine. They are now used in daily life in many devices that have revolutionized the way users interact with machines. In fact new PCs, tablets and smartphones are moving increasingly toward a direction that will bring in a short time to have interaction paradigms so advanced that will be completely transparent to users. The various modes of human-machine interaction, through voice recognition are without doubt one of the most considered. A number of researchers have revealed that a speech reading system is beneficial complement to an audio speech recognition system by using of visual cues of the speakers, such as face in noisy environment. However, robust and precise extraction of visual features is a challenging problem in object recognition, due to high variation in pose, lighting and facial makeup. Most of the existing approaches use constraints such as the use of reflective marker on subjects lips, lip movements recorded with a fixed camera position (head mounted camera) and lip segmentation in organized illumination conditions. Furthermore, there is no common consensus about the visual features selection and their significance for a particular phoneme. Speech is the natural procedure of communication. Therefore speech would be an apparently preferred option for human computer interaction. In the past years, development in technology, combined with a significant reduction in cost, has led to the pervasive use of automated speech recognition in variety of systems such as telephony, human-computer interaction and robotics. Visual speech cues are prospective source of speech information and they are apparently not affected in noisy acoustic environmental condition and cross talking between speakers. Visual information of a speaker is the key component of Speech Recognition system such as outside area of mouth, mouth gestures and facial expressions. The major problem to develop robust speech recognition system is to find the precise visual feature extraction method. Sometime hearer observes improper from speaker because of the incompatible effect of visual features. These visual features have great role in the lip reading process. These interpretations gave a motivation for developing a computer speech recognition system. Butt et al. / Electronic Letters on Computer Vision and Image Analysis 15(2):30-32, 2016 31 I propose a speech recognition system using face detection, lip extraction and tracking with some preprocessing techniques to overwhelmed the pose/lighting variation problems. The proposed approach is useful for face/lip detection and tracking in sequence of images and to augment global facial features to improve the recognition performance. a)Original Image a1)face detection a2) eyes and Face b)Original Image b1)face detection b2) Eyes and Face Figure 1. Face, eyes and mouth detection from videos a) Original b)After Illumination c) Teeth filtering d)Original e)Shadow Filtering Figure 2. Some results of Illumination equalization, Teeth and Shadow FIltering The Proposed approach consists of four major parts, firstly detecting/localizing human faces, lips and define the lip region of interest in the first frame as shown in Figure 1, secondly three pre-processing steps, namely illumination equalization, teeth detection and shadow removal developed, aiming at investigating edge information and global statistical characteristics which are sensitive to the uneven illuminations and susceptible to the complex appearance in presence of teeth and shadow. In contrast, the proposed method, which is aimed at local region analysis, can successfully avoid the complex appearance (e.g. low contrast, shadow, moustaches and teeth). The high average extraction performance is reached as shown in Figure 2, thirdly create contour line (3a), draw the 16 points by splitting image into four parts as shown in the figure 3 (b), and stored the coordinates of these constraints. The new approach is implemented in the lip tracking module. Using this lip tracking module from the lip boundary lines a feature vector of 16 points lip model of the speaker’s lips, stores the coordinates of these points and tracks these coordinates during the utterance by the speaker and tracked in every image of the image sequence. Contour line 16 Points Draw Points d)Open/Close Figure 3. (a) Contourline (b) Drawn points of the four parts of the image Finally track the lip contour with their coordinates in the following frames. Extensive experiments show the encouraging results and the effectiveness of the proposed method in comparison with the existing methods. The proposed approach has also been evaluated by testing the system in noisy real world facial image sequences. Experiments have shown that outliers detecting and better predicting ROIs can further 32 Butt et al. / Electronic Letters on Computer Vision and Image Analysis 15(2):30-32, 2016 reduce the number of frames with locating or tracking failures. Figure 4 shows the complete proceedure for speech recognition system on first frame of the video. Figure 4. Complete results for single image by proposed SRS References [1] Waqqas ur Rehman Butt, Luca Lombardi “An Improved Local Region Based Approach for Lip Detection and Tracking towards Speech Recognition” submitted to Journal of Visual Communication and Image Representation, Under review (June 2016) [2] Waqqas ur Rehman Butt, Luca Lombardi, Dr. Marion Pause “Automatic Object detection in digital Images under non standardized Conditions\\\" submitted to Signal, Image and Video Processing Journal, Under Review (2016) [3] Waqqas ur Rehman Butt, Luca Lombardi \\\"A Methodological Comparison of Moving Object Detection and Tracking in Videos with Background Subtraction and Mixture of Gaussian \\\". accepted a regular research paper (RRP) in (IPCV’15)The 2015 International Conference on Image Processing, Computer Vision & Pattern Recognition (July 27-30, 2015, Las Vegas, USA) [4] Luca Lombardi, Waqqas ur Rehman Butt, Marco Grecuccio \\\"Lip Tracking Towards An Automatic Lip Reading Approach\\\". Journal of Multimedia Processing and Technologies Volume 5 Number 1 March 2014, Pages 111 , Print ISSN: 0976-4127, Online ISSN: 0976-4135 [5] W.U.R. Butt, L. Lombardi “Comparisons of Visual Features Extraction Towards Automatic Lip Reading” 5th International Conference on Education and New Learning Technologies, Barcelona, Spain. (1-3 July, 2013) EDULEARN13 Proceedings, Pages: 2188-2196, ISBN: 978-84-616-3822-2, ISSN: 2340-1117. http://library.iated.org/view/BUTT2013COM Original Illumination Grey Scale Split 1 Split 2 smoothing 4 Points Ellipse Normalize New Roi Graph Lip points\",\"PeriodicalId\":38711,\"journal\":{\"name\":\"Electronic Letters on Computer Vision and Image Analysis\",\"volume\":\"70 1\",\"pages\":\"30-32\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-11-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Electronic Letters on Computer Vision and Image Analysis\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5565/REV/ELCVIA.953\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Electronic Letters on Computer Vision and Image Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5565/REV/ELCVIA.953","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 2
摘要
随着各种设备和多种交互方式的出现,计算机变得比以往任何时候都更加普及。使用键盘、鼠标和显示器的传统人机交互方式正在被更自然的模式(如语音、触摸和手势)所取代。科技的不断进步,使人与机器的互动模式发生了不可逆转的变化。它们现在被用于日常生活中的许多设备中,这些设备彻底改变了用户与机器交互的方式。事实上,新的个人电脑、平板电脑和智能手机正朝着一个方向发展,即在短时间内实现对用户完全透明的先进交互模式。通过语音识别的各种人机交互模式无疑是最受关注的模式之一。许多研究人员发现,语音阅读系统是语音识别系统的有益补充,它利用说话人的视觉线索,如嘈杂环境中的人脸。然而,由于姿态、光线和面部表情的高度变化,视觉特征的鲁棒和精确提取在目标识别中是一个具有挑战性的问题。现有的大多数方法都有一些限制,比如在受试者嘴唇上使用反射标记,用固定的摄像机位置(头戴式摄像机)记录嘴唇运动,以及在有组织的照明条件下进行嘴唇分割。此外,对于视觉特征的选择及其对特定音素的意义也没有共识。言语是交流的自然过程。因此,语音显然是人机交互的首选。在过去的几年里,技术的发展,加上成本的显著降低,导致了自动语音识别在各种系统中的广泛使用,如电话、人机交互和机器人。视觉语音线索是语音信息的前瞻性来源,在嘈杂声环境和说话人之间的串话中不受影响。说话人的嘴外区域、嘴部手势、面部表情等视觉信息是语音识别系统的关键组成部分。开发鲁棒语音识别系统的主要问题是找到精确的视觉特征提取方法。有时听者会因为视觉特征的不协调而观察到说话者的不恰当之处。这些视觉特征在唇读过程中起着重要的作用。这些解释为开发计算机语音识别系统提供了动力。我提出了一个语音识别系统,使用人脸检测、唇形提取和跟踪以及一些预处理技术来克服姿势/光照变化问题。该方法可用于图像序列的人脸/嘴唇检测和跟踪,并增强全局人脸特征以提高识别性能。a)原始图像a1)人脸检测a2)眼睛和人脸b)原始图像b1)人脸检测b2)眼睛和人脸从视频中检测人脸、眼睛和嘴巴a)原始图像b)光照后c)牙齿滤波d)原始图像e)阴影滤波本文提出的方法主要包括四大部分,首先对人脸、嘴唇进行检测/定位,并在第一帧中定义感兴趣的嘴唇区域,如图1所示,然后进行三个预处理步骤,即光照均衡、牙齿检测和阴影去除。旨在研究对光照不均匀敏感、牙齿和阴影存在时易受复杂外观影响的边缘信息和全局统计特征。而以局部区域分析为目标的方法,可以成功地避免复杂的外观(如低对比度、阴影、胡须和牙齿)。达到了如图2所示的较高的平均提取性能,再次创建等高线(3a),将图像分成图3 (b)所示的四部分绘制出16个点,并存储这些约束的坐标。该方法在唇形跟踪模块中实现。利用该唇形跟踪模块从唇形边界线得到一个由16点组成的特征向量的说话人唇形模型,存储这些点的坐标,并在说话人说话过程中跟踪这些坐标,并在图像序列的每个图像中进行跟踪。等高线16点绘制点d)开/闭图3。(a)等高线(b)绘制图像四个部分的点,最后在接下来的帧中用它们的坐标跟踪唇形轮廓。大量的实验表明,与现有方法相比,该方法取得了令人鼓舞的结果和有效性。 该方法还通过在有噪声的真实世界人脸图像序列中测试系统进行了评估。实验表明,检测异常点并更好地预测roi可以进一步减少定位或跟踪失败的帧数。[32]Butt et al. /计算机视觉与图像分析电子信函15(2):30- 32,2016。图4显示了视频第一帧上语音识别系统的完整过程。图4。[1] Waqqas ur Rehman Butt, Luca Lombardi,“一种改进的基于局部区域的唇形检测和跟踪方法用于语音识别”,提交给《视觉通信与图像表示》杂志,2016年6月。[2]Waqqas ur Rehman Butt, Luca Lombardi, Dr. Marion Pause,“非标准化条件下数字图像中的自动目标检测”,提交给《信号,图像和视频处理》杂志,[3] Waqqas ur Rehman Butt, Luca Lombardi,“一种基于背景减除和混合高斯的视频运动目标检测与跟踪方法的比较”。[4]刘建军,刘建军,刘建军,“一种基于图像处理、计算机视觉与模式识别的唇形识别方法”,在2015年国际计算机视觉与模式识别学术会议(IPCV ' 15)上发表。[5]吴国荣,刘志强,刘志强,“基于视觉特征提取的自动唇读方法研究”,中国计算机科学,2014,vol . 1, no . 1, no . 1, page 111, Print ISSN: 0976-4127, Online ISSN: 0976-4135。(1-3 July, 2013) EDULEARN13 Proceedings, Pages: 2188-2196, ISBN: 978-84-616-3822-2, ISSN: 2340-1117。http://library.iated.org/view/BUTT2013COM原始照明灰度分割1分割2平滑4点椭圆归一化新的Roi图唇点
Computers have become more pervasive than ever with a wide range of devices and multiple ways of interaction. Traditional ways of human computer interaction using keyboards, mice and display monitors are being replaced by more natural modes such as speech, touch, and gesture. The continuous progress of technology brings to an irreversible change of paradigms of interaction between human and machine. They are now used in daily life in many devices that have revolutionized the way users interact with machines. In fact new PCs, tablets and smartphones are moving increasingly toward a direction that will bring in a short time to have interaction paradigms so advanced that will be completely transparent to users. The various modes of human-machine interaction, through voice recognition are without doubt one of the most considered. A number of researchers have revealed that a speech reading system is beneficial complement to an audio speech recognition system by using of visual cues of the speakers, such as face in noisy environment. However, robust and precise extraction of visual features is a challenging problem in object recognition, due to high variation in pose, lighting and facial makeup. Most of the existing approaches use constraints such as the use of reflective marker on subjects lips, lip movements recorded with a fixed camera position (head mounted camera) and lip segmentation in organized illumination conditions. Furthermore, there is no common consensus about the visual features selection and their significance for a particular phoneme. Speech is the natural procedure of communication. Therefore speech would be an apparently preferred option for human computer interaction. In the past years, development in technology, combined with a significant reduction in cost, has led to the pervasive use of automated speech recognition in variety of systems such as telephony, human-computer interaction and robotics. Visual speech cues are prospective source of speech information and they are apparently not affected in noisy acoustic environmental condition and cross talking between speakers. Visual information of a speaker is the key component of Speech Recognition system such as outside area of mouth, mouth gestures and facial expressions. The major problem to develop robust speech recognition system is to find the precise visual feature extraction method. Sometime hearer observes improper from speaker because of the incompatible effect of visual features. These visual features have great role in the lip reading process. These interpretations gave a motivation for developing a computer speech recognition system. Butt et al. / Electronic Letters on Computer Vision and Image Analysis 15(2):30-32, 2016 31 I propose a speech recognition system using face detection, lip extraction and tracking with some preprocessing techniques to overwhelmed the pose/lighting variation problems. The proposed approach is useful for face/lip detection and tracking in sequence of images and to augment global facial features to improve the recognition performance. a)Original Image a1)face detection a2) eyes and Face b)Original Image b1)face detection b2) Eyes and Face Figure 1. Face, eyes and mouth detection from videos a) Original b)After Illumination c) Teeth filtering d)Original e)Shadow Filtering Figure 2. Some results of Illumination equalization, Teeth and Shadow FIltering The Proposed approach consists of four major parts, firstly detecting/localizing human faces, lips and define the lip region of interest in the first frame as shown in Figure 1, secondly three pre-processing steps, namely illumination equalization, teeth detection and shadow removal developed, aiming at investigating edge information and global statistical characteristics which are sensitive to the uneven illuminations and susceptible to the complex appearance in presence of teeth and shadow. In contrast, the proposed method, which is aimed at local region analysis, can successfully avoid the complex appearance (e.g. low contrast, shadow, moustaches and teeth). The high average extraction performance is reached as shown in Figure 2, thirdly create contour line (3a), draw the 16 points by splitting image into four parts as shown in the figure 3 (b), and stored the coordinates of these constraints. The new approach is implemented in the lip tracking module. Using this lip tracking module from the lip boundary lines a feature vector of 16 points lip model of the speaker’s lips, stores the coordinates of these points and tracks these coordinates during the utterance by the speaker and tracked in every image of the image sequence. Contour line 16 Points Draw Points d)Open/Close Figure 3. (a) Contourline (b) Drawn points of the four parts of the image Finally track the lip contour with their coordinates in the following frames. Extensive experiments show the encouraging results and the effectiveness of the proposed method in comparison with the existing methods. The proposed approach has also been evaluated by testing the system in noisy real world facial image sequences. Experiments have shown that outliers detecting and better predicting ROIs can further 32 Butt et al. / Electronic Letters on Computer Vision and Image Analysis 15(2):30-32, 2016 reduce the number of frames with locating or tracking failures. Figure 4 shows the complete proceedure for speech recognition system on first frame of the video. Figure 4. Complete results for single image by proposed SRS References [1] Waqqas ur Rehman Butt, Luca Lombardi “An Improved Local Region Based Approach for Lip Detection and Tracking towards Speech Recognition” submitted to Journal of Visual Communication and Image Representation, Under review (June 2016) [2] Waqqas ur Rehman Butt, Luca Lombardi, Dr. Marion Pause “Automatic Object detection in digital Images under non standardized Conditions" submitted to Signal, Image and Video Processing Journal, Under Review (2016) [3] Waqqas ur Rehman Butt, Luca Lombardi "A Methodological Comparison of Moving Object Detection and Tracking in Videos with Background Subtraction and Mixture of Gaussian ". accepted a regular research paper (RRP) in (IPCV’15)The 2015 International Conference on Image Processing, Computer Vision & Pattern Recognition (July 27-30, 2015, Las Vegas, USA) [4] Luca Lombardi, Waqqas ur Rehman Butt, Marco Grecuccio "Lip Tracking Towards An Automatic Lip Reading Approach". Journal of Multimedia Processing and Technologies Volume 5 Number 1 March 2014, Pages 111 , Print ISSN: 0976-4127, Online ISSN: 0976-4135 [5] W.U.R. Butt, L. Lombardi “Comparisons of Visual Features Extraction Towards Automatic Lip Reading” 5th International Conference on Education and New Learning Technologies, Barcelona, Spain. (1-3 July, 2013) EDULEARN13 Proceedings, Pages: 2188-2196, ISBN: 978-84-616-3822-2, ISSN: 2340-1117. http://library.iated.org/view/BUTT2013COM Original Illumination Grey Scale Split 1 Split 2 smoothing 4 Points Ellipse Normalize New Roi Graph Lip points