{"title":"基于批处理归一化的CNN唇读模型","authors":"Saquib Nadeem Hashmi, Harsh Gupta, Dhruv Mittal, Kaushtubh Kumar, Aparajita Nanda, Sarishty Gupta","doi":"10.1109/IC3.2018.8530509","DOIUrl":null,"url":null,"abstract":"The goal of Lip-reading is to decode and analyze the lip movements of a speaker for a said word or phrase. Variation in speaking speed, intensity and same lip sequence of distinct characters have been the challenging aspects of lip reading. In this paper we present a lip reading model for an audio-less video data of variable-length sequence frames. First, we extract the lip region from each face image in the video sequence and concatenate them to form a single image. Next, we design a twelve-layer Convolutional Neural Network with two layer of batch normalization for training the model and to extract the visual features end-to-end. Batch normalization helps to reduce the internal and external variances in various attributes like speaker's accent, lighting and quality of image frames, pace of the speaker and posture of speaking etc. We validate the performance of ourmodel on a standard audio-less video MIRACLE-VC1 dataset and compare with an existing model whichuses 16 layers CNN or more. A training accuracy of 96% and a validation accuracy of 52.9% have been attained on the proposed lip reading model.","PeriodicalId":118388,"journal":{"name":"2018 Eleventh International Conference on Contemporary Computing (IC3)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"A Lip Reading Model Using CNN with Batch Normalization\",\"authors\":\"Saquib Nadeem Hashmi, Harsh Gupta, Dhruv Mittal, Kaushtubh Kumar, Aparajita Nanda, Sarishty Gupta\",\"doi\":\"10.1109/IC3.2018.8530509\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The goal of Lip-reading is to decode and analyze the lip movements of a speaker for a said word or phrase. Variation in speaking speed, intensity and same lip sequence of distinct characters have been the challenging aspects of lip reading. In this paper we present a lip reading model for an audio-less video data of variable-length sequence frames. First, we extract the lip region from each face image in the video sequence and concatenate them to form a single image. Next, we design a twelve-layer Convolutional Neural Network with two layer of batch normalization for training the model and to extract the visual features end-to-end. Batch normalization helps to reduce the internal and external variances in various attributes like speaker's accent, lighting and quality of image frames, pace of the speaker and posture of speaking etc. We validate the performance of ourmodel on a standard audio-less video MIRACLE-VC1 dataset and compare with an existing model whichuses 16 layers CNN or more. A training accuracy of 96% and a validation accuracy of 52.9% have been attained on the proposed lip reading model.\",\"PeriodicalId\":118388,\"journal\":{\"name\":\"2018 Eleventh International Conference on Contemporary Computing (IC3)\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 Eleventh International Conference on Contemporary Computing (IC3)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IC3.2018.8530509\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Eleventh International Conference on Contemporary Computing (IC3)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IC3.2018.8530509","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Lip Reading Model Using CNN with Batch Normalization
The goal of Lip-reading is to decode and analyze the lip movements of a speaker for a said word or phrase. Variation in speaking speed, intensity and same lip sequence of distinct characters have been the challenging aspects of lip reading. In this paper we present a lip reading model for an audio-less video data of variable-length sequence frames. First, we extract the lip region from each face image in the video sequence and concatenate them to form a single image. Next, we design a twelve-layer Convolutional Neural Network with two layer of batch normalization for training the model and to extract the visual features end-to-end. Batch normalization helps to reduce the internal and external variances in various attributes like speaker's accent, lighting and quality of image frames, pace of the speaker and posture of speaking etc. We validate the performance of ourmodel on a standard audio-less video MIRACLE-VC1 dataset and compare with an existing model whichuses 16 layers CNN or more. A training accuracy of 96% and a validation accuracy of 52.9% have been attained on the proposed lip reading model.