Nikita Deshmukh, Anamika Ahire, S. Bhandari, Apurva Mali, Kalyani Warkari
{"title":"基于视觉的深度学习唇读系统","authors":"Nikita Deshmukh, Anamika Ahire, S. Bhandari, Apurva Mali, Kalyani Warkari","doi":"10.1109/CCGE50943.2021.9776430","DOIUrl":null,"url":null,"abstract":"Lip reading is an approach for understanding speech by visually interpreting lip movements. Vision based lip leading system takes a video (without audio) as an input of a person speaking some word or phrase and provides the predicted word or phrase the person is speaking as output. This paper presents the method for Vision based Lip Reading system that uses convolutional neural network (CNN) with attention-based Long Short-Term Memory (LSTM). The dataset includes video clips pronouncing single digits. The pretrained CNN is used for extracting features from pre-processed video frames which then are processed for learning temporal characteristics by LSTM. The SoftMax layer of architecture provides the result of lip reading. In the present work experiments are performed with two pre-trained models namely VGG19 and ResNet50 and the results are compared. To further improve the performance of the system ensembled learning is also used. The system provides 85% accuracy using ResNet50 and ensemble learning.","PeriodicalId":130452,"journal":{"name":"2021 International Conference on Computing, Communication and Green Engineering (CCGE)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Vision based Lip Reading System using Deep Learning\",\"authors\":\"Nikita Deshmukh, Anamika Ahire, S. Bhandari, Apurva Mali, Kalyani Warkari\",\"doi\":\"10.1109/CCGE50943.2021.9776430\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Lip reading is an approach for understanding speech by visually interpreting lip movements. Vision based lip leading system takes a video (without audio) as an input of a person speaking some word or phrase and provides the predicted word or phrase the person is speaking as output. This paper presents the method for Vision based Lip Reading system that uses convolutional neural network (CNN) with attention-based Long Short-Term Memory (LSTM). The dataset includes video clips pronouncing single digits. The pretrained CNN is used for extracting features from pre-processed video frames which then are processed for learning temporal characteristics by LSTM. The SoftMax layer of architecture provides the result of lip reading. In the present work experiments are performed with two pre-trained models namely VGG19 and ResNet50 and the results are compared. To further improve the performance of the system ensembled learning is also used. The system provides 85% accuracy using ResNet50 and ensemble learning.\",\"PeriodicalId\":130452,\"journal\":{\"name\":\"2021 International Conference on Computing, Communication and Green Engineering (CCGE)\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Computing, Communication and Green Engineering (CCGE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGE50943.2021.9776430\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Computing, Communication and Green Engineering (CCGE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGE50943.2021.9776430","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Vision based Lip Reading System using Deep Learning
Lip reading is an approach for understanding speech by visually interpreting lip movements. Vision based lip leading system takes a video (without audio) as an input of a person speaking some word or phrase and provides the predicted word or phrase the person is speaking as output. This paper presents the method for Vision based Lip Reading system that uses convolutional neural network (CNN) with attention-based Long Short-Term Memory (LSTM). The dataset includes video clips pronouncing single digits. The pretrained CNN is used for extracting features from pre-processed video frames which then are processed for learning temporal characteristics by LSTM. The SoftMax layer of architecture provides the result of lip reading. In the present work experiments are performed with two pre-trained models namely VGG19 and ResNet50 and the results are compared. To further improve the performance of the system ensembled learning is also used. The system provides 85% accuracy using ResNet50 and ensemble learning.