M. Rahman, Md Rashad Tanjim, S. Hasan, Sayeed Md. Shaiban, Mohammad Ashrafuzzaman Khan
{"title":"唇读孟加拉语","authors":"M. Rahman, Md Rashad Tanjim, S. Hasan, Sayeed Md. Shaiban, Mohammad Ashrafuzzaman Khan","doi":"10.1145/3579654.3579677","DOIUrl":null,"url":null,"abstract":"This work aims to lip-read Bengali words from talking faces without using audio. Lip reading for English words and sentences is well explored in literature. However, to our knowledge, we are the first to explore this for Bengali words, a language spoken by about 272 million people in south-east Asia [7]. We used a CNN to extract features from the video frames in sequence and provided the features to a bidirectional LSTM network followed by a classifier. We trained the entire network end-to-end. We investigated the effects of using different types of convolution operations during feature collection. We used convolution with filters of multiple scales in a single stage (Inception [24]), depthwise and pointwise convolution (MobileNet [25]), traditional CNN (VGG16 [26], ResNet [17], DenseNet [27], ResNeXt [28]), and a custom CNN. For Bengali word lip reading, MobileNet [25] (as CNN) followed by a bidirectional LSTM and classifier achieved the highest accuracy of 84.75%. Moreover, we found that longer words have better detection rates than shorter ones using any type of convolution.","PeriodicalId":146783,"journal":{"name":"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence","volume":"2011 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Lip Reading Bengali Words\",\"authors\":\"M. Rahman, Md Rashad Tanjim, S. Hasan, Sayeed Md. Shaiban, Mohammad Ashrafuzzaman Khan\",\"doi\":\"10.1145/3579654.3579677\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This work aims to lip-read Bengali words from talking faces without using audio. Lip reading for English words and sentences is well explored in literature. However, to our knowledge, we are the first to explore this for Bengali words, a language spoken by about 272 million people in south-east Asia [7]. We used a CNN to extract features from the video frames in sequence and provided the features to a bidirectional LSTM network followed by a classifier. We trained the entire network end-to-end. We investigated the effects of using different types of convolution operations during feature collection. We used convolution with filters of multiple scales in a single stage (Inception [24]), depthwise and pointwise convolution (MobileNet [25]), traditional CNN (VGG16 [26], ResNet [17], DenseNet [27], ResNeXt [28]), and a custom CNN. For Bengali word lip reading, MobileNet [25] (as CNN) followed by a bidirectional LSTM and classifier achieved the highest accuracy of 84.75%. Moreover, we found that longer words have better detection rates than shorter ones using any type of convolution.\",\"PeriodicalId\":146783,\"journal\":{\"name\":\"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence\",\"volume\":\"2011 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3579654.3579677\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3579654.3579677","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
This work aims to lip-read Bengali words from talking faces without using audio. Lip reading for English words and sentences is well explored in literature. However, to our knowledge, we are the first to explore this for Bengali words, a language spoken by about 272 million people in south-east Asia [7]. We used a CNN to extract features from the video frames in sequence and provided the features to a bidirectional LSTM network followed by a classifier. We trained the entire network end-to-end. We investigated the effects of using different types of convolution operations during feature collection. We used convolution with filters of multiple scales in a single stage (Inception [24]), depthwise and pointwise convolution (MobileNet [25]), traditional CNN (VGG16 [26], ResNet [17], DenseNet [27], ResNeXt [28]), and a custom CNN. For Bengali word lip reading, MobileNet [25] (as CNN) followed by a bidirectional LSTM and classifier achieved the highest accuracy of 84.75%. Moreover, we found that longer words have better detection rates than shorter ones using any type of convolution.