{"title":"基于深度循环架构的视障场景描述生成器","authors":"Aviral Chharia, Rahul Upadhyay","doi":"10.1109/ICUMT51630.2020.9222441","DOIUrl":null,"url":null,"abstract":"Vision is the most essential sense for human beings. But today, more than 2.2 billion people worldwide suffer from some form of vision impairment. This paper presents an end-to-end human-centric model for aiding the visually impaired by employing the deep recurrent architecture of the start-of-the-art image captioning models. A VGG-16 net convolutional neural network (CNN) is used to extract feature vectors from real-time video (image frames) and an long short-term memory (LSTM) network is employed to generate captions from these feature vectors. The model is tested on the Flickr 8K Dataset, one of the most popularly used image captioning dataset which contains over 8000 images. On real-time videos, the model generates rich descriptive captions which are converted to audio for a visually impaired person to listen. Comprehensively the model generates promising results which has great potential to enhance the lives of the visually impaired people by assisting them to get a better understanding of their surroundings.","PeriodicalId":170847,"journal":{"name":"2020 12th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Deep Recurrent Architecture based Scene Description Generator for Visually Impaired\",\"authors\":\"Aviral Chharia, Rahul Upadhyay\",\"doi\":\"10.1109/ICUMT51630.2020.9222441\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Vision is the most essential sense for human beings. But today, more than 2.2 billion people worldwide suffer from some form of vision impairment. This paper presents an end-to-end human-centric model for aiding the visually impaired by employing the deep recurrent architecture of the start-of-the-art image captioning models. A VGG-16 net convolutional neural network (CNN) is used to extract feature vectors from real-time video (image frames) and an long short-term memory (LSTM) network is employed to generate captions from these feature vectors. The model is tested on the Flickr 8K Dataset, one of the most popularly used image captioning dataset which contains over 8000 images. On real-time videos, the model generates rich descriptive captions which are converted to audio for a visually impaired person to listen. Comprehensively the model generates promising results which has great potential to enhance the lives of the visually impaired people by assisting them to get a better understanding of their surroundings.\",\"PeriodicalId\":170847,\"journal\":{\"name\":\"2020 12th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT)\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 12th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICUMT51630.2020.9222441\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 12th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICUMT51630.2020.9222441","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Deep Recurrent Architecture based Scene Description Generator for Visually Impaired
Vision is the most essential sense for human beings. But today, more than 2.2 billion people worldwide suffer from some form of vision impairment. This paper presents an end-to-end human-centric model for aiding the visually impaired by employing the deep recurrent architecture of the start-of-the-art image captioning models. A VGG-16 net convolutional neural network (CNN) is used to extract feature vectors from real-time video (image frames) and an long short-term memory (LSTM) network is employed to generate captions from these feature vectors. The model is tested on the Flickr 8K Dataset, one of the most popularly used image captioning dataset which contains over 8000 images. On real-time videos, the model generates rich descriptive captions which are converted to audio for a visually impaired person to listen. Comprehensively the model generates promising results which has great potential to enhance the lives of the visually impaired people by assisting them to get a better understanding of their surroundings.