{"title":"Vision 360: Image Caption Generation Using Encoder-Decoder Model","authors":"Ankita Kumari, A. Chauhan, Abhishek Singhal","doi":"10.1109/confluence52989.2022.9734167","DOIUrl":null,"url":null,"abstract":"Vision360 incorporates three features in itself Image elaboration, speech to text, text to speech. The Main feature is Image Caption Generation, i.e., not only it is responsible for Image Segmentation, Object classification but it also establishes a relation between the objects classified that too with a logical relation that somehow gives the human vibe. Encoder-decoder model has been used. CNN has been used for Image and LSTM has been used for text. The paper also demonstrates the integration InceptionV3 model. Vision360 is a way of providing aid to blind people or partially blind people. It’s a way to bring convenience in their proximity in a single touch. It tries to bridge the gap that they have been feeling all along while walking on the same path with different people. A task to describe an Image is not very hard but if we want to automate this task of depicting something from an image and make the machine do it, it’ll be nearly impossible, even if the new researches have been made and feature extraction is attainable. Logically establishing semantically and syntactically correct sentences is still a hard task to accomplish. We used encoder-decoder model for parallel training of Image and text data, and used InceptionV3 for extracting feature vector. We evaluated our result on BLEU score metric and the model achieved BLEU score in-range of 0.70 to 0.78 for various images in the validation set.","PeriodicalId":261941,"journal":{"name":"2022 12th International Conference on Cloud Computing, Data Science & Engineering (Confluence)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 12th International Conference on Cloud Computing, Data Science & Engineering (Confluence)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/confluence52989.2022.9734167","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Vision360 incorporates three features in itself Image elaboration, speech to text, text to speech. The Main feature is Image Caption Generation, i.e., not only it is responsible for Image Segmentation, Object classification but it also establishes a relation between the objects classified that too with a logical relation that somehow gives the human vibe. Encoder-decoder model has been used. CNN has been used for Image and LSTM has been used for text. The paper also demonstrates the integration InceptionV3 model. Vision360 is a way of providing aid to blind people or partially blind people. It’s a way to bring convenience in their proximity in a single touch. It tries to bridge the gap that they have been feeling all along while walking on the same path with different people. A task to describe an Image is not very hard but if we want to automate this task of depicting something from an image and make the machine do it, it’ll be nearly impossible, even if the new researches have been made and feature extraction is attainable. Logically establishing semantically and syntactically correct sentences is still a hard task to accomplish. We used encoder-decoder model for parallel training of Image and text data, and used InceptionV3 for extracting feature vector. We evaluated our result on BLEU score metric and the model achieved BLEU score in-range of 0.70 to 0.78 for various images in the validation set.