{"title":"Egocentric Scene Description for the Blind and Visually Impaired","authors":"Khadidja Delloul, S. Larabi","doi":"10.1109/ISIA55826.2022.9993531","DOIUrl":null,"url":null,"abstract":"Image captioning methods come short when being used to describe scenes for the blind and visually impaired, because not only do they focus exclusively on salient objects, eliminating background and surrounding information, but they also do not offer egocentric positional descriptions of objects regarding the users, failing by that to give them the opportunity to understand and rebuild the scenes they are in. Furthermore, the majority of solutions neglect depth information, and models are trained solely on 2D (RGB) images, leading to less accurate prepositions and words or phrases' order. In this paper, we will offer the blind and visually impaired more descriptive captions for almost every region present in the image by the use of DenseCap model. Our contribution lies within the use of depth information that will be estimated by AdaBins model in order to enrich captions with positional information regarding the users, helping them understand their surroundings.","PeriodicalId":169898,"journal":{"name":"2022 5th International Symposium on Informatics and its Applications (ISIA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 5th International Symposium on Informatics and its Applications (ISIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISIA55826.2022.9993531","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Image captioning methods come short when being used to describe scenes for the blind and visually impaired, because not only do they focus exclusively on salient objects, eliminating background and surrounding information, but they also do not offer egocentric positional descriptions of objects regarding the users, failing by that to give them the opportunity to understand and rebuild the scenes they are in. Furthermore, the majority of solutions neglect depth information, and models are trained solely on 2D (RGB) images, leading to less accurate prepositions and words or phrases' order. In this paper, we will offer the blind and visually impaired more descriptive captions for almost every region present in the image by the use of DenseCap model. Our contribution lies within the use of depth information that will be estimated by AdaBins model in order to enrich captions with positional information regarding the users, helping them understand their surroundings.