语言和视觉的自动多模态处理，以帮助有视觉障碍的人

LatinX in AI at North American Chapter of the Association for Computational Linguistics Conference 2022 Pub Date : 2022-07-10 DOI:10.52591/lxai202207104

{"title":"语言和视觉的自动多模态处理，以帮助有视觉障碍的人","authors":"","doi":"10.52591/lxai202207104","DOIUrl":null,"url":null,"abstract":"In recent years, the study of the intersection between vision and language modalities, specifically in visual question answering (VQA) models, has gained significant appeal due to its great potential in assistive applications for people with visual disabilities. Despite this, to date, many of the existing VQA models are nor applicable to this goal for at least three reasons. To begin with, they are designed to respond to a single question. That is, they are not able to give feedback to incomplete or incremental questions. Secondly, they only consider a single image which is neither blurred, nor poorly focused, nor poorly framed. All these problems are directly related to the loss of the visual capacity. People with visual disabilities may have trouble interacting with a visual user interface for asking questions and for taking adequate photographs. They also frequently need to read text captured by the images, and most current VQA systems fall short in this task. This work presents a PhD proposal with four lines of research that will be carried out until December 2025. It investigates techniques that increase the robustness of the VQA models. In particular we propose the integration of dialogue history, the analysis of more than one input image, and the incorporation of text recognition capabilities to the models. All of these contributions are motivated to assist people with vision problems with their day-to-day tasks.","PeriodicalId":350984,"journal":{"name":"LatinX in AI at North American Chapter of the Association for Computational Linguistics Conference 2022","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic multi-modal processing of language and vision to assist people with visual impairments\",\"authors\":\"\",\"doi\":\"10.52591/lxai202207104\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, the study of the intersection between vision and language modalities, specifically in visual question answering (VQA) models, has gained significant appeal due to its great potential in assistive applications for people with visual disabilities. Despite this, to date, many of the existing VQA models are nor applicable to this goal for at least three reasons. To begin with, they are designed to respond to a single question. That is, they are not able to give feedback to incomplete or incremental questions. Secondly, they only consider a single image which is neither blurred, nor poorly focused, nor poorly framed. All these problems are directly related to the loss of the visual capacity. People with visual disabilities may have trouble interacting with a visual user interface for asking questions and for taking adequate photographs. They also frequently need to read text captured by the images, and most current VQA systems fall short in this task. This work presents a PhD proposal with four lines of research that will be carried out until December 2025. It investigates techniques that increase the robustness of the VQA models. In particular we propose the integration of dialogue history, the analysis of more than one input image, and the incorporation of text recognition capabilities to the models. All of these contributions are motivated to assist people with vision problems with their day-to-day tasks.\",\"PeriodicalId\":350984,\"journal\":{\"name\":\"LatinX in AI at North American Chapter of the Association for Computational Linguistics Conference 2022\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"LatinX in AI at North American Chapter of the Association for Computational Linguistics Conference 2022\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.52591/lxai202207104\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"LatinX in AI at North American Chapter of the Association for Computational Linguistics Conference 2022","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.52591/lxai202207104","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

近年来，视觉和语言模式的交叉研究，特别是视觉问答(VQA)模型的研究，因其在视觉障碍人士的辅助应用方面具有巨大的潜力而受到了极大的关注。尽管如此，到目前为止，由于至少三个原因，许多现有的VQA模型并不适用于这一目标。首先，它们被设计用来回答一个问题。也就是说，他们不能对不完整或增量的问题给出反馈。其次，他们只考虑一张既不模糊，也不对焦，也不相框的图像。所有这些问题都与视力的丧失直接相关。有视觉障碍的人可能在使用视觉用户界面提问和拍摄足够的照片时遇到困难。它们还经常需要读取图像捕获的文本，而大多数当前的VQA系统在这项任务中都做不到。这项工作提出了一个博士建议，包括四个研究方向，将在2025年12月之前进行。它研究了增加VQA模型鲁棒性的技术。我们特别提出了对话历史的集成，多幅输入图像的分析，以及将文本识别能力结合到模型中。所有这些贡献都是为了帮助有视力问题的人完成日常任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automatic multi-modal processing of language and vision to assist people with visual impairments

In recent years, the study of the intersection between vision and language modalities, specifically in visual question answering (VQA) models, has gained significant appeal due to its great potential in assistive applications for people with visual disabilities. Despite this, to date, many of the existing VQA models are nor applicable to this goal for at least three reasons. To begin with, they are designed to respond to a single question. That is, they are not able to give feedback to incomplete or incremental questions. Secondly, they only consider a single image which is neither blurred, nor poorly focused, nor poorly framed. All these problems are directly related to the loss of the visual capacity. People with visual disabilities may have trouble interacting with a visual user interface for asking questions and for taking adequate photographs. They also frequently need to read text captured by the images, and most current VQA systems fall short in this task. This work presents a PhD proposal with four lines of research that will be carried out until December 2025. It investigates techniques that increase the robustness of the VQA models. In particular we propose the integration of dialogue history, the analysis of more than one input image, and the incorporation of text recognition capabilities to the models. All of these contributions are motivated to assist people with vision problems with their day-to-day tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

LatinX in AI at North American Chapter of the Association for Computational Linguistics Conference 2022

自引率

0.00%

发文量