MobiVQA:高效的设备上可视化问答

Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. Pub Date : 2022-01-01 DOI:10.1145/3534619

Qingqing Cao

{"title":"MobiVQA:高效的设备上可视化问答","authors":"Qingqing Cao","doi":"10.1145/3534619","DOIUrl":null,"url":null,"abstract":"Visual Question Answering (VQA) is a relatively new task where a user can ask a natural question about an image and obtain an answer. VQA is useful for many applications and is widely popular for users with visual impairments. Our goal is to design a VQA application that works efficiently on mobile devices without requiring cloud support. Such a system will allow users to ask visual questions privately, without having to send their questions to the cloud, while also reduce cloud communication costs. However, existing VQA applications use deep learning models that significantly improve accuracy, but is computationally heavy. Unfortunately, existing techniques that optimize deep learning for mobile devices cannot be applied for VQA because the VQA task is multi-modal—it requires both processing vision and text data. Existing mobile optimizations that work for vision-only or text-only neural networks cannot be applied here because of the dependencies between the two modes. Instead, we design MobiVQA, a set of optimizations that leverage the multi-modal nature of VQA. We show using extensive evaluation on two VQA testbeds and two mobile platforms, that MobiVQA significantly improves latency and energy with minimal accuracy loss compared to state-of-the-art VQA models. For instance, MobiVQA can answer a visual question in 163 milliseconds on the phone, compared to over 20-second latency incurred by the most accurate state-of-the-art model, while incurring less than 1 point reduction in accuracy.","PeriodicalId":20463,"journal":{"name":"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.","volume":"107 1","pages":"44:1-44:23"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"MobiVQA: Efficient On-Device Visual Question Answering\",\"authors\":\"Qingqing Cao\",\"doi\":\"10.1145/3534619\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual Question Answering (VQA) is a relatively new task where a user can ask a natural question about an image and obtain an answer. VQA is useful for many applications and is widely popular for users with visual impairments. Our goal is to design a VQA application that works efficiently on mobile devices without requiring cloud support. Such a system will allow users to ask visual questions privately, without having to send their questions to the cloud, while also reduce cloud communication costs. However, existing VQA applications use deep learning models that significantly improve accuracy, but is computationally heavy. Unfortunately, existing techniques that optimize deep learning for mobile devices cannot be applied for VQA because the VQA task is multi-modal—it requires both processing vision and text data. Existing mobile optimizations that work for vision-only or text-only neural networks cannot be applied here because of the dependencies between the two modes. Instead, we design MobiVQA, a set of optimizations that leverage the multi-modal nature of VQA. We show using extensive evaluation on two VQA testbeds and two mobile platforms, that MobiVQA significantly improves latency and energy with minimal accuracy loss compared to state-of-the-art VQA models. For instance, MobiVQA can answer a visual question in 163 milliseconds on the phone, compared to over 20-second latency incurred by the most accurate state-of-the-art model, while incurring less than 1 point reduction in accuracy.\",\"PeriodicalId\":20463,\"journal\":{\"name\":\"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.\",\"volume\":\"107 1\",\"pages\":\"44:1-44:23\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3534619\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3534619","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

视觉问答(VQA)是一项相对较新的任务，用户可以就图像提出一个自然的问题并获得答案。VQA对许多应用程序都很有用，并且在有视觉障碍的用户中广受欢迎。我们的目标是设计一个在移动设备上高效工作的VQA应用程序，而不需要云支持。这样的系统将允许用户私下提出可视化问题，而不必将他们的问题发送到云端，同时也降低了云通信成本。然而，现有的VQA应用程序使用深度学习模型，可以显着提高准确性，但计算量很大。不幸的是，现有的为移动设备优化深度学习的技术不能应用于VQA，因为VQA任务是多模态的——它需要同时处理视觉和文本数据。现有的仅用于视觉或纯文本神经网络的移动优化无法应用于此，因为这两种模式之间存在依赖关系。相反，我们设计了MobiVQA，这是一组利用VQA的多模态特性的优化。我们在两个VQA测试平台和两个移动平台上进行了广泛的评估，结果表明，与最先进的VQA模型相比，MobiVQA显著改善了延迟和能量，并且精度损失最小。例如，MobiVQA可以在163毫秒的时间内在手机上回答一个视觉问题，而最准确的最先进的模型需要超过20秒的延迟，而准确性降低不到1分。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MobiVQA: Efficient On-Device Visual Question Answering

Visual Question Answering (VQA) is a relatively new task where a user can ask a natural question about an image and obtain an answer. VQA is useful for many applications and is widely popular for users with visual impairments. Our goal is to design a VQA application that works efficiently on mobile devices without requiring cloud support. Such a system will allow users to ask visual questions privately, without having to send their questions to the cloud, while also reduce cloud communication costs. However, existing VQA applications use deep learning models that significantly improve accuracy, but is computationally heavy. Unfortunately, existing techniques that optimize deep learning for mobile devices cannot be applied for VQA because the VQA task is multi-modal—it requires both processing vision and text data. Existing mobile optimizations that work for vision-only or text-only neural networks cannot be applied here because of the dependencies between the two modes. Instead, we design MobiVQA, a set of optimizations that leverage the multi-modal nature of VQA. We show using extensive evaluation on two VQA testbeds and two mobile platforms, that MobiVQA significantly improves latency and energy with minimal accuracy loss compared to state-of-the-art VQA models. For instance, MobiVQA can answer a visual question in 163 milliseconds on the phone, compared to over 20-second latency incurred by the most accurate state-of-the-art model, while incurring less than 1 point reduction in accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.

自引率

0.00%

发文量