{"title":"基于混合精确训练的视觉问答优化框架","authors":"Souvik Chowdhury, B. Soni","doi":"10.1109/ICAIA57370.2023.10169318","DOIUrl":null,"url":null,"abstract":"Thanks to the emergence and continued devel-opment of machine learning, particularly deep learning, the research on visual question and answer, also known as VQA, has advanced dramatically, with great theoretical research significance and practical application value. This field of study makes use of multimodal learning, computer vision, and natural language processing techniques. Except for a few academics who presented different types of optimized bi-linear fusion approaches that integrate text and image characteristics in an efficient way, there haven’t been many efforts to optimize the VQA framework. In order to optimize the VQA problem, we offer a unique Visual Question Answering framework in this research. Because both 16-bit and 32-bit floating points provide automatic mixed precision, deep learning architectures can now be optimized with less computation and execution time. Using the VQA 2.0 and CLEVR datasets, the proposed framework has been tested against two models. In terms of overall accuracy and execution time, the experimental findings demonstrated a significant improvement.","PeriodicalId":196526,"journal":{"name":"2023 International Conference on Artificial Intelligence and Applications (ICAIA) Alliance Technology Conference (ATCON-1)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Visual Question Answering Optimized Framework using Mixed Precision Training\",\"authors\":\"Souvik Chowdhury, B. Soni\",\"doi\":\"10.1109/ICAIA57370.2023.10169318\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Thanks to the emergence and continued devel-opment of machine learning, particularly deep learning, the research on visual question and answer, also known as VQA, has advanced dramatically, with great theoretical research significance and practical application value. This field of study makes use of multimodal learning, computer vision, and natural language processing techniques. Except for a few academics who presented different types of optimized bi-linear fusion approaches that integrate text and image characteristics in an efficient way, there haven’t been many efforts to optimize the VQA framework. In order to optimize the VQA problem, we offer a unique Visual Question Answering framework in this research. Because both 16-bit and 32-bit floating points provide automatic mixed precision, deep learning architectures can now be optimized with less computation and execution time. Using the VQA 2.0 and CLEVR datasets, the proposed framework has been tested against two models. In terms of overall accuracy and execution time, the experimental findings demonstrated a significant improvement.\",\"PeriodicalId\":196526,\"journal\":{\"name\":\"2023 International Conference on Artificial Intelligence and Applications (ICAIA) Alliance Technology Conference (ATCON-1)\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Artificial Intelligence and Applications (ICAIA) Alliance Technology Conference (ATCON-1)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAIA57370.2023.10169318\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Artificial Intelligence and Applications (ICAIA) Alliance Technology Conference (ATCON-1)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAIA57370.2023.10169318","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
由于机器学习特别是深度学习的出现和不断发展,视觉问答(visual question and answer,简称VQA)的研究有了长足的进步,具有很大的理论研究意义和实际应用价值。这个研究领域使用了多模态学习、计算机视觉和自然语言处理技术。除了少数学者提出了不同类型的优化的双线性融合方法,有效地整合了文本和图像的特征,对VQA框架进行优化的努力并不多。为了优化VQA问题,我们在本研究中提供了一个独特的可视化问答框架。因为16位和32位浮点都提供自动混合精度,深度学习架构现在可以用更少的计算和执行时间进行优化。使用VQA 2.0和CLEVR数据集,对所提出的框架进行了两个模型的测试。在总体精度和执行时间方面,实验结果显示了显着的改进。
Visual Question Answering Optimized Framework using Mixed Precision Training
Thanks to the emergence and continued devel-opment of machine learning, particularly deep learning, the research on visual question and answer, also known as VQA, has advanced dramatically, with great theoretical research significance and practical application value. This field of study makes use of multimodal learning, computer vision, and natural language processing techniques. Except for a few academics who presented different types of optimized bi-linear fusion approaches that integrate text and image characteristics in an efficient way, there haven’t been many efforts to optimize the VQA framework. In order to optimize the VQA problem, we offer a unique Visual Question Answering framework in this research. Because both 16-bit and 32-bit floating points provide automatic mixed precision, deep learning architectures can now be optimized with less computation and execution time. Using the VQA 2.0 and CLEVR datasets, the proposed framework has been tested against two models. In terms of overall accuracy and execution time, the experimental findings demonstrated a significant improvement.