Transformer-Based Nonautoregressive Image Captioning via Guided Keyword Generation and Learnable Positional Encoding for IoT Devices

IF 8.9 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Internet of Things Journal Pub Date : 2025-06-16 DOI:10.1109/JIOT.2025.3579587

Yuanqiu Liu;Hong Yu;Hui Li;Xin Han;Xiaotong Zhang;Han Liu

{"title":"Transformer-Based Nonautoregressive Image Captioning via Guided Keyword Generation and Learnable Positional Encoding for IoT Devices","authors":"Yuanqiu Liu;Hong Yu;Hui Li;Xin Han;Xiaotong Zhang;Han Liu","doi":"10.1109/JIOT.2025.3579587","DOIUrl":null,"url":null,"abstract":"The emergence of the Intelligent Internet of Things (IIoT) has brought data processing closer to data sources, especially for real-time processing of surveillance video and image analysis. Image captioning plays a crucial role in understanding images. However, the Transformer architecture, which has become prevalent in recent applications, has been observed to increase the computational resources required for image captioning models. Conventionally, most existing methods use the autoregressive paradigm, which reduces their computational efficiency on edge devices and results in significant inference delays. In this article, we use nonautoregressive paradigms to improve its inference speed and model efficiency. Nevertheless, the lack of effective inputs results in a performance gap between nonautoregressive and autoregressive models. To bridge this gap, we propose the learnable positional encoding and keyword guided nonautoregressive image captioning. First, a diffusion model guided by image features is employed to generate keywords that accurately reflect the image content, thereby infusing a substantial amount of semantic information into the nonautoregressive decoder. Second, positional encoding is utilized to guide the decoder in generating appropriate words at the correct positions within the caption. Extensive experiments on widely used benchmarks demonstrate that our model achieves state-of-the-art performance in nonautoregressive image captioning. Furthermore, our model maintains a competitive inference speed.","PeriodicalId":54347,"journal":{"name":"IEEE Internet of Things Journal","volume":"12 17","pages":"35320-35328"},"PeriodicalIF":8.9000,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Internet of Things Journal","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11036655/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The emergence of the Intelligent Internet of Things (IIoT) has brought data processing closer to data sources, especially for real-time processing of surveillance video and image analysis. Image captioning plays a crucial role in understanding images. However, the Transformer architecture, which has become prevalent in recent applications, has been observed to increase the computational resources required for image captioning models. Conventionally, most existing methods use the autoregressive paradigm, which reduces their computational efficiency on edge devices and results in significant inference delays. In this article, we use nonautoregressive paradigms to improve its inference speed and model efficiency. Nevertheless, the lack of effective inputs results in a performance gap between nonautoregressive and autoregressive models. To bridge this gap, we propose the learnable positional encoding and keyword guided nonautoregressive image captioning. First, a diffusion model guided by image features is employed to generate keywords that accurately reflect the image content, thereby infusing a substantial amount of semantic information into the nonautoregressive decoder. Second, positional encoding is utilized to guide the decoder in generating appropriate words at the correct positions within the caption. Extensive experiments on widely used benchmarks demonstrate that our model achieves state-of-the-art performance in nonautoregressive image captioning. Furthermore, our model maintains a competitive inference speed.

查看原文本刊更多论文

基于导向关键字生成和可学习位置编码的物联网设备变压器非自回归图像字幕

智能物联网（IIoT）的出现使数据处理更接近数据源，特别是对监控视频和图像分析的实时处理。图像字幕在理解图像方面起着至关重要的作用。然而，在最近的应用程序中变得很流行的Transformer体系结构，已经被观察到增加了图像字幕模型所需的计算资源。传统上，大多数现有方法使用自回归范式，这降低了它们在边缘设备上的计算效率，并导致显著的推理延迟。在本文中，我们使用非自回归范式来提高其推理速度和模型效率。然而，缺乏有效的输入导致非自回归和自回归模型之间的性能差距。为了弥补这一差距，我们提出了可学习的位置编码和关键字引导的非自回归图像字幕。首先，利用图像特征引导的扩散模型生成准确反映图像内容的关键词，从而为非自回归解码器注入大量的语义信息。其次，使用位置编码来引导解码器在标题内的正确位置生成适当的单词。在广泛使用的基准测试上进行的大量实验表明，我们的模型在非自回归图像字幕中达到了最先进的性能。此外，我们的模型保持了具有竞争力的推理速度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Internet of Things Journal Computer Science-Information Systems

CiteScore

17.60

自引率

13.20%

发文量

1982

期刊介绍： The EEE Internet of Things (IoT) Journal publishes articles and review articles covering various aspects of IoT, including IoT system architecture, IoT enabling technologies, IoT communication and networking protocols such as network coding, and IoT services and applications. Topics encompass IoT's impacts on sensor technologies, big data management, and future internet design for applications like smart cities and smart homes. Fields of interest include IoT architecture such as things-centric, data-centric, service-oriented IoT architecture; IoT enabling technologies and systematic integration such as sensor technologies, big sensor data management, and future Internet design for IoT; IoT services, applications, and test-beds such as IoT service middleware, IoT application programming interface (API), IoT application design, and IoT trials/experiments; IoT standardization activities and technology development in different standard development organizations (SDO) such as IEEE, IETF, ITU, 3GPP, ETSI, etc.