Yuanqiu Liu;Hong Yu;Hui Li;Xin Han;Xiaotong Zhang;Han Liu
{"title":"基于导向关键字生成和可学习位置编码的物联网设备变压器非自回归图像字幕","authors":"Yuanqiu Liu;Hong Yu;Hui Li;Xin Han;Xiaotong Zhang;Han Liu","doi":"10.1109/JIOT.2025.3579587","DOIUrl":null,"url":null,"abstract":"The emergence of the Intelligent Internet of Things (IIoT) has brought data processing closer to data sources, especially for real-time processing of surveillance video and image analysis. Image captioning plays a crucial role in understanding images. However, the Transformer architecture, which has become prevalent in recent applications, has been observed to increase the computational resources required for image captioning models. Conventionally, most existing methods use the autoregressive paradigm, which reduces their computational efficiency on edge devices and results in significant inference delays. In this article, we use nonautoregressive paradigms to improve its inference speed and model efficiency. Nevertheless, the lack of effective inputs results in a performance gap between nonautoregressive and autoregressive models. To bridge this gap, we propose the learnable positional encoding and keyword guided nonautoregressive image captioning. First, a diffusion model guided by image features is employed to generate keywords that accurately reflect the image content, thereby infusing a substantial amount of semantic information into the nonautoregressive decoder. Second, positional encoding is utilized to guide the decoder in generating appropriate words at the correct positions within the caption. Extensive experiments on widely used benchmarks demonstrate that our model achieves state-of-the-art performance in nonautoregressive image captioning. Furthermore, our model maintains a competitive inference speed.","PeriodicalId":54347,"journal":{"name":"IEEE Internet of Things Journal","volume":"12 17","pages":"35320-35328"},"PeriodicalIF":8.9000,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Transformer-Based Nonautoregressive Image Captioning via Guided Keyword Generation and Learnable Positional Encoding for IoT Devices\",\"authors\":\"Yuanqiu Liu;Hong Yu;Hui Li;Xin Han;Xiaotong Zhang;Han Liu\",\"doi\":\"10.1109/JIOT.2025.3579587\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The emergence of the Intelligent Internet of Things (IIoT) has brought data processing closer to data sources, especially for real-time processing of surveillance video and image analysis. Image captioning plays a crucial role in understanding images. However, the Transformer architecture, which has become prevalent in recent applications, has been observed to increase the computational resources required for image captioning models. Conventionally, most existing methods use the autoregressive paradigm, which reduces their computational efficiency on edge devices and results in significant inference delays. In this article, we use nonautoregressive paradigms to improve its inference speed and model efficiency. Nevertheless, the lack of effective inputs results in a performance gap between nonautoregressive and autoregressive models. To bridge this gap, we propose the learnable positional encoding and keyword guided nonautoregressive image captioning. First, a diffusion model guided by image features is employed to generate keywords that accurately reflect the image content, thereby infusing a substantial amount of semantic information into the nonautoregressive decoder. Second, positional encoding is utilized to guide the decoder in generating appropriate words at the correct positions within the caption. Extensive experiments on widely used benchmarks demonstrate that our model achieves state-of-the-art performance in nonautoregressive image captioning. Furthermore, our model maintains a competitive inference speed.\",\"PeriodicalId\":54347,\"journal\":{\"name\":\"IEEE Internet of Things Journal\",\"volume\":\"12 17\",\"pages\":\"35320-35328\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2025-06-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Internet of Things Journal\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11036655/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Internet of Things Journal","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11036655/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Transformer-Based Nonautoregressive Image Captioning via Guided Keyword Generation and Learnable Positional Encoding for IoT Devices
The emergence of the Intelligent Internet of Things (IIoT) has brought data processing closer to data sources, especially for real-time processing of surveillance video and image analysis. Image captioning plays a crucial role in understanding images. However, the Transformer architecture, which has become prevalent in recent applications, has been observed to increase the computational resources required for image captioning models. Conventionally, most existing methods use the autoregressive paradigm, which reduces their computational efficiency on edge devices and results in significant inference delays. In this article, we use nonautoregressive paradigms to improve its inference speed and model efficiency. Nevertheless, the lack of effective inputs results in a performance gap between nonautoregressive and autoregressive models. To bridge this gap, we propose the learnable positional encoding and keyword guided nonautoregressive image captioning. First, a diffusion model guided by image features is employed to generate keywords that accurately reflect the image content, thereby infusing a substantial amount of semantic information into the nonautoregressive decoder. Second, positional encoding is utilized to guide the decoder in generating appropriate words at the correct positions within the caption. Extensive experiments on widely used benchmarks demonstrate that our model achieves state-of-the-art performance in nonautoregressive image captioning. Furthermore, our model maintains a competitive inference speed.
期刊介绍:
The EEE Internet of Things (IoT) Journal publishes articles and review articles covering various aspects of IoT, including IoT system architecture, IoT enabling technologies, IoT communication and networking protocols such as network coding, and IoT services and applications. Topics encompass IoT's impacts on sensor technologies, big data management, and future internet design for applications like smart cities and smart homes. Fields of interest include IoT architecture such as things-centric, data-centric, service-oriented IoT architecture; IoT enabling technologies and systematic integration such as sensor technologies, big sensor data management, and future Internet design for IoT; IoT services, applications, and test-beds such as IoT service middleware, IoT application programming interface (API), IoT application design, and IoT trials/experiments; IoT standardization activities and technology development in different standard development organizations (SDO) such as IEEE, IETF, ITU, 3GPP, ETSI, etc.