基于LSTM和YOLOv5融合注意机制的图像字幕研究

Third International Seminar on Artificial Intelligence, Networking, and Information Technology Pub Date : 2023-02-22 DOI:10.1117/12.2667667

Xiaoliang Zhang, Qingtao Zeng, Yeli Li, Likun Lu, Weichun Yang

{"title":"基于LSTM和YOLOv5融合注意机制的图像字幕研究","authors":"Xiaoliang Zhang, Qingtao Zeng, Yeli Li, Likun Lu, Weichun Yang","doi":"10.1117/12.2667667","DOIUrl":null,"url":null,"abstract":"Humans can easily learn to recognize every object in life, every landscape, and describe the things around them in detail from the process of growing up, but computers cannot. How to make computers learn to describe things in pictures has become the research direction of many scholars. If this technology is mature, it will bring great boon to people with visual impairments. They can understand the things around them and the beautiful earth through hearing. Robots recognize objects and understand their surroundings. With the development of artificial intelligence, the power of convolutional neural networks is more and more comparable to that of the human brain. In recent years, many scholars have proposed different methods to seek better solutions to this problem, including generative adversarial networks. Based on the classic structure of Encoder-Decoder, this paper first compares the code implementation and results of ResNet101 as an Encoder on the COCO dataset, and then proposes a new solution that integrates YOLOv5 and LSTM, aiming to improve the model inference speed and inference accuracy.","PeriodicalId":128051,"journal":{"name":"Third International Seminar on Artificial Intelligence, Networking, and Information Technology","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Research on image captioning based on LSTM and YOLOv5 fusion attention mechanism\",\"authors\":\"Xiaoliang Zhang, Qingtao Zeng, Yeli Li, Likun Lu, Weichun Yang\",\"doi\":\"10.1117/12.2667667\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Humans can easily learn to recognize every object in life, every landscape, and describe the things around them in detail from the process of growing up, but computers cannot. How to make computers learn to describe things in pictures has become the research direction of many scholars. If this technology is mature, it will bring great boon to people with visual impairments. They can understand the things around them and the beautiful earth through hearing. Robots recognize objects and understand their surroundings. With the development of artificial intelligence, the power of convolutional neural networks is more and more comparable to that of the human brain. In recent years, many scholars have proposed different methods to seek better solutions to this problem, including generative adversarial networks. Based on the classic structure of Encoder-Decoder, this paper first compares the code implementation and results of ResNet101 as an Encoder on the COCO dataset, and then proposes a new solution that integrates YOLOv5 and LSTM, aiming to improve the model inference speed and inference accuracy.\",\"PeriodicalId\":128051,\"journal\":{\"name\":\"Third International Seminar on Artificial Intelligence, Networking, and Information Technology\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Third International Seminar on Artificial Intelligence, Networking, and Information Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1117/12.2667667\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Third International Seminar on Artificial Intelligence, Networking, and Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2667667","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

人类可以很容易地从成长的过程中学会识别生活中的每一个物体，每一个风景，并详细描述周围的事物，但计算机却做不到。如何让计算机学会用图片来描述事物，已经成为许多学者的研究方向。如果这项技术成熟，将会给视障人士带来巨大的福音。他们可以通过听力了解周围的事物和美丽的地球。机器人可以识别物体并了解周围环境。随着人工智能的发展，卷积神经网络的能力越来越能与人脑相媲美。近年来，许多学者提出了不同的方法来寻求更好的解决方案，其中包括生成对抗网络。基于经典的编码器-解码器结构，本文首先比较了ResNet101作为编码器在COCO数据集上的代码实现和结果，然后提出了一种将YOLOv5和LSTM相结合的新方案，旨在提高模型推理速度和推理精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Research on image captioning based on LSTM and YOLOv5 fusion attention mechanism

Humans can easily learn to recognize every object in life, every landscape, and describe the things around them in detail from the process of growing up, but computers cannot. How to make computers learn to describe things in pictures has become the research direction of many scholars. If this technology is mature, it will bring great boon to people with visual impairments. They can understand the things around them and the beautiful earth through hearing. Robots recognize objects and understand their surroundings. With the development of artificial intelligence, the power of convolutional neural networks is more and more comparable to that of the human brain. In recent years, many scholars have proposed different methods to seek better solutions to this problem, including generative adversarial networks. Based on the classic structure of Encoder-Decoder, this paper first compares the code implementation and results of ResNet101 as an Encoder on the COCO dataset, and then proposes a new solution that integrates YOLOv5 and LSTM, aiming to improve the model inference speed and inference accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Third International Seminar on Artificial Intelligence, Networking, and Information Technology

自引率

0.00%

发文量