Xiaoliang Zhang, Qingtao Zeng, Yeli Li, Likun Lu, Weichun Yang
{"title":"基于LSTM和YOLOv5融合注意机制的图像字幕研究","authors":"Xiaoliang Zhang, Qingtao Zeng, Yeli Li, Likun Lu, Weichun Yang","doi":"10.1117/12.2667667","DOIUrl":null,"url":null,"abstract":"Humans can easily learn to recognize every object in life, every landscape, and describe the things around them in detail from the process of growing up, but computers cannot. How to make computers learn to describe things in pictures has become the research direction of many scholars. If this technology is mature, it will bring great boon to people with visual impairments. They can understand the things around them and the beautiful earth through hearing. Robots recognize objects and understand their surroundings. With the development of artificial intelligence, the power of convolutional neural networks is more and more comparable to that of the human brain. In recent years, many scholars have proposed different methods to seek better solutions to this problem, including generative adversarial networks. Based on the classic structure of Encoder-Decoder, this paper first compares the code implementation and results of ResNet101 as an Encoder on the COCO dataset, and then proposes a new solution that integrates YOLOv5 and LSTM, aiming to improve the model inference speed and inference accuracy.","PeriodicalId":128051,"journal":{"name":"Third International Seminar on Artificial Intelligence, Networking, and Information Technology","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Research on image captioning based on LSTM and YOLOv5 fusion attention mechanism\",\"authors\":\"Xiaoliang Zhang, Qingtao Zeng, Yeli Li, Likun Lu, Weichun Yang\",\"doi\":\"10.1117/12.2667667\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Humans can easily learn to recognize every object in life, every landscape, and describe the things around them in detail from the process of growing up, but computers cannot. How to make computers learn to describe things in pictures has become the research direction of many scholars. If this technology is mature, it will bring great boon to people with visual impairments. They can understand the things around them and the beautiful earth through hearing. Robots recognize objects and understand their surroundings. With the development of artificial intelligence, the power of convolutional neural networks is more and more comparable to that of the human brain. In recent years, many scholars have proposed different methods to seek better solutions to this problem, including generative adversarial networks. Based on the classic structure of Encoder-Decoder, this paper first compares the code implementation and results of ResNet101 as an Encoder on the COCO dataset, and then proposes a new solution that integrates YOLOv5 and LSTM, aiming to improve the model inference speed and inference accuracy.\",\"PeriodicalId\":128051,\"journal\":{\"name\":\"Third International Seminar on Artificial Intelligence, Networking, and Information Technology\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Third International Seminar on Artificial Intelligence, Networking, and Information Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1117/12.2667667\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Third International Seminar on Artificial Intelligence, Networking, and Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2667667","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Research on image captioning based on LSTM and YOLOv5 fusion attention mechanism
Humans can easily learn to recognize every object in life, every landscape, and describe the things around them in detail from the process of growing up, but computers cannot. How to make computers learn to describe things in pictures has become the research direction of many scholars. If this technology is mature, it will bring great boon to people with visual impairments. They can understand the things around them and the beautiful earth through hearing. Robots recognize objects and understand their surroundings. With the development of artificial intelligence, the power of convolutional neural networks is more and more comparable to that of the human brain. In recent years, many scholars have proposed different methods to seek better solutions to this problem, including generative adversarial networks. Based on the classic structure of Encoder-Decoder, this paper first compares the code implementation and results of ResNet101 as an Encoder on the COCO dataset, and then proposes a new solution that integrates YOLOv5 and LSTM, aiming to improve the model inference speed and inference accuracy.