遥感和农业中的 ResNet50：评估高光谱数据的图像标题性能

IF 2.8 4区环境科学与生态学 Q3 ENVIRONMENTAL SCIENCES

Environmental Earth Sciences Pub Date : 2024-11-21 DOI:10.1007/s12665-024-11950-2

Chengping Zhang, Imran Iqbal, Uzair Aslam Bhatti, Jinru Liu, Emad Mahrous Awwad, Nadia Sarhan

{"title":"遥感和农业中的 ResNet50：评估高光谱数据的图像标题性能","authors":"Chengping Zhang, Imran Iqbal, Uzair Aslam Bhatti, Jinru Liu, Emad Mahrous Awwad, Nadia Sarhan","doi":"10.1007/s12665-024-11950-2","DOIUrl":null,"url":null,"abstract":"<div><p>Remote sensing image captioning is crucial as it enables the automatic interpretation and description of complex images captured from satellite or aerial sensors, facilitating the efficient analysis and understanding of vast amounts of geospatial data. This capability is essential for various applications, including environmental monitoring, disaster management, urban planning, and agricultural assessment, where accurate and timely information is vital for decision-making and response. This paper aims to evaluate deep learning models for image captioning in the context of remote sensing data and specifically compares Vision Transformer (ViT) and ResNet50 architectures. Utilizing the BLEU score to evaluate the quality of generated captions, the research explores the models' capabilities across varying sample sizes: The amount of samples included 25, 50, 75, and 100 samples. As it is shown in the tables above, the Vision Transformer outperforms the ResNet50 model in most cases, with the highest BLEU score of 0. 5507 at 50 samples, which indicates the superiority in learning global dependencies for image understanding and text generation. Nonetheless, the performance of ViT decreases slightly when the number of samples is greater than 50, which might be attributed to overfitting or scalability. On the other hand, ResNet50 shows a gradual increase in BLEU score with the increase in sample size and attains the maximum BLEU score of 0. 4783 at 100 samples, meaning that it is most effective with large data sets where it can fully take advantage of the learning algorithm. This work also discusses the advantages and disadvantages of the two models and makes suggestions on when it is suitable to use which model for image captioning tasks in remote sensing, thus helps to advance the discussion on model selection and improvement for image captioning tasks.</p></div>","PeriodicalId":542,"journal":{"name":"Environmental Earth Sciences","volume":"83 23","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ResNet50 in remote sensing and agriculture: evaluating image captioning performance for high spectral data\",\"authors\":\"Chengping Zhang, Imran Iqbal, Uzair Aslam Bhatti, Jinru Liu, Emad Mahrous Awwad, Nadia Sarhan\",\"doi\":\"10.1007/s12665-024-11950-2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Remote sensing image captioning is crucial as it enables the automatic interpretation and description of complex images captured from satellite or aerial sensors, facilitating the efficient analysis and understanding of vast amounts of geospatial data. This capability is essential for various applications, including environmental monitoring, disaster management, urban planning, and agricultural assessment, where accurate and timely information is vital for decision-making and response. This paper aims to evaluate deep learning models for image captioning in the context of remote sensing data and specifically compares Vision Transformer (ViT) and ResNet50 architectures. Utilizing the BLEU score to evaluate the quality of generated captions, the research explores the models' capabilities across varying sample sizes: The amount of samples included 25, 50, 75, and 100 samples. As it is shown in the tables above, the Vision Transformer outperforms the ResNet50 model in most cases, with the highest BLEU score of 0. 5507 at 50 samples, which indicates the superiority in learning global dependencies for image understanding and text generation. Nonetheless, the performance of ViT decreases slightly when the number of samples is greater than 50, which might be attributed to overfitting or scalability. On the other hand, ResNet50 shows a gradual increase in BLEU score with the increase in sample size and attains the maximum BLEU score of 0. 4783 at 100 samples, meaning that it is most effective with large data sets where it can fully take advantage of the learning algorithm. This work also discusses the advantages and disadvantages of the two models and makes suggestions on when it is suitable to use which model for image captioning tasks in remote sensing, thus helps to advance the discussion on model selection and improvement for image captioning tasks.</p></div>\",\"PeriodicalId\":542,\"journal\":{\"name\":\"Environmental Earth Sciences\",\"volume\":\"83 23\",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-11-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Environmental Earth Sciences\",\"FirstCategoryId\":\"93\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s12665-024-11950-2\",\"RegionNum\":4,\"RegionCategory\":\"环境科学与生态学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ENVIRONMENTAL SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Earth Sciences","FirstCategoryId":"93","ListUrlMain":"https://link.springer.com/article/10.1007/s12665-024-11950-2","RegionNum":4,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

遥感图像字幕至关重要，因为它能够自动解释和描述卫星或航空传感器捕获的复杂图像，促进对大量地理空间数据的有效分析和理解。这种能力对于环境监测、灾害管理、城市规划和农业评估等各种应用至关重要，因为在这些应用中，准确及时的信息对于决策和响应至关重要。本文旨在评估遥感数据背景下图像字幕的深度学习模型，并特别比较了 Vision Transformer (ViT) 和 ResNet50 架构。研究利用 BLEU 分数来评估生成字幕的质量，探索模型在不同样本量下的能力：样本量包括 25、50、75 和 100 个样本。如上表所示，Vision Transformer 在大多数情况下都优于 ResNet50 模型，在 50 个样本时的 BLEU 得分最高，为 0.5507，这表明它在学习全局依赖关系以理解图像和生成文本方面具有优势。然而，当样本数超过 50 个时，ViT 的性能略有下降，这可能是由于过度拟合或可扩展性造成的。另一方面，ResNet50 的 BLEU 分数随着样本量的增加而逐渐增加，在 100 个样本时达到最大 BLEU 分数 0.4783，这意味着它在大型数据集上最有效，可以充分发挥学习算法的优势。本研究还讨论了两种模型的优缺点，并就在遥感图像标注任务中何时适合使用哪种模型提出了建议，从而有助于推进图像标注任务中模型选择和改进的讨论。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ResNet50 in remote sensing and agriculture: evaluating image captioning performance for high spectral data

Remote sensing image captioning is crucial as it enables the automatic interpretation and description of complex images captured from satellite or aerial sensors, facilitating the efficient analysis and understanding of vast amounts of geospatial data. This capability is essential for various applications, including environmental monitoring, disaster management, urban planning, and agricultural assessment, where accurate and timely information is vital for decision-making and response. This paper aims to evaluate deep learning models for image captioning in the context of remote sensing data and specifically compares Vision Transformer (ViT) and ResNet50 architectures. Utilizing the BLEU score to evaluate the quality of generated captions, the research explores the models' capabilities across varying sample sizes: The amount of samples included 25, 50, 75, and 100 samples. As it is shown in the tables above, the Vision Transformer outperforms the ResNet50 model in most cases, with the highest BLEU score of 0. 5507 at 50 samples, which indicates the superiority in learning global dependencies for image understanding and text generation. Nonetheless, the performance of ViT decreases slightly when the number of samples is greater than 50, which might be attributed to overfitting or scalability. On the other hand, ResNet50 shows a gradual increase in BLEU score with the increase in sample size and attains the maximum BLEU score of 0. 4783 at 100 samples, meaning that it is most effective with large data sets where it can fully take advantage of the learning algorithm. This work also discusses the advantages and disadvantages of the two models and makes suggestions on when it is suitable to use which model for image captioning tasks in remote sensing, thus helps to advance the discussion on model selection and improvement for image captioning tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Environmental Earth Sciences 环境科学-地球科学综合

CiteScore

5.10

自引率

3.60%

发文量

494

审稿时长

8.3 months

期刊介绍： Environmental Earth Sciences is an international multidisciplinary journal concerned with all aspects of interaction between humans, natural resources, ecosystems, special climates or unique geographic zones, and the earth: Water and soil contamination caused by waste management and disposal practices Environmental problems associated with transportation by land, air, or water Geological processes that may impact biosystems or humans Man-made or naturally occurring geological or hydrological hazards Environmental problems associated with the recovery of materials from the earth Environmental problems caused by extraction of minerals, coal, and ores, as well as oil and gas, water and alternative energy sources Environmental impacts of exploration and recultivation – Environmental impacts of hazardous materials Management of environmental data and information in data banks and information systems Dissemination of knowledge on techniques, methods, approaches and experiences to improve and remediate the environment In pursuit of these topics, the geoscientific disciplines are invited to contribute their knowledge and experience. Major disciplines include: hydrogeology, hydrochemistry, geochemistry, geophysics, engineering geology, remediation science, natural resources management, environmental climatology and biota, environmental geography, soil science and geomicrobiology.