FishDetectLLM: Multimodal instruction tuning with large language models for fish detection

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2025-04-14 DOI:10.1016/j.knosys.2025.113418

Jiaxin Zhu , Shibai Yin , Xin Liu , Xingyang Wang , Yee-Hong Yang

{"title":"FishDetectLLM: Multimodal instruction tuning with large language models for fish detection","authors":"Jiaxin Zhu , Shibai Yin , Xin Liu , Xingyang Wang , Yee-Hong Yang","doi":"10.1016/j.knosys.2025.113418","DOIUrl":null,"url":null,"abstract":"<div><div>Aquatic species play crucial roles in global ecosystems but are increasingly threatened by factors such as overfishing, coastal development and climate change. Existing deep learning methods address these challenges by employing powerful networks and large-scale, diverse datasets, separately tackling species recognition and trait identification during ongoing monitoring. However, they often exhibit limited generalization ability. Inspired by the human ability to quickly identify fish species and their locations with just a glance at an underwater image or scene, we introduce FishDetectLLM—a framework built on the lightweight TinyLLaVA architecture. FishDetectLLM utilizes the powerful reasoning capabilities and vast world knowledge of large language models (LLMs) to address the fish detection problem, providing both fish classification results and predicted bounding boxes for fish. Specifically, we create instruction dialogues for fish detection that connect fish taxonomy with classification descriptions and map location descriptions to the corresponding coordinates of bounding box in the input images from the recently released large-scale FishNet dataset. Then, we pretrain and fine-tune FishDetectLLM to achieve fish detection using the created dataset, leveraging the principle of augmenting human knowledge. Our results show that FishDetectLLM significantly outperforms existing multimodal LLMs and task-specific methods. Unlike conventional detection architectures that struggle to generalize beyond the training data, FishDetectLLM exhibits strong generalization capabilities, achieving robust performance on unseen data. This innovation paves the way for future applications of MLLMs in full research and offers valuable tools for the conservation of fish biodiversity.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"318 ","pages":"Article 113418"},"PeriodicalIF":7.2000,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125004654","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Aquatic species play crucial roles in global ecosystems but are increasingly threatened by factors such as overfishing, coastal development and climate change. Existing deep learning methods address these challenges by employing powerful networks and large-scale, diverse datasets, separately tackling species recognition and trait identification during ongoing monitoring. However, they often exhibit limited generalization ability. Inspired by the human ability to quickly identify fish species and their locations with just a glance at an underwater image or scene, we introduce FishDetectLLM—a framework built on the lightweight TinyLLaVA architecture. FishDetectLLM utilizes the powerful reasoning capabilities and vast world knowledge of large language models (LLMs) to address the fish detection problem, providing both fish classification results and predicted bounding boxes for fish. Specifically, we create instruction dialogues for fish detection that connect fish taxonomy with classification descriptions and map location descriptions to the corresponding coordinates of bounding box in the input images from the recently released large-scale FishNet dataset. Then, we pretrain and fine-tune FishDetectLLM to achieve fish detection using the created dataset, leveraging the principle of augmenting human knowledge. Our results show that FishDetectLLM significantly outperforms existing multimodal LLMs and task-specific methods. Unlike conventional detection architectures that struggle to generalize beyond the training data, FishDetectLLM exhibits strong generalization capabilities, achieving robust performance on unseen data. This innovation paves the way for future applications of MLLMs in full research and offers valuable tools for the conservation of fish biodiversity.

查看原文本刊更多论文

FishDetectLLM: 利用大型语言模型进行鱼类检测的多模式指令调整

水生物种在全球生态系统中发挥着至关重要的作用，但它们日益受到过度捕捞、沿海开发和气候变化等因素的威胁。现有的深度学习方法通过使用强大的网络和大规模、多样化的数据集来解决这些挑战，在持续监测期间分别处理物种识别和特征识别。然而，他们往往表现出有限的泛化能力。受人类快速识别鱼类物种及其位置的能力的启发，只需瞥一眼水下图像或场景，我们介绍了fishdetectllm -一个基于轻量级TinyLLaVA架构的框架。FishDetectLLM利用大语言模型（llm）强大的推理能力和广泛的世界知识来解决鱼类检测问题，既提供鱼类分类结果，又提供鱼类的预测边界盒。具体来说，我们创建了用于鱼类检测的指令对话，将鱼类分类与分类描述联系起来，并将位置描述映射到来自最近发布的大规模渔网数据集的输入图像中的相应边界框坐标。然后，我们利用增强人类知识的原理，利用创建的数据集对FishDetectLLM进行预训练和微调，以实现鱼类检测。我们的研究结果表明，FishDetectLLM显著优于现有的多模态llm和特定任务方法。与难以泛化训练数据的传统检测架构不同，FishDetectLLM展示了强大的泛化能力，在未见过的数据上实现了稳健的性能。这一创新为今后mlm在全面研究中的应用铺平了道路，并为鱼类生物多样性的保护提供了有价值的工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.