Advancing capsicum detection in night-time greenhouse environments using deep learning models: Comparative analysis and improved zero-shot detection through fusion with a single-shot detector
{"title":"Advancing capsicum detection in night-time greenhouse environments using deep learning models: Comparative analysis and improved zero-shot detection through fusion with a single-shot detector","authors":"Ayan Paul, Rajendra Machavaram","doi":"10.1016/j.fraope.2025.100243","DOIUrl":null,"url":null,"abstract":"<div><div>This study addresses capsicum detection in night-time greenhouse settings using a robust approach. A dataset of 300 images was curated, capturing various shooting distances, heights, occlusions, and lighting intensities, and underwent extensive pre-processing and augmentation. The single-shot custom-trained You Only Look Once version 9 (YOLOv9) model was evaluated, achieving precision, recall, F1 score, and mean Average Precision (mAP) of 0.898, 0.864, 0.881, and 0.947, respectively, with a detection speed of 38.46 frames per second (FPS). Concurrently, the zero-shot Grounding self-DIstillation with NO labels (Grounding DINO) model required no training and was hypertuned for capsicum detection using Google Colaboratory. Utilizing its Open Vocabulary Object Detection (OVOD) capability, the model successfully performed capsicum detection, positional search, growth stage detection, and diseased capsicum detection with confidence scores of 74 %, 43 %, 74 %, and 43 %, respectively. Comparative testing of both models on 100 test images containing 175 capsicums showed that YOLOv9 outperformed Grounding DINO with precision, recall, and F1 scores of 0.88, 0.86, and 0.87, compared to Grounding DINO's 0.72, 0.69, and 0.70. YOLOv9 also demonstrated an inference speed of 26 milliseconds, approximately five times faster than Grounding DINO. The fusion of YOLOv9 and Grounding DINO into You Only Look Once version Open Vocabulary Object Detection (YOLOvOVOD) significantly improved performance, achieving the highest confidence of 88 % for growth stage detection and a 65.11 % increase in confidence for positional search. This integrated approach leverages the strengths of both models, presenting a robust solution for future automation in agricultural machine vision.</div></div>","PeriodicalId":100554,"journal":{"name":"Franklin Open","volume":"10 ","pages":"Article 100243"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Franklin Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2773186325000337","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This study addresses capsicum detection in night-time greenhouse settings using a robust approach. A dataset of 300 images was curated, capturing various shooting distances, heights, occlusions, and lighting intensities, and underwent extensive pre-processing and augmentation. The single-shot custom-trained You Only Look Once version 9 (YOLOv9) model was evaluated, achieving precision, recall, F1 score, and mean Average Precision (mAP) of 0.898, 0.864, 0.881, and 0.947, respectively, with a detection speed of 38.46 frames per second (FPS). Concurrently, the zero-shot Grounding self-DIstillation with NO labels (Grounding DINO) model required no training and was hypertuned for capsicum detection using Google Colaboratory. Utilizing its Open Vocabulary Object Detection (OVOD) capability, the model successfully performed capsicum detection, positional search, growth stage detection, and diseased capsicum detection with confidence scores of 74 %, 43 %, 74 %, and 43 %, respectively. Comparative testing of both models on 100 test images containing 175 capsicums showed that YOLOv9 outperformed Grounding DINO with precision, recall, and F1 scores of 0.88, 0.86, and 0.87, compared to Grounding DINO's 0.72, 0.69, and 0.70. YOLOv9 also demonstrated an inference speed of 26 milliseconds, approximately five times faster than Grounding DINO. The fusion of YOLOv9 and Grounding DINO into You Only Look Once version Open Vocabulary Object Detection (YOLOvOVOD) significantly improved performance, achieving the highest confidence of 88 % for growth stage detection and a 65.11 % increase in confidence for positional search. This integrated approach leverages the strengths of both models, presenting a robust solution for future automation in agricultural machine vision.