一种新的基于三维卷积神经网络的视频分析时空特征映射深度学习模型：胃肠内镜视频分类的可行性研究

IF 2.7 Q3 IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY

Journal of Imaging Pub Date : 2025-07-18 DOI:10.3390/jimaging11070243

Mrinal Kanti Dhar, Mou Deb, Poonguzhali Elangovan, Keerthy Gopalakrishnan, Divyanshi Sood, Avneet Kaur, Charmy Parikh, Swetha Rapolu, Gianeshwaree Alias Rachna Panjwani, Rabiah Aslam Ansari, Naghmeh Asadimanesh, Shiva Sankari Karuppiah, Scott A Helgeson, Venkata S Akshintala, Shivaram P Arunachalam

{"title":"一种新的基于三维卷积神经网络的视频分析时空特征映射深度学习模型：胃肠内镜视频分类的可行性研究","authors":"Mrinal Kanti Dhar, Mou Deb, Poonguzhali Elangovan, Keerthy Gopalakrishnan, Divyanshi Sood, Avneet Kaur, Charmy Parikh, Swetha Rapolu, Gianeshwaree Alias Rachna Panjwani, Rabiah Aslam Ansari, Naghmeh Asadimanesh, Shiva Sankari Karuppiah, Scott A Helgeson, Venkata S Akshintala, Shivaram P Arunachalam","doi":"10.3390/jimaging11070243","DOIUrl":null,"url":null,"abstract":"Accurate analysis of medical videos remains a major challenge in deep learning (DL) due to the need for effective spatiotemporal feature mapping that captures both spatial detail and temporal dynamics. Despite advances in DL, most existing models in medical AI focus on static images, overlooking critical temporal cues present in video data. To bridge this gap, a novel DL-based framework is proposed for spatiotemporal feature extraction from medical video sequences. As a feasibility use case, this study focuses on gastrointestinal (GI) endoscopic video classification. A 3D convolutional neural network (CNN) is developed to classify upper and lower GI endoscopic videos using the hyperKvasir dataset, which contains 314 lower and 60 upper GI videos. To address data imbalance, 60 matched pairs of videos are randomly selected across 20 experimental runs. Videos are resized to 224 × 224, and the 3D CNN captures spatiotemporal information. A 3D version of the parallel spatial and channel squeeze-and-excitation (P-scSE) is implemented, and a new block called the residual with parallel attention (RPA) block is proposed by combining P-scSE3D with a residual block. To reduce computational complexity, a (2 + 1)D convolution is used in place of full 3D convolution. The model achieves an average accuracy of 0.933, precision of 0.932, recall of 0.944, F1-score of 0.935, and AUC of 0.933. It is also observed that the integration of P-scSE3D increased the F1-score by 7%. This preliminary work opens avenues for exploring various GI endoscopic video-based prospective studies.","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"11 7","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12295846/pdf/","citationCount":"0","resultStr":"{\"title\":\"A Novel 3D Convolutional Neural Network-Based Deep Learning Model for Spatiotemporal Feature Mapping for Video Analysis: Feasibility Study for Gastrointestinal Endoscopic Video Classification.\",\"authors\":\"Mrinal Kanti Dhar, Mou Deb, Poonguzhali Elangovan, Keerthy Gopalakrishnan, Divyanshi Sood, Avneet Kaur, Charmy Parikh, Swetha Rapolu, Gianeshwaree Alias Rachna Panjwani, Rabiah Aslam Ansari, Naghmeh Asadimanesh, Shiva Sankari Karuppiah, Scott A Helgeson, Venkata S Akshintala, Shivaram P Arunachalam\",\"doi\":\"10.3390/jimaging11070243\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accurate analysis of medical videos remains a major challenge in deep learning (DL) due to the need for effective spatiotemporal feature mapping that captures both spatial detail and temporal dynamics. Despite advances in DL, most existing models in medical AI focus on static images, overlooking critical temporal cues present in video data. To bridge this gap, a novel DL-based framework is proposed for spatiotemporal feature extraction from medical video sequences. As a feasibility use case, this study focuses on gastrointestinal (GI) endoscopic video classification. A 3D convolutional neural network (CNN) is developed to classify upper and lower GI endoscopic videos using the hyperKvasir dataset, which contains 314 lower and 60 upper GI videos. To address data imbalance, 60 matched pairs of videos are randomly selected across 20 experimental runs. Videos are resized to 224 × 224, and the 3D CNN captures spatiotemporal information. A 3D version of the parallel spatial and channel squeeze-and-excitation (P-scSE) is implemented, and a new block called the residual with parallel attention (RPA) block is proposed by combining P-scSE3D with a residual block. To reduce computational complexity, a (2 + 1)D convolution is used in place of full 3D convolution. The model achieves an average accuracy of 0.933, precision of 0.932, recall of 0.944, F1-score of 0.935, and AUC of 0.933. It is also observed that the integration of P-scSE3D increased the F1-score by 7%. This preliminary work opens avenues for exploring various GI endoscopic video-based prospective studies.\",\"PeriodicalId\":37035,\"journal\":{\"name\":\"Journal of Imaging\",\"volume\":\"11 7\",\"pages\":\"\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2025-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12295846/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Imaging\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/jimaging11070243\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/jimaging11070243","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

医学视频的准确分析仍然是深度学习（DL）的一个主要挑战，因为需要有效的时空特征映射，以捕获空间细节和时间动态。尽管深度学习取得了进步，但大多数现有的医疗人工智能模型都专注于静态图像，忽略了视频数据中存在的关键时间线索。为了弥补这一缺陷，提出了一种新的基于dl的医学视频序列时空特征提取框架。作为一个可行性用例，本研究的重点是胃肠内镜视频分类。利用hyperKvasir数据集（包含314个下消化道视频和60个上消化道视频），开发了一个3D卷积神经网络（CNN）来对上消化道内镜视频进行分类。为了解决数据不平衡问题，在20次实验运行中随机选择60对匹配的视频。视频被调整为224 × 224, 3D CNN捕获时空信息。实现了一种三维版本的平行空间和通道挤压激励（P-scSE），并将P-scSE3D与残差块相结合，提出了一种新的并行注意残差块（RPA）。为了降低计算复杂度，使用（2 + 1）D卷积来代替完整的3D卷积。模型的平均准确率为0.933，精密度为0.932，召回率为0.944，f1得分为0.935，AUC为0.933。还观察到P-scSE3D的整合使f1评分提高了7%。这项初步工作为探索各种胃肠道内窥镜视频前瞻性研究开辟了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Novel 3D Convolutional Neural Network-Based Deep Learning Model for Spatiotemporal Feature Mapping for Video Analysis: Feasibility Study for Gastrointestinal Endoscopic Video Classification.

Accurate analysis of medical videos remains a major challenge in deep learning (DL) due to the need for effective spatiotemporal feature mapping that captures both spatial detail and temporal dynamics. Despite advances in DL, most existing models in medical AI focus on static images, overlooking critical temporal cues present in video data. To bridge this gap, a novel DL-based framework is proposed for spatiotemporal feature extraction from medical video sequences. As a feasibility use case, this study focuses on gastrointestinal (GI) endoscopic video classification. A 3D convolutional neural network (CNN) is developed to classify upper and lower GI endoscopic videos using the hyperKvasir dataset, which contains 314 lower and 60 upper GI videos. To address data imbalance, 60 matched pairs of videos are randomly selected across 20 experimental runs. Videos are resized to 224 × 224, and the 3D CNN captures spatiotemporal information. A 3D version of the parallel spatial and channel squeeze-and-excitation (P-scSE) is implemented, and a new block called the residual with parallel attention (RPA) block is proposed by combining P-scSE3D with a residual block. To reduce computational complexity, a (2 + 1)D convolution is used in place of full 3D convolution. The model achieves an average accuracy of 0.933, precision of 0.932, recall of 0.944, F1-score of 0.935, and AUC of 0.933. It is also observed that the integration of P-scSE3D increased the F1-score by 7%. This preliminary work opens avenues for exploring various GI endoscopic video-based prospective studies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊