General surgery vision transformer: A video pre-trained foundation model for general surgery

arXiv - QuanBio - Tissues and Organs Pub Date : 2024-03-09 DOI:arxiv-2403.05949

Samuel Schmidgall, Ji Woong Kim, Jeffery Jopling, Axel Krieger

引用次数: 0

Abstract

The absence of openly accessible data and specialized foundation models is a major barrier for computational research in surgery. Toward this, (i) we open-source the largest dataset of general surgery videos to-date, consisting of 680 hours of surgical videos, including data from robotic and laparoscopic techniques across 28 procedures; (ii) we propose a technique for video pre-training a general surgery vision transformer (GSViT) on surgical videos based on forward video prediction that can run in real-time for surgical applications, toward which we open-source the code and weights of GSViT; (iii) we also release code and weights for procedure-specific fine-tuned versions of GSViT across 10 procedures; (iv) we demonstrate the performance of GSViT on the Cholec80 phase annotation task, displaying improved performance over state-of-the-art single frame predictors.

查看原文本刊更多论文

普外科视觉转换器：用于普外科手术的视频预训练基础模型

缺乏可公开访问的数据和专业基础模型是外科计算研究的一大障碍。为此，(i) 我们开源了迄今为止最大的普外科手术视频数据集，该数据集由 680 小时的手术视频组成，包括来自机器人和腹腔镜技术的 28 种手术数据；(ii) 我们提出了一种基于前向视频预测的普外科手术视觉转换器（GSViT）视频预训练技术，该技术可实时运行于手术应用中，为此我们开源了 GSViT 的代码和权重；(iv) 我们展示了 GSViT 在 Cholec80 阶段标注任务中的性能，其性能超过了最先进的单帧预测器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - QuanBio - Tissues and Organs

自引率

0.00%

发文量