Kilian Chandelon, Alice Pitout, Mathieu Souchaud, Julie Desternes, Gaëlle Margue, Julien Peyras, Nicolas Bourdel, Jean-Christophe Bernhard, Adrien Bartoli
{"title":"Landmark-free automatic digital twin registration in robot-assisted partial nephrectomy using a generic end-to-end model.","authors":"Kilian Chandelon, Alice Pitout, Mathieu Souchaud, Julie Desternes, Gaëlle Margue, Julien Peyras, Nicolas Bourdel, Jean-Christophe Bernhard, Adrien Bartoli","doi":"10.1007/s11548-025-03473-3","DOIUrl":"https://doi.org/10.1007/s11548-025-03473-3","url":null,"abstract":"<p><strong>Purpose: </strong>Augmented Reality in Minimally Invasive Surgery has made tremendous progress in organs including the liver and the uterus. The core problem of Augmented Reality is registration, where a preoperative patient's geometric digital twin must be aligned with the image of the surgical camera. The case of the kidney is yet unresolved, owing to the absence of anatomical landmarks visible in both the patient's digital twin and the surgical images.</p><p><strong>Methods: </strong>We propose a landmark-free approach to registration, which is particularly well-adapted to the kidney. The approach involves a generic kidney model and an end-to-end neural network, which we train with a proposed dataset to regress the registration directly from a surgical RGB image.</p><p><strong>Results: </strong>Experimental evaluation across four clinical cases demonstrates strong concordance with expert-labelled registration, despite anatomical and motion variability. The proposed method achieved an average tumour contour alignment error of <math><mrow><mn>7.3</mn> <mo>±</mo> <mn>4.1</mn></mrow> </math> mm in <math><mrow><mn>9.4</mn> <mo>±</mo> <mn>0.2</mn></mrow> </math> ms.</p><p><strong>Conclusion: </strong>This landmark-free registration approach meets the accuracy, speed and resource constraints required in clinical practice, making it a promising tool for Augmented Reality-Assisted Partial Nephrectomy.</p>","PeriodicalId":51251,"journal":{"name":"International Journal of Computer Assisted Radiology and Surgery","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144661024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating large language models on hospital health data for automated emergency triage.","authors":"Carlos Lafuente, Mehdi Rahim","doi":"10.1007/s11548-025-03475-1","DOIUrl":"https://doi.org/10.1007/s11548-025-03475-1","url":null,"abstract":"<p><strong>Purpose: </strong>Large language models (LLMs) have a significant potential in healthcare due to their ability to process unstructured text from electronic health records (EHRs) and to generate knowledge with few or no training. In this study, we investigate the effectiveness of LLMs for clinical decision support, specifically in the context of emergency department triage, where the volume of textual data is minimal compared to other scenarios such as making a clinical diagnosis.</p><p><strong>Methods: </strong>We benchmark LLMs with traditional machine learning (ML) approaches using the Emergency Severity Index (ESI) as the gold standard criteria of triage. The benchmark includes general purpose, specialised, and fine-tuned LLMs. All models are prompted to predict ESI score from a EHRs. We use a balanced subset (n = 1000) from MIMIC-IV-ED, a large database containing records of admissions to the emergency department of Beth Israel Deaconess Medical Center.</p><p><strong>Results: </strong>Our findings show that the best-performing models have an average F1-score below 0.60. Also, while zero-shot and fine-tuned LLMs can outperform standard ML models, their performance is surpassed by ML models augmented with features derived from LLMs or knowledge graphs.</p><p><strong>Conclusion: </strong>LLMs show value for clinical decision support in scenarios with limited textual data, such as emergency department triage. The study advocates for integrating LLM knowledge representation to improve existing ML models rather than using LLMs in isolation, suggesting this as a more promising approach to enhance the accuracy of automated triage systems.</p>","PeriodicalId":51251,"journal":{"name":"International Journal of Computer Assisted Radiology and Surgery","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144644112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient needle guidance: multi-camera augmented reality navigation without patient-specific calibration.","authors":"Yizhi Wei, Bingyu Huang, Bolin Zhao, Zhengyu Lin, Steven Zhiying Zhou","doi":"10.1007/s11548-025-03477-z","DOIUrl":"https://doi.org/10.1007/s11548-025-03477-z","url":null,"abstract":"<p><strong>Purpose: </strong>Augmented reality (AR) technology holds significant promise for enhancing surgical navigation in needle-based procedures such as biopsies and ablations. However, most existing AR systems rely on patient-specific markers, which disrupt clinical workflows and require time-consuming preoperative calibrations, thereby hindering operational efficiency and precision.</p><p><strong>Methods: </strong>We developed a novel multi-camera AR navigation system that eliminates the need for patient-specific markers by utilizing ceiling-mounted markers mapped to fixed medical imaging devices. A hierarchical optimization framework integrates both marker mapping and multi-camera calibration. Deep learning techniques are employed to enhance marker detection and registration accuracy. Additionally, a vision-based pose compensation method is implemented to mitigate errors caused by patient movement, improving overall positional accuracy.</p><p><strong>Results: </strong>Validation through phantom experiments and simulated clinical scenarios demonstrated an average puncture accuracy of 3.72 ± 1.21 mm. The system reduced needle placement time by 20 s compared to traditional marker-based methods. It also effectively corrected errors induced by patient movement, with a mean positional error of 0.38 pixels and an angular deviation of 0.51 <math><mmultiscripts><mrow></mrow> <mrow></mrow> <mo>∘</mo></mmultiscripts> </math> . These results highlight the system's precision, adaptability, and reliability in realistic surgical conditions.</p><p><strong>Conclusion: </strong>This marker-free AR guidance system significantly streamlines surgical workflows while enhancing needle navigation accuracy. Its simplicity, cost-effectiveness, and adaptability make it an ideal solution for both high- and low-resource clinical environments, offering the potential for improved precision, reduced procedural time, and better patient outcomes.</p>","PeriodicalId":51251,"journal":{"name":"International Journal of Computer Assisted Radiology and Surgery","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144621053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel A Di Giovanni, M Kersten-Oertel, S Drouin, D L Collins
{"title":"Assessment of cognitive load in the context of neurosurgery.","authors":"Daniel A Di Giovanni, M Kersten-Oertel, S Drouin, D L Collins","doi":"10.1007/s11548-025-03478-y","DOIUrl":"https://doi.org/10.1007/s11548-025-03478-y","url":null,"abstract":"<p><strong>Purpose: </strong>Image-guided neurosurgery demands precise depth perception to minimize cognitive burden during intricate navigational tasks. Existing evaluation methods rely heavily on subjective user feedback, which can be biased and inconsistent. This study uses a physiological measure via electroencephalography (EEG), to quantify cognitive load when using novel dynamic depth-cue visualizations. By comparing dynamic versus static rendering techniques, we aim to establish an objective framework for assessing and validating visualization strategies beyond traditional performance metrics.</p><p><strong>Methods: </strong>Twenty participants (experts in brain imaging) navigated to specified targets within a computed tomography angiography (CTA) volume using a tracked 3D pointer. We implemented three visualization methods (shading, ChromaDepth, aerial perspective) in both static and dynamic modes, randomized across 80 trials per subject. Continuous EEG was recorded via a Muse headband; raw signals were preprocessed and theta-band (4-7 Hz) power extracted for each trial. A two-way repeated measures ANOVA assessed the effects of visualization type and dynamic interaction on theta power.</p><p><strong>Results: </strong>Dynamic visualization conditions yielded lower mean theta-band power compared to static conditions (Δ = 0.057 V2/Hz; F (1,19) = 6.00, p = 0.024), indicating reduced neural markers of cognitive load. No significant main effect was observed across visualization methods, nor their interaction with dynamic mode. These findings suggest that real-time feedback from pointer-driven interactions may alleviate mental effort regardless of the specific depth cue employed.</p><p><strong>Conclusion: </strong>Our exploratory results demonstrate the feasibility of using consumer-grade EEG to provide objective insights into cognitive load for surgical visualization techniques. Although limited by non-surgeon participants, the observed theta-power reductions under dynamic conditions support further investigation. Future work should correlate EEG-derived load measures with performance outcomes, involve practising neurosurgeons, and leverage high-density EEG or AI-driven adaptive visualization to refine and validate these preliminary findings.</p>","PeriodicalId":51251,"journal":{"name":"International Journal of Computer Assisted Radiology and Surgery","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144621079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Britta Maria Lohn, Stefan Raith, Mark Ooms, Philipp Winnand, Frank Hölzle, Ali Modabber
{"title":"Comparison of the accuracy of different slot properties of 3D-printed cutting guides for raising free fibular flaps using saw or piezoelectric instruments: an in vitro study.","authors":"Britta Maria Lohn, Stefan Raith, Mark Ooms, Philipp Winnand, Frank Hölzle, Ali Modabber","doi":"10.1007/s11548-025-03474-2","DOIUrl":"https://doi.org/10.1007/s11548-025-03474-2","url":null,"abstract":"<p><strong>Purpose: </strong>The free fibular flap (FFF) is a standard procedure for the oral rehabilitation of segmental bone defects in the mandible caused by diseases such as malignant processes, osteonecrosis, or trauma. Digital guides and computer-assisted surgery (CAS) can improve precision and reduce the time and cost of surgery. This study evaluates how different designs of slot cutting guides, guiding heights, and cutting instruments affect surgical accuracy during mandibular reconstruction.</p><p><strong>Methods: </strong>Ninety model operations in a three-part fibular transplant for mandibular reconstruction were conducted according to digital planning with three guide designs (standard, flange, and anatomical slots), three guide heights (1 mm, 2 mm, 3 mm), and two osteotomy instruments (piezoelectric instrument and saw). The cut segments were digitized using computed tomography and digitally evaluated to assess surgical accuracy.</p><p><strong>Results: </strong>For vestibular and lingual segment length, the anatomical slot and the flange appear to be the most accurate, with the flange slightly under-contoured vestibularly and the standard slot over-contoured lingually and vestibularly (p < 0.001). There were only minor differences between the use of saw and piezoelectric instrument for lingual (p = 0.005) and vestibular (p < 0.001) length and proximal angle (p = 0.014). The U-distance after global reconstruction for flanges resulted in a median deviation of 0.0468 mm (IQR 8.15), but was not significant (p = 0.067).</p><p><strong>Conclusion: </strong>Anatomical slots and flanges are recommended for osteotomy, with guiding effects relying on both haptic and visual control. Unilateral guided flanges also work accurately at high guidance heights. The results of piezoelectric instrument (PI) and saw showed comparable results in the assessment of individual segments and U-reconstruction in this in vitro study without soft tissue, so that the final decision is left to the expertise of the surgeons.</p>","PeriodicalId":51251,"journal":{"name":"International Journal of Computer Assisted Radiology and Surgery","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144621052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rintaro Miyazaki, Yuichiro Hayashi, Masahiro Oda, Kensaku Mori
{"title":"GPU-accelerated deformation mapping in hybrid organ models for real-time simulation.","authors":"Rintaro Miyazaki, Yuichiro Hayashi, Masahiro Oda, Kensaku Mori","doi":"10.1007/s11548-025-03377-2","DOIUrl":"https://doi.org/10.1007/s11548-025-03377-2","url":null,"abstract":"<p><strong>Purpose: </strong>Surgical simulation is expected to be an effective way for physicians and medical students to learn surgical skills. To achieve real-time deformation of soft tissues with high visual quality, multiple resolution and adaptive mesh refinement models have been introduced. However, those models require additional processing time to map the deformation results of the deformed lattice to a polygon model. In this study, we propose a method to accelerate this process using vertex shaders on GPU and investigate its performance.</p><p><strong>Methods: </strong>A hierarchical octree cube structure is generated from a high-resolution organ polygon model. The entire organ model is divided into pieces according to the cube structure. In a simulation, vertex coordinates of the organ model pieces are obtained by trilinear interpolation of the cube's 8 vertex coordinates. This process is described in a shader program, and organ model vertices are processed in the rendering pipeline for acceleration.</p><p><strong>Results: </strong>For a constant number of processing cubes, the CPU-based processing time increased linearly with the total number of organ model vertices, and the GPU-based time was nearly constant. On the other hand, for a constant number of model vertices, the GPU-based time increased linearly with the number of surface cubes. These linearities determine a condition that the GPU-based implementation is faster in the same frame time.</p><p><strong>Conclusion: </strong>We implemented octree cube deformation mapping using vertex shaders and confirmed its performance. The experimental results showed that the GPU can accelerate the mapping process in high-resolution organ models with a large number of vertices.</p>","PeriodicalId":51251,"journal":{"name":"International Journal of Computer Assisted Radiology and Surgery","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144576905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"eNCApsulate: neural cellular automata for precision diagnosis on capsule endoscopes.","authors":"Henry John Krumb, Anirban Mukhopadhyay","doi":"10.1007/s11548-025-03425-x","DOIUrl":"https://doi.org/10.1007/s11548-025-03425-x","url":null,"abstract":"<p><strong>Purpose: </strong>Wireless capsule endoscopy (WCE) is a noninvasive imaging method for the entire gastrointestinal tract and is a pain-free alternative to traditional endoscopy. It generates extensive video data that requires significant review time, and localizing the capsule after ingestion is a challenge. Techniques like bleeding detection and depth estimation can help with localization of pathologies, but deep learning models are typically too large to run directly on the capsule.</p><p><strong>Methods: </strong>Neural cellular automata (NCAs) for bleeding segmentation and depth estimation are trained on capsule endoscopic images. For monocular depth estimation, we distill a large foundation model into the lean NCA architecture, by treating the outputs of the foundation model as pseudo-ground truth. We then port the trained NCAs to the ESP32 microcontroller, enabling efficient image processing on hardware as small as a camera capsule.</p><p><strong>Results: </strong>NCAs are more accurate (Dice) than other portable segmentation models, while requiring more than 100x fewer parameters stored in memory than other small-scale models. The visual results of NCAs depth estimation look convincing and in some cases beat the realism and detail of the pseudo-ground truth. Runtime optimizations on the ESP32-S3 accelerate the average inference speed significantly, by more than factor 3.</p><p><strong>Conclusion: </strong>With several algorithmic adjustments and distillation, it is possible to eNCApsulate NCA models into microcontrollers that fit into wireless capsule endoscopes. This is the first work that enables reliable bleeding segmentation and depth estimation on a miniaturized device, paving the way for precise diagnosis combined with visual odometry as a means of precise localization of the capsule-on the capsule.</p>","PeriodicalId":51251,"journal":{"name":"International Journal of Computer Assisted Radiology and Surgery","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144565483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sidra Rashid, Lukas Bernhard, Sonja Stabenow, Emily Spicker, Charlotte Haid, Carl König, Hedi Louise Kramer, Sandro Pischinger, Daniel Schade, Johannes Fottner, Dirk Wilhelm, Maximilian Berlet
{"title":"Bridging the gap between models and reality: development of a research environment for an object-oriented hospital information system to integrate artificial intelligence and robotics into clinical practice.","authors":"Sidra Rashid, Lukas Bernhard, Sonja Stabenow, Emily Spicker, Charlotte Haid, Carl König, Hedi Louise Kramer, Sandro Pischinger, Daniel Schade, Johannes Fottner, Dirk Wilhelm, Maximilian Berlet","doi":"10.1007/s11548-025-03470-6","DOIUrl":"https://doi.org/10.1007/s11548-025-03470-6","url":null,"abstract":"<p><strong>Introduction: </strong>Hospital information systems (HISs) are the main access opportunity for medical professionals to computer-based patient administration. However, current HISs are primarily designed to function as office applications rather than as comprehensive management and supporting tools. Due to their inflexible architecture, integrating modern technologies like artificial intelligence (AI) models and medical robotics (MR) is hindered. Therefore, we have conceptualized an object-oriented HIS (oHIS) as a pragmatic digital twin (PDT) of the entire patient care in a hospital and developed a functional research framework software for further investigations to bridge the gap between reality and models via oHIS.</p><p><strong>Material and methods: </strong>In an interdisciplinary team of engineers and physicians, we conducted a requirements assessment on the surgical wards of the TUM University Hospital in Munich. Then, we designed the research framework named OMNI-SYS and developed it into a functional research platform capable of bridging the gap between a model management system and real-world agents. Finally, we evaluated the framework simulating a clinical use case.</p><p><strong>Results: </strong>Our analysis revealed that future-proof HIS is an under-researched topic. The integration of new technologies into clinical practice is not sufficiently prepared. Therefore, our approach could solve this shortcoming allowing for human agents, devices, models, and robots to interact in a PDT. Models can be integrated as quasi-natural objects and interact with representations of tangible objects in real time. This approach enables even the integration of new technologies that are still unimaginable today. Our oHIS research framework enabled a functional object representation in a simulated use case.</p><p><strong>Conclusion: </strong>oHIS could significantly facilitate the integration of future technologies like AI models and MR. The OMNI-SYS framework could serve as a cornerstone for further research into this new approach. Studies on its clinical application and formalization are already planned in preparation for a possible future standard.</p>","PeriodicalId":51251,"journal":{"name":"International Journal of Computer Assisted Radiology and Surgery","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144555664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuzhang Liu, Yuichiro Hayashi, Masahiro Oda, Kensaku Mori
{"title":"Enhancing YOLO for laparoscopic tool detection: novel data augmentation and structural modifications addressing mis-detection of bifurcated targets.","authors":"Yuzhang Liu, Yuichiro Hayashi, Masahiro Oda, Kensaku Mori","doi":"10.1007/s11548-025-03352-x","DOIUrl":"https://doi.org/10.1007/s11548-025-03352-x","url":null,"abstract":"<p><strong>Purpose: </strong>Laparoscopic tool detection is vital for assistance of minimally invasive surgeries, aiding tasks like tool pose estimation and surgical navigation. This study enhances YOLO models for better detection of bifurcated targets (BT) in such procedures, addressing the issue of mis-detection of bifurcated targets (MDBT) where BT tips are misidentified as separate entities or overlooked.</p><p><strong>Methods: </strong>We proposed a data augmentation strategy, Random Target Masking, to prevent the model from identifying BT tips as separate laparoscopic tools. Mixup Plus was developed to balance instance count across categories with varying BT proportions. Additionally, we employed the Space-to-Depth Convolution block for downsampling to curb the model's tendency to overlook small-sized BT tips.</p><p><strong>Results: </strong>The YOLOv8 model featuring our modifications, tested on our dataset derived from EndoVis17 and EndoVis18, showed improvement in both <math><msub><mi>mAP</mi> <mn>50</mn></msub> </math> and <math><msub><mi>mAP</mi> <mrow><mn>50</mn> <mo>:</mo> <mn>95</mn></mrow> </msub> </math> metrics on the test dataset. On the BT test dataset specifically, <math><msub><mi>mAP</mi> <mn>50</mn></msub> </math> and <math><msub><mi>mAP</mi> <mrow><mn>50</mn> <mo>:</mo> <mn>95</mn></mrow> </msub> </math> metrics improved by nearly 0.2 and 0.1, respectively. For the Clip Applier category, which has the fewest instances (fewer than 100 instances in the entire training and test dataset), the YOLOv8n model incorporating our proposed modifications increased <math><msub><mi>AP</mi> <mn>50</mn></msub> </math> from 0.0251 to 0.457.</p><p><strong>Conclusion: </strong>This study focused on improving BT detection accuracy in laparoscopic tool detection using YOLO models, incorporating RTM and MUP data augmentation techniques along with SPD-Conv block integration. Experimental evaluations based on the EndoVis datasets validated the enhancements. The ablation study confirmed the effectiveness of each proposed improvement, particularly highlighting the distinct advantages of the proposed data augmentation methods.</p>","PeriodicalId":51251,"journal":{"name":"International Journal of Computer Assisted Radiology and Surgery","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144555598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Watch and learn: leveraging expert knowledge and language for surgical video understanding.","authors":"David Gastager, Ghazal Ghazaei, Constantin Patsch","doi":"10.1007/s11548-025-03472-4","DOIUrl":"https://doi.org/10.1007/s11548-025-03472-4","url":null,"abstract":"<p><strong>Purpose: </strong>Automated surgical workflow analysis is a common yet challenging task with diverse applications in surgical education, research, and clinical decision-making. Although videos are commonly collected during surgical interventions, the lack of annotated datasets hinders the development of accurate and comprehensive workflow analysis solutions. We introduce a novel approach for addressing the sparsity and heterogeneity of annotated training data inspired by the human learning procedure of watching experts and understanding their explanations.</p><p><strong>Methods: </strong>Our method leverages a video-language model trained on alignment, denoising, and generative tasks to learn short-term spatio-temporal and multimodal representations. A task-specific temporal model is then used to capture relationships across entire videos. To achieve comprehensive video-language understanding in the surgical domain, we introduce a data collection and filtering strategy to construct a large-scale pretraining dataset from educational YouTube videos. We then utilize parameter-efficient fine-tuning by projecting downstream task annotations from publicly available surgical datasets into the language domain.</p><p><strong>Results: </strong>Extensive experiments in two surgical domains demonstrate the effectiveness of our approach, with performance improvements of up to 7% in phase segmentation tasks, 5% in zero-shot phase segmentation, and comparable capabilities to fully supervised models in few-shot settings. Harnessing our model's capabilities for long-range temporal localization and text generation, we present the first comprehensive solution for dense video captioning (DVC) of surgical videos, addressing this task despite the absence of existing DVC datasets in the surgical domain.</p><p><strong>Conclusion: </strong>We introduce a novel approach to surgical workflow understanding that leverages video-language pretraining, large-scale video pretraining, and optimized fine-tuning. Our method improves performance over state-of-the-art techniques and enables new downstream tasks for surgical video understanding.</p>","PeriodicalId":51251,"journal":{"name":"International Journal of Computer Assisted Radiology and Surgery","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144546070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}