使用直接坐标预测的视觉变压器进行头部测量地标检测。

IF 2.1 2区医学 Q2 DENTISTRY, ORAL SURGERY & MEDICINE

Journal of Cranio-Maxillofacial Surgery Pub Date : 2025-07-01 DOI:10.1016/j.jcms.2025.05.021

Filipe Laitenberger , Hannah T. Scheuer , Hanna A. Scheuer , Enno Lilienthal , Shaodi You , Reinhard E. Friedrich

{"title":"使用直接坐标预测的视觉变压器进行头部测量地标检测。","authors":"Filipe Laitenberger , Hannah T. Scheuer , Hanna A. Scheuer , Enno Lilienthal , Shaodi You , Reinhard E. Friedrich","doi":"10.1016/j.jcms.2025.05.021","DOIUrl":null,"url":null,"abstract":"<div><div>Cephalometric Landmark Detection (CLD), i.e. annotating interest points in lateral X-ray images, is the crucial first step of every orthodontic therapy. While CLD has immense potential for automation using Deep Learning methods, carefully crafted contemporary approaches using convolutional neural networks and heatmap prediction do not qualify for large-scale clinical application due to insufficient performance. We propose a novel approach using Vision Transformers (ViTs) with direct coordinate prediction, avoiding the memory-intensive heatmap prediction common in previous work. Through extensive ablation studies comparing our method against contemporary CNN architectures (ConvNext V2) and heatmap-based approaches (Segformer), we demonstrate that ViTs with coordinate prediction achieve superior performance with more than 2 mm improvement in mean radial error compared to state-of-the-art CLD methods. Our results show that while non-adapted CNN architectures perform poorly on the given task, contemporary approaches may be too tailored to specific datasets, failing to generalize to different and especially sparse datasets. We conclude that using general-purpose Vision Transformers with direct coordinate prediction shows great promise for future research on CLD and medical computer vision.</div></div>","PeriodicalId":54851,"journal":{"name":"Journal of Cranio-Maxillofacial Surgery","volume":"53 9","pages":"Pages 1518-1529"},"PeriodicalIF":2.1000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cephalometric landmark detection using vision transformers with direct coordinate prediction\",\"authors\":\"Filipe Laitenberger , Hannah T. Scheuer , Hanna A. Scheuer , Enno Lilienthal , Shaodi You , Reinhard E. Friedrich\",\"doi\":\"10.1016/j.jcms.2025.05.021\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Cephalometric Landmark Detection (CLD), i.e. annotating interest points in lateral X-ray images, is the crucial first step of every orthodontic therapy. While CLD has immense potential for automation using Deep Learning methods, carefully crafted contemporary approaches using convolutional neural networks and heatmap prediction do not qualify for large-scale clinical application due to insufficient performance. We propose a novel approach using Vision Transformers (ViTs) with direct coordinate prediction, avoiding the memory-intensive heatmap prediction common in previous work. Through extensive ablation studies comparing our method against contemporary CNN architectures (ConvNext V2) and heatmap-based approaches (Segformer), we demonstrate that ViTs with coordinate prediction achieve superior performance with more than 2 mm improvement in mean radial error compared to state-of-the-art CLD methods. Our results show that while non-adapted CNN architectures perform poorly on the given task, contemporary approaches may be too tailored to specific datasets, failing to generalize to different and especially sparse datasets. We conclude that using general-purpose Vision Transformers with direct coordinate prediction shows great promise for future research on CLD and medical computer vision.</div></div>\",\"PeriodicalId\":54851,\"journal\":{\"name\":\"Journal of Cranio-Maxillofacial Surgery\",\"volume\":\"53 9\",\"pages\":\"Pages 1518-1529\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2025-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Cranio-Maxillofacial Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1010518225001866\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cranio-Maxillofacial Surgery","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1010518225001866","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

摘要

头颅测量地标检测（CLD），即在侧位x线图像中标注兴趣点，是每一种正畸治疗的关键第一步。虽然CLD在使用深度学习方法实现自动化方面具有巨大潜力，但由于性能不足，使用卷积神经网络和热图预测的精心设计的现代方法不适合大规模临床应用。我们提出了一种使用视觉变换（ViTs）直接坐标预测的新方法，避免了以往工作中常见的内存密集型热图预测。通过将我们的方法与当代CNN架构（ConvNext V2）和基于热图的方法（Segformer）进行广泛的消融研究，我们证明，与最先进的CLD方法相比，具有坐标预测的ViTs实现了卓越的性能，平均径向误差提高了2毫米以上。我们的研究结果表明，虽然非自适应CNN架构在给定任务上表现不佳，但当代方法可能过于适合特定的数据集，无法推广到不同的，特别是稀疏的数据集。我们认为，使用具有直接坐标预测功能的通用视觉变压器对CLD和医疗计算机视觉的未来研究具有很大的前景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cephalometric landmark detection using vision transformers with direct coordinate prediction

Cephalometric Landmark Detection (CLD), i.e. annotating interest points in lateral X-ray images, is the crucial first step of every orthodontic therapy. While CLD has immense potential for automation using Deep Learning methods, carefully crafted contemporary approaches using convolutional neural networks and heatmap prediction do not qualify for large-scale clinical application due to insufficient performance. We propose a novel approach using Vision Transformers (ViTs) with direct coordinate prediction, avoiding the memory-intensive heatmap prediction common in previous work. Through extensive ablation studies comparing our method against contemporary CNN architectures (ConvNext V2) and heatmap-based approaches (Segformer), we demonstrate that ViTs with coordinate prediction achieve superior performance with more than 2 mm improvement in mean radial error compared to state-of-the-art CLD methods. Our results show that while non-adapted CNN architectures perform poorly on the given task, contemporary approaches may be too tailored to specific datasets, failing to generalize to different and especially sparse datasets. We conclude that using general-purpose Vision Transformers with direct coordinate prediction shows great promise for future research on CLD and medical computer vision.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Cranio-Maxillofacial Surgery 医学-外科

CiteScore

5.20

自引率

22.60%

发文量

117

审稿时长

70 days

期刊介绍： The Journal of Cranio-Maxillofacial Surgery publishes articles covering all aspects of surgery of the head, face and jaw. Specific topics covered recently have included: • Distraction osteogenesis • Synthetic bone substitutes • Fibroblast growth factors • Fetal wound healing • Skull base surgery • Computer-assisted surgery • Vascularized bone grafts