用于口咽拭子点定位的自对准多模式变压器

IF 6.6 1区计算机科学 Q1 Multidisciplinary

Tsinghua Science and Technology Pub Date : 2024-02-09 DOI:10.26599/TST.2023.9010070

Tianyu Liu;Fuchun Sun

{"title":"用于口咽拭子点定位的自对准多模式变压器","authors":"Tianyu Liu;Fuchun Sun","doi":"10.26599/TST.2023.9010070","DOIUrl":null,"url":null,"abstract":"The oropharyngeal swabbing is a pre-diagnostic procedure used to test various respiratory diseases, including COVID and Influenza A (H1N1). To improve the testing efficiency of testing, a real-time, accurate, and robust sampling point localization algorithm is needed for robots. However, current solutions rely heavily on visual input, which is not reliable enough for large-scale deployment. The transformer has significantly improved the performance of image-related tasks and challenged the dominance of traditional convolutional neural networks (CNNs) in the image field. Inspired by its success, we propose a novel self-aligning multi-modal transformer (SAMMT) to dynamically attend to different parts of unaligned feature maps, preventing information loss caused by perspective disparity and simplifying overall implementation. Unlike preexisting multi-modal transformers, our attention mechanism works in image space instead of embedding space, rendering the need for the sensor registration process obsolete. To facilitate the multi-modal task, we collected and annotate an oropharynx localization/segmentation dataset by trained medical personnel. This dataset is open-sourced and can be used for future multi-modal research. Our experiments show that our model improves the performance of the localization task by 4.2% compared to the pure visual model, and reduces the pixel-wise error rate of the segmentation task by 16.7% compared to the CNN baseline.","PeriodicalId":48690,"journal":{"name":"Tsinghua Science and Technology","volume":"29 4","pages":"1082-1091"},"PeriodicalIF":6.6000,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10431728","citationCount":"0","resultStr":"{\"title\":\"Self-Aligning Multi-Modal Transformer for Oropharyngeal Swab Point Localization\",\"authors\":\"Tianyu Liu;Fuchun Sun\",\"doi\":\"10.26599/TST.2023.9010070\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The oropharyngeal swabbing is a pre-diagnostic procedure used to test various respiratory diseases, including COVID and Influenza A (H1N1). To improve the testing efficiency of testing, a real-time, accurate, and robust sampling point localization algorithm is needed for robots. However, current solutions rely heavily on visual input, which is not reliable enough for large-scale deployment. The transformer has significantly improved the performance of image-related tasks and challenged the dominance of traditional convolutional neural networks (CNNs) in the image field. Inspired by its success, we propose a novel self-aligning multi-modal transformer (SAMMT) to dynamically attend to different parts of unaligned feature maps, preventing information loss caused by perspective disparity and simplifying overall implementation. Unlike preexisting multi-modal transformers, our attention mechanism works in image space instead of embedding space, rendering the need for the sensor registration process obsolete. To facilitate the multi-modal task, we collected and annotate an oropharynx localization/segmentation dataset by trained medical personnel. This dataset is open-sourced and can be used for future multi-modal research. Our experiments show that our model improves the performance of the localization task by 4.2% compared to the pure visual model, and reduces the pixel-wise error rate of the segmentation task by 16.7% compared to the CNN baseline.\",\"PeriodicalId\":48690,\"journal\":{\"name\":\"Tsinghua Science and Technology\",\"volume\":\"29 4\",\"pages\":\"1082-1091\"},\"PeriodicalIF\":6.6000,\"publicationDate\":\"2024-02-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10431728\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Tsinghua Science and Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10431728/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Multidisciplinary\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Tsinghua Science and Technology","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10431728/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Multidisciplinary","Score":null,"Total":0}

引用次数: 0

摘要

口咽拭子是用于检测各种呼吸道疾病（包括 COVID 和甲型 H1N1 流感）的诊断前程序。为了提高检测效率，机器人需要一种实时、准确、稳健的采样点定位算法。然而，目前的解决方案严重依赖视觉输入，这对于大规模部署来说不够可靠。变压器大大提高了图像相关任务的性能，并对传统卷积神经网络（CNN）在图像领域的主导地位提出了挑战。受其成功经验的启发，我们提出了一种新颖的自对齐多模态变换器（SAMMT），可动态关注未对齐特征图的不同部分，防止因视角差异造成的信息丢失，并简化整体实现过程。与现有的多模态变换器不同，我们的关注机制在图像空间而非嵌入空间工作，因此无需传感器注册过程。为了促进多模态任务的完成，我们收集并注释了由训练有素的医务人员制作的口咽定位/分割数据集。该数据集已开源，可用于未来的多模态研究。实验表明，与纯视觉模型相比，我们的模型将定位任务的性能提高了 4.2%；与 CNN 基线相比，我们的模型将分割任务的像素误差率降低了 16.7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Self-Aligning Multi-Modal Transformer for Oropharyngeal Swab Point Localization

The oropharyngeal swabbing is a pre-diagnostic procedure used to test various respiratory diseases, including COVID and Influenza A (H1N1). To improve the testing efficiency of testing, a real-time, accurate, and robust sampling point localization algorithm is needed for robots. However, current solutions rely heavily on visual input, which is not reliable enough for large-scale deployment. The transformer has significantly improved the performance of image-related tasks and challenged the dominance of traditional convolutional neural networks (CNNs) in the image field. Inspired by its success, we propose a novel self-aligning multi-modal transformer (SAMMT) to dynamically attend to different parts of unaligned feature maps, preventing information loss caused by perspective disparity and simplifying overall implementation. Unlike preexisting multi-modal transformers, our attention mechanism works in image space instead of embedding space, rendering the need for the sensor registration process obsolete. To facilitate the multi-modal task, we collected and annotate an oropharynx localization/segmentation dataset by trained medical personnel. This dataset is open-sourced and can be used for future multi-modal research. Our experiments show that our model improves the performance of the localization task by 4.2% compared to the pure visual model, and reduces the pixel-wise error rate of the segmentation task by 16.7% compared to the CNN baseline.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Tsinghua Science and Technology COMPUTER SCIENCE, INFORMATION SYSTEMSCOMPU-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

10.20

自引率

10.60%

发文量

2340

期刊介绍： Tsinghua Science and Technology (Tsinghua Sci Technol) started publication in 1996. It is an international academic journal sponsored by Tsinghua University and is published bimonthly. This journal aims at presenting the up-to-date scientific achievements in computer science, electronic engineering, and other IT fields. Contributions all over the world are welcome.