Token-Mixer: Bind Image and Text in One Embedding Space for Medical Image Reporting.

IEEE transactions on medical imaging Pub Date : 2024-06-11 DOI:10.1109/TMI.2024.3412402

Yan Yang, Jun Yu, Zhenqi Fu, Ke Zhang, Ting Yu, Xianyun Wang, Hanliang Jiang, Junhui Lv, Qingming Huang, Weidong Han

{"title":"Token-Mixer: Bind Image and Text in One Embedding Space for Medical Image Reporting.","authors":"Yan Yang, Jun Yu, Zhenqi Fu, Ke Zhang, Ting Yu, Xianyun Wang, Hanliang Jiang, Junhui Lv, Qingming Huang, Weidong Han","doi":"10.1109/TMI.2024.3412402","DOIUrl":null,"url":null,"abstract":"<p><p>Medical image reporting focused on automatically generating the diagnostic reports from medical images has garnered growing research attention. In this task, learning cross-modal alignment between images and reports is crucial. However, the exposure bias problem in autoregressive text generation poses a notable challenge, as the model is optimized by a word-level loss function using the teacher-forcing strategy. To this end, we propose a novel Token-Mixer framework that learns to bind image and text in one embedding space for medical image reporting. Concretely, Token-Mixer enhances the cross-modal alignment by matching image-to-text generation with text-to-text generation that suffers less from exposure bias. The framework contains an image encoder, a text encoder and a text decoder. In training, images and paired reports are first encoded into image tokens and text tokens, and these tokens are randomly mixed to form the mixed tokens. Then, the text decoder accepts image tokens, text tokens or mixed tokens as prompt tokens and conducts text generation for network optimization. Furthermore, we introduce a tailored text decoder and an alternative training strategy that well integrate with our Token-Mixer framework. Extensive experiments across three publicly available datasets demonstrate Token-Mixer successfully enhances the image-text alignment and thereby attains a state-of-the-art performance. Related codes are available at https://github.com/yangyan22/Token-Mixer.</p>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on medical imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TMI.2024.3412402","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Medical image reporting focused on automatically generating the diagnostic reports from medical images has garnered growing research attention. In this task, learning cross-modal alignment between images and reports is crucial. However, the exposure bias problem in autoregressive text generation poses a notable challenge, as the model is optimized by a word-level loss function using the teacher-forcing strategy. To this end, we propose a novel Token-Mixer framework that learns to bind image and text in one embedding space for medical image reporting. Concretely, Token-Mixer enhances the cross-modal alignment by matching image-to-text generation with text-to-text generation that suffers less from exposure bias. The framework contains an image encoder, a text encoder and a text decoder. In training, images and paired reports are first encoded into image tokens and text tokens, and these tokens are randomly mixed to form the mixed tokens. Then, the text decoder accepts image tokens, text tokens or mixed tokens as prompt tokens and conducts text generation for network optimization. Furthermore, we introduce a tailored text decoder and an alternative training strategy that well integrate with our Token-Mixer framework. Extensive experiments across three publicly available datasets demonstrate Token-Mixer successfully enhances the image-text alignment and thereby attains a state-of-the-art performance. Related codes are available at https://github.com/yangyan22/Token-Mixer.

查看原文本刊更多论文

令牌混合器：将图像和文本绑定到一个嵌入空间，用于医学图像报告。

医学影像报告侧重于根据医学影像自动生成诊断报告，已引起越来越多的研究关注。在这项任务中，学习图像和报告之间的跨模态对齐至关重要。然而，自回归文本生成中的暴露偏差问题是一个显著的挑战，因为该模型是通过使用教师强迫策略的单词级损失函数进行优化的。为此，我们提出了一种新颖的 Token-Mixer 框架，该框架可学习在一个嵌入空间中绑定图像和文本，用于医学影像报告。具体来说，Token-Mixer 通过将图像到文本的生成与受曝光偏差影响较小的文本到文本的生成相匹配，增强了跨模态对齐。该框架包含一个图像编码器、一个文本编码器和一个文本解码器。在训练过程中，首先将图像和配对报告编码为图像令牌和文本令牌，然后将这些令牌随机混合，形成混合令牌。然后，文本解码器接受图像令牌、文本令牌或混合令牌作为提示令牌，并为网络优化生成文本。此外，我们还介绍了一种量身定制的文本解码器和另一种训练策略，它们能很好地与我们的 Token-Mixer 框架集成。在三个公开可用的数据集上进行的广泛实验表明，Token-Mixer 成功地增强了图像与文本的对齐，从而达到了最先进的性能。相关代码见 https://github.com/yangyan22/Token-Mixer。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on medical imaging

自引率

0.00%

发文量

文献相关原料

公司名称	产品信息	采购帮参考价格