基于生物医学应用偏好优化的多模态集成知识迁移到大型语言模型。

ArXiv Pub Date : 2025-05-09

Da Wu, Zhanliang Wang, Quan Nguyen, Zhuoran Xu, Kai Wang

{"title":"基于生物医学应用偏好优化的多模态集成知识迁移到大型语言模型。","authors":"Da Wu, Zhanliang Wang, Quan Nguyen, Zhuoran Xu, Kai Wang","doi":"","DOIUrl":null,"url":null,"abstract":"The scarcity of high-quality multimodal biomedical data limits the ability to effectively fine-tune pretrained Large Language Models (LLMs) for specialized biomedical tasks. To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from high-quality multimodal biomedical data through preference optimization. While MINT supports different optimization techniques, we primarily implement it with the Odds Ratio Preference Optimization (ORPO) framework as its backbone. This strategy enables the aligned LLMs to perform predictive tasks using text-only or image-only inputs while retaining knowledge learnt from multimodal data. MINT leverages an upstream multimodal machine learning (MML) model trained on high-quality multimodal data to transfer domain-specific insights to downstream text-only or image-only LLMs. We demonstrate MINT's effectiveness through two key applications: (1) Rare genetic disease prediction from texts, where MINT uses a multimodal encoder model, trained on facial photos and clinical notes, to generate a preference dataset for aligning a lightweight decoder-based text-only LLM (Llama 3.2-3B-Instruct). Despite relying on text input only, the MINT-derived model outperforms models trained with Supervised Fine-Tuning (SFT), Retrieval-Augmented Generation (RAG), or direct preference optimization (DPO), and even outperforms much larger foundation model (Llama 3.1-405B-Instruct). (2) Tissue type classification using cell nucleus images, where MINT uses a vision-language foundation model as the preference generator, containing knowledge learnt from both text and histopathological images to align downstream image-only models. The resulting MINT-derived model significantly improves the performance of Llama 3.2-Vision-11B-Instruct on tissue type classification. In summary, MINT provides an effective strategy to align unimodal LLMs with high-quality multimodal expertise through preference optimization. Our study also highlights a hybrid strategy that grafts the strength of encoder models in classification tasks into large decoder models to enhance reasoning, improve predictive tasks and reduce hallucination in biomedical applications.","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12083703/pdf/","citationCount":"0","resultStr":"{\"title\":\"Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications.\",\"authors\":\"Da Wu, Zhanliang Wang, Quan Nguyen, Zhuoran Xu, Kai Wang\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The scarcity of high-quality multimodal biomedical data limits the ability to effectively fine-tune pretrained Large Language Models (LLMs) for specialized biomedical tasks. To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from high-quality multimodal biomedical data through preference optimization. While MINT supports different optimization techniques, we primarily implement it with the Odds Ratio Preference Optimization (ORPO) framework as its backbone. This strategy enables the aligned LLMs to perform predictive tasks using text-only or image-only inputs while retaining knowledge learnt from multimodal data. MINT leverages an upstream multimodal machine learning (MML) model trained on high-quality multimodal data to transfer domain-specific insights to downstream text-only or image-only LLMs. We demonstrate MINT's effectiveness through two key applications: (1) Rare genetic disease prediction from texts, where MINT uses a multimodal encoder model, trained on facial photos and clinical notes, to generate a preference dataset for aligning a lightweight decoder-based text-only LLM (Llama 3.2-3B-Instruct). Despite relying on text input only, the MINT-derived model outperforms models trained with Supervised Fine-Tuning (SFT), Retrieval-Augmented Generation (RAG), or direct preference optimization (DPO), and even outperforms much larger foundation model (Llama 3.1-405B-Instruct). (2) Tissue type classification using cell nucleus images, where MINT uses a vision-language foundation model as the preference generator, containing knowledge learnt from both text and histopathological images to align downstream image-only models. The resulting MINT-derived model significantly improves the performance of Llama 3.2-Vision-11B-Instruct on tissue type classification. In summary, MINT provides an effective strategy to align unimodal LLMs with high-quality multimodal expertise through preference optimization. Our study also highlights a hybrid strategy that grafts the strength of encoder models in classification tasks into large decoder models to enhance reasoning, improve predictive tasks and reduce hallucination in biomedical applications.\",\"PeriodicalId\":93888,\"journal\":{\"name\":\"ArXiv\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-05-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12083703/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ArXiv\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

高质量多模态生物医学数据的稀缺性限制了有效微调预训练大型语言模型（llm）用于专门生物医学任务的能力。为了应对这一挑战，我们引入了MINT（多模态集成知识转移），这是一个框架，通过偏好优化将单模态大型解码器模型与来自多模态生物医学数据的特定领域决策模式结合起来。虽然MINT支持不同的优化技术，但我们主要使用Odds Ratio Preference optimization （ORPO）框架作为其主干来实现它。该策略使对齐的llm能够使用纯文本或纯图像输入执行预测任务，同时保留从多模态数据中学习的知识。MINT利用高质量多模态数据训练的上游多模态机器学习（MML）模型，将特定领域的见解转移到下游纯文本或纯图像的llm。我们通过两个关键应用证明了它的有效性：(1)从文本中预测罕见遗传疾病，其中MINT使用多模态编码器模型，经过面部照片和临床记录的训练，生成用于对齐轻量级Llama 3.2-3B-Instruct的偏好数据集。尽管只依赖于文本输入，但mint衍生的模型优于使用SFT、RAG或DPO训练的模型，甚至优于Llama 3.1-405B-Instruct。(2)使用细胞核图像进行组织类型分类，其中MINT使用视觉语言基础模型作为偏好生成器，该模型包含从文本和组织病理图像中学习的知识，以对齐下游仅图像模型。所得到的mint衍生模型显著提高了Llama 3.2- vision - 11b - directive对组织类型分类的性能。总之，MINT提供了一种有效的策略，通过偏好优化将单模态法学硕士与高质量的多模态专业知识结合起来。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

本刊更多论文

Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications.

The scarcity of high-quality multimodal biomedical data limits the ability to effectively fine-tune pretrained Large Language Models (LLMs) for specialized biomedical tasks. To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from high-quality multimodal biomedical data through preference optimization. While MINT supports different optimization techniques, we primarily implement it with the Odds Ratio Preference Optimization (ORPO) framework as its backbone. This strategy enables the aligned LLMs to perform predictive tasks using text-only or image-only inputs while retaining knowledge learnt from multimodal data. MINT leverages an upstream multimodal machine learning (MML) model trained on high-quality multimodal data to transfer domain-specific insights to downstream text-only or image-only LLMs. We demonstrate MINT's effectiveness through two key applications: (1) Rare genetic disease prediction from texts, where MINT uses a multimodal encoder model, trained on facial photos and clinical notes, to generate a preference dataset for aligning a lightweight decoder-based text-only LLM (Llama 3.2-3B-Instruct). Despite relying on text input only, the MINT-derived model outperforms models trained with Supervised Fine-Tuning (SFT), Retrieval-Augmented Generation (RAG), or direct preference optimization (DPO), and even outperforms much larger foundation model (Llama 3.1-405B-Instruct). (2) Tissue type classification using cell nucleus images, where MINT uses a vision-language foundation model as the preference generator, containing knowledge learnt from both text and histopathological images to align downstream image-only models. The resulting MINT-derived model significantly improves the performance of Llama 3.2-Vision-11B-Instruct on tissue type classification. In summary, MINT provides an effective strategy to align unimodal LLMs with high-quality multimodal expertise through preference optimization. Our study also highlights a hybrid strategy that grafts the strength of encoder models in classification tasks into large decoder models to enhance reasoning, improve predictive tasks and reduce hallucination in biomedical applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ArXiv

自引率

0.00%

发文量