使用视觉语言模型和情境学习的b细胞淋巴瘤分类

IF 1.9

Clinical and translational discovery Pub Date : 2025-08-15 DOI:10.1002/ctd2.70081

Mobina Shrestha, Bishwas Mandal, Vishal Mandal, Amir Babu Shrestha

{"title":"使用视觉语言模型和情境学习的b细胞淋巴瘤分类","authors":"Mobina Shrestha, Bishwas Mandal, Vishal Mandal, Amir Babu Shrestha","doi":"10.1002/ctd2.70081","DOIUrl":null,"url":null,"abstract":"Dear Editor,Accurate classification of B-cell lymphoma is essential for leading treatment decisions and prognostic assessments. Subtypes such as chronic lymphocytic leukaemia (CLL), follicular lymphoma (FL), and mantle cell lymphoma (MCL) often show overlapping morphologic features, particularly in small biopsies or poorly preserved samples. Even with supporting ancillary tests, distinguishing between these subtypes can be difficult, especially outside large university centers where hematopathology subspecialists may not be available. Digital pathology has brought with it the possibility of augmenting diagnostic accuracy with artificial intelligence (AI), particularly through deep learning algorithms. Several studies have shown promising results when convolutional neural networks are trained on thousands of annotated images to identify lymphoid neoplasms and other malignancies.1, 2 But these approaches often require large-scale, curated datasets, annotated by domain experts.This is where in-context learning (ICL) offers a meaningful alternative. ICL allows models to generate predictions based on just a few labelled examples shown at inference time, without the need for annotated datasets or model retraining. This mirrors how clinicians’ reason through new cases by recalling similar prior examples and using them to guide interpretation. Large vision-language models (VLMs) have demonstrated this ability in domains like dermatopathology, radiology, and gastrointestinal histology. However, despite the progress, to this date there have been no studies applying ICL to lymphoma subtyping. Given that B-cell lymphomas have well-described morphologic patterns and are amongst the most common lymphoid neoplasms encountered in practice, they are an ideal test case for this approach.Therefore, in this study, we evaluated four state-of-the-art VLMs, that is, GPT-4o, Paligemma, CLIP and ALIGN in classifying CLL, FL, and MCL using digital histopathology images. We assess model performance in zero-shot and few-shot settings, simulating real-world diagnostic constraints where only a handful of reference cases may be available. Our aim is not to replace pathologists but to explore whether this type of AI can be used as a low-barrier, annotation-efficient tool to support lymphoma diagnosis, especially in environments where expert pathology review is limited.In this study, a total of 150 Haematoxylin and Eosin (H&E) stained histopathology images with 50 each of CLL, FL and MCL were used. All images were obtained from the publicly available malignant lymphoma classification dataset on Kaggle.3 Testing for GPT-4o was performed via the OpenAI Python API. Paligemma was implemented using the pretrained checkpoint (google/paligemma-3b-mix-224) from the Hugging Face model hub, configured for image-text inference. CLIP was implemented using the ViT-B/32 backbone (openai/clip-vit-base-patch32). To approximate ALIGN, we used the open-source kakaobrain/align-base model, which follows the original ALIGN architecture. For clarity, we refer to this model as “ALIGN” throughout the study. This implementation has been previously used in similar work by others.4, 5 Models were tested using ICL at 0, 3, 5 and 10-shot settings. For each test case, support examples were randomly sampled from the remaining dataset and embedded into a structured prompt containing both image and diagnostic label. Prompts were framed using standardised clinical instructions, and label order was randomised to reduce positional bias. Similarly, model performance was evaluated using weighted F1 scores and 95% confidence intervals (CI) were calculated using bootstrap resampling (n = 10,000). Figure 1A shows the schematic workflow of the study.Our experiments indicated that the performance improved consistently across all models, with increasing few-shot examples as shown in Figure 1B. GPT-4o achieved the highest overall F1 scores at each shot level, increasing from 0.54 (95% CI: 0.49–0.58) in the zero-shot setting to 0.74 (CI: 0.65–0.81) with 10-shot prompting. Paligemma achieved comparable F1 scores, obtaining 0.50 (95% CI: 0.45–0.56) in the zero-shot setting, and showed improved performance with few-shot prompting, reaching an F1 score of 0.71 (CI: 0.64–0.79) at 10-shot. CLIP and ALIGN showed moderate gains but appeared to plateau earlier, with 10-shot F1 scores of 0.67 (CI: 0.61–0.74) and 0.70 (CI: 0.63–0.75), respectively. The largest F1 score improvements for all models occurred between 0-shot and 5-shot, with more modest improvements from 5 to 10-shot, indicating diminishing returns beyond a certain point. As more number of examples were shown to the VLMs, the performance gap between the models began to narrow, particularly between GPT-4o and Paligemma implying that exposure to a few prior examples was enough to bring models to comparable levels of performance.On comparison of models’ performance by lymphoma subtype, we observed mixed results given the morphologic differences between them. CLL, however, was always well-classified across all models as shown in Figure 1C. Even in the absence of support examples (i.e., at zero-shot), models were able to recognise typical CLL features such as small, mature lymphocytes and proliferation centers. At 10-shot, GPT-4o and Paligemma both reached an F1 score of 0.79, whilst ALIGN and CLIP achieved an F1 score of 0.74 each for CLL prediction. FL, on the other hand, was more difficult to predict, especially in the zero-shot setting. The reason behind this could be its variable nodular architecture and some of its features overlapping with other small B-cell lymphomas. However, the performance improved with the addition of support examples. GPT-4o showed the best improvement, increasing from an F1 score of 0.48 to 0.72, demonstrating that FL benefited from few-shot prompting. On the other hand, Paligemma attained the second-best result with an F1 score of 0.69 at 10-shot. Finally, in predicting MCL, the models performed somewhat better than they did with FL, but their results were still not as strong as for CLL. Although zero-shot F1 scores were modest across models, they all showed better performance with increasing shot numbers. At 10-shot, GPT-4o led with an F1 score of 0.71, followed closely by ALIGN (F1 = 0.68), Paligemma (F1 = 0.66) and CLIP (F1 = 0.64). Improvements here suggest that the models were able to learn and apply subtle features such as nuclear irregularity and cytologic monotony. Overall, the models performed best when the morphologic patterns were distinct and benefited from even a few well-chosen reference cases when features were more ambiguous.One notable bottleneck during experimentation was the prompt length constraint, which posed a practical limitation for GPT-4o and Paligemma, as both models operate within fixed input token capacities. However, we were able to include all 10 examples per class without truncation by optimising prompt formatting, reducing redundancy in the prompt phrasing, and making sure that the image resolution remained within context length. CLIP and ALIGN, by contrast, processed each support example independently, so prompt length was not a limiting factor in those models. However, without any model retraining, all four evaluated VLMs viz. GPT-4o, Paligemma, CLIP, and ALIGN showed consistent improvements in performance with increasing few-shot settings. GPT-4o achieved the highest overall accuracy and most stable gains across all settings, particularly in diagnostically challenging subtypes such as FL and MCL. These findings suggest that even with a limited number of reference cases, pretrained VLMs can be guided to perform complex morphologic classification tasks with reasonable F1 scores. Whilst the results are promising, several practical limitations remain, including variability in image quality and the controlled nature of the dataset. Therefore, further work is needed to validate this approach in larger, more diverse cohorts and to assess its reliability across a wider range of morphologic scenarios.Conceptualisation: Mobina Shrestha and Vishal Mandal. Methods: Mobina Shrestha, Bishwas Mandal and Vishal Mandal. Formal Analysis: Mobina Shrestha, Bishwas Mandal and Vishal Mandal. Data Analysis: Mobina Shrestha. Figures and Visualisation: Mobina Shrestha. Original Paper Writing: Mobina Shrestha. Paper Revision and Edits: Bishwas Mandal, Vishal Mandal and Amir Babu Shrestha.The authors declare no conflicts of interest.This study was conducted using publicly available, de-identified datasets and did not involve identifiable patient data. As such, institutional review board approval and informed consent were not required.","PeriodicalId":72605,"journal":{"name":"Clinical and translational discovery","volume":"5 4","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2025-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ctd2.70081","citationCount":"0","resultStr":"{\"title\":\"B-cell lymphoma classification using vision-language models and in-context learning\",\"authors\":\"Mobina Shrestha, Bishwas Mandal, Vishal Mandal, Amir Babu Shrestha\",\"doi\":\"10.1002/ctd2.70081\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Dear Editor,Accurate classification of B-cell lymphoma is essential for leading treatment decisions and prognostic assessments. Subtypes such as chronic lymphocytic leukaemia (CLL), follicular lymphoma (FL), and mantle cell lymphoma (MCL) often show overlapping morphologic features, particularly in small biopsies or poorly preserved samples. Even with supporting ancillary tests, distinguishing between these subtypes can be difficult, especially outside large university centers where hematopathology subspecialists may not be available. Digital pathology has brought with it the possibility of augmenting diagnostic accuracy with artificial intelligence (AI), particularly through deep learning algorithms. Several studies have shown promising results when convolutional neural networks are trained on thousands of annotated images to identify lymphoid neoplasms and other malignancies.1, 2 But these approaches often require large-scale, curated datasets, annotated by domain experts.This is where in-context learning (ICL) offers a meaningful alternative. ICL allows models to generate predictions based on just a few labelled examples shown at inference time, without the need for annotated datasets or model retraining. This mirrors how clinicians’ reason through new cases by recalling similar prior examples and using them to guide interpretation. Large vision-language models (VLMs) have demonstrated this ability in domains like dermatopathology, radiology, and gastrointestinal histology. However, despite the progress, to this date there have been no studies applying ICL to lymphoma subtyping. Given that B-cell lymphomas have well-described morphologic patterns and are amongst the most common lymphoid neoplasms encountered in practice, they are an ideal test case for this approach.Therefore, in this study, we evaluated four state-of-the-art VLMs, that is, GPT-4o, Paligemma, CLIP and ALIGN in classifying CLL, FL, and MCL using digital histopathology images. We assess model performance in zero-shot and few-shot settings, simulating real-world diagnostic constraints where only a handful of reference cases may be available. Our aim is not to replace pathologists but to explore whether this type of AI can be used as a low-barrier, annotation-efficient tool to support lymphoma diagnosis, especially in environments where expert pathology review is limited.In this study, a total of 150 Haematoxylin and Eosin (H&E) stained histopathology images with 50 each of CLL, FL and MCL were used. All images were obtained from the publicly available malignant lymphoma classification dataset on Kaggle.3 Testing for GPT-4o was performed via the OpenAI Python API. Paligemma was implemented using the pretrained checkpoint (google/paligemma-3b-mix-224) from the Hugging Face model hub, configured for image-text inference. CLIP was implemented using the ViT-B/32 backbone (openai/clip-vit-base-patch32). To approximate ALIGN, we used the open-source kakaobrain/align-base model, which follows the original ALIGN architecture. For clarity, we refer to this model as “ALIGN” throughout the study. This implementation has been previously used in similar work by others.4, 5 Models were tested using ICL at 0, 3, 5 and 10-shot settings. For each test case, support examples were randomly sampled from the remaining dataset and embedded into a structured prompt containing both image and diagnostic label. Prompts were framed using standardised clinical instructions, and label order was randomised to reduce positional bias. Similarly, model performance was evaluated using weighted F1 scores and 95% confidence intervals (CI) were calculated using bootstrap resampling (n = 10,000). Figure 1A shows the schematic workflow of the study.Our experiments indicated that the performance improved consistently across all models, with increasing few-shot examples as shown in Figure 1B. GPT-4o achieved the highest overall F1 scores at each shot level, increasing from 0.54 (95% CI: 0.49–0.58) in the zero-shot setting to 0.74 (CI: 0.65–0.81) with 10-shot prompting. Paligemma achieved comparable F1 scores, obtaining 0.50 (95% CI: 0.45–0.56) in the zero-shot setting, and showed improved performance with few-shot prompting, reaching an F1 score of 0.71 (CI: 0.64–0.79) at 10-shot. CLIP and ALIGN showed moderate gains but appeared to plateau earlier, with 10-shot F1 scores of 0.67 (CI: 0.61–0.74) and 0.70 (CI: 0.63–0.75), respectively. The largest F1 score improvements for all models occurred between 0-shot and 5-shot, with more modest improvements from 5 to 10-shot, indicating diminishing returns beyond a certain point. As more number of examples were shown to the VLMs, the performance gap between the models began to narrow, particularly between GPT-4o and Paligemma implying that exposure to a few prior examples was enough to bring models to comparable levels of performance.On comparison of models’ performance by lymphoma subtype, we observed mixed results given the morphologic differences between them. CLL, however, was always well-classified across all models as shown in Figure 1C. Even in the absence of support examples (i.e., at zero-shot), models were able to recognise typical CLL features such as small, mature lymphocytes and proliferation centers. At 10-shot, GPT-4o and Paligemma both reached an F1 score of 0.79, whilst ALIGN and CLIP achieved an F1 score of 0.74 each for CLL prediction. FL, on the other hand, was more difficult to predict, especially in the zero-shot setting. The reason behind this could be its variable nodular architecture and some of its features overlapping with other small B-cell lymphomas. However, the performance improved with the addition of support examples. GPT-4o showed the best improvement, increasing from an F1 score of 0.48 to 0.72, demonstrating that FL benefited from few-shot prompting. On the other hand, Paligemma attained the second-best result with an F1 score of 0.69 at 10-shot. Finally, in predicting MCL, the models performed somewhat better than they did with FL, but their results were still not as strong as for CLL. Although zero-shot F1 scores were modest across models, they all showed better performance with increasing shot numbers. At 10-shot, GPT-4o led with an F1 score of 0.71, followed closely by ALIGN (F1 = 0.68), Paligemma (F1 = 0.66) and CLIP (F1 = 0.64). Improvements here suggest that the models were able to learn and apply subtle features such as nuclear irregularity and cytologic monotony. Overall, the models performed best when the morphologic patterns were distinct and benefited from even a few well-chosen reference cases when features were more ambiguous.One notable bottleneck during experimentation was the prompt length constraint, which posed a practical limitation for GPT-4o and Paligemma, as both models operate within fixed input token capacities. However, we were able to include all 10 examples per class without truncation by optimising prompt formatting, reducing redundancy in the prompt phrasing, and making sure that the image resolution remained within context length. CLIP and ALIGN, by contrast, processed each support example independently, so prompt length was not a limiting factor in those models. However, without any model retraining, all four evaluated VLMs viz. GPT-4o, Paligemma, CLIP, and ALIGN showed consistent improvements in performance with increasing few-shot settings. GPT-4o achieved the highest overall accuracy and most stable gains across all settings, particularly in diagnostically challenging subtypes such as FL and MCL. These findings suggest that even with a limited number of reference cases, pretrained VLMs can be guided to perform complex morphologic classification tasks with reasonable F1 scores. Whilst the results are promising, several practical limitations remain, including variability in image quality and the controlled nature of the dataset. Therefore, further work is needed to validate this approach in larger, more diverse cohorts and to assess its reliability across a wider range of morphologic scenarios.Conceptualisation: Mobina Shrestha and Vishal Mandal. Methods: Mobina Shrestha, Bishwas Mandal and Vishal Mandal. Formal Analysis: Mobina Shrestha, Bishwas Mandal and Vishal Mandal. Data Analysis: Mobina Shrestha. Figures and Visualisation: Mobina Shrestha. Original Paper Writing: Mobina Shrestha. Paper Revision and Edits: Bishwas Mandal, Vishal Mandal and Amir Babu Shrestha.The authors declare no conflicts of interest.This study was conducted using publicly available, de-identified datasets and did not involve identifiable patient data. As such, institutional review board approval and informed consent were not required.\",\"PeriodicalId\":72605,\"journal\":{\"name\":\"Clinical and translational discovery\",\"volume\":\"5 4\",\"pages\":\"\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2025-08-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ctd2.70081\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Clinical and translational discovery\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/ctd2.70081\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical and translational discovery","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ctd2.70081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

亲爱的编辑，b细胞淋巴瘤的准确分类对于指导治疗决策和预后评估至关重要。慢性淋巴细胞性白血病（CLL）、滤泡性淋巴瘤（FL）和套细胞淋巴瘤（MCL）等亚型通常表现出重叠的形态特征，特别是在小活检或保存不良的样本中。即使有辅助测试，区分这些亚型也很困难，特别是在大型大学中心之外，那里可能没有血液病理学专科医生。数字病理学带来了利用人工智能（AI）提高诊断准确性的可能性，特别是通过深度学习算法。当卷积神经网络在数以千计的注释图像上进行训练以识别淋巴肿瘤和其他恶性肿瘤时，一些研究已经显示出有希望的结果。但这些方法通常需要大规模的、精心策划的数据集，并由领域专家注释。这就是情境学习（ICL）提供一个有意义的选择的地方。ICL允许模型根据在推理时显示的几个标记示例生成预测，而不需要注释数据集或模型再训练。这反映了临床医生如何通过回顾类似的先前例子并使用它们来指导解释来对新病例进行推理。大型视觉语言模型（VLMs）已经在皮肤病理学、放射学和胃肠道组织学等领域证明了这种能力。然而，尽管取得了进展，到目前为止，还没有将ICL应用于淋巴瘤亚型的研究。鉴于b细胞淋巴瘤具有良好的形态学模式，并且是在实践中遇到的最常见的淋巴样肿瘤之一，它们是该方法的理想测试案例。因此，在本研究中，我们评估了四种最先进的VLMs，即gpt - 40、Paligemma、CLIP和ALIGN，使用数字组织病理学图像对CLL、FL和MCL进行分类。我们在零射击和少射击设置中评估模型性能，模拟现实世界的诊断约束，其中只有少数参考案例可用。我们的目的不是取代病理学家，而是探索这种类型的人工智能是否可以作为一种低障碍、注释高效的工具来支持淋巴瘤诊断，特别是在专家病理审查有限的环境中。本研究共使用150张Haematoxylin和Eosin （H&amp；E）染色的组织病理学图像，CLL、FL和MCL各50张。所有图像均来自kaggle上公开可用的恶性淋巴瘤分类数据集。3通过OpenAI Python API进行gpt - 40测试。Paligemma是使用来自hug Face模型中心的预训练检查点（谷歌/ Paligemma -3b-mix-224）实现的，配置用于图像-文本推理。CLIP使用vitb /32骨干网（openai/ CLIP - vitbase -patch32）实现。为了近似ALIGN，我们使用了开源的kakaobrain/ ALIGN -base模型，它遵循了原始的ALIGN架构。为了清楚起见，我们在整个研究中将此模型称为“ALIGN”。这种实现以前已经被其他人用于类似的工作中。4、5模型在0、3、5和10针设置下使用ICL进行测试。对于每个测试用例，从剩余的数据集中随机抽取支持示例，并嵌入到包含图像和诊断标签的结构化提示符中。提示框使用标准化临床说明，标签顺序随机化，以减少位置偏差。同样，使用加权F1分数评估模型性能，使用bootstrap重采样（n = 10,000）计算95%置信区间（CI）。图1A显示了该研究的工作流程示意图。我们的实验表明，所有模型的性能都得到了一致的提高，图1B中所示的例子越来越少。gpt - 40在每个注射水平上获得了最高的F1总分，从零注射时的0.54 （95% CI: 0.49-0.58）增加到10注射时的0.74 （CI: 0.65-0.81）。Paligemma获得了相当的F1分数，在零射击设置下获得0.50 (95% CI: 0.45-0.56)，并且在较少的射击提示下表现出改善的性能，在10射击时达到了0.71 （CI: 0.64-0.79）。CLIP和ALIGN表现出适度的增长，但出现平台期较早，10针F1评分分别为0.67 （CI: 0.61-0.74）和0.70 （CI: 0.63-0.75）。在所有模型中，最大的F1分数改进发生在0- 5次射击之间，从5- 10次射击的改进更为温和，表明超过某一点后收益递减。随着更多的示例展示给VLMs，模型之间的性能差距开始缩小，特别是gpt - 40和Paligemma之间，这意味着暴露于一些先前的示例足以使模型达到相当的性能水平。在比较不同淋巴瘤亚型模型的性能时，由于它们之间的形态学差异，我们观察到混合结果。然而，CLL总是在所有模型中被很好地分类，如图1C所示。即使在没有支持样本的情况下（即在零射击时），模型也能够识别典型的CLL特征，如小的、成熟的淋巴细胞和增殖中心。在10次注射时，gpt - 40和Paligemma的F1评分均为0.79，而ALIGN和CLIP预测CLL的F1评分均为0.74。另一方面，FL更难预测，尤其是在零投的情况下。其背后的原因可能是其可变结节结构及其与其他小b细胞淋巴瘤的一些特征重叠。然而，随着支持示例的增加，性能得到了提高。gpt - 40的改善效果最好，从F1得分0.48提高到0.72，说明FL受益于少针提示。另一方面，帕利格玛以10杆0.69的F1成绩位居第二。最后，在预测MCL时，这些模型的表现略好于预测FL，但其结果仍不如预测CLL那么强。虽然所有模型的零射击F1得分都不高，但随着射击次数的增加，它们的表现都有所提高。10次射击时，gpt - 40以0.71的F1得分领先，ALIGN （F1 = 0.68）、Paligemma （F1 = 0.66）和CLIP （F1 = 0.64）紧随其后。这里的改进表明，模型能够学习和应用微妙的特征，如核不规则和细胞学单调。总的来说，当形态模式不同时，模型表现最好，当特征更模糊时，甚至从一些精心选择的参考案例中受益。实验过程中一个值得注意的瓶颈是提示长度约束，这对gpt - 40和Paligemma构成了实际限制，因为这两个模型都在固定的输入令牌容量内运行。然而，通过优化提示格式，减少提示措辞中的冗余，并确保图像分辨率保持在上下文长度范围内，我们能够在不截断的情况下包含每个类的所有10个示例。相比之下，CLIP和ALIGN独立处理每个支持示例，因此提示长度在这些模型中不是限制因素。然而，在没有任何模型再训练的情况下，所有四种评估的VLMs，即gpt - 40、Paligemma、CLIP和ALIGN，随着少射设置的增加，表现出一致的改善。gpt - 40在所有情况下都实现了最高的总体准确性和最稳定的增益，特别是在FL和MCL等具有诊断挑战性的亚型中。这些发现表明，即使参考病例数量有限，预训练的vlm也可以以合理的F1分数指导完成复杂的形态分类任务。虽然结果很有希望，但仍然存在一些实际限制，包括图像质量的可变性和数据集的受控性质。因此，需要进一步的工作来验证这种方法在更大、更多样化的队列中，并评估其在更广泛的形态学情景下的可靠性。概念化：Mobina Shrestha和Vishal Mandal。方法：Mobina Shrestha， Bishwas Mandal和Vishal Mandal。形式分析：Mobina Shrestha， Bishwas Mandal和Vishal Mandal。数据分析：Mobina Shrestha。图形和可视化：Mobina Shrestha。原始论文写作：Mobina Shrestha。论文修订和编辑：Bishwas Mandal， Vishal Mandal和Amir Babu Shrestha。作者声明无利益冲突。本研究使用公开可用的、去识别的数据集进行，不涉及可识别的患者数据。因此，不需要机构审查委员会的批准和知情同意。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

B-cell lymphoma classification using vision-language models and in-context learning

查看原文本刊更多论文

B-cell lymphoma classification using vision-language models and in-context learning

Dear Editor,

Accurate classification of B-cell lymphoma is essential for leading treatment decisions and prognostic assessments. Subtypes such as chronic lymphocytic leukaemia (CLL), follicular lymphoma (FL), and mantle cell lymphoma (MCL) often show overlapping morphologic features, particularly in small biopsies or poorly preserved samples. Even with supporting ancillary tests, distinguishing between these subtypes can be difficult, especially outside large university centers where hematopathology subspecialists may not be available. Digital pathology has brought with it the possibility of augmenting diagnostic accuracy with artificial intelligence (AI), particularly through deep learning algorithms. Several studies have shown promising results when convolutional neural networks are trained on thousands of annotated images to identify lymphoid neoplasms and other malignancies.^{1, 2} But these approaches often require large-scale, curated datasets, annotated by domain experts.

This is where in-context learning (ICL) offers a meaningful alternative. ICL allows models to generate predictions based on just a few labelled examples shown at inference time, without the need for annotated datasets or model retraining. This mirrors how clinicians’ reason through new cases by recalling similar prior examples and using them to guide interpretation. Large vision-language models (VLMs) have demonstrated this ability in domains like dermatopathology, radiology, and gastrointestinal histology. However, despite the progress, to this date there have been no studies applying ICL to lymphoma subtyping. Given that B-cell lymphomas have well-described morphologic patterns and are amongst the most common lymphoid neoplasms encountered in practice, they are an ideal test case for this approach.

Therefore, in this study, we evaluated four state-of-the-art VLMs, that is, GPT-4o, Paligemma, CLIP and ALIGN in classifying CLL, FL, and MCL using digital histopathology images. We assess model performance in zero-shot and few-shot settings, simulating real-world diagnostic constraints where only a handful of reference cases may be available. Our aim is not to replace pathologists but to explore whether this type of AI can be used as a low-barrier, annotation-efficient tool to support lymphoma diagnosis, especially in environments where expert pathology review is limited.

In this study, a total of 150 Haematoxylin and Eosin (H&E) stained histopathology images with 50 each of CLL, FL and MCL were used. All images were obtained from the publicly available malignant lymphoma classification dataset on Kaggle.³ Testing for GPT-4o was performed via the OpenAI Python API. Paligemma was implemented using the pretrained checkpoint (google/paligemma-3b-mix-224) from the Hugging Face model hub, configured for image-text inference. CLIP was implemented using the ViT-B/32 backbone (openai/clip-vit-base-patch32). To approximate ALIGN, we used the open-source kakaobrain/align-base model, which follows the original ALIGN architecture. For clarity, we refer to this model as “ALIGN” throughout the study. This implementation has been previously used in similar work by others.^{4, 5} Models were tested using ICL at 0, 3, 5 and 10-shot settings. For each test case, support examples were randomly sampled from the remaining dataset and embedded into a structured prompt containing both image and diagnostic label. Prompts were framed using standardised clinical instructions, and label order was randomised to reduce positional bias. Similarly, model performance was evaluated using weighted F1 scores and 95% confidence intervals (CI) were calculated using bootstrap resampling (n = 10,000). Figure 1A shows the schematic workflow of the study.

Our experiments indicated that the performance improved consistently across all models, with increasing few-shot examples as shown in Figure 1B. GPT-4o achieved the highest overall F1 scores at each shot level, increasing from 0.54 (95% CI: 0.49–0.58) in the zero-shot setting to 0.74 (CI: 0.65–0.81) with 10-shot prompting. Paligemma achieved comparable F1 scores, obtaining 0.50 (95% CI: 0.45–0.56) in the zero-shot setting, and showed improved performance with few-shot prompting, reaching an F1 score of 0.71 (CI: 0.64–0.79) at 10-shot. CLIP and ALIGN showed moderate gains but appeared to plateau earlier, with 10-shot F1 scores of 0.67 (CI: 0.61–0.74) and 0.70 (CI: 0.63–0.75), respectively. The largest F1 score improvements for all models occurred between 0-shot and 5-shot, with more modest improvements from 5 to 10-shot, indicating diminishing returns beyond a certain point. As more number of examples were shown to the VLMs, the performance gap between the models began to narrow, particularly between GPT-4o and Paligemma implying that exposure to a few prior examples was enough to bring models to comparable levels of performance.

On comparison of models’ performance by lymphoma subtype, we observed mixed results given the morphologic differences between them. CLL, however, was always well-classified across all models as shown in Figure 1C. Even in the absence of support examples (i.e., at zero-shot), models were able to recognise typical CLL features such as small, mature lymphocytes and proliferation centers. At 10-shot, GPT-4o and Paligemma both reached an F1 score of 0.79, whilst ALIGN and CLIP achieved an F1 score of 0.74 each for CLL prediction. FL, on the other hand, was more difficult to predict, especially in the zero-shot setting. The reason behind this could be its variable nodular architecture and some of its features overlapping with other small B-cell lymphomas. However, the performance improved with the addition of support examples. GPT-4o showed the best improvement, increasing from an F1 score of 0.48 to 0.72, demonstrating that FL benefited from few-shot prompting. On the other hand, Paligemma attained the second-best result with an F1 score of 0.69 at 10-shot. Finally, in predicting MCL, the models performed somewhat better than they did with FL, but their results were still not as strong as for CLL. Although zero-shot F1 scores were modest across models, they all showed better performance with increasing shot numbers. At 10-shot, GPT-4o led with an F1 score of 0.71, followed closely by ALIGN (F1 = 0.68), Paligemma (F1 = 0.66) and CLIP (F1 = 0.64). Improvements here suggest that the models were able to learn and apply subtle features such as nuclear irregularity and cytologic monotony. Overall, the models performed best when the morphologic patterns were distinct and benefited from even a few well-chosen reference cases when features were more ambiguous.

One notable bottleneck during experimentation was the prompt length constraint, which posed a practical limitation for GPT-4o and Paligemma, as both models operate within fixed input token capacities. However, we were able to include all 10 examples per class without truncation by optimising prompt formatting, reducing redundancy in the prompt phrasing, and making sure that the image resolution remained within context length. CLIP and ALIGN, by contrast, processed each support example independently, so prompt length was not a limiting factor in those models. However, without any model retraining, all four evaluated VLMs viz. GPT-4o, Paligemma, CLIP, and ALIGN showed consistent improvements in performance with increasing few-shot settings. GPT-4o achieved the highest overall accuracy and most stable gains across all settings, particularly in diagnostically challenging subtypes such as FL and MCL. These findings suggest that even with a limited number of reference cases, pretrained VLMs can be guided to perform complex morphologic classification tasks with reasonable F1 scores. Whilst the results are promising, several practical limitations remain, including variability in image quality and the controlled nature of the dataset. Therefore, further work is needed to validate this approach in larger, more diverse cohorts and to assess its reliability across a wider range of morphologic scenarios.

Conceptualisation: Mobina Shrestha and Vishal Mandal. Methods: Mobina Shrestha, Bishwas Mandal and Vishal Mandal. Formal Analysis: Mobina Shrestha, Bishwas Mandal and Vishal Mandal. Data Analysis: Mobina Shrestha. Figures and Visualisation: Mobina Shrestha. Original Paper Writing: Mobina Shrestha. Paper Revision and Edits: Bishwas Mandal, Vishal Mandal and Amir Babu Shrestha.

The authors declare no conflicts of interest.

This study was conducted using publicly available, de-identified datasets and did not involve identifiable patient data. As such, institutional review board approval and informed consent were not required.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Clinical and translational discovery

CiteScore

1.00

自引率

0.00%

发文量