Livia Lilli, Stefano Patarnello, Carlotta Masciocchi, Antonio Marchetti, Giovanni Arcuri
{"title":"对意大利医学文本分类的大型语言模型进行基准测试:生成模型是最佳选择吗?","authors":"Livia Lilli, Stefano Patarnello, Carlotta Masciocchi, Antonio Marchetti, Giovanni Arcuri","doi":"10.3233/SHTI251486","DOIUrl":null,"url":null,"abstract":"<p><p>The extraction of meaningful information from clinical reports has been an area of growing interest, with a variety of studies leveraging natural language processing (NLP) techniques based on BERT architectures and generative large language models (LLMs). However, identifying the most effective approach remains challenging, especially for text classification, where model architecture, data availability, domain-specific nuances and language play a crucial role in performance. In this study, we present a benchmark analysis of generative LLMs and BERT-based models for the classification of metastasis in Italian clinical reports of breast cancer patients. Our methodology compares the performance of generative LLMs implemented within a structured generation framework, versus BERT-based models fine-tuned on the metastasis classification task, and also applied in a zero-shot learning setting. In our experiments, fine-tuned BERT models achieved the most balanced results (F1 = 0.884, AUC = 0.720). Generative LLMs showed promising performance, with potential for improvement through further adaptation. Finally, our study suggests that both BERT-based models and generative LLMs are potential solutions also in low computational settings, making them accessible for real-world clinical applications, particularly in medical text classification.</p>","PeriodicalId":94357,"journal":{"name":"Studies in health technology and informatics","volume":"332 ","pages":"12-16"},"PeriodicalIF":0.0000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Benchmarking Large Language Models for Italian Medical Text Classification: Are Generative Models the Best Choice?\",\"authors\":\"Livia Lilli, Stefano Patarnello, Carlotta Masciocchi, Antonio Marchetti, Giovanni Arcuri\",\"doi\":\"10.3233/SHTI251486\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The extraction of meaningful information from clinical reports has been an area of growing interest, with a variety of studies leveraging natural language processing (NLP) techniques based on BERT architectures and generative large language models (LLMs). However, identifying the most effective approach remains challenging, especially for text classification, where model architecture, data availability, domain-specific nuances and language play a crucial role in performance. In this study, we present a benchmark analysis of generative LLMs and BERT-based models for the classification of metastasis in Italian clinical reports of breast cancer patients. Our methodology compares the performance of generative LLMs implemented within a structured generation framework, versus BERT-based models fine-tuned on the metastasis classification task, and also applied in a zero-shot learning setting. In our experiments, fine-tuned BERT models achieved the most balanced results (F1 = 0.884, AUC = 0.720). Generative LLMs showed promising performance, with potential for improvement through further adaptation. Finally, our study suggests that both BERT-based models and generative LLMs are potential solutions also in low computational settings, making them accessible for real-world clinical applications, particularly in medical text classification.</p>\",\"PeriodicalId\":94357,\"journal\":{\"name\":\"Studies in health technology and informatics\",\"volume\":\"332 \",\"pages\":\"12-16\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Studies in health technology and informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/SHTI251486\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in health technology and informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/SHTI251486","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Benchmarking Large Language Models for Italian Medical Text Classification: Are Generative Models the Best Choice?
The extraction of meaningful information from clinical reports has been an area of growing interest, with a variety of studies leveraging natural language processing (NLP) techniques based on BERT architectures and generative large language models (LLMs). However, identifying the most effective approach remains challenging, especially for text classification, where model architecture, data availability, domain-specific nuances and language play a crucial role in performance. In this study, we present a benchmark analysis of generative LLMs and BERT-based models for the classification of metastasis in Italian clinical reports of breast cancer patients. Our methodology compares the performance of generative LLMs implemented within a structured generation framework, versus BERT-based models fine-tuned on the metastasis classification task, and also applied in a zero-shot learning setting. In our experiments, fine-tuned BERT models achieved the most balanced results (F1 = 0.884, AUC = 0.720). Generative LLMs showed promising performance, with potential for improvement through further adaptation. Finally, our study suggests that both BERT-based models and generative LLMs are potential solutions also in low computational settings, making them accessible for real-world clinical applications, particularly in medical text classification.