Muneerah Alqahtani, Abdullah Al-Barakati, Fahd Alotaibi, Mohammed Al Shibli, Saad Almousa
{"title":"详细说明与通用说明对患者出院说明生成的微调语言模型的影响:比较统计分析。","authors":"Muneerah Alqahtani, Abdullah Al-Barakati, Fahd Alotaibi, Mohammed Al Shibli, Saad Almousa","doi":"10.2196/80917","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Discharge instructions are essential for patient post-hospital care, but are time-consuming to write. With the rise of large language models (LLMs), there is strong potential to automate this process. This study explores the use of open-source LLMs for generating discharge instructions.</p><p><strong>Objective: </strong>We investigated whether a Mistral model can reliably generate patient-oriented discharge instructions. Two distinct instruction-tuning paradigms were compared, each using a different mechanism for embedding guidance during fine-tuning.</p><p><strong>Methods: </strong>In our experiment, we applied Mistral-NeMo-Instruct, a large language model, in combination with two distinct instruction strategies for fine-tuning. The first were detailed instructions tailored to the task of discharge instruction generation. The second was a basic instruction with minimal guidance and no task-specific detail. The independent variable in this study is the instruction strategy (detailed vs. generic), while the dependent variables are the evaluation scores of the generated discharge instructions. The generated discharge instructions were evaluated against 3,621 ground-truth references. We used BLEU-1 to BLEU-4, ROUGE (ROUGE-1, ROUGE-2, ROUGE-L), SentenceTransformer similarity, and BERTScore as evaluation metrics to assess the quality of the generated outputs in comparison to the corresponding ground-truth instructions for the same discharge summaries.</p><p><strong>Results: </strong>The detailed instruction model demonstrated superior performance across all automated evaluation metrics compared with the generic instruction model. BERTScore increased from 78.92% to 87.05%, while structural alignment measured by ROUGE-L improved from 8.59% to 26.52%. N-gram precision (BLEU-4) increased from 0.81% to 21.24%, and METEOR scores rose from 15.33% to 18.47%. Additional metrics showed consistent gains: ROUGE-1 improved from 16.59% to 42.72%, and ROUGE-2 increased from 1.97% to 45.84%. All improvements were statistically significant (P < .001), indicating that detailed, task-specific instruction design substantially enhances model performance.</p><p><strong>Conclusions: </strong>The use of detailed, task-specific instruction strategies significantly enhances the effectiveness of open-source large language models in generating discharge instructions. These findings indicate that carefully designed instructions during fine-tuning substantially improve model performance.</p><p><strong>Clinicaltrial: </strong></p>","PeriodicalId":14841,"journal":{"name":"JMIR Formative Research","volume":" ","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Impact of Detailed Versus Generic Instructions on Fine-Tuned Language Models for Patient Discharge Instructions Generation: Comparative Statistical Analysis.\",\"authors\":\"Muneerah Alqahtani, Abdullah Al-Barakati, Fahd Alotaibi, Mohammed Al Shibli, Saad Almousa\",\"doi\":\"10.2196/80917\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Discharge instructions are essential for patient post-hospital care, but are time-consuming to write. With the rise of large language models (LLMs), there is strong potential to automate this process. This study explores the use of open-source LLMs for generating discharge instructions.</p><p><strong>Objective: </strong>We investigated whether a Mistral model can reliably generate patient-oriented discharge instructions. Two distinct instruction-tuning paradigms were compared, each using a different mechanism for embedding guidance during fine-tuning.</p><p><strong>Methods: </strong>In our experiment, we applied Mistral-NeMo-Instruct, a large language model, in combination with two distinct instruction strategies for fine-tuning. The first were detailed instructions tailored to the task of discharge instruction generation. The second was a basic instruction with minimal guidance and no task-specific detail. The independent variable in this study is the instruction strategy (detailed vs. generic), while the dependent variables are the evaluation scores of the generated discharge instructions. The generated discharge instructions were evaluated against 3,621 ground-truth references. We used BLEU-1 to BLEU-4, ROUGE (ROUGE-1, ROUGE-2, ROUGE-L), SentenceTransformer similarity, and BERTScore as evaluation metrics to assess the quality of the generated outputs in comparison to the corresponding ground-truth instructions for the same discharge summaries.</p><p><strong>Results: </strong>The detailed instruction model demonstrated superior performance across all automated evaluation metrics compared with the generic instruction model. BERTScore increased from 78.92% to 87.05%, while structural alignment measured by ROUGE-L improved from 8.59% to 26.52%. N-gram precision (BLEU-4) increased from 0.81% to 21.24%, and METEOR scores rose from 15.33% to 18.47%. Additional metrics showed consistent gains: ROUGE-1 improved from 16.59% to 42.72%, and ROUGE-2 increased from 1.97% to 45.84%. All improvements were statistically significant (P < .001), indicating that detailed, task-specific instruction design substantially enhances model performance.</p><p><strong>Conclusions: </strong>The use of detailed, task-specific instruction strategies significantly enhances the effectiveness of open-source large language models in generating discharge instructions. These findings indicate that carefully designed instructions during fine-tuning substantially improve model performance.</p><p><strong>Clinicaltrial: </strong></p>\",\"PeriodicalId\":14841,\"journal\":{\"name\":\"JMIR Formative Research\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2025-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Formative Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2196/80917\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Formative Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/80917","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
Impact of Detailed Versus Generic Instructions on Fine-Tuned Language Models for Patient Discharge Instructions Generation: Comparative Statistical Analysis.
Background: Discharge instructions are essential for patient post-hospital care, but are time-consuming to write. With the rise of large language models (LLMs), there is strong potential to automate this process. This study explores the use of open-source LLMs for generating discharge instructions.
Objective: We investigated whether a Mistral model can reliably generate patient-oriented discharge instructions. Two distinct instruction-tuning paradigms were compared, each using a different mechanism for embedding guidance during fine-tuning.
Methods: In our experiment, we applied Mistral-NeMo-Instruct, a large language model, in combination with two distinct instruction strategies for fine-tuning. The first were detailed instructions tailored to the task of discharge instruction generation. The second was a basic instruction with minimal guidance and no task-specific detail. The independent variable in this study is the instruction strategy (detailed vs. generic), while the dependent variables are the evaluation scores of the generated discharge instructions. The generated discharge instructions were evaluated against 3,621 ground-truth references. We used BLEU-1 to BLEU-4, ROUGE (ROUGE-1, ROUGE-2, ROUGE-L), SentenceTransformer similarity, and BERTScore as evaluation metrics to assess the quality of the generated outputs in comparison to the corresponding ground-truth instructions for the same discharge summaries.
Results: The detailed instruction model demonstrated superior performance across all automated evaluation metrics compared with the generic instruction model. BERTScore increased from 78.92% to 87.05%, while structural alignment measured by ROUGE-L improved from 8.59% to 26.52%. N-gram precision (BLEU-4) increased from 0.81% to 21.24%, and METEOR scores rose from 15.33% to 18.47%. Additional metrics showed consistent gains: ROUGE-1 improved from 16.59% to 42.72%, and ROUGE-2 increased from 1.97% to 45.84%. All improvements were statistically significant (P < .001), indicating that detailed, task-specific instruction design substantially enhances model performance.
Conclusions: The use of detailed, task-specific instruction strategies significantly enhances the effectiveness of open-source large language models in generating discharge instructions. These findings indicate that carefully designed instructions during fine-tuning substantially improve model performance.