{"title":"评估健康信息的质量:人类和人工智能的比较。","authors":"Dhruva Arcot, Neha Pondicherry, Subhankar Chakraborty","doi":"10.1111/nmo.70164","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Over half of all Americans seek health-related information online, yet the quality of this digital content remains largely unregulated and variable. The DISCERN score, a validated 15-item instrument, offers a structured method to assess the reliability of written health information. While expert-assigned DISCERN scores have been widely applied across various disease states, whether artificial intelligence (AI) can automate this evaluation remains unknown. Specifically, it is unclear whether AI-generated DISCERN scores align with those assigned by human experts. Our study seeks to investigate this gap in knowledge by examining the correlation between AI-generated and human-assigned DISCERN scores for TikTok videos on Irritable Bowel Syndrome (IBS).</p><p><strong>Methods: </strong>A set of 100 TikTok videos on IBS previously scored using DISCERN by two physicians was chosen. Sixty-nine videos contained transcribable spoken audio, which was processed using a free online transcription tool. The remaining videos either featured songs or music that were not suitable for transcription or were deleted or were not publicly available. The audio transcripts were prefixed with an identical prompt and submitted to two common AI models-ChatGPT 4.0 and Microsoft Copilot for-DISCERN score evaluation. The average DISCERN score for each transcript was compared between the AI models and with the mean of the DISCERN score given by the human reviewers using Pearson correlation (r) and Kruskal Wallis test.</p><p><strong>Results: </strong>There was a significant correlation between human and AI-generated DISCERN scores (r = 0.60-0.65). When categorized by the background of the content creators-medical (N = 26) versus non-medical (N = 43), the correlation was significant only for content made by non-medical content creators (r = 0.69-0.75, p < 0.001). Correlation between ChatGPT and Copilot DISCERN scores was stronger for videos by non-medical content creators (r = 0.66) than those by medical content creators (r = 0.43). On linear regression, ChatGPT's DISCERN scores explained 55.6% of the variation in human DISCERN scores for videos by non-medical creators, compared to 8.9% for videos by medical creators. For Copilot, the corresponding values were 47.2% and 9.3%.</p><p><strong>Conclusion: </strong>AI models demonstrated moderate alignment with human-assigned DISCERN scores for IBS-related TikTok videos, but only when content was produced by non-medical creators. The weaker correlation for content produced by those with a medical background suggests limitations in current AI models' ability to interpret nuanced or technical health information. These findings highlight the need for further validation across broader topics, languages, platforms, and reviewer pools. If refined, AI-generated DISCERN scoring could serve as a scalable tool to help users assess the reliability of health information on social media and curb misinformation.</p>","PeriodicalId":19123,"journal":{"name":"Neurogastroenterology and Motility","volume":" ","pages":"e70164"},"PeriodicalIF":2.9000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating the Quality of Health Information: Comparison of Human and Artificial Intelligence.\",\"authors\":\"Dhruva Arcot, Neha Pondicherry, Subhankar Chakraborty\",\"doi\":\"10.1111/nmo.70164\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Over half of all Americans seek health-related information online, yet the quality of this digital content remains largely unregulated and variable. The DISCERN score, a validated 15-item instrument, offers a structured method to assess the reliability of written health information. While expert-assigned DISCERN scores have been widely applied across various disease states, whether artificial intelligence (AI) can automate this evaluation remains unknown. Specifically, it is unclear whether AI-generated DISCERN scores align with those assigned by human experts. Our study seeks to investigate this gap in knowledge by examining the correlation between AI-generated and human-assigned DISCERN scores for TikTok videos on Irritable Bowel Syndrome (IBS).</p><p><strong>Methods: </strong>A set of 100 TikTok videos on IBS previously scored using DISCERN by two physicians was chosen. Sixty-nine videos contained transcribable spoken audio, which was processed using a free online transcription tool. The remaining videos either featured songs or music that were not suitable for transcription or were deleted or were not publicly available. The audio transcripts were prefixed with an identical prompt and submitted to two common AI models-ChatGPT 4.0 and Microsoft Copilot for-DISCERN score evaluation. The average DISCERN score for each transcript was compared between the AI models and with the mean of the DISCERN score given by the human reviewers using Pearson correlation (r) and Kruskal Wallis test.</p><p><strong>Results: </strong>There was a significant correlation between human and AI-generated DISCERN scores (r = 0.60-0.65). When categorized by the background of the content creators-medical (N = 26) versus non-medical (N = 43), the correlation was significant only for content made by non-medical content creators (r = 0.69-0.75, p < 0.001). Correlation between ChatGPT and Copilot DISCERN scores was stronger for videos by non-medical content creators (r = 0.66) than those by medical content creators (r = 0.43). On linear regression, ChatGPT's DISCERN scores explained 55.6% of the variation in human DISCERN scores for videos by non-medical creators, compared to 8.9% for videos by medical creators. For Copilot, the corresponding values were 47.2% and 9.3%.</p><p><strong>Conclusion: </strong>AI models demonstrated moderate alignment with human-assigned DISCERN scores for IBS-related TikTok videos, but only when content was produced by non-medical creators. The weaker correlation for content produced by those with a medical background suggests limitations in current AI models' ability to interpret nuanced or technical health information. These findings highlight the need for further validation across broader topics, languages, platforms, and reviewer pools. If refined, AI-generated DISCERN scoring could serve as a scalable tool to help users assess the reliability of health information on social media and curb misinformation.</p>\",\"PeriodicalId\":19123,\"journal\":{\"name\":\"Neurogastroenterology and Motility\",\"volume\":\" \",\"pages\":\"e70164\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2025-09-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurogastroenterology and Motility\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1111/nmo.70164\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CLINICAL NEUROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurogastroenterology and Motility","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/nmo.70164","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
Evaluating the Quality of Health Information: Comparison of Human and Artificial Intelligence.
Background: Over half of all Americans seek health-related information online, yet the quality of this digital content remains largely unregulated and variable. The DISCERN score, a validated 15-item instrument, offers a structured method to assess the reliability of written health information. While expert-assigned DISCERN scores have been widely applied across various disease states, whether artificial intelligence (AI) can automate this evaluation remains unknown. Specifically, it is unclear whether AI-generated DISCERN scores align with those assigned by human experts. Our study seeks to investigate this gap in knowledge by examining the correlation between AI-generated and human-assigned DISCERN scores for TikTok videos on Irritable Bowel Syndrome (IBS).
Methods: A set of 100 TikTok videos on IBS previously scored using DISCERN by two physicians was chosen. Sixty-nine videos contained transcribable spoken audio, which was processed using a free online transcription tool. The remaining videos either featured songs or music that were not suitable for transcription or were deleted or were not publicly available. The audio transcripts were prefixed with an identical prompt and submitted to two common AI models-ChatGPT 4.0 and Microsoft Copilot for-DISCERN score evaluation. The average DISCERN score for each transcript was compared between the AI models and with the mean of the DISCERN score given by the human reviewers using Pearson correlation (r) and Kruskal Wallis test.
Results: There was a significant correlation between human and AI-generated DISCERN scores (r = 0.60-0.65). When categorized by the background of the content creators-medical (N = 26) versus non-medical (N = 43), the correlation was significant only for content made by non-medical content creators (r = 0.69-0.75, p < 0.001). Correlation between ChatGPT and Copilot DISCERN scores was stronger for videos by non-medical content creators (r = 0.66) than those by medical content creators (r = 0.43). On linear regression, ChatGPT's DISCERN scores explained 55.6% of the variation in human DISCERN scores for videos by non-medical creators, compared to 8.9% for videos by medical creators. For Copilot, the corresponding values were 47.2% and 9.3%.
Conclusion: AI models demonstrated moderate alignment with human-assigned DISCERN scores for IBS-related TikTok videos, but only when content was produced by non-medical creators. The weaker correlation for content produced by those with a medical background suggests limitations in current AI models' ability to interpret nuanced or technical health information. These findings highlight the need for further validation across broader topics, languages, platforms, and reviewer pools. If refined, AI-generated DISCERN scoring could serve as a scalable tool to help users assess the reliability of health information on social media and curb misinformation.
期刊介绍:
Neurogastroenterology & Motility (NMO) is the official Journal of the European Society of Neurogastroenterology & Motility (ESNM) and the American Neurogastroenterology and Motility Society (ANMS). It is edited by James Galligan, Albert Bredenoord, and Stephen Vanner. The editorial and peer review process is independent of the societies affiliated to the journal and publisher: Neither the ANMS, the ESNM or the Publisher have editorial decision-making power. Whenever these are relevant to the content being considered or published, the editors, journal management committee and editorial board declare their interests and affiliations.