{"title":"评估大型语言模型,为研究选择统计测试:试点研究","authors":"Himel Mondal, Shaikat Mondal, Prabhat Mittal","doi":"10.4103/picr.picr_275_23","DOIUrl":null,"url":null,"abstract":"\n \n \n In contemporary research, selecting the appropriate statistical test is a critical and often challenging step. The emergence of large language models (LLMs) has offered a promising avenue for automating this process, potentially enhancing the efficiency and accuracy of statistical test selection.\n \n \n \n This study aimed to assess the capability of freely available LLMs – OpenAI’s ChatGPT3.5, Google Bard, Microsoft Bing Chat, and Perplexity in recommending suitable statistical tests for research, comparing their recommendations with those made by human experts.\n \n \n \n A total of 27 case vignettes were prepared for common research models with a question asking suitable statistical tests. The cases were formulated from previously published literature and reviewed by a human expert for their accuracy of information. The LLMs were asked the question with the case vignettes and the process was repeated with paraphrased cases. The concordance (if exactly matching the answer key) and acceptance (when not exactly matching with answer key, but can be considered suitable) were evaluated between LLM’s recommendations and those of human experts.\n \n \n \n Among the 27 case vignettes, ChatGPT3.5-suggested statistical test had 85.19% concordance and 100% acceptance; Bard experiment had 77.78% concordance and 96.3% acceptance; Microsoft Bing Chat had 96.3% concordance and 100% acceptance; and Perplexity had 85.19% concordance and 100% acceptance. The intra-class correction coefficient of average measure among the responses of LLMs was 0.728 (95% confidence interval [CI]: 0.51–0.86), P < 0.0001. The test–retest reliability of ChatGPT was r = 0.71 (95% CI: 0.44–0.86), P < 0.0001, Bard was r = −0.22 (95% CI: −0.56–0.18), P = 0.26, Bing was r = −0.06 (95% CI: −0.44–0.33), P = 0.73, and Perplexity was r = 0.52 (95% CI: 0.16–0.75), P = 0.0059.\n \n \n \n The LLMs, namely, ChatGPT, Google Bard, Microsoft Bing, and Perplexity all showed >75% concordance in suggesting statistical tests for research case vignettes with all having acceptance of >95%. The LLMs had a moderate level of agreement among them. While not a complete replacement for human expertise, these models can serve as effective decision support systems, especially in scenarios where rapid test selection is essential.\n","PeriodicalId":20015,"journal":{"name":"Perspectives in Clinical Research","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating large language models for selection of statistical test for research: A pilot study\",\"authors\":\"Himel Mondal, Shaikat Mondal, Prabhat Mittal\",\"doi\":\"10.4103/picr.picr_275_23\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n \\n \\n In contemporary research, selecting the appropriate statistical test is a critical and often challenging step. The emergence of large language models (LLMs) has offered a promising avenue for automating this process, potentially enhancing the efficiency and accuracy of statistical test selection.\\n \\n \\n \\n This study aimed to assess the capability of freely available LLMs – OpenAI’s ChatGPT3.5, Google Bard, Microsoft Bing Chat, and Perplexity in recommending suitable statistical tests for research, comparing their recommendations with those made by human experts.\\n \\n \\n \\n A total of 27 case vignettes were prepared for common research models with a question asking suitable statistical tests. The cases were formulated from previously published literature and reviewed by a human expert for their accuracy of information. The LLMs were asked the question with the case vignettes and the process was repeated with paraphrased cases. The concordance (if exactly matching the answer key) and acceptance (when not exactly matching with answer key, but can be considered suitable) were evaluated between LLM’s recommendations and those of human experts.\\n \\n \\n \\n Among the 27 case vignettes, ChatGPT3.5-suggested statistical test had 85.19% concordance and 100% acceptance; Bard experiment had 77.78% concordance and 96.3% acceptance; Microsoft Bing Chat had 96.3% concordance and 100% acceptance; and Perplexity had 85.19% concordance and 100% acceptance. The intra-class correction coefficient of average measure among the responses of LLMs was 0.728 (95% confidence interval [CI]: 0.51–0.86), P < 0.0001. The test–retest reliability of ChatGPT was r = 0.71 (95% CI: 0.44–0.86), P < 0.0001, Bard was r = −0.22 (95% CI: −0.56–0.18), P = 0.26, Bing was r = −0.06 (95% CI: −0.44–0.33), P = 0.73, and Perplexity was r = 0.52 (95% CI: 0.16–0.75), P = 0.0059.\\n \\n \\n \\n The LLMs, namely, ChatGPT, Google Bard, Microsoft Bing, and Perplexity all showed >75% concordance in suggesting statistical tests for research case vignettes with all having acceptance of >95%. The LLMs had a moderate level of agreement among them. While not a complete replacement for human expertise, these models can serve as effective decision support systems, especially in scenarios where rapid test selection is essential.\\n\",\"PeriodicalId\":20015,\"journal\":{\"name\":\"Perspectives in Clinical Research\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Perspectives in Clinical Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4103/picr.picr_275_23\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Perspectives in Clinical Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4103/picr.picr_275_23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}
Evaluating large language models for selection of statistical test for research: A pilot study
In contemporary research, selecting the appropriate statistical test is a critical and often challenging step. The emergence of large language models (LLMs) has offered a promising avenue for automating this process, potentially enhancing the efficiency and accuracy of statistical test selection.
This study aimed to assess the capability of freely available LLMs – OpenAI’s ChatGPT3.5, Google Bard, Microsoft Bing Chat, and Perplexity in recommending suitable statistical tests for research, comparing their recommendations with those made by human experts.
A total of 27 case vignettes were prepared for common research models with a question asking suitable statistical tests. The cases were formulated from previously published literature and reviewed by a human expert for their accuracy of information. The LLMs were asked the question with the case vignettes and the process was repeated with paraphrased cases. The concordance (if exactly matching the answer key) and acceptance (when not exactly matching with answer key, but can be considered suitable) were evaluated between LLM’s recommendations and those of human experts.
Among the 27 case vignettes, ChatGPT3.5-suggested statistical test had 85.19% concordance and 100% acceptance; Bard experiment had 77.78% concordance and 96.3% acceptance; Microsoft Bing Chat had 96.3% concordance and 100% acceptance; and Perplexity had 85.19% concordance and 100% acceptance. The intra-class correction coefficient of average measure among the responses of LLMs was 0.728 (95% confidence interval [CI]: 0.51–0.86), P < 0.0001. The test–retest reliability of ChatGPT was r = 0.71 (95% CI: 0.44–0.86), P < 0.0001, Bard was r = −0.22 (95% CI: −0.56–0.18), P = 0.26, Bing was r = −0.06 (95% CI: −0.44–0.33), P = 0.73, and Perplexity was r = 0.52 (95% CI: 0.16–0.75), P = 0.0059.
The LLMs, namely, ChatGPT, Google Bard, Microsoft Bing, and Perplexity all showed >75% concordance in suggesting statistical tests for research case vignettes with all having acceptance of >95%. The LLMs had a moderate level of agreement among them. While not a complete replacement for human expertise, these models can serve as effective decision support systems, especially in scenarios where rapid test selection is essential.
期刊介绍:
This peer review quarterly journal is positioned to build a learning clinical research community in India. This scientific journal will have a broad coverage of topics across clinical research disciplines including clinical research methodology, research ethics, clinical data management, training, data management, biostatistics, regulatory and will include original articles, reviews, news and views, perspectives, and other interesting sections. PICR will offer all clinical research stakeholders in India – academicians, ethics committees, regulators, and industry professionals -a forum for exchange of ideas, information and opinions.