Yu Zhu , Yongrong Lu , Huan Xie , Jiyuan Ye , Ming Chen
{"title":"生成式人工智能在社会科学学术内容评价中的能力与局限性的准实验分析","authors":"Yu Zhu , Yongrong Lu , Huan Xie , Jiyuan Ye , Ming Chen","doi":"10.1016/j.ipm.2025.104365","DOIUrl":null,"url":null,"abstract":"<div><div>The complexity of social sciences research and the limitations of traditional evaluation methods highlight the need to explore the capabilities and application potential of generative AI in academic evaluation. Previous research in fields such as biomedical and other natural sciences has demonstrated the potential of generative AI to estimate the quality of research articles. This study adopts a quasi-experimental approach, 100 volunteers produced 600 social sciences academic texts across 6 types of topics, which were evaluated by 8 mainstream generative AI models. Statistical and sentiment analysis was conducted to compare the evaluation results using zero-shot and few-shot prompting strategies. The results show that AI-generated total scores are unreliable (precision = 66.35 %), and the actual total scores differ moderately from the human benchmark (average Cohen's <em>d</em> = 0.425). Few-shot prompt exhibited weaker differentiation capabilities across dimensions (average correlation = 5.25), while zero-shot prompt performed better (e.g., correlation<sub>Clarity, Significance</sub> = 0.13), particularly in writing quality (average standard deviation = 5.38). Significant score differences were observed across the eight models (all <em>p</em> < 0.001), indicating inconsistency among models. Additionally, AI-generated comments across dimensions were generally positive, with different models exhibiting strengths across various dimensions and tasks. This study provides empirical evidence for scholars, peer reviewers, and research evaluation professionals interested in integrating generative AI into social sciences’ evaluation workflows. Overall, generative AI shows potential for enhancing evaluation efficiency and reducing favoritism in the peer review of social sciences, especially in large-scale or preliminary evaluations. However, when evaluating the novelty and significance, its dependency on domain knowledge and the interpretability of the results still requires prudent consideration and refinement.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"63 1","pages":"Article 104365"},"PeriodicalIF":6.9000,"publicationDate":"2025-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A quasi-experimental analysis of capabilities and limitations of generative AI in academic content evaluation in social sciences\",\"authors\":\"Yu Zhu , Yongrong Lu , Huan Xie , Jiyuan Ye , Ming Chen\",\"doi\":\"10.1016/j.ipm.2025.104365\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The complexity of social sciences research and the limitations of traditional evaluation methods highlight the need to explore the capabilities and application potential of generative AI in academic evaluation. Previous research in fields such as biomedical and other natural sciences has demonstrated the potential of generative AI to estimate the quality of research articles. This study adopts a quasi-experimental approach, 100 volunteers produced 600 social sciences academic texts across 6 types of topics, which were evaluated by 8 mainstream generative AI models. Statistical and sentiment analysis was conducted to compare the evaluation results using zero-shot and few-shot prompting strategies. The results show that AI-generated total scores are unreliable (precision = 66.35 %), and the actual total scores differ moderately from the human benchmark (average Cohen's <em>d</em> = 0.425). Few-shot prompt exhibited weaker differentiation capabilities across dimensions (average correlation = 5.25), while zero-shot prompt performed better (e.g., correlation<sub>Clarity, Significance</sub> = 0.13), particularly in writing quality (average standard deviation = 5.38). Significant score differences were observed across the eight models (all <em>p</em> < 0.001), indicating inconsistency among models. Additionally, AI-generated comments across dimensions were generally positive, with different models exhibiting strengths across various dimensions and tasks. This study provides empirical evidence for scholars, peer reviewers, and research evaluation professionals interested in integrating generative AI into social sciences’ evaluation workflows. Overall, generative AI shows potential for enhancing evaluation efficiency and reducing favoritism in the peer review of social sciences, especially in large-scale or preliminary evaluations. However, when evaluating the novelty and significance, its dependency on domain knowledge and the interpretability of the results still requires prudent consideration and refinement.</div></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":\"63 1\",\"pages\":\"Article 104365\"},\"PeriodicalIF\":6.9000,\"publicationDate\":\"2025-08-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457325003061\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325003061","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
A quasi-experimental analysis of capabilities and limitations of generative AI in academic content evaluation in social sciences
The complexity of social sciences research and the limitations of traditional evaluation methods highlight the need to explore the capabilities and application potential of generative AI in academic evaluation. Previous research in fields such as biomedical and other natural sciences has demonstrated the potential of generative AI to estimate the quality of research articles. This study adopts a quasi-experimental approach, 100 volunteers produced 600 social sciences academic texts across 6 types of topics, which were evaluated by 8 mainstream generative AI models. Statistical and sentiment analysis was conducted to compare the evaluation results using zero-shot and few-shot prompting strategies. The results show that AI-generated total scores are unreliable (precision = 66.35 %), and the actual total scores differ moderately from the human benchmark (average Cohen's d = 0.425). Few-shot prompt exhibited weaker differentiation capabilities across dimensions (average correlation = 5.25), while zero-shot prompt performed better (e.g., correlationClarity, Significance = 0.13), particularly in writing quality (average standard deviation = 5.38). Significant score differences were observed across the eight models (all p < 0.001), indicating inconsistency among models. Additionally, AI-generated comments across dimensions were generally positive, with different models exhibiting strengths across various dimensions and tasks. This study provides empirical evidence for scholars, peer reviewers, and research evaluation professionals interested in integrating generative AI into social sciences’ evaluation workflows. Overall, generative AI shows potential for enhancing evaluation efficiency and reducing favoritism in the peer review of social sciences, especially in large-scale or preliminary evaluations. However, when evaluating the novelty and significance, its dependency on domain knowledge and the interpretability of the results still requires prudent consideration and refinement.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.