Evaluation of a large language model (ChatGPT) versus human researchers in assessing risk-of-bias and community engagement levels: a systematic review use-case analysis.
IF 3.7 3区 医学Q1 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH
Marcello Di Pumpo, Maria Teresa Riccardi, Vittorio De Vita, Gianfranco Damiani
{"title":"Evaluation of a large language model (ChatGPT) versus human researchers in assessing risk-of-bias and community engagement levels: a systematic review use-case analysis.","authors":"Marcello Di Pumpo, Maria Teresa Riccardi, Vittorio De Vita, Gianfranco Damiani","doi":"10.1093/eurpub/ckaf072","DOIUrl":null,"url":null,"abstract":"<p><p>Large language models (LLMs) like OpenAI's ChatGPT (generative pretrained transformers) offer great benefits to systematic review production and quality assessment. A careful assessment and comparison with standard practice is highly needed. Two custom GPTs models were developed to compare a LLM's performance in \"Risk-of-bias (ROB)\" assessment and \"Levels of engagement reached (LOER)\" classification vs human judgments. Inter-rater agreement was calculated. ROB GPT classified a slightly higher \"low risk\" overall judgments (27.8% vs 22.2%) and \"some concern\" (58.3% vs 52.8%) than the research team, for whom \"high risk\" judgments were double (25.0% vs 13.9%). The research team classified slightly higher \"low risk\" total judgments (59.7% vs 55.1%) and almost double \"high risk\" (11.1% vs 5.6%) compared to \"ROB GPT\" (55.1%), which rated higher \"some concerns\" (39.4% vs 29.2%) (P = .366). With regards to LOER analysis, 91.7% vs 25.0% were classified \"Collaborate\" level, 5.6% vs 61.1% as \"Shared leadership\", and 2.8% as \"Involve\" vs 13.9% by researchers, while no studies classified in the first two engagement level vs 8.3% and 13.9%, respectively, by researchers (P = .169). A mixed-effect ordinal logistic regression showed an odds ratio (OR) = 0.97 [95% confidence interval (CI) 0.647-1.446, P = .874] for ROB and an OR = 1.00 (95% CI = 0.397-2.543, P = .992) for LOER compared to researchers. Partial agreement on some judgments was observed. Further evaluation of these promising tools is needed to enable their effective yet reliable introduction in scientific practice.</p>","PeriodicalId":12059,"journal":{"name":"European Journal of Public Health","volume":" ","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Public Health","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/eurpub/ckaf072","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0
Abstract
Large language models (LLMs) like OpenAI's ChatGPT (generative pretrained transformers) offer great benefits to systematic review production and quality assessment. A careful assessment and comparison with standard practice is highly needed. Two custom GPTs models were developed to compare a LLM's performance in "Risk-of-bias (ROB)" assessment and "Levels of engagement reached (LOER)" classification vs human judgments. Inter-rater agreement was calculated. ROB GPT classified a slightly higher "low risk" overall judgments (27.8% vs 22.2%) and "some concern" (58.3% vs 52.8%) than the research team, for whom "high risk" judgments were double (25.0% vs 13.9%). The research team classified slightly higher "low risk" total judgments (59.7% vs 55.1%) and almost double "high risk" (11.1% vs 5.6%) compared to "ROB GPT" (55.1%), which rated higher "some concerns" (39.4% vs 29.2%) (P = .366). With regards to LOER analysis, 91.7% vs 25.0% were classified "Collaborate" level, 5.6% vs 61.1% as "Shared leadership", and 2.8% as "Involve" vs 13.9% by researchers, while no studies classified in the first two engagement level vs 8.3% and 13.9%, respectively, by researchers (P = .169). A mixed-effect ordinal logistic regression showed an odds ratio (OR) = 0.97 [95% confidence interval (CI) 0.647-1.446, P = .874] for ROB and an OR = 1.00 (95% CI = 0.397-2.543, P = .992) for LOER compared to researchers. Partial agreement on some judgments was observed. Further evaluation of these promising tools is needed to enable their effective yet reliable introduction in scientific practice.
期刊介绍:
The European Journal of Public Health (EJPH) is a multidisciplinary journal aimed at attracting contributions from epidemiology, health services research, health economics, social sciences, management sciences, ethics and law, environmental health sciences, and other disciplines of relevance to public health. The journal provides a forum for discussion and debate of current international public health issues, with a focus on the European Region. Bi-monthly issues contain peer-reviewed original articles, editorials, commentaries, book reviews, news, letters to the editor, announcements of events, and various other features.