Ghayath Janoudi, Mara Uzun, Tim Disher, Mia Jurdana, Ena Fuzul, Josip Ivkovic, Brian Hutton
{"title":"验证Loon Lens 1.0用于自主摘要筛选和系统评论中自信引导的人在循环工作流程。","authors":"Ghayath Janoudi, Mara Uzun, Tim Disher, Mia Jurdana, Ena Fuzul, Josip Ivkovic, Brian Hutton","doi":"10.1016/j.jval.2025.09.008","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>Title and Abstract (TiAb) screening is a labour-intensive step in systematic literature reviews (SLR). We examine the performance of Loon Lens 1.0, an agentic AI platform for autonomous TiAb screening and test whether its confidence scores can target minimal human oversight.</p><p><strong>Methods: </strong>Eight SLRs by Canada's Drug Agency were re-screened through dual-human reviewers and adjudicated process (3,796 citations, 287 includes, 7.6%) and separately by Loon Lens, based on predefined eligibility criteria. Accuracy, sensitivity, precision, and specificity were measured and bootstrapped to generate 95% confidence intervals. Logistic regression with (i) confidence alone and (ii) confidence + Include/Exclude decision predicted errors and informed simulated human-in-the-loop (HITL) strategies.</p><p><strong>Results: </strong>Loon Lens achieved 95.5% accuracy (95% CI 94.8-96.1), 98.9% sensitivity (97.6-100), 95.2% specificity (94.5-95.9) and 63.0% precision (58.4-67.3). Errors clustered in Low-Medium-confidence Includes. The extended logistic regression model (confidence + decision; C-index 0.98) estimated a 75% error probability for Low-confidence Includes versus <0.1% for Very-High-confidence Excludes. Simulated HITL review of Low + Medium-confidence Includes only (145 citations, 3.8%), lifted precision to 81.4% and overall accuracy to 98.2% while preserving sensitivity (99.0%). Adding High-confidence Includes (221 citations, 5.8%) pushed precision to 89.9% and accuracy to 99.0%.</p><p><strong>Conclusions: </strong>Across eight SLRs (3,796 citations), Loon Lens 1.0 reproduced adjudicated human screening with 98.9% sensitivity and 95.2% specificity. In simulation, restricting human-in-the-loop review to ≤5.8% of citations, by prioritising low- and medium-confidence Include calls, reduced false positives and increased precision to 89.9% while maintaining sensitivity and raising overall accuracy to 99.0%. These findings indicate that confidence-guided oversight can concentrate reviewer effort on a small subset of records.</p>","PeriodicalId":23508,"journal":{"name":"Value in Health","volume":" ","pages":""},"PeriodicalIF":6.0000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Validating Loon Lens 1.0 for Autonomous Abstract Screening and Confidence-Guided Human-in-the-Loop Workflows in Systematic Reviews.\",\"authors\":\"Ghayath Janoudi, Mara Uzun, Tim Disher, Mia Jurdana, Ena Fuzul, Josip Ivkovic, Brian Hutton\",\"doi\":\"10.1016/j.jval.2025.09.008\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objectives: </strong>Title and Abstract (TiAb) screening is a labour-intensive step in systematic literature reviews (SLR). We examine the performance of Loon Lens 1.0, an agentic AI platform for autonomous TiAb screening and test whether its confidence scores can target minimal human oversight.</p><p><strong>Methods: </strong>Eight SLRs by Canada's Drug Agency were re-screened through dual-human reviewers and adjudicated process (3,796 citations, 287 includes, 7.6%) and separately by Loon Lens, based on predefined eligibility criteria. Accuracy, sensitivity, precision, and specificity were measured and bootstrapped to generate 95% confidence intervals. Logistic regression with (i) confidence alone and (ii) confidence + Include/Exclude decision predicted errors and informed simulated human-in-the-loop (HITL) strategies.</p><p><strong>Results: </strong>Loon Lens achieved 95.5% accuracy (95% CI 94.8-96.1), 98.9% sensitivity (97.6-100), 95.2% specificity (94.5-95.9) and 63.0% precision (58.4-67.3). Errors clustered in Low-Medium-confidence Includes. The extended logistic regression model (confidence + decision; C-index 0.98) estimated a 75% error probability for Low-confidence Includes versus <0.1% for Very-High-confidence Excludes. Simulated HITL review of Low + Medium-confidence Includes only (145 citations, 3.8%), lifted precision to 81.4% and overall accuracy to 98.2% while preserving sensitivity (99.0%). Adding High-confidence Includes (221 citations, 5.8%) pushed precision to 89.9% and accuracy to 99.0%.</p><p><strong>Conclusions: </strong>Across eight SLRs (3,796 citations), Loon Lens 1.0 reproduced adjudicated human screening with 98.9% sensitivity and 95.2% specificity. In simulation, restricting human-in-the-loop review to ≤5.8% of citations, by prioritising low- and medium-confidence Include calls, reduced false positives and increased precision to 89.9% while maintaining sensitivity and raising overall accuracy to 99.0%. These findings indicate that confidence-guided oversight can concentrate reviewer effort on a small subset of records.</p>\",\"PeriodicalId\":23508,\"journal\":{\"name\":\"Value in Health\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2025-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Value in Health\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.jval.2025.09.008\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ECONOMICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Value in Health","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jval.2025.09.008","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECONOMICS","Score":null,"Total":0}
Validating Loon Lens 1.0 for Autonomous Abstract Screening and Confidence-Guided Human-in-the-Loop Workflows in Systematic Reviews.
Objectives: Title and Abstract (TiAb) screening is a labour-intensive step in systematic literature reviews (SLR). We examine the performance of Loon Lens 1.0, an agentic AI platform for autonomous TiAb screening and test whether its confidence scores can target minimal human oversight.
Methods: Eight SLRs by Canada's Drug Agency were re-screened through dual-human reviewers and adjudicated process (3,796 citations, 287 includes, 7.6%) and separately by Loon Lens, based on predefined eligibility criteria. Accuracy, sensitivity, precision, and specificity were measured and bootstrapped to generate 95% confidence intervals. Logistic regression with (i) confidence alone and (ii) confidence + Include/Exclude decision predicted errors and informed simulated human-in-the-loop (HITL) strategies.
Results: Loon Lens achieved 95.5% accuracy (95% CI 94.8-96.1), 98.9% sensitivity (97.6-100), 95.2% specificity (94.5-95.9) and 63.0% precision (58.4-67.3). Errors clustered in Low-Medium-confidence Includes. The extended logistic regression model (confidence + decision; C-index 0.98) estimated a 75% error probability for Low-confidence Includes versus <0.1% for Very-High-confidence Excludes. Simulated HITL review of Low + Medium-confidence Includes only (145 citations, 3.8%), lifted precision to 81.4% and overall accuracy to 98.2% while preserving sensitivity (99.0%). Adding High-confidence Includes (221 citations, 5.8%) pushed precision to 89.9% and accuracy to 99.0%.
Conclusions: Across eight SLRs (3,796 citations), Loon Lens 1.0 reproduced adjudicated human screening with 98.9% sensitivity and 95.2% specificity. In simulation, restricting human-in-the-loop review to ≤5.8% of citations, by prioritising low- and medium-confidence Include calls, reduced false positives and increased precision to 89.9% while maintaining sensitivity and raising overall accuracy to 99.0%. These findings indicate that confidence-guided oversight can concentrate reviewer effort on a small subset of records.
期刊介绍:
Value in Health contains original research articles for pharmacoeconomics, health economics, and outcomes research (clinical, economic, and patient-reported outcomes/preference-based research), as well as conceptual and health policy articles that provide valuable information for health care decision-makers as well as the research community. As the official journal of ISPOR, Value in Health provides a forum for researchers, as well as health care decision-makers to translate outcomes research into health care decisions.