Pouria Mortezaagha, Joseph Shaw, Bowen Sun, Arya Rahgozar
{"title":"From chaos to clarity: schema-constrained AI for auditable biomedical evidence extraction from full-text PDFs.","authors":"Pouria Mortezaagha, Joseph Shaw, Bowen Sun, Arya Rahgozar","doi":"10.1186/s12874-026-02847-8","DOIUrl":"https://doi.org/10.1186/s12874-026-02847-8","url":null,"abstract":"<p><strong>Background: </strong>Biomedical evidence synthesis depends on accurate extraction of methodological, laboratory, and outcome variables from full-text research articles. These variables are predominantly embedded in complex scientific PDFs that interleave multi-column text, tables, figures, and captions, making manual abstraction time-intensive, error-prone, and increasingly impractical at the scale of contemporary systematic reviews. Despite advances in layout-aware and multimodal document models, end-to-end extraction systems suitable for evidence synthesis remain constrained by limited throughput, OCR error propagation, and insufficient auditability.</p><p><strong>Methods: </strong>We propose a schema-constrained AI extraction system that transforms full-text biomedical PDFs into structured, analysis-ready records by explicitly restricting model inference through typed schemas, controlled vocabularies, and evidence-gated decisions. Documents are ingested using resume-aware hashing, partitioned into page-level and caption-aware chunks, and processed asynchronously under explicit concurrency and rate-limiting controls. A high-accuracy OCR model is guided by multiple domain-specific schemas covering bibliographic metadata, study design, populations, laboratory assays, timing and thresholds, clinical outcomes, and diagnostic performance. Chunk-level outputs are deterministically merged into study-level records using controlled vocabularies, conflict-aware handling of scalar fields, set-based aggregation of list-valued fields, and sentence-level evidence capture to enable traceability and post-hoc audit.</p><p><strong>Results: </strong>Applied to a corpus of 734 biomedical articles on direct oral anticoagulant (DOAC) level measurement, the pipeline processed all documents without manual intervention while maintaining stable throughput. Schema-constrained extraction exhibited strong internal consistency, with sentence-level provenance populated for nearly all supported decisions. Iterative schema and prompt refinement yielded substantial improvements in extraction fidelity, particularly for outcome definitions, assay classification, and global coagulation testing. Outputs included reproducible CSV/Parquet datasets and caption-aware multimodal markdown reconstructions supporting efficient expert review.</p><p><strong>Conclusions: </strong>Schema-constrained AI extraction enables scalable and auditable extraction of structured evidence from heterogeneous scientific PDFs. By combining deterministic chunking, asynchronous orchestration, controlled vocabularies, sentence-level provenance, and aggregated analytical outputs, the proposed pipeline aligns modern document understanding capabilities with the transparency, reproducibility, and reliability demands of biomedical evidence synthesis.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":" ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2026-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147670457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lisa Affengruber, Robert Emprechtinger, Emma Persad, Jos Kleijnen, Gerald Gartlehner
{"title":"Characteristics of studies falsely excluded during single-reviewer abstract screening: a meta-epidemiological analysis.","authors":"Lisa Affengruber, Robert Emprechtinger, Emma Persad, Jos Kleijnen, Gerald Gartlehner","doi":"10.1186/s12874-026-02838-9","DOIUrl":"https://doi.org/10.1186/s12874-026-02838-9","url":null,"abstract":"","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":" ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2026-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147626662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient median estimation for stratified multi-population data: health services, medical workforce, and medical education.","authors":"Umer Daraz, Hassan M Aljohani, Huda M Alshanbari","doi":"10.1186/s12874-026-02829-w","DOIUrl":"https://doi.org/10.1186/s12874-026-02829-w","url":null,"abstract":"","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":" ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2026-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147626685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Zhuo, Siyuan Gao, Rui Jin, Lang Zhuo, Xiuying Wang
{"title":"Correction: VAS value set still has its own application value compare with TTO value set of EQ-5D-3L in Chinese population.","authors":"Lin Zhuo, Siyuan Gao, Rui Jin, Lang Zhuo, Xiuying Wang","doi":"10.1186/s12874-026-02832-1","DOIUrl":"10.1186/s12874-026-02832-1","url":null,"abstract":"","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"26 1","pages":""},"PeriodicalIF":3.4,"publicationDate":"2026-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13049807/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147615656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lixin Liu, Haifeng Wang, Yi Wei, Di Kong, Zhaoping Xue, Ying Liu
{"title":"DSPONVNet: a multimodal deep learning model integrating intraoperative monitoring and clinical features for predicting postoperative nausea and vomiting risk.","authors":"Lixin Liu, Haifeng Wang, Yi Wei, Di Kong, Zhaoping Xue, Ying Liu","doi":"10.1186/s12874-026-02845-w","DOIUrl":"https://doi.org/10.1186/s12874-026-02845-w","url":null,"abstract":"","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":" ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2026-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147608020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The performance of different propensity score methods for estimating the effects of multiple treatments or exposures: a neutral comparison study.","authors":"Peter C Austin, David E Austin","doi":"10.1186/s12874-026-02831-2","DOIUrl":"https://doi.org/10.1186/s12874-026-02831-2","url":null,"abstract":"","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":" ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147590091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A computational framework and software tool for generating complex survival data with predefined censoring rates in simulation studies.","authors":"Haichang Chen, Qiang Yao, Yidie Lin, Meijing Hu, Jiao Pei, Cairong Zhu","doi":"10.1186/s12874-026-02843-y","DOIUrl":"https://doi.org/10.1186/s12874-026-02843-y","url":null,"abstract":"","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":" ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147590006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md Ashiqul Haque, Nathan C Nickel, Maxime Turgeon, Lisa M Lix
{"title":"Model-based algorithms to ascertain smoking in administrative health data: a registry-based validation study.","authors":"Md Ashiqul Haque, Nathan C Nickel, Maxime Turgeon, Lisa M Lix","doi":"10.1186/s12874-026-02839-8","DOIUrl":"10.1186/s12874-026-02839-8","url":null,"abstract":"","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":" ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2026-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13154672/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147572045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuhan Liu, Zhihao Cheng, Ben Zhang, Fangrong Yan, Bosheng Li, Tao Zhang
{"title":"Rationale for incorporating short-term endpoints in interim futility analysis of phase 3 oncology trials: a simulation study.","authors":"Shuhan Liu, Zhihao Cheng, Ben Zhang, Fangrong Yan, Bosheng Li, Tao Zhang","doi":"10.1186/s12874-026-02815-2","DOIUrl":"https://doi.org/10.1186/s12874-026-02815-2","url":null,"abstract":"","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":" ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2026-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147572048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An alternative method to validate surrogate endpoints in oncology.","authors":"Xingyue Zhu, Ting Yu, Die Xiao","doi":"10.1186/s12874-026-02846-9","DOIUrl":"10.1186/s12874-026-02846-9","url":null,"abstract":"","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":" ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2026-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13154564/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147572023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}