Alison M Veintimilla, Chintan K Acharya, Connie J Mulligan, Ruogu Fang, Erika Moore
{"title":"TRACE: applying AI language models to extract ancestry information from curated biomedical literature.","authors":"Alison M Veintimilla, Chintan K Acharya, Connie J Mulligan, Ruogu Fang, Erika Moore","doi":"10.3389/fdgth.2025.1608370","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Ancestry reporting is essential to ensure transparency and proper representation in biomedical studies. However, manually extracting this information from study texts is time-consuming and inefficient. In this paper, we present TRACE (Tool for Researching Ancestry and Cell Extraction), powered by GPT-4 and web-crawling, to automate ancestry identification by detecting cell lines or cultures in texts and tracing their ancestry.</p><p><strong>Methods: </strong>TRACE extracts cell lines and primary cultures from research articles and follows web sources to determine their ancestry. We compared TRACE's outputs to a manually generated database to confirm its performance in identifying and verifying ancestry information.</p><p><strong>Results: </strong>The results reveal an overrepresentation of European/White samples and significant underreporting. TRACE enables large-scale, systematic ancestry analysis-a valuable resource for researchers and agencies assessing biases in sample selection.</p><p><strong>Conclusions: </strong>As an open-source tool, TRACE it facilitates broader use to evaluate and improve ancestry representation in biomedical research.</p>","PeriodicalId":73078,"journal":{"name":"Frontiers in digital health","volume":"7 ","pages":"1608370"},"PeriodicalIF":3.2000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12491185/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdgth.2025.1608370","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Ancestry reporting is essential to ensure transparency and proper representation in biomedical studies. However, manually extracting this information from study texts is time-consuming and inefficient. In this paper, we present TRACE (Tool for Researching Ancestry and Cell Extraction), powered by GPT-4 and web-crawling, to automate ancestry identification by detecting cell lines or cultures in texts and tracing their ancestry.
Methods: TRACE extracts cell lines and primary cultures from research articles and follows web sources to determine their ancestry. We compared TRACE's outputs to a manually generated database to confirm its performance in identifying and verifying ancestry information.
Results: The results reveal an overrepresentation of European/White samples and significant underreporting. TRACE enables large-scale, systematic ancestry analysis-a valuable resource for researchers and agencies assessing biases in sample selection.
Conclusions: As an open-source tool, TRACE it facilitates broader use to evaluate and improve ancestry representation in biomedical research.