Explosion of formulaic research articles, including inappropriate study designs and false discoveries, based on the NHANES US national health database.
IF 9.8 1区 生物学Q1 Agricultural and Biological Sciences
Tulsi Suchak, Anietie E Aliu, Charlie Harrison, Reyer Zwiggelaar, Nophar Geifman, Matt Spick
{"title":"Explosion of formulaic research articles, including inappropriate study designs and false discoveries, based on the NHANES US national health database.","authors":"Tulsi Suchak, Anietie E Aliu, Charlie Harrison, Reyer Zwiggelaar, Nophar Geifman, Matt Spick","doi":"10.1371/journal.pbio.3003152","DOIUrl":null,"url":null,"abstract":"<p><p>With the growth of artificial intelligence (AI)-ready datasets such as the National Health and Nutrition Examination Survey (NHANES), new opportunities for data-driven research are being created, but also generating risks of data exploitation by paper mills. In this work, we focus on two areas of potential concern for AI-supported research efforts. First, we describe the production of large numbers of formulaic single-factor analyses, relating single predictors to specific health conditions, where multifactorial approaches would be more appropriate. Employing AI-supported single-factor approaches removes context from research, fails to capture interactions, avoids false discovery correction, and is an approach that can easily be adopted by paper mills. Second, we identify risks of selective data usage, such as analyzing limited date ranges or cohort subsets without clear justification, suggestive of data dredging, and post-hoc hypothesis formation. Using a systematic literature search for single-factor analyses, we identified 341 NHANES-derived research papers published over the past decade, each proposing an association between a predictor and a health condition from the wide range contained within NHANES. We found evidence that research failed to take account of multifactorial relationships, that manuscripts did not account for the risks of false discoveries, and that researchers selectively extracted data from NHANES rather than utilizing the full range of data available. Given the explosion of AI-assisted productivity in published manuscripts (the systematic search strategy used here identified an average of 4 papers per annum from 2014 to 2021, but 190 in 2024-9 October alone), we highlight a set of best practices to address these concerns, aimed at researchers, data controllers, publishers, and peer reviewers, to encourage improved statistical practices and mitigate the risks of paper mills using AI-assisted workflows to introduce low-quality manuscripts to the scientific literature.</p>","PeriodicalId":49001,"journal":{"name":"PLoS Biology","volume":"23 5","pages":"e3003152"},"PeriodicalIF":9.8000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12061153/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pbio.3003152","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}
引用次数: 0
Abstract
With the growth of artificial intelligence (AI)-ready datasets such as the National Health and Nutrition Examination Survey (NHANES), new opportunities for data-driven research are being created, but also generating risks of data exploitation by paper mills. In this work, we focus on two areas of potential concern for AI-supported research efforts. First, we describe the production of large numbers of formulaic single-factor analyses, relating single predictors to specific health conditions, where multifactorial approaches would be more appropriate. Employing AI-supported single-factor approaches removes context from research, fails to capture interactions, avoids false discovery correction, and is an approach that can easily be adopted by paper mills. Second, we identify risks of selective data usage, such as analyzing limited date ranges or cohort subsets without clear justification, suggestive of data dredging, and post-hoc hypothesis formation. Using a systematic literature search for single-factor analyses, we identified 341 NHANES-derived research papers published over the past decade, each proposing an association between a predictor and a health condition from the wide range contained within NHANES. We found evidence that research failed to take account of multifactorial relationships, that manuscripts did not account for the risks of false discoveries, and that researchers selectively extracted data from NHANES rather than utilizing the full range of data available. Given the explosion of AI-assisted productivity in published manuscripts (the systematic search strategy used here identified an average of 4 papers per annum from 2014 to 2021, but 190 in 2024-9 October alone), we highlight a set of best practices to address these concerns, aimed at researchers, data controllers, publishers, and peer reviewers, to encourage improved statistical practices and mitigate the risks of paper mills using AI-assisted workflows to introduce low-quality manuscripts to the scientific literature.
期刊介绍:
PLOS Biology is the flagship journal of the Public Library of Science (PLOS) and focuses on publishing groundbreaking and relevant research in all areas of biological science. The journal features works at various scales, ranging from molecules to ecosystems, and also encourages interdisciplinary studies. PLOS Biology publishes articles that demonstrate exceptional significance, originality, and relevance, with a high standard of scientific rigor in methodology, reporting, and conclusions.
The journal aims to advance science and serve the research community by transforming research communication to align with the research process. It offers evolving article types and policies that empower authors to share the complete story behind their scientific findings with a diverse global audience of researchers, educators, policymakers, patient advocacy groups, and the general public.
PLOS Biology, along with other PLOS journals, is widely indexed by major services such as Crossref, Dimensions, DOAJ, Google Scholar, PubMed, PubMed Central, Scopus, and Web of Science. Additionally, PLOS Biology is indexed by various other services including AGRICOLA, Biological Abstracts, BIOSYS Previews, CABI CAB Abstracts, CABI Global Health, CAPES, CAS, CNKI, Embase, Journal Guide, MEDLINE, and Zoological Record, ensuring that the research content is easily accessible and discoverable by a wide range of audiences.