Explosion of formulaic research articles, including inappropriate study designs and false discoveries, based on the NHANES US national health database.

IF 9.8 1区 生物学 Q1 Agricultural and Biological Sciences
PLoS Biology Pub Date : 2025-05-08 eCollection Date: 2025-05-01 DOI:10.1371/journal.pbio.3003152
Tulsi Suchak, Anietie E Aliu, Charlie Harrison, Reyer Zwiggelaar, Nophar Geifman, Matt Spick
{"title":"Explosion of formulaic research articles, including inappropriate study designs and false discoveries, based on the NHANES US national health database.","authors":"Tulsi Suchak, Anietie E Aliu, Charlie Harrison, Reyer Zwiggelaar, Nophar Geifman, Matt Spick","doi":"10.1371/journal.pbio.3003152","DOIUrl":null,"url":null,"abstract":"<p><p>With the growth of artificial intelligence (AI)-ready datasets such as the National Health and Nutrition Examination Survey (NHANES), new opportunities for data-driven research are being created, but also generating risks of data exploitation by paper mills. In this work, we focus on two areas of potential concern for AI-supported research efforts. First, we describe the production of large numbers of formulaic single-factor analyses, relating single predictors to specific health conditions, where multifactorial approaches would be more appropriate. Employing AI-supported single-factor approaches removes context from research, fails to capture interactions, avoids false discovery correction, and is an approach that can easily be adopted by paper mills. Second, we identify risks of selective data usage, such as analyzing limited date ranges or cohort subsets without clear justification, suggestive of data dredging, and post-hoc hypothesis formation. Using a systematic literature search for single-factor analyses, we identified 341 NHANES-derived research papers published over the past decade, each proposing an association between a predictor and a health condition from the wide range contained within NHANES. We found evidence that research failed to take account of multifactorial relationships, that manuscripts did not account for the risks of false discoveries, and that researchers selectively extracted data from NHANES rather than utilizing the full range of data available. Given the explosion of AI-assisted productivity in published manuscripts (the systematic search strategy used here identified an average of 4 papers per annum from 2014 to 2021, but 190 in 2024-9 October alone), we highlight a set of best practices to address these concerns, aimed at researchers, data controllers, publishers, and peer reviewers, to encourage improved statistical practices and mitigate the risks of paper mills using AI-assisted workflows to introduce low-quality manuscripts to the scientific literature.</p>","PeriodicalId":49001,"journal":{"name":"PLoS Biology","volume":"23 5","pages":"e3003152"},"PeriodicalIF":9.8000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12061153/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pbio.3003152","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}
引用次数: 0

Abstract

With the growth of artificial intelligence (AI)-ready datasets such as the National Health and Nutrition Examination Survey (NHANES), new opportunities for data-driven research are being created, but also generating risks of data exploitation by paper mills. In this work, we focus on two areas of potential concern for AI-supported research efforts. First, we describe the production of large numbers of formulaic single-factor analyses, relating single predictors to specific health conditions, where multifactorial approaches would be more appropriate. Employing AI-supported single-factor approaches removes context from research, fails to capture interactions, avoids false discovery correction, and is an approach that can easily be adopted by paper mills. Second, we identify risks of selective data usage, such as analyzing limited date ranges or cohort subsets without clear justification, suggestive of data dredging, and post-hoc hypothesis formation. Using a systematic literature search for single-factor analyses, we identified 341 NHANES-derived research papers published over the past decade, each proposing an association between a predictor and a health condition from the wide range contained within NHANES. We found evidence that research failed to take account of multifactorial relationships, that manuscripts did not account for the risks of false discoveries, and that researchers selectively extracted data from NHANES rather than utilizing the full range of data available. Given the explosion of AI-assisted productivity in published manuscripts (the systematic search strategy used here identified an average of 4 papers per annum from 2014 to 2021, but 190 in 2024-9 October alone), we highlight a set of best practices to address these concerns, aimed at researchers, data controllers, publishers, and peer reviewers, to encourage improved statistical practices and mitigate the risks of paper mills using AI-assisted workflows to introduce low-quality manuscripts to the scientific literature.

基于NHANES美国国家卫生数据库的公式化研究文章激增,包括不适当的研究设计和错误的发现。
随着国家健康和营养检查调查(NHANES)等人工智能(AI)就绪数据集的增长,数据驱动研究的新机会正在创造,但也产生了造纸厂利用数据的风险。在这项工作中,我们关注人工智能支持研究工作的两个潜在关注领域。首先,我们描述了大量公式化单因素分析的产生,将单一预测因素与特定健康状况联系起来,其中多因素方法更为合适。采用人工智能支持的单因素方法会从研究中删除背景,无法捕捉相互作用,避免错误的发现纠正,并且是一种很容易被造纸厂采用的方法。其次,我们确定了选择性数据使用的风险,例如在没有明确理由的情况下分析有限的日期范围或队列子集,暗示数据挖掘和事后假设形成。通过对单因素分析的系统文献检索,我们确定了过去十年中发表的341篇NHANES衍生研究论文,每篇论文都提出了NHANES中广泛范围内预测因子与健康状况之间的关联。我们发现有证据表明,研究没有考虑到多因素关系,手稿没有考虑到错误发现的风险,研究人员有选择性地从NHANES中提取数据,而不是利用所有可用的数据。鉴于人工智能辅助出版手稿的生产力爆炸式增长(本文使用的系统搜索策略从2014年到2021年平均每年确定4篇论文,但仅在2024年至10月9日就确定了190篇),我们强调了一套针对研究人员,数据控制者,出版商和同行评审的最佳实践来解决这些问题。鼓励改进统计实践,降低造纸厂使用人工智能辅助工作流程将低质量手稿引入科学文献的风险。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
PLoS Biology
PLoS Biology BIOCHEMISTRY & MOLECULAR BIOLOGY-BIOLOGY
CiteScore
15.40
自引率
2.00%
发文量
359
审稿时长
3-8 weeks
期刊介绍: PLOS Biology is the flagship journal of the Public Library of Science (PLOS) and focuses on publishing groundbreaking and relevant research in all areas of biological science. The journal features works at various scales, ranging from molecules to ecosystems, and also encourages interdisciplinary studies. PLOS Biology publishes articles that demonstrate exceptional significance, originality, and relevance, with a high standard of scientific rigor in methodology, reporting, and conclusions. The journal aims to advance science and serve the research community by transforming research communication to align with the research process. It offers evolving article types and policies that empower authors to share the complete story behind their scientific findings with a diverse global audience of researchers, educators, policymakers, patient advocacy groups, and the general public. PLOS Biology, along with other PLOS journals, is widely indexed by major services such as Crossref, Dimensions, DOAJ, Google Scholar, PubMed, PubMed Central, Scopus, and Web of Science. Additionally, PLOS Biology is indexed by various other services including AGRICOLA, Biological Abstracts, BIOSYS Previews, CABI CAB Abstracts, CABI Global Health, CAPES, CAS, CNKI, Embase, Journal Guide, MEDLINE, and Zoological Record, ensuring that the research content is easily accessible and discoverable by a wide range of audiences.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信