Mihai Pop, Teresa K Attwood, Judith A Blake, Philip E Bourne, Ana Conesa, Terry Gaasterland, Lawrence Hunter, Carl Kingsford, Oliver Kohlbacher, Thomas Lengauer, Scott Markel, Yves Moreau, William S Noble, Christine Orengo, B F Francis Ouellette, Laxmi Parida, Natasa Przulj, Teresa M Przytycka, Shoba Ranganathan, Russell Schwartz, Alfonso Valencia, Tandy Warnow
{"title":"人工智能时代的生物数据库。","authors":"Mihai Pop, Teresa K Attwood, Judith A Blake, Philip E Bourne, Ana Conesa, Terry Gaasterland, Lawrence Hunter, Carl Kingsford, Oliver Kohlbacher, Thomas Lengauer, Scott Markel, Yves Moreau, William S Noble, Christine Orengo, B F Francis Ouellette, Laxmi Parida, Natasa Przulj, Teresa M Przytycka, Shoba Ranganathan, Russell Schwartz, Alfonso Valencia, Tandy Warnow","doi":"10.1093/bioadv/vbaf044","DOIUrl":null,"url":null,"abstract":"<p><strong>Summary: </strong>Modern biological research critically depends on public databases. The introduction and propagation of errors within and across databases can lead to wasted resources as scientists are led astray by bad data or have to conduct expensive validation experiments. The emergence of generative artificial intelligence systems threatens to compound this problem owing to the ease with which massive volumes of synthetic data can be generated. We provide an overview of several key issues that occur within the biological data ecosystem and make several recommendations aimed at reducing data errors and their propagation. We specifically highlight the critical importance of improved educational programs aimed at biologists and life scientists that emphasize best practices in data engineering. We also argue for increased theoretical and empirical research on data provenance, error propagation, and on understanding the impact of errors on analytic pipelines. Furthermore, we recommend enhanced funding for the stewardship and maintenance of public biological databases.</p><p><strong>Availability and implementation: </strong>Not applicable.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf044"},"PeriodicalIF":2.4000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11964588/pdf/","citationCount":"0","resultStr":"{\"title\":\"Biological databases in the age of generative artificial intelligence.\",\"authors\":\"Mihai Pop, Teresa K Attwood, Judith A Blake, Philip E Bourne, Ana Conesa, Terry Gaasterland, Lawrence Hunter, Carl Kingsford, Oliver Kohlbacher, Thomas Lengauer, Scott Markel, Yves Moreau, William S Noble, Christine Orengo, B F Francis Ouellette, Laxmi Parida, Natasa Przulj, Teresa M Przytycka, Shoba Ranganathan, Russell Schwartz, Alfonso Valencia, Tandy Warnow\",\"doi\":\"10.1093/bioadv/vbaf044\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Summary: </strong>Modern biological research critically depends on public databases. The introduction and propagation of errors within and across databases can lead to wasted resources as scientists are led astray by bad data or have to conduct expensive validation experiments. The emergence of generative artificial intelligence systems threatens to compound this problem owing to the ease with which massive volumes of synthetic data can be generated. We provide an overview of several key issues that occur within the biological data ecosystem and make several recommendations aimed at reducing data errors and their propagation. We specifically highlight the critical importance of improved educational programs aimed at biologists and life scientists that emphasize best practices in data engineering. We also argue for increased theoretical and empirical research on data provenance, error propagation, and on understanding the impact of errors on analytic pipelines. Furthermore, we recommend enhanced funding for the stewardship and maintenance of public biological databases.</p><p><strong>Availability and implementation: </strong>Not applicable.</p>\",\"PeriodicalId\":72368,\"journal\":{\"name\":\"Bioinformatics advances\",\"volume\":\"5 1\",\"pages\":\"vbaf044\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-03-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11964588/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics advances\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioadv/vbaf044\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
Biological databases in the age of generative artificial intelligence.
Summary: Modern biological research critically depends on public databases. The introduction and propagation of errors within and across databases can lead to wasted resources as scientists are led astray by bad data or have to conduct expensive validation experiments. The emergence of generative artificial intelligence systems threatens to compound this problem owing to the ease with which massive volumes of synthetic data can be generated. We provide an overview of several key issues that occur within the biological data ecosystem and make several recommendations aimed at reducing data errors and their propagation. We specifically highlight the critical importance of improved educational programs aimed at biologists and life scientists that emphasize best practices in data engineering. We also argue for increased theoretical and empirical research on data provenance, error propagation, and on understanding the impact of errors on analytic pipelines. Furthermore, we recommend enhanced funding for the stewardship and maintenance of public biological databases.