The Australian Reference Genome Atlas (ARGA): Finding, sharing and reusing Australian genomics data in an occurrence-driven context

Biodiversity Information Science and Standards Pub Date : 2023-09-06 DOI:10.3897/biss.7.112129

Kathryn Hall, Matt Andrews, Keeva Connolly, Yasima Kankanamge, Christopher Mangion, Winnie Mok, Lars Nauheimer, Goran Sterjov, Nigel Ward, Peter Brenton

{"title":"The Australian Reference Genome Atlas (ARGA): Finding, sharing and reusing Australian genomics data in an occurrence-driven context","authors":"Kathryn Hall, Matt Andrews, Keeva Connolly, Yasima Kankanamge, Christopher Mangion, Winnie Mok, Lars Nauheimer, Goran Sterjov, Nigel Ward, Peter Brenton","doi":"10.3897/biss.7.112129","DOIUrl":null,"url":null,"abstract":"Fundamental to the capacity of Australia’s 15,000 biosciences researchers to answer questions in taxonomy, phylogeny, evolution, conservation, and applied fields like crop improvement and biosecurity, is access to trusted genomics (and genetics) datasets. Historically, researchers turned to single points of origin, like GenBank (part of the United States' National Center for Biotechnology Information), to find the reference or comparative data they needed, but the rapidity of data generation using next-gen methods, and the enormous size and diversity of datasets derived from next-gen sequencing methods, mean that single databases no longer contain all data of a specific class, which may be attributable to individual taxa, nor the full breadth of data types relevant for that taxon. Comprehensively searching for taxonomically relevant data, and indeed, data of types germane to the research question, is a significant challenge for researchers. Data are openly available online, but the data may be stored under synonyms or indexed via unconventional taxonomies. Data repositories are largely disconnected and researchers must visit multiple sites to have confidence that their searches have been exhaustive. Databases may focus on single data types and not store or reference other data assets, though they may be relevant for the taxon of interest. Additionally, our survey of the genomics community indicated that researchers are less likely to trust data with inadequately evidenced provenance metadata. This means that genomics data are hard to find and are often untrusted. Moreover, even once found, the data are in formats that do not interoperate with occurrence and ecological datasets, such as those housed in the Atlas of Living Australia. \n We built the Australian Reference Genome Atlas (ARGA) to overcome the barriers faced by researchers in finding and collating genomics data for Australia’s species, and we have built it so that researchers can search for data within taxonomically accepted contexts and defined intersections and conjunctions with verified and expert ecological datasets. Using a series of ingestion scripts, the ARGA data team has implemented new and customised data mappings that effectively integrate genomics data, ecological traits, and occurrence data within an extended Darwin Core Event framework (GBIF 2018). Here, we will demonstrate how the architecture we derived for ARGA application works, and how it can be extended as new data sources emerge. We then demonstrate how our flexible model can be used to:\n \n \n \n locate genomics data for taxa of interest;\n \n \n explore data within an ecological context; and\n \n \n calculate metrics for data availability for provincial bioregions.\n \n \n \n locate genomics data for taxa of interest;\n explore data within an ecological context; and\n calculate metrics for data availability for provincial bioregions.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"1a 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodiversity Information Science and Standards","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3897/biss.7.112129","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Fundamental to the capacity of Australia’s 15,000 biosciences researchers to answer questions in taxonomy, phylogeny, evolution, conservation, and applied fields like crop improvement and biosecurity, is access to trusted genomics (and genetics) datasets. Historically, researchers turned to single points of origin, like GenBank (part of the United States' National Center for Biotechnology Information), to find the reference or comparative data they needed, but the rapidity of data generation using next-gen methods, and the enormous size and diversity of datasets derived from next-gen sequencing methods, mean that single databases no longer contain all data of a specific class, which may be attributable to individual taxa, nor the full breadth of data types relevant for that taxon. Comprehensively searching for taxonomically relevant data, and indeed, data of types germane to the research question, is a significant challenge for researchers. Data are openly available online, but the data may be stored under synonyms or indexed via unconventional taxonomies. Data repositories are largely disconnected and researchers must visit multiple sites to have confidence that their searches have been exhaustive. Databases may focus on single data types and not store or reference other data assets, though they may be relevant for the taxon of interest. Additionally, our survey of the genomics community indicated that researchers are less likely to trust data with inadequately evidenced provenance metadata. This means that genomics data are hard to find and are often untrusted. Moreover, even once found, the data are in formats that do not interoperate with occurrence and ecological datasets, such as those housed in the Atlas of Living Australia. We built the Australian Reference Genome Atlas (ARGA) to overcome the barriers faced by researchers in finding and collating genomics data for Australia’s species, and we have built it so that researchers can search for data within taxonomically accepted contexts and defined intersections and conjunctions with verified and expert ecological datasets. Using a series of ingestion scripts, the ARGA data team has implemented new and customised data mappings that effectively integrate genomics data, ecological traits, and occurrence data within an extended Darwin Core Event framework (GBIF 2018). Here, we will demonstrate how the architecture we derived for ARGA application works, and how it can be extended as new data sources emerge. We then demonstrate how our flexible model can be used to: locate genomics data for taxa of interest; explore data within an ecological context; and calculate metrics for data availability for provincial bioregions. locate genomics data for taxa of interest; explore data within an ecological context; and calculate metrics for data availability for provincial bioregions.

查看原文本刊更多论文

澳大利亚参考基因组图谱(ARGA):在事件驱动的背景下发现、共享和再利用澳大利亚基因组数据

澳大利亚15000名生物科学研究人员能够回答分类学、系统发育、进化、保护以及作物改良和生物安全等应用领域的问题，其基础是访问可信的基因组学(和遗传学)数据集。从历史上看，研究人员转向单一起源点，如GenBank(美国国家生物技术信息中心的一部分)，以寻找他们需要的参考或比较数据，但使用下一代方法生成数据的速度，以及下一代测序方法衍生的数据集的巨大规模和多样性，意味着单个数据库不再包含特定类别的所有数据，这可能归因于单个分类群。也不是与该分类单元相关的全部数据类型。全面搜索与分类学相关的数据，以及与研究问题相关的数据类型，对研究人员来说是一个重大挑战。数据在网上是公开可用的，但是数据可能存储在同义词下，或者通过非常规的分类法建立索引。数据存储库在很大程度上是断开的，研究人员必须访问多个站点，以确信他们的搜索已经详尽无遗。数据库可能专注于单一数据类型，而不存储或引用其他数据资产，尽管它们可能与感兴趣的分类相关。此外，我们对基因组学社区的调查表明，研究人员不太可能信任没有充分证据的来源元数据。这意味着基因组学数据很难找到，而且往往不可信。此外，即使找到了这些数据，其格式也不能与事件和生态数据集互操作，比如澳大利亚生活地图集中的数据集。我们建立了澳大利亚参考基因组图谱(ARGA)，以克服研究人员在寻找和整理澳大利亚物种基因组数据时面临的障碍，我们已经建立了它，以便研究人员可以在分类学上公认的背景下搜索数据，并与经过验证的专家生态数据集定义交集和连接。ARGA数据团队使用一系列摄取脚本，实现了新的定制数据映射，有效地将基因组学数据、生态性状和发生数据集成到扩展的达尔文核心事件框架(GBIF 2018)中。在这里，我们将演示为ARGA应用程序派生的体系结构是如何工作的，以及如何在出现新数据源时对其进行扩展。然后，我们演示了如何使用我们的灵活模型:定位感兴趣的分类群的基因组学数据;在生态环境中探索数据;并计算省级生物区域的数据可用性指标。定位感兴趣分类群的基因组数据;在生态环境中探索数据;并计算省级生物区域的数据可用性指标。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biodiversity Information Science and Standards

自引率

0.00%

发文量