Scientific DataPub Date : 2025-04-05DOI: 10.1038/s41597-025-04753-y
Qiwei Lin, Derek Ouyang, Cameron Guage, Isabel O Gallegos, Jacob Goldin, Daniel E Ho
{"title":"Enabling disaggregation of Asian American subgroups: a dataset of Wikidata names for disparity estimation.","authors":"Qiwei Lin, Derek Ouyang, Cameron Guage, Isabel O Gallegos, Jacob Goldin, Daniel E Ho","doi":"10.1038/s41597-025-04753-y","DOIUrl":"https://doi.org/10.1038/s41597-025-04753-y","url":null,"abstract":"<p><p>Decades of research and advocacy have underscored the imperative of surfacing - as the first step towards mitigating - racial disparities, including among subgroups historically bundled into aggregated categories. Recent U.S. federal regulations have required increasingly disaggregated race reporting, but major implementation barriers mean that, in practice, reported race data continues to remain inadequate. While imputation methods have enabled disparity assessments in many research and policy settings lacking reported race, the leading name algorithms cannot recover disaggregated categories, given the same lack of disaggregated data from administrative sources to inform algorithm design. Leveraging a Wikidata sample of over 300,000 individuals from six Asian countries, we extract frequencies of 25,876 first names and 18,703 surnames which can be used as proxies for U.S. name-race distributions among six major Asian subgroups: Asian Indian, Chinese, Filipino, Japanese, Korean, and Vietnamese. We show that these data, when combined with public geography-race distributions to predict subgroup membership, outperform existing deterministic name lists in key prediction settings, and enable critical Asian disparity assessments.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"580"},"PeriodicalIF":5.8,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143788684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific DataPub Date : 2025-04-05DOI: 10.1038/s41597-025-04914-z
Prasanna Kanti Ghoshal, A P Joshi, Kunal Chakraborty
{"title":"An improved long-term high-resolution surface pCO<sub>2</sub> data product for the Indian Ocean using machine learning.","authors":"Prasanna Kanti Ghoshal, A P Joshi, Kunal Chakraborty","doi":"10.1038/s41597-025-04914-z","DOIUrl":"https://doi.org/10.1038/s41597-025-04914-z","url":null,"abstract":"<p><p>Accurate estimation of surface ocean pCO<sub>2</sub> is crucial for understanding the ocean's role in the global carbon cycle and its response to climate change. In this study, we employ a machine learning algorithm to correct the deviations in high-resolution (1/12°) model simulations of surface pCO<sub>2</sub> from the INCOIS-BIO-ROMS model (pCO<sub>2</sub><sup>model</sup>) for the period 1980-2019, using available observations (pCO<sub>2</sub><sup>obs</sup>). We train the XGBoost model to generate spatio-temporal deviations (pCO<sub>2</sub><sup>obs</sup> - pCO<sub>2</sub><sup>model</sup>) of pCO<sub>2</sub><sup>model</sup>. The interannually and climatologically varying deviations are then added back to the original model separately, which results in an improved surface pCO<sub>2</sub> data product. A comparison of our surface pCO<sub>2</sub> data product with moored observations, gridded SOCAT, CMEMS-LSCE-FFNN, and OceanSODA demonstrates an improvement by approximately 40% ± 3.31% in RMSE. Further analysis reveals that adding climatological deviations to pCO<sub>2</sub><sup>model</sup> results in greater improvements than adding interannual deviations. This analysis underscores the ability of machine learning algorithms to enhance the accuracy of model-simulated surface pCO<sub>2</sub> outputs.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"577"},"PeriodicalIF":5.8,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143788491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific DataPub Date : 2025-04-05DOI: 10.1038/s41597-025-04854-8
Anthony A Snead, Fang Meng, Nicolas Largotta, Kristin M Winchell, Brenna A Levine
{"title":"Diploid chromosome-level genome assembly and annotation for Lycorma delicatula.","authors":"Anthony A Snead, Fang Meng, Nicolas Largotta, Kristin M Winchell, Brenna A Levine","doi":"10.1038/s41597-025-04854-8","DOIUrl":"https://doi.org/10.1038/s41597-025-04854-8","url":null,"abstract":"<p><p>The spotted lanternfly (Lycorma delicatula) is a planthopper species (Hemiptera: Fulgoridae) native to China but invasive in South Korea, Japan, and the United States where it is a significant threat to agriculture. Genomic resources are critical to both management of this species and understanding the genomic characteristics of successful invaders. We report an annotated, haplotype-phased, chromosome-level genome assembly for the spotted lanternfly using PacBio long-read sequencing, Hi-C technology, and RNA-seq. The 2.2 Gbp genome comprises 13 chromosomes, and whole genome resequencing of eighty-two adults indicated chromosome four as the sex chromosome and a corresponding XO sex-determination system. We identified over 12,000 protein-coding genes and performed functional annotation, facilitating the identification of candidate genes that may hold importance for spotted lanternfly control. The assemblies and annotations were highly complete with over 96% of BUSCO genes complete regardless of the database (i.e., Eukaryota, Arthropoda, Insecta). This reference-quality genome will serve as an important resource for development and optimization of management practices for the spotted lanternfly and invasive species genomics as a whole.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"579"},"PeriodicalIF":5.8,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143788677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific DataPub Date : 2025-04-05DOI: 10.1038/s41597-025-04903-2
Jacob Levy Abitbol, Louis Arod
{"title":"Seven years of time-tracking data capturing collaboration and failure dynamics: the Gryzzly dataset.","authors":"Jacob Levy Abitbol, Louis Arod","doi":"10.1038/s41597-025-04903-2","DOIUrl":"https://doi.org/10.1038/s41597-025-04903-2","url":null,"abstract":"<p><p>We introduce the Gryzzly time-tracking dataset: a longitudinal, high-resolution collection of 4.4 million interactions recorded between 12,447 users and 173,323 tasks across 50,759 projects, spanning from 2017 to 2024. Compiled from real-world usage data of the Gryzzly software, the dataset encompasses projects from diverse industries such as marketing, finance, and banking. It provides a detailed view of daily activities contributing to project completion, including information about the users involved, the tasks they worked on, and the planned versus actual costs of each project. To validate the published data, we analyzed the underlying temporal collaboration network, revealing expected patterns such as circadian user activity, power-law characteristics in degree distributions, and heterogeneously distributed inter-declaration times. Additionally, we observed well-documented failure dynamics, including a heavy-tailed distribution of failure streak lengths and diverging performance improvement trends between successful and failed projects. These features make the Gryzzly dataset a key resource for studying productivity, team dynamics, and project failure.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"578"},"PeriodicalIF":5.8,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143788780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific DataPub Date : 2025-04-05DOI: 10.1038/s41597-025-04868-2
Akshata Karnik, Jack B Kilbride, Tristan R H Goodbody, Rachael Ross, Elias Ayrey
{"title":"An open-access database of nature-based carbon offset project boundaries.","authors":"Akshata Karnik, Jack B Kilbride, Tristan R H Goodbody, Rachael Ross, Elias Ayrey","doi":"10.1038/s41597-025-04868-2","DOIUrl":"https://doi.org/10.1038/s41597-025-04868-2","url":null,"abstract":"<p><p>Nature-based climate solutions (NBS) have become an important component of strategies aiming to reduce atmospheric CO<sub>2</sub> and mitigate climate change impacts. Carbon offsets have emerged as one of the most widely implemented NBS strategies, however, these projects have also been criticized for exaggerating offsets. Verifying the efficacy of NBS-derived carbon offset is complicated by a lack of readily available geospatial boundary data. Herein, we detail methods and present a database of nature-based offset project boundaries. This database provides the locations of 575 NBS projects distributed across 55 countries. Geospatial boundaries were aggregated using a combination of scraping data from carbon project registries (n = 433, 75.3%) as well as manual georeferencing and digitization (n = 127, 22.1%). Database entries include three varieties of carbon projects: avoided deforestation, afforestation, reforestation and re-vegetation, and improved forest management. An accuracy assessment of the georeferencing and digitizing process indicated a high degree of accuracy (intersection over union score of 0.98 ± 0.015).</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"581"},"PeriodicalIF":5.8,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143788652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific DataPub Date : 2025-04-04DOI: 10.1038/s41597-025-04904-1
Min Jiang, Chaolei Zheng, Li Jia, Jiu Chen
{"title":"A 20-year dataset (2001-2020) of global cropland water-use efficiency at 1-km grid resolution.","authors":"Min Jiang, Chaolei Zheng, Li Jia, Jiu Chen","doi":"10.1038/s41597-025-04904-1","DOIUrl":"https://doi.org/10.1038/s41597-025-04904-1","url":null,"abstract":"<p><p>Cropland water-use efficiency (WUE) is an essential indicator for the sustainable utilization of agricultural water resources. The lack of long-term global cropland WUE datasets with high spatial resolution limits our understanding of global and regional patterns of cropland WUE. This study developed a long-term global cropland WUE dataset at 1-km spatial resolution from 2001 to 2020. The cropland WUE was obtained as the ratio between net primary productivity (NPP) and evapotranspiration that was retrieved from ETMonitor global evapotranspiration datasets. The global cropland NPP was estimated by subtracting plant respiration from gross primary production (GPP), which was estimated using an improved light-use efficiency model after being optimized for different global climate zones using flux-tower observation data. The generated WUE product showed good accuracy with high correlation efficiency (0.76) and low root mean square error (0.5 g C/kg H<sub>2</sub>O/yr) compared with the ground measurements at flux towers. This dataset can be used as fundamental data to advance the efficient utilization of water use for sustainable development.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"574"},"PeriodicalIF":5.8,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143788487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific DataPub Date : 2025-04-04DOI: 10.1038/s41597-025-04431-z
Ignacio Cazcarro, Arkaitz Usubiaga-Liaño, Marίa Victoria Román, Pablo Piñero, Erik Dietzenbacher, José Manuel Rueda-Cantuche, Iñaki Arto
{"title":"FIGARO-E3: a high-resolution extended multi-regional input-output database consistent with official statistics.","authors":"Ignacio Cazcarro, Arkaitz Usubiaga-Liaño, Marίa Victoria Román, Pablo Piñero, Erik Dietzenbacher, José Manuel Rueda-Cantuche, Iñaki Arto","doi":"10.1038/s41597-025-04431-z","DOIUrl":"https://doi.org/10.1038/s41597-025-04431-z","url":null,"abstract":"<p><p>Existing 'official' multi-regional input-output (MRIO) databases lack sufficient sectoral detail and extensions for calculating accurate environmental and social footprints. FIGARO-E3 is a highly disaggregated MRIO database for 2015 with labour and environmental extensions largely consistent with official statistics. The database has been created by disaggregating the official FIGARO database (46 countries, 64 industries/products) to achieve a resolution of 175 industries and 213 products based on the monetary structures of EXIOBASE. Labour accounts (including total employment and employment by gender and by skill) are based on OECD data and EXIOBASE structures. Energy accounts (primary energy supply, net, final and non-energy uses, energy industry own use and energy losses) are based on the IEA's extended energy balances and FIGARO-E3 MRIO tables. GHG emission accounts (covering four types of GHGs: CO2, CH4, N2O and fluorinated gases, both for combustion and non-combustion processes) are based on IEA and EDGAR data. GHG emission accounts for European countries have been reconciled with data from Eurostat. The FIGARO-E3 database is largely consistent with official statistics.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"575"},"PeriodicalIF":5.8,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143788687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific DataPub Date : 2025-04-04DOI: 10.1038/s41597-025-04902-3
Cinta Pegueroles, Carles Galià-Camps, Marta Pascual, Marta Bassitta, Didac González, Carola Greve, Enrique Macpherson, Núria Raventós, Tilman Schell, Héctor Torrado, Carlos Carreras
{"title":"Chromosome-level genome assembly and annotation of the sharpsnout seabream (Diplodus puntazzo).","authors":"Cinta Pegueroles, Carles Galià-Camps, Marta Pascual, Marta Bassitta, Didac González, Carola Greve, Enrique Macpherson, Núria Raventós, Tilman Schell, Héctor Torrado, Carlos Carreras","doi":"10.1038/s41597-025-04902-3","DOIUrl":"https://doi.org/10.1038/s41597-025-04902-3","url":null,"abstract":"<p><p>Diplodus puntazzo is a demersal fish inhabiting the Mediterranean Sea and the eastern Atlantic and plays an important ecological role in coastal areas. Here, we present the first nuclear genome assembly and annotation of this species and genus. We used a combination of PacBio CLR long reads, Illumina short reads and chromatin capture reads (Omni-C) to generate a chromosome-level assembly. The nuclear genome assembly has a total span of 788 Mb, containing 24 chromosome-scale scaffolds (98.76% of the total length), coinciding with its known karyotype. By using RNA-Seq data from D. puntazzo and gene models from closely related species, we also generated a high-quality nuclear annotation. We predicted a total of 87,572 transcripts from the nuclear genome, 26,838 coding, and 60,734 non-coding that included lncRNA, snoRNA, and tRNAs. We also assembled and annotated the mitochondrial genome, circularized in 16,642 bp comprising 13 protein-coding genes, 2 rRNA, and 22 tRNA. This high-quality reference genome will enrich the current genomic resources available to the large fish scientific community.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"576"},"PeriodicalIF":5.8,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143788661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High Resolution Water Quality Dataset of Chinese Lakes and Reservoirs from 2000 to 2023.","authors":"Shilong Luan, Huixiao Pan, Ruoque Shen, Xiaosheng Xia, Hongtao Duan, Wenping Yuan, Jing Wei","doi":"10.1038/s41597-025-04915-y","DOIUrl":"https://doi.org/10.1038/s41597-025-04915-y","url":null,"abstract":"<p><p>Water quality parameters (pH, dissolved oxygen (DO), total nitrogen (TN, includes both organic nitrogen and inorganic nitrogen), total phosphorus (TP), permanganate index (COD<sub>Mn</sub>), turbidity (Tur), electrical conductivity (EC), and dissolved organic carbon (DOC)) are important to evaluate the ecological health of lakes and reservoirs. In this research, we developed a monthly dataset of these key water quality parameters from 2000 to 2023 for nearly 180,000 lakes and reservoirs across China, using the random forest (RF) models. These RF models took into account the impacts of climate, soil properties, and anthropogenic activities within basins of studied lakes and reservoirs, and effectively captured the spatial and temporal variations of their water quality parameters with correlation coefficients (R<sup>2</sup>) ranging from 0.65 to 0.76. Interestingly, an increase in Tur and EC was observed during this period, while pH, DO, and other parameters showed minimal fluctuations. This dataset is of significant value for further evaluating the ecological, environmental, and climatic functions of aquatic ecosystems.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"572"},"PeriodicalIF":5.8,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143788779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific DataPub Date : 2025-04-04DOI: 10.1038/s41597-025-04916-x
Yidi Wu, Hang Sha, Hongwei Liang
{"title":"Chromosome-scale genome assembly and annotation of Xenocypris argentea.","authors":"Yidi Wu, Hang Sha, Hongwei Liang","doi":"10.1038/s41597-025-04916-x","DOIUrl":"https://doi.org/10.1038/s41597-025-04916-x","url":null,"abstract":"<p><p>Xenocypris argentea is a small to medium-sized freshwater cyprinid fish. It distributes widely in the rivers and lakes of China, and is often used as a tool fish for water quality improvement and optimizing aquaculture structures. In recent years, natural populations of X. argentea have decreased rapidly due to human activities, yet little is known about the genetics and genomics of this fish. In the present work, we reported a chromosome-level reference genome of X. argentea based on PacBio HiFi, Hi-C and Illumina paired-end sequencing technologies. The assembled genome was 984.96 Mb in length, with a contig N50 of 36.02 Mb. Using Hi-C interaction information, 99.47% of the contigs were anchored onto 24 chromosomes, and 18 of the chromosomes were gap-free. Further analysis identified 560.27 Mb of repeat sequences and 28,533 protein-coding genes in the genome, of which, 95.62% (27,284) genes were functionally annotated. This high-quality genome offers an invaluable resource for population genetics and phylogeny, comparative genomics, adaptive evolution and functional exploration of X. argentea.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"573"},"PeriodicalIF":5.8,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143788619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}