Scientific DataPub Date : 2024-10-21DOI: 10.1038/s41597-024-03933-6
Miao-Jiong Tang, Tian-Cheng Zhu, Shuo-Qing Zhang, Xin Hong
{"title":"QM9star, two Million DFT-computed Equilibrium Structures for Ions and Radicals with Atomic Information.","authors":"Miao-Jiong Tang, Tian-Cheng Zhu, Shuo-Qing Zhang, Xin Hong","doi":"10.1038/s41597-024-03933-6","DOIUrl":"10.1038/s41597-024-03933-6","url":null,"abstract":"<p><p>Ions and radicals serve as key intermediates in molecular transformation, with their chemical properties being essential for understanding and predicting reaction reactivity and selectivity. In this data descriptor, we report a quantum chemical dataset named QM9star, comprising cations, anions, and radicals. This dataset is derived from the molecular structures of the QM9 dataset, created by removing terminal hydrogens followed by optimization using B3LYP-D3(BJ)/6-311 + G(d,p) level of density functional theory. The QM9star dataset includes approximately 1.9 million cations, anions, and radicals, along with 120 kilo neutral molecules prior to hydrogen removal. Each entry encompasses both molecular and atomic information: representative global properties include orbital energies, vibrational frequencies, etc., while local properties cover aspects such as charges and spin densities at each atomic site. The QM9star dataset not only serves as a comprehensive source of quantum chemical information for intermediates but also offers insights into the principle of atomic property distribution. We anticipate that these data will aid in machine learning studies related to chemical intermediates and contribute to the molecular representation learning.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"11 1","pages":"1158"},"PeriodicalIF":5.8,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11494049/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142473839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific DataPub Date : 2024-10-21DOI: 10.1038/s41597-024-03882-0
Esther Thea Inau, Angela Dedié, Ivona Anastasova, Renate Schick, Yaroslav Zdravomyslov, Brigitte Fröhlich, Andreas L Birkenfeld, Martin Hrabě de Angelis, Michael Roden, Atinkut Alamirrew Zeleke, Martin Preusse, Dagmar Waltemath
{"title":"The Journey to a FAIR CORE DATA SET for Diabetes Research in Germany.","authors":"Esther Thea Inau, Angela Dedié, Ivona Anastasova, Renate Schick, Yaroslav Zdravomyslov, Brigitte Fröhlich, Andreas L Birkenfeld, Martin Hrabě de Angelis, Michael Roden, Atinkut Alamirrew Zeleke, Martin Preusse, Dagmar Waltemath","doi":"10.1038/s41597-024-03882-0","DOIUrl":"10.1038/s41597-024-03882-0","url":null,"abstract":"<p><p>The German Center for Diabetes Research (DZD) established a core data set (CDS) of clinical parameters relevant for diabetes research in 2021. The CDS is central to the design of current and future DZD studies. Here, we describe the process and outcomes of FAIRifying the initial version of the CDS. We first did a baseline evaluation of the FAIRness using the FAIR Data Maturity Model. The FAIRification process and the results of this assessment led us to convert the CDS into the recommended format for spreadsheets, annotating the parameters with standardized medical codes, licensing the data set, enriching the data set with metadata, and indexing the metadata. The FAIRified version of the CDS is more suitable for data sharing in diabetes research across DZD sites and beyond. It contributes to the reusability of health research studies.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"11 1","pages":"1159"},"PeriodicalIF":5.8,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11494036/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142473844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific DataPub Date : 2024-10-21DOI: 10.1038/s41597-024-03582-9
A Lina Heinzke, Barbara Zdrazil, Paul D Leeson, Robert J Young, Axel Pahl, Herbert Waldmann, Andrew R Leach
{"title":"A compound-target pairs dataset: differences between drugs, clinical candidates and other bioactive compounds.","authors":"A Lina Heinzke, Barbara Zdrazil, Paul D Leeson, Robert J Young, Axel Pahl, Herbert Waldmann, Andrew R Leach","doi":"10.1038/s41597-024-03582-9","DOIUrl":"10.1038/s41597-024-03582-9","url":null,"abstract":"<p><p>Providing a better understanding of what makes a compound a successful drug candidate is crucial for reducing the high attrition rates in drug discovery. Analyses of the differences between active compounds, clinical candidates and drugs require high-quality datasets. However, most datasets of drug discovery programs are not openly available. This work introduces a dataset of compound-target pairs extracted from the open-source bioactivity database ChEMBL (release 32). Compound-target pairs in the dataset either have at least one measured activity or are part of the manually curated set of known interactions in ChEMBL. Known interactions between drugs or clinical candidates and targets are specifically annotated to facilitate analyses of differences between drugs, clinical candidates, and other active compounds. In total, the dataset comprises 614,594 compound-target pairs, 5,109 (3,932) of which are known interactions between drugs (clinical candidates) and targets. The extraction is performed in an automated manner and fully reproducible. We are providing not only the datasets but also the code to rerun the analyses with other ChEMBL releases.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"11 1","pages":"1160"},"PeriodicalIF":5.8,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11494047/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142473802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific DataPub Date : 2024-10-19DOI: 10.1038/s41597-024-03995-6
Andre Geraldo de Lima Moraes, Sajad Khoshnood Motlagh
{"title":"The Climate Data for Adaptation and Vulnerability Assessments and the Spatial Interactions Downscaling Method.","authors":"Andre Geraldo de Lima Moraes, Sajad Khoshnood Motlagh","doi":"10.1038/s41597-024-03995-6","DOIUrl":"10.1038/s41597-024-03995-6","url":null,"abstract":"<p><p>This study presents the spatial interactions downscaling (SPID) method and introduces the climate data for adaptation and vulnerability assessments (ClimAVA) dataset. SPID employs random forest models to capture the relationship between spatial patterns at global circulation model (GCM) resolution and fine-resolution pixel values. In summary, a random forest model is trained for each fine spatial resolution pixel of the reference data as the predictand, and nine pixels from the spatially resampled (coarser) version of the reference data at the GCM's resolutions as predictors. Models are then utilized to downscale the bias-corrected GCM data. The ClimAVA-SW dataset offers a high-resolution (4 km), bias-corrected, downscaled future climate projection derived from seventeen CMIP6 GCMs. It includes three variables (daily precipitation, minimum and maximum temperature) for three shared socioeconomic pathways (SSP245, SSP370, SSP585) across the U.S. Southwest region. The ClimAVA dataset sets itself apart with the SPID method's capacity to provide remarkable climate realism, high physical plausibility of change, and excellent representation of extreme events while maintaining user-friendliness and requiring relatively low computational resources.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"11 1","pages":"1157"},"PeriodicalIF":5.8,"publicationDate":"2024-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11490614/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142473843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific DataPub Date : 2024-10-18DOI: 10.1038/s41597-024-03923-8
Girish Patidar, J Indu, Subhankar Karmakar
{"title":"ExtendinG SUb-DAily River Discharge data over INdia (GUARDIAN).","authors":"Girish Patidar, J Indu, Subhankar Karmakar","doi":"10.1038/s41597-024-03923-8","DOIUrl":"https://doi.org/10.1038/s41597-024-03923-8","url":null,"abstract":"<p><p>River discharge information is crucial for various applications, but the measurement process often remains impeded by factors that hinder near real-time (NRT) data availability in India. However, leveraging telemetry-based water surface elevation (WSE) data across the country provides an opportunity to convert it into river discharge. This conversion is made possible by utilizing rating curves (RCs) derived from historical collocated measurements of WSE and discharge. In this study, NRT WSE from the Central Water Commission (CWC) flood portal is obtained via the web-scraping tool. Through the application of RCs, discharge data is extended across 210 gauging stations in India from the year 2020 to the present, encompassing sub-daily discharge during the non-monsoon and hourly discharge series during monsoon (June-September) across the Indian rivers. Annually, the study generated over 800,000 discharge data points for Indian rivers, accounting for more than 4000 discharge measurements per station annually. These comprehensive datasets provide valuable insights for water resource and flood management research, offering NRT access to WSE, and discharge, along with the local RCs.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"11 1","pages":"1155"},"PeriodicalIF":5.8,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11489425/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142473830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific DataPub Date : 2024-10-18DOI: 10.1038/s41597-024-03971-0
Kangyang Cao, Yujian Zou, Chang Zhang, Weijing Zhang, Jie Zhang, Guojie Wang, Chu Zhang, Jiegeng Lyu, Yue Sun, Hongyuan Zhang, Bin Huang, Lei Deng, Shuiqing Yang, Jianpeng Li, Bingsheng Huang
{"title":"A multicenter bladder cancer MRI dataset and baseline evaluation of federated learning in clinical application.","authors":"Kangyang Cao, Yujian Zou, Chang Zhang, Weijing Zhang, Jie Zhang, Guojie Wang, Chu Zhang, Jiegeng Lyu, Yue Sun, Hongyuan Zhang, Bin Huang, Lei Deng, Shuiqing Yang, Jianpeng Li, Bingsheng Huang","doi":"10.1038/s41597-024-03971-0","DOIUrl":"10.1038/s41597-024-03971-0","url":null,"abstract":"<p><p>Bladder cancer (BCa), as the most common malignant tumor of the urinary system, has received significant attention in research on the clinical application of artificial intelligence algorithms. Nevertheless, it has been observed that certain investigations use data from various medical facilities to train models for BCa, which may pose a privacy risk. Given this concern, protecting patient privacy during machine learning algorithm training is a crucial aspect that requires substantial attention. One emerging machine learning paradigm that addresses this concern is federated learning (FL). FL enables multiple entities to collaboratively build machine learning models while preserving data privacy and security. In this study, we present a multicenter BCa magnetic resonance imaging (MRI) dataset. The dataset comprises 275 three-dimensional bladder T2-weighted MRI scans collected from four medical centers, and each scan includes diagnostic pathological labels for muscle invasion and pixel-level annotations of tumor contours. Four FL methods are used to assess the baseline of the dataset for both the task of diagnosing muscle-invasive bladder cancer and automatic bladder tumor lesion segmentation.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"11 1","pages":"1147"},"PeriodicalIF":5.8,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11489429/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142473805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific DataPub Date : 2024-10-18DOI: 10.1038/s41597-024-03998-3
Binbin Xia, Jianghua Shen, Hao Zhang, Siqi Chen, Xuan Zhang, Moshi Song, Jun Wang
{"title":"The alternative splicing landscape of infarcted mouse heart identifies isoform level therapeutic targets.","authors":"Binbin Xia, Jianghua Shen, Hao Zhang, Siqi Chen, Xuan Zhang, Moshi Song, Jun Wang","doi":"10.1038/s41597-024-03998-3","DOIUrl":"10.1038/s41597-024-03998-3","url":null,"abstract":"<p><p>Alternative splicing is an important process that contributes to highly diverse transcripts and protein products, which can affect the development of disease in various organisms. Cardiovascular disease (CVD) represents one of the greatest global threats to humans, particularly acute myocardial infarction (MI) and subsequent ischemic reperfusion (IR) injury, which involve complex transcriptomic changes in heart tissues associated with metabolic reshaping and immunological response. In this study, we used a newly developed ONT full-length transcriptomic approach and performed transcript-resolved differential expression profiling in murine models of MI and IR. We built an analytical pipeline to reliably identify and quantify alternative splicing products (isoforms), expanding on the currently available catalog of isoforms described in mice. The updated alternative splicing landscape included transcripts, genes, and pathways that were differentially regulated during IR and MI. Our study establishes a pipeline to profile highly diverse isoforms using state-of-the-art long-read sequencing, builds a landscape of alternative splicing in the mouse heart during MI and IR.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"11 1","pages":"1154"},"PeriodicalIF":5.8,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11489681/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142473842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transcriptomes of soybean roots and nodules inoculated with Sinorhizobium fredii with NopP and NopI variants.","authors":"Kejing Fan, Zhixia Xiao, Liping Wang, Wai-Lun Cheung, Fuk-Ling Wong, Feng Zhang, Man-Wah Li, Hon-Ming Lam","doi":"10.1038/s41597-024-03964-z","DOIUrl":"https://doi.org/10.1038/s41597-024-03964-z","url":null,"abstract":"<p><p>The major crop, soybean, forms root nodules with symbiotic rhizobia, providing energy and carbon to the bacteria in exchange for bioavailable nitrogen. The relationship is host-specific and highly host-regulated to maximize energy efficiency. Symbiotic nitrogen fixation (SNF) is greener than synthetic fertilizer for replenishing soil fertility, contributing to yield increase. Nodulation Outer Protein P (NopP) and NopI of the type 3 secretion system (T3SS) of the rhizobium determine host specificity. Sinorhizobium fredii CCBAU25509 (R2) and CCBAU45436 (R4) have different NopP and NopI variants, affecting their respective symbiotic compatibilities with the cultivated soybean C08 and the wild soybean W05. Swapping the NopP variants between R2 and R4 has been shown to switch their compatibility with C08 with the rj2/Rfg1 genotype. To understand the effects of Nops on host compatibility, analyses on the transcriptomic data of W05 roots and nodules inoculated with S. fredii strains containing Nop variants uncovered many differentially expressed genes related to nodulation and nodule functions, providing important information on the effects of Nops on hosts and nodules.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"11 1","pages":"1146"},"PeriodicalIF":5.8,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11489703/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142473845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific DataPub Date : 2024-10-18DOI: 10.1038/s41597-024-03972-z
Sophie Huddart, Vijay Yadav, Solveig K Sieberts, Larson Omberg, Mihaja Raberahona, Rivo Rakotoarivelo, Issa N Lyimo, Omar Lweno, Devasahayam J Christopher, Nguyen Viet Nhung, Grant Theron, William Worodria, Charles Y Yu, Christine M Bachman, Stephen Burkot, Puneet Dewan, Sourabh Kulhare, Peter M Small, Adithya Cattamanchi, Devan Jaganath, Simon Grandjean Lapierre
{"title":"A dataset of Solicited Cough Sound for Tuberculosis Triage Testing.","authors":"Sophie Huddart, Vijay Yadav, Solveig K Sieberts, Larson Omberg, Mihaja Raberahona, Rivo Rakotoarivelo, Issa N Lyimo, Omar Lweno, Devasahayam J Christopher, Nguyen Viet Nhung, Grant Theron, William Worodria, Charles Y Yu, Christine M Bachman, Stephen Burkot, Puneet Dewan, Sourabh Kulhare, Peter M Small, Adithya Cattamanchi, Devan Jaganath, Simon Grandjean Lapierre","doi":"10.1038/s41597-024-03972-z","DOIUrl":"10.1038/s41597-024-03972-z","url":null,"abstract":"<p><p>Cough is a common and commonly ignored symptom of lung disease. Cough is often perceived as difficult to quantify, frequently self-limiting, and non-specific. However, cough has a central role in the clinical detection of many lung diseases including tuberculosis (TB), which remains the leading infectious disease killer worldwide. TB screening currently relies on self-reported cough which fails to meet the World Health Organization (WHO) accuracy targets for a TB triage test. Artificial intelligence (AI) models based on cough sound have been developed for several respiratory conditions, with limited work being done in TB. To support the development of an accurate, point-of-care cough-based triage tool for TB, we have compiled a large multi-country database of cough sounds from individuals being evaluated for TB. The dataset includes more than 700,000 cough sounds from 2,143 individuals with detailed demographic, clinical and microbiologic diagnostic information. We aim to empower researchers in the development of cough sound analysis models to improve TB diagnosis, where innovative approaches are critically needed to end this long-standing pandemic.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"11 1","pages":"1149"},"PeriodicalIF":5.8,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11489852/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142473803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}