Biodata MiningPub Date : 2025-01-17DOI: 10.1186/s13040-025-00423-2
Richa Gupta, Mansi Bhandari, Anhad Grover, Taher Al-Shehari, Mohammed Kadrie, Taha Alfakih, Hussain Alsalman
{"title":"Correction: Predictive modeling of ALS progression: an XGBoost approach using clinical features.","authors":"Richa Gupta, Mansi Bhandari, Anhad Grover, Taher Al-Shehari, Mohammed Kadrie, Taha Alfakih, Hussain Alsalman","doi":"10.1186/s13040-025-00423-2","DOIUrl":"https://doi.org/10.1186/s13040-025-00423-2","url":null,"abstract":"","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"5"},"PeriodicalIF":4.0,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11740421/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143014567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-01-16DOI: 10.1186/s13040-024-00419-4
Heesang Moon, Mina Rho
{"title":"MultiChem: predicting chemical properties using multi-view graph attention network.","authors":"Heesang Moon, Mina Rho","doi":"10.1186/s13040-024-00419-4","DOIUrl":"10.1186/s13040-024-00419-4","url":null,"abstract":"<p><strong>Background: </strong>Understanding the molecular properties of chemical compounds is essential for identifying potential candidates or ensuring safety in drug discovery. However, exploring the vast chemical space is time-consuming and costly, necessitating the development of time-efficient and cost-effective computational methods. Recent advances in deep learning approaches have offered deeper insights into molecular structures. Leveraging this progress, we developed a novel multi-view learning model.</p><p><strong>Results: </strong>We introduce a graph-integrated model that captures both local and global structural features of chemical compounds. In our model, graph attention layers are employed to effectively capture essential local structures by jointly considering atom and bond features, while multi-head attention layers extract important global features. We evaluated our model on nine MoleculeNet datasets, encompassing both classification and regression tasks, and compared its performance with state-of-the-art methods. Our model achieved an average area under the receiver operating characteristic (AUROC) of 0.822 and a root mean squared error (RMSE) of 1.133, representing a 3% improvement in AUROC and a 7% improvement in RMSE over state-of-the-art models in extensive seed testing.</p><p><strong>Conclusion: </strong>MultiChem highlights the importance of integrating both local and global structural information in predicting molecular properties, while also assessing the stability of the models across multiple datasets using various random seed values.</p><p><strong>Implementation: </strong>The codes are available at https://github.com/DMnBI/MultiChem .</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"4"},"PeriodicalIF":4.0,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11737097/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143014571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-01-15DOI: 10.1186/s13040-024-00421-w
Peter T Nguyen, Simon G Coetzee, Irina Silacheva, Dennis J Hazelett
{"title":"Genome-wide association studies are enriched for interacting genes.","authors":"Peter T Nguyen, Simon G Coetzee, Irina Silacheva, Dennis J Hazelett","doi":"10.1186/s13040-024-00421-w","DOIUrl":"10.1186/s13040-024-00421-w","url":null,"abstract":"<p><strong>Background: </strong>With recent advances in single cell technology, high-throughput methods provide unique insight into disease mechanisms and more importantly, cell type origin. Here, we used multi-omics data to understand how genetic variants from genome-wide association studies influence development of disease. We show in principle how to use genetic algorithms with normal, matching pairs of single-nucleus RNA- and ATAC-seq, genome annotations, and protein-protein interaction data to describe the genes and cell types collectively and their contribution to increased risk.</p><p><strong>Results: </strong>We used genetic algorithms to measure fitness of gene-cell set proposals against a series of objective functions that capture data and annotations. The highest information objective function captured protein-protein interactions. We observed significantly greater fitness scores and subgraph sizes in foreground vs. matching sets of control variants. Furthermore, our model reliably identified known targets and ligand-receptor pairs, consistent with prior studies.</p><p><strong>Conclusions: </strong>Our findings suggested that application of genetic algorithms to association studies can generate a coherent cellular model of risk from a set of susceptibility variants. Further, we showed, using breast cancer as an example, that such variants have a greater number of physical interactions than expected due to chance.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"3"},"PeriodicalIF":4.0,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11734473/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143014570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-01-09DOI: 10.1186/s13040-024-00412-x
Davide Chicco, Alessandro Fabris, Giuseppe Jurman
{"title":"The Venus score for the assessment of the quality and trustworthiness of biomedical datasets.","authors":"Davide Chicco, Alessandro Fabris, Giuseppe Jurman","doi":"10.1186/s13040-024-00412-x","DOIUrl":"10.1186/s13040-024-00412-x","url":null,"abstract":"<p><p>Biomedical datasets are the mainstays of computational biology and health informatics projects, and can be found on multiple data platforms online or obtained from wet-lab biologists and physicians. The quality and the trustworthiness of these datasets, however, can sometimes be poor, producing bad results in turn, which can harm patients and data subjects. To address this problem, policy-makers, researchers, and consortia have proposed diverse regulations, guidelines, and scores to assess the quality and increase the reliability of datasets. Although generally useful, however, they are often incomplete and impractical. The guidelines of Datasheets for Datasets, in particular, are too numerous; the requirements of the Kaggle Dataset Usability Score focus on non-scientific requisites (for example, including a cover image); and the European Union Artificial Intelligence Act (EU AI Act) sets forth sparse and general data governance requirements, which we tailored to datasets for biomedical AI. Against this backdrop, we introduce our new Venus score to assess the data quality and trustworthiness of biomedical datasets. Our score ranges from 0 to 10 and consists of ten questions that anyone developing a bioinformatics, medical informatics, or cheminformatics dataset should answer before the release. In this study, we first describe the EU AI Act, Datasheets for Datasets, and the Kaggle Dataset Usability Score, presenting their requirements and their drawbacks. To do so, we reverse-engineer the weights of the influential Kaggle Score for the first time and report them in this study. We distill the most important data governance requirements into ten questions tailored to the biomedical domain, comprising the Venus score. We apply the Venus score to twelve datasets from multiple subdomains, including electronic health records, medical imaging, microarray and bulk RNA-seq gene expression, cheminformatics, physiologic electrogram signals, and medical text. Analyzing the results, we surface fine-grained strengths and weaknesses of popular datasets, as well as aggregate trends. Most notably, we find a widespread tendency to gloss over sources of data inaccuracy and noise, which may hinder the reliable exploitation of data and, consequently, research results. Overall, our results confirm the applicability and utility of the Venus score to assess the trustworthiness of biomedical data.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"1"},"PeriodicalIF":4.0,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11716409/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142957099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-01-04DOI: 10.1186/s13040-024-00414-9
Xingyu Li, Lu Peng, Yu-Ping Wang, Weihua Zhang
{"title":"Open challenges and opportunities in federated foundation models towards biomedical healthcare.","authors":"Xingyu Li, Lu Peng, Yu-Ping Wang, Weihua Zhang","doi":"10.1186/s13040-024-00414-9","DOIUrl":"10.1186/s13040-024-00414-9","url":null,"abstract":"<p><p>This survey explores the transformative impact of foundation models (FMs) in artificial intelligence, focusing on their integration with federated learning (FL) in biomedical research. Foundation models such as ChatGPT, LLaMa, and CLIP, which are trained on vast datasets through methods including unsupervised pretraining, self-supervised learning, instructed fine-tuning, and reinforcement learning from human feedback, represent significant advancements in machine learning. These models, with their ability to generate coherent text and realistic images, are crucial for biomedical applications that require processing diverse data forms such as clinical reports, diagnostic images, and multimodal patient interactions. The incorporation of FL with these sophisticated models presents a promising strategy to harness their analytical power while safeguarding the privacy of sensitive medical data. This approach not only enhances the capabilities of FMs in medical diagnostics and personalized treatment but also addresses critical concerns about data privacy and security in healthcare. This survey reviews the current applications of FMs in federated settings, underscores the challenges, and identifies future research directions including scaling FMs, managing data diversity, and enhancing communication efficiency within FL frameworks. The objective is to encourage further research into the combined potential of FMs and FL, laying the groundwork for healthcare innovations.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"2"},"PeriodicalIF":4.0,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142928515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2024-12-30DOI: 10.1186/s13040-024-00417-6
Jakub Horvath, Pavel Jedlicka, Marie Kratka, Zdenek Kubat, Eduard Kejnovsky, Matej Lexa
{"title":"Correction: Detection and classification of long terminal repeat sequences in plant LTR-retrotransposons and their analysis using explainable machine learning.","authors":"Jakub Horvath, Pavel Jedlicka, Marie Kratka, Zdenek Kubat, Eduard Kejnovsky, Matej Lexa","doi":"10.1186/s13040-024-00417-6","DOIUrl":"10.1186/s13040-024-00417-6","url":null,"abstract":"","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"62"},"PeriodicalIF":4.0,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11687018/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142907814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2024-12-28DOI: 10.1186/s13040-024-00413-w
Zhendong Sha, Philip J Freda, Priyanka Bhandary, Attri Ghosh, Nicholas Matsumoto, Jason H Moore, Ting Hu
{"title":"Distinct network patterns emerge from Cartesian and XOR epistasis models: a comparative network science analysis.","authors":"Zhendong Sha, Philip J Freda, Priyanka Bhandary, Attri Ghosh, Nicholas Matsumoto, Jason H Moore, Ting Hu","doi":"10.1186/s13040-024-00413-w","DOIUrl":"10.1186/s13040-024-00413-w","url":null,"abstract":"<p><strong>Background: </strong>Epistasis, the phenomenon where the effect of one gene (or variant) is masked or modified by one or more other genes, significantly contributes to the phenotypic variance of complex traits. Traditionally, epistasis has been modeled using the Cartesian epistatic model, a multiplicative approach based on standard statistical regression. However, a recent study investigating epistasis in obesity-related traits has identified potential limitations of the Cartesian epistatic model, revealing that it likely only detects a fraction of the genetic interactions occurring in natural systems. In contrast, the exclusive-or (XOR) epistatic model has shown promise in detecting a broader range of epistatic interactions and revealing more biologically relevant functions associated with interacting variants. To investigate whether the XOR epistatic model also forms distinct network structures compared to the Cartesian model, we applied network science to examine genetic interactions underlying body mass index (BMI) in rats (Rattus norvegicus).</p><p><strong>Results: </strong>Our comparative analysis of XOR and Cartesian epistatic models in rats reveals distinct topological characteristics. The XOR model exhibits enhanced sensitivity to epistatic interactions between the network communities found in the Cartesian epistatic network, facilitating the identification of novel trait-related biological functions via community-based enrichment analysis. Additionally, the XOR network features triangle network motifs, indicative of higher-order epistatic interactions. This research also evaluates the impact of linkage disequilibrium (LD)-based edge pruning on network-based epistasis analysis, finding that LD-based edge pruning may lead to increased network fragmentation, which may hinder the effectiveness of network analysis for the investigation of epistasis. We confirmed through network permutation analysis that most XOR and Cartesian epistatic networks derived from the data display distinct structural properties compared to randomly shuffled networks.</p><p><strong>Conclusions: </strong>Collectively, these findings highlight the XOR model's ability to uncover meaningful biological associations and higher-order epistasis derived from lower-order network topologies. The introduction of community-based enrichment analysis and motif-based epistatic discovery emphasize network science as a critical approach for advancing epistasis research and understanding complex genetic architectures.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"61"},"PeriodicalIF":4.0,"publicationDate":"2024-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11681696/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142899656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2024-12-24DOI: 10.1186/s13040-024-00416-7
Dani Livne, Sol Efroni
{"title":"Pathway metrics accurately stratify T cells to their cells states.","authors":"Dani Livne, Sol Efroni","doi":"10.1186/s13040-024-00416-7","DOIUrl":"10.1186/s13040-024-00416-7","url":null,"abstract":"<p><p>Pathway analysis is a powerful approach for elucidating insights from gene expression data and associating such changes with cellular phenotypes. The overarching objective of pathway research is to identify critical molecular drivers within a cellular context and uncover novel signaling networks from groups of relevant biomolecules. In this work, we present PathSingle, a Python-based pathway analysis tool tailored for single-cell data analysis. PathSingle employs a unique graph-based algorithm to enable the classification of diverse cellular states, such as T cell subtypes. Designed to be open-source, extensible, and computationally efficient, PathSingle is available at https://github.com/zurkin1/PathSingle under the MIT license. This tool provides researchers with a versatile framework for uncovering biologically meaningful insights from high-dimensional single-cell transcriptomics data, facilitating a deeper understanding of cellular regulation and function.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"60"},"PeriodicalIF":4.0,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11668091/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142883414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2024-12-18DOI: 10.1186/s13040-024-00411-y
Nina Kastendiek, Roberta Coletti, Thilo Gross, Marta B Lopes
{"title":"Exploring glioma heterogeneity through omics networks: from gene network discovery to causal insights and patient stratification.","authors":"Nina Kastendiek, Roberta Coletti, Thilo Gross, Marta B Lopes","doi":"10.1186/s13040-024-00411-y","DOIUrl":"10.1186/s13040-024-00411-y","url":null,"abstract":"<p><p>Gliomas are primary malignant brain tumors with a typically poor prognosis, exhibiting significant heterogeneity across different cancer types. Each glioma type possesses distinct molecular characteristics determining patient prognosis and therapeutic options. This study aims to explore the molecular complexity of gliomas at the transcriptome level, employing a comprehensive approach grounded in network discovery. The graphical lasso method was used to estimate a gene co-expression network for each glioma type from a transcriptomics dataset. Causality was subsequently inferred from correlation networks by estimating the Jacobian matrix. The networks were then analyzed for gene importance using centrality measures and modularity detection, leading to the selection of genes that might play an important role in the disease. To explore the pathways and biological functions these genes are involved in, KEGG and Gene Ontology (GO) enrichment analyses on the disclosed gene sets were performed, highlighting the significance of the genes selected across several relevent pathways and GO terms. Spectral clustering based on patient similarity networks was applied to stratify patients into groups with similar molecular characteristics and to assess whether the resulting clusters align with the diagnosed glioma type. The results presented highlight the ability of the proposed methodology to uncover relevant genes associated with glioma intertumoral heterogeneity. Further investigation might encompass biological validation of the putative biomarkers disclosed.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"56"},"PeriodicalIF":4.0,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11657291/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142856223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prognostic feature based on androgen-responsive genes in bladder cancer and screening for potential targeted drugs.","authors":"Jiang Zhao, Qian Zhang, Cunle Zhu, Wu Yuqi, Guohui Zhang, Qianliang Wang, Xingyou Dong, Benyi Li, Xiangwei Wang","doi":"10.1186/s13040-024-00377-x","DOIUrl":"10.1186/s13040-024-00377-x","url":null,"abstract":"<p><strong>Objectives: </strong>Bladder cancer (BLCA) is a tumor that affects men more than women. The biological function and prognostic value of androgen-responsive genes (ARGs) in BLCA are currently unknown. To address this, we established an androgen signature to determine the prognosis of BLCA.</p><p><strong>Methods: </strong>Sequencing data for BLCA from the TCGA and GEO datasets were used for research. The tumor microenvironment (TME) was measured using Cibersort and ssGSEA. Prognosis-related genes were identified and a risk score model was constructed using univariate Cox regression, LASSO regression, and multivariate Cox regression. Drug sensitivity analysis was performed using Genomics of drug sensitivity in cancer (GDSC). Real-time quantitative PCR was performed to assess the expression of representative genes in clinical samples.</p><p><strong>Results: </strong>ARGs (especially the CDK6, FADS1, PGM3, SCD, PTK2B, and TPD52) might regulate the progression of BLCA. The different expression patterns of ARGs may lead to different immune cell infiltration. The risk model indicates that patients with higher risk scores have a poorer prognosis, more stromal infiltration, and an enrichment of biological functions. Single-cell RNA analysis, bulk RNA data, and PCR analysis support the reliability of this risk model, and a nomogram was also established for clinical use. Drug prediction analysis showed that high-risk patients had a better response to fludarabine, AZD8186, and carmustine.</p><p><strong>Conclusion: </strong>ARGs played an important role in the progression, immune infiltration, and prognosis of BLCA. The ARGs model has high accuracy in predicting the prognosis of BLCA patients and provides more effective medication guidelines.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"59"},"PeriodicalIF":4.0,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11657289/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142856224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}