Biodata MiningPub Date : 2025-06-19DOI: 10.1186/s13040-025-00459-4
Yasaman Fatapour, James P Brody
{"title":"A compact encoding of the genome suitable for machine learning prediction of traits and genetic risk scores.","authors":"Yasaman Fatapour, James P Brody","doi":"10.1186/s13040-025-00459-4","DOIUrl":"https://doi.org/10.1186/s13040-025-00459-4","url":null,"abstract":"<p><p>Genotype to phenotype prediction is a central problem in biology and medicine. Machine learning is a natural tool to address this problem. However, a person's genotype is usually represented by a few million single-nucleotide polymorphisms and most datasets only have a few thousand patients. Thus, this problem typically has many more predictors than the number of samples (patients), making it unsuitable for machine learning. The objective of this paper is to examine the efficacy of a compact genotype representation, which employs a limited number of predictors, in predicting a person's phenotype through the application of machine learning. We characterized a person's genotype using chromosome-scale length variation, a measure that is computed as the average value of reported log R ratios across a portion of a chromosome. We computed these numbers from data collected by the NIH All of Us program. We used the AutoML function (h2o.ai) in binary classification mode to identify the best models to differentiate between male/female, Black/white, white/Asian, and Black/Asian. We also used the AutoML function in regression mode to predict the height of people based on their age and genotype. Our results showed that we could effectively classify a person, using only information from chromosomes 1-22, as Male/Female (AUC = 0.9988 ± 0.0001), White/Black (AUC = 0.970 ± 0.002), Asian/White (AUC = 0.877 ± 0.002), and Black/Asian (AUC = 0.966 ± 0.002). This approach also effectively predicted height. In conclusion, we have shown that this compact representation of a person's genotype, along with machine learning, can effectively predict a person's phenotype.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"44"},"PeriodicalIF":4.0,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144334213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recent advances in deep learning for protein-protein interaction: a review.","authors":"Jiafu Cui, Siqi Yang, Litai Yi, Qilemuge Xi, Dezhi Yang, Yongchun Zuo","doi":"10.1186/s13040-025-00457-6","DOIUrl":"10.1186/s13040-025-00457-6","url":null,"abstract":"<p><p>Deep learning, a cornerstone of artificial intelligence, is driving rapid advancements in computational biology. Protein-protein interactions (PPIs) are fundamental regulators of biological functions. With the inclusion of deep learning in PPI research, the field is undergoing transformative changes. Therefore, there is an urgent need for a comprehensive review and assessment of recent developments to improve analytical methods and open up a wider range of biomedical applications. This review meticulously assesses deep learning progress in PPI prediction from 2021 to 2025. We evaluate core architectures (GNNs, CNNs, RNNs) and pioneering approaches-attention-driven Transformers, multi-task frameworks, multimodal integration of sequence and structural data, transfer learning via BERT and ESM, and autoencoders for interaction characterization. Moreover, we examined enhanced algorithms for dealing with data imbalances, variations, and high-dimensional feature sparsity, as well as industry challenges (including shifting protein interactions, interactions with non-model organisms, and rare or unannotated protein interactions), and offered perspectives on the future of the field. In summary, this review systematically summarizes the latest advances and existing challenges in deep learning in the field of protein interaction analysis, providing a valuable reference for researchers in the fields of computational biology and deep learning.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"43"},"PeriodicalIF":4.0,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12168265/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144310649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-06-14DOI: 10.1186/s13040-025-00458-5
Xiongbin Gui, Hanlin Lv, Xiao Wang, Longting Lv, Yi Xiao, Lei Wang
{"title":"Enhancing hepatopathy clinical trial efficiency: a secure, large language model-powered pre-screening pipeline.","authors":"Xiongbin Gui, Hanlin Lv, Xiao Wang, Longting Lv, Yi Xiao, Lei Wang","doi":"10.1186/s13040-025-00458-5","DOIUrl":"10.1186/s13040-025-00458-5","url":null,"abstract":"<p><strong>Background: </strong>Recruitment for cohorts involving complex liver diseases, such as hepatocellular carcinoma and liver cirrhosis, often requires interpreting semantically complex criteria. Traditional manual screening methods are time-consuming and prone to errors. While AI-powered pre-screening offers potential solutions, challenges remain regarding accuracy, efficiency, and data privacy.</p><p><strong>Methods: </strong>We developed a novel patient pre-screening pipeline that leverages clinical expertise to guide the precise, safe, and efficient application of large language models. The pipeline breaks down complex criteria into a series of composite questions and then employs two strategies to perform semantic question-answering through electronic health records: (1) Pathway A, Anthropomorphized Experts' Chain of Thought strategy; and (2) Pathway B, Preset Stances within an Agent Collaboration strategy, particularly in managing complex clinical reasoning scenarios. The pipeline is evaluated on key metrics including precision, recall, time consumption, and counterfactual inference-at both the question and criterion levels.</p><p><strong>Results: </strong>Our pipeline achieved a notable balance of high precision (e.g., 0.921, criteria level) and good overall recall (e.g., ~ 0.82, criteria level), alongside high efficiency (0.44s per task). Pathway B excelled in high-precision complex reasoning (while exhibiting a specific recall profile conducive to accuracy), whereas Pathway A was particularly effective for tasks requiring both robust precision and recall (e.g., direct data extraction), often with faster processing times. Both pathways achieved comparable overall precision while offering different strengths in the precision-recall trade-off. The pipeline showed promising precision-focused results in hepatocellular carcinoma (0.878) and cirrhosis trials (0.843).</p><p><strong>Conclusions: </strong>This data-secure and time-efficient pipeline shows high precision and achieves good recall in hepatopathy trials, providing promising solutions for streamlining clinical trial workflows. Its efficiency, adaptability, and balanced performance profile make it suitable for improving patient recruitment. And its capability to function in resource-constrained environments further enhances its utility in clinical settings.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"42"},"PeriodicalIF":4.0,"publicationDate":"2025-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12167571/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144295174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-06-13DOI: 10.1186/s13040-025-00456-7
Tue T Te, Alex A T Bui, Constance H Fung, Mary Regina Boland
{"title":"Geospatial analysis of short sleep duration and cognitive disability in US adults: a multi-state study using machine learning techniques.","authors":"Tue T Te, Alex A T Bui, Constance H Fung, Mary Regina Boland","doi":"10.1186/s13040-025-00456-7","DOIUrl":"10.1186/s13040-025-00456-7","url":null,"abstract":"<p><strong>Background: </strong>There is evidence of increased risk of cognitive disability due to short sleep duration and adverse Social Determinants of Health (SDoH). To determine whether spatial associations (correlation between spatially distributed variables within a given geographic area) exist between neighborhoods with short sleep duration and cognitive disability across the United States (US) after adjusting for other factors. We conducted a spatial analysis using a spatial lag model at the neighborhood-level with the census tract as unit-of-analysis within each state in the US. We aggregated our results nationally using a weighted analysis to adjust for the number of census tracts per state. This study used Centers for Disease Control and Prevention (CDC) data on short sleep duration, cognitive disability and other health factors. We used 2021-2022 neighborhood-level data from the CDC and US Census Bureau adjusting for social determinants of health (SDoH) and demographics, excluding Florida due to inconsistencies in data availability. Our exposure variable was self-reported short sleep defined by the CDC (\"sleep less than 7 hours per 24 hour period\"). Our outcome was self-reported cognitive disability defined by the CDC (\"difficulty concentrating, remembering, or making decision\"). We adjusted for other factors including 'health outcomes', 'preventive practices', and the CDC's Social Vulnerability Index.</p><p><strong>Results: </strong>The spatial analysis revealed a significant association between short sleep duration and an increased risk of cognitive disability across the US (estimate range [0.29; 1.27], p < 0.005) after adjustment. Notably, six Western states (New Mexico, Alaska, Arizona, Nevada, Idaho, and Oregon) were at increased risk of cognitive disability due to short sleep duration and this pattern was significant (p = 0.007).</p><p><strong>Conclusions: </strong>Our study highlights the importance of short sleep duration as a significant predictor of cognitive disability across the US after adjusting for other confounders. The association between short sleep and cognitive disability was especially strong in the Western region of the US providing a deeper understanding of how geographic context and local factors can shape health outcomes.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"41"},"PeriodicalIF":4.0,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12166631/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144295129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-06-12DOI: 10.1186/s13040-025-00455-8
Davide Chicco, Luca Oneto, Davide Cangelosi
{"title":"DBSCAN and DBCV application to open medical records heterogeneous data for identifying clinically significant clusters of patients with neuroblastoma.","authors":"Davide Chicco, Luca Oneto, Davide Cangelosi","doi":"10.1186/s13040-025-00455-8","DOIUrl":"10.1186/s13040-025-00455-8","url":null,"abstract":"<p><p>Neuroblastoma is a common pediatric cancer that affects thousands of infants worldwide, especially children under five years of age. Although recovery for patients with neuroblastoma is possible in 80% of cases, only 40% of those with high-risk stage four neuroblastoma survive. Electronic health records of patients with this disease contain valuable data on patients that can be analyzed using computational intelligence and statistical software by biomedical informatics researchers. Unsupervised machine learning methods, in particular, can identify clinically significant subgroups of patients, which can lead to new therapies or medical treatments for future patients belonging to the same subgroups. However, access to these datasets is often restricted, making it difficult to obtain them for independent research projects. In this study, we retrieved three open datasets containing data from patients diagnosed with neuroblastoma: the Genoa dataset and the Shanghai dataset from the Neuroblastoma Electronic Health Records Open Data Repository, and a dataset from the TARGET-NBL renowned program. We analyzed these datasets using several clustering techniques and measured the results with the DBCV (Density-Based Clustering Validation) index. Among these algorithms, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) was the only one that produced meaningful results. We scrutinized the two clusters of patients' profiles identified by DBSCAN in the three datasets and recognized several relevant clinical variables that clearly partitioned the patients into the two clusters that have clinical meaning in the neuroblastoma literature. Our results can have a significant impact on health informatics, because any computational analyst wishing to cluster small data of patients of a rare disease can choose to use DBSCAN and DBCV rather than utilizing more common methods such as k-Means and Silhouette coefficient.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"40"},"PeriodicalIF":4.0,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12164137/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144286933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-06-11DOI: 10.1186/s13040-025-00454-9
David Vidmar, Jessica De Freitas, Will Thompson, John M Pfeifer, Brandon K Fornwalt, Noah Zimmerman, Riccardo Miotto, Ruijun Chen
{"title":"A probabilistic approach for building disease phenotypes across electronic health records.","authors":"David Vidmar, Jessica De Freitas, Will Thompson, John M Pfeifer, Brandon K Fornwalt, Noah Zimmerman, Riccardo Miotto, Ruijun Chen","doi":"10.1186/s13040-025-00454-9","DOIUrl":"10.1186/s13040-025-00454-9","url":null,"abstract":"","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"39"},"PeriodicalIF":4.0,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12153169/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144276400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-06-05DOI: 10.1186/s13040-025-00453-w
Tom Tubbesing, Andreas Schlüter, Alexander Sczyrba
{"title":"subMG automates data submission for metagenomics studies.","authors":"Tom Tubbesing, Andreas Schlüter, Alexander Sczyrba","doi":"10.1186/s13040-025-00453-w","DOIUrl":"10.1186/s13040-025-00453-w","url":null,"abstract":"<p><strong>Background: </strong>Publicly available metagenomics datasets are crucial for ensuring the reproducibility of scientific findings and supporting contemporary large-scale studies. However, submitting a comprehensive metagenomics dataset is both cumbersome and time-consuming. It requires including sample information, sequencing reads, assemblies, binned contigs, metagenome-assembled genomes (MAGs), and appropriate metadata. As a result, metagenomics studies are often published with incomplete datasets or, in some cases, without any data at all. subMG addresses this challenge by simplifying and automating the data submission process, thereby encouraging broader and more consistent data sharing.</p><p><strong>Results: </strong>subMG streamlines the process of submitting metagenomics study results to the European Nucleotide Archive (ENA) by allowing researchers to input files and metadata from their studies in a single form and automating downstream tasks that otherwise require extensive manual effort and expertise. The tool comes with comprehensive documentation as well as example data tailored for different use cases and can be operated via the command-line or a graphical user interface (GUI), making it easily deployable to a wide range of potential users.</p><p><strong>Conclusions: </strong>By simplifying the submission of genome-resolved metagenomics study datasets, subMG significantly reduces the time, effort, and expertise required from researchers, thus paving the way for more numerous and comprehensive data submissions in the future. An increased availability of well-documented and FAIR data can benefit future research, particularly in meta-analyses and comparative studies.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"38"},"PeriodicalIF":4.0,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12142852/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144235707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-05-27DOI: 10.1186/s13040-025-00452-x
Rachit Kumar, Joseph D Romano, Marylyn D Ritchie
{"title":"Network-based analyses of multiomics data in biomedicine.","authors":"Rachit Kumar, Joseph D Romano, Marylyn D Ritchie","doi":"10.1186/s13040-025-00452-x","DOIUrl":"10.1186/s13040-025-00452-x","url":null,"abstract":"<p><p>Network representations of data are designed to encode relationships between concepts as sets of edges between nodes. Human biology is inherently complex and is represented by data that often exists in a hierarchical nature. One canonical example is the relationship that exists within and between various -omics datasets, including genomics, transcriptomics, and proteomics, among others. Encoding such data in a network-based or graph-based representation allows the explicit incorporation of such relationships into various biomedical big data tasks, including (but not limited to) disease subtyping, interaction prediction, biomarker identification, and patient classification. This review will present various existing approaches in using network representations and analysis of data in multiomics in the framework of deep learning and machine learning approaches, subdivided into supervised and unsupervised approaches, to identify benefits and drawbacks of various approaches as well as the possible next steps for the field.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"37"},"PeriodicalIF":4.0,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12117783/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144161878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-05-22DOI: 10.1186/s13040-025-00451-y
Suruthy Sivanathan, Ting Hu
{"title":"Correction: Learning the therapeutic targets of acute myeloid leukemia through multiscale human interactome network and community analysis.","authors":"Suruthy Sivanathan, Ting Hu","doi":"10.1186/s13040-025-00451-y","DOIUrl":"10.1186/s13040-025-00451-y","url":null,"abstract":"","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"36"},"PeriodicalIF":4.0,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12096567/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144127755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-05-13DOI: 10.1186/s13040-025-00450-z
Berit Hunsdieck, Christian Bender, Katja Ickstadt, Johanna Mielke
{"title":"Joint models in big data: simulation-based guidelines for required data quality in longitudinal electronic health records.","authors":"Berit Hunsdieck, Christian Bender, Katja Ickstadt, Johanna Mielke","doi":"10.1186/s13040-025-00450-z","DOIUrl":"10.1186/s13040-025-00450-z","url":null,"abstract":"<p><strong>Background: </strong>Over the past decade an increase in usage of electronic health data (EHR) by office-based physicians and hospitals has been reported. However, these data types come with challenge regarding completeness and data quality and it is, especially for more complex models, unclear how these characteristics influence the performance.</p><p><strong>Methods: </strong>In this paper, we focus on joint models which combines longitudinal modelling with survival modelling to incorporate all available information. The aim of this paper is to establish simulation-based guidelines for the necessary quality of longitudinal EHR data so that joint models perform better than cox models. We conducted an extensive simulation study by systematically and transparently varying different characteristics of data quality, e.g., measurement frequency, noise, and heterogeneity between patients. We apply the joint models and evaluate their performance relative to traditional Cox survival modelling techniques.</p><p><strong>Results: </strong>Key findings suggest that biomarker changes before disease onset must be consistent within similar patient groups. With increasing noise and a higher measurement density, the joint model surpasses the traditional Cox regression model in terms of model performance. We illustrate the usefulness and limitations of the guidelines with two real-world examples, namely the influence of serum bilirubin on primary biliary liver cirrhosis and the influence of the estimated glomerular filtration rate on chronic kidney disease.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"35"},"PeriodicalIF":4.0,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12070788/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143993927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}