Biodata MiningPub Date : 2024-11-13DOI: 10.1186/s13040-024-00400-1
Pradeep Varathan Pugalenthi, Bing He, Linhui Xie, Kwangsik Nho, Andrew J Saykin, Jingwen Yan
{"title":"Deciphering the tissue-specific functional effect of Alzheimer risk SNPs with deep genome annotation.","authors":"Pradeep Varathan Pugalenthi, Bing He, Linhui Xie, Kwangsik Nho, Andrew J Saykin, Jingwen Yan","doi":"10.1186/s13040-024-00400-1","DOIUrl":"10.1186/s13040-024-00400-1","url":null,"abstract":"<p><p>Alzheimer's disease (AD) is a highly heritable brain dementia, along with substantial failure of cognitive function. Large-scale genome-wide association studies (GWASs) have led to a set of SNPs significantly associated with AD and related traits. GWAS hits usually emerge as clusters where a lead SNP with the highest significance is surrounded by other less significant neighboring SNPs. Although functionality is not guaranteed even with the strongest associations in GWASs, lead SNPs have historically been the focus of the field, with the remaining associations inferred to be redundant. Recent deep genome annotation tools enable the prediction of function from a segment of a DNA sequence with significantly improved precision, which allows in-silico mutagenesis to interrogate the functional effect of SNP alleles. In this project, we explored the impact of top AD GWAS hits around APOE region on chromatin functions and whether it will be altered by the genetic context (i.e., alleles of neighboring SNPs). Our results showed that highly correlated SNPs in the same LD block could have distinct impacts on downstream functions. Although some GWAS lead SNPs showed dominant functional effects regardless of the neighborhood SNP alleles, several other SNPs did exhibit enhanced loss or gain of function under certain genetic contexts, suggesting potential additional information hidden in the LD blocks.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"50"},"PeriodicalIF":4.0,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11558841/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142631056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigating potential drug targets for IgA nephropathy and membranous nephropathy through multi-queue plasma protein analysis: a Mendelian randomization study based on SMR and co-localization analysis.","authors":"Xinyi Xu, Changhong Miao, Shirui Yang, Lu Xiao, Ying Gao, Fangying Wu, Jianbo Xu","doi":"10.1186/s13040-024-00405-w","DOIUrl":"10.1186/s13040-024-00405-w","url":null,"abstract":"<p><strong>Background: </strong>Membranous nephropathy (MN) and IgA nephropathy (IgAN) pose challenges in clinical treatment with existing therapies primarily focusing on symptom relief and often yielding unsatisfactory outcomes. The search for novel drug targets remains crucial to address the shortcomings in managing both kidney diseases.</p><p><strong>Methods: </strong>Utilizing GWAS data for MN (ncase = 2150, ncontrol = 5829) and IgAN (ncase = 15587, ncontrol = 462197), instrumental variables for plasma proteins were derived from recent GWAS. Sensitivity analysis involved bidirectional Mendelian randomization analysis, MR Steiger, Bayesian co-localization, and Phenotype scanning. The SMR analysis using eQTL data from the eQTLGen Consortium was conducted to assess the availability of selected protein targets. The PPI network was constructed to reveal potential associations with existing drug treatment targets.</p><p><strong>Results: </strong>The study, subjected to the stringent Bonferroni correction, revealed significant associations: four proteins with MN and three proteins with IgAN. In plasma protein cis-pQTL data from two cohorts, an increase in one standard deviation in PLA2R1 (OR = 2.01, 95%CI = 1.83-2.21), AIF1 (OR = 9.04, 95%CI = 4.69-17.41), MLN (OR = 3.79, 95%CI = 2.12-6.78), and NFKB1 (OR = 29.43, 95%CI = 7.73-112.0) was associated with an increased risk of MN. Additionally, in plasma protein cis-pQTL data, a standard deviation increase in FCGR3B (OR = 1.15, 95%CI = 1.09-1.22) and BTN3A1 (OR = 4.05, 95%CI = 2.65-6.19) correlated with elevated IgAN risk, while AIF1 (OR = 0.58, 95%CI = 0.46-0.73) exhibited IgAN protection. Bayesian co-localization indicated that PLA2R1 (coloc.abf-PPH4 = 0.695), NFKB1 (coloc.abf-PPH4 = 0.949), FCGR3B (coloc.abf-PPH4 = 0.909), and BTN3A1 (coloc.abf-PPH4 = 0.685) share the same variants associated with MN and IgAN. The SMR analysis indicated a causal link between NFKB1 and BTN3A1 plasma protein eQTL in both conditions, and BTN3A1 was validated externally.</p><p><strong>Conclusion: </strong>Genetically influenced plasma levels of PLA2R1 and NFKB1 impact MN risk, while FCGR3B and BTN3A1 levels are causally linked to IgAN risk, suggesting potential drug targets for further clinical exploration, notably BTN3A1 for IgAN.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"49"},"PeriodicalIF":4.0,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11545554/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142631058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep joint learning diagnosis of Alzheimer's disease based on multimodal feature fusion.","authors":"Jingru Wang, Shipeng Wen, Wenjie Liu, Xianglian Meng, Zhuqing Jiao","doi":"10.1186/s13040-024-00395-9","DOIUrl":"10.1186/s13040-024-00395-9","url":null,"abstract":"<p><p>Alzheimer's disease (AD) is an advanced and incurable neurodegenerative disease. Genetic variations are intrinsic etiological factors contributing to the abnormal expression of brain function and structure in AD patients. A new multimodal feature fusion called \"magnetic resonance imaging (MRI)-p value\" was proposed to construct 3D fusion images by introducing genes as a priori knowledge. Moreover, a new deep joint learning diagnostic model was constructed to fully learn images features. One branch trained a residual network (ResNet) to learn the features of local pathological regions. The other branch learned the position information of brain regions with different changes in the different categories of subjects' brains by introducing attention convolution, and then obtained the discriminative probability information from locations via convolution and global average pooling. The feature and position information of the two branches were linearly interacted to acquire the diagnostic basis for classifying the different categories of subjects. The diagnoses of AD and health control (HC), AD and mild cognitive impairment (MCI), HC and MCI were performed with data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). The results showed that the proposed method achieved optimal results in AD-related diagnosis. The classification accuracy (ACC) and area under the curve (AUC) of the three experimental groups were 93.44% and 96.67%, 89.06% and 92%, and 84% and 81.84%, respectively. Moreover, a total of six novel genes were found to be significantly associated with AD, namely NTM, MAML2, NAALADL2, FHIT, TMEM132D and PCSK5, which provided new targets for the potential treatment of neurodegenerative diseases.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"48"},"PeriodicalIF":4.0,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11536794/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142584754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2024-11-01DOI: 10.1186/s13040-024-00403-y
Amani Almohaimeed, Ishag Adam
{"title":"Modeling heterogeneity of Sudanese hospital stay in neonatal and maternal unit: non-parametric random effect models with Gamma distribution.","authors":"Amani Almohaimeed, Ishag Adam","doi":"10.1186/s13040-024-00403-y","DOIUrl":"10.1186/s13040-024-00403-y","url":null,"abstract":"<p><strong>Objective: </strong>Studies looking into patient and institutional variables linked to extended hospital stays have arisen as a result of the increased focus on severe maternal morbidity and mortality. Understanding the length of hospitalization of patients after delivery is important to gain insights into when hospitals will reach capacity and to predict corresponding staffing or equipment requirements. In Sudan, the distribution of the length of stay during delivery hospitalizations is heavily skewed, with the average length of stay of 2 to 3 days. This study aimed to investigate the use of non-parametric random effect model with Gamma distributed response for analyzing skewed hospital length of stay data in Sudan in neonatal and maternal unit.</p><p><strong>Methods: </strong>We applied Gamma regression models with unknown random effects, estimated using the non-parametric maximum likelihood (NPML) technique [5]. The NPML reduces the heterogeneity in the distribution of the response and produce a robust estimation since it does not require any assumptions on the distribution. The same applies to the log-Gamma link that does not require any transformation for the data distribution and it can handle the outliers in the data points. In this study, the models are fitted with and without covariates and compared using AIC and BIC values.</p><p><strong>Results: </strong>The findings imply that in the context of health care database investigations, Gamma regression models with non-parametric random effect consistently reduce heterogeneity and improve model accuracy. The generalized linear model with covariates and random effect (k = 4) had the best fit, indicating that Sudanese hospital length of stay data could be classified into four groups with varying average stays influenced by maternal, neonatal, and obstetrics data.</p><p><strong>Conclusion: </strong>Identifying factors contributing to longer stays allows hospitals to implement strategies for improvement. Non-parametric random effect model with Gamma distributed response effectively accounts for unobserved heterogeneity and individual-level variability, leading to more accurate inferences and improved patient care. Including random effects can significantly affect variable significance in statistical models, emphasizing the need to consider unobserved heterogeneity when analyzing data containing potential individual-level variability. The findings emphasise the importance of making robust methodological choices in healthcare research in order to inform accurate policy decisions.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"47"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11529257/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142565124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2024-10-30DOI: 10.1186/s13040-024-00397-7
Vanesa Gómez-Martínez, David Chushig-Muzo, Marit B Veierød, Conceição Granja, Cristina Soguero-Ruiz
{"title":"Ensemble feature selection and tabular data augmentation with generative adversarial networks to enhance cutaneous melanoma identification and interpretability.","authors":"Vanesa Gómez-Martínez, David Chushig-Muzo, Marit B Veierød, Conceição Granja, Cristina Soguero-Ruiz","doi":"10.1186/s13040-024-00397-7","DOIUrl":"10.1186/s13040-024-00397-7","url":null,"abstract":"<p><strong>Background: </strong>Cutaneous melanoma is the most aggressive form of skin cancer, responsible for most skin cancer-related deaths. Recent advances in artificial intelligence, jointly with the availability of public dermoscopy image datasets, have allowed to assist dermatologists in melanoma identification. While image feature extraction holds potential for melanoma detection, it often leads to high-dimensional data. Furthermore, most image datasets present the class imbalance problem, where a few classes have numerous samples, whereas others are under-represented.</p><p><strong>Methods: </strong>In this paper, we propose to combine ensemble feature selection (FS) methods and data augmentation with the conditional tabular generative adversarial networks (CTGAN) to enhance melanoma identification in imbalanced datasets. We employed dermoscopy images from two public datasets, PH2 and Derm7pt, which contain melanoma and not-melanoma lesions. To capture intrinsic information from skin lesions, we conduct two feature extraction (FE) approaches, including handcrafted and embedding features. For the former, color, geometric and first-, second-, and higher-order texture features were extracted, whereas for the latter, embeddings were obtained using ResNet-based models. To alleviate the high-dimensionality in the FE, ensemble FS with filter methods were used and evaluated. For data augmentation, we conducted a progressive analysis of the imbalance ratio (IR), related to the amount of synthetic samples created, and evaluated the impact on the predictive results. To gain interpretability on predictive models, we used SHAP, bootstrap resampling statistical tests and UMAP visualizations.</p><p><strong>Results: </strong>The combination of ensemble FS, CTGAN, and linear models achieved the best predictive results, achieving AUCROC values of 87% (with support vector machine and IR=0.9) and 76% (with LASSO and IR=1.0) for the PH2 and Derm7pt, respectively. We also identified that melanoma lesions were mainly characterized by features related to color, while not-melanoma lesions were characterized by texture features.</p><p><strong>Conclusions: </strong>Our results demonstrate the effectiveness of ensemble FS and synthetic data in the development of models that accurately identify melanoma. This research advances skin lesion analysis, contributing to both melanoma detection and the interpretation of main features for its identification.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"46"},"PeriodicalIF":4.0,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11526724/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142548479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2024-10-29DOI: 10.1186/s13040-024-00401-0
Laila Musib, Roberta Coletti, Marta B Lopes, Helena Mouriño, Eunice Carrasquinha
{"title":"Priority-Elastic net for binary disease outcome prediction based on multi-omics data.","authors":"Laila Musib, Roberta Coletti, Marta B Lopes, Helena Mouriño, Eunice Carrasquinha","doi":"10.1186/s13040-024-00401-0","DOIUrl":"10.1186/s13040-024-00401-0","url":null,"abstract":"<p><strong>Background: </strong>High-dimensional omics data integration has emerged as a prominent avenue within the healthcare industry, presenting substantial potential to improve predictive models. However, the data integration process faces several challenges, including data heterogeneity, priority sequence in which data blocks are prioritized for rendering predictive information contained in multiple blocks, assessing the flow of information from one omics level to the other and multicollinearity.</p><p><strong>Methods: </strong>We propose the Priority-Elastic net algorithm, a hierarchical regression method extending Priority-Lasso for the binary logistic regression model by incorporating a priority order for blocks of variables while fitting Elastic-net models sequentially for each block. The fitted values from each step are then used as an offset in the subsequent step. Additionally, we considered the adaptive elastic-net penalty within our priority framework to compare the results.</p><p><strong>Results: </strong>The Priority-Elastic net and Priority-Adaptive Elastic net algorithms were evaluated on a brain tumor dataset available from The Cancer Genome Atlas (TCGA), accounting for transcriptomics, proteomics, and clinical information measured over two glioma types: Lower-grade glioma (LGG) and glioblastoma (GBM).</p><p><strong>Conclusion: </strong>Our findings suggest that the Priority-Elastic net is a highly advantageous choice for a wide range of applications. It offers moderate computational complexity, flexibility in integrating prior knowledge while introducing a hierarchical modeling perspective, and, importantly, improved stability and accuracy in predictions, making it superior to the other methods discussed. This evolution marks a significant step forward in predictive modeling, offering a sophisticated tool for navigating the complexities of multi-omics datasets in pursuit of precision medicine's ultimate goal: personalized treatment optimization based on a comprehensive array of patient-specific data. This framework can be generalized to time-to-event, Cox proportional hazards regression and multicategorical outcomes. A practical implementation of this method is available upon request in R script, complete with an example to facilitate its application.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"45"},"PeriodicalIF":4.0,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11523883/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142548496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2024-10-24DOI: 10.1186/s13040-024-00398-6
Dixin Shen, Juan Pablo Lewinger, Eric Kawaguchi
{"title":"A regularized Cox hierarchical model for incorporating annotation information in predictive omic studies.","authors":"Dixin Shen, Juan Pablo Lewinger, Eric Kawaguchi","doi":"10.1186/s13040-024-00398-6","DOIUrl":"10.1186/s13040-024-00398-6","url":null,"abstract":"<p><strong>Background: </strong>Associated with high-dimensional omics data there are often \"meta-features\" such as biological pathways and functional annotations, summary statistics from similar studies that can be informative for predicting an outcome of interest. We introduce a regularized hierarchical framework for integrating meta-features, with the goal of improving prediction and feature selection performance with time-to-event outcomes.</p><p><strong>Methods: </strong>A hierarchical framework is deployed to incorporate meta-features. Regularization is applied to the omic features as well as the meta-features so that high-dimensional data can be handled at both levels. The proposed hierarchical Cox model can be efficiently fitted by a combination of iterative reweighted least squares and cyclic coordinate descent.</p><p><strong>Results: </strong>In a simulation study we show that when the external meta-features are informative, the regularized hierarchical model can substantially improve prediction performance over standard regularized Cox regression. We illustrate the proposed model with applications to breast cancer and melanoma survival based on gene expression profiles, which show the improvement in prediction performance by applying meta-features, as well as the discovery of important omic feature sets with sparse regularization at meta-feature level.</p><p><strong>Conclusions: </strong>The proposed hierarchical regularized regression model enables integration of external meta-feature information directly into the modeling process for time-to-event outcomes, improves prediction performance when the external meta-feature data is informative. Importantly, when the external meta-features are uninformative, the prediction performance based on the regularized hierarchical model is on par with standard regularized Cox regression, indicating robustness of the framework. In addition to developing predictive signatures, the model can also be deployed in discovery applications where the main goal is to identify important features associated with the outcome rather than developing a predictive model.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"44"},"PeriodicalIF":4.0,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515443/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142511162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2024-10-23DOI: 10.1186/s13040-024-00402-z
Andrew Marra
{"title":"G4 & the balanced metric family - a novel approach to solving binary classification problems in medical device validation & verification studies.","authors":"Andrew Marra","doi":"10.1186/s13040-024-00402-z","DOIUrl":"10.1186/s13040-024-00402-z","url":null,"abstract":"<p><strong>Background: </strong>In medical device validation and verification studies, the area under the receiver operating characteristic curve (AUROC) is often used as a primary endpoint despite multiple reports showing its limitations. Hence, researchers are encouraged to consider alternative metrics as primary endpoints. A new metric called G4 is presented, which is the geometric mean of sensitivity, specificity, the positive predictive value, and the negative predictive value. G4 is part of a balanced metric family which includes the Unified Performance Measure (also known as P4) and the Matthews' Correlation Coefficient (MCC). The purpose of this manuscript is to unveil the benefits of using G4 together with the balanced metric family when analyzing the overall performance of binary classifiers.</p><p><strong>Results: </strong>Simulated datasets encompassing different prevalence rates of the minority class were analyzed under a multi-reader-multi-case study design. In addition, data from an independently published study that tested the performance of a unique ultrasound artificial intelligence algorithm in the context of breast cancer detection was also considered. Within each dataset, AUROC was reported alongside the balanced metric family for comparison. When the dataset prevalence and bias of the minority class approached 50%, all three balanced metrics provided equivalent interpretations of an AI's performance. As the prevalence rate increased / decreased and the data became more imbalanced, AUROC tended to overvalue / undervalue the true classifier performance, while the balanced metric family was resistant to such imbalance. Under certain circumstances where data imbalance was strong (minority-class prevalence < 10%), MCC was preferred for standalone assessments while P4 provided a stronger effect size when evaluating between-groups analyses. G4 acted as a middle ground for maximizing both standalone assessments and between-groups analyses.</p><p><strong>Conclusions: </strong>Use of AUROC as the primary endpoint in binary classification problems provides misleading results as the dataset becomes more imbalanced. This is explicitly noticed when incorporating AUROC in medical device validation and verification studies. G4, P4, and MCC do not share this limitation and paint a more complete picture of a medical device's performance in a clinical setting. Therefore, researchers are encouraged to explore the balanced metric family when evaluating binary classification problems.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"43"},"PeriodicalIF":4.0,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515465/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142511164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"From COVID-19 to monkeypox: a novel predictive model for emerging infectious diseases.","authors":"Deren Xu, Weng Howe Chan, Habibollah Haron, Hui Wen Nies, Kohbalan Moorthy","doi":"10.1186/s13040-024-00396-8","DOIUrl":"https://doi.org/10.1186/s13040-024-00396-8","url":null,"abstract":"<p><p>The outbreak of emerging infectious diseases poses significant challenges to global public health. Accurate early forecasting is crucial for effective resource allocation and emergency response planning. This study aims to develop a comprehensive predictive model for emerging infectious diseases, integrating the blending framework, transfer learning, incremental learning, and the biological feature Rt to increase prediction accuracy and practicality. By transferring features from a COVID-19 dataset to a monkeypox dataset and introducing dynamically updated incremental learning techniques, the model's predictive capability in data-scarce scenarios was significantly improved. The research findings demonstrate that the blending framework performs exceptionally well in short-term (7-day) predictions. Furthermore, the combination of transfer learning and incremental learning techniques significantly enhanced the adaptability and precision, with a 91.41% improvement in the RMSE and an 89.13% improvement in the MAE. In particular, the inclusion of the Rt feature enabled the model to more accurately reflect the dynamics of disease spread, further improving the RMSE by 1.91% and the MAE by 2.17%. This study underscores the significant application potential of multimodel fusion and real-time data updates in infectious disease prediction, offering new theoretical perspectives and technical support. This research not only enriches the theoretical foundation of infectious disease prediction models but also provides reliable technical support for public health emergency responses. Future research should continue to explore integrating data from multiple sources and enhancing model generalization capabilities to further enhance the practicality and reliability of predictive tools.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"42"},"PeriodicalIF":4.0,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11494870/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142511163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2024-10-11DOI: 10.1186/s13040-024-00393-x
Philip J Freda, Attri Ghosh, Priyanka Bhandary, Nicholas Matsumoto, Apurva S Chitre, Jiayan Zhou, Molly A Hall, Abraham A Palmer, Tayo Obafemi-Ajayi, Jason H Moore
{"title":"PAGER: A novel genotype encoding strategy for modeling deviations from additivity in complex trait association studies.","authors":"Philip J Freda, Attri Ghosh, Priyanka Bhandary, Nicholas Matsumoto, Apurva S Chitre, Jiayan Zhou, Molly A Hall, Abraham A Palmer, Tayo Obafemi-Ajayi, Jason H Moore","doi":"10.1186/s13040-024-00393-x","DOIUrl":"10.1186/s13040-024-00393-x","url":null,"abstract":"<p><strong>Background: </strong>The additive model of inheritance assumes that heterozygotes (Aa) are exactly intermediate in respect to homozygotes (AA and aa). While this model is commonly used in single-locus genetic association studies, significant deviations from additivity are well-documented and contribute to phenotypic variance across many traits and systems. This assumption can introduce type I and type II errors by overestimating or underestimating the effects of variants that deviate from additivity. Alternative genotype encoding strategies have been explored to account for different inheritance patterns, but they often incur significant computational or methodological costs. To address these challenges, we introduce PAGER (Phenotype Adjusted Genotype Encoding and Ranking), an efficient pre-processing method that encodes each genetic variant based on normalized mean phenotypic differences between diallelic genotype classes (AA, Aa, and aa). This approach more accurately reflects each variant's true inheritance model, improving model precision while minimizing the costs associated with alternative encoding strategies.</p><p><strong>Results: </strong>Through extensive benchmarking on SNPs simulated with both binary and continuous phenotypes, we demonstrate that PAGER accurately represents various inheritance patterns (including additive, dominant, recessive, and heterosis), achieves levels of statistical power that meet or exceed other encoding strategies, and attains computation speeds up to 55 times faster than a similar method, EDGE. We also apply PAGER to publicly available real-world data and identify a novel, relevant putative QTL associated with body mass index in rats (Rattus norvegicus) that is not detected with the additive model.</p><p><strong>Conclusions: </strong>Overall, we show that PAGER is an efficient genotype encoding approach that can uncover sources of missing heritability and reveal novel insights in the study of complex traits while incurring minimal costs.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"41"},"PeriodicalIF":4.0,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11468469/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142407082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}