Biodata Mining最新文献_第5页

Inter-organ correlation based multi-task deep learning model for dynamically predicting functional deterioration in multiple organ systems of ICU patients. 基于器官间相关性的多任务深度学习模型动态预测ICU患者多器官系统功能恶化。

IF 4 3区生物学

Biodata Mining Pub Date : 2025-04-16 DOI: 10.1186/s13040-025-00445-w

Zhixuan Zeng, Yang Liu, Shuo Yao, Minjie Lin, Xu Cai, Wenbin Nan, Yiyang Xie, Xun Gong

{"title":"Inter-organ correlation based multi-task deep learning model for dynamically predicting functional deterioration in multiple organ systems of ICU patients.","authors":"Zhixuan Zeng, Yang Liu, Shuo Yao, Minjie Lin, Xu Cai, Wenbin Nan, Yiyang Xie, Xun Gong","doi":"10.1186/s13040-025-00445-w","DOIUrl":"https://doi.org/10.1186/s13040-025-00445-w","url":null,"abstract":"Background: Functional deterioration (FD) of various organ systems is the major cause of death in ICU patients, but few studies propose effective multi-task (MT) model to predict FD of multiple organs simultaneously. This study propose a MT deep learning model named inter-organ correlation based multi-task model (IOC-MT), to dynamically predict FD in six organ systems.Methods: Three public ICU databases were used for model training and validation. The IOC-MT was designed based on the routine MT deep learning framework, but it used a Graph Attention Networks (GAT) module to capture inter-organ correlation and an adaptive adjustment mechanism (AAM) to adjust prediction. We compared the IOC-MT to five single-task (ST) baseline models, including three deep models (LSTM-ST, GRU-ST, Transformer-ST) and two machine learning models (GRU-ST, RF-ST), and performed ablation study to assess the contribution of important components in IOC-MT. Model discrimination was evaluated by AUROC and AUPRC, and model calibration was assessed by the calibration curve. The attention weight and adjustment coefficient were analyzed at both overall and individual level to show the AAM of IOC-MT.Results: The IOC-MT had comparable discrimination and calibration to LSTM-ST, GRU-ST and Transformer-ST for most organs under different gap windows in the internal and external validation, and obviously outperformed GRU-ST, RF-ST. The ablation study showed that the GAT, AAM and missing indicator could improve the overall performance of the model. Furthermore, the inter-organ correlation and prediction adjustment of IOC-MT were intuitive and comprehensible, and also had biological plausibility.Conclusions: The IOC-MT is a promising MT model for dynamically predicting FD in six organ systems. It can capture inter-organ correlation and adjust the prediction for one organ based on aggregated information from the other organs.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"31"},"PeriodicalIF":4.0,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12001458/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144043336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing clinical outcome predictions through effective sample size evaluation in graph-based digital twin modeling. 通过基于图形的数字孪生模型的有效样本量评估，增强临床结果预测。

IF 4 3区生物学

Biodata Mining Pub Date : 2025-04-15 DOI: 10.1186/s13040-025-00446-9

Xi Li, Jui-Hsuan Chang, Mythreye Venkatesan, Zhiping Paul Wang, Jason H Moore

{"title":"Enhancing clinical outcome predictions through effective sample size evaluation in graph-based digital twin modeling.","authors":"Xi Li, Jui-Hsuan Chang, Mythreye Venkatesan, Zhiping Paul Wang, Jason H Moore","doi":"10.1186/s13040-025-00446-9","DOIUrl":"https://doi.org/10.1186/s13040-025-00446-9","url":null,"abstract":"Digital twins in healthcare offer an innovative approach to precision diagnosis, prognosis, and treatment. SynTwin, a novel computational methodology to generate digital twins using synthetic data and network science, has previously shown promise for improving prediction of breast cancer mortality. In this study, we validate SynTwin using population-level data for different cancer types from the Surveillance, Epidemiology, and End Results (SEER) program from the National Cancer Institute (USA). We assess its predictive accuracy across cancer types of varying sample sizes (n = 1,000 to 30,000 records), mortality rates (35% to 60%), and study designs, revealing insights into the strengths and limitations of digital twins derived from synthetic data in mortality prediction. We also evaluate the effect of sample size (n = 1,000 to 70,000 records) on predictive accuracy for selected cancers (non-Hodgkin lymphoma, bladder, and colorectal cancers). Our results indicate that for larger datasets (n > 10,000) including digital twins in the nearest network neighbor prediction model significantly improves the performance compared to using real patients alone. Specifically, AUROCs ranged from 0.828 to 0.884 for cancers such as cervix uteri and ovarian cancer with digital twins, compared to 0.720 to 0.858 when using real patient data. Similarly, among the selected three cancers, AUROCs using digital twins exceeded AUROCs using real patients alone by at least 0.06 with narrowing variance in performance as the sample size increased. These results highlight the benefit of network-based digital twins, while emphasizing the importance of considering effective sample size when developing predictive models like SynTwin.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"30"},"PeriodicalIF":4.0,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11998210/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144057855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards precision oncology: a multi-level cancer classification system integrating liquid biopsy and machine learning. 迈向精确肿瘤学：结合液体活检和机器学习的多层次癌症分类系统。

IF 4 3区生物学

Biodata Mining Pub Date : 2025-04-11 DOI: 10.1186/s13040-025-00439-8

Amr Eledkawy, Taher Hamza, Sara El-Metwally

{"title":"Towards precision oncology: a multi-level cancer classification system integrating liquid biopsy and machine learning.","authors":"Amr Eledkawy, Taher Hamza, Sara El-Metwally","doi":"10.1186/s13040-025-00439-8","DOIUrl":"https://doi.org/10.1186/s13040-025-00439-8","url":null,"abstract":"Background: Millions of people die from cancer every year. Early cancer detection is crucial for ensuring higher survival rates, as it provides an opportunity for timely medical interventions. This paper proposes a multi-level cancer classification system that uses plasma cfDNA/ctDNA mutations and protein biomarkers to identify seven distinct cancer types: colorectal, breast, upper gastrointestinal, lung, pancreas, ovarian, and liver.Results: The proposed system employs a multi-stage binary classification framework where each stage is customized for a specific cancer type. A majority vote feature selection process is employed by combining six feature selectors: Information Value, Chi-Square, Random Forest Feature Importance, Extra Tree Feature Importance, Recursive Feature Elimination, and L1 Regularization. Following the feature selection process, classifiers-including eXtreme Gradient Boosting, Random Forest, Extra Tree, and Quadratic Discriminant Analysis-are customized for each cancer type individually or in an ensemble soft voting setup to optimize predictive accuracy. The proposed system outperformed previously published results, achieving an AUC of 98.2% and an accuracy of 96.21%. To ensure reproducibility of the results, the trained models and the dataset used in this study are made publicly available via the GitHub repository ( https://github.com/SaraEl-Metwally/Towards-Precision-Oncology ).Conclusion: The identified biomarkers enhance the interpretability of the diagnosis, facilitating more informed decision-making. The system's performance underscores its effectiveness in tissue localization, contributing to improved patient outcomes through timely medical interventions.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"29"},"PeriodicalIF":4.0,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11987386/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144023569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Few-shot biomedical NER empowered by LLMs-assisted data augmentation and multi-scale feature extraction. 基于llms辅助数据增强和多尺度特征提取的少镜头生物医学NER。

IF 4 3区生物学

Biodata Mining Pub Date : 2025-04-04 DOI: 10.1186/s13040-025-00443-y

Di Zhao, Wenxuan Mu, Xiangxing Jia, Shuang Liu, Yonghe Chu, Jiana Meng, Hongfei Lin

{"title":"Few-shot biomedical NER empowered by LLMs-assisted data augmentation and multi-scale feature extraction.","authors":"Di Zhao, Wenxuan Mu, Xiangxing Jia, Shuang Liu, Yonghe Chu, Jiana Meng, Hongfei Lin","doi":"10.1186/s13040-025-00443-y","DOIUrl":"10.1186/s13040-025-00443-y","url":null,"abstract":"Named Entity Recognition (NER) is a fundamental task in processing biomedical text. Due to the limited availability of labeled data, researchers have investigated few-shot learning methods to tackle this challenge. However, replicating the performance of fully supervised methods remains difficult in few-shot scenarios. This paper addresses two main issues. In terms of data augmentation, existing methods primarily focus on replacing content in the original text, which can potentially distort the semantics. Furthermore, current approaches often neglect sentence features at multiple scales. To overcome these challenges, we utilize ChatGPT to generate enriched data with distinct semantics for the same entities, thereby reducing noisy data. Simultaneously, we employ dynamic convolution to capture multi-scale semantic information in sentences and enhance feature representation based on PubMedBERT. We evaluated the experiments on four biomedical NER datasets (BC5CDR-Disease, NCBI, BioNLP11EPI, BioNLP13GE), and the results exceeded the current state-of-the-art models in most few-shot scenarios, including mainstream large language models like ChatGPT. The results confirm the effectiveness of the proposed method in data augmentation and model generalization.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"28"},"PeriodicalIF":4.0,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11969866/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143781479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multivariate longitudinal clustering reveals neuropsychological factors as dementia predictors in an Alzheimer's disease progression study. 在一项阿尔茨海默病进展研究中，多变量纵向聚类揭示了神经心理因素作为痴呆预测因子。

IF 4 3区生物学

Biodata Mining Pub Date : 2025-03-28 DOI: 10.1186/s13040-025-00441-0

Patrizia Ribino, Claudia Di Napoli, Giovanni Paragliola, Davide Chicco, Francesca Gasparini

{"title":"Multivariate longitudinal clustering reveals neuropsychological factors as dementia predictors in an Alzheimer's disease progression study.","authors":"Patrizia Ribino, Claudia Di Napoli, Giovanni Paragliola, Davide Chicco, Francesca Gasparini","doi":"10.1186/s13040-025-00441-0","DOIUrl":"https://doi.org/10.1186/s13040-025-00441-0","url":null,"abstract":"Dementia due to Alzheimer's disease (AD) is a multifaceted neurodegenerative disorder characterized by various cognitive and behavioral decline factors. In this work, we propose an extension of the traditional k-means clustering for multivariate time series data to cluster joint trajectories of different features describing progression over time. The algorithm we propose here enables the joint analysis of various longitudinal features to explore co-occurring trajectory factors among markers indicative of cognitive decline in individuals participating in an AD progression study. By examining how multiple variables co-vary and evolve together, we identify distinct subgroups within the cohort based on their longitudinal trajectories. Our clustering method enhances the understanding of individual development across multiple dimensions and provides deeper medical insights into the trajectories of cognitive decline. In addition, the proposed algorithm is also able to make a selection of the most significant features in separating clusters by considering trajectories over time. This process, together with a preliminary pre-processing on the OASIS-3 dataset, reveals an important role of some neuropsychological factors. In particular, the proposed method has identified a significant profile compatible with a syndrome known as Mild Behavioral Impairment (MBI), displaying behavioral manifestations of individuals that may precede the cognitive symptoms typically observed in AD patients. The findings underscore the importance of considering multiple longitudinal features in clinical modeling, ultimately supporting more effective and individualized patient management strategies.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"26"},"PeriodicalIF":4.0,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951806/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143744332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Network-based multi-omics integrative analysis methods in drug discovery: a systematic review. 药物发现中基于网络的多组学综合分析方法：系统综述。

IF 4 3区生物学

Biodata Mining Pub Date : 2025-03-28 DOI: 10.1186/s13040-025-00442-z

Wei Jiang, Weicai Ye, Xiaoming Tan, Yun-Juan Bao

{"title":"Network-based multi-omics integrative analysis methods in drug discovery: a systematic review.","authors":"Wei Jiang, Weicai Ye, Xiaoming Tan, Yun-Juan Bao","doi":"10.1186/s13040-025-00442-z","DOIUrl":"https://doi.org/10.1186/s13040-025-00442-z","url":null,"abstract":"The integration of multi-omics data from diverse high-throughput technologies has revolutionized drug discovery. While various network-based methods have been developed to integrate multi-omics data, systematic evaluation and comparison of these methods remain challenging. This review aims to analyze network-based approaches for multi-omics integration and evaluate their applications in drug discovery. We conducted a comprehensive review of literature (2015-2024) on network-based multi-omics integration methods in drug discovery, and categorized methods into four primary types: network propagation/diffusion, similarity-based approaches, graph neural networks, and network inference models. We also discussed the applications of the methods in three scenario of drug discovery, including drug target identification, drug response prediction, and drug repurposing, and finally evaluated the performance of the methods by highlighting their advantages and limitations in specific applications. While network-based multi-omics integration has shown promise in drug discovery, challenges remain in computational scalability, data integration, and biological interpretation. Future developments should focus on incorporating temporal and spatial dynamics, improving model interpretability, and establishing standardized evaluation frameworks.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"27"},"PeriodicalIF":4.0,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11954193/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143744334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Advancing preeclampsia prediction: a tailored machine learning pipeline integrating resampling and ensemble models for handling imbalanced medical data. 推进子痫前期预测：一个定制的机器学习管道，集成了重采样和集成模型，用于处理不平衡的医疗数据。

IF 4 3区生物学

Biodata Mining Pub Date : 2025-03-24 DOI: 10.1186/s13040-025-00440-1

Yinyao Ma, Hanlin Lv, Yanhua Ma, Xiao Wang, Longting Lv, Xuxia Liang, Lei Wang

{"title":"Advancing preeclampsia prediction: a tailored machine learning pipeline integrating resampling and ensemble models for handling imbalanced medical data.","authors":"Yinyao Ma, Hanlin Lv, Yanhua Ma, Xiao Wang, Longting Lv, Xuxia Liang, Lei Wang","doi":"10.1186/s13040-025-00440-1","DOIUrl":"10.1186/s13040-025-00440-1","url":null,"abstract":"Background: Constructing a predictive model is challenging in imbalanced medical dataset (such as preeclampsia), particularly when employing ensemble machine learning algorithms.Objective: This study aims to develop a robust pipeline that enhances the predictive performance of ensemble machine learning models for the early prediction of preeclampsia in an imbalanced dataset.Methods: Our research establishes a comprehensive pipeline optimized for early preeclampsia prediction in imbalanced medical datasets. We gathered electronic health records from pregnant women at the People's Hospital of Guangxi from 2015 to 2020, with additional external validation using three public datasets. This extensive data collection facilitated the systematic assessment of various resampling techniques, varied minority-to-majority ratios, and ensemble machine learning algorithms through a structured evaluation process. We analyzed 4,608 combinations of model settings against performance metrics such as G-mean, MCC, AP, and AUC to determine the most effective configurations. Advanced statistical analyses including OLS regression, ANOVA, and Kruskal-Wallis tests were utilized to fine-tune these settings, enhancing model performance and robustness for clinical application.Results: Our analysis confirmed the significant impact of systematic sequential optimization of variables on the predictive performance of our models. The most effective configuration utilized the Inverse Weighted Gaussian Mixture Model for resampling, combined with Gradient Boosting Decision Trees algorithm, and an optimized minority-to-majority ratio of 0.09, achieving a Geometric Mean of 0.6694 (95% confidence interval: 0.5855-0.7557). This configuration significantly outperformed the baseline across all evaluated metrics, demonstrating substantial improvements in model performance.Conclusions: This study establishes a robust pipeline that significantly enhances the predictive performance of models for preeclampsia within imbalanced datasets. Our findings underscore the importance of a strategic approach to variable optimization in medical diagnostics, offering potential for broad application in various medical contexts where class imbalance is a concern.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"25"},"PeriodicalIF":4.0,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11934807/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143701866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-dimensional mediation analysis reveals the mediating role of physical activity patterns in genetic pathways leading to AD-like brain atrophy. 高维中介分析揭示了体育活动模式在导致ad样脑萎缩的遗传途径中的中介作用。

IF 4 3区生物学

Biodata Mining Pub Date : 2025-03-24 DOI: 10.1186/s13040-025-00432-1

Hanxiang Xu, Shizhuo Mu, Jingxuan Bao, Christos Davatzikos, Haochang Shou, Li Shen

{"title":"High-dimensional mediation analysis reveals the mediating role of physical activity patterns in genetic pathways leading to AD-like brain atrophy.","authors":"Hanxiang Xu, Shizhuo Mu, Jingxuan Bao, Christos Davatzikos, Haochang Shou, Li Shen","doi":"10.1186/s13040-025-00432-1","DOIUrl":"10.1186/s13040-025-00432-1","url":null,"abstract":"Background: Alzheimer's disease (AD) is a complex disorder that affects multiple biological systems including cognition, behavior and physical health. Unfortunately, the pathogenic mechanisms behind AD are not yet clear and the treatment options are still limited. Despite the increasing number of studies examining the pairwise relationships between genetic factors, physical activity (PA), and AD, few have successfully integrated all three domains of data, which may help reveal mechanisms and impact of these genomic and phenomic factors on AD. We use high-dimensional mediation analysis as an integrative framework to study the relationships among genetic factors, PA and AD-like brain atrophy quantified by spatial patterns of brain atrophy.Results: We integrate data from genetics, PA and neuroimaging measures collected from 13,425 UK Biobank samples to unveil the complex relationship among genetic risk factors, behavior and brain signatures in the contexts of aging and AD. Specifically, we used a composite imaging marker, Spatial Pattern of Abnormality for Recognition of Early AD (SPARE-AD) that characterizes AD-like brain atrophy, as an outcome variable to represent AD risk. Through GWAS, we identified single nucleotide polymorphisms (SNPs) that are significantly associated with SPARE-AD as exposure variables. We employed conventional summary statistics and functional principal component analysis to extract patterns of PA as mediators. After constructing these variables, we utilized a high-dimensional mediation analysis method, Bayesian Mediation Analysis (BAMA), to estimate potential mediating pathways between SNPs, multivariate PA signatures and SPARE-AD. BAMA incorporates Bayesian continuous shrinkage prior to select the active mediators from a large pool of candidates. We identified a total of 22 mediation pathways, indicating how genetic variants can influence SPARE-AD by altering physical activity. By comparing the results with those obtained using univariate mediation analysis, we demonstrate the advantages of high-dimensional mediation analysis methods over univariate mediation analysis.Conclusion: Through integrative analysis of multi-omics data, we identified several mediation pathways of physical activity between genetic factors and SPARE-AD. These findings contribute to a better understanding of the pathogenic mechanisms of AD. Moreover, our research demonstrates the potential of the high-dimensional mediation analysis method in revealing the mechanisms of disease.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"24"},"PeriodicalIF":4.0,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11931790/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143701870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic detection and extraction of key resources from tables in biomedical papers. 生物医学论文表格关键资源的自动检测与提取。

IF 4 3区生物学

Biodata Mining Pub Date : 2025-03-20 DOI: 10.1186/s13040-025-00438-9

Ibrahim Burak Ozyurt, Anita Bandrowski

{"title":"Automatic detection and extraction of key resources from tables in biomedical papers.","authors":"Ibrahim Burak Ozyurt, Anita Bandrowski","doi":"10.1186/s13040-025-00438-9","DOIUrl":"10.1186/s13040-025-00438-9","url":null,"abstract":"Background: Tables are useful information artifacts that allow easy detection of missing data and have been deployed by several publishers to improve the amount of information present for key resources and reagents such as antibodies, cell lines, and other tools that constitute the inputs to a study. STAR*Methods key resource tables have increased the \"findability\" of these key resources, improving transparency of the paper by warning authors (before publication) about any problems, such as key resources that cannot be uniquely identified or those that are known to be problematic, but they have not been commonly available outside of the Cell Press journal family. We believe that processing preprints and adding these 'resource table candidates' automatically will improve the availability of structured and linked information about research resources in a broader swath of the scientific literature. However, if the authors have already added a key resource table, that table must be detected, and each entity must be correctly identified and faithfully restructured into a standard format.Methods: We introduce four end-to-end table extraction pipelines to extract and faithfully reconstruct key resource tables from biomedical papers in PDF format. The pipelines employ machine learning approaches for key resource table page identification, \"Table Transformer\" models for table detection, and table structure recognition. We also introduce a character-level generative pre-trained transformer (GPT) language model for scientific tables pre-trained on over 11 million scientific tables. We fine-tuned our table-specific language model with synthetic training data generated with a novel approach to alleviate row over-segmentation significantly improving key resource extraction performance.Results: The extraction of key resource tables in PDF files by the popular GROBID tool resulted in a Grid Table Similarity (GriTS) score of 0.12. All of our pipelines have outperformed GROBID by a large margin. Our best pipeline with table-specific language model-based row merger achieved a GriTS score of 0.90.Conclusions: Our pipelines allow the detection and extraction of key resources from tables with much higher accuracy, enabling the deployment of automated research resource extraction tools on BioRxiv to help authors correct unidentifiable key resources detected in their articles and improve the reproducibility of their findings. The code, table-specific language model, annotated training and evaluation data are publicly available.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"23"},"PeriodicalIF":4.0,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11924859/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143671632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Leveraging mixed-effects regression trees for the analysis of high-dimensional longitudinal data to identify the low and high-risk subgroups: simulation study with application to genetic study. 利用混合效应回归树分析高维纵向数据以确定低和高风险亚群：模拟研究及其在遗传研究中的应用。

IF 4 3区生物学

Biodata Mining Pub Date : 2025-03-19 DOI: 10.1186/s13040-025-00437-w

Mina Jahangiri, Anoshirvan Kazemnejad, Keith S Goldfeld, Maryam S Daneshpour, Mehdi Momen, Shayan Mostafaei, Davood Khalili, Mahdi Akbarzadeh

{"title":"Leveraging mixed-effects regression trees for the analysis of high-dimensional longitudinal data to identify the low and high-risk subgroups: simulation study with application to genetic study.","authors":"Mina Jahangiri, Anoshirvan Kazemnejad, Keith S Goldfeld, Maryam S Daneshpour, Mehdi Momen, Shayan Mostafaei, Davood Khalili, Mahdi Akbarzadeh","doi":"10.1186/s13040-025-00437-w","DOIUrl":"10.1186/s13040-025-00437-w","url":null,"abstract":"Background: The linear mixed-effects model (LME) is a conventional parametric method mainly used for analyzing longitudinal and clustered data in genetic studies. Previous studies have shown that this model can be sensitive to parametric assumptions and provides less predictive performance than non-parametric methods such as random effects-expectation maximization (RE-EM) and unbiased RE-EM regression tree algorithms. These longitudinal regression trees utilize classification and regression trees (CART) and conditional inference trees (Ctree) to estimate the fixed-effects components of the mixed-effects model. While CART is a well-known tree algorithm, it suffers from greediness. To mitigate this issue, we used the Evtree algorithm to estimate the fixed-effects part of the LME for handling longitudinal and clustered data in genome association studies.Methods: In this study, we propose a new non-parametric longitudinal-based algorithm called \"Ev-RE-EM\" for modeling a continuous response variable using the Evtree algorithm to estimate the fixed-effects part of the LME. We compared its predictive performance with other tree algorithms, such as RE-EM and unbiased RE-EM, with and without considering the structure for autocorrelation between errors within subjects to analyze the longitudinal data in the genetic study. The autocorrelation structures include a first-order autoregressive process, a compound symmetric structure with a constant correlation, and a general correlation matrix. The real data was obtained from the longitudinal Tehran cardiometabolic genetic study (TCGS). The data modeling used body mass index (BMI) as the phenotype and included predictor variables such as age, sex, and 25,640 single nucleotide polymorphisms (SNPs).Results: The results demonstrated that the predictive performance of Ev-RE-EM and unbiased RE-EM was nearly similar. Additionally, the Ev-RE-EM algorithm generated smaller trees than the unbiased RE-EM algorithm, enhancing tree interpretability.Conclusion: The results showed that the unbiased RE-EM and Ev-RE-EM algorithms outperformed the RE-EM algorithm. Since algorithm performance varies across datasets, researchers should test different algorithms on the dataset of interest and select the best-performing one. Accurately predicting and diagnosing an individual's genetic profile is crucial in medical studies. The model with the highest accuracy should be used to enhance understanding of the genetics of complex traits, improve disease prevention and diagnosis, and aid in treating complex human diseases.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"22"},"PeriodicalIF":4.0,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11924713/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143665028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0