Bioinformatics advances最新文献_第6页

Blackbird: structural variant detection using synthetic and low-coverage long-reads. 黑鸟：结构变异检测使用合成和低覆盖长读取。

IF 2.4

Bioinformatics advances Pub Date : 2025-07-04 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf151

Dmitry Meleshko, Rui Yang, Salil Maharjan, David C Danko, Anton Korobeynikov, Iman Hajirasouliha

{"title":"Blackbird: structural variant detection using synthetic and low-coverage long-reads.","authors":"Dmitry Meleshko, Rui Yang, Salil Maharjan, David C Danko, Anton Korobeynikov, Iman Hajirasouliha","doi":"10.1093/bioadv/vbaf151","DOIUrl":"10.1093/bioadv/vbaf151","url":null,"abstract":"Motivation: Recent benchmarks show that most structural variations, especially within 50-10,000 bp range cannot be resolved with short-read sequencing, but long-read structural variant callers perform better on the same datasets. However, high-coverage long-read sequencing is costly and requires substantial input DNA. Reducing coverage lowers cost but significantly impacts the performance of existing structural variation (SV) callers. Synthetic long-read technologies offer long-range information at lower cost, but leveraging them for SVs under 50 kbp remains challenging.Results: We propose a novel hybrid alignment- and local-assembly-based algorithm, Blackbird, that uses synthetic long reads and low-coverage long reads to improve structural variant detection. Instead of relying on whole-genome assembly, Blackbird uses a sliding window approach and synthetic long-read barcode information to assemble local segments, integrating long reads to improve structural variant detection accuracy. We evaluated Blackbird on real human genome datasets. On the HG002 Genome in a Bottle (GIAB) benchmark, Blackbird in hybrid mode demonstrated results comparable to state-of-the-art long-read tools, while using less long-read coverage. Blackbird requires only 5 <math><mo>×</mo></math> coverage to achieve F1-scores (0.835 and 0.808 for deletions and insertions) similar to PBSV and Sniffles2 using 10 <math><mo>×</mo></math> PacBio Hi-Fi long-read coverage.Availability and implementation: Blackbird is available at https://github.com/1dayac/Blackbird.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf151"},"PeriodicalIF":2.4,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12237510/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144593034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Volcano: a pipeline to characterize long terminal repeat-retrotransposons families in plants. 火山：表征植物长末端重复反转录转座子家族的管道。

IF 2.8

Bioinformatics advances Pub Date : 2025-07-04 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf162

Hao He, Fei Shen, Yong Hou, Xiaozeng Yang

{"title":"Volcano: a pipeline to characterize long terminal repeat-retrotransposons families in plants.","authors":"Hao He, Fei Shen, Yong Hou, Xiaozeng Yang","doi":"10.1093/bioadv/vbaf162","DOIUrl":"10.1093/bioadv/vbaf162","url":null,"abstract":"Motivation: Long Terminal Repeat Retrotransposons (LTR-RTs) comprise a significant portion of repetitive sequences in numerous plant species. LTR-RTs hold considerable functional significance, as they can impact gene family functionality and contribute to the formation of new genes. Investigating the quantities and activities of LTR-RTs is essential for understanding species' evolutionary dynamics and the foundational mechanisms driving genome evolution. While current softwares can predict and initially classify LTR-RTs, there is a high need for more comprehensive and efficient software to fully characterize and quantify LTR-RTs during burst events and in subsequent detailed classification and quantification, especially given the surged demands of genome annotation.Results: In this study, we have developed a pipeline called Volcano to accurately classify LTR-RTs and characterize burst families in plants. To distinguish different clades of LTR-RTs, we have implemented an improved depth-first search algorithm. Volcano can also quantify LTR-RT expression using RNA-seq data. By analyzing LTR-RTs in three genomes from the Asteraceae family, we observed that larger genomes tend to contain a greater number of LTR-RTs, and our software effectively categorizes them at the clade level.Availability and implementation: The proposed Volcano compressor can be downloaded from https://github.com/Suosihe/volcano_LTR.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf162"},"PeriodicalIF":2.8,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12349922/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144849950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimal solution to the set cover problem with a vicinity constraint for estimating genotype tissue expression profiles. 用邻近约束估计基因型组织表达谱的集合覆盖问题的最优解。

IF 2.8

Bioinformatics advances Pub Date : 2025-07-04 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf163

Jiahong Dong, Stephen Brown, Kevin Truong

{"title":"Optimal solution to the set cover problem with a vicinity constraint for estimating genotype tissue expression profiles.","authors":"Jiahong Dong, Stephen Brown, Kevin Truong","doi":"10.1093/bioadv/vbaf163","DOIUrl":"10.1093/bioadv/vbaf163","url":null,"abstract":"Motivation: Genes located in close genomic proximity tend to have more similar genotype tissue expression profiles. This suggests that expression profiles for the entire genome could be estimated using a smaller set of experimentally determined profiles from carefully selected reference genes, thereby reducing the need for extensive experimental measurements.Results: We address this challenge by mapping it as a set cover problem, aiming to identify an optimal number of gene sets that can cover the entire genome. However, traditional set cover algorithms are either slow in runtime or yield non-optimal results for large datasets. To overcome this limitation, we developed a dynamic programming algorithm that leverages the consecutive ordering of genes within vicinity sets. Our algorithm solves this vicinity set cover problem with tractable runtime while minimizing the average distance between reference genes and non-reference genes within the vicinity, thereby maximizing estimation accuracy. This algorithm can be used to reduce the number of required experiments in organisms lacking genotype tissue expression data or in new human datasets with expanded tissue sets. Lastly, our algorithm also has broader applications for set cover optimization problems in other fields.Availability and implementation: The source code along with all implementation details are available at: https://github.com/sensationTI/vicinity_set_cover.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf163"},"PeriodicalIF":2.8,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12313015/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144762411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MEAanalysis: an open-source R package for downstream visualization of AxIS navigator multi-electrode array burst data at the single-electrode level. MEAanalysis：一个开源的R包，用于在单电极水平上对AxIS导航器多电极阵列突发数据进行下游可视化。

IF 2.8

Bioinformatics advances Pub Date : 2025-07-03 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf160

Emily A Gordon, David L Bennett, Georgios Baskozos, Maddalena Comini

引用次数: 0

Improved prediction of antibody and their complexes with clustered generative modelling ensembles. 利用聚类生成模型集成改进抗体及其复合物的预测。

IF 2.4

Bioinformatics advances Pub Date : 2025-07-03 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf161

Xiaotong Xu, Marco Giulini, Alexandre M J J Bonvin

{"title":"Improved prediction of antibody and their complexes with clustered generative modelling ensembles.","authors":"Xiaotong Xu, Marco Giulini, Alexandre M J J Bonvin","doi":"10.1093/bioadv/vbaf161","DOIUrl":"10.1093/bioadv/vbaf161","url":null,"abstract":"Motivation: Gaining structural insights into antibody-antigen complexes is crucial for understanding antigen recognition mechanisms and advancing therapeutic antibody design. However, accurate prediction of the structure of highly variable complementarity-determining region 3 on the antibody heavy chain (CDR-H3 loop) remains a significant challenge due to its increased length and conformational variability. While AlphaFold2-multimer (AF2) has made substantial progress in protein structure prediction, its application on antibodies and antibody-antigen complexes is limited by the weak evolutionary signals in the CDR region and the lack of structural diversity in its output.Results: To address these limitations, we propose a workflow that combines AlphaFlow to generate ensembles of potential loop conformations with integrative modelling of antibody-antigen complexes with HADDOCK. Improving the structural diversity of the H3 loop increases the success rate of subsequent docking tasks. Our analysis shows that while AF2 generally predicts accurate antibody structures, it struggles with the H3 loop. In cases where AF2 mispredicts the loop, we leverage AlphaFlow to generate ensembles of loop conformations via score-based flow matching, followed by clustering to produce a structurally diverse set of models. We demonstrate that these ensembles significantly improve antibody-antigen docking performance compared to the standard AF2 ensembles.Availability and implementation: The input datasets and codes involved in this research are available at https://github.com/haddocking/alphaflow-antibodies. All the resulting modelling data are available from Zenodo (https://zenodo.org/records/14906314).","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf161"},"PeriodicalIF":2.4,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12279294/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144683717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Calculating genetic risk scores directly from summary statistics with an application to type 1 diabetes. 计算遗传风险得分直接从汇总统计与应用于1型糖尿病。

IF 2.4

Bioinformatics advances Pub Date : 2025-07-02 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf158

Steven Squires, Michael N Weedon, Richard A Oram

{"title":"Calculating genetic risk scores directly from summary statistics with an application to type 1 diabetes.","authors":"Steven Squires, Michael N Weedon, Richard A Oram","doi":"10.1093/bioadv/vbaf158","DOIUrl":"10.1093/bioadv/vbaf158","url":null,"abstract":"Motivation: Genetic risk scores (GRS) summarise genetic data into a single number and allow for discrimination between cases and controls. Many applications of GRSs would benefit from comparisons with multiple datasets to assess quality of the GRS across different groups. However, genetic data is often unavailable. If summary statistics of the genetic data could be used to calculate GRSs more comparisons could be made, potentially leading to improved research.Results: We present a methodology that utilises only summary statistics of genetic data to calculate GRSs with an example of a type 1 diabetes (T1D) GRS. An example on European populations of the mean T1D GRS for those calculated from genetic data and from summary statistics (our method) was 10.31 (10.12-10.48) and 10.38 (10.24-10.53), respectively. An example of a case-control set for T1D has an area under the receiver operating characteristic curve of 0.917 (0.903-0.93) for those calculated from genetic data and 0.914 (0.898-0.929) for those calculated from summary statistics.Availability: The code is available at https://github.com/stevensquires/simulating_genetic_risk_scores.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf158"},"PeriodicalIF":2.4,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12270265/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144661138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PangenePro: an automated pipeline for rapid identification and classification of gene family members. PangenePro：用于快速鉴定和分类基因家族成员的自动化流水线。

IF 2.4

Bioinformatics advances Pub Date : 2025-07-02 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf159

Kinza Fatima, Haifei Hu, Muhammad Tahir Ul Qamar

{"title":"PangenePro: an automated pipeline for rapid identification and classification of gene family members.","authors":"Kinza Fatima, Haifei Hu, Muhammad Tahir Ul Qamar","doi":"10.1093/bioadv/vbaf159","DOIUrl":"10.1093/bioadv/vbaf159","url":null,"abstract":"Motivation: The increasing availability of sequenced and assembled plant genomes in public databases has led to a surge in genome-wide identification (GWI) studies of gene families. However, previous studies are often single-reference genome-based, limiting their ability to capture intraspecific genetic diversity. Further, manual identification from multiple genomes is labor-intensive and time-consuming.Results: Here, we present PangenePro, a fully automated pipeline using Python and R scripting, implemented in the Linux environment, designed to identify and classify gene family members across multiple genomes simultaneously. This pipeline integrates sequence alignment using BLAST, domain validation through InterProScan, and orthologous clustering to classify the identified genes into core, dispensable, and unique pangenes sets. PangenePro was tested using five Arabidopsis thaliana, three Arachis and rice, and five Barley genomes, identifying a number of members comparable to those in previously reported studies. These results demonstrate the accuracy and efficiency of this method for gene family identification and classification in diverse and complex genomes. Moreover, its rapid nature enables comprehensive capture of intraspecific diversity and yields valuable candidate genes for further functional and plant breeding studies.Availability and implementation: The PangenePro is freely available at GitHub DOI: https://github.com/kinza111/PangenePro.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf159"},"PeriodicalIF":2.4,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12255874/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144627817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

X-CRISP: domain-adaptable and interpretable CRISPR repair outcome prediction. X-CRISP：区域适应性和可解释的CRISPR修复结果预测。

IF 2.8

Bioinformatics advances Pub Date : 2025-07-02 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf157

Colm Seale, Joana P Gonçalves

{"title":"X-CRISP: domain-adaptable and interpretable CRISPR repair outcome prediction.","authors":"Colm Seale, Joana P Gonçalves","doi":"10.1093/bioadv/vbaf157","DOIUrl":"10.1093/bioadv/vbaf157","url":null,"abstract":"Motivation: Controlling the outcomes of CRISPR editing is crucial for the success of gene therapy. Since donor template-based editing is often inefficient, alternative strategies have emerged that leverage mutagenic end-joining repair instead. Existing machine learning models can accurately predict end-joining repair outcomes; however, generalisability beyond the specific cell line used for training remains a challenge, and interpretability is typically limited by suboptimal feature representation and model architecture.Results: We propose X-CRISP, a flexible and interpretable neural network for predicting repair outcome frequencies based on a minimal set of outcome and sequence features, including microhomologies (MH). Outperforming prior models on detailed and aggregate outcome predictions, X-CRISP prioritised MH location over MH sequence properties such as GC content for deletion outcomes. Through transfer learning, we adapted X-CRISP pre-trained on wild-type mESC data to target human cell lines K562, HAP1, U2OS, and mESC lines with altered DNA repair function. Adapted X-CRISP models improved over direct training on target data from as few as 50 samples, suggesting that this strategy could be leveraged to build models for new domains using a fraction of the data required to train models from scratch.Availability and implementation: X-CRISP is available at https://github.com/joanagoncalveslab/xcrisp.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf157"},"PeriodicalIF":2.8,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12270252/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144661140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Recent computational advances in the identification of cryptic binding sites for drug discovery. 用于药物发现的隐结合位点鉴定的最新计算进展。

IF 2.8

Bioinformatics advances Pub Date : 2025-07-01 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf156

Dorota Gašparíková, Rupesh Chikhale, Jason Cole, Ehmke Pohl

{"title":"Recent computational advances in the identification of cryptic binding sites for drug discovery.","authors":"Dorota Gašparíková, Rupesh Chikhale, Jason Cole, Ehmke Pohl","doi":"10.1093/bioadv/vbaf156","DOIUrl":"10.1093/bioadv/vbaf156","url":null,"abstract":"Motivation: Cryptic ligand binding sites, defined as binding pockets that exist in the ligand-bound state of a protein but not in its apo form, are gaining increasing interest due to the opportunities they provide for drug discovery.Results: This review article looks at the current state of cryptic binding site research, highlighting advancements in both molecular dynamic (MD) methods and machine learning (ML) methods to predict and utilize these sites.Availibilty and implementation: MD methods include the use of Markov State Models, Enhanced Sampling, and other methods such as Cosolvent MD, while ML methods utilize Support Vector Machine, Random Forest, and Neural Networks. Here, we discuss case studies for both methods and their overlaps, providing insight into the future and the limitations faced. Compared to MD methods, ML methods are often reported to be more cost- and time-effective. However, a limited number of datasets are available for training these ML methods. Integrating MD with ML methods promises to expand our ability to predict and validate new cryptic binding sites that can be evaluated for druggability.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf156"},"PeriodicalIF":2.8,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12342141/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144838707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

StackGlyEmbed: prediction of N-linked glycosylation sites using protein language models. StackGlyEmbed：使用蛋白质语言模型预测n -链糖基化位点。

IF 2.4

Bioinformatics advances Pub Date : 2025-06-28 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf146

Md Muhaiminul Islam Nafi, M Saifur Rahman

{"title":"StackGlyEmbed: prediction of N-linked glycosylation sites using protein language models.","authors":"Md Muhaiminul Islam Nafi, M Saifur Rahman","doi":"10.1093/bioadv/vbaf146","DOIUrl":"10.1093/bioadv/vbaf146","url":null,"abstract":"Motivation: N-linked glycosylation is one of the most basic post-translational modifications (PTMs) where oligosaccharides covalently bond with Asparagine (N). These are found in the conserved regions like N-X-S or N-X-T where X can be any residue except Proline (P). Prediction of N-linked glycosylation sites has great importance as these PTMs play a vital role in many biological processes and functionalities. Experimental methods, such as mass spectrometry, for detecting N-linked glycosylation sites are very expensive. Therefore, the prediction of N-linked glycosylation sites has become an important research field.Results: In this work, we propose StackGlyEmbed, a stacking ensemble machine learning model, to computationally predict N-linked glycosylation sites. We have explored embeddings from several protein language models and built the stacking ensemble using Support Vector Machine (SVM), Extreme Gradient Boosting (XGB) and K-nearest Neighbor (KNN) learners in the base layer, with a second SVM model in the meta layer. StackGlyEmbed achieves 98.2% sensitivity, 92.5% balanced accuracy, 89.1% F1-score and 82.6% Matthew's correlation coefficient in independent testing, outperforming the existing state-of-the-art methods.Availability and implementation: StackGlyEmbed is freely available at: https://github.com/nafcoder/StackGlyEmbed.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf146"},"PeriodicalIF":2.4,"publicationDate":"2025-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12237515/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144593046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0