{"title":"HyenaCircle: a HyenaDNA-based pretrained large language model for long eccDNA prediction.","authors":"Fuyu Li, Wenxiang Lu, Yunfei Bai","doi":"10.3389/fgene.2025.1641162","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Extrachromosomal circular DNA (eccDNA) represents a class of circular DNA molecules derived from chromosomes with diverse roles in disease. Long eccDNAs (typically 1-5 kb) pose detection challenges due to their large size, hindering functional studies. We propose HyenaCircle, a novel deep learning model leveraging large language model and third-generation sequencing data to predict long eccDNA formation.</p><p><strong>Methods: </strong>Full-length eccDNAs within 1-5 kb were identified by FLED algorithm for Nanopore sequencing data, extended by 100-bp flanking sequences, and paired with 20,000 length-matched negative controls from eccDNA-depleted genomic regions. HyenaCircle was built by adapting the pretrained HyenaDNA model with a designed classifier head. The strategies of data augmentation, regularization and class imbalance weighting were applied to increase model robustness.</p><p><strong>Results: </strong>HyenaCircle achieved comparable performance with a validation AUROC of 0.715 and recall of 0.776. It surpassed DNABERT by 5.9% in AUROC and demonstrated stable convergence. Hyperparameter optimization confirmed batch size 16 and learning rate 5 × 10<sup>-5</sup> as optimal. The ablation studies revealed flanking sequences are important, as their removal reduced model stability. The model also showed superior stability over the baseline HyenaDNA architecture.</p><p><strong>Conclusion: </strong>HyenaCircle integrated third-generation sequencing data and large language model for long eccDNA prediction, which outperformed the existing model. Our work demonstrates that the HyenaDNA architecture enables effective long-sequence genomic modeling and provides a new insight for eccDNA prediction and identification.</p>","PeriodicalId":12750,"journal":{"name":"Frontiers in Genetics","volume":"16 ","pages":"1641162"},"PeriodicalIF":2.8000,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12240936/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.3389/fgene.2025.1641162","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Extrachromosomal circular DNA (eccDNA) represents a class of circular DNA molecules derived from chromosomes with diverse roles in disease. Long eccDNAs (typically 1-5 kb) pose detection challenges due to their large size, hindering functional studies. We propose HyenaCircle, a novel deep learning model leveraging large language model and third-generation sequencing data to predict long eccDNA formation.
Methods: Full-length eccDNAs within 1-5 kb were identified by FLED algorithm for Nanopore sequencing data, extended by 100-bp flanking sequences, and paired with 20,000 length-matched negative controls from eccDNA-depleted genomic regions. HyenaCircle was built by adapting the pretrained HyenaDNA model with a designed classifier head. The strategies of data augmentation, regularization and class imbalance weighting were applied to increase model robustness.
Results: HyenaCircle achieved comparable performance with a validation AUROC of 0.715 and recall of 0.776. It surpassed DNABERT by 5.9% in AUROC and demonstrated stable convergence. Hyperparameter optimization confirmed batch size 16 and learning rate 5 × 10-5 as optimal. The ablation studies revealed flanking sequences are important, as their removal reduced model stability. The model also showed superior stability over the baseline HyenaDNA architecture.
Conclusion: HyenaCircle integrated third-generation sequencing data and large language model for long eccDNA prediction, which outperformed the existing model. Our work demonstrates that the HyenaDNA architecture enables effective long-sequence genomic modeling and provides a new insight for eccDNA prediction and identification.
Frontiers in GeneticsBiochemistry, Genetics and Molecular Biology-Molecular Medicine
CiteScore
5.50
自引率
8.10%
发文量
3491
审稿时长
14 weeks
期刊介绍:
Frontiers in Genetics publishes rigorously peer-reviewed research on genes and genomes relating to all the domains of life, from humans to plants to livestock and other model organisms. Led by an outstanding Editorial Board of the world’s leading experts, this multidisciplinary, open-access journal is at the forefront of communicating cutting-edge research to researchers, academics, clinicians, policy makers and the public.
The study of inheritance and the impact of the genome on various biological processes is well documented. However, the majority of discoveries are still to come. A new era is seeing major developments in the function and variability of the genome, the use of genetic and genomic tools and the analysis of the genetic basis of various biological phenomena.