{"title":"HKDE-LACM: a hybrid model for lactic acid bacteria classification via k-mer and DNABERT-2 embedding fusion with cyclic DE-BO optimization.","authors":"Jie Zou, Weichi Liu, Jinhui Dai, Gaifang Dong","doi":"10.1186/s12864-025-12009-7","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Lactic acid bacteria (LAB) play vital roles in food production and clinical applications. Accurate classification of LAB strains facilitates their functional development and targeted utilization. Although machine learning and deep learning methods have been widely applied to genome sequence classification, challenges remain in capturing comprehensive feature representations and enhancing model generalizability.</p><p><strong>Results: </strong>We present HKDE-LACM, a hybrid classification model that integrates high-dimensional k-mer frequency features with contextual embeddings derived from DNABERT-2. To optimize model hyperparameters, we introduce a Cyclic Differential Evolution and Bayesian Optimization with Failure Avoidance (C-DBFA) framework. We conducted 10-fold cross-validation on three LAB datasets and evaluated performance. Experimental results demonstrate that HKDE-LACM outperforms existing methods in terms of both classification accuracy and robustness.</p><p><strong>Conclusions: </strong>HKDE-LACM overcomes the limitations of traditional k-mer features by incorporating semantic embeddings, thereby enriching the representation of genomic sequences. In addition, the model can automatically identify optimal combinations of feature extractors and classifiers through the C-DBFA optimization framework. These advantages effectively enhance the model's generalization ability, making it a promising tool for genome-based LAB classification and related tasks.</p>","PeriodicalId":9030,"journal":{"name":"BMC Genomics","volume":"26 1","pages":"815"},"PeriodicalIF":3.7000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12465904/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12864-025-12009-7","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Lactic acid bacteria (LAB) play vital roles in food production and clinical applications. Accurate classification of LAB strains facilitates their functional development and targeted utilization. Although machine learning and deep learning methods have been widely applied to genome sequence classification, challenges remain in capturing comprehensive feature representations and enhancing model generalizability.
Results: We present HKDE-LACM, a hybrid classification model that integrates high-dimensional k-mer frequency features with contextual embeddings derived from DNABERT-2. To optimize model hyperparameters, we introduce a Cyclic Differential Evolution and Bayesian Optimization with Failure Avoidance (C-DBFA) framework. We conducted 10-fold cross-validation on three LAB datasets and evaluated performance. Experimental results demonstrate that HKDE-LACM outperforms existing methods in terms of both classification accuracy and robustness.
Conclusions: HKDE-LACM overcomes the limitations of traditional k-mer features by incorporating semantic embeddings, thereby enriching the representation of genomic sequences. In addition, the model can automatically identify optimal combinations of feature extractors and classifiers through the C-DBFA optimization framework. These advantages effectively enhance the model's generalization ability, making it a promising tool for genome-based LAB classification and related tasks.
期刊介绍:
BMC Genomics is an open access, peer-reviewed journal that considers articles on all aspects of genome-scale analysis, functional genomics, and proteomics.
BMC Genomics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.