SV-MeCa：一种基于xgboost的元调用者方法，用于从短读数据调用结构变量。

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics Pub Date : 2025-08-20 DOI:10.1186/s12859-025-06246-6

Rudel Christian Nkouamedjo Fankep, Arda Söylev, Anna-Lena Kobiela, Jochen Blom, Corinna Ernst, Susanne Motameny

{"title":"SV-MeCa：一种基于xgboost的元调用者方法，用于从短读数据调用结构变量。","authors":"Rudel Christian Nkouamedjo Fankep, Arda Söylev, Anna-Lena Kobiela, Jochen Blom, Corinna Ernst, Susanne Motameny","doi":"10.1186/s12859-025-06246-6","DOIUrl":null,"url":null,"abstract":"Background: Calling structural variants (SVs), i.e., genomic alterations of ≥50bp, from whole genome short-read data remains challenging, as existing callers are known to lack accuracy and robustness. Therefore, meta-caller approaches combining the results of multiple standalone tools in a consensus set of reported SV calls, are widely used. Here, SV-MeCa (Structural Variant Meta-Caller) is presented, the first SV meta-caller incorporating variant-specific quality metrics from individual VCF outputs, rather than relying solely on number and combination of tools supporting consensus SV calls. In addition, SV-MeCa offers a suitable score to rank obtained consensus SV calls according to evidence of representing true positive calls, i.e., real-world variants.Results: SV-MeCa applies seven standalone SV callers and merges resulting deletion and insertion calls into a union VCF file using SURVIVOR. For each entry in the SURVIVOR-generated consensus, caller-specific quality measures are extracted from corresponding standalone VCF files, and serve as input for an either deletion- or insertion-specific XGBoost decision tree classifier, which was previously trained on the HG002 SV benchmark data provided by the Genome in a Bottle consortium. The SV-MeCa XGBoost models assign a probability to (consensus) SV calls to represent true positive calls, which can be used for ranking the final output according to evidence. Performance of SV-MeCa and four previously published meta-caller approaches were evaluated based on autosomal SV calls in samples curated by the Human Genome Structural Variation Consortium, Phase 2. With regard to F[Formula: see text] scores, which were 0.58 on average for deletions and 0.42 on average for insertions, SV-MeCa outperformed the other meta-callers. With regard to precision, only ConsensuSV achieved higher values (0.97 versus 0.64 on average for deletions, 0.75 versus 0.53 on average for insertions), and with regard to recall, SV-MeCa was outperformed exclusively by Meta-SV for deletions (0.55 versus 0.53).Conclusions: SV-MeCa, publicly available at https://github.com/ccfboc-bioinformatics/SV-MeCa , outperforms existing SV meta-caller approaches by taking variant-specific quality measures into account. Moreover, due to the XGBoost prediction probabilities serving as scores, the output of SV-MeCa can be continuously adjusted to user needs in terms of sensitivity and precision.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"218"},"PeriodicalIF":3.3000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12366149/pdf/","citationCount":"0","resultStr":"{\"title\":\"SV-MeCa: an XGBoost-based meta-caller approach for structural variant calling from short-read data.\",\"authors\":\"Rudel Christian Nkouamedjo Fankep, Arda Söylev, Anna-Lena Kobiela, Jochen Blom, Corinna Ernst, Susanne Motameny\",\"doi\":\"10.1186/s12859-025-06246-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Calling structural variants (SVs), i.e., genomic alterations of ≥50bp, from whole genome short-read data remains challenging, as existing callers are known to lack accuracy and robustness. Therefore, meta-caller approaches combining the results of multiple standalone tools in a consensus set of reported SV calls, are widely used. Here, SV-MeCa (Structural Variant Meta-Caller) is presented, the first SV meta-caller incorporating variant-specific quality metrics from individual VCF outputs, rather than relying solely on number and combination of tools supporting consensus SV calls. In addition, SV-MeCa offers a suitable score to rank obtained consensus SV calls according to evidence of representing true positive calls, i.e., real-world variants.Results: SV-MeCa applies seven standalone SV callers and merges resulting deletion and insertion calls into a union VCF file using SURVIVOR. For each entry in the SURVIVOR-generated consensus, caller-specific quality measures are extracted from corresponding standalone VCF files, and serve as input for an either deletion- or insertion-specific XGBoost decision tree classifier, which was previously trained on the HG002 SV benchmark data provided by the Genome in a Bottle consortium. The SV-MeCa XGBoost models assign a probability to (consensus) SV calls to represent true positive calls, which can be used for ranking the final output according to evidence. Performance of SV-MeCa and four previously published meta-caller approaches were evaluated based on autosomal SV calls in samples curated by the Human Genome Structural Variation Consortium, Phase 2. With regard to F[Formula: see text] scores, which were 0.58 on average for deletions and 0.42 on average for insertions, SV-MeCa outperformed the other meta-callers. With regard to precision, only ConsensuSV achieved higher values (0.97 versus 0.64 on average for deletions, 0.75 versus 0.53 on average for insertions), and with regard to recall, SV-MeCa was outperformed exclusively by Meta-SV for deletions (0.55 versus 0.53).Conclusions: SV-MeCa, publicly available at https://github.com/ccfboc-bioinformatics/SV-MeCa , outperforms existing SV meta-caller approaches by taking variant-specific quality measures into account. Moreover, due to the XGBoost prediction probabilities serving as scores, the output of SV-MeCa can be continuously adjusted to user needs in terms of sensitivity and precision.\",\"PeriodicalId\":8958,\"journal\":{\"name\":\"BMC Bioinformatics\",\"volume\":\"26 1\",\"pages\":\"218\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12366149/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s12859-025-06246-6\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-025-06246-6","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

背景：从全基因组短读数据中调用结构变异（SVs），即≥50bp的基因组改变，仍然具有挑战性，因为已知现有的调用者缺乏准确性和鲁棒性。因此，元调用者方法将多个独立工具的结果结合在一组一致报告的SV调用中，被广泛使用。本文介绍了SV- meca（结构变量元调用者），这是第一个包含来自单个VCF输出的特定变量质量度量的SV元调用者，而不是仅仅依赖于支持共识SV调用的工具的数量和组合。此外，SV- meca提供了一个合适的分数，根据代表真正呼叫的证据对获得的共识SV呼叫进行排名，即现实世界的变体。结果：SV- meca应用7个独立的SV调用程序，并使用SURVIVOR将结果删除和插入调用合并到一个联合VCF文件中。对于survivor生成的共识中的每个条目，从相应的独立VCF文件中提取特定于呼者的质量度量，并作为特定于删除或插入的XGBoost决策树分类器的输入，该分类器先前在Bottle联盟中基因组提供的HG002 SV基准数据上进行了训练。SV- meca XGBoost模型为（共识）SV调用分配一个概率，以表示真正调用，这可用于根据证据对最终输出进行排序。基于人类基因组结构变异联盟（Human Genome Structural Variation Consortium, Phase 2）收集的常染色体SV调用样本，对SV- meca和四种先前发表的元调用者方法的性能进行了评估。关于F[公式：见文本]得分，删除的平均得分为0.58，插入的平均得分为0.42，SV-MeCa优于其他元调用者。在精确度方面，只有ConsensuSV达到了更高的值（删除平均为0.97比0.64，插入平均为0.75比0.53），在召回率方面，删除的Meta-SV完全优于SV-MeCa（0.55比0.53）。结论：SV- meca，可在https://github.com/ccfboc-bioinformatics/SV-MeCa上公开获得，通过考虑特定变量的质量度量，优于现有的SV元调用者方法。此外，由于XGBoost预测概率作为评分，SV-MeCa的输出可以在灵敏度和精度方面不断调整，以满足用户的需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

SV-MeCa: an XGBoost-based meta-caller approach for structural variant calling from short-read data.

查看原文本刊更多论文

SV-MeCa: an XGBoost-based meta-caller approach for structural variant calling from short-read data.

Background: Calling structural variants (SVs), i.e., genomic alterations of ≥50bp, from whole genome short-read data remains challenging, as existing callers are known to lack accuracy and robustness. Therefore, meta-caller approaches combining the results of multiple standalone tools in a consensus set of reported SV calls, are widely used. Here, SV-MeCa (Structural Variant Meta-Caller) is presented, the first SV meta-caller incorporating variant-specific quality metrics from individual VCF outputs, rather than relying solely on number and combination of tools supporting consensus SV calls. In addition, SV-MeCa offers a suitable score to rank obtained consensus SV calls according to evidence of representing true positive calls, i.e., real-world variants.

Results: SV-MeCa applies seven standalone SV callers and merges resulting deletion and insertion calls into a union VCF file using SURVIVOR. For each entry in the SURVIVOR-generated consensus, caller-specific quality measures are extracted from corresponding standalone VCF files, and serve as input for an either deletion- or insertion-specific XGBoost decision tree classifier, which was previously trained on the HG002 SV benchmark data provided by the Genome in a Bottle consortium. The SV-MeCa XGBoost models assign a probability to (consensus) SV calls to represent true positive calls, which can be used for ranking the final output according to evidence. Performance of SV-MeCa and four previously published meta-caller approaches were evaluated based on autosomal SV calls in samples curated by the Human Genome Structural Variation Consortium, Phase 2. With regard to F[Formula: see text] scores, which were 0.58 on average for deletions and 0.42 on average for insertions, SV-MeCa outperformed the other meta-callers. With regard to precision, only ConsensuSV achieved higher values (0.97 versus 0.64 on average for deletions, 0.75 versus 0.53 on average for insertions), and with regard to recall, SV-MeCa was outperformed exclusively by Meta-SV for deletions (0.55 versus 0.53).

Conclusions: SV-MeCa, publicly available at https://github.com/ccfboc-bioinformatics/SV-MeCa , outperforms existing SV meta-caller approaches by taking variant-specific quality measures into account. Moreover, due to the XGBoost prediction probabilities serving as scores, the output of SV-MeCa can be continuously adjusted to user needs in terms of sensitivity and precision.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.