来自RNA面板测序数据的基因融合调用:一种集成学习方法

Kenneth B. Thomas, Y. Mou, C. Magnan, T. Gyuris, E. Shinbrot, Fernando Díaz, Steven Lau-Rivera, Segun Jung, V. Funari, L. Weiss
{"title":"来自RNA面板测序数据的基因融合调用:一种集成学习方法","authors":"Kenneth B. Thomas, Y. Mou, C. Magnan, T. Gyuris, E. Shinbrot, Fernando Díaz, Steven Lau-Rivera, Segun Jung, V. Funari, L. Weiss","doi":"10.1158/1538-7445.AM2021-240","DOIUrl":null,"url":null,"abstract":"Introduction: Our goal is to improve gene fusion detection via RNA sequencing by combining multiple fusion callers through machine learning techniques. Background: Gene Fusion events are important drivers of malignancy. RNA sequencing (RNAseq) methods for detection of fusions have the advantage that multiple markers can be targeted at one time. Unlike DNA methods, in which it is challenging to capture fusion breakpoints, in RNA methods fusions are readily identified through chimeric transcripts. While many fusion calling algorithms exist for use on RNAseq data, sensitive fusion callers, needed for samples of low tumor content, often present high false positive rates - a result of aligning chimeric transcripts. Further, there currently is no single feature in NGS data that can be used to filter out false positive fusion calls. In order to achieve higher accuracy in fusion calls than can be achieved using individual fusion callers, we have weighted and combined the results of multiple fusion callers by systematic and objective means: an ensemble learning approach based on random forest models. Our method selects from data generated by three independent fusion callers supplemented by metrics obtained from in-house methods. It presents a metric that can be immediately interpreted as the probability that a candidate fusion call is a true fusion call. Methods: Random forest models were generated by use of the randomForest package in R, with tuning by the R caret package. Training data sets consisted of a balanced set of 394 fusion calls from clinical samples of solid tumors. For training, fusion calls with at least 10 supporting reads were deemed true or false based on manual review via IGV, and orthogonal methods including PCR with Sanger sequencing and the commercial Archer™ fusion CTL and Sarcoma panels. We present the results of training on data from the three well-known fusion callers Arriba, STAR-Fusion, and FusionCatcher, together with additional data from an in-house developed junction counting method, and fusion membership in a list of known fusions (a “white list”). Models were validated by 10-fold cross-validation. Results: In performance evaluations, false positive and false negative calls were presumed false based on orthogonal determinations. On that basis, our current best model has an accuracy of 94.9% (sensitivity 93.4%, specificity 96.7%). Currently, High Confidence fusion calls (calls with probability score greater than 70%) are the most common positive calls. These have been confirmed with 100% success. Conclusion: We have successfully integrated multiple fusion callers by means of random forest models. Our current model is validated for use on our solid tumor fusion calling pipeline. Citation Format: Kenneth B. Thomas, Yanglong Mou, Christophe Magnan, Tibor Gyuris, Eve Shinbrot, Fernando Lopez Diaz, Steven Lau-Rivera, Segun Jung, Vincent Funari, Lawrence M. Weiss. Gene fusion calling from RNA panel sequencing data: An ensemble learning approach [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2021; 2021 Apr 10-15 and May 17-21. Philadelphia (PA): AACR; Cancer Res 2021;81(13_Suppl):Abstract nr 240.","PeriodicalId":73617,"journal":{"name":"Journal of bioinformatics and systems biology : Open access","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Abstract 240: Gene fusion calling from RNA panel sequencing data: An ensemble learning approach\",\"authors\":\"Kenneth B. Thomas, Y. Mou, C. Magnan, T. Gyuris, E. Shinbrot, Fernando Díaz, Steven Lau-Rivera, Segun Jung, V. Funari, L. Weiss\",\"doi\":\"10.1158/1538-7445.AM2021-240\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: Our goal is to improve gene fusion detection via RNA sequencing by combining multiple fusion callers through machine learning techniques. Background: Gene Fusion events are important drivers of malignancy. RNA sequencing (RNAseq) methods for detection of fusions have the advantage that multiple markers can be targeted at one time. Unlike DNA methods, in which it is challenging to capture fusion breakpoints, in RNA methods fusions are readily identified through chimeric transcripts. While many fusion calling algorithms exist for use on RNAseq data, sensitive fusion callers, needed for samples of low tumor content, often present high false positive rates - a result of aligning chimeric transcripts. Further, there currently is no single feature in NGS data that can be used to filter out false positive fusion calls. In order to achieve higher accuracy in fusion calls than can be achieved using individual fusion callers, we have weighted and combined the results of multiple fusion callers by systematic and objective means: an ensemble learning approach based on random forest models. Our method selects from data generated by three independent fusion callers supplemented by metrics obtained from in-house methods. It presents a metric that can be immediately interpreted as the probability that a candidate fusion call is a true fusion call. Methods: Random forest models were generated by use of the randomForest package in R, with tuning by the R caret package. Training data sets consisted of a balanced set of 394 fusion calls from clinical samples of solid tumors. For training, fusion calls with at least 10 supporting reads were deemed true or false based on manual review via IGV, and orthogonal methods including PCR with Sanger sequencing and the commercial Archer™ fusion CTL and Sarcoma panels. We present the results of training on data from the three well-known fusion callers Arriba, STAR-Fusion, and FusionCatcher, together with additional data from an in-house developed junction counting method, and fusion membership in a list of known fusions (a “white list”). Models were validated by 10-fold cross-validation. Results: In performance evaluations, false positive and false negative calls were presumed false based on orthogonal determinations. On that basis, our current best model has an accuracy of 94.9% (sensitivity 93.4%, specificity 96.7%). Currently, High Confidence fusion calls (calls with probability score greater than 70%) are the most common positive calls. These have been confirmed with 100% success. Conclusion: We have successfully integrated multiple fusion callers by means of random forest models. Our current model is validated for use on our solid tumor fusion calling pipeline. Citation Format: Kenneth B. Thomas, Yanglong Mou, Christophe Magnan, Tibor Gyuris, Eve Shinbrot, Fernando Lopez Diaz, Steven Lau-Rivera, Segun Jung, Vincent Funari, Lawrence M. Weiss. Gene fusion calling from RNA panel sequencing data: An ensemble learning approach [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2021; 2021 Apr 10-15 and May 17-21. Philadelphia (PA): AACR; Cancer Res 2021;81(13_Suppl):Abstract nr 240.\",\"PeriodicalId\":73617,\"journal\":{\"name\":\"Journal of bioinformatics and systems biology : Open access\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of bioinformatics and systems biology : Open access\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1158/1538-7445.AM2021-240\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of bioinformatics and systems biology : Open access","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1158/1538-7445.AM2021-240","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

我们的目标是通过机器学习技术结合多个融合调用者,通过RNA测序改进基因融合检测。背景:基因融合事件是恶性肿瘤的重要驱动因素。RNA测序(RNAseq)检测融合物的方法具有一次检测多个标记物的优点。与DNA方法不同,在DNA方法中很难捕获融合断点,而RNA方法通过嵌合转录物很容易识别融合。虽然存在许多用于RNAseq数据的融合调用算法,但对于低肿瘤含量的样本来说,敏感的融合调用器通常会出现高假阳性率——这是嵌合转录物排列的结果。此外,目前在NGS数据中没有单一的特征可以用来过滤掉误报融合呼叫。为了获得比使用单个融合调用器更高的融合调用精度,我们通过系统和客观的方法对多个融合调用器的结果进行加权和组合:基于随机森林模型的集成学习方法。我们的方法从三个独立的融合调用程序生成的数据中进行选择,并辅以从内部方法获得的指标。它提出了一个度量,可以立即解释为候选融合调用是真正融合调用的概率。方法:使用R中的randomForest包生成随机森林模型,并使用R插入符号包进行调优。训练数据集包括来自实体瘤临床样本的394个融合呼叫的平衡集。对于训练,基于IGV和正交方法(包括PCR与Sanger测序和商业Archer™融合CTL和Sarcoma面板)的人工审查,具有至少10个支持读数的融合呼叫被认为是正确或错误的。我们介绍了三个著名的融合调用器Arriba、STAR-Fusion和FusionCatcher的数据训练结果,以及来自内部开发的结计数方法的额外数据,以及已知融合列表(“白名单”)中的融合成员。模型采用10倍交叉验证。结果:在绩效评估中,假阳性和假阴性呼叫被假定为基于正交确定的假。在此基础上,我们目前的最佳模型准确率为94.9%(灵敏度93.4%,特异性96.7%)。目前,高置信度融合呼叫(概率得分大于70%)是最常见的正面呼叫。这些已被证实100%成功。结论:我们利用随机森林模型成功地集成了多个融合调用者。我们目前的模型已被验证用于我们的实体肿瘤融合呼叫管道。引用格式:Kenneth B. Thomas, Yanglong Mou, Christophe Magnan, Tibor Gyuris, Eve Shinbrot, Fernando Lopez Diaz, Steven Lau-Rivera, Segun Jung, Vincent Funari, Lawrence M. Weiss来自RNA面板测序数据的基因融合调用:一种集成学习方法[摘要]。见:美国癌症研究协会2021年年会论文集;2021年4月10日至15日和5月17日至21日。费城(PA): AACR;癌症杂志,2021;81(13 -增刊):摘要第240期。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Abstract 240: Gene fusion calling from RNA panel sequencing data: An ensemble learning approach
Introduction: Our goal is to improve gene fusion detection via RNA sequencing by combining multiple fusion callers through machine learning techniques. Background: Gene Fusion events are important drivers of malignancy. RNA sequencing (RNAseq) methods for detection of fusions have the advantage that multiple markers can be targeted at one time. Unlike DNA methods, in which it is challenging to capture fusion breakpoints, in RNA methods fusions are readily identified through chimeric transcripts. While many fusion calling algorithms exist for use on RNAseq data, sensitive fusion callers, needed for samples of low tumor content, often present high false positive rates - a result of aligning chimeric transcripts. Further, there currently is no single feature in NGS data that can be used to filter out false positive fusion calls. In order to achieve higher accuracy in fusion calls than can be achieved using individual fusion callers, we have weighted and combined the results of multiple fusion callers by systematic and objective means: an ensemble learning approach based on random forest models. Our method selects from data generated by three independent fusion callers supplemented by metrics obtained from in-house methods. It presents a metric that can be immediately interpreted as the probability that a candidate fusion call is a true fusion call. Methods: Random forest models were generated by use of the randomForest package in R, with tuning by the R caret package. Training data sets consisted of a balanced set of 394 fusion calls from clinical samples of solid tumors. For training, fusion calls with at least 10 supporting reads were deemed true or false based on manual review via IGV, and orthogonal methods including PCR with Sanger sequencing and the commercial Archer™ fusion CTL and Sarcoma panels. We present the results of training on data from the three well-known fusion callers Arriba, STAR-Fusion, and FusionCatcher, together with additional data from an in-house developed junction counting method, and fusion membership in a list of known fusions (a “white list”). Models were validated by 10-fold cross-validation. Results: In performance evaluations, false positive and false negative calls were presumed false based on orthogonal determinations. On that basis, our current best model has an accuracy of 94.9% (sensitivity 93.4%, specificity 96.7%). Currently, High Confidence fusion calls (calls with probability score greater than 70%) are the most common positive calls. These have been confirmed with 100% success. Conclusion: We have successfully integrated multiple fusion callers by means of random forest models. Our current model is validated for use on our solid tumor fusion calling pipeline. Citation Format: Kenneth B. Thomas, Yanglong Mou, Christophe Magnan, Tibor Gyuris, Eve Shinbrot, Fernando Lopez Diaz, Steven Lau-Rivera, Segun Jung, Vincent Funari, Lawrence M. Weiss. Gene fusion calling from RNA panel sequencing data: An ensemble learning approach [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2021; 2021 Apr 10-15 and May 17-21. Philadelphia (PA): AACR; Cancer Res 2021;81(13_Suppl):Abstract nr 240.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信