Getiria Onsongo, H. Lam, Matthew Bower, B. Thyagarajan
{"title":"Hadoop-CNV-RF:用于下一代测序数据的可扩展拷贝数变异检测工具","authors":"Getiria Onsongo, H. Lam, Matthew Bower, B. Thyagarajan","doi":"10.1145/3388440.3414861","DOIUrl":null,"url":null,"abstract":"Detection of small copy number variations (CNVs) in clinically relevant genes is routinely being used to aid diagnosis. We recently developed a tool, CNV-RF, capable of detecting clinically relevant CNVs with a high degree of sensitivity. CNV-RF implementation was designed for small gene panels and did not scale to large gene panels. Analyzing large gene panels with several hundred genes routinely failed due to memory limitations on a single computer, and, when successful, analysis took on average over 24 hours, making it impractical for routine use in the clinic. We need a reliable tool capable of accurately identifying clinically relevant CNVs on large gene panels within a more practical time frame. We have developed Hadoop-CNV-RF, a freely available, scalable, and more user-friendly implementation of CNV-RF capable of rapidly analyzing large datasets. Hadoop-CNV-RF takes advantage of Hadoop, a framework developed to analyze large amounts of data. In its implementation, we demonstrate the feasibility of developing scalable pipelines on Hadoop that integrate popular bioinformatics software developed for usage on traditional single-user computers without the need for special-purpose routines developed for Hadoop. Results show that Hadoop-CNV-RF reduces analysis time on large gene panels from over 24 hours to about 4 hours on a 20 node Hadoop cluster. Additionally, we demonstrate its ability to scale by analyzing a whole-exome dataset with close to a billion reads. Hadoop-CNV-RF has been clinically validated for large gene panels (up to 4800 genes) and is currently being used in the clinic. It is publicly available at: https://github.com/getiria-onsongo/hadoopcnvrf-public.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"277 19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hadoop-CNV-RF: A Scalable Copy Number Variation Detection Tool for Next-Generation Sequencing Data\",\"authors\":\"Getiria Onsongo, H. Lam, Matthew Bower, B. Thyagarajan\",\"doi\":\"10.1145/3388440.3414861\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Detection of small copy number variations (CNVs) in clinically relevant genes is routinely being used to aid diagnosis. We recently developed a tool, CNV-RF, capable of detecting clinically relevant CNVs with a high degree of sensitivity. CNV-RF implementation was designed for small gene panels and did not scale to large gene panels. Analyzing large gene panels with several hundred genes routinely failed due to memory limitations on a single computer, and, when successful, analysis took on average over 24 hours, making it impractical for routine use in the clinic. We need a reliable tool capable of accurately identifying clinically relevant CNVs on large gene panels within a more practical time frame. We have developed Hadoop-CNV-RF, a freely available, scalable, and more user-friendly implementation of CNV-RF capable of rapidly analyzing large datasets. Hadoop-CNV-RF takes advantage of Hadoop, a framework developed to analyze large amounts of data. In its implementation, we demonstrate the feasibility of developing scalable pipelines on Hadoop that integrate popular bioinformatics software developed for usage on traditional single-user computers without the need for special-purpose routines developed for Hadoop. Results show that Hadoop-CNV-RF reduces analysis time on large gene panels from over 24 hours to about 4 hours on a 20 node Hadoop cluster. Additionally, we demonstrate its ability to scale by analyzing a whole-exome dataset with close to a billion reads. Hadoop-CNV-RF has been clinically validated for large gene panels (up to 4800 genes) and is currently being used in the clinic. It is publicly available at: https://github.com/getiria-onsongo/hadoopcnvrf-public.\",\"PeriodicalId\":411338,\"journal\":{\"name\":\"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"volume\":\"277 19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3388440.3414861\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3388440.3414861","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Hadoop-CNV-RF: A Scalable Copy Number Variation Detection Tool for Next-Generation Sequencing Data
Detection of small copy number variations (CNVs) in clinically relevant genes is routinely being used to aid diagnosis. We recently developed a tool, CNV-RF, capable of detecting clinically relevant CNVs with a high degree of sensitivity. CNV-RF implementation was designed for small gene panels and did not scale to large gene panels. Analyzing large gene panels with several hundred genes routinely failed due to memory limitations on a single computer, and, when successful, analysis took on average over 24 hours, making it impractical for routine use in the clinic. We need a reliable tool capable of accurately identifying clinically relevant CNVs on large gene panels within a more practical time frame. We have developed Hadoop-CNV-RF, a freely available, scalable, and more user-friendly implementation of CNV-RF capable of rapidly analyzing large datasets. Hadoop-CNV-RF takes advantage of Hadoop, a framework developed to analyze large amounts of data. In its implementation, we demonstrate the feasibility of developing scalable pipelines on Hadoop that integrate popular bioinformatics software developed for usage on traditional single-user computers without the need for special-purpose routines developed for Hadoop. Results show that Hadoop-CNV-RF reduces analysis time on large gene panels from over 24 hours to about 4 hours on a 20 node Hadoop cluster. Additionally, we demonstrate its ability to scale by analyzing a whole-exome dataset with close to a billion reads. Hadoop-CNV-RF has been clinically validated for large gene panels (up to 4800 genes) and is currently being used in the clinic. It is publicly available at: https://github.com/getiria-onsongo/hadoopcnvrf-public.