Xueqi Li, Guangming Tan, Bingchen Wang, Ninghui Sun
{"title":"具有内存计算的高性能基因组分析框架","authors":"Xueqi Li, Guangming Tan, Bingchen Wang, Ninghui Sun","doi":"10.1145/3178487.3178511","DOIUrl":null,"url":null,"abstract":"In this paper, we propose an in-memory computing framework (called GPF) that provides a set of genomic formats, APIs and a fast genomic engine for large-scale genomic data processing. Our GPF comprises two main components: (1) scalable genomic data formats and API. (2) an advanced execution engine that supports efficient compression of genomic data and eliminates redundancies in the execution engine of our GPF. We further present both system and algorithm-specific implementations for users to build genomic analysis pipeline without any acquaintance of Spark parallel programming. To test the performance of GPF, we built a WGS pipeline on top of our GPF as a test case. Our experimental data indicate that GPF completes Whole-Genome-Sequencing (WGS) analysis of 146.9G bases Human Platinum Genome in running time of 24 minutes, with over 50% parallel efficiency when used on 2048 CPU cores. Together, our GPF framework provides a fast and general engine for large-scale genomic data processing which supports in-memory computing.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"409 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"High-performance genomic analysis framework with in-memory computing\",\"authors\":\"Xueqi Li, Guangming Tan, Bingchen Wang, Ninghui Sun\",\"doi\":\"10.1145/3178487.3178511\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose an in-memory computing framework (called GPF) that provides a set of genomic formats, APIs and a fast genomic engine for large-scale genomic data processing. Our GPF comprises two main components: (1) scalable genomic data formats and API. (2) an advanced execution engine that supports efficient compression of genomic data and eliminates redundancies in the execution engine of our GPF. We further present both system and algorithm-specific implementations for users to build genomic analysis pipeline without any acquaintance of Spark parallel programming. To test the performance of GPF, we built a WGS pipeline on top of our GPF as a test case. Our experimental data indicate that GPF completes Whole-Genome-Sequencing (WGS) analysis of 146.9G bases Human Platinum Genome in running time of 24 minutes, with over 50% parallel efficiency when used on 2048 CPU cores. Together, our GPF framework provides a fast and general engine for large-scale genomic data processing which supports in-memory computing.\",\"PeriodicalId\":193776,\"journal\":{\"name\":\"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming\",\"volume\":\"409 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-02-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3178487.3178511\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3178487.3178511","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
High-performance genomic analysis framework with in-memory computing
In this paper, we propose an in-memory computing framework (called GPF) that provides a set of genomic formats, APIs and a fast genomic engine for large-scale genomic data processing. Our GPF comprises two main components: (1) scalable genomic data formats and API. (2) an advanced execution engine that supports efficient compression of genomic data and eliminates redundancies in the execution engine of our GPF. We further present both system and algorithm-specific implementations for users to build genomic analysis pipeline without any acquaintance of Spark parallel programming. To test the performance of GPF, we built a WGS pipeline on top of our GPF as a test case. Our experimental data indicate that GPF completes Whole-Genome-Sequencing (WGS) analysis of 146.9G bases Human Platinum Genome in running time of 24 minutes, with over 50% parallel efficiency when used on 2048 CPU cores. Together, our GPF framework provides a fast and general engine for large-scale genomic data processing which supports in-memory computing.