Diksha Chaudhary, Bratati Kahali, Yogesh L. Simmhan
{"title":"人类基因组数据高效存储的实证研究","authors":"Diksha Chaudhary, Bratati Kahali, Yogesh L. Simmhan","doi":"10.1109/HiPCW.2019.00030","DOIUrl":null,"url":null,"abstract":"Next-generation sequencing (NGS) has become affordable and fast, facilitating large scale population-level Whole Genome Sequencing (WGS) studies. NGS and its processing pipeline generate 100's of gigabytes of data per human subject, which can grow to petabytes for large studies, such as the upcoming GenomeIndia program. At these scales, affordable and reliable storage of data becomes a challenge. Here, we propose a preliminary data management architecture for storage and querying of data from the GenomeIndia project. In this initial empirical study, we focus on existing generic and domain-specific compression techniques for reducing the storage space of genome sequence data and compare erasure coding and replication in providing reliability on commodity hardware. We report the time and space complexity of these approaches, and this will reform the future design of our architecture.","PeriodicalId":223719,"journal":{"name":"2019 26th International Conference on High Performance Computing, Data and Analytics Workshop (HiPCW)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Empirical Study on Efficient Storage of Human Genome Data\",\"authors\":\"Diksha Chaudhary, Bratati Kahali, Yogesh L. Simmhan\",\"doi\":\"10.1109/HiPCW.2019.00030\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Next-generation sequencing (NGS) has become affordable and fast, facilitating large scale population-level Whole Genome Sequencing (WGS) studies. NGS and its processing pipeline generate 100's of gigabytes of data per human subject, which can grow to petabytes for large studies, such as the upcoming GenomeIndia program. At these scales, affordable and reliable storage of data becomes a challenge. Here, we propose a preliminary data management architecture for storage and querying of data from the GenomeIndia project. In this initial empirical study, we focus on existing generic and domain-specific compression techniques for reducing the storage space of genome sequence data and compare erasure coding and replication in providing reliability on commodity hardware. We report the time and space complexity of these approaches, and this will reform the future design of our architecture.\",\"PeriodicalId\":223719,\"journal\":{\"name\":\"2019 26th International Conference on High Performance Computing, Data and Analytics Workshop (HiPCW)\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 26th International Conference on High Performance Computing, Data and Analytics Workshop (HiPCW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HiPCW.2019.00030\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 26th International Conference on High Performance Computing, Data and Analytics Workshop (HiPCW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPCW.2019.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An Empirical Study on Efficient Storage of Human Genome Data
Next-generation sequencing (NGS) has become affordable and fast, facilitating large scale population-level Whole Genome Sequencing (WGS) studies. NGS and its processing pipeline generate 100's of gigabytes of data per human subject, which can grow to petabytes for large studies, such as the upcoming GenomeIndia program. At these scales, affordable and reliable storage of data becomes a challenge. Here, we propose a preliminary data management architecture for storage and querying of data from the GenomeIndia project. In this initial empirical study, we focus on existing generic and domain-specific compression techniques for reducing the storage space of genome sequence data and compare erasure coding and replication in providing reliability on commodity hardware. We report the time and space complexity of these approaches, and this will reform the future design of our architecture.