Lifeng Yan , Zekun Yin , Tong Zhang , Fangjin Zhu , Xiaohui Duan , Bertil Schmidt , Weiguo Liu
{"title":"SWQC:新一代 sunway 平台上的高效测序数据质量控制","authors":"Lifeng Yan , Zekun Yin , Tong Zhang , Fangjin Zhu , Xiaohui Duan , Bertil Schmidt , Weiguo Liu","doi":"10.1016/j.future.2024.107577","DOIUrl":null,"url":null,"abstract":"<div><div>Sequencing data quality control can significantly prevent low-quality data from impacting downstream applications in bioinformatics. The enormous growth of biological sequencing data in recent years introduces new challenges to the efficiency of quality control processes and motivates the need for fast implementations on modern compute systems. The powerful next-generation heterogeneous Sunway platform holds significant potential for addressing this challenge. However, there are currently no dedicated quality control applications that can fully utilize its computational power. To bridge this gap, we introduce SWQC, a novel quality control application specifically designed for the Sunway platform. We present an efficient distributed FASTQ I/O framework for Sunway-based workstations and supercomputers to take advantage of fast SSDs and the parallel file system. In order to support both process-level and thread-level (CPE-level) parallelism to leverage the computational power, we refactor and optimize all standard quality control modules for the heterogeneous Sunway architecture. When using a single node, SWQC achieves speedups between 2 and 40 over highly optimized quality control applications executed on a high-end 48-core AMD server. Additionally, when using 16 nodes, SWQC achieves parallel efficiencies of 70% (for reading and writing a single file) and 95% (for reading one file and writing split files) compared to a single node. Overall, SWQC is able to perform quality control operations for a 140GB FASTQ file within only 70 s using a single Sunway node. It is publicly available at <span><span>https://github.com/RabbitBio/SWQC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"164 ","pages":"Article 107577"},"PeriodicalIF":6.2000,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SWQC: Efficient sequencing data quality control on the next-generation sunway platform\",\"authors\":\"Lifeng Yan , Zekun Yin , Tong Zhang , Fangjin Zhu , Xiaohui Duan , Bertil Schmidt , Weiguo Liu\",\"doi\":\"10.1016/j.future.2024.107577\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Sequencing data quality control can significantly prevent low-quality data from impacting downstream applications in bioinformatics. The enormous growth of biological sequencing data in recent years introduces new challenges to the efficiency of quality control processes and motivates the need for fast implementations on modern compute systems. The powerful next-generation heterogeneous Sunway platform holds significant potential for addressing this challenge. However, there are currently no dedicated quality control applications that can fully utilize its computational power. To bridge this gap, we introduce SWQC, a novel quality control application specifically designed for the Sunway platform. We present an efficient distributed FASTQ I/O framework for Sunway-based workstations and supercomputers to take advantage of fast SSDs and the parallel file system. In order to support both process-level and thread-level (CPE-level) parallelism to leverage the computational power, we refactor and optimize all standard quality control modules for the heterogeneous Sunway architecture. When using a single node, SWQC achieves speedups between 2 and 40 over highly optimized quality control applications executed on a high-end 48-core AMD server. Additionally, when using 16 nodes, SWQC achieves parallel efficiencies of 70% (for reading and writing a single file) and 95% (for reading one file and writing split files) compared to a single node. Overall, SWQC is able to perform quality control operations for a 140GB FASTQ file within only 70 s using a single Sunway node. It is publicly available at <span><span>https://github.com/RabbitBio/SWQC</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":55132,\"journal\":{\"name\":\"Future Generation Computer Systems-The International Journal of Escience\",\"volume\":\"164 \",\"pages\":\"Article 107577\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2024-10-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Future Generation Computer Systems-The International Journal of Escience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167739X24005417\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X24005417","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
SWQC: Efficient sequencing data quality control on the next-generation sunway platform
Sequencing data quality control can significantly prevent low-quality data from impacting downstream applications in bioinformatics. The enormous growth of biological sequencing data in recent years introduces new challenges to the efficiency of quality control processes and motivates the need for fast implementations on modern compute systems. The powerful next-generation heterogeneous Sunway platform holds significant potential for addressing this challenge. However, there are currently no dedicated quality control applications that can fully utilize its computational power. To bridge this gap, we introduce SWQC, a novel quality control application specifically designed for the Sunway platform. We present an efficient distributed FASTQ I/O framework for Sunway-based workstations and supercomputers to take advantage of fast SSDs and the parallel file system. In order to support both process-level and thread-level (CPE-level) parallelism to leverage the computational power, we refactor and optimize all standard quality control modules for the heterogeneous Sunway architecture. When using a single node, SWQC achieves speedups between 2 and 40 over highly optimized quality control applications executed on a high-end 48-core AMD server. Additionally, when using 16 nodes, SWQC achieves parallel efficiencies of 70% (for reading and writing a single file) and 95% (for reading one file and writing split files) compared to a single node. Overall, SWQC is able to perform quality control operations for a 140GB FASTQ file within only 70 s using a single Sunway node. It is publicly available at https://github.com/RabbitBio/SWQC.
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.