交错硬件加速的k-mer解析器

2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) Pub Date : 2022-12-06 DOI:10.1109/BIBM55620.2022.9995126

F. Milicchio, Marco Oliva, Mattia C. F. Prosperi

{"title":"交错硬件加速的k-mer解析器","authors":"F. Milicchio, Marco Oliva, Mattia C. F. Prosperi","doi":"10.1109/BIBM55620.2022.9995126","DOIUrl":null,"url":null,"abstract":"Advances in next-generation sequencing (NGS) have not only increased the overall throughput of genomic content (e.g. Illumina NovaSeq up to 6, 000GB), but also provided technology miniaturization (e.g. Oxford Nanopore MinION) enabling real-time, mobile experiments. Single Instruction/Multiple Data (SIMD) hardware acceleration is increasingly used to improve performance of NGS data processing tools, while generic template programming libraries are advantageous to adapt to the fast changes in sequencing and computing platforms. We here present a novel k-mer parser written in ISO C++ that exploits an interleaved, non-sequential, hardware accelerated SIMD implementation within a generic programming framework called libseq. We benchmarked our k-mer parser using different NGS experimental datasets comparing with other two popular k-mer counting tools (DSK and KMC3). On an Intel machine with AVX2 (Quad-Core Intel Core i5 CPU, 32 GB RAM), using simulated in-memory reads, DSK and KMC3 were on average 3. 6x and 1. 03x times slower than our parser across k value ranges of 35-63. On real sequencing experiments, DSK and KMC3 were on average 8. 3x and 28. 8x times slower in file/read parsing and k-mer building than ours. Since our tool uses generic programming, other methods that rely on k-mers (e.g. de Bruijn graphs) can directly benefit from its SIMD acceleration. Our k-mer parser and libseq 2.0 are released under the BSD license and available at https://zenodo.org/record/7015294.","PeriodicalId":210337,"journal":{"name":"2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An interleaved hardware-accelerated k-mer parser\",\"authors\":\"F. Milicchio, Marco Oliva, Mattia C. F. Prosperi\",\"doi\":\"10.1109/BIBM55620.2022.9995126\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Advances in next-generation sequencing (NGS) have not only increased the overall throughput of genomic content (e.g. Illumina NovaSeq up to 6, 000GB), but also provided technology miniaturization (e.g. Oxford Nanopore MinION) enabling real-time, mobile experiments. Single Instruction/Multiple Data (SIMD) hardware acceleration is increasingly used to improve performance of NGS data processing tools, while generic template programming libraries are advantageous to adapt to the fast changes in sequencing and computing platforms. We here present a novel k-mer parser written in ISO C++ that exploits an interleaved, non-sequential, hardware accelerated SIMD implementation within a generic programming framework called libseq. We benchmarked our k-mer parser using different NGS experimental datasets comparing with other two popular k-mer counting tools (DSK and KMC3). On an Intel machine with AVX2 (Quad-Core Intel Core i5 CPU, 32 GB RAM), using simulated in-memory reads, DSK and KMC3 were on average 3. 6x and 1. 03x times slower than our parser across k value ranges of 35-63. On real sequencing experiments, DSK and KMC3 were on average 8. 3x and 28. 8x times slower in file/read parsing and k-mer building than ours. Since our tool uses generic programming, other methods that rely on k-mers (e.g. de Bruijn graphs) can directly benefit from its SIMD acceleration. Our k-mer parser and libseq 2.0 are released under the BSD license and available at https://zenodo.org/record/7015294.\",\"PeriodicalId\":210337,\"journal\":{\"name\":\"2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BIBM55620.2022.9995126\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM55620.2022.9995126","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

新一代测序技术(NGS)的进步不仅提高了基因组内容的总体通量(例如Illumina NovaSeq高达6000 gb)，而且还提供了技术小型化(例如Oxford Nanopore MinION)，使实时、移动实验成为可能。单指令/多数据(SIMD)硬件加速越来越多地用于提高NGS数据处理工具的性能，而通用模板编程库则有利于适应测序和计算平台的快速变化。我们在这里提出了一个用ISO c++编写的新颖的k-mer解析器，它利用了在称为libseq的通用编程框架内的交错、非顺序、硬件加速的SIMD实现。我们使用不同的NGS实验数据集对我们的k-mer解析器进行基准测试，并与其他两种流行的k-mer计数工具(DSK和KMC3)进行比较。在使用AVX2(四核英特尔酷睿i5 CPU, 32 GB RAM)的英特尔机器上，使用模拟内存读取，DSK和KMC3平均为3。6x和1。在k值范围为35-63时，比我们的解析器慢0.3倍。在真实测序实验中，DSK和KMC3平均为8。3x和28。在文件/读取解析和k-mer构建方面比我们慢8倍。由于我们的工具使用泛型编程，其他依赖k-mers的方法(例如de Bruijn图)可以直接受益于它的SIMD加速。我们的k-mer解析器和libseq 2.0是在BSD许可下发布的，可以在https://zenodo.org/record/7015294上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An interleaved hardware-accelerated k-mer parser

Advances in next-generation sequencing (NGS) have not only increased the overall throughput of genomic content (e.g. Illumina NovaSeq up to 6, 000GB), but also provided technology miniaturization (e.g. Oxford Nanopore MinION) enabling real-time, mobile experiments. Single Instruction/Multiple Data (SIMD) hardware acceleration is increasingly used to improve performance of NGS data processing tools, while generic template programming libraries are advantageous to adapt to the fast changes in sequencing and computing platforms. We here present a novel k-mer parser written in ISO C++ that exploits an interleaved, non-sequential, hardware accelerated SIMD implementation within a generic programming framework called libseq. We benchmarked our k-mer parser using different NGS experimental datasets comparing with other two popular k-mer counting tools (DSK and KMC3). On an Intel machine with AVX2 (Quad-Core Intel Core i5 CPU, 32 GB RAM), using simulated in-memory reads, DSK and KMC3 were on average 3. 6x and 1. 03x times slower than our parser across k value ranges of 35-63. On real sequencing experiments, DSK and KMC3 were on average 8. 3x and 28. 8x times slower in file/read parsing and k-mer building than ours. Since our tool uses generic programming, other methods that rely on k-mers (e.g. de Bruijn graphs) can directly benefit from its SIMD acceleration. Our k-mer parser and libseq 2.0 are released under the BSD license and available at https://zenodo.org/record/7015294.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

自引率

0.00%

发文量