{"title":"Direct-Coding DNA With Multilevel Parallelism","authors":"Caden Corontzos;Eitan Frachtenberg","doi":"10.1109/LCA.2024.3355109","DOIUrl":null,"url":null,"abstract":"The cost and time to sequence entire genomes have been on a steady and rapid decline since the early 2000s, leading to an explosion of genomic data. In contrast, the growth rates for digital storage device capacity, CPU clock speed, and networking bandwidth have been much more moderate. This gap means that the need for storing, transmitting, and processing sequenced genomic data is outpacing the capacities of the underlying technologies. Compounding the problem is the fact that traditional data compression techniques used for natural language or images are not optimal for genomic data. To address this challenge, many data-compression techniques have been developed, offering a range of tradeoffs between compression ratio, computation time, memory requirements, and complexity. This paper focuses on a specific technique on one extreme of this tradeoff, namely two-bit coding, wherein every base in a genomic sequence is compressed from its original 8-bit ASCII representation to a unique two-bit binary representation. Even for this simple direct-coding scheme, current implementations leave room for significant performance improvements. Here, we show that this encoding can exploit multiple levels of parallelism in modern computer architectures to maximize encoding and decoding efficiency. Our open-source implementation achieves encoding and decoding rates of billions of bases per second, which are much higher than previously reported results. In fact, our measured throughput is typically limited only by the speed of the underlying storage media.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"21-24"},"PeriodicalIF":1.4000,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Computer Architecture Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10401969/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
The cost and time to sequence entire genomes have been on a steady and rapid decline since the early 2000s, leading to an explosion of genomic data. In contrast, the growth rates for digital storage device capacity, CPU clock speed, and networking bandwidth have been much more moderate. This gap means that the need for storing, transmitting, and processing sequenced genomic data is outpacing the capacities of the underlying technologies. Compounding the problem is the fact that traditional data compression techniques used for natural language or images are not optimal for genomic data. To address this challenge, many data-compression techniques have been developed, offering a range of tradeoffs between compression ratio, computation time, memory requirements, and complexity. This paper focuses on a specific technique on one extreme of this tradeoff, namely two-bit coding, wherein every base in a genomic sequence is compressed from its original 8-bit ASCII representation to a unique two-bit binary representation. Even for this simple direct-coding scheme, current implementations leave room for significant performance improvements. Here, we show that this encoding can exploit multiple levels of parallelism in modern computer architectures to maximize encoding and decoding efficiency. Our open-source implementation achieves encoding and decoding rates of billions of bases per second, which are much higher than previously reported results. In fact, our measured throughput is typically limited only by the speed of the underlying storage media.
期刊介绍:
IEEE Computer Architecture Letters is a rigorously peer-reviewed forum for publishing early, high-impact results in the areas of uni- and multiprocessor computer systems, computer architecture, microarchitecture, workload characterization, performance evaluation and simulation techniques, and power-aware computing. Submissions are welcomed on any topic in computer architecture, especially but not limited to: microprocessor and multiprocessor systems, microarchitecture and ILP processors, workload characterization, performance evaluation and simulation techniques, compiler-hardware and operating system-hardware interactions, interconnect architectures, memory and cache systems, power and thermal issues at the architecture level, I/O architectures and techniques, independent validation of previously published results, analysis of unsuccessful techniques, domain-specific processor architectures (e.g., embedded, graphics, network, etc.), real-time and high-availability architectures, reconfigurable systems.