Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, Onur Mutlu
{"title":"MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing","authors":"Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, Onur Mutlu","doi":"arxiv-2406.19113","DOIUrl":null,"url":null,"abstract":"Metagenomics has led to significant advances in many fields. Metagenomic\nanalysis commonly involves the key tasks of determining the species present in\na sample and their relative abundances. These tasks require searching large\nmetagenomic databases. Metagenomic analysis suffers from significant data\nmovement overhead due to moving large amounts of low-reuse data from the\nstorage system. In-storage processing can be a fundamental solution for\nreducing this overhead. However, designing an in-storage processing system for\nmetagenomics is challenging because existing approaches to metagenomic analysis\ncannot be directly implemented in storage effectively due to the hardware\nlimitations of modern SSDs. We propose MegIS, the first in-storage processing\nsystem designed to significantly reduce the data movement overhead of the\nend-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweight\ndesign that effectively leverages and orchestrates processing inside and\noutside the storage system. We address in-storage processing challenges for\nmetagenomics via specialized and efficient 1) task partitioning, 2)\ndata/computation flow coordination, 3) storage technology-aware algorithmic\noptimizations, 4) data mapping, and 5) lightweight in-storage accelerators.\nMegIS's design is flexible, capable of supporting different types of\nmetagenomic input datasets, and can be integrated into various metagenomic\nanalysis pipelines. Our evaluation shows that MegIS outperforms the\nstate-of-the-art performance- and accuracy-optimized software metagenomic tools\nby 2.7$\\times$-37.2$\\times$ and 6.9$\\times$-100.2$\\times$, respectively, while\nmatching the accuracy of the accuracy-optimized tool. MegIS achieves\n1.5$\\times$-5.1$\\times$ speedup compared to the state-of-the-art metagenomic\nhardware-accelerated (using processing-in-memory) tool, while achieving\nsignificantly higher accuracy.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"96 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.19113","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Metagenomics has led to significant advances in many fields. Metagenomic
analysis commonly involves the key tasks of determining the species present in
a sample and their relative abundances. These tasks require searching large
metagenomic databases. Metagenomic analysis suffers from significant data
movement overhead due to moving large amounts of low-reuse data from the
storage system. In-storage processing can be a fundamental solution for
reducing this overhead. However, designing an in-storage processing system for
metagenomics is challenging because existing approaches to metagenomic analysis
cannot be directly implemented in storage effectively due to the hardware
limitations of modern SSDs. We propose MegIS, the first in-storage processing
system designed to significantly reduce the data movement overhead of the
end-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweight
design that effectively leverages and orchestrates processing inside and
outside the storage system. We address in-storage processing challenges for
metagenomics via specialized and efficient 1) task partitioning, 2)
data/computation flow coordination, 3) storage technology-aware algorithmic
optimizations, 4) data mapping, and 5) lightweight in-storage accelerators.
MegIS's design is flexible, capable of supporting different types of
metagenomic input datasets, and can be integrated into various metagenomic
analysis pipelines. Our evaluation shows that MegIS outperforms the
state-of-the-art performance- and accuracy-optimized software metagenomic tools
by 2.7$\times$-37.2$\times$ and 6.9$\times$-100.2$\times$, respectively, while
matching the accuracy of the accuracy-optimized tool. MegIS achieves
1.5$\times$-5.1$\times$ speedup compared to the state-of-the-art metagenomic
hardware-accelerated (using processing-in-memory) tool, while achieving
significantly higher accuracy.
元基因组学在许多领域都取得了重大进展。元基因组分析通常涉及确定样本中存在的物种及其相对丰度等关键任务。这些任务需要搜索大型的元基因组数据库。由于要从存储系统中移动大量低重复利用率的数据,元基因组分析需要大量的数据移动开销。存储内处理可以从根本上减少这种开销。然而,由于现代固态硬盘的硬件限制,现有的元基因组分析方法无法直接有效地在存储系统中实现,因此设计存储内处理系统具有挑战性。我们提出的 MegIS 是首个存储内处理系统,旨在显著减少端到端元基因组分析管道的数据移动开销。我们的轻量级设计有效地利用和协调了存储系统内部和外部的处理,从而使 MegIS 得以实现。我们通过专门而高效的1)任务分区、2)数据/计算流协调、3)存储技术感知算法优化、4)数据映射和5)轻量级存储内加速器,解决了元基因组学面临的存储内处理难题。MegIS的设计非常灵活,能够支持不同类型的元基因组输入数据集,并可集成到各种元基因组分析流水线中。我们的评估结果表明,MegIS的性能和准确性分别比最先进的性能优化软件元基因组工具高出2.7倍-37.2倍和6.9倍-100.2倍,而准确性则与准确性优化工具相当。与最先进的元基因组硬件加速(使用内存处理)工具相比,MegIS 的速度提高了 1.5 倍-5.1 倍,同时准确率也显著提高。