针对 Docker 注册表和 Docker 容器存储系统的端到端高性能重复数据删除方案

IF 2.6 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage Pub Date : 2024-01-30 DOI:10.1145/3643819

Nannan Zhao, Muhui Lin, Hadeel Albahar, Arnab K. Paul, Zhijie Huang, Subil Abraham, Keren Chen, Vasily Tarasov, Dimitrios Skourtis, Ali Anwar, Ali R. Butt

{"title":"针对 Docker 注册表和 Docker 容器存储系统的端到端高性能重复数据删除方案","authors":"Nannan Zhao, Muhui Lin, Hadeel Albahar, Arnab K. Paul, Zhijie Huang, Subil Abraham, Keren Chen, Vasily Tarasov, Dimitrios Skourtis, Ali Anwar, Ali R. Butt","doi":"10.1145/3643819","DOIUrl":null,"url":null,"abstract":"The wide adoption of Docker containers for supporting agile and elastic enterprise applications has led to a broad proliferation of container images. The associated storage performance and capacity requirements place a high pressure on the infrastructure of container registries that store and distribute images and container storage systems on the Docker client side that manage image layers and store ephemeral data generated at container runtime. The storage demand is worsened by the large amount of duplicate data in images. Moreover, container storage systems that use Copy-on-Write (CoW) file systems as storage drivers exacerbate the redundancy. Exploiting the high file redundancy in real-world images is a promising approach to drastically reduce the growing storage requirements of container registries and improve the space efficiency of container storage systems. However, existing deduplication techniques significantly degrade the performance of both registries and container storage systems because of data reconstruction overhead as well as the deduplication cost. We propose DupHunter, an end-to-end deduplication scheme that deduplicates layers for both Docker registries and container storage systems while maintaining a high image distribution speed and container I/O performance. DupHunter is divided into 3 tiers: registry tier, middle tier, and client tier. Specifically, we first build a high-performance deduplication engine at the registry tier that not only natively deduplicates layers for space savings but also reduces layer restore overhead. Then, we use deduplication offloading at the middle tier to eliminate the redundant files from the client tier and avoid bringing deduplication overhead to the clients. To further reduce the data duplicates caused by CoWs and improve the container I/O performance, we utilize a container-aware storage system at the client tier that reserves space for each container and arranges the placement of files and their modifications on the disk to preserve locality. Under real workloads, DupHunter reduces storage space by up to 6.9 × and reduces the <monospace>GET</monospace> layer latency up to 2.8 × compared to the state-of-the-art. Moreover, DupHunter can improve the container I/O performance by up to 93% for reads and 64% for writes.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"94 18 1","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An End-to-End High-Performance Deduplication Scheme for Docker Registries and Docker Container Storage Systems\",\"authors\":\"Nannan Zhao, Muhui Lin, Hadeel Albahar, Arnab K. Paul, Zhijie Huang, Subil Abraham, Keren Chen, Vasily Tarasov, Dimitrios Skourtis, Ali Anwar, Ali R. Butt\",\"doi\":\"10.1145/3643819\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The wide adoption of Docker containers for supporting agile and elastic enterprise applications has led to a broad proliferation of container images. The associated storage performance and capacity requirements place a high pressure on the infrastructure of container registries that store and distribute images and container storage systems on the Docker client side that manage image layers and store ephemeral data generated at container runtime. The storage demand is worsened by the large amount of duplicate data in images. Moreover, container storage systems that use Copy-on-Write (CoW) file systems as storage drivers exacerbate the redundancy. Exploiting the high file redundancy in real-world images is a promising approach to drastically reduce the growing storage requirements of container registries and improve the space efficiency of container storage systems. However, existing deduplication techniques significantly degrade the performance of both registries and container storage systems because of data reconstruction overhead as well as the deduplication cost. We propose DupHunter, an end-to-end deduplication scheme that deduplicates layers for both Docker registries and container storage systems while maintaining a high image distribution speed and container I/O performance. DupHunter is divided into 3 tiers: registry tier, middle tier, and client tier. Specifically, we first build a high-performance deduplication engine at the registry tier that not only natively deduplicates layers for space savings but also reduces layer restore overhead. Then, we use deduplication offloading at the middle tier to eliminate the redundant files from the client tier and avoid bringing deduplication overhead to the clients. To further reduce the data duplicates caused by CoWs and improve the container I/O performance, we utilize a container-aware storage system at the client tier that reserves space for each container and arranges the placement of files and their modifications on the disk to preserve locality. Under real workloads, DupHunter reduces storage space by up to 6.9 × and reduces the <monospace>GET</monospace> layer latency up to 2.8 × compared to the state-of-the-art. Moreover, DupHunter can improve the container I/O performance by up to 93% for reads and 64% for writes.\",\"PeriodicalId\":49113,\"journal\":{\"name\":\"ACM Transactions on Storage\",\"volume\":\"94 18 1\",\"pages\":\"\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2024-01-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Storage\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3643819\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Storage","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3643819","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

由于广泛采用 Docker 容器来支持敏捷、灵活的企业应用，导致容器映像大量涌现。相关的存储性能和容量要求给存储和分发映像的容器注册中心以及管理映像层和存储容器运行时生成的短暂数据的 Docker 客户端容器存储系统等基础设施带来了巨大压力。镜像中的大量重复数据加剧了存储需求。此外，使用写入复制（CoW）文件系统作为存储驱动程序的容器存储系统会加剧冗余。利用现实世界图像中的高文件冗余度是一种很有前途的方法，可以大幅降低容器注册表不断增长的存储需求，提高容器存储系统的空间效率。然而，由于数据重建开销和重复数据删除成本，现有的重复数据删除技术大大降低了注册表和容器存储系统的性能。我们提出的 DupHunter 是一种端到端的重复数据删除方案，它可以为 Docker 注册表和容器存储系统重复数据删除层，同时保持较高的镜像分发速度和容器 I/O 性能。DupHunter 分为 3 层：注册表层、中间层和客户端层。具体来说，我们首先在注册层构建了一个高性能重复数据删除引擎，它不仅能原生重复数据删除层以节省空间，还能减少层还原开销。然后，我们在中间层使用重复数据删除卸载来消除客户端层的冗余文件，避免给客户端带来重复数据删除开销。为了进一步减少 CoW 造成的数据重复并提高容器的 I/O 性能，我们在客户端层使用了容器感知存储系统，为每个容器预留空间，并在磁盘上安排文件及其修改的位置，以保持本地性。在实际工作负载下，与最先进的技术相比，DupHunter 最多可减少 6.9 倍的存储空间，最多可减少 2.8 倍的 GET 层延迟。此外，DupHunter 还能将容器的 I/O 性能提高 93%（读取）和 64%（写入）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An End-to-End High-Performance Deduplication Scheme for Docker Registries and Docker Container Storage Systems

The wide adoption of Docker containers for supporting agile and elastic enterprise applications has led to a broad proliferation of container images. The associated storage performance and capacity requirements place a high pressure on the infrastructure of container registries that store and distribute images and container storage systems on the Docker client side that manage image layers and store ephemeral data generated at container runtime. The storage demand is worsened by the large amount of duplicate data in images. Moreover, container storage systems that use Copy-on-Write (CoW) file systems as storage drivers exacerbate the redundancy. Exploiting the high file redundancy in real-world images is a promising approach to drastically reduce the growing storage requirements of container registries and improve the space efficiency of container storage systems. However, existing deduplication techniques significantly degrade the performance of both registries and container storage systems because of data reconstruction overhead as well as the deduplication cost.

We propose DupHunter, an end-to-end deduplication scheme that deduplicates layers for both Docker registries and container storage systems while maintaining a high image distribution speed and container I/O performance. DupHunter is divided into 3 tiers: registry tier, middle tier, and client tier. Specifically, we first build a high-performance deduplication engine at the registry tier that not only natively deduplicates layers for space savings but also reduces layer restore overhead. Then, we use deduplication offloading at the middle tier to eliminate the redundant files from the client tier and avoid bringing deduplication overhead to the clients. To further reduce the data duplicates caused by CoWs and improve the container I/O performance, we utilize a container-aware storage system at the client tier that reserves space for each container and arranges the placement of files and their modifications on the disk to preserve locality. Under real workloads, DupHunter reduces storage space by up to 6.9 × and reduces the GET layer latency up to 2.8 × compared to the state-of-the-art. Moreover, DupHunter can improve the container I/O performance by up to 93% for reads and 64% for writes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Storage COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

4.20

自引率

5.90%

发文量

审稿时长

>12 weeks

期刊介绍： The ACM Transactions on Storage (TOS) is a new journal with an intent to publish original archival papers in the area of storage and closely related disciplines. Articles that appear in TOS will tend either to present new techniques and concepts or to report novel experiences and experiments with practical systems. Storage is a broad and multidisciplinary area that comprises of network protocols, resource management, data backup, replication, recovery, devices, security, and theory of data coding, densities, and low-power. Potential synergies among these fields are expected to open up new research directions.