Reliable adaptable Network RAM

2008 IEEE International Conference on Cluster Computing Pub Date : 2008-10-31 DOI:10.1109/CLUSTR.2008.4663750

T. Newhall, D. Amato, A. Pshenichkin

{"title":"Reliable adaptable Network RAM","authors":"T. Newhall, D. Amato, A. Pshenichkin","doi":"10.1109/CLUSTR.2008.4663750","DOIUrl":null,"url":null,"abstract":"We present reliability solutions for adaptable network RAM systems running on general-purpose clusters. Network RAM allows nodes with over-committed memory to swap pages over the network, storing them in the idle RAM of other nodes and avoiding swapping to slow, local disk. An adaptable network RAM system adjusts the amount of RAM currently available for storing remotely swapped pages in response to changes in nodespsila local RAM usage. It is important that network RAM systems provide reliability for remotely swapped page data. Without reliability, a single node failure can result in failure of unrelated processes running on other nodes by losing their remotely swapped pages. Adaptable network RAM systems pose extra difficulties in providing reliability because each nodepsilas capacity for storing remotely swapped pages changes over time, and because pages may move from node to node in response to these changes. Our novel dynamic RAID-based reliability solutions use idle RAM for storing page and reliability data, avoiding using slow disk for reliability. They are designed to work with the adaptive nature of our network RAM system (Nswap), allowing page and reliability data to migrate from node to node and allowing pages to be added to or removed from different parity groups. Additionally, page recovery runs concurrently with cluster applications, so that cluster applications do not have to wait until all data from a failed node is recovered before resuming execution. We present results comparing Nswap to disk swapping for a set of benchmarks running on our gigabit cluster. Our results show that reliable Nswap is up to 32 times faster than swapping to disk, and that there is virtually no impact on the performance of applications as they run concurrently with page recovery.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTR.2008.4663750","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

We present reliability solutions for adaptable network RAM systems running on general-purpose clusters. Network RAM allows nodes with over-committed memory to swap pages over the network, storing them in the idle RAM of other nodes and avoiding swapping to slow, local disk. An adaptable network RAM system adjusts the amount of RAM currently available for storing remotely swapped pages in response to changes in nodespsila local RAM usage. It is important that network RAM systems provide reliability for remotely swapped page data. Without reliability, a single node failure can result in failure of unrelated processes running on other nodes by losing their remotely swapped pages. Adaptable network RAM systems pose extra difficulties in providing reliability because each nodepsilas capacity for storing remotely swapped pages changes over time, and because pages may move from node to node in response to these changes. Our novel dynamic RAID-based reliability solutions use idle RAM for storing page and reliability data, avoiding using slow disk for reliability. They are designed to work with the adaptive nature of our network RAM system (Nswap), allowing page and reliability data to migrate from node to node and allowing pages to be added to or removed from different parity groups. Additionally, page recovery runs concurrently with cluster applications, so that cluster applications do not have to wait until all data from a failed node is recovered before resuming execution. We present results comparing Nswap to disk swapping for a set of benchmarks running on our gigabit cluster. Our results show that reliable Nswap is up to 32 times faster than swapping to disk, and that there is virtually no impact on the performance of applications as they run concurrently with page recovery.

查看原文本刊更多论文

可靠的自适应网络RAM

我们提出了在通用集群上运行的自适应网络RAM系统的可靠性解决方案。网络RAM允许内存过度使用的节点通过网络交换页面，将它们存储在其他节点的空闲RAM中，避免交换到速度较慢的本地磁盘。可适应的网络RAM系统根据节点和本地RAM使用情况的变化，调整当前可用于存储远程交换页面的RAM数量。网络RAM系统为远程交换页数据提供可靠性是很重要的。如果没有可靠性，单个节点的故障可能会导致在其他节点上运行的不相关进程的故障，因为它们会丢失远程交换的页面。适应性网络RAM系统在提供可靠性方面带来了额外的困难，因为每个节点存储远程交换页面的能力会随着时间的推移而变化，而且页面可能会根据这些变化从一个节点移动到另一个节点。我们新颖的基于动态raid的可靠性解决方案使用空闲RAM来存储页面和可靠性数据，避免使用慢速磁盘来实现可靠性。它们被设计为与我们的网络RAM系统(swap)的自适应特性一起工作，允许页面和可靠性数据从节点迁移到节点，并允许将页面添加到不同的奇偶校验组中或从不同的奇偶校验组中删除。此外，页面恢复与集群应用程序并发运行，因此集群应用程序不必等到故障节点的所有数据都恢复后才恢复执行。我们提供了在千兆集群上运行的一组基准测试中比较swap和磁盘交换的结果。我们的结果表明，可靠的swap比交换到磁盘快32倍，并且由于它们与页面恢复并发运行，因此几乎不会影响应用程序的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量