Caches All the Way Down: Infrastructure for Data Intensive Science

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2017-06-26 DOI:10.1145/3078597.3091525

D. Abramson

{"title":"Caches All the Way Down: Infrastructure for Data Intensive Science","authors":"D. Abramson","doi":"10.1145/3078597.3091525","DOIUrl":null,"url":null,"abstract":"The rise of big data science has created new demands for modern computer systems. While floating performance has driven computer architecture and system design for the past few decades, there is renewed interest in the speed at which data can be ingested and processed. Early exemplars such as Gordon, the NSF funded system at the San Diego Supercomputing Centre, shifted the focus from pure floating-point performance to memory and IO rates. At the University of Queensland we have continued this trend with the design of FlashLite, a parallel cluster equipped with large amounts of main memory, flash disk, and a distributed shared memory system (ScaleMP's vSMP). This allows applications to place data \"close\" to the processor, enhancing processing speeds. Further, we have built a geographically distributed multi-tier hierarchical data fabric called MeDiCI, which provides an abstraction of very large data stores across the metropolitan area. MeDiCI leverages industry solutions such as IBM's Spectrum Scale and SGI's DMF platforms. Caching underpins both FlashLite and MeDiCI. In this I will describe the design decisions and illustrate some early application studies that benefit from the approach. I will also highlight some of the challenges that need to be solved for this approach to become mainstream.","PeriodicalId":436194,"journal":{"name":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3078597.3091525","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The rise of big data science has created new demands for modern computer systems. While floating performance has driven computer architecture and system design for the past few decades, there is renewed interest in the speed at which data can be ingested and processed. Early exemplars such as Gordon, the NSF funded system at the San Diego Supercomputing Centre, shifted the focus from pure floating-point performance to memory and IO rates. At the University of Queensland we have continued this trend with the design of FlashLite, a parallel cluster equipped with large amounts of main memory, flash disk, and a distributed shared memory system (ScaleMP's vSMP). This allows applications to place data "close" to the processor, enhancing processing speeds. Further, we have built a geographically distributed multi-tier hierarchical data fabric called MeDiCI, which provides an abstraction of very large data stores across the metropolitan area. MeDiCI leverages industry solutions such as IBM's Spectrum Scale and SGI's DMF platforms. Caching underpins both FlashLite and MeDiCI. In this I will describe the design decisions and illustrate some early application studies that benefit from the approach. I will also highlight some of the challenges that need to be solved for this approach to become mainstream.

查看原文本刊更多论文

高速缓存:数据密集型科学的基础设施

大数据科学的兴起对现代计算机系统提出了新的需求。虽然在过去的几十年里，浮动性能一直是计算机架构和系统设计的驱动力，但人们对数据摄取和处理的速度又产生了新的兴趣。早期的例子，如Gordon，由美国国家科学基金会资助的圣地亚哥超级计算中心的系统，将重点从纯粹的浮点性能转移到内存和IO速率上。在昆士兰大学，我们通过FlashLite的设计延续了这一趋势，FlashLite是一个并行集群，配备了大量的主内存、闪存盘和分布式共享内存系统(ScaleMP的vSMP)。这允许应用程序将数据“靠近”处理器，从而提高处理速度。此外，我们已经建立了一个地理上分布的多层分层数据结构，称为MeDiCI，它提供了跨大都市区域的非常大的数据存储的抽象。MeDiCI利用行业解决方案，如IBM的Spectrum Scale和SGI的DMF平台。缓存支持FlashLite和MeDiCI。在本文中，我将描述设计决策，并举例说明一些受益于该方法的早期应用研究。我还将强调要使这种方法成为主流需要解决的一些挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

自引率

0.00%

发文量