Challenges to Achieving High Availability at Scale

Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems Pub Date : 2017-06-08 DOI:10.1145/3093742.3097270

W. Shulte

{"title":"Challenges to Achieving High Availability at Scale","authors":"W. Shulte","doi":"10.1145/3093742.3097270","DOIUrl":null,"url":null,"abstract":"Facebook is a social network that connects more than 1.8 billion people. To serve these many users requires infrastructure which is composed of thousands of interdependent systems that span geographically distributed data centers. But what is the guiding principle for building and operating these systems? For Facebook’s infrastructure teams the answer is: Systems must always be available and never lose data. This talk will explore this quest. We will focus on three aspects. Availability and consistency. What form of consistency do Facebook’s systems guarantee? Strong consistency makes understanding easy but has latency penalties, weak consistency is fast but difficult to reason for developers and users. We describe our usage of eventual consistency and delve into how Facebook constructs its caching and replicated storage systems to minimize the duration for achieving consistency. We share empirical data that measures the effectiveness of our design. Availability and correctness. With network partitions, relaxed forms of consistency, and software bugs, how do we guarantee a consistent state? We present two systems to find and repair structural errors in Facebook’s social graph, one batch and one real-time. Availability and scale. Sharding is one of the standard answers to operate at scale. But how can we develop one system that can shard storage as well as compute? We will introduce a new Sharding-as-a-Service component. We will show and evaluate how its design and service policies control for latency, failure tolerance and operationally efficiency. 1998 ACM Subject Classification Computer; C 1.4 Distributed Architectures; C.2.4 Distributed Systems; C.4 Fault Tolerance, Reliability, Availability and Serviceability; D 1.3 Distributed Programming; D 4.7 Distributed Systems; E 1 Distributed Data Structures","PeriodicalId":325666,"journal":{"name":"Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3093742.3097270","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Facebook is a social network that connects more than 1.8 billion people. To serve these many users requires infrastructure which is composed of thousands of interdependent systems that span geographically distributed data centers. But what is the guiding principle for building and operating these systems? For Facebook’s infrastructure teams the answer is: Systems must always be available and never lose data. This talk will explore this quest. We will focus on three aspects. Availability and consistency. What form of consistency do Facebook’s systems guarantee? Strong consistency makes understanding easy but has latency penalties, weak consistency is fast but difficult to reason for developers and users. We describe our usage of eventual consistency and delve into how Facebook constructs its caching and replicated storage systems to minimize the duration for achieving consistency. We share empirical data that measures the effectiveness of our design. Availability and correctness. With network partitions, relaxed forms of consistency, and software bugs, how do we guarantee a consistent state? We present two systems to find and repair structural errors in Facebook’s social graph, one batch and one real-time. Availability and scale. Sharding is one of the standard answers to operate at scale. But how can we develop one system that can shard storage as well as compute? We will introduce a new Sharding-as-a-Service component. We will show and evaluate how its design and service policies control for latency, failure tolerance and operationally efficiency. 1998 ACM Subject Classification Computer; C 1.4 Distributed Architectures; C.2.4 Distributed Systems; C.4 Fault Tolerance, Reliability, Availability and Serviceability; D 1.3 Distributed Programming; D 4.7 Distributed Systems; E 1 Distributed Data Structures

查看原文本刊更多论文

实现大规模高可用性的挑战

Facebook是一个连接超过18亿人的社交网络。为了服务这么多用户，需要由数千个相互依赖的系统组成的基础设施，这些系统跨越地理分布的数据中心。但是，构建和操作这些系统的指导原则是什么呢?对于Facebook的基础设施团队来说，答案是:系统必须始终可用，永远不会丢失数据。这次演讲将探讨这个问题。我们将重点从三个方面着手。可用性和一致性。Facebook的系统保证了什么形式的一致性?强一致性使理解变得容易，但有延迟的代价，弱一致性很快，但对开发人员和用户来说很难理解。我们描述了最终一致性的使用，并深入研究了Facebook如何构建其缓存和复制存储系统，以最大限度地减少实现一致性的持续时间。我们分享经验数据来衡量我们设计的有效性。可用性和正确性。对于网络分区、松散形式的一致性和软件bug，我们如何保证一致的状态?我们提出了两个系统来发现和修复Facebook社交图谱中的结构性错误，一个是批量的，一个是实时的。可用性和规模。分片是大规模操作的标准解决方案之一。但是，我们如何开发一个既能共享存储又能共享计算的系统呢?我们将引入一个新的分片即服务组件。我们将展示并评估其设计和服务策略如何控制延迟、容错和操作效率。1998 ACM主题分类计算机;C 1.4分布式体系结构;C.2.4分布式系统;C.4容错、可靠性、可用性和可服务性;1.3分布式编程;d4.7分布式系统;e1分布式数据结构

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems

自引率

0.00%

发文量