Photon: fault-tolerant and scalable joining of continuous data streams

R. Ananthanarayanan, Venkatesh Basker, Sumit Das, A. Gupta, H. Jiang, Tianhao Qiu, Alexey Reznichenko, D.Yu. Ryabkov, Manpreet Singh, S. Venkataraman
{"title":"Photon: fault-tolerant and scalable joining of continuous data streams","authors":"R. Ananthanarayanan, Venkatesh Basker, Sumit Das, A. Gupta, H. Jiang, Tianhao Qiu, Alexey Reznichenko, D.Yu. Ryabkov, Manpreet Singh, S. Venkataraman","doi":"10.1145/2463676.2465272","DOIUrl":null,"url":null,"abstract":"Web-based enterprises process events generated by millions of users interacting with their websites. Rich statistical data distilled from combining such interactions in near real-time generates enormous business value. In this paper, we describe the architecture of Photon, a geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency, where the streams may be unordered or delayed. The system fully tolerates infrastructure degradation and datacenter-level outages without any manual intervention. Photon guarantees that there will be no duplicates in the joined output (at-most-once semantics) at any point in time, that most joinable events will be present in the output in real-time (near-exact semantics), and exactly-once semantics eventually.\n Photon is deployed within Google Advertising System to join data streams such as web search queries and user clicks on advertisements. It produces joined logs that are used to derive key business metrics, including billing for advertisers. Our production deployment processes millions of events per minute at peak with an average end-to-end latency of less than 10 seconds. We also present challenges and solutions in maintaining large persistent state across geographically distant locations, and highlight the design principles that emerged from our experience.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"146","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. ACM-SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2463676.2465272","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 146

Abstract

Web-based enterprises process events generated by millions of users interacting with their websites. Rich statistical data distilled from combining such interactions in near real-time generates enormous business value. In this paper, we describe the architecture of Photon, a geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency, where the streams may be unordered or delayed. The system fully tolerates infrastructure degradation and datacenter-level outages without any manual intervention. Photon guarantees that there will be no duplicates in the joined output (at-most-once semantics) at any point in time, that most joinable events will be present in the output in real-time (near-exact semantics), and exactly-once semantics eventually. Photon is deployed within Google Advertising System to join data streams such as web search queries and user clicks on advertisements. It produces joined logs that are used to derive key business metrics, including billing for advertisers. Our production deployment processes millions of events per minute at peak with an average end-to-end latency of less than 10 seconds. We also present challenges and solutions in maintaining large persistent state across geographically distant locations, and highlight the design principles that emerged from our experience.
Photon:连续数据流的容错和可扩展连接
基于web的企业处理由数百万用户与其网站交互产生的事件。从近乎实时地组合这些交互中提取的丰富统计数据产生了巨大的业务价值。在本文中,我们描述了Photon的架构,Photon是一个地理分布式系统,用于实时连接多个连续流动的数据流,具有高可扩展性和低延迟,其中流可能是无序或延迟的。该系统完全容忍基础设施退化和数据中心级别的中断,无需任何人工干预。Photon保证在任何时间点都不会有重复的连接输出(最多一次语义),大多数可连接事件将实时出现在输出中(近精确语义),最终精确一次语义。Photon部署在谷歌广告系统中,以连接网络搜索查询和用户点击广告等数据流。它生成用于派生关键业务指标的连接日志,包括广告商的计费。我们的生产部署在峰值时每分钟处理数百万个事件,平均端到端延迟不到10秒。我们还提出了在地理位置遥远的地方维护大型持久状态的挑战和解决方案,并强调了从我们的经验中产生的设计原则。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信