微软的大数据处理:超大规模、巨大复杂性和最低成本

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2019-11-20 DOI:10.1145/3357223.3366029

Hiren Patel, Alekh Jindal, C. Szyperski

{"title":"微软的大数据处理:超大规模、巨大复杂性和最低成本","authors":"Hiren Patel, Alekh Jindal, C. Szyperski","doi":"10.1145/3357223.3366029","DOIUrl":null,"url":null,"abstract":"The past decade has seen a tremendous interest in large-scale data processing at Microsoft. Typical scenarios include building business-critical pipelines such as advertiser feedback loop, index builder, and relevance/ranking algorithms for Bing; analyzing user experience telemetry for Office, Windows or Xbox; and gathering recommendations for products like Windows and Xbox. To address these needs a first-party big data analytics platform, referred to as Cosmos, was developed in the early 2010s at Microsoft. Cosmos makes it possible to store data at exabyte scale and process in a serverless form factor, with SCOPE [4] being the query processing workhorse. Over time, however, several newer challenges have emerged, requiring major technical innovations in Cosmos to meet these newer demands. In this abstract, we describe three such challenges from the query processing viewpoint, and our approaches to handling them. Hyper Scale. Cosmos has witnessed a significant growth in usage from its early days, from the number of customers (starting from Bing to almost every single business unit at Microsoft today), to the volume of data processed (from petabytes to exabytes today), to the amount of processing done (from tens of thousands of SCOPE jobs to hundreds of thousands of jobs today, across hundreds of thousands of machines). Even a single job can consume tens of petabytes of data and produce similar volumes of data by running millions of tasks in parallel. Our approach to handle this unprecedented scale is two fold. First, we decoupled and disaggregated the query processor from storage and resource management components, thereby allowing different components in the Cosmos stack to scale independently. Second, we scaled the data movement in the SCOPE query processor with quasilinear complexity [2]. This is crucial since data movement is often the most expensive step, and hence the bottleneck, in massive-scale data processing. Massive Complexity. Cosmos workloads are also highly complex. Thanks to adoption across the whole of Microsoft, Cosmos needs to support workloads that are representative of multiple industry segments, including search engine (Bing), operating system (Windows), workplace productivity (Office), personal computing (Surface), gaming (XBox), etc. To handle such diverse workloads, our approach has been to provide a one-size-fits-all experience. First of all, to make it easy for the customers to express their computations, SCOPE supports different types of queries, from batch to interactive to streaming and machine learning. Second, SCOPE supports both structured and unstructured data processing. Likewise, multiple data formats, including both propriety and open source source such as Parquet, are supported. Third, users can write business logic using a mix of declarative and imperative languages, over even different imperative languages such as C# and Python, in the same job. Furthermore, users can express all of the above in simple data flow style computation for better readability and maintainability. Finally, considering the diverse workload mix inside Microsoft, we have come to realization that it is not possible to fits all scenarios using SCOPE. Therefore, we also support the popular Spark query processing engine. Overall, the one-size-fits-all query processing experience in Cosmos covers very diverse workloads, including data formats, programming languages, and the backend engines. Minimal Cost. While scale and complexity are hard by themselves, the biggest challenge is to achieve all of that at minimal cost. In fact, there is a pressing need to improve Cosmos efficiency and reduce operational costs. This is challenging due to several reasons. First, optimizing a SCOPE job is hard considering that the SCOPE DAGs are super large (up to 1000s of operators in single job!), and the optimization estimates (cardinality, cost, etc.) are often way off from the actuals. Second, SCOPE optimizes a given query, while the operational costs depend on the overall workload. Therefore workload optimization becomes very important. And finally, SCOPE jobs are typically interlinked in data pipelines, i.e., the output of one job is consumed by other jobs. This means that workload optimization needs to be aware of these dependencies. Our approach is to develop a feedback loop to learn from past workloads in order to optimize the future ones. Specifically, we leverage machine learning to learn models for optimizing individual jobs [3], apply multi-query optimizations to optimize the costs of overall workload [1], and build dependency graphs to identify and optimize for the data pipelines.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"4 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Big Data Processing at Microsoft: Hyper Scale, Massive Complexity, and Minimal Cost\",\"authors\":\"Hiren Patel, Alekh Jindal, C. Szyperski\",\"doi\":\"10.1145/3357223.3366029\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The past decade has seen a tremendous interest in large-scale data processing at Microsoft. Typical scenarios include building business-critical pipelines such as advertiser feedback loop, index builder, and relevance/ranking algorithms for Bing; analyzing user experience telemetry for Office, Windows or Xbox; and gathering recommendations for products like Windows and Xbox. To address these needs a first-party big data analytics platform, referred to as Cosmos, was developed in the early 2010s at Microsoft. Cosmos makes it possible to store data at exabyte scale and process in a serverless form factor, with SCOPE [4] being the query processing workhorse. Over time, however, several newer challenges have emerged, requiring major technical innovations in Cosmos to meet these newer demands. In this abstract, we describe three such challenges from the query processing viewpoint, and our approaches to handling them. Hyper Scale. Cosmos has witnessed a significant growth in usage from its early days, from the number of customers (starting from Bing to almost every single business unit at Microsoft today), to the volume of data processed (from petabytes to exabytes today), to the amount of processing done (from tens of thousands of SCOPE jobs to hundreds of thousands of jobs today, across hundreds of thousands of machines). Even a single job can consume tens of petabytes of data and produce similar volumes of data by running millions of tasks in parallel. Our approach to handle this unprecedented scale is two fold. First, we decoupled and disaggregated the query processor from storage and resource management components, thereby allowing different components in the Cosmos stack to scale independently. Second, we scaled the data movement in the SCOPE query processor with quasilinear complexity [2]. This is crucial since data movement is often the most expensive step, and hence the bottleneck, in massive-scale data processing. Massive Complexity. Cosmos workloads are also highly complex. Thanks to adoption across the whole of Microsoft, Cosmos needs to support workloads that are representative of multiple industry segments, including search engine (Bing), operating system (Windows), workplace productivity (Office), personal computing (Surface), gaming (XBox), etc. To handle such diverse workloads, our approach has been to provide a one-size-fits-all experience. First of all, to make it easy for the customers to express their computations, SCOPE supports different types of queries, from batch to interactive to streaming and machine learning. Second, SCOPE supports both structured and unstructured data processing. Likewise, multiple data formats, including both propriety and open source source such as Parquet, are supported. Third, users can write business logic using a mix of declarative and imperative languages, over even different imperative languages such as C# and Python, in the same job. Furthermore, users can express all of the above in simple data flow style computation for better readability and maintainability. Finally, considering the diverse workload mix inside Microsoft, we have come to realization that it is not possible to fits all scenarios using SCOPE. Therefore, we also support the popular Spark query processing engine. Overall, the one-size-fits-all query processing experience in Cosmos covers very diverse workloads, including data formats, programming languages, and the backend engines. Minimal Cost. While scale and complexity are hard by themselves, the biggest challenge is to achieve all of that at minimal cost. In fact, there is a pressing need to improve Cosmos efficiency and reduce operational costs. This is challenging due to several reasons. First, optimizing a SCOPE job is hard considering that the SCOPE DAGs are super large (up to 1000s of operators in single job!), and the optimization estimates (cardinality, cost, etc.) are often way off from the actuals. Second, SCOPE optimizes a given query, while the operational costs depend on the overall workload. Therefore workload optimization becomes very important. And finally, SCOPE jobs are typically interlinked in data pipelines, i.e., the output of one job is consumed by other jobs. This means that workload optimization needs to be aware of these dependencies. Our approach is to develop a feedback loop to learn from past workloads in order to optimize the future ones. Specifically, we leverage machine learning to learn models for optimizing individual jobs [3], apply multi-query optimizations to optimize the costs of overall workload [1], and build dependency graphs to identify and optimize for the data pipelines.\",\"PeriodicalId\":91949,\"journal\":{\"name\":\"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)\",\"volume\":\"4 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3357223.3366029\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3357223.3366029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

在过去的十年里，微软对大规模数据处理产生了极大的兴趣。典型的场景包括为Bing构建关键业务管道，如广告商反馈循环、索引构建器和相关/排名算法;分析Office、Windows或Xbox的用户体验遥测;收集Windows和Xbox等产品的推荐。为了满足这些需求，微软在2010年初开发了一个名为Cosmos的第一方大数据分析平台。Cosmos使得以eb级规模存储数据并以无服务器的形式进行处理成为可能，SCOPE[4]是查询处理的主力。然而，随着时间的推移，出现了一些新的挑战，要求Cosmos进行重大技术创新以满足这些新的需求。在这篇摘要中，我们从查询处理的角度描述了三个这样的挑战，以及我们处理它们的方法。超规模。从早期的用户数量(从Bing到今天微软的几乎每一个业务部门)，到处理的数据量(从今天的pb到今天的eb)，再到处理的数量(从数万个SCOPE作业到今天的数十万个作业，跨越数十万台机器)，Cosmos的使用量都有了显著的增长。即使是单个作业也可能消耗数十pb的数据，并通过并行运行数百万个任务产生类似的数据量。我们处理这一空前规模的方法有两个方面。首先，我们将查询处理器与存储和资源管理组件解耦并分解，从而允许Cosmos堆栈中的不同组件独立扩展。其次，我们以拟线性复杂度[2]缩放SCOPE查询处理器中的数据移动。这是至关重要的，因为在大规模数据处理中，数据移动通常是最昂贵的步骤，因此也是瓶颈。巨大的复杂性。Cosmos工作负载也非常复杂。由于整个微软的采用，Cosmos需要支持多个行业领域的工作负载，包括搜索引擎(Bing)、操作系统(Windows)、工作效率(Office)、个人计算(Surface)、游戏(XBox)等。为了处理如此多样化的工作负载，我们的方法是提供一刀切的体验。首先，为了方便客户表达他们的计算，SCOPE支持不同类型的查询，从批处理到交互，再到流和机器学习。其次，SCOPE同时支持结构化和非结构化数据处理。同样，支持多种数据格式，包括专有的和开放源码的，如Parquet。第三，用户可以在同一个任务中混合使用声明式和命令式语言编写业务逻辑，甚至可以使用不同的命令式语言(如c#和Python)。此外，用户可以用简单的数据流样式计算来表达上述所有内容，从而提高可读性和可维护性。最后，考虑到Microsoft内部不同的工作负载组合，我们已经意识到使用SCOPE不可能适合所有场景。因此，我们还支持流行的Spark查询处理引擎。总的来说，Cosmos中一刀切的查询处理体验涵盖了非常不同的工作负载，包括数据格式、编程语言和后端引擎。最小的成本。虽然规模和复杂性本身就很困难，但最大的挑战是以最小的成本实现所有这些目标。事实上，提高Cosmos的效率和降低运营成本是迫切需要的。由于几个原因，这是具有挑战性的。首先，考虑到SCOPE dag非常大(单个作业中多达1000个操作符!)，优化SCOPE作业非常困难，而且优化估计(基数、成本等)通常与实际情况相差甚远。其次，SCOPE优化给定的查询，而操作成本取决于总体工作负载。因此，工作负载优化变得非常重要。最后，SCOPE作业通常在数据管道中相互链接，也就是说，一个作业的输出由其他作业使用。这意味着工作负载优化需要意识到这些依赖关系。我们的方法是建立一个反馈回路，从过去的工作量中学习，以优化未来的工作量。具体来说，我们利用机器学习来学习优化单个作业[3]的模型，应用多查询优化来优化总体工作负载[1]的成本，并构建依赖关系图来识别和优化数据管道。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Big Data Processing at Microsoft: Hyper Scale, Massive Complexity, and Minimal Cost

The past decade has seen a tremendous interest in large-scale data processing at Microsoft. Typical scenarios include building business-critical pipelines such as advertiser feedback loop, index builder, and relevance/ranking algorithms for Bing; analyzing user experience telemetry for Office, Windows or Xbox; and gathering recommendations for products like Windows and Xbox. To address these needs a first-party big data analytics platform, referred to as Cosmos, was developed in the early 2010s at Microsoft. Cosmos makes it possible to store data at exabyte scale and process in a serverless form factor, with SCOPE [4] being the query processing workhorse. Over time, however, several newer challenges have emerged, requiring major technical innovations in Cosmos to meet these newer demands. In this abstract, we describe three such challenges from the query processing viewpoint, and our approaches to handling them. Hyper Scale. Cosmos has witnessed a significant growth in usage from its early days, from the number of customers (starting from Bing to almost every single business unit at Microsoft today), to the volume of data processed (from petabytes to exabytes today), to the amount of processing done (from tens of thousands of SCOPE jobs to hundreds of thousands of jobs today, across hundreds of thousands of machines). Even a single job can consume tens of petabytes of data and produce similar volumes of data by running millions of tasks in parallel. Our approach to handle this unprecedented scale is two fold. First, we decoupled and disaggregated the query processor from storage and resource management components, thereby allowing different components in the Cosmos stack to scale independently. Second, we scaled the data movement in the SCOPE query processor with quasilinear complexity [2]. This is crucial since data movement is often the most expensive step, and hence the bottleneck, in massive-scale data processing. Massive Complexity. Cosmos workloads are also highly complex. Thanks to adoption across the whole of Microsoft, Cosmos needs to support workloads that are representative of multiple industry segments, including search engine (Bing), operating system (Windows), workplace productivity (Office), personal computing (Surface), gaming (XBox), etc. To handle such diverse workloads, our approach has been to provide a one-size-fits-all experience. First of all, to make it easy for the customers to express their computations, SCOPE supports different types of queries, from batch to interactive to streaming and machine learning. Second, SCOPE supports both structured and unstructured data processing. Likewise, multiple data formats, including both propriety and open source source such as Parquet, are supported. Third, users can write business logic using a mix of declarative and imperative languages, over even different imperative languages such as C# and Python, in the same job. Furthermore, users can express all of the above in simple data flow style computation for better readability and maintainability. Finally, considering the diverse workload mix inside Microsoft, we have come to realization that it is not possible to fits all scenarios using SCOPE. Therefore, we also support the popular Spark query processing engine. Overall, the one-size-fits-all query processing experience in Cosmos covers very diverse workloads, including data formats, programming languages, and the backend engines. Minimal Cost. While scale and complexity are hard by themselves, the biggest challenge is to achieve all of that at minimal cost. In fact, there is a pressing need to improve Cosmos efficiency and reduce operational costs. This is challenging due to several reasons. First, optimizing a SCOPE job is hard considering that the SCOPE DAGs are super large (up to 1000s of operators in single job!), and the optimization estimates (cardinality, cost, etc.) are often way off from the actuals. Second, SCOPE optimizes a given query, while the operational costs depend on the overall workload. Therefore workload optimization becomes very important. And finally, SCOPE jobs are typically interlinked in data pipelines, i.e., the output of one job is consumed by other jobs. This means that workload optimization needs to be aware of these dependencies. Our approach is to develop a feedback loop to learn from past workloads in order to optimize the future ones. Specifically, we leverage machine learning to learn models for optimizing individual jobs [3], apply multi-query optimizations to optimize the costs of overall workload [1], and build dependency graphs to identify and optimize for the data pipelines.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)

自引率

0.00%

发文量