Cross-Language Optimizations in Big Data Systems: A Case Study of SCOPE

2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP) Pub Date : 2017-05-01 DOI:10.1145/3183519.3183528

Marija Selakovic, Mike Barnett, Madan Musuvathi, Todd Mytkowicz

{"title":"Cross-Language Optimizations in Big Data Systems: A Case Study of SCOPE","authors":"Marija Selakovic, Mike Barnett, Madan Musuvathi, Todd Mytkowicz","doi":"10.1145/3183519.3183528","DOIUrl":null,"url":null,"abstract":"Building scalable big data programs currently requires programmers to combine relational (SQL) with non-relational code (Java, C#, Scala). Relational code is declarative - a program describes what the computation is and the compiler decides how to distribute the program. SQL query optimization has enjoyed a rich and fruitful history, however, most research and commercial optimization engines treat non-relational code as a black-box and thus are unable to optimize it. This paper empirically studies over 3 million SCOPE programs across five data centers within Microsoft and finds programs with non-relational code take between 45-70% of data center CPU time. We further explore the potential for SCOPE optimization by generating more native code from the non-relational part. Finally, we present 6 case studies showing that triggering more generation of native code in these jobs yields significant performance improvement: optimizing just one portion resulted in as much as 25% improvement for an entire program.","PeriodicalId":445513,"journal":{"name":"2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3183519.3183528","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Building scalable big data programs currently requires programmers to combine relational (SQL) with non-relational code (Java, C#, Scala). Relational code is declarative - a program describes what the computation is and the compiler decides how to distribute the program. SQL query optimization has enjoyed a rich and fruitful history, however, most research and commercial optimization engines treat non-relational code as a black-box and thus are unable to optimize it. This paper empirically studies over 3 million SCOPE programs across five data centers within Microsoft and finds programs with non-relational code take between 45-70% of data center CPU time. We further explore the potential for SCOPE optimization by generating more native code from the non-relational part. Finally, we present 6 case studies showing that triggering more generation of native code in these jobs yields significant performance improvement: optimizing just one portion resulted in as much as 25% improvement for an entire program.

查看原文本刊更多论文

大数据系统中的跨语言优化:SCOPE的案例研究

目前，构建可扩展的大数据程序需要程序员将关系代码(SQL)与非关系代码(Java、c#、Scala)结合起来。关系代码是声明性的——程序描述计算是什么，编译器决定如何分发程序。SQL查询优化有着丰富而富有成果的历史，然而，大多数研究和商业优化引擎将非关系代码视为黑箱，因此无法对其进行优化。本文对微软内部5个数据中心的300多万个SCOPE程序进行了实证研究，发现使用非关系代码的程序占用了数据中心45-70%的CPU时间。通过从非关系部分生成更多的本机代码，我们进一步探索了SCOPE优化的潜力。最后，我们提供了6个案例研究，表明在这些作业中触发更多的本机代码生成会产生显著的性能改进:仅优化一部分就可以使整个程序提高多达25%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP)

自引率

0.00%

发文量