{"title":"Datalog redux: experience and conjecture","authors":"J. Hellerstein","doi":"10.1145/1807085.1807087","DOIUrl":null,"url":null,"abstract":"There is growing urgency in computer science circles regarding an impending crisis in parallel programming. Emerging computing platforms, from multicore processors to cloud computing, predicate their performance growth on the development of software to harness parallelism. For the first time in the history of computing, the progress of Moore's Law depends on the productivity of software engineers. Unfortunately, parallel and distributed programming today is challenging even for the best programmers, and simply unworkable for the majority. There has never been a more urgent need for breakthroughs in programming models and languages.\n While parallel programming in general is considered very difficult, data parallelism has been very successful. The relational algebra parallelizes easily over large datasets, and SQL programmers have long reaped the benefits of parallelism without modifications to their code. This point has been rediscovered and amplified via recent enthusiasm for MapReduce programming and \"Big Data\", which have turned data parallelism into common culture across computing.\n As a result, it is increasingly attractive to tackle the challenge of parallel programming on the firm common ground of data parallelism: start with an easy-to-parallelize kernel-relational algebra-and extend it to general-purpose computation. This approach has clear precedents in database theory, where it has long been known that classical relational languages have natural Turing-complete extensions.\n At the same time that this crisis has been evolving, variants of Datalog have been seen cropping up in a wide range of practical settings, from security to robotics to compiler analysis. Over the past seven years, we have been exploring the use of Datalog-inspired languages in a variety of systems projects, with a focus on inherently parallel tasks in networking and distributed systems. The experience has been largely positive: we have demonstrated full-featured Datalog-based system implementations that are orders of magnitude more compact than equivalent imperatively-implemented systems, with competitive performance and significantly accelerated software evolution. Evidence is mounting that Datalog can serve as the basis of a much simpler family of languages for programming serious parallel and distributed software.\n This raises many questions that should warm the heart of a database theoretician. How does the complexity hierarchy of logic languages relate to parallel models of computation? Is there a suitable Coordination Complexity model that captures the realities of modern parallel hardware, where computation is cheap and coordination is expensive? Can the lens of logic provide better focus on what is \"hard\" to parallelize, what is \"embarrassingly parallel\", and points in between? Does our understanding of non-monotonic reasoning shed light on the ability of loosely-coupled distributed systems to guarantee eventual consistency? And finally, a question close to the heart of the PODS conference: if Datalog has been The Answer all these years, is parallel and distributed programming The Question it has been waiting for?\n In this talk and the paper that accompanies it, I present design patterns that arose in our experience building distributed and parallel software in the style of Datalog, and use them to motivate some initial conjectures relating to the questions above.\n The full paper was not available at the time these proceedings were printed, but can be found online by searching for the phrase \"Springtime for Datalog\".","PeriodicalId":92118,"journal":{"name":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","volume":"15 1","pages":"1-2"},"PeriodicalIF":0.0000,"publicationDate":"2010-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1807085.1807087","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 26
Abstract
There is growing urgency in computer science circles regarding an impending crisis in parallel programming. Emerging computing platforms, from multicore processors to cloud computing, predicate their performance growth on the development of software to harness parallelism. For the first time in the history of computing, the progress of Moore's Law depends on the productivity of software engineers. Unfortunately, parallel and distributed programming today is challenging even for the best programmers, and simply unworkable for the majority. There has never been a more urgent need for breakthroughs in programming models and languages.
While parallel programming in general is considered very difficult, data parallelism has been very successful. The relational algebra parallelizes easily over large datasets, and SQL programmers have long reaped the benefits of parallelism without modifications to their code. This point has been rediscovered and amplified via recent enthusiasm for MapReduce programming and "Big Data", which have turned data parallelism into common culture across computing.
As a result, it is increasingly attractive to tackle the challenge of parallel programming on the firm common ground of data parallelism: start with an easy-to-parallelize kernel-relational algebra-and extend it to general-purpose computation. This approach has clear precedents in database theory, where it has long been known that classical relational languages have natural Turing-complete extensions.
At the same time that this crisis has been evolving, variants of Datalog have been seen cropping up in a wide range of practical settings, from security to robotics to compiler analysis. Over the past seven years, we have been exploring the use of Datalog-inspired languages in a variety of systems projects, with a focus on inherently parallel tasks in networking and distributed systems. The experience has been largely positive: we have demonstrated full-featured Datalog-based system implementations that are orders of magnitude more compact than equivalent imperatively-implemented systems, with competitive performance and significantly accelerated software evolution. Evidence is mounting that Datalog can serve as the basis of a much simpler family of languages for programming serious parallel and distributed software.
This raises many questions that should warm the heart of a database theoretician. How does the complexity hierarchy of logic languages relate to parallel models of computation? Is there a suitable Coordination Complexity model that captures the realities of modern parallel hardware, where computation is cheap and coordination is expensive? Can the lens of logic provide better focus on what is "hard" to parallelize, what is "embarrassingly parallel", and points in between? Does our understanding of non-monotonic reasoning shed light on the ability of loosely-coupled distributed systems to guarantee eventual consistency? And finally, a question close to the heart of the PODS conference: if Datalog has been The Answer all these years, is parallel and distributed programming The Question it has been waiting for?
In this talk and the paper that accompanies it, I present design patterns that arose in our experience building distributed and parallel software in the style of Datalog, and use them to motivate some initial conjectures relating to the questions above.
The full paper was not available at the time these proceedings were printed, but can be found online by searching for the phrase "Springtime for Datalog".
在计算机科学圈中,并行编程面临的危机越来越紧迫。新兴的计算平台,从多核处理器到云计算,它们的性能增长取决于利用并行性的软件开发。在计算机历史上,摩尔定律的进步第一次依赖于软件工程师的生产力。不幸的是,今天的并行和分布式编程即使对最优秀的程序员来说也是一种挑战,而且对大多数人来说根本不可行。从来没有像现在这样迫切需要在编程模型和语言方面取得突破。虽然并行编程通常被认为是非常困难的,但数据并行已经非常成功。关系代数很容易在大型数据集上并行化,SQL程序员长期以来一直享受并行性的好处,而无需修改他们的代码。这一点通过最近对MapReduce编程和“大数据”的热情被重新发现和放大,它们已经将数据并行化变成了跨计算领域的共同文化。因此,在数据并行性的坚实基础上解决并行编程的挑战变得越来越有吸引力:从易于并行化的内核——关系代数——开始,并将其扩展到通用计算。这种方法在数据库理论中有明确的先例,人们早就知道经典关系语言具有自然的图灵完全扩展。与此同时,这场危机一直在演变,Datalog的变体已经出现在广泛的实际环境中,从安全到机器人再到编译器分析。在过去的七年中,我们一直在探索在各种系统项目中使用受datalog启发的语言,重点关注网络和分布式系统中固有的并行任务。经验在很大程度上是积极的:我们已经展示了全功能的基于datalog的系统实现,它比同等的命令式实现的系统要紧凑得多,具有竞争力的性能和显著加速的软件进化。越来越多的证据表明,Datalog可以作为一个更简单的语言家族的基础,用于编程严肃的并行和分布式软件。这就提出了许多应该温暖数据库理论家的心的问题。逻辑语言的复杂性层次如何与并行计算模型相关联?是否存在一种合适的协调复杂性模型来捕捉现代并行硬件的现实,即计算成本低而协调成本高?逻辑的镜头能更好地关注什么是“难以”并行的,什么是“令人尴尬的并行”,以及两者之间的点吗?我们对非单调推理的理解是否揭示了松耦合分布式系统保证最终一致性的能力?最后,一个接近PODS会议核心的问题:如果Datalog这些年来一直是答案,那么并行和分布式编程是它一直在等待的问题吗?在这次演讲和随附的论文中,我介绍了在我们以Datalog风格构建分布式和并行软件的经验中出现的设计模式,并使用它们来激发与上述问题相关的一些初步猜想。在这些论文集出版时,还没有完整的论文,但可以在网上搜索短语“Springtime for Datalog”找到。