Empowering big data analytics with polystore and strongly typed functional queries

Proceedings of the 24th Symposium on International Database Engineering & Applications Pub Date : 2020-08-12 DOI:10.1145/3410566.3410591

Annabelle Gillet, É. Leclercq, M. Savonnet, N. Cullot

{"title":"Empowering big data analytics with polystore and strongly typed functional queries","authors":"Annabelle Gillet, É. Leclercq, M. Savonnet, N. Cullot","doi":"10.1145/3410566.3410591","DOIUrl":null,"url":null,"abstract":"Polystores are of primary importance to tackle the diversity and the volume of Big Data, as they propose to store data according to specific use cases. Nevertheless, analytics frameworks often lack a uniform interface allowing to fully access and take advantage of the various models offered by the polystore. It also should be ensured that the typing of the algebraic expressions built with data manipulation operators can be checked and that schema can be inferred before starting to execute the operators (type-safe). Tensors are good candidates for supporting a pivot data model. They are powerful abstract mathematical objects which can embed complex relationships between entities and that are used in major analytics frameworks. However, they are far away from data models, and lack high level operators to manipulate their content, resulting in bad coding habits and less maintainability, and sometimes poor performances. With TDM (Tensor Data Model), we propose to join the best of both worlds, to take advantage of modeling capabilities of tensors by adding schema and data manipulation operators to them. We developed an implementation in Scala using Spark, providing users with a type-safe and schema inference mechanism that guarantees the technical and functional correctness of composed expressions on tensors at compile time. We show that this extension does not induce overhead and allows to outperform Spark query optimizer using bind join.","PeriodicalId":137708,"journal":{"name":"Proceedings of the 24th Symposium on International Database Engineering & Applications","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 24th Symposium on International Database Engineering & Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3410566.3410591","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Polystores are of primary importance to tackle the diversity and the volume of Big Data, as they propose to store data according to specific use cases. Nevertheless, analytics frameworks often lack a uniform interface allowing to fully access and take advantage of the various models offered by the polystore. It also should be ensured that the typing of the algebraic expressions built with data manipulation operators can be checked and that schema can be inferred before starting to execute the operators (type-safe). Tensors are good candidates for supporting a pivot data model. They are powerful abstract mathematical objects which can embed complex relationships between entities and that are used in major analytics frameworks. However, they are far away from data models, and lack high level operators to manipulate their content, resulting in bad coding habits and less maintainability, and sometimes poor performances. With TDM (Tensor Data Model), we propose to join the best of both worlds, to take advantage of modeling capabilities of tensors by adding schema and data manipulation operators to them. We developed an implementation in Scala using Spark, providing users with a type-safe and schema inference mechanism that guarantees the technical and functional correctness of composed expressions on tensors at compile time. We show that this extension does not induce overhead and allows to outperform Spark query optimizer using bind join.

查看原文本刊更多论文

通过多存储和强类型功能查询增强大数据分析能力

多元存储对于处理大数据的多样性和数量至关重要，因为它们建议根据特定的用例存储数据。然而，分析框架通常缺乏统一的接口，无法完全访问和利用polystore提供的各种模型。还应该确保可以检查用数据操作符构建的代数表达式的类型，并且可以在开始执行操作符之前推断模式(类型安全)。张量是支持枢轴数据模型的良好候选者。它们是功能强大的抽象数学对象，可以嵌入实体之间的复杂关系，并用于主要的分析框架。然而，它们与数据模型相去甚远，并且缺乏高级操作符来操作其内容，从而导致不良的编码习惯和较差的可维护性，有时还会导致性能不佳。对于TDM(张量数据模型)，我们建议将两者的优点结合起来，通过向张量添加模式和数据操作操作符来利用张量的建模能力。我们使用Spark在Scala中开发了一个实现，为用户提供了一个类型安全和模式推断机制，以保证在编译时张量上组合表达式的技术和功能正确性。我们展示了这个扩展不会引起开销，并且允许超越使用绑定连接的Spark查询优化器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 24th Symposium on International Database Engineering & Applications

自引率

0.00%

发文量