QDrill:基于查询的大数据分布式可消费分析

Shadi Khalifa, Patrick Martin, Dan Rope, Mike McRoberts, Craig Statchuk
{"title":"QDrill:基于查询的大数据分布式可消费分析","authors":"Shadi Khalifa, Patrick Martin, Dan Rope, Mike McRoberts, Craig Statchuk","doi":"10.1109/BigDataCongress.2016.23","DOIUrl":null,"url":null,"abstract":"Consumable analytics attempt to address the shortage of skilled data analysts in many organizations by offering analytic functionality in a form more familiar to in-house expertise. Providing consumable analytics for Big Data faces three main challenges. The first challenge is making the analytics algorithms run in a distributed fashion in order to analyze Big Data in a timely manner. The second challenge is providing an easy interface to allow in-house expertise to run these algorithms in a distributed fashion while minimizing the learning cycle and existing code rewrites. The third challenge is running the analytics on data of different formats stored on heterogeneous data stores. In this paper, we address these challenges in the proposed QDrill. We introduce the Analytics Adaptor extension for Apache Drill, a schema-free SQL query engine for non-relational storage. The Analytics Adaptor introduces the Distributed Analytics Query Language for invoking data mining algorithms from within the Drill standard SQL query statements. The adaptor allows using any sequential single-node data mining library (e.g. WEKA) and makes its algorithms run in a distributed fashion without having to rewrite them. We evaluate QDrill against Apache Mahout. The evaluation shows that QDrill outperforms Mahout in Updatable model training and scoring phase while almost keeping the same performance for Non-Updatable model training. QDrill is more scalable and offers an easier interface, no storage overhead and the whole algorithms repository of WEKA, with the ability to extend to use algorithms from other data mining libraries.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"QDrill: Query-Based Distributed Consumable Analytics for Big Data\",\"authors\":\"Shadi Khalifa, Patrick Martin, Dan Rope, Mike McRoberts, Craig Statchuk\",\"doi\":\"10.1109/BigDataCongress.2016.23\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Consumable analytics attempt to address the shortage of skilled data analysts in many organizations by offering analytic functionality in a form more familiar to in-house expertise. Providing consumable analytics for Big Data faces three main challenges. The first challenge is making the analytics algorithms run in a distributed fashion in order to analyze Big Data in a timely manner. The second challenge is providing an easy interface to allow in-house expertise to run these algorithms in a distributed fashion while minimizing the learning cycle and existing code rewrites. The third challenge is running the analytics on data of different formats stored on heterogeneous data stores. In this paper, we address these challenges in the proposed QDrill. We introduce the Analytics Adaptor extension for Apache Drill, a schema-free SQL query engine for non-relational storage. The Analytics Adaptor introduces the Distributed Analytics Query Language for invoking data mining algorithms from within the Drill standard SQL query statements. The adaptor allows using any sequential single-node data mining library (e.g. WEKA) and makes its algorithms run in a distributed fashion without having to rewrite them. We evaluate QDrill against Apache Mahout. The evaluation shows that QDrill outperforms Mahout in Updatable model training and scoring phase while almost keeping the same performance for Non-Updatable model training. QDrill is more scalable and offers an easier interface, no storage overhead and the whole algorithms repository of WEKA, with the ability to extend to use algorithms from other data mining libraries.\",\"PeriodicalId\":407471,\"journal\":{\"name\":\"2016 IEEE International Congress on Big Data (BigData Congress)\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE International Congress on Big Data (BigData Congress)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BigDataCongress.2016.23\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Congress on Big Data (BigData Congress)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BigDataCongress.2016.23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

消耗性分析试图通过以内部专家更熟悉的形式提供分析功能来解决许多组织中熟练数据分析师的短缺问题。为大数据提供可消费的分析面临三大挑战。第一个挑战是使分析算法以分布式方式运行,以便及时分析大数据。第二个挑战是提供一个简单的接口,允许内部专家以分布式方式运行这些算法,同时最大限度地减少学习周期和现有代码重写。第三个挑战是对存储在异构数据存储中的不同格式的数据运行分析。在本文中,我们在提出的QDrill中解决了这些挑战。我们介绍了Apache Drill的Analytics Adaptor扩展,这是一个用于非关系存储的无模式SQL查询引擎。分析适配器引入了分布式分析查询语言,用于从Drill标准SQL查询语句中调用数据挖掘算法。适配器允许使用任何顺序的单节点数据挖掘库(例如WEKA),并使其算法以分布式方式运行,而无需重写它们。我们针对Apache Mahout对QDrill进行了评估。评估表明,QDrill在可更新模型训练和评分阶段优于Mahout,而在非可更新模型训练阶段几乎保持相同的性能。QDrill具有更高的可扩展性,提供了一个更简单的接口,没有存储开销和WEKA的整个算法存储库,能够扩展到使用来自其他数据挖掘库的算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
QDrill: Query-Based Distributed Consumable Analytics for Big Data
Consumable analytics attempt to address the shortage of skilled data analysts in many organizations by offering analytic functionality in a form more familiar to in-house expertise. Providing consumable analytics for Big Data faces three main challenges. The first challenge is making the analytics algorithms run in a distributed fashion in order to analyze Big Data in a timely manner. The second challenge is providing an easy interface to allow in-house expertise to run these algorithms in a distributed fashion while minimizing the learning cycle and existing code rewrites. The third challenge is running the analytics on data of different formats stored on heterogeneous data stores. In this paper, we address these challenges in the proposed QDrill. We introduce the Analytics Adaptor extension for Apache Drill, a schema-free SQL query engine for non-relational storage. The Analytics Adaptor introduces the Distributed Analytics Query Language for invoking data mining algorithms from within the Drill standard SQL query statements. The adaptor allows using any sequential single-node data mining library (e.g. WEKA) and makes its algorithms run in a distributed fashion without having to rewrite them. We evaluate QDrill against Apache Mahout. The evaluation shows that QDrill outperforms Mahout in Updatable model training and scoring phase while almost keeping the same performance for Non-Updatable model training. QDrill is more scalable and offers an easier interface, no storage overhead and the whole algorithms repository of WEKA, with the ability to extend to use algorithms from other data mining libraries.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信