A Tool for Statistical Analysis on Network Big Data

2017 28th International Workshop on Database and Expert Systems Applications (DEXA) Pub Date : 2017-08-01 DOI:10.1109/DEXA.2017.23

C. Ordonez, T. Johnson, D. Srivastava, Simon Urbanek

{"title":"A Tool for Statistical Analysis on Network Big Data","authors":"C. Ordonez, T. Johnson, D. Srivastava, Simon Urbanek","doi":"10.1109/DEXA.2017.23","DOIUrl":null,"url":null,"abstract":"Due to advances in parallel file systems for big data (i.e. HDFS) and larger capacity hardware (multicore CPUs, large RAM) it is now feasible to manage and query network data in a parallel DBMS supporting SQL, but performing statistical analysis remains a challenge.On the statistics side, the R language is popular, but it presents important limitations: R is limited by main memory, R works in a different address space from query processing, R cannot analyze large disk-resident data sets efficiently, and R has no data management capabilities. Moreover, some R libraries allow R to work in parallel, but without data management capabilities. Considering the challenges and limitations described above, we present a system that allows combining SQL queries and R functions in a seamless manner. We justify a parallel DBMS and the R runtime are two different systems that benefit from a low-level integration. Our parallel DBMS is built on top of HDFS, programmed in Java and C++, with a flexible scale out architecture, whereas R is programmed purely in C. The user or developer can make calls in both directions: (1) R calling SQL, to evaluate analytic queries or retrieve data from materialized views (transferring result tables in RAM in a streaming fashion and analyzing them in R), and vice-versa (2) SQL calling R, allowing SQL to convert relational tables to matrices or vectors and making complex computations on them. We give a summary of network monitoring tasks at ATT and present specific programming examples, showing language calls in both directions (i.e. R calls SQL, SQL calls R).","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEXA.2017.23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Due to advances in parallel file systems for big data (i.e. HDFS) and larger capacity hardware (multicore CPUs, large RAM) it is now feasible to manage and query network data in a parallel DBMS supporting SQL, but performing statistical analysis remains a challenge.On the statistics side, the R language is popular, but it presents important limitations: R is limited by main memory, R works in a different address space from query processing, R cannot analyze large disk-resident data sets efficiently, and R has no data management capabilities. Moreover, some R libraries allow R to work in parallel, but without data management capabilities. Considering the challenges and limitations described above, we present a system that allows combining SQL queries and R functions in a seamless manner. We justify a parallel DBMS and the R runtime are two different systems that benefit from a low-level integration. Our parallel DBMS is built on top of HDFS, programmed in Java and C++, with a flexible scale out architecture, whereas R is programmed purely in C. The user or developer can make calls in both directions: (1) R calling SQL, to evaluate analytic queries or retrieve data from materialized views (transferring result tables in RAM in a streaming fashion and analyzing them in R), and vice-versa (2) SQL calling R, allowing SQL to convert relational tables to matrices or vectors and making complex computations on them. We give a summary of network monitoring tasks at ATT and present specific programming examples, showing language calls in both directions (i.e. R calls SQL, SQL calls R).

查看原文本刊更多论文

网络大数据统计分析工具

由于大数据并行文件系统(例如HDFS)和大容量硬件(多核cpu，大内存)的进步，现在可以在支持SQL的并行DBMS中管理和查询网络数据，但是执行统计分析仍然是一个挑战。在统计方面，R语言很受欢迎，但它存在重要的局限性:R受主存的限制，R工作在与查询处理不同的地址空间中，R不能有效地分析大型磁盘驻留数据集，R没有数据管理能力。此外，一些R库允许R并行工作，但没有数据管理功能。考虑到上面描述的挑战和限制，我们提出了一个允许以无缝方式组合SQL查询和R函数的系统。我们认为并行DBMS和R运行时是两个不同的系统，可以从低级集成中受益。我们的并行DBMS建立在HDFS之上，用Java和c++编程，具有灵活的扩展架构，而R是纯用C编程的。用户或开发人员可以在两个方向上进行调用:(1) R调用SQL，评估分析查询或从物化视图中检索数据(以流方式在RAM中传输结果表并在R中分析它们)，反之亦然(2)SQL调用R，允许SQL将关系表转换为矩阵或向量并在其上进行复杂的计算。我们总结了ATT的网络监控任务，并给出了具体的编程示例，展示了两个方向的语言调用(即R调用SQL, SQL调用R)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 28th International Workshop on Database and Expert Systems Applications (DEXA)

自引率

0.00%

发文量