大数据分析任务与方法(调研)

PROBLEMS IN PROGRAMMING Pub Date : 2019-08-21 DOI:10.15407/PP2019.03.058

O. Balabanov

{"title":"大数据分析任务与方法(调研)","authors":"O. Balabanov","doi":"10.15407/PP2019.03.058","DOIUrl":null,"url":null,"abstract":"We review tasks and methods most relevant to Big Data analysis. Emphasis is made on the conceptual and pragmatic issues of the tasks and methods (avoiding unnecessary mathematical details). We suggest that all scope of jobs with Big Data fall into four conceptual modes (types): four modes of large-scale usage of Big Data: 1) intelligent information retrieval; 2) massive (large-scale) conveyed data processing (mining); 3) model inference from data; 4) knowledge extraction from data (regularities detection and structures discovery). The essence of various tasks (clustering, regression, generative model inference, structures discovery etc.) are elucidated. We compare key methods of clustering, regression, classification, deep learning, generative model inference and causal discovery. Cluster analysis may be divided into methods based on mean distance, methods based on local distance and methods based on a model. The targeted (predictive) methods fall into two categories: methods which infer a model; \"tied to data\" methods which compute prediction directly from data. Common tasks of temporal data analysis are briefly overviewed. Among diverse methods of generative model inference we make focus on causal network learning because models of this class are very expressive, flexible and are able to predict effects of interventions under varying conditions. Independence-based approach to causal network inference from data is characterized. We give a few comments on specificity of task of dynamical causal network inference from timeseries. Challenges of Big Data analysis raised by data multidimensionality, heterogeneity and huge volume are presented. Some statistical issues related to the challenges are summarized. Problems in programming 2019; 3: 58-85","PeriodicalId":313885,"journal":{"name":"PROBLEMS IN PROGRAMMING","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Tasks and methods of Big Data analysis (a survey)\",\"authors\":\"O. Balabanov\",\"doi\":\"10.15407/PP2019.03.058\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We review tasks and methods most relevant to Big Data analysis. Emphasis is made on the conceptual and pragmatic issues of the tasks and methods (avoiding unnecessary mathematical details). We suggest that all scope of jobs with Big Data fall into four conceptual modes (types): four modes of large-scale usage of Big Data: 1) intelligent information retrieval; 2) massive (large-scale) conveyed data processing (mining); 3) model inference from data; 4) knowledge extraction from data (regularities detection and structures discovery). The essence of various tasks (clustering, regression, generative model inference, structures discovery etc.) are elucidated. We compare key methods of clustering, regression, classification, deep learning, generative model inference and causal discovery. Cluster analysis may be divided into methods based on mean distance, methods based on local distance and methods based on a model. The targeted (predictive) methods fall into two categories: methods which infer a model; \\\"tied to data\\\" methods which compute prediction directly from data. Common tasks of temporal data analysis are briefly overviewed. Among diverse methods of generative model inference we make focus on causal network learning because models of this class are very expressive, flexible and are able to predict effects of interventions under varying conditions. Independence-based approach to causal network inference from data is characterized. We give a few comments on specificity of task of dynamical causal network inference from timeseries. Challenges of Big Data analysis raised by data multidimensionality, heterogeneity and huge volume are presented. Some statistical issues related to the challenges are summarized. Problems in programming 2019; 3: 58-85\",\"PeriodicalId\":313885,\"journal\":{\"name\":\"PROBLEMS IN PROGRAMMING\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PROBLEMS IN PROGRAMMING\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15407/PP2019.03.058\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PROBLEMS IN PROGRAMMING","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15407/PP2019.03.058","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们回顾了与大数据分析最相关的任务和方法。重点是任务和方法的概念和实用问题(避免不必要的数学细节)。我们建议所有涉及大数据的工作范围可分为四种概念模式(类型):四种大数据的大规模使用模式:1)智能信息检索;2)海量(大规模)传输数据处理(挖掘);3)数据模型推断;4)从数据中提取知识(规律检测和结构发现)。阐述了各种任务(聚类、回归、生成模型推理、结构发现等)的本质。我们比较了聚类、回归、分类、深度学习、生成模型推理和因果发现的关键方法。聚类分析可分为基于平均距离的方法、基于局部距离的方法和基于模型的方法。目标(预测)方法分为两类:推断模型的方法;“与数据挂钩”的方法直接从数据中计算预测。简要概述了时态数据分析的常见任务。在生成模型推理的各种方法中，我们将重点放在因果网络学习上，因为这类模型具有很强的表现力、灵活性，并且能够在不同条件下预测干预措施的效果。研究了基于独立性的因果网络数据推理方法。对时间序列动态因果网络推理任务的特殊性作了一些评论。数据的多维性、异质性和海量性给大数据分析带来了挑战。总结了与这些挑战有关的一些统计问题。2019年编程问题;3: 58 - 85

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Tasks and methods of Big Data analysis (a survey)

We review tasks and methods most relevant to Big Data analysis. Emphasis is made on the conceptual and pragmatic issues of the tasks and methods (avoiding unnecessary mathematical details). We suggest that all scope of jobs with Big Data fall into four conceptual modes (types): four modes of large-scale usage of Big Data: 1) intelligent information retrieval; 2) massive (large-scale) conveyed data processing (mining); 3) model inference from data; 4) knowledge extraction from data (regularities detection and structures discovery). The essence of various tasks (clustering, regression, generative model inference, structures discovery etc.) are elucidated. We compare key methods of clustering, regression, classification, deep learning, generative model inference and causal discovery. Cluster analysis may be divided into methods based on mean distance, methods based on local distance and methods based on a model. The targeted (predictive) methods fall into two categories: methods which infer a model; "tied to data" methods which compute prediction directly from data. Common tasks of temporal data analysis are briefly overviewed. Among diverse methods of generative model inference we make focus on causal network learning because models of this class are very expressive, flexible and are able to predict effects of interventions under varying conditions. Independence-based approach to causal network inference from data is characterized. We give a few comments on specificity of task of dynamical causal network inference from timeseries. Challenges of Big Data analysis raised by data multidimensionality, heterogeneity and huge volume are presented. Some statistical issues related to the challenges are summarized. Problems in programming 2019; 3: 58-85

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

PROBLEMS IN PROGRAMMING

自引率

0.00%

发文量