An Algorithm for Identifying the Learning Patterns in Big Data

2015 IEEE Trustcom/BigDataSE/ISPA Pub Date : 2015-08-20 DOI:10.1109/Trustcom.2015.561

Majed Farrash, Wenjia Wang

引用次数: 1

Abstract

Divide-and-Conquer is probably the most commonly used strategy to deal with a big data that is too big to be loaded into any computing system's memory as a whole for analysis. It partitions such a big dataset into many smaller subsets that can be loaded into computer memory separately to induce models, which can be combined by machine learning ensemble methods. However, it is not clear that how the size of subsets may affect the learning performance of individual models and their ensemble. This paper proposes an ensemble based algorithm to quickly detect their relational patterns in terms of ensemble accuracy and the size of partitioned data subset. An ensemble framework of the algorithm is implemented and tested on 12 relatively big benchmark datasets. The experimental results indicate that it is able to identify the relation patterns accurately and efficiently in less than 10 steps. The identified patterns show that in most cases it is not necessary to use the whole big dataset for analysis as few smaller subsets are already sufficiently representative of the underlying problem, which is obviously a useful knowledge in big data analysis.

查看原文本刊更多论文

一种大数据学习模式识别算法

分而治之可能是处理大数据时最常用的策略，因为大数据太大，无法将其作为一个整体加载到任何计算系统的内存中进行分析。它将如此大的数据集划分为许多较小的子集，这些子集可以单独加载到计算机内存中以诱导模型，这些模型可以通过机器学习集成方法进行组合。然而，子集的大小如何影响单个模型及其集合的学习性能尚不清楚。本文提出了一种基于集成的算法，从集成精度和分割数据子集的大小两方面快速检测它们之间的关系模式。实现了该算法的集成框架，并在12个较大的基准数据集上进行了测试。实验结果表明，该方法可以在不到10步的时间内准确、高效地识别出相关模式。所识别的模式表明，在大多数情况下，没有必要使用整个大数据集进行分析，因为少数较小的子集已经充分代表了潜在的问题，这显然是大数据分析中有用的知识。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE Trustcom/BigDataSE/ISPA

自引率

0.00%

发文量