Spark-gram: Mining frequent N-grams using parallel processing in Spark

2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS) Pub Date : 2015-10-01 DOI:10.1109/ICACSIS.2015.7415169

Prasetya Ajie Utama, Bayu Distiawan

引用次数: 2

Abstract

Mining sequence patterns in form of n-grams (sequences of words that appear consecutively) from a large text data is one of the fundamental parts in several information retrieval and natural language processing applications. In this work, we present Spark-gram, a method for large scale frequent sequence mining based on Spark that was adapted from its equivalent method in MapReduce called Suffix-σ. Spark-gram design allows the discovery of all n-grams with maximum length σ and minimum occurrence frequency τ, using iterative algorithm with only a single shuffle phase. We show that Spark-gram can outperform Suffix-σ mainly when τ is high but potentially worse when the value of σ grows higher.

查看原文本刊更多论文

Spark-gram:在Spark中使用并行处理挖掘频繁的n -gram

从大型文本数据中以n-gram(连续出现的单词序列)的形式挖掘序列模式是许多信息检索和自然语言处理应用程序的基本部分之一。在这项工作中，我们提出了Spark-gram，一种基于Spark的大规模频繁序列挖掘方法，该方法改编自MapReduce中的等效方法Suffix-σ。Spark-gram设计允许使用迭代算法发现所有具有最大长度σ和最小出现频率τ的n个图，只有一个洗牌阶段。我们发现，当τ较大时，Spark-gram的表现优于后缀-σ，但当σ增大时，表现可能会变差。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS)

自引率

0.00%

发文量