A Comparison of Approaches to Chinese Word Segmentation in Hadoop

2014 IEEE International Conference on Data Mining Workshop Pub Date : 2014-12-01 DOI:10.1109/ICDMW.2014.43

Zhangang Wang, Bangjie Meng

{"title":"A Comparison of Approaches to Chinese Word Segmentation in Hadoop","authors":"Zhangang Wang, Bangjie Meng","doi":"10.1109/ICDMW.2014.43","DOIUrl":null,"url":null,"abstract":"Today, we're surrounded by data especially Chinese information. The exponential growth of data first presented challenges to cutting-edge businesses such as Alibaba, Jingdong, Amazon, and Microsoft. They need to go through terabytes and petabytes of data to figure out which websites were popular, what books were in demand, and what kinds of ads appealed to people. Chinese word segmentation is a computer problem in Chinese information processing, and the Chinese word segmentation algorithm is one of the core, but because of the different characteristics of the environment morpheme in English, making the Chinese must solve word problems. Chinese lexical analysis is the foundation and key Chinese information processing. IKAnalyzer (IK) and ICTCLAS (IC) is a very popular Chinese word segmentation algorithm. At present, these two algorithms in Chinese segmentation play an important role in solving the text data. If the two algorithms are well applied to Hadoop distributed environment, will have better performance. In this paper we compare IK and IC algorithm performance by the theory and experiments. This paper reports the experimental work on the mass Chinese text segmentation problem and its optimal solution using Hadoop cluster, Hadoop Distributed File System (HDFS) for storage and using parallel processing to process large data sets using Map Reduce programming framework. We have done prototype implementation of Hadoop cluster, HDFS storage and Map Reduce framework for processing large text data sets by considering prototype of big data application scenarios. The results obtained from various experiments indicate favorable results of above IC and IK algorithm to address mass Chinese text segmentation problem. (Addressing Big Data Problem Using Hadoop and Map Reduce). Furthermore, we evaluate both kinds of segmentation in terms of performance. Although the process to load data into and tune the execution of parallel distributed system took much longer than the centralized system, the observed performance of these word segmentation algorithms were strikingly better.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Conference on Data Mining Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW.2014.43","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Today, we're surrounded by data especially Chinese information. The exponential growth of data first presented challenges to cutting-edge businesses such as Alibaba, Jingdong, Amazon, and Microsoft. They need to go through terabytes and petabytes of data to figure out which websites were popular, what books were in demand, and what kinds of ads appealed to people. Chinese word segmentation is a computer problem in Chinese information processing, and the Chinese word segmentation algorithm is one of the core, but because of the different characteristics of the environment morpheme in English, making the Chinese must solve word problems. Chinese lexical analysis is the foundation and key Chinese information processing. IKAnalyzer (IK) and ICTCLAS (IC) is a very popular Chinese word segmentation algorithm. At present, these two algorithms in Chinese segmentation play an important role in solving the text data. If the two algorithms are well applied to Hadoop distributed environment, will have better performance. In this paper we compare IK and IC algorithm performance by the theory and experiments. This paper reports the experimental work on the mass Chinese text segmentation problem and its optimal solution using Hadoop cluster, Hadoop Distributed File System (HDFS) for storage and using parallel processing to process large data sets using Map Reduce programming framework. We have done prototype implementation of Hadoop cluster, HDFS storage and Map Reduce framework for processing large text data sets by considering prototype of big data application scenarios. The results obtained from various experiments indicate favorable results of above IC and IK algorithm to address mass Chinese text segmentation problem. (Addressing Big Data Problem Using Hadoop and Map Reduce). Furthermore, we evaluate both kinds of segmentation in terms of performance. Although the process to load data into and tune the execution of parallel distributed system took much longer than the centralized system, the observed performance of these word segmentation algorithms were strikingly better.

查看原文本刊更多论文

Hadoop中中文分词方法的比较

今天，我们被数据包围，尤其是中国的信息。数据的指数级增长首先给阿里巴巴、京东、亚马逊和微软等前沿企业带来了挑战。他们需要浏览千兆字节的数据，以找出哪些网站是受欢迎的，哪些书是需要的，什么样的广告对人们有吸引力。汉语分词是汉语信息处理中的一个计算机问题，而汉语分词算法是其中的核心之一，但由于英语环境中语素的不同特点，使得汉语必须解决分词问题。汉语词法分析是汉语信息处理的基础和关键。IKAnalyzer (IK)和ICTCLAS (IC)是一种非常流行的中文分词算法。目前，这两种算法在中文切分中对文本数据的求解起着重要的作用。如果将这两种算法很好地应用到Hadoop分布式环境中，将会有更好的性能。本文从理论和实验两方面比较了IK和IC算法的性能。本文报道了使用Hadoop集群、Hadoop分布式文件系统(HDFS)进行存储以及使用Map Reduce编程框架对大型数据集进行并行处理的海量中文文本切分问题及其优化解决方案的实验工作。结合大数据应用场景的原型，对处理大型文本数据集的Hadoop集群、HDFS存储和Map Reduce框架进行了原型实现。各种实验结果表明，上述IC和IK算法在解决海量中文文本分割问题上取得了良好的效果。(使用Hadoop和Map Reduce解决大数据问题)。此外，我们从性能方面评估了这两种分割。虽然将数据加载到并行分布式系统和调优执行的过程比集中式系统要长得多，但观察到这些分词算法的性能明显更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE International Conference on Data Mining Workshop

自引率

0.00%

发文量