Tibetan Web Information Collection System

2012 Fifth International Conference on Intelligent Networks and Intelligent Systems Pub Date : 2012-11-01 DOI:10.1109/ICINIS.2012.46

Guixian Xu, D. Zhong, Xu Gao, Yuan Lin, Xiaobing Zhao, Guosheng Yang

引用次数: 2

Abstract

Nutch is an open source web-search software project. This paper introduces a system called Tibetan web information collection system, which bases on Apache Nutch. It points out original program's shortcomings and proposes an improved method, which can utilize the Nutch to deal with Tibetan web pages and generate the files that we need. Besides, this paper shows how to update the data regularly and delete the duplicate data. It is useful and helpful for the study of Tibetan information processing.

查看原文本刊更多论文

藏文网络信息采集系统

Nutch是一个开源的网络搜索软件项目。本文介绍了一个基于Apache Nutch的藏文网页信息采集系统。指出了原程序的不足，提出了一种改进的方法，利用Nutch对藏文网页进行处理，生成我们需要的文件。此外，本文还介绍了如何定期更新数据和删除重复数据。这对藏文信息处理的研究有一定的借鉴意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 Fifth International Conference on Intelligent Networks and Intelligent Systems

自引率

0.00%

发文量