Model of Data Gathering and Processing on Tibetan and Uyghur Language

2012 Fifth International Conference on Intelligent Networks and Intelligent Systems Pub Date : 2012-11-01 DOI:10.1109/ICINIS.2012.81

Yunfeng Weng, Hanxin Jia, Qing Ma

引用次数: 0

Abstract

A model of web data gathering and processing on Tibetan and Uyghur language is introduced in this paper, including page crawler, content extraction, word segmentation and frequency statistics and display. Firstly, It extracts the website's templates and use the template to extract the content and title of the web page, then the software transforms the HTML file to the XML file. The second step is to segment the content of XML file into words and to count the number of words, in order to store the statistics into database. Finally", "there is a web page to display the the result of the frequency statistics.

查看原文本刊更多论文

藏、维吾尔语数据采集与处理模型

本文介绍了一种藏文和维吾尔文网络数据采集与处理模型，包括页面爬虫、内容提取、分词和频次统计与显示。首先提取网站的模板，利用模板提取网页的内容和标题，然后软件将HTML文件转换为XML文件。第二步是将XML文件的内容分割成单词，并统计单词的数量，以便将统计信息存储到数据库中。最后，有一个网页显示频率统计的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 Fifth International Conference on Intelligent Networks and Intelligent Systems

自引率

0.00%

发文量