Automatic Web Content Extraction for Generating Tag Clouds from Thai Web Sites

2011 IEEE 8th International Conference on e-Business Engineering Pub Date : 2011-10-19 DOI:10.1109/ICEBE.2011.34

W. Thanadechteemapat, L. Fung

引用次数: 4

Abstract

This paper proposes a novel Web content extraction approach based on heuristic rules and the XPath utility in XML. The main objective is to address the problem of Web visualization by generating tag clouds from Thai Web sites in order to provide an overview of the key words in the Web pages. This paper also proposes a detailed method to assess the Web content extraction technique on a single Web page by using the length of the extracted content. There are three main steps in the proposed technique: Web page elements and features extraction, Block detection, and Content extraction selection. The empirical results have shown this technique produces high accuracies.

查看原文本刊更多论文

从泰国网站生成标签云的自动Web内容提取

提出了一种基于启发式规则和XML中的XPath实用程序的Web内容抽取方法。主要目标是通过从泰国Web站点生成标记云来解决Web可视化问题，以便提供Web页面中关键字的概览。本文还提出了一种详细的方法，通过使用提取内容的长度来评估单个Web页面上的Web内容提取技术。所提出的技术有三个主要步骤:网页元素和特征提取、块检测和内容提取选择。实验结果表明，该方法具有较高的精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE 8th International Conference on e-Business Engineering

自引率

0.00%

发文量