Classifying Web Pages Using Information Extraction Patterns Preliminary Results and Findings

2010 Sixth International Conference on Signal-Image Technology and Internet Based Systems Pub Date : 2010-12-15 DOI:10.1109/SITIS.2010.42

Lay-Ki Soon, Sang Ho Lee

引用次数: 1

Abstract

Web page classification plays an essential role in facilitating more efficient information retrieval and information processing. Conventionally, web text documents are represented by term frequency matrix for classification purpose. However, considering the limitations of representing documents using terms or keywords, we propose to represent web pages using information extraction patterns that are identified within the pages respectively. In this paper, we present the results as well as the findings obtained from our preliminary experiments. Our experimental results indicate that the existence of a word in different contexts has different impact to the classification task. Thus, the extraction patterns used to represent each document are more semantically meaningful and give better insight to web classification in comparison with keywords.

查看原文本刊更多论文

利用信息抽取模式对网页进行分类的初步结果和发现

网页分类对于提高信息检索和信息处理的效率起着至关重要的作用。通常，为了便于分类，网络文本文档用词频矩阵表示。然而，考虑到使用术语或关键字表示文档的局限性，我们建议使用在页面中分别识别的信息提取模式来表示网页。在本文中，我们给出了结果以及从我们的初步实验中得到的发现。我们的实验结果表明，一个词在不同语境中的存在对分类任务的影响是不同的。因此，与关键词相比，用于表示每个文档的提取模式在语义上更有意义，并且可以更好地了解web分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 Sixth International Conference on Signal-Image Technology and Internet Based Systems

自引率

0.00%

发文量