The Design and Implementation of a Topic-Driven Crawler

Workshop on Intelligent Information Technology Application (IITA 2007) Pub Date : 2007-12-02 DOI:10.1109/IITA.2007.33

Qiong Li, Tao Jin, Yuchen Fu, Quan Liu, Zhiming Cui

引用次数: 0

Abstract

It is indispensable that the users surfing on the Internet could have web pages classified into a given topic as correct as possible. As a result, topic-driven crawlers are becoming important tools to support applications such as specialized web portals, online searching, and competitive intelligence. This paper presents a topic-driven crawler computing the degree of relevance and refining the preliminary set of related web pages using term frequency/document frequency, entropy, and compiled rules. This paper also gives a kind of comparatively ideal system architecture and the relationship of each module of a topic-driven crawler, and describes several modules on the details.

查看原文本刊更多论文

主题驱动爬虫的设计与实现

在互联网上冲浪的用户能够尽可能正确地将网页分类到给定的主题中是必不可少的。因此，主题驱动的爬虫正在成为支持诸如专门的web门户、在线搜索和竞争情报等应用程序的重要工具。本文提出了一个主题驱动的爬虫计算相关度，并使用术语频率/文档频率、熵和编译规则来精炼相关网页的初步集。本文还给出了一种比较理想的系统架构和主题驱动爬虫的各个模块之间的关系，并对其中的几个模块进行了详细的描述。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Workshop on Intelligent Information Technology Application (IITA 2007)

自引率

0.00%

发文量