基于N-Gram的网页类型自动识别方法

2009 42nd Hawaii International Conference on System Sciences Pub Date : 2009-01-20 DOI:10.1109/HICSS.2009.581

Jane E. Mason, M. Shepherd, Jack Duffy

{"title":"基于N-Gram的网页类型自动识别方法","authors":"Jane E. Mason, M. Shepherd, Jack Duffy","doi":"10.1109/HICSS.2009.581","DOIUrl":null,"url":null,"abstract":"The research reported in this paper is the first phase of a larger project on the automatic classification of web pages by their genres, using n-gram representations of the web pages. In this study, the textual content of web pages is used to create feature sets consisting of the most frequent n-grams and their associated frequencies. We present three methods, each of which uses a distance measure to determine the dissimilarity between two feature sets. Each method forms a feature set for every web page in the test set, however the formation of feature sets from the training set differs between methods: we experiment using one feature set per web page, per genre, and a combination of genre-based feature sets supplemented by subgenre feature sets. We present results for a balanced corpus of seven genres (blog, eshop, FAQs, front page, listing, home page, and search page). Initial results are encouraging.","PeriodicalId":211759,"journal":{"name":"2009 42nd Hawaii International Conference on System Sciences","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":"{\"title\":\"An N-Gram Based Approach to Automatically Identifying Web Page Genre\",\"authors\":\"Jane E. Mason, M. Shepherd, Jack Duffy\",\"doi\":\"10.1109/HICSS.2009.581\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The research reported in this paper is the first phase of a larger project on the automatic classification of web pages by their genres, using n-gram representations of the web pages. In this study, the textual content of web pages is used to create feature sets consisting of the most frequent n-grams and their associated frequencies. We present three methods, each of which uses a distance measure to determine the dissimilarity between two feature sets. Each method forms a feature set for every web page in the test set, however the formation of feature sets from the training set differs between methods: we experiment using one feature set per web page, per genre, and a combination of genre-based feature sets supplemented by subgenre feature sets. We present results for a balanced corpus of seven genres (blog, eshop, FAQs, front page, listing, home page, and search page). Initial results are encouraging.\",\"PeriodicalId\":211759,\"journal\":{\"name\":\"2009 42nd Hawaii International Conference on System Sciences\",\"volume\":\"80 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-01-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"33\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 42nd Hawaii International Conference on System Sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HICSS.2009.581\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 42nd Hawaii International Conference on System Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HICSS.2009.581","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 33

摘要

本文报道的研究是一个更大的项目的第一阶段，该项目使用网页的n-gram表示，根据网页的类型对网页进行自动分类。在本研究中，网页的文本内容被用来创建由最频繁的n-gram及其相关频率组成的特征集。我们提出了三种方法，每种方法都使用距离度量来确定两个特征集之间的不相似性。每种方法都为测试集中的每个网页形成一个特征集，但是从训练集中形成的特征集在不同的方法之间是不同的:我们在每个网页、每个类型和基于类型的特征集的组合中使用一个特征集，并辅以子类型特征集。我们为七种类型(博客，商店，常见问题，首页，列表，主页和搜索页面)的平衡语料库提供结果。初步结果令人鼓舞。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An N-Gram Based Approach to Automatically Identifying Web Page Genre

The research reported in this paper is the first phase of a larger project on the automatic classification of web pages by their genres, using n-gram representations of the web pages. In this study, the textual content of web pages is used to create feature sets consisting of the most frequent n-grams and their associated frequencies. We present three methods, each of which uses a distance measure to determine the dissimilarity between two feature sets. Each method forms a feature set for every web page in the test set, however the formation of feature sets from the training set differs between methods: we experiment using one feature set per web page, per genre, and a combination of genre-based feature sets supplemented by subgenre feature sets. We present results for a balanced corpus of seven genres (blog, eshop, FAQs, front page, listing, home page, and search page). Initial results are encouraging.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2009 42nd Hawaii International Conference on System Sciences

自引率

0.00%

发文量