{"title":"利用信息抽取模式对网页进行分类的初步结果和发现","authors":"Lay-Ki Soon, Sang Ho Lee","doi":"10.1109/SITIS.2010.42","DOIUrl":null,"url":null,"abstract":"Web page classification plays an essential role in facilitating more efficient information retrieval and information processing. Conventionally, web text documents are represented by term frequency matrix for classification purpose. However, considering the limitations of representing documents using terms or keywords, we propose to represent web pages using information extraction patterns that are identified within the pages respectively. In this paper, we present the results as well as the findings obtained from our preliminary experiments. Our experimental results indicate that the existence of a word in different contexts has different impact to the classification task. Thus, the extraction patterns used to represent each document are more semantically meaningful and give better insight to web classification in comparison with keywords.","PeriodicalId":128396,"journal":{"name":"2010 Sixth International Conference on Signal-Image Technology and Internet Based Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Classifying Web Pages Using Information Extraction Patterns Preliminary Results and Findings\",\"authors\":\"Lay-Ki Soon, Sang Ho Lee\",\"doi\":\"10.1109/SITIS.2010.42\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Web page classification plays an essential role in facilitating more efficient information retrieval and information processing. Conventionally, web text documents are represented by term frequency matrix for classification purpose. However, considering the limitations of representing documents using terms or keywords, we propose to represent web pages using information extraction patterns that are identified within the pages respectively. In this paper, we present the results as well as the findings obtained from our preliminary experiments. Our experimental results indicate that the existence of a word in different contexts has different impact to the classification task. Thus, the extraction patterns used to represent each document are more semantically meaningful and give better insight to web classification in comparison with keywords.\",\"PeriodicalId\":128396,\"journal\":{\"name\":\"2010 Sixth International Conference on Signal-Image Technology and Internet Based Systems\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 Sixth International Conference on Signal-Image Technology and Internet Based Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SITIS.2010.42\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 Sixth International Conference on Signal-Image Technology and Internet Based Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SITIS.2010.42","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Classifying Web Pages Using Information Extraction Patterns Preliminary Results and Findings
Web page classification plays an essential role in facilitating more efficient information retrieval and information processing. Conventionally, web text documents are represented by term frequency matrix for classification purpose. However, considering the limitations of representing documents using terms or keywords, we propose to represent web pages using information extraction patterns that are identified within the pages respectively. In this paper, we present the results as well as the findings obtained from our preliminary experiments. Our experimental results indicate that the existence of a word in different contexts has different impact to the classification task. Thus, the extraction patterns used to represent each document are more semantically meaningful and give better insight to web classification in comparison with keywords.