{"title":"Web Content Information Extraction Approach Based on Removing Noise and Content-Features","authors":"D. Yang, Jihua Song","doi":"10.1109/WISM.2010.82","DOIUrl":null,"url":null,"abstract":"This paper presents an improved approach to extract the main content from web pages. There are a good many financial news pages which have so many links that the algorithms mainly based on link density have poor performance in extracting main content. To solve this problem, we put forward an extracting main content method which firstly removes the usual noise and the candidate nodes without any main content information from web pages, and makes use of the relation of content text length, the length of anchor text and the number of punctuation marks to extract the main content. In this paper, we focus on removing noise and utilization of all kinds of content-characteristics, experiments show that this approach can enhance the universality and accuracy in extracting the body text of web pages.","PeriodicalId":119569,"journal":{"name":"2010 International Conference on Web Information Systems and Mining","volume":"341 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 International Conference on Web Information Systems and Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WISM.2010.82","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21
Abstract
This paper presents an improved approach to extract the main content from web pages. There are a good many financial news pages which have so many links that the algorithms mainly based on link density have poor performance in extracting main content. To solve this problem, we put forward an extracting main content method which firstly removes the usual noise and the candidate nodes without any main content information from web pages, and makes use of the relation of content text length, the length of anchor text and the number of punctuation marks to extract the main content. In this paper, we focus on removing noise and utilization of all kinds of content-characteristics, experiments show that this approach can enhance the universality and accuracy in extracting the body text of web pages.