通过扩展一种新颖的单页提取方法改进网页内容提取:泰国网站的案例研究

2012 International Conference on Machine Learning and Cybernetics Pub Date : 2012-07-15 DOI:10.1109/ICMLC.2012.6359546

W. Thanadechteemapat, L. Fung

{"title":"通过扩展一种新颖的单页提取方法改进网页内容提取:泰国网站的案例研究","authors":"W. Thanadechteemapat, L. Fung","doi":"10.1109/ICMLC.2012.6359546","DOIUrl":null,"url":null,"abstract":"Web Content Extraction technique is proposed in this paper. The technique is able to work with both single and multiple pages based on heuristic rules. An Extracted Content Matching (ECM) technique is proposed in the multiple page extraction to identify the noises among the extracted results. Some features in this technique are also introduced in order to reduce processing time such as use of XPath, file compression, and parallel processing. Assessment of the performance is based on precision, recall and F-measure by using the length of extracted content. Initial results by comparing results from the proposed approach to extraction by manual process are good.","PeriodicalId":128006,"journal":{"name":"2012 International Conference on Machine Learning and Cybernetics","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Improving Webpage Content Extraction by extending a novel single page extraction approach: A case study with Thai websites\",\"authors\":\"W. Thanadechteemapat, L. Fung\",\"doi\":\"10.1109/ICMLC.2012.6359546\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Web Content Extraction technique is proposed in this paper. The technique is able to work with both single and multiple pages based on heuristic rules. An Extracted Content Matching (ECM) technique is proposed in the multiple page extraction to identify the noises among the extracted results. Some features in this technique are also introduced in order to reduce processing time such as use of XPath, file compression, and parallel processing. Assessment of the performance is based on precision, recall and F-measure by using the length of extracted content. Initial results by comparing results from the proposed approach to extraction by manual process are good.\",\"PeriodicalId\":128006,\"journal\":{\"name\":\"2012 International Conference on Machine Learning and Cybernetics\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 International Conference on Machine Learning and Cybernetics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLC.2012.6359546\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 International Conference on Machine Learning and Cybernetics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLC.2012.6359546","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

本文提出了一种Web内容抽取技术。该技术能够基于启发式规则处理单个和多个页面。在多页面提取中，提出了一种提取内容匹配(ECM)技术来识别提取结果中的噪声。为了减少处理时间，还介绍了该技术中的一些特性，如使用XPath、文件压缩和并行处理。性能的评估是基于精度，召回率和f测量使用提取内容的长度。通过与人工提取方法的初步结果比较，取得了较好的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving Webpage Content Extraction by extending a novel single page extraction approach: A case study with Thai websites

Web Content Extraction technique is proposed in this paper. The technique is able to work with both single and multiple pages based on heuristic rules. An Extracted Content Matching (ECM) technique is proposed in the multiple page extraction to identify the noises among the extracted results. Some features in this technique are also introduced in order to reduce processing time such as use of XPath, file compression, and parallel processing. Assessment of the performance is based on precision, recall and F-measure by using the length of extracted content. Initial results by comparing results from the proposed approach to extraction by manual process are good.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 International Conference on Machine Learning and Cybernetics

自引率

0.00%

发文量