TOPCRAWL:网络搜索引擎中的社区挖掘，强调主题抓取

International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012) Pub Date : 2012-03-21 DOI:10.1109/ICPRIME.2012.6208281

S. Balaji, S. Sarumathi

{"title":"TOPCRAWL:网络搜索引擎中的社区挖掘，强调主题抓取","authors":"S. Balaji, S. Sarumathi","doi":"10.1109/ICPRIME.2012.6208281","DOIUrl":null,"url":null,"abstract":"Web Mining Systems make use of the redundancy of data published on the Web to automatically extract formation from existing web documents. The crawler is an important module of a web search engine. The quality of a crawler directly affects the searching quality of such web search engines. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. Given some URLs, the crawler should retrieve the web pages of those URLs, parse the HTML files, add new URLs into its queue and go back to the first phase of this cycle. The crawler also can retrieve some other information from the HTML files as it is parsing them to get the new URLs. This paper proposes a framework and algorithm, TOPCRAWL for mining. The proposed TOPCRAWL algorithm is a new crawling method which emphasis on topic relevancy and outperforms state-of-the-art approaches with respect to recall values achievable within a given period of time. This method also tries to offer the result in community format and it makes use of a new combination of ideas and techniques used to identify and exploit navigational structures of websites, such as hierarchies, lists or maps. This algorithm is simulated with web mining tool Deixto and the basic idea has been implemented using the JAVA and Results are given. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall whilst maintaining precision.","PeriodicalId":148511,"journal":{"name":"International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"TOPCRAWL: Community mining in web search engines with emphasize on topical crawling\",\"authors\":\"S. Balaji, S. Sarumathi\",\"doi\":\"10.1109/ICPRIME.2012.6208281\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Web Mining Systems make use of the redundancy of data published on the Web to automatically extract formation from existing web documents. The crawler is an important module of a web search engine. The quality of a crawler directly affects the searching quality of such web search engines. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. Given some URLs, the crawler should retrieve the web pages of those URLs, parse the HTML files, add new URLs into its queue and go back to the first phase of this cycle. The crawler also can retrieve some other information from the HTML files as it is parsing them to get the new URLs. This paper proposes a framework and algorithm, TOPCRAWL for mining. The proposed TOPCRAWL algorithm is a new crawling method which emphasis on topic relevancy and outperforms state-of-the-art approaches with respect to recall values achievable within a given period of time. This method also tries to offer the result in community format and it makes use of a new combination of ideas and techniques used to identify and exploit navigational structures of websites, such as hierarchies, lists or maps. This algorithm is simulated with web mining tool Deixto and the basic idea has been implemented using the JAVA and Results are given. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall whilst maintaining precision.\",\"PeriodicalId\":148511,\"journal\":{\"name\":\"International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-03-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPRIME.2012.6208281\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPRIME.2012.6208281","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

Web挖掘系统利用Web上发布的数据冗余，从现有的Web文档中自动提取信息。爬虫是网络搜索引擎的一个重要模块。爬虫的质量直接影响到这类网络搜索引擎的搜索质量。这样的网络爬虫可能在数周或数月的时间内与数百万台主机交互，因此健壮性、灵活性和可管理性问题非常重要。给定一些url，爬虫应该检索这些url的web页面，解析HTML文件，将新的url添加到队列中，然后返回到该循环的第一阶段。爬虫在解析HTML文件以获得新的url时，还可以从这些文件中检索一些其他信息。本文提出了一种用于挖掘的框架和算法TOPCRAWL。提出的TOPCRAWL算法是一种新的爬行方法，它强调主题相关性，并且在给定时间内可实现的召回值方面优于目前最先进的方法。该方法还尝试以社区格式提供结果，并利用了用于识别和利用网站导航结构(如层次结构、列表或地图)的新思想和技术组合。利用web挖掘工具Deixto对该算法进行了仿真，并用JAVA实现了该算法的基本思想，并给出了结果。与现有的聚焦爬行技术的比较表明，新的爬行方法在保持精度的同时显著提高了召回率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

TOPCRAWL: Community mining in web search engines with emphasize on topical crawling

Web Mining Systems make use of the redundancy of data published on the Web to automatically extract formation from existing web documents. The crawler is an important module of a web search engine. The quality of a crawler directly affects the searching quality of such web search engines. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. Given some URLs, the crawler should retrieve the web pages of those URLs, parse the HTML files, add new URLs into its queue and go back to the first phase of this cycle. The crawler also can retrieve some other information from the HTML files as it is parsing them to get the new URLs. This paper proposes a framework and algorithm, TOPCRAWL for mining. The proposed TOPCRAWL algorithm is a new crawling method which emphasis on topic relevancy and outperforms state-of-the-art approaches with respect to recall values achievable within a given period of time. This method also tries to offer the result in community format and it makes use of a new combination of ideas and techniques used to identify and exploit navigational structures of websites, such as hierarchies, lists or maps. This algorithm is simulated with web mining tool Deixto and the basic idea has been implemented using the JAVA and Results are given. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall whilst maintaining precision.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012)

自引率

0.00%

发文量