Sergej Sizov, Stefan Siersdorfer, M. Theobald, G. Weikum
{"title":"The BINGO! focused crawler: from bookmarks to archetypes","authors":"Sergej Sizov, Stefan Siersdorfer, M. Theobald, G. Weikum","doi":"10.1109/ICDE.2002.994746","DOIUrl":null,"url":null,"abstract":"The BINGO! system implements an approach to focused crawling that aims to overcome the limitations of the initial training data. To this end, BINGO! identifies, among the crawled and positively classified documents of a topic, characteristic \"archetypes\" and uses them for periodically re-training the classifier; this way the crawler is dynamically adapted based on the most significant documents seen so far. Two kinds of archetypes are considered: good authorities as determined by employing Kleinberg's link analysis algorithm, and documents that have been automatically classified with high confidence using a linear SVM classifier.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 18th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2002.994746","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17
Abstract
The BINGO! system implements an approach to focused crawling that aims to overcome the limitations of the initial training data. To this end, BINGO! identifies, among the crawled and positively classified documents of a topic, characteristic "archetypes" and uses them for periodically re-training the classifier; this way the crawler is dynamically adapted based on the most significant documents seen so far. Two kinds of archetypes are considered: good authorities as determined by employing Kleinberg's link analysis algorithm, and documents that have been automatically classified with high confidence using a linear SVM classifier.