{"title":"Malformed UTF-8 and spam","authors":"Matt Crane, A. Trotman, Richard A. O'Keefe","doi":"10.1145/2537734.2537746","DOIUrl":null,"url":null,"abstract":"In this paper we discuss some of the document encoding errors that were found when scaling our indexer and search engine up to large collections crawled from the web, such as ClueWeb09. In this paper we describe the encoding errors, what effect they could have on indexing and searching, how they are processed within our indexer and search engine and how they relate to the quality of the page measured by another method.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"200 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Australasian Document Computing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2537734.2537746","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
In this paper we discuss some of the document encoding errors that were found when scaling our indexer and search engine up to large collections crawled from the web, such as ClueWeb09. In this paper we describe the encoding errors, what effect they could have on indexing and searching, how they are processed within our indexer and search engine and how they relate to the quality of the page measured by another method.