畸形的UTF-8和垃圾邮件

Australasian Document Computing Symposium Pub Date : 2013-12-05 DOI:10.1145/2537734.2537746

Matt Crane, A. Trotman, Richard A. O'Keefe

{"title":"畸形的UTF-8和垃圾邮件","authors":"Matt Crane, A. Trotman, Richard A. O'Keefe","doi":"10.1145/2537734.2537746","DOIUrl":null,"url":null,"abstract":"In this paper we discuss some of the document encoding errors that were found when scaling our indexer and search engine up to large collections crawled from the web, such as ClueWeb09. In this paper we describe the encoding errors, what effect they could have on indexing and searching, how they are processed within our indexer and search engine and how they relate to the quality of the page measured by another method.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"200 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Malformed UTF-8 and spam\",\"authors\":\"Matt Crane, A. Trotman, Richard A. O'Keefe\",\"doi\":\"10.1145/2537734.2537746\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we discuss some of the document encoding errors that were found when scaling our indexer and search engine up to large collections crawled from the web, such as ClueWeb09. In this paper we describe the encoding errors, what effect they could have on indexing and searching, how they are processed within our indexer and search engine and how they relate to the quality of the page measured by another method.\",\"PeriodicalId\":402985,\"journal\":{\"name\":\"Australasian Document Computing Symposium\",\"volume\":\"200 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Australasian Document Computing Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2537734.2537746\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Australasian Document Computing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2537734.2537746","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

在本文中，我们讨论了一些文档编码错误，这些错误是在扩展我们的索引器和搜索引擎到从web抓取的大型集合时发现的，比如ClueWeb09。在本文中，我们描述了编码错误，它们对索引和搜索的影响，它们在我们的索引器和搜索引擎中是如何处理的，以及它们与用另一种方法测量的页面质量的关系。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Malformed UTF-8 and spam

In this paper we discuss some of the document encoding errors that were found when scaling our indexer and search engine up to large collections crawled from the web, such as ClueWeb09. In this paper we describe the encoding errors, what effect they could have on indexing and searching, how they are processed within our indexer and search engine and how they relate to the quality of the page measured by another method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Australasian Document Computing Symposium

自引率

0.00%

发文量