{"title":"Searching Corrupted Document Collections","authors":"Jason J. Soo, O. Frieder","doi":"10.1109/DAS.2016.28","DOIUrl":null,"url":null,"abstract":"Historical documents are typically digitized using optical Character Recognition. While effective, the results may not always be accurate and are highly dependent on the input. Consequently, degraded documents are often corrupted. Our focus is finding flexible, reliable methods to correct for such degradation, in the face of limited resources. We extend upon our substring and context fusion based retrieval system known as Segments, to consider metadata. By extracting topics from documents, and supplementing and weighting our lexicon with co-occurring terms found in documents with those topics, we achieve a statistically significant improvement over the state-of-the-art in all but one test configuration. Our mean reciprocal rank measured on two free, publicly available, independently judged datasets is 0.7657 and 0.5382.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DAS.2016.28","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Historical documents are typically digitized using optical Character Recognition. While effective, the results may not always be accurate and are highly dependent on the input. Consequently, degraded documents are often corrupted. Our focus is finding flexible, reliable methods to correct for such degradation, in the face of limited resources. We extend upon our substring and context fusion based retrieval system known as Segments, to consider metadata. By extracting topics from documents, and supplementing and weighting our lexicon with co-occurring terms found in documents with those topics, we achieve a statistically significant improvement over the state-of-the-art in all but one test configuration. Our mean reciprocal rank measured on two free, publicly available, independently judged datasets is 0.7657 and 0.5382.