退化文献集合中已知项检索的重访

Document Recognition and Retrieval Pub Date : 2016-02-17 DOI:10.2352/ISSN.2470-1173.2016.17.DRR-065

Jason J. Soo, O. Frieder

{"title":"退化文献集合中已知项检索的重访","authors":"Jason J. Soo, O. Frieder","doi":"10.2352/ISSN.2470-1173.2016.17.DRR-065","DOIUrl":null,"url":null,"abstract":"Optical character recognition software converts an image of text to a text document but typically degrades the document’s contents. Correcting such degradation to enable the document set to be queried effectively is the focus of this work. The described approach uses a fusion of substring generation rules and context aware analysis to correct these errors. Evaluation was facilitated by two publicly available datasets from TREC-5’s Confusion Track containing estimated error rates of 5% and 20% . On the 5% dataset, we demonstrate a statistically significant improvement over the prior art and Solr’s mean reciprocal rank (MRR). On the 20% dataset, we demonstrate a statistically significant improvement over Solr, and have similar performance to the prior art. The described approach achieves an MRR of 0.6627 and 0.4924 on collections with error rates of approximately 5% and 20% respectively. Introduction Documents that are not electronically readable are increasingly difficult to manage, search, and maintain. Optical character recognition (OCR) is used to digitize these documents, but frequently produces a degraded copy. We develop a search system capable of searching such degraded documents. Our approach sustains a higher search accuracy rate than the prior art as evaluated using the TREC-5 Confusion Track datasets. Additionally, the approach developed is domain and language agnostic; increasing its applicability. In the United States there are two federal initiatives underway focused on the digitization of health records. First, the federal government is incentivizing local and private hospitals to switch from paper to electronic health records to improve the quality of care [3]. Second, the Veteran’s Affairs (VA) has an initiative to eliminate all paper health records by 2015 [2]. Both processes require converting paper records to digital images, and – hopefully – indexing of the digitized images to support searching. These efforts either are leveraging or can leverage OCR to query the newly created records to improve quality of service. These are but a few of the many examples demonstrating the importance of OCR. An OCR process is composed of two main parts. First is the conversion of an imagine to text by identifying characters and words from images [8, 17]. Second, the resulting text is post-processed to identify and correct errors during the first phase. Techniques in this process can range from simple dictionary checks to statistical methods. Our research focuses on the latter phase. Some work in the second phase has attempted to optimize the algorithm’s parameters by training algorithms on portions of the dataset [16]. However, such an approach does not generalize to other OCR collections. Other work focuses on specialized situations: handwritten documents [15]; signs, historical markers/documents [13, 9]. While other works hinge on assumptions: the OCR exposes a confidence level for each processed word [7]; online resources will allow the system to make hundredsof-thousands of queries in short bursts [6, 12]; or the ability to crawl many web sources to create lexicons [28]. We focus on the generalized case of post-processing of OCR degraded documents without training or consideration of document type. Historically, there was a flurry of research in this area, particularly around the time TREC released an OCR corrupted dataset [10]. Entries to the TREC competition fell into 2 categories: attempts to clarify or expand the query and attempts to clarify or correct the documents themselves. Results submitted from the latter category have higher mean reciprocal ranks (MRR). Therefore, we continue work in this direction. Taghva et al. published many results in this area [26]. They have designed specialized retrieval engines for OCR copies of severely degraded documents [25] and found their tested OCR error correction methods had little impact on precision/recall vs an unmodified search engine [24]. This result suggests that Solr is a good enough solution to searching OCR corrupted collections. Their most related work to this research was the creation of a correction system for OCR errors. This system uses statistical methods to make more accurate corrections, but requires user training and assistance [27]. More recent work from this lab has been focused on similar supervised approaches [18]. In contrast, our objective is the development of a solution requiring no user intervention or training data. Our contributions are: • Given a minimally corrupted dataset (∼5% error rate), we show that a fusion based method has a statistically significantly (p<0.05) higher MRR than prior art, and higher MRR than individual methods for correcting corrupt words. • Given a moderately corrupted dataset (∼20% error rate), we show the same method’s MRR is roughly equal to the prior art’s. • We evaluate the impact of context when correcting corrupted terms in a corrupted document. • We demonstrate the tradeoffs of occurrence frequency thresholds for corrupt words. Thresholds set too high and too low negatively impact MRR. • We evaluate filtering methods to increase the accuracy of identifying corrupt words. • We reinforce the assumption that use of domain keywords improve correction rates by showing their impact on MRR. Methods Dataset Document Set The experiments performed are based on the publicly available TREC-5 Confusion Track collection: 395 MB containing approximately 55,600 documents. The documents are part of the Federal Register printed by the United States Government Printing Office. A list of 49 queries and the best resulting document are provided for evaluation. Since each query seeks only a single document, MRR is reported. TREC created two corrupted datasets from the original collection with an estimated 5% error rate and 20% error rate. Real Words Dictionary We create an exhaustive English dictionary of real words using the following three datasets: 1) 99,044 words from the English dictionary1; 2) 94,293 sir names in the United States2; 3) 1,293,142 geographic locations within the United States3. Collectively, this dictionary is referred to as real words. To measure the impact of a domain specific dictionary, we supplement the real words dictionary with additional terms obtained from the 1996 Federal Register [1]. By selecting the publications from 1996 – 2 year after our test set – we ensure minimal possible overlap of temporal topics. To accurately attribute the impact of these domain terms, we report our results both with and without this dataset.","PeriodicalId":152377,"journal":{"name":"Document Recognition and Retrieval","volume":"93 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Revisiting Known-Item Retrieval in Degraded Document Collections\",\"authors\":\"Jason J. Soo, O. Frieder\",\"doi\":\"10.2352/ISSN.2470-1173.2016.17.DRR-065\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Optical character recognition software converts an image of text to a text document but typically degrades the document’s contents. Correcting such degradation to enable the document set to be queried effectively is the focus of this work. The described approach uses a fusion of substring generation rules and context aware analysis to correct these errors. Evaluation was facilitated by two publicly available datasets from TREC-5’s Confusion Track containing estimated error rates of 5% and 20% . On the 5% dataset, we demonstrate a statistically significant improvement over the prior art and Solr’s mean reciprocal rank (MRR). On the 20% dataset, we demonstrate a statistically significant improvement over Solr, and have similar performance to the prior art. The described approach achieves an MRR of 0.6627 and 0.4924 on collections with error rates of approximately 5% and 20% respectively. Introduction Documents that are not electronically readable are increasingly difficult to manage, search, and maintain. Optical character recognition (OCR) is used to digitize these documents, but frequently produces a degraded copy. We develop a search system capable of searching such degraded documents. Our approach sustains a higher search accuracy rate than the prior art as evaluated using the TREC-5 Confusion Track datasets. Additionally, the approach developed is domain and language agnostic; increasing its applicability. In the United States there are two federal initiatives underway focused on the digitization of health records. First, the federal government is incentivizing local and private hospitals to switch from paper to electronic health records to improve the quality of care [3]. Second, the Veteran’s Affairs (VA) has an initiative to eliminate all paper health records by 2015 [2]. Both processes require converting paper records to digital images, and – hopefully – indexing of the digitized images to support searching. These efforts either are leveraging or can leverage OCR to query the newly created records to improve quality of service. These are but a few of the many examples demonstrating the importance of OCR. An OCR process is composed of two main parts. First is the conversion of an imagine to text by identifying characters and words from images [8, 17]. Second, the resulting text is post-processed to identify and correct errors during the first phase. Techniques in this process can range from simple dictionary checks to statistical methods. Our research focuses on the latter phase. Some work in the second phase has attempted to optimize the algorithm’s parameters by training algorithms on portions of the dataset [16]. However, such an approach does not generalize to other OCR collections. Other work focuses on specialized situations: handwritten documents [15]; signs, historical markers/documents [13, 9]. While other works hinge on assumptions: the OCR exposes a confidence level for each processed word [7]; online resources will allow the system to make hundredsof-thousands of queries in short bursts [6, 12]; or the ability to crawl many web sources to create lexicons [28]. We focus on the generalized case of post-processing of OCR degraded documents without training or consideration of document type. Historically, there was a flurry of research in this area, particularly around the time TREC released an OCR corrupted dataset [10]. Entries to the TREC competition fell into 2 categories: attempts to clarify or expand the query and attempts to clarify or correct the documents themselves. Results submitted from the latter category have higher mean reciprocal ranks (MRR). Therefore, we continue work in this direction. Taghva et al. published many results in this area [26]. They have designed specialized retrieval engines for OCR copies of severely degraded documents [25] and found their tested OCR error correction methods had little impact on precision/recall vs an unmodified search engine [24]. This result suggests that Solr is a good enough solution to searching OCR corrupted collections. Their most related work to this research was the creation of a correction system for OCR errors. This system uses statistical methods to make more accurate corrections, but requires user training and assistance [27]. More recent work from this lab has been focused on similar supervised approaches [18]. In contrast, our objective is the development of a solution requiring no user intervention or training data. Our contributions are: • Given a minimally corrupted dataset (∼5% error rate), we show that a fusion based method has a statistically significantly (p<0.05) higher MRR than prior art, and higher MRR than individual methods for correcting corrupt words. • Given a moderately corrupted dataset (∼20% error rate), we show the same method’s MRR is roughly equal to the prior art’s. • We evaluate the impact of context when correcting corrupted terms in a corrupted document. • We demonstrate the tradeoffs of occurrence frequency thresholds for corrupt words. Thresholds set too high and too low negatively impact MRR. • We evaluate filtering methods to increase the accuracy of identifying corrupt words. • We reinforce the assumption that use of domain keywords improve correction rates by showing their impact on MRR. Methods Dataset Document Set The experiments performed are based on the publicly available TREC-5 Confusion Track collection: 395 MB containing approximately 55,600 documents. The documents are part of the Federal Register printed by the United States Government Printing Office. A list of 49 queries and the best resulting document are provided for evaluation. Since each query seeks only a single document, MRR is reported. TREC created two corrupted datasets from the original collection with an estimated 5% error rate and 20% error rate. Real Words Dictionary We create an exhaustive English dictionary of real words using the following three datasets: 1) 99,044 words from the English dictionary1; 2) 94,293 sir names in the United States2; 3) 1,293,142 geographic locations within the United States3. Collectively, this dictionary is referred to as real words. To measure the impact of a domain specific dictionary, we supplement the real words dictionary with additional terms obtained from the 1996 Federal Register [1]. By selecting the publications from 1996 – 2 year after our test set – we ensure minimal possible overlap of temporal topics. To accurately attribute the impact of these domain terms, we report our results both with and without this dataset.\",\"PeriodicalId\":152377,\"journal\":{\"name\":\"Document Recognition and Retrieval\",\"volume\":\"93 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-02-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Document Recognition and Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-065\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Document Recognition and Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-065","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

光学字符识别软件将文本图像转换为文本文档，但通常会降低文档的内容。纠正这种退化，使文档集能够被有效地查询是这项工作的重点。所描述的方法使用子字符串生成规则和上下文感知分析的融合来纠正这些错误。评估由两个来自TREC-5的混淆轨道的公开数据集促进，估计错误率为5%和20%。在5%的数据集上，我们展示了比现有技术和Solr的平均倒数秩(MRR)在统计上的显著改进。在20%的数据集上，我们证明了在统计上比Solr有显著的改进，并且具有与现有技术相似的性能。所描述的方法在错误率分别约为5%和20%的集合上实现了0.6627和0.4924的MRR。不能电子阅读的文档越来越难以管理、搜索和维护。光学字符识别(OCR)用于数字化这些文档，但经常产生降级副本。我们开发了一个搜索系统，能够搜索这种退化的文件。我们的方法比使用TREC-5混淆跟踪数据集评估的现有技术保持更高的搜索准确率。此外，开发的方法是领域和语言不可知论的;提高其适用性。在美国，有两项联邦倡议正在进行中，重点是健康记录的数字化。首先，联邦政府鼓励地方和私立医院从纸质医疗记录转向电子医疗记录，以提高医疗质量。第二，退伍军人事务部(VA)有一项倡议，到2015年消除所有纸质健康记录。这两个过程都需要将纸质记录转换为数字图像，并希望对数字化图像进行索引以支持搜索。这些工作利用或可以利用OCR来查询新创建的记录，以提高服务质量。这些只是证明OCR重要性的众多例子中的一小部分。OCR过程由两个主要部分组成。首先是通过识别图像中的字符和单词将图像转换为文本[8,17]。其次，对生成的文本进行后处理，以在第一阶段识别和纠正错误。这个过程中的技术可以从简单的字典检查到统计方法。我们的研究重点是后一阶段。第二阶段的一些工作试图通过在数据集[16]的部分上训练算法来优化算法的参数。但是，这种方法不能推广到其他OCR集合。其他工作侧重于特殊情况:手写文件[15];标志、历史标记/文献[13,9]。而其他工作则依赖于假设:OCR暴露了每个处理过的单词的置信水平[7];在线资源将允许系统在短时间内进行数十万次查询[6,12];或者能够抓取许多网络资源来创建词典[28]。我们关注的是在没有经过训练或不考虑文档类型的情况下对OCR退化文档进行后处理的一般情况。从历史上看，在这个领域有大量的研究，特别是在TREC发布一个OCR损坏的数据集[10]的时候。TREC竞赛的参赛作品分为两类:澄清或扩展查询的尝试，以及澄清或纠正文档本身的尝试。后一类提交的结果具有较高的平均倒数排名(MRR)。因此，我们将继续朝这个方向努力。Taghva等人在这一领域发表了许多研究成果。他们为严重退化文档[25]的OCR副本设计了专门的检索引擎，并发现与未经修改的搜索引擎[24]相比，他们测试的OCR纠错方法对精度/召回率的影响很小。这个结果表明，Solr是搜索OCR损坏集合的一个足够好的解决方案。他们与这项研究最相关的工作是为OCR错误创建一个校正系统。该系统使用统计方法进行更准确的修正，但需要用户培训和协助[27]。这个实验室最近的工作集中在类似的监督方法上。相反，我们的目标是开发不需要用户干预或训练数据的解决方案。我们的贡献是:•给定最小损坏数据集(错误率约5%)，我们表明基于融合的方法具有统计学上显著(p<0.05)高于现有技术的MRR，并且高于校正损坏词的单个方法的MRR。•给定中度损坏的数据集(错误率为20%)，我们显示相同方法的MRR大致等于现有技术。•我们在纠正损坏文档中的损坏术语时评估上下文的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Revisiting Known-Item Retrieval in Degraded Document Collections

Optical character recognition software converts an image of text to a text document but typically degrades the document’s contents. Correcting such degradation to enable the document set to be queried effectively is the focus of this work. The described approach uses a fusion of substring generation rules and context aware analysis to correct these errors. Evaluation was facilitated by two publicly available datasets from TREC-5’s Confusion Track containing estimated error rates of 5% and 20% . On the 5% dataset, we demonstrate a statistically significant improvement over the prior art and Solr’s mean reciprocal rank (MRR). On the 20% dataset, we demonstrate a statistically significant improvement over Solr, and have similar performance to the prior art. The described approach achieves an MRR of 0.6627 and 0.4924 on collections with error rates of approximately 5% and 20% respectively. Introduction Documents that are not electronically readable are increasingly difficult to manage, search, and maintain. Optical character recognition (OCR) is used to digitize these documents, but frequently produces a degraded copy. We develop a search system capable of searching such degraded documents. Our approach sustains a higher search accuracy rate than the prior art as evaluated using the TREC-5 Confusion Track datasets. Additionally, the approach developed is domain and language agnostic; increasing its applicability. In the United States there are two federal initiatives underway focused on the digitization of health records. First, the federal government is incentivizing local and private hospitals to switch from paper to electronic health records to improve the quality of care [3]. Second, the Veteran’s Affairs (VA) has an initiative to eliminate all paper health records by 2015 [2]. Both processes require converting paper records to digital images, and – hopefully – indexing of the digitized images to support searching. These efforts either are leveraging or can leverage OCR to query the newly created records to improve quality of service. These are but a few of the many examples demonstrating the importance of OCR. An OCR process is composed of two main parts. First is the conversion of an imagine to text by identifying characters and words from images [8, 17]. Second, the resulting text is post-processed to identify and correct errors during the first phase. Techniques in this process can range from simple dictionary checks to statistical methods. Our research focuses on the latter phase. Some work in the second phase has attempted to optimize the algorithm’s parameters by training algorithms on portions of the dataset [16]. However, such an approach does not generalize to other OCR collections. Other work focuses on specialized situations: handwritten documents [15]; signs, historical markers/documents [13, 9]. While other works hinge on assumptions: the OCR exposes a confidence level for each processed word [7]; online resources will allow the system to make hundredsof-thousands of queries in short bursts [6, 12]; or the ability to crawl many web sources to create lexicons [28]. We focus on the generalized case of post-processing of OCR degraded documents without training or consideration of document type. Historically, there was a flurry of research in this area, particularly around the time TREC released an OCR corrupted dataset [10]. Entries to the TREC competition fell into 2 categories: attempts to clarify or expand the query and attempts to clarify or correct the documents themselves. Results submitted from the latter category have higher mean reciprocal ranks (MRR). Therefore, we continue work in this direction. Taghva et al. published many results in this area [26]. They have designed specialized retrieval engines for OCR copies of severely degraded documents [25] and found their tested OCR error correction methods had little impact on precision/recall vs an unmodified search engine [24]. This result suggests that Solr is a good enough solution to searching OCR corrupted collections. Their most related work to this research was the creation of a correction system for OCR errors. This system uses statistical methods to make more accurate corrections, but requires user training and assistance [27]. More recent work from this lab has been focused on similar supervised approaches [18]. In contrast, our objective is the development of a solution requiring no user intervention or training data. Our contributions are: • Given a minimally corrupted dataset (∼5% error rate), we show that a fusion based method has a statistically significantly (p<0.05) higher MRR than prior art, and higher MRR than individual methods for correcting corrupt words. • Given a moderately corrupted dataset (∼20% error rate), we show the same method’s MRR is roughly equal to the prior art’s. • We evaluate the impact of context when correcting corrupted terms in a corrupted document. • We demonstrate the tradeoffs of occurrence frequency thresholds for corrupt words. Thresholds set too high and too low negatively impact MRR. • We evaluate filtering methods to increase the accuracy of identifying corrupt words. • We reinforce the assumption that use of domain keywords improve correction rates by showing their impact on MRR. Methods Dataset Document Set The experiments performed are based on the publicly available TREC-5 Confusion Track collection: 395 MB containing approximately 55,600 documents. The documents are part of the Federal Register printed by the United States Government Printing Office. A list of 49 queries and the best resulting document are provided for evaluation. Since each query seeks only a single document, MRR is reported. TREC created two corrupted datasets from the original collection with an estimated 5% error rate and 20% error rate. Real Words Dictionary We create an exhaustive English dictionary of real words using the following three datasets: 1) 99,044 words from the English dictionary1; 2) 94,293 sir names in the United States2; 3) 1,293,142 geographic locations within the United States3. Collectively, this dictionary is referred to as real words. To measure the impact of a domain specific dictionary, we supplement the real words dictionary with additional terms obtained from the 1996 Federal Register [1]. By selecting the publications from 1996 – 2 year after our test set – we ensure minimal possible overlap of temporal topics. To accurately attribute the impact of these domain terms, we report our results both with and without this dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Document Recognition and Retrieval

自引率

0.00%

发文量