{"title":"退化文献集合中已知项检索的重访","authors":"Jason J. Soo, O. Frieder","doi":"10.2352/ISSN.2470-1173.2016.17.DRR-065","DOIUrl":null,"url":null,"abstract":"Optical character recognition software converts an image of text to a text document but typically degrades the document’s contents. Correcting such degradation to enable the document set to be queried effectively is the focus of this work. The described approach uses a fusion of substring generation rules and context aware analysis to correct these errors. Evaluation was facilitated by two publicly available datasets from TREC-5’s Confusion Track containing estimated error rates of 5% and 20% . On the 5% dataset, we demonstrate a statistically significant improvement over the prior art and Solr’s mean reciprocal rank (MRR). On the 20% dataset, we demonstrate a statistically significant improvement over Solr, and have similar performance to the prior art. The described approach achieves an MRR of 0.6627 and 0.4924 on collections with error rates of approximately 5% and 20% respectively. Introduction Documents that are not electronically readable are increasingly difficult to manage, search, and maintain. Optical character recognition (OCR) is used to digitize these documents, but frequently produces a degraded copy. We develop a search system capable of searching such degraded documents. Our approach sustains a higher search accuracy rate than the prior art as evaluated using the TREC-5 Confusion Track datasets. Additionally, the approach developed is domain and language agnostic; increasing its applicability. In the United States there are two federal initiatives underway focused on the digitization of health records. First, the federal government is incentivizing local and private hospitals to switch from paper to electronic health records to improve the quality of care [3]. Second, the Veteran’s Affairs (VA) has an initiative to eliminate all paper health records by 2015 [2]. Both processes require converting paper records to digital images, and – hopefully – indexing of the digitized images to support searching. These efforts either are leveraging or can leverage OCR to query the newly created records to improve quality of service. These are but a few of the many examples demonstrating the importance of OCR. An OCR process is composed of two main parts. First is the conversion of an imagine to text by identifying characters and words from images [8, 17]. Second, the resulting text is post-processed to identify and correct errors during the first phase. Techniques in this process can range from simple dictionary checks to statistical methods. Our research focuses on the latter phase. Some work in the second phase has attempted to optimize the algorithm’s parameters by training algorithms on portions of the dataset [16]. However, such an approach does not generalize to other OCR collections. Other work focuses on specialized situations: handwritten documents [15]; signs, historical markers/documents [13, 9]. While other works hinge on assumptions: the OCR exposes a confidence level for each processed word [7]; online resources will allow the system to make hundredsof-thousands of queries in short bursts [6, 12]; or the ability to crawl many web sources to create lexicons [28]. We focus on the generalized case of post-processing of OCR degraded documents without training or consideration of document type. Historically, there was a flurry of research in this area, particularly around the time TREC released an OCR corrupted dataset [10]. Entries to the TREC competition fell into 2 categories: attempts to clarify or expand the query and attempts to clarify or correct the documents themselves. Results submitted from the latter category have higher mean reciprocal ranks (MRR). Therefore, we continue work in this direction. Taghva et al. published many results in this area [26]. They have designed specialized retrieval engines for OCR copies of severely degraded documents [25] and found their tested OCR error correction methods had little impact on precision/recall vs an unmodified search engine [24]. This result suggests that Solr is a good enough solution to searching OCR corrupted collections. Their most related work to this research was the creation of a correction system for OCR errors. This system uses statistical methods to make more accurate corrections, but requires user training and assistance [27]. More recent work from this lab has been focused on similar supervised approaches [18]. In contrast, our objective is the development of a solution requiring no user intervention or training data. Our contributions are: • Given a minimally corrupted dataset (∼5% error rate), we show that a fusion based method has a statistically significantly (p<0.05) higher MRR than prior art, and higher MRR than individual methods for correcting corrupt words. • Given a moderately corrupted dataset (∼20% error rate), we show the same method’s MRR is roughly equal to the prior art’s. • We evaluate the impact of context when correcting corrupted terms in a corrupted document. • We demonstrate the tradeoffs of occurrence frequency thresholds for corrupt words. Thresholds set too high and too low negatively impact MRR. • We evaluate filtering methods to increase the accuracy of identifying corrupt words. • We reinforce the assumption that use of domain keywords improve correction rates by showing their impact on MRR. Methods Dataset Document Set The experiments performed are based on the publicly available TREC-5 Confusion Track collection: 395 MB containing approximately 55,600 documents. The documents are part of the Federal Register printed by the United States Government Printing Office. A list of 49 queries and the best resulting document are provided for evaluation. Since each query seeks only a single document, MRR is reported. TREC created two corrupted datasets from the original collection with an estimated 5% error rate and 20% error rate. Real Words Dictionary We create an exhaustive English dictionary of real words using the following three datasets: 1) 99,044 words from the English dictionary1; 2) 94,293 sir names in the United States2; 3) 1,293,142 geographic locations within the United States3. Collectively, this dictionary is referred to as real words. To measure the impact of a domain specific dictionary, we supplement the real words dictionary with additional terms obtained from the 1996 Federal Register [1]. By selecting the publications from 1996 – 2 year after our test set – we ensure minimal possible overlap of temporal topics. To accurately attribute the impact of these domain terms, we report our results both with and without this dataset.","PeriodicalId":152377,"journal":{"name":"Document Recognition and Retrieval","volume":"93 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Revisiting Known-Item Retrieval in Degraded Document Collections\",\"authors\":\"Jason J. Soo, O. Frieder\",\"doi\":\"10.2352/ISSN.2470-1173.2016.17.DRR-065\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Optical character recognition software converts an image of text to a text document but typically degrades the document’s contents. Correcting such degradation to enable the document set to be queried effectively is the focus of this work. The described approach uses a fusion of substring generation rules and context aware analysis to correct these errors. Evaluation was facilitated by two publicly available datasets from TREC-5’s Confusion Track containing estimated error rates of 5% and 20% . On the 5% dataset, we demonstrate a statistically significant improvement over the prior art and Solr’s mean reciprocal rank (MRR). On the 20% dataset, we demonstrate a statistically significant improvement over Solr, and have similar performance to the prior art. The described approach achieves an MRR of 0.6627 and 0.4924 on collections with error rates of approximately 5% and 20% respectively. Introduction Documents that are not electronically readable are increasingly difficult to manage, search, and maintain. Optical character recognition (OCR) is used to digitize these documents, but frequently produces a degraded copy. We develop a search system capable of searching such degraded documents. Our approach sustains a higher search accuracy rate than the prior art as evaluated using the TREC-5 Confusion Track datasets. Additionally, the approach developed is domain and language agnostic; increasing its applicability. In the United States there are two federal initiatives underway focused on the digitization of health records. First, the federal government is incentivizing local and private hospitals to switch from paper to electronic health records to improve the quality of care [3]. Second, the Veteran’s Affairs (VA) has an initiative to eliminate all paper health records by 2015 [2]. Both processes require converting paper records to digital images, and – hopefully – indexing of the digitized images to support searching. These efforts either are leveraging or can leverage OCR to query the newly created records to improve quality of service. These are but a few of the many examples demonstrating the importance of OCR. An OCR process is composed of two main parts. First is the conversion of an imagine to text by identifying characters and words from images [8, 17]. Second, the resulting text is post-processed to identify and correct errors during the first phase. Techniques in this process can range from simple dictionary checks to statistical methods. Our research focuses on the latter phase. Some work in the second phase has attempted to optimize the algorithm’s parameters by training algorithms on portions of the dataset [16]. However, such an approach does not generalize to other OCR collections. Other work focuses on specialized situations: handwritten documents [15]; signs, historical markers/documents [13, 9]. While other works hinge on assumptions: the OCR exposes a confidence level for each processed word [7]; online resources will allow the system to make hundredsof-thousands of queries in short bursts [6, 12]; or the ability to crawl many web sources to create lexicons [28]. We focus on the generalized case of post-processing of OCR degraded documents without training or consideration of document type. Historically, there was a flurry of research in this area, particularly around the time TREC released an OCR corrupted dataset [10]. Entries to the TREC competition fell into 2 categories: attempts to clarify or expand the query and attempts to clarify or correct the documents themselves. Results submitted from the latter category have higher mean reciprocal ranks (MRR). Therefore, we continue work in this direction. Taghva et al. published many results in this area [26]. They have designed specialized retrieval engines for OCR copies of severely degraded documents [25] and found their tested OCR error correction methods had little impact on precision/recall vs an unmodified search engine [24]. This result suggests that Solr is a good enough solution to searching OCR corrupted collections. Their most related work to this research was the creation of a correction system for OCR errors. This system uses statistical methods to make more accurate corrections, but requires user training and assistance [27]. More recent work from this lab has been focused on similar supervised approaches [18]. In contrast, our objective is the development of a solution requiring no user intervention or training data. Our contributions are: • Given a minimally corrupted dataset (∼5% error rate), we show that a fusion based method has a statistically significantly (p<0.05) higher MRR than prior art, and higher MRR than individual methods for correcting corrupt words. • Given a moderately corrupted dataset (∼20% error rate), we show the same method’s MRR is roughly equal to the prior art’s. • We evaluate the impact of context when correcting corrupted terms in a corrupted document. • We demonstrate the tradeoffs of occurrence frequency thresholds for corrupt words. Thresholds set too high and too low negatively impact MRR. • We evaluate filtering methods to increase the accuracy of identifying corrupt words. • We reinforce the assumption that use of domain keywords improve correction rates by showing their impact on MRR. Methods Dataset Document Set The experiments performed are based on the publicly available TREC-5 Confusion Track collection: 395 MB containing approximately 55,600 documents. The documents are part of the Federal Register printed by the United States Government Printing Office. A list of 49 queries and the best resulting document are provided for evaluation. Since each query seeks only a single document, MRR is reported. TREC created two corrupted datasets from the original collection with an estimated 5% error rate and 20% error rate. Real Words Dictionary We create an exhaustive English dictionary of real words using the following three datasets: 1) 99,044 words from the English dictionary1; 2) 94,293 sir names in the United States2; 3) 1,293,142 geographic locations within the United States3. Collectively, this dictionary is referred to as real words. To measure the impact of a domain specific dictionary, we supplement the real words dictionary with additional terms obtained from the 1996 Federal Register [1]. By selecting the publications from 1996 – 2 year after our test set – we ensure minimal possible overlap of temporal topics. To accurately attribute the impact of these domain terms, we report our results both with and without this dataset.\",\"PeriodicalId\":152377,\"journal\":{\"name\":\"Document Recognition and Retrieval\",\"volume\":\"93 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-02-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Document Recognition and Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-065\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Document Recognition and Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-065","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Revisiting Known-Item Retrieval in Degraded Document Collections
Optical character recognition software converts an image of text to a text document but typically degrades the document’s contents. Correcting such degradation to enable the document set to be queried effectively is the focus of this work. The described approach uses a fusion of substring generation rules and context aware analysis to correct these errors. Evaluation was facilitated by two publicly available datasets from TREC-5’s Confusion Track containing estimated error rates of 5% and 20% . On the 5% dataset, we demonstrate a statistically significant improvement over the prior art and Solr’s mean reciprocal rank (MRR). On the 20% dataset, we demonstrate a statistically significant improvement over Solr, and have similar performance to the prior art. The described approach achieves an MRR of 0.6627 and 0.4924 on collections with error rates of approximately 5% and 20% respectively. Introduction Documents that are not electronically readable are increasingly difficult to manage, search, and maintain. Optical character recognition (OCR) is used to digitize these documents, but frequently produces a degraded copy. We develop a search system capable of searching such degraded documents. Our approach sustains a higher search accuracy rate than the prior art as evaluated using the TREC-5 Confusion Track datasets. Additionally, the approach developed is domain and language agnostic; increasing its applicability. In the United States there are two federal initiatives underway focused on the digitization of health records. First, the federal government is incentivizing local and private hospitals to switch from paper to electronic health records to improve the quality of care [3]. Second, the Veteran’s Affairs (VA) has an initiative to eliminate all paper health records by 2015 [2]. Both processes require converting paper records to digital images, and – hopefully – indexing of the digitized images to support searching. These efforts either are leveraging or can leverage OCR to query the newly created records to improve quality of service. These are but a few of the many examples demonstrating the importance of OCR. An OCR process is composed of two main parts. First is the conversion of an imagine to text by identifying characters and words from images [8, 17]. Second, the resulting text is post-processed to identify and correct errors during the first phase. Techniques in this process can range from simple dictionary checks to statistical methods. Our research focuses on the latter phase. Some work in the second phase has attempted to optimize the algorithm’s parameters by training algorithms on portions of the dataset [16]. However, such an approach does not generalize to other OCR collections. Other work focuses on specialized situations: handwritten documents [15]; signs, historical markers/documents [13, 9]. While other works hinge on assumptions: the OCR exposes a confidence level for each processed word [7]; online resources will allow the system to make hundredsof-thousands of queries in short bursts [6, 12]; or the ability to crawl many web sources to create lexicons [28]. We focus on the generalized case of post-processing of OCR degraded documents without training or consideration of document type. Historically, there was a flurry of research in this area, particularly around the time TREC released an OCR corrupted dataset [10]. Entries to the TREC competition fell into 2 categories: attempts to clarify or expand the query and attempts to clarify or correct the documents themselves. Results submitted from the latter category have higher mean reciprocal ranks (MRR). Therefore, we continue work in this direction. Taghva et al. published many results in this area [26]. They have designed specialized retrieval engines for OCR copies of severely degraded documents [25] and found their tested OCR error correction methods had little impact on precision/recall vs an unmodified search engine [24]. This result suggests that Solr is a good enough solution to searching OCR corrupted collections. Their most related work to this research was the creation of a correction system for OCR errors. This system uses statistical methods to make more accurate corrections, but requires user training and assistance [27]. More recent work from this lab has been focused on similar supervised approaches [18]. In contrast, our objective is the development of a solution requiring no user intervention or training data. Our contributions are: • Given a minimally corrupted dataset (∼5% error rate), we show that a fusion based method has a statistically significantly (p<0.05) higher MRR than prior art, and higher MRR than individual methods for correcting corrupt words. • Given a moderately corrupted dataset (∼20% error rate), we show the same method’s MRR is roughly equal to the prior art’s. • We evaluate the impact of context when correcting corrupted terms in a corrupted document. • We demonstrate the tradeoffs of occurrence frequency thresholds for corrupt words. Thresholds set too high and too low negatively impact MRR. • We evaluate filtering methods to increase the accuracy of identifying corrupt words. • We reinforce the assumption that use of domain keywords improve correction rates by showing their impact on MRR. Methods Dataset Document Set The experiments performed are based on the publicly available TREC-5 Confusion Track collection: 395 MB containing approximately 55,600 documents. The documents are part of the Federal Register printed by the United States Government Printing Office. A list of 49 queries and the best resulting document are provided for evaluation. Since each query seeks only a single document, MRR is reported. TREC created two corrupted datasets from the original collection with an estimated 5% error rate and 20% error rate. Real Words Dictionary We create an exhaustive English dictionary of real words using the following three datasets: 1) 99,044 words from the English dictionary1; 2) 94,293 sir names in the United States2; 3) 1,293,142 geographic locations within the United States3. Collectively, this dictionary is referred to as real words. To measure the impact of a domain specific dictionary, we supplement the real words dictionary with additional terms obtained from the 1996 Federal Register [1]. By selecting the publications from 1996 – 2 year after our test set – we ensure minimal possible overlap of temporal topics. To accurately attribute the impact of these domain terms, we report our results both with and without this dataset.