构建用于复杂文档信息处理的测试集合

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval Pub Date : 2006-08-06 DOI:10.1145/1148170.1148307

D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, J. Heard

{"title":"构建用于复杂文档信息处理的测试集合","authors":"D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, J. Heard","doi":"10.1145/1148170.1148307","DOIUrl":null,"url":null,"abstract":"Research and development of information access technology for scanned paper documents has been hampered by the lack of public test collections of realistic scope and complexity. As part of a project to create a prototype system for search and mining of masses of document images, we are assembling a 1.5 terabyte dataset to support evaluation of both end-to-end complex document information processing (CDIP) tasks (e.g., text retrieval and data mining) as well as component technologies such as optical character recognition (OCR), document structure analysis, signature matching, and authorship attribution.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"230","resultStr":"{\"title\":\"Building a test collection for complex document information processing\",\"authors\":\"D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, J. Heard\",\"doi\":\"10.1145/1148170.1148307\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Research and development of information access technology for scanned paper documents has been hampered by the lack of public test collections of realistic scope and complexity. As part of a project to create a prototype system for search and mining of masses of document images, we are assembling a 1.5 terabyte dataset to support evaluation of both end-to-end complex document information processing (CDIP) tasks (e.g., text retrieval and data mining) as well as component technologies such as optical character recognition (OCR), document structure analysis, signature matching, and authorship attribution.\",\"PeriodicalId\":433366,\"journal\":{\"name\":\"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-08-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"230\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1148170.1148307\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1148170.1148307","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 230

摘要

由于缺乏具有现实范围和复杂性的公共测试集，纸质扫描文档信息访问技术的研究和发展一直受到阻碍。作为创建用于搜索和挖掘大量文档图像的原型系统项目的一部分，我们正在组装一个1.5 tb的数据集，以支持端到端的复杂文档信息处理(CDIP)任务(例如，文本检索和数据挖掘)以及光学字符识别(OCR)、文档结构分析、签名匹配和作者归属等组件技术的评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Building a test collection for complex document information processing

Research and development of information access technology for scanned paper documents has been hampered by the lack of public test collections of realistic scope and complexity. As part of a project to create a prototype system for search and mining of masses of document images, we are assembling a 1.5 terabyte dataset to support evaluation of both end-to-end complex document information processing (CDIP) tasks (e.g., text retrieval and data mining) as well as component technologies such as optical character recognition (OCR), document structure analysis, signature matching, and authorship attribution.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

自引率

0.00%

发文量