旧文档中单词的索引和检索

Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings. Pub Date : 2003-08-03 DOI:10.1109/ICDAR.2003.1227663

S. Marinai, E. Marino, G. Soda

{"title":"旧文档中单词的索引和检索","authors":"S. Marinai, E. Marino, G. Soda","doi":"10.1109/ICDAR.2003.1227663","DOIUrl":null,"url":null,"abstract":"This paper describes a system for efficient indexing and retrieval of words in collections of document images. The proposed method is based on two main principles: unsupervised prototype clustering, and string encoding for efficient string matching. During indexing, a self organizing map (SOM) is trained so as to cluster together similar symbols (character-like objects) in a sub-set of the documents to be stored. By using the trained SOM the words in the whole collection can be stored and represented with a fixed-length description that can be easily compared in order to score most similar words in response to a user query. The system can be automatically adapted to different languages and font styles. The most appropriate applications are for the processing of old documents (18th and 19th Centuries) where current OCRs have more difficulties. Experimental results describe three application scenarios having various levels of difficulty for current OCR systems.","PeriodicalId":249193,"journal":{"name":"Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":"{\"title\":\"Indexing and retrieval of words in old documents\",\"authors\":\"S. Marinai, E. Marino, G. Soda\",\"doi\":\"10.1109/ICDAR.2003.1227663\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes a system for efficient indexing and retrieval of words in collections of document images. The proposed method is based on two main principles: unsupervised prototype clustering, and string encoding for efficient string matching. During indexing, a self organizing map (SOM) is trained so as to cluster together similar symbols (character-like objects) in a sub-set of the documents to be stored. By using the trained SOM the words in the whole collection can be stored and represented with a fixed-length description that can be easily compared in order to score most similar words in response to a user query. The system can be automatically adapted to different languages and font styles. The most appropriate applications are for the processing of old documents (18th and 19th Centuries) where current OCRs have more difficulties. Experimental results describe three application scenarios having various levels of difficulty for current OCR systems.\",\"PeriodicalId\":249193,\"journal\":{\"name\":\"Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2003-08-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"28\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDAR.2003.1227663\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2003.1227663","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

摘要

本文描述了一个高效索引和检索文档图像集合中的词的系统。该方法基于两个主要原则:无监督原型聚类和高效字符串匹配的字符串编码。在索引期间，训练一个自组织映射(SOM)，以便在要存储的文档的子集中将相似的符号(类似字符的对象)聚在一起。通过使用经过训练的SOM，可以存储整个集合中的单词，并用固定长度的描述表示，可以很容易地进行比较，以便在响应用户查询时对最相似的单词进行评分。该系统可以自动适应不同的语言和字体样式。最合适的应用是处理旧文件(18和19世纪)，目前的ocr有更多的困难。实验结果描述了当前OCR系统具有不同难度的三种应用场景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Indexing and retrieval of words in old documents

This paper describes a system for efficient indexing and retrieval of words in collections of document images. The proposed method is based on two main principles: unsupervised prototype clustering, and string encoding for efficient string matching. During indexing, a self organizing map (SOM) is trained so as to cluster together similar symbols (character-like objects) in a sub-set of the documents to be stored. By using the trained SOM the words in the whole collection can be stored and represented with a fixed-length description that can be easily compared in order to score most similar words in response to a user query. The system can be automatically adapted to different languages and font styles. The most appropriate applications are for the processing of old documents (18th and 19th Centuries) where current OCRs have more difficulties. Experimental results describe three application scenarios having various levels of difficulty for current OCR systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.

自引率

0.00%

发文量