{"title":"多层次分类器和聚类在历史文献图像自动词识别中的应用","authors":"R. F. Moghaddam, M. Cheriet","doi":"10.1109/ICDAR.2009.104","DOIUrl":null,"url":null,"abstract":"A complete system for preprocessing and word spotting of very old historical document images is presented. Document images are processed for extraction of salient information using a word spotting technique which does not need line and word segmentation and is language independent.A multi-class library of connected components of document text is created based on six features. The spotting is performed using Euclidean distance measure enhanced by rotation and dynamic time wrapping transforms. The method is applied to a dataset from Juma Al Majid Center (Dubai)with promising results. A promising performance of the word spotting technique is obtained using an automatic preprocessing stage. In this stage, using content-level classifiers, accurate stroke pixels are extracted in a robust way. The preprocessed document images are also more legible to the end user and are less costly to archive and transfer.","PeriodicalId":433762,"journal":{"name":"2009 10th International Conference on Document Analysis and Recognition","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"60","resultStr":"{\"title\":\"Application of Multi-Level Classifiers and Clustering for Automatic Word Spotting in Historical Document Images\",\"authors\":\"R. F. Moghaddam, M. Cheriet\",\"doi\":\"10.1109/ICDAR.2009.104\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A complete system for preprocessing and word spotting of very old historical document images is presented. Document images are processed for extraction of salient information using a word spotting technique which does not need line and word segmentation and is language independent.A multi-class library of connected components of document text is created based on six features. The spotting is performed using Euclidean distance measure enhanced by rotation and dynamic time wrapping transforms. The method is applied to a dataset from Juma Al Majid Center (Dubai)with promising results. A promising performance of the word spotting technique is obtained using an automatic preprocessing stage. In this stage, using content-level classifiers, accurate stroke pixels are extracted in a robust way. The preprocessed document images are also more legible to the end user and are less costly to archive and transfer.\",\"PeriodicalId\":433762,\"journal\":{\"name\":\"2009 10th International Conference on Document Analysis and Recognition\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-07-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"60\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 10th International Conference on Document Analysis and Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDAR.2009.104\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 10th International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2009.104","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 60
摘要
提出了一套完整的历史文献图像预处理和单词识别系统。采用一种与语言无关、不需要分线分词的词点识别技术对文档图像进行显著信息提取。基于六个特征,创建了一个文档文本连接组件的多类库。利用旋转和动态时间包裹变换增强的欧几里得距离测量进行定位。将该方法应用于Juma Al Majid Center (Dubai)的数据集,取得了令人满意的结果。采用自动预处理阶段,获得了较好的单词识别性能。在这个阶段,使用内容级分类器,以鲁棒的方式提取准确的笔画像素。预处理后的文档图像对最终用户来说也更清晰,存档和传输的成本也更低。
Application of Multi-Level Classifiers and Clustering for Automatic Word Spotting in Historical Document Images
A complete system for preprocessing and word spotting of very old historical document images is presented. Document images are processed for extraction of salient information using a word spotting technique which does not need line and word segmentation and is language independent.A multi-class library of connected components of document text is created based on six features. The spotting is performed using Euclidean distance measure enhanced by rotation and dynamic time wrapping transforms. The method is applied to a dataset from Juma Al Majid Center (Dubai)with promising results. A promising performance of the word spotting technique is obtained using an automatic preprocessing stage. In this stage, using content-level classifiers, accurate stroke pixels are extracted in a robust way. The preprocessed document images are also more legible to the end user and are less costly to archive and transfer.