在HIMANIS计划中大规模索引大量中世纪手稿集的预备KWS实验

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) Pub Date : 2017-11-09 DOI:10.1109/ICDAR.2017.59

Théodore Bluche, Sébastien Hamel, Christopher Kermorvant, J. Puigcerver, D. Stutzmann, A. Toselli, E. Vidal

{"title":"在HIMANIS计划中大规模索引大量中世纪手稿集的预备KWS实验","authors":"Théodore Bluche, Sébastien Hamel, Christopher Kermorvant, J. Puigcerver, D. Stutzmann, A. Toselli, E. Vidal","doi":"10.1109/ICDAR.2017.59","DOIUrl":null,"url":null,"abstract":"Making large-scale collections of digitized historical documents searchable is being earnestly demanded by many archives and libraries. Probabilistically indexing the text images of these collections by means of keyword spotting techniques is currently seen as perhaps the only feasible approach to meet this demand. A vast medieval manuscript collection, written in both Latin and French, called \"Chancery\", is currently being considered for indexing at large. In addition to its bilingual nature, one of the major difficulties of this collection is the very high rate of abbreviated words which, on the other hand, are completely expanded in the ground truth transcripts available. In preparation to undertake full indexing of Chancery, experiments have been carried out on a relatively small but fully representative subset of this collection. To this end, a keyword spotting approach has been adopted which computes word relevance probabilities using character lattices produced by a recurrent neural network and a N-gram character language model. Results confirm the viability of the chosen approach for the large-scale indexing aimed at and show the ability of the proposed modeling and training approaches to properly deal with the abbreviation difficulties mentioned.","PeriodicalId":433676,"journal":{"name":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":"{\"title\":\"Preparatory KWS Experiments for Large-Scale Indexing of a Vast Medieval Manuscript Collection in the HIMANIS Project\",\"authors\":\"Théodore Bluche, Sébastien Hamel, Christopher Kermorvant, J. Puigcerver, D. Stutzmann, A. Toselli, E. Vidal\",\"doi\":\"10.1109/ICDAR.2017.59\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Making large-scale collections of digitized historical documents searchable is being earnestly demanded by many archives and libraries. Probabilistically indexing the text images of these collections by means of keyword spotting techniques is currently seen as perhaps the only feasible approach to meet this demand. A vast medieval manuscript collection, written in both Latin and French, called \\\"Chancery\\\", is currently being considered for indexing at large. In addition to its bilingual nature, one of the major difficulties of this collection is the very high rate of abbreviated words which, on the other hand, are completely expanded in the ground truth transcripts available. In preparation to undertake full indexing of Chancery, experiments have been carried out on a relatively small but fully representative subset of this collection. To this end, a keyword spotting approach has been adopted which computes word relevance probabilities using character lattices produced by a recurrent neural network and a N-gram character language model. Results confirm the viability of the chosen approach for the large-scale indexing aimed at and show the ability of the proposed modeling and training approaches to properly deal with the abbreviation difficulties mentioned.\",\"PeriodicalId\":433676,\"journal\":{\"name\":\"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"38\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDAR.2017.59\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2017.59","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 38

摘要

实现大规模的数字化历史文献检索是许多档案馆和图书馆的迫切要求。通过关键字定位技术对这些集合的文本图像进行概率索引，目前看来可能是满足这一需求的唯一可行方法。一个用拉丁语和法语写成的庞大的中世纪手稿集，被称为“Chancery”，目前正被考虑编入索引。除了它的双语性质之外，这个集合的主要困难之一是缩略词的比率非常高，另一方面，这些缩写词在现有的真实文本中完全扩展了。为了准备对Chancery进行全面索引，已对该集合中一个相对较小但完全具有代表性的子集进行了实验。为此，采用了一种关键字识别方法，该方法使用由循环神经网络和N-gram字符语言模型产生的字符格计算单词相关概率。结果证实了所选方法对于大规模标引的可行性，并表明所提出的建模和训练方法能够正确处理上述缩写困难。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Preparatory KWS Experiments for Large-Scale Indexing of a Vast Medieval Manuscript Collection in the HIMANIS Project

Making large-scale collections of digitized historical documents searchable is being earnestly demanded by many archives and libraries. Probabilistically indexing the text images of these collections by means of keyword spotting techniques is currently seen as perhaps the only feasible approach to meet this demand. A vast medieval manuscript collection, written in both Latin and French, called "Chancery", is currently being considered for indexing at large. In addition to its bilingual nature, one of the major difficulties of this collection is the very high rate of abbreviated words which, on the other hand, are completely expanded in the ground truth transcripts available. In preparation to undertake full indexing of Chancery, experiments have been carried out on a relatively small but fully representative subset of this collection. To this end, a keyword spotting approach has been adopted which computes word relevance probabilities using character lattices produced by a recurrent neural network and a N-gram character language model. Results confirm the viability of the chosen approach for the large-scale indexing aimed at and show the ability of the proposed modeling and training approaches to properly deal with the abbreviation difficulties mentioned.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)

自引率

0.00%

发文量