在HIMANIS计划中大规模索引大量中世纪手稿集的预备KWS实验

Théodore Bluche, Sébastien Hamel, Christopher Kermorvant, J. Puigcerver, D. Stutzmann, A. Toselli, E. Vidal
{"title":"在HIMANIS计划中大规模索引大量中世纪手稿集的预备KWS实验","authors":"Théodore Bluche, Sébastien Hamel, Christopher Kermorvant, J. Puigcerver, D. Stutzmann, A. Toselli, E. Vidal","doi":"10.1109/ICDAR.2017.59","DOIUrl":null,"url":null,"abstract":"Making large-scale collections of digitized historical documents searchable is being earnestly demanded by many archives and libraries. Probabilistically indexing the text images of these collections by means of keyword spotting techniques is currently seen as perhaps the only feasible approach to meet this demand. A vast medieval manuscript collection, written in both Latin and French, called \"Chancery\", is currently being considered for indexing at large. In addition to its bilingual nature, one of the major difficulties of this collection is the very high rate of abbreviated words which, on the other hand, are completely expanded in the ground truth transcripts available. In preparation to undertake full indexing of Chancery, experiments have been carried out on a relatively small but fully representative subset of this collection. To this end, a keyword spotting approach has been adopted which computes word relevance probabilities using character lattices produced by a recurrent neural network and a N-gram character language model. Results confirm the viability of the chosen approach for the large-scale indexing aimed at and show the ability of the proposed modeling and training approaches to properly deal with the abbreviation difficulties mentioned.","PeriodicalId":433676,"journal":{"name":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":"{\"title\":\"Preparatory KWS Experiments for Large-Scale Indexing of a Vast Medieval Manuscript Collection in the HIMANIS Project\",\"authors\":\"Théodore Bluche, Sébastien Hamel, Christopher Kermorvant, J. Puigcerver, D. Stutzmann, A. Toselli, E. Vidal\",\"doi\":\"10.1109/ICDAR.2017.59\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Making large-scale collections of digitized historical documents searchable is being earnestly demanded by many archives and libraries. Probabilistically indexing the text images of these collections by means of keyword spotting techniques is currently seen as perhaps the only feasible approach to meet this demand. A vast medieval manuscript collection, written in both Latin and French, called \\\"Chancery\\\", is currently being considered for indexing at large. In addition to its bilingual nature, one of the major difficulties of this collection is the very high rate of abbreviated words which, on the other hand, are completely expanded in the ground truth transcripts available. In preparation to undertake full indexing of Chancery, experiments have been carried out on a relatively small but fully representative subset of this collection. To this end, a keyword spotting approach has been adopted which computes word relevance probabilities using character lattices produced by a recurrent neural network and a N-gram character language model. Results confirm the viability of the chosen approach for the large-scale indexing aimed at and show the ability of the proposed modeling and training approaches to properly deal with the abbreviation difficulties mentioned.\",\"PeriodicalId\":433676,\"journal\":{\"name\":\"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"38\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDAR.2017.59\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2017.59","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 38

摘要

实现大规模的数字化历史文献检索是许多档案馆和图书馆的迫切要求。通过关键字定位技术对这些集合的文本图像进行概率索引,目前看来可能是满足这一需求的唯一可行方法。一个用拉丁语和法语写成的庞大的中世纪手稿集,被称为“Chancery”,目前正被考虑编入索引。除了它的双语性质之外,这个集合的主要困难之一是缩略词的比率非常高,另一方面,这些缩写词在现有的真实文本中完全扩展了。为了准备对Chancery进行全面索引,已对该集合中一个相对较小但完全具有代表性的子集进行了实验。为此,采用了一种关键字识别方法,该方法使用由循环神经网络和N-gram字符语言模型产生的字符格计算单词相关概率。结果证实了所选方法对于大规模标引的可行性,并表明所提出的建模和训练方法能够正确处理上述缩写困难。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Preparatory KWS Experiments for Large-Scale Indexing of a Vast Medieval Manuscript Collection in the HIMANIS Project
Making large-scale collections of digitized historical documents searchable is being earnestly demanded by many archives and libraries. Probabilistically indexing the text images of these collections by means of keyword spotting techniques is currently seen as perhaps the only feasible approach to meet this demand. A vast medieval manuscript collection, written in both Latin and French, called "Chancery", is currently being considered for indexing at large. In addition to its bilingual nature, one of the major difficulties of this collection is the very high rate of abbreviated words which, on the other hand, are completely expanded in the ground truth transcripts available. In preparation to undertake full indexing of Chancery, experiments have been carried out on a relatively small but fully representative subset of this collection. To this end, a keyword spotting approach has been adopted which computes word relevance probabilities using character lattices produced by a recurrent neural network and a N-gram character language model. Results confirm the viability of the chosen approach for the large-scale indexing aimed at and show the ability of the proposed modeling and training approaches to properly deal with the abbreviation difficulties mentioned.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信