词匹配和检索从图像

Seema Yadav, P. Bhanushali, Saurabhkumar Jain, Tejinder Kaur
{"title":"词匹配和检索从图像","authors":"Seema Yadav, P. Bhanushali, Saurabhkumar Jain, Tejinder Kaur","doi":"10.1109/ICECA.2017.8203695","DOIUrl":null,"url":null,"abstract":"As vast amount of digital image data is stored by the advanced libraries, there is a requirement for an efficient query word searching methodologies which can make them accessible according to user's requirement. For their accurate retrieval, it is essential to understand their contents. Present technologies for optical character recognition (OCR) and image document analysis do not handle such documents adequately because of the recognition errors. Due to the problems faced by traditional OCR during recognition, computer is unable to extract the textual characters properly after scanning them. In this paper, we propose an effective word extraction and matching scheme from image documents that achieves high performance, even in the presence of noise in the image, degradation and font-variants. Initially, each image in image-database is pre-processed. In the next step, find contour method is used to detect blobs which are further passed in tesseract engine. Tesseract segments the characters from the image and stores in character database. Each word in the database is used to index a given set of images. During retrieval, the query word presented to the system is matched with characters in the database and all images containing instances of the query word are retrieved and presented to the user. Using this approach, our system is able to properly handle images with different font styles, size and heavily touching characters. From the experimental results on the various image formats it is observed that the extraction of text from the images is mostly accurate and indexing of words based on the position is working perfectly.","PeriodicalId":222768,"journal":{"name":"2017 International conference of Electronics, Communication and Aerospace Technology (ICECA)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Word matching and retrieval from images\",\"authors\":\"Seema Yadav, P. Bhanushali, Saurabhkumar Jain, Tejinder Kaur\",\"doi\":\"10.1109/ICECA.2017.8203695\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As vast amount of digital image data is stored by the advanced libraries, there is a requirement for an efficient query word searching methodologies which can make them accessible according to user's requirement. For their accurate retrieval, it is essential to understand their contents. Present technologies for optical character recognition (OCR) and image document analysis do not handle such documents adequately because of the recognition errors. Due to the problems faced by traditional OCR during recognition, computer is unable to extract the textual characters properly after scanning them. In this paper, we propose an effective word extraction and matching scheme from image documents that achieves high performance, even in the presence of noise in the image, degradation and font-variants. Initially, each image in image-database is pre-processed. In the next step, find contour method is used to detect blobs which are further passed in tesseract engine. Tesseract segments the characters from the image and stores in character database. Each word in the database is used to index a given set of images. During retrieval, the query word presented to the system is matched with characters in the database and all images containing instances of the query word are retrieved and presented to the user. Using this approach, our system is able to properly handle images with different font styles, size and heavily touching characters. From the experimental results on the various image formats it is observed that the extraction of text from the images is mostly accurate and indexing of words based on the position is working perfectly.\",\"PeriodicalId\":222768,\"journal\":{\"name\":\"2017 International conference of Electronics, Communication and Aerospace Technology (ICECA)\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 International conference of Electronics, Communication and Aerospace Technology (ICECA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICECA.2017.8203695\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International conference of Electronics, Communication and Aerospace Technology (ICECA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICECA.2017.8203695","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

由于先进的图书馆存储了大量的数字图像数据,因此需要一种高效的查询词搜索方法,使这些数据能够根据用户的需求进行访问。为了准确地检索它们,理解它们的内容是必不可少的。现有的光学字符识别(OCR)技术和图像文档分析技术由于识别误差而不能充分处理此类文档。由于传统OCR在识别过程中面临的问题,计算机在扫描文本字符后无法正确提取文本字符。在本文中,我们提出了一种有效的从图像文档中提取和匹配单词的方案,该方案即使在图像中存在噪声、退化和字体变体的情况下也能实现高性能。首先,对图像数据库中的每张图像进行预处理。下一步,使用寻找轮廓法检测斑点,并在tesseract引擎中进一步传递。Tesseract从图像中分割字符并存储在字符数据库中。数据库中的每个单词用于索引一组给定的图像。在检索过程中,将呈现给系统的查询词与数据库中的字符进行匹配,检索包含该查询词实例的所有图像并呈现给用户。使用这种方法,我们的系统能够正确处理具有不同字体样式,大小和大量触摸字符的图像。通过对各种图像格式的实验结果可以看出,从图像中提取文本的准确率较高,基于位置的词索引效果良好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Word matching and retrieval from images
As vast amount of digital image data is stored by the advanced libraries, there is a requirement for an efficient query word searching methodologies which can make them accessible according to user's requirement. For their accurate retrieval, it is essential to understand their contents. Present technologies for optical character recognition (OCR) and image document analysis do not handle such documents adequately because of the recognition errors. Due to the problems faced by traditional OCR during recognition, computer is unable to extract the textual characters properly after scanning them. In this paper, we propose an effective word extraction and matching scheme from image documents that achieves high performance, even in the presence of noise in the image, degradation and font-variants. Initially, each image in image-database is pre-processed. In the next step, find contour method is used to detect blobs which are further passed in tesseract engine. Tesseract segments the characters from the image and stores in character database. Each word in the database is used to index a given set of images. During retrieval, the query word presented to the system is matched with characters in the database and all images containing instances of the query word are retrieved and presented to the user. Using this approach, our system is able to properly handle images with different font styles, size and heavily touching characters. From the experimental results on the various image formats it is observed that the extraction of text from the images is mostly accurate and indexing of words based on the position is working perfectly.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信