Recovering Damaged Documents to Improve Information Retrieval Processes

Q4 Biochemistry, Genetics and Molecular Biology
Angel L. Garrido, Álvaro Peiró
{"title":"Recovering Damaged Documents to Improve Information Retrieval Processes","authors":"Angel L. Garrido, Álvaro Peiró","doi":"10.5584/JIOMICS.V8I3.230","DOIUrl":null,"url":null,"abstract":"Although computer forensics is frequently related to the investigation of computer crimes, it can also be used in civil procedures. An example of case of use is information retrieval from damaged documents, where words have undergone alterations, either accidentally or intentionally. In this paper, we present a new tool able to retrieve information from large volumes of documents whose contents have been damaged. We have designed a new approach to recover the original words, composed of two stages: a text cleaning filter, able to remove non relevant information, and a text correction unit, which gather a general purpose spell checker with a N-gram based spell checker built specifically for the domain of the documents. The benefits of using this combined approach are two-fold: on the one hand, the general spell checker allows us to leverage all the general purpose techniques that are usually used to perform the corrections; on the other hand, the use of an N-gram based model allows us to adapt them to the particular domain we are tackling exploiting text regularities detected in successfully processed domain documents. The result of the correction allows us to improve automatic information retrieval tasks of from the texts. We have tested it using a real data set by using an information extraction tool based on semantic technologies in collaboration with the Spanish company InSynergy Consulting.","PeriodicalId":37675,"journal":{"name":"Journal of Integrated OMICS","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Integrated OMICS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5584/JIOMICS.V8I3.230","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Biochemistry, Genetics and Molecular Biology","Score":null,"Total":0}
引用次数: 3

Abstract

Although computer forensics is frequently related to the investigation of computer crimes, it can also be used in civil procedures. An example of case of use is information retrieval from damaged documents, where words have undergone alterations, either accidentally or intentionally. In this paper, we present a new tool able to retrieve information from large volumes of documents whose contents have been damaged. We have designed a new approach to recover the original words, composed of two stages: a text cleaning filter, able to remove non relevant information, and a text correction unit, which gather a general purpose spell checker with a N-gram based spell checker built specifically for the domain of the documents. The benefits of using this combined approach are two-fold: on the one hand, the general spell checker allows us to leverage all the general purpose techniques that are usually used to perform the corrections; on the other hand, the use of an N-gram based model allows us to adapt them to the particular domain we are tackling exploiting text regularities detected in successfully processed domain documents. The result of the correction allows us to improve automatic information retrieval tasks of from the texts. We have tested it using a real data set by using an information extraction tool based on semantic technologies in collaboration with the Spanish company InSynergy Consulting.
恢复损坏的文档以改进信息检索过程
虽然计算机取证经常与计算机犯罪的调查有关,但它也可以用于民事诉讼。使用的一个例子是从损坏的文档中检索信息,其中单词经历了无意或有意的更改。在本文中,我们提出了一种新的工具,能够从内容已损坏的大量文档中检索信息。我们设计了一种新的方法来恢复原始单词,它由两个阶段组成:一个文本清理过滤器,能够删除不相关的信息;一个文本校正单元,它收集了一个通用的拼写检查器,其中包含一个专门为文档领域构建的基于N-gram的拼写检查器。使用这种组合方法的好处是双重的:一方面,通用拼写检查器允许我们利用所有通常用于执行更正的通用技术;另一方面,使用基于N-gram的模型使我们能够利用在成功处理的领域文档中检测到的文本规律,使它们适应我们正在处理的特定领域。修正的结果使我们能够改进文本信息的自动检索任务。我们与西班牙公司InSynergy Consulting合作,使用基于语义技术的信息提取工具,使用真实数据集对其进行了测试。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Integrated OMICS
Journal of Integrated OMICS Biochemistry, Genetics and Molecular Biology-Biochemistry
CiteScore
1.10
自引率
0.00%
发文量
3
期刊介绍: JIOMICS provides a forum for the publication of original research papers, letters to the editor, short communications, and critical reviews in all branches of pure and applied –omics subjects, such as proteomics, metabolomics, metallomics and genomics. Especial interest is given to papers where more than one –omics subject is covered. Papers are evaluated based on scientific novelty and demonstrated scientific applicability. Original research papers on fundamental studies, and novel sensor and instrumentation development, are especially encouraged. Novel or improved findings in areas such as clinical, medicinal, biological, environmental and materials –omics are welcome.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信