恢复损坏的文档以改进信息检索过程

Q4 Biochemistry, Genetics and Molecular Biology

Journal of Integrated OMICS Pub Date : 2018-12-19 DOI:10.5584/JIOMICS.V8I3.230

Angel L. Garrido, Álvaro Peiró

{"title":"恢复损坏的文档以改进信息检索过程","authors":"Angel L. Garrido, Álvaro Peiró","doi":"10.5584/JIOMICS.V8I3.230","DOIUrl":null,"url":null,"abstract":"Although computer forensics is frequently related to the investigation of computer crimes, it can also be used in civil procedures. An example of case of use is information retrieval from damaged documents, where words have undergone alterations, either accidentally or intentionally. In this paper, we present a new tool able to retrieve information from large volumes of documents whose contents have been damaged. We have designed a new approach to recover the original words, composed of two stages: a text cleaning filter, able to remove non relevant information, and a text correction unit, which gather a general purpose spell checker with a N-gram based spell checker built specifically for the domain of the documents. The benefits of using this combined approach are two-fold: on the one hand, the general spell checker allows us to leverage all the general purpose techniques that are usually used to perform the corrections; on the other hand, the use of an N-gram based model allows us to adapt them to the particular domain we are tackling exploiting text regularities detected in successfully processed domain documents. The result of the correction allows us to improve automatic information retrieval tasks of from the texts. We have tested it using a real data set by using an information extraction tool based on semantic technologies in collaboration with the Spanish company InSynergy Consulting.","PeriodicalId":37675,"journal":{"name":"Journal of Integrated OMICS","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Recovering Damaged Documents to Improve Information Retrieval Processes\",\"authors\":\"Angel L. Garrido, Álvaro Peiró\",\"doi\":\"10.5584/JIOMICS.V8I3.230\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Although computer forensics is frequently related to the investigation of computer crimes, it can also be used in civil procedures. An example of case of use is information retrieval from damaged documents, where words have undergone alterations, either accidentally or intentionally. In this paper, we present a new tool able to retrieve information from large volumes of documents whose contents have been damaged. We have designed a new approach to recover the original words, composed of two stages: a text cleaning filter, able to remove non relevant information, and a text correction unit, which gather a general purpose spell checker with a N-gram based spell checker built specifically for the domain of the documents. The benefits of using this combined approach are two-fold: on the one hand, the general spell checker allows us to leverage all the general purpose techniques that are usually used to perform the corrections; on the other hand, the use of an N-gram based model allows us to adapt them to the particular domain we are tackling exploiting text regularities detected in successfully processed domain documents. The result of the correction allows us to improve automatic information retrieval tasks of from the texts. We have tested it using a real data set by using an information extraction tool based on semantic technologies in collaboration with the Spanish company InSynergy Consulting.\",\"PeriodicalId\":37675,\"journal\":{\"name\":\"Journal of Integrated OMICS\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Integrated OMICS\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5584/JIOMICS.V8I3.230\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"Biochemistry, Genetics and Molecular Biology\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Integrated OMICS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5584/JIOMICS.V8I3.230","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Biochemistry, Genetics and Molecular Biology","Score":null,"Total":0}

引用次数: 3

摘要

虽然计算机取证经常与计算机犯罪的调查有关，但它也可以用于民事诉讼。使用的一个例子是从损坏的文档中检索信息，其中单词经历了无意或有意的更改。在本文中，我们提出了一种新的工具，能够从内容已损坏的大量文档中检索信息。我们设计了一种新的方法来恢复原始单词，它由两个阶段组成:一个文本清理过滤器，能够删除不相关的信息;一个文本校正单元，它收集了一个通用的拼写检查器，其中包含一个专门为文档领域构建的基于N-gram的拼写检查器。使用这种组合方法的好处是双重的:一方面，通用拼写检查器允许我们利用所有通常用于执行更正的通用技术;另一方面，使用基于N-gram的模型使我们能够利用在成功处理的领域文档中检测到的文本规律，使它们适应我们正在处理的特定领域。修正的结果使我们能够改进文本信息的自动检索任务。我们与西班牙公司InSynergy Consulting合作，使用基于语义技术的信息提取工具，使用真实数据集对其进行了测试。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Recovering Damaged Documents to Improve Information Retrieval Processes

Although computer forensics is frequently related to the investigation of computer crimes, it can also be used in civil procedures. An example of case of use is information retrieval from damaged documents, where words have undergone alterations, either accidentally or intentionally. In this paper, we present a new tool able to retrieve information from large volumes of documents whose contents have been damaged. We have designed a new approach to recover the original words, composed of two stages: a text cleaning filter, able to remove non relevant information, and a text correction unit, which gather a general purpose spell checker with a N-gram based spell checker built specifically for the domain of the documents. The benefits of using this combined approach are two-fold: on the one hand, the general spell checker allows us to leverage all the general purpose techniques that are usually used to perform the corrections; on the other hand, the use of an N-gram based model allows us to adapt them to the particular domain we are tackling exploiting text regularities detected in successfully processed domain documents. The result of the correction allows us to improve automatic information retrieval tasks of from the texts. We have tested it using a real data set by using an information extraction tool based on semantic technologies in collaboration with the Spanish company InSynergy Consulting.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Integrated OMICS Biochemistry, Genetics and Molecular Biology-Biochemistry

CiteScore

1.10

自引率

0.00%

发文量

期刊介绍： JIOMICS provides a forum for the publication of original research papers, letters to the editor, short communications, and critical reviews in all branches of pure and applied –omics subjects, such as proteomics, metabolomics, metallomics and genomics. Especial interest is given to papers where more than one –omics subject is covered. Papers are evaluated based on scientific novelty and demonstrated scientific applicability. Original research papers on fundamental studies, and novel sensor and instrumentation development, are especially encouraged. Novel or improved findings in areas such as clinical, medicinal, biological, environmental and materials –omics are welcome.