Cross Domain Assessment of Document to HTML Conversion Tools to Quantify Text and Structural Loss during Document Analysis

Kyle Goslin, M. Hofmann
{"title":"Cross Domain Assessment of Document to HTML Conversion Tools to Quantify Text and Structural Loss during Document Analysis","authors":"Kyle Goslin, M. Hofmann","doi":"10.1109/EISIC.2013.22","DOIUrl":null,"url":null,"abstract":"During forensic text analysis, the automation of the process is key when working with large quantities of documents. As documents often come in a wide variety of different file types, this creates the need for tailored tools to be developed to analyze each document type to correctly identify and extract text elements for analysis without loss. These text extraction tools often omit sections of text that are unreadable from documents leaving drastic inconsistencies during the forensic text analysis process. As a solution to this a single output format, HTML, was chosen as a unified analysis format. Document to HTML/CSS extraction tools each with varying techniques to convert common document formats to rich HTML/CSS counterparts were tested. This approach can reduce the amount of analysis tools needed during forensic text analysis by utilizing a single document format. Two tests were designed, a 10 point document overview test and a 48 point detailed document analysis test to assess and quantify the level of loss, rate of error and overall quality of outputted HTML structures. This study concluded that tools that utilize a number of different approaches and have an understanding of the document structure yield the best results with the least amount of loss.","PeriodicalId":229195,"journal":{"name":"2013 European Intelligence and Security Informatics Conference","volume":"136 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 European Intelligence and Security Informatics Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EISIC.2013.22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

During forensic text analysis, the automation of the process is key when working with large quantities of documents. As documents often come in a wide variety of different file types, this creates the need for tailored tools to be developed to analyze each document type to correctly identify and extract text elements for analysis without loss. These text extraction tools often omit sections of text that are unreadable from documents leaving drastic inconsistencies during the forensic text analysis process. As a solution to this a single output format, HTML, was chosen as a unified analysis format. Document to HTML/CSS extraction tools each with varying techniques to convert common document formats to rich HTML/CSS counterparts were tested. This approach can reduce the amount of analysis tools needed during forensic text analysis by utilizing a single document format. Two tests were designed, a 10 point document overview test and a 48 point detailed document analysis test to assess and quantify the level of loss, rate of error and overall quality of outputted HTML structures. This study concluded that tools that utilize a number of different approaches and have an understanding of the document structure yield the best results with the least amount of loss.
文档到HTML转换工具的跨域评估,以量化文档分析期间的文本和结构损失
在取证文本分析期间,处理大量文档时,过程的自动化是关键。由于文档通常有多种不同的文件类型,因此需要开发定制的工具来分析每种文档类型,以便正确识别和提取文本元素以进行分析,而不会造成损失。这些文本提取工具通常会忽略文档中不可读的文本部分,在取证文本分析过程中留下严重的不一致。为了解决这个问题,选择了单一的输出格式HTML作为统一的分析格式。对文档到HTML/CSS的提取工具进行了测试,每种工具都具有不同的技术,可以将常见的文档格式转换为丰富的HTML/CSS格式。通过使用单一的文档格式,这种方法可以减少在取证文本分析期间所需的分析工具的数量。设计了两个测试,一个是10分的文档概述测试,一个是48分的详细文档分析测试,以评估和量化输出HTML结构的丢失程度、错误率和整体质量。这项研究得出的结论是,利用多种不同方法并了解文档结构的工具可以以最小的损失产生最佳结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信