Syntactical method for reconstructing highly fragmented OOXML files

Q3 Computer Science
Maksym Boiko, Viacheslav Moskalenko
{"title":"Syntactical method for reconstructing highly fragmented OOXML files","authors":"Maksym Boiko, Viacheslav Moskalenko","doi":"10.32620/reks.2023.1.14","DOIUrl":null,"url":null,"abstract":"A common task in computer forensics is to recover files that lack file system metadata. In the case of searching for file fragments in unallocated space, file carving is the most often used method, which is ideal for unfragmented data. However, such methods and the tools based on them are ineffective for recovering OOXML files with a high fragmentation level. These methods do not provide reliable determination of the correct order of fragments. Techniques for reconstructing documents based on the analysis of words and phrases are also ineffective in fragmented OOXML documents. The main reason is that OOXML files are ZIP archives and, as a result, store data on disk space in a compressed form. This paper proposes a syntactical method for reconstructing OOXML documents based on knowledge about the internal structure of this file type, regardless of their content. The details of the implementation of the reconstruction algorithm and the peculiarities of restoring certain types of local elements of the document were considered. The efficiency of the algorithm was tested on the Govdocs1 and NapierOne datasets. The proposed method was applied to 4096-byte data blocks, which correspond to the standard cluster size in different file systems. The experimental results confirmed the method's suitability for practical use with 82.97 % of recovered files, including 34.38 % reconstructed completely, 0.43 % excluding the last 21 bytes at most, and another 48.16 % excluding embeddings that require other approaches. In the latter case, obtaining a fully working document without displaying graphic images and other contents of different embeddings is possible. The presence in OOXML files of CRC-32 hashes of the uncompressed data stream of each local element allows us to confirm the correctness of information recovery and its integrity unambiguously. Simultaneously, the method's effectiveness depends mainly on data verification methods during the reconstruction of local elements that occupy at least three clusters in the file. Therefore, this method is supposed to be improved by developing new mechanisms for verifying XML elements.","PeriodicalId":36122,"journal":{"name":"Radioelectronic and Computer Systems","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radioelectronic and Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32620/reks.2023.1.14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0

Abstract

A common task in computer forensics is to recover files that lack file system metadata. In the case of searching for file fragments in unallocated space, file carving is the most often used method, which is ideal for unfragmented data. However, such methods and the tools based on them are ineffective for recovering OOXML files with a high fragmentation level. These methods do not provide reliable determination of the correct order of fragments. Techniques for reconstructing documents based on the analysis of words and phrases are also ineffective in fragmented OOXML documents. The main reason is that OOXML files are ZIP archives and, as a result, store data on disk space in a compressed form. This paper proposes a syntactical method for reconstructing OOXML documents based on knowledge about the internal structure of this file type, regardless of their content. The details of the implementation of the reconstruction algorithm and the peculiarities of restoring certain types of local elements of the document were considered. The efficiency of the algorithm was tested on the Govdocs1 and NapierOne datasets. The proposed method was applied to 4096-byte data blocks, which correspond to the standard cluster size in different file systems. The experimental results confirmed the method's suitability for practical use with 82.97 % of recovered files, including 34.38 % reconstructed completely, 0.43 % excluding the last 21 bytes at most, and another 48.16 % excluding embeddings that require other approaches. In the latter case, obtaining a fully working document without displaying graphic images and other contents of different embeddings is possible. The presence in OOXML files of CRC-32 hashes of the uncompressed data stream of each local element allows us to confirm the correctness of information recovery and its integrity unambiguously. Simultaneously, the method's effectiveness depends mainly on data verification methods during the reconstruction of local elements that occupy at least three clusters in the file. Therefore, this method is supposed to be improved by developing new mechanisms for verifying XML elements.
用于重建高度碎片化的OOXML文件的语法方法
计算机取证中的一项常见任务是恢复缺少文件系统元数据的文件。在未分配空间中搜索文件碎片的情况下,文件雕刻是最常用的方法,非常适合未分割的数据。然而,这种方法和基于它们的工具对于恢复具有高碎片级别的OOXML文件是无效的。这些方法不能可靠地确定碎片的正确顺序。基于单词和短语的分析来重构文档的技术在零散的OOXML文档中也是无效的。主要原因是OOXML文件是ZIP档案,因此,它以压缩的形式将数据存储在磁盘空间中。本文基于对这种文件类型的内部结构的了解,提出了一种重构OOXML文档的语法方法,无论其内容如何。考虑了重建算法的实现细节以及恢复文档中某些类型的局部元素的特性。该算法的效率在Govdocs1和NapierOne数据集上进行了测试。将所提出的方法应用于4096字节的数据块,这些数据块对应于不同文件系统中的标准集群大小。实验结果证实了该方法的实用性,82.97%的恢复文件,其中34.38%完全重建,0.43%最多不包括最后21个字节,48.16%不包括需要其他方法的嵌入。在后一种情况下,可以在不显示图形图像和不同嵌入的其他内容的情况下获得完整的工作文档。每个本地元素的未压缩数据流的CRC-32散列在OOXML文件中的存在使我们能够毫不含糊地确认信息恢复的正确性及其完整性。同时,该方法的有效性主要取决于在重建文件中至少占据三个簇的局部元素期间的数据验证方法。因此,应该通过开发用于验证XML元素的新机制来改进这种方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Radioelectronic and Computer Systems
Radioelectronic and Computer Systems Computer Science-Computer Graphics and Computer-Aided Design
CiteScore
3.60
自引率
0.00%
发文量
50
审稿时长
2 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信