用于重建高度碎片化的OOXML文件的语法方法

Q3 Computer Science

Radioelectronic and Computer Systems Pub Date : 2023-03-07 DOI:10.32620/reks.2023.1.14

Maksym Boiko, Viacheslav Moskalenko

{"title":"用于重建高度碎片化的OOXML文件的语法方法","authors":"Maksym Boiko, Viacheslav Moskalenko","doi":"10.32620/reks.2023.1.14","DOIUrl":null,"url":null,"abstract":"A common task in computer forensics is to recover files that lack file system metadata. In the case of searching for file fragments in unallocated space, file carving is the most often used method, which is ideal for unfragmented data. However, such methods and the tools based on them are ineffective for recovering OOXML files with a high fragmentation level. These methods do not provide reliable determination of the correct order of fragments. Techniques for reconstructing documents based on the analysis of words and phrases are also ineffective in fragmented OOXML documents. The main reason is that OOXML files are ZIP archives and, as a result, store data on disk space in a compressed form. This paper proposes a syntactical method for reconstructing OOXML documents based on knowledge about the internal structure of this file type, regardless of their content. The details of the implementation of the reconstruction algorithm and the peculiarities of restoring certain types of local elements of the document were considered. The efficiency of the algorithm was tested on the Govdocs1 and NapierOne datasets. The proposed method was applied to 4096-byte data blocks, which correspond to the standard cluster size in different file systems. The experimental results confirmed the method's suitability for practical use with 82.97 % of recovered files, including 34.38 % reconstructed completely, 0.43 % excluding the last 21 bytes at most, and another 48.16 % excluding embeddings that require other approaches. In the latter case, obtaining a fully working document without displaying graphic images and other contents of different embeddings is possible. The presence in OOXML files of CRC-32 hashes of the uncompressed data stream of each local element allows us to confirm the correctness of information recovery and its integrity unambiguously. Simultaneously, the method's effectiveness depends mainly on data verification methods during the reconstruction of local elements that occupy at least three clusters in the file. Therefore, this method is supposed to be improved by developing new mechanisms for verifying XML elements.","PeriodicalId":36122,"journal":{"name":"Radioelectronic and Computer Systems","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Syntactical method for reconstructing highly fragmented OOXML files\",\"authors\":\"Maksym Boiko, Viacheslav Moskalenko\",\"doi\":\"10.32620/reks.2023.1.14\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A common task in computer forensics is to recover files that lack file system metadata. In the case of searching for file fragments in unallocated space, file carving is the most often used method, which is ideal for unfragmented data. However, such methods and the tools based on them are ineffective for recovering OOXML files with a high fragmentation level. These methods do not provide reliable determination of the correct order of fragments. Techniques for reconstructing documents based on the analysis of words and phrases are also ineffective in fragmented OOXML documents. The main reason is that OOXML files are ZIP archives and, as a result, store data on disk space in a compressed form. This paper proposes a syntactical method for reconstructing OOXML documents based on knowledge about the internal structure of this file type, regardless of their content. The details of the implementation of the reconstruction algorithm and the peculiarities of restoring certain types of local elements of the document were considered. The efficiency of the algorithm was tested on the Govdocs1 and NapierOne datasets. The proposed method was applied to 4096-byte data blocks, which correspond to the standard cluster size in different file systems. The experimental results confirmed the method's suitability for practical use with 82.97 % of recovered files, including 34.38 % reconstructed completely, 0.43 % excluding the last 21 bytes at most, and another 48.16 % excluding embeddings that require other approaches. In the latter case, obtaining a fully working document without displaying graphic images and other contents of different embeddings is possible. The presence in OOXML files of CRC-32 hashes of the uncompressed data stream of each local element allows us to confirm the correctness of information recovery and its integrity unambiguously. Simultaneously, the method's effectiveness depends mainly on data verification methods during the reconstruction of local elements that occupy at least three clusters in the file. Therefore, this method is supposed to be improved by developing new mechanisms for verifying XML elements.\",\"PeriodicalId\":36122,\"journal\":{\"name\":\"Radioelectronic and Computer Systems\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radioelectronic and Computer Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.32620/reks.2023.1.14\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radioelectronic and Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32620/reks.2023.1.14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

摘要

计算机取证中的一项常见任务是恢复缺少文件系统元数据的文件。在未分配空间中搜索文件碎片的情况下，文件雕刻是最常用的方法，非常适合未分割的数据。然而，这种方法和基于它们的工具对于恢复具有高碎片级别的OOXML文件是无效的。这些方法不能可靠地确定碎片的正确顺序。基于单词和短语的分析来重构文档的技术在零散的OOXML文档中也是无效的。主要原因是OOXML文件是ZIP档案，因此，它以压缩的形式将数据存储在磁盘空间中。本文基于对这种文件类型的内部结构的了解，提出了一种重构OOXML文档的语法方法，无论其内容如何。考虑了重建算法的实现细节以及恢复文档中某些类型的局部元素的特性。该算法的效率在Govdocs1和NapierOne数据集上进行了测试。将所提出的方法应用于4096字节的数据块，这些数据块对应于不同文件系统中的标准集群大小。实验结果证实了该方法的实用性，82.97%的恢复文件，其中34.38%完全重建，0.43%最多不包括最后21个字节，48.16%不包括需要其他方法的嵌入。在后一种情况下，可以在不显示图形图像和不同嵌入的其他内容的情况下获得完整的工作文档。每个本地元素的未压缩数据流的CRC-32散列在OOXML文件中的存在使我们能够毫不含糊地确认信息恢复的正确性及其完整性。同时，该方法的有效性主要取决于在重建文件中至少占据三个簇的局部元素期间的数据验证方法。因此，应该通过开发用于验证XML元素的新机制来改进这种方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Syntactical method for reconstructing highly fragmented OOXML files

A common task in computer forensics is to recover files that lack file system metadata. In the case of searching for file fragments in unallocated space, file carving is the most often used method, which is ideal for unfragmented data. However, such methods and the tools based on them are ineffective for recovering OOXML files with a high fragmentation level. These methods do not provide reliable determination of the correct order of fragments. Techniques for reconstructing documents based on the analysis of words and phrases are also ineffective in fragmented OOXML documents. The main reason is that OOXML files are ZIP archives and, as a result, store data on disk space in a compressed form. This paper proposes a syntactical method for reconstructing OOXML documents based on knowledge about the internal structure of this file type, regardless of their content. The details of the implementation of the reconstruction algorithm and the peculiarities of restoring certain types of local elements of the document were considered. The efficiency of the algorithm was tested on the Govdocs1 and NapierOne datasets. The proposed method was applied to 4096-byte data blocks, which correspond to the standard cluster size in different file systems. The experimental results confirmed the method's suitability for practical use with 82.97 % of recovered files, including 34.38 % reconstructed completely, 0.43 % excluding the last 21 bytes at most, and another 48.16 % excluding embeddings that require other approaches. In the latter case, obtaining a fully working document without displaying graphic images and other contents of different embeddings is possible. The presence in OOXML files of CRC-32 hashes of the uncompressed data stream of each local element allows us to confirm the correctness of information recovery and its integrity unambiguously. Simultaneously, the method's effectiveness depends mainly on data verification methods during the reconstruction of local elements that occupy at least three clusters in the file. Therefore, this method is supposed to be improved by developing new mechanisms for verifying XML elements.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Radioelectronic and Computer Systems Computer Science-Computer Graphics and Computer-Aided Design

CiteScore

3.60

自引率

0.00%

发文量

审稿时长

2 weeks