Duplications and Misattributions of File Fragment Hashes in Image and Compressed Files

2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS) Pub Date : 2018-02-01 DOI:10.1109/NTMS.2018.8328690

Johan Garcia

{"title":"Duplications and Misattributions of File Fragment Hashes in Image and Compressed Files","authors":"Johan Garcia","doi":"10.1109/NTMS.2018.8328690","DOIUrl":null,"url":null,"abstract":"Hashing is used in a wide variety of security contexts. Hashes of parts of files, fragment hashes, can be used to detect remains of deleted files in cluster slack, to detect illicit files being sent over a network, to perform approximate file matching, or to quickly scan large storage devices using sector sampling. In this work we examine the fragment hash uniqueness and hash duplication characteristics of five different data sets with a focus on JPEG images and compressed file archives. We consider both block and rolling hashes and evaluate sizes of the hashed fragments ranging from 16 to 4096 bytes. During an initial hash generation phase hash metadata is created for each data set, which in total becomes several several billion hashes. During the scan phase each other data set is scanned and hashes checked for potential matches in the hash metadata. Three aspects of fragment hashes are examined: 1) the rate of duplicate hashes within each data set, 2) the rate of hash misattribution where a fragment hash from the scanned data set matches a fragment in the hash metadata although the actual file is not present in the scan set, 3) to what extent it is possible to detect fragments from files in a hashed set when those files have been compressed and embedded in a zip archive. The results obtained are useful as input to dimensioning and evaluation procedures for several application areas of fragment hashing.","PeriodicalId":140704,"journal":{"name":"2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NTMS.2018.8328690","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Hashing is used in a wide variety of security contexts. Hashes of parts of files, fragment hashes, can be used to detect remains of deleted files in cluster slack, to detect illicit files being sent over a network, to perform approximate file matching, or to quickly scan large storage devices using sector sampling. In this work we examine the fragment hash uniqueness and hash duplication characteristics of five different data sets with a focus on JPEG images and compressed file archives. We consider both block and rolling hashes and evaluate sizes of the hashed fragments ranging from 16 to 4096 bytes. During an initial hash generation phase hash metadata is created for each data set, which in total becomes several several billion hashes. During the scan phase each other data set is scanned and hashes checked for potential matches in the hash metadata. Three aspects of fragment hashes are examined: 1) the rate of duplicate hashes within each data set, 2) the rate of hash misattribution where a fragment hash from the scanned data set matches a fragment in the hash metadata although the actual file is not present in the scan set, 3) to what extent it is possible to detect fragments from files in a hashed set when those files have been compressed and embedded in a zip archive. The results obtained are useful as input to dimensioning and evaluation procedures for several application areas of fragment hashing.

查看原文本刊更多论文

图像和压缩文件中文件片段哈希值的重复和错误归属

散列用于各种各样的安全上下文中。文件部分的哈希值(片段哈希值)可用于检测集群松弛中已删除文件的残余，检测通过网络发送的非法文件，执行近似文件匹配，或使用扇区抽样快速扫描大型存储设备。在这项工作中，我们研究了五种不同数据集的片段哈希唯一性和哈希重复特征，重点是JPEG图像和压缩文件存档。我们考虑块哈希和滚动哈希，并评估哈希片段的大小，范围从16到4096字节。在初始哈希生成阶段，为每个数据集创建哈希元数据，这些数据集总共成为数十亿个哈希。在扫描阶段，会扫描其他数据集，并检查散列元数据中的潜在匹配。片段哈希的三个方面进行了检查:1)每个数据集中重复哈希的比率，2)哈希错误归因的比率，其中扫描数据集中的片段哈希与哈希元数据中的片段相匹配，尽管实际文件不存在于扫描集中，3)当这些文件被压缩并嵌入到zip存档中时，在多大程度上可以从哈希集中的文件中检测片段。所获得的结果对于片段哈希的几个应用领域的维度和评估程序是有用的输入。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS)

自引率

0.00%

发文量