Lena L. Voigt , Felix Freiling , Christopher Hargreaves
{"title":"基于指标的磁盘映像:见解和应用程序","authors":"Lena L. Voigt , Felix Freiling , Christopher Hargreaves","doi":"10.1016/j.fsidi.2025.301874","DOIUrl":null,"url":null,"abstract":"<div><div>There is currently no systematic method for evaluating digital forensic datasets. This makes it difficult to judge their suitability for specific use cases in digital forensic education and training. Additionally, there is limited comparability in the quality of synthetic datasets or the strengths and weaknesses of different data synthesis approaches. In this paper, we propose the concept of a quantitative, metrics-based assessment of forensic datasets as a first step toward a systematic evaluation approach. As a concrete implementation of this approach, we introduce <em>Mass Disk Processor</em>, a tool that automates the collection of metrics from large sets of disk images. It enables a privacy-preserving retrieval of high-level disk image characteristics, facilitating the assessment of not only synthetic but also real-world disk images. We demonstrate two applications of our tool. First, we create a comprehensive datasheet for publicly available, scenario-based synthetic disk images. Second, we propose a formal definition of synthetic data realism that compares properties of synthetic data to properties of real-world data and present results from an examination of the realism of current scenario-based disk images.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"52 ","pages":"Article 301874"},"PeriodicalIF":2.0000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A metrics-based look at disk images: Insights and applications\",\"authors\":\"Lena L. Voigt , Felix Freiling , Christopher Hargreaves\",\"doi\":\"10.1016/j.fsidi.2025.301874\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>There is currently no systematic method for evaluating digital forensic datasets. This makes it difficult to judge their suitability for specific use cases in digital forensic education and training. Additionally, there is limited comparability in the quality of synthetic datasets or the strengths and weaknesses of different data synthesis approaches. In this paper, we propose the concept of a quantitative, metrics-based assessment of forensic datasets as a first step toward a systematic evaluation approach. As a concrete implementation of this approach, we introduce <em>Mass Disk Processor</em>, a tool that automates the collection of metrics from large sets of disk images. It enables a privacy-preserving retrieval of high-level disk image characteristics, facilitating the assessment of not only synthetic but also real-world disk images. We demonstrate two applications of our tool. First, we create a comprehensive datasheet for publicly available, scenario-based synthetic disk images. Second, we propose a formal definition of synthetic data realism that compares properties of synthetic data to properties of real-world data and present results from an examination of the realism of current scenario-based disk images.</div></div>\",\"PeriodicalId\":48481,\"journal\":{\"name\":\"Forensic Science International-Digital Investigation\",\"volume\":\"52 \",\"pages\":\"Article 301874\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2025-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Forensic Science International-Digital Investigation\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666281725000137\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Forensic Science International-Digital Investigation","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666281725000137","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
摘要
目前还没有评估数字法医数据集的系统方法。这使得很难判断它们是否适合数字法医教育和培训中的特定用例。此外,合成数据集的质量或不同数据合成方法的优缺点具有有限的可比性。在本文中,我们提出了一个定量的,基于指标的法医数据集评估的概念,作为迈向系统评估方法的第一步。作为这种方法的具体实现,我们介绍了Mass Disk Processor,这是一种工具,可以自动收集来自大型磁盘映像集的指标。它支持高级磁盘映像特征的隐私保护检索,不仅便于对合成磁盘映像进行评估,还便于对真实磁盘映像进行评估。我们将演示该工具的两个应用程序。首先,我们为公开可用的、基于场景的合成磁盘映像创建一个全面的数据表。其次,我们提出了合成数据真实感的正式定义,将合成数据的属性与真实世界数据的属性进行比较,并给出了对当前基于场景的磁盘映像的真实感检查的结果。
A metrics-based look at disk images: Insights and applications
There is currently no systematic method for evaluating digital forensic datasets. This makes it difficult to judge their suitability for specific use cases in digital forensic education and training. Additionally, there is limited comparability in the quality of synthetic datasets or the strengths and weaknesses of different data synthesis approaches. In this paper, we propose the concept of a quantitative, metrics-based assessment of forensic datasets as a first step toward a systematic evaluation approach. As a concrete implementation of this approach, we introduce Mass Disk Processor, a tool that automates the collection of metrics from large sets of disk images. It enables a privacy-preserving retrieval of high-level disk image characteristics, facilitating the assessment of not only synthetic but also real-world disk images. We demonstrate two applications of our tool. First, we create a comprehensive datasheet for publicly available, scenario-based synthetic disk images. Second, we propose a formal definition of synthetic data realism that compares properties of synthetic data to properties of real-world data and present results from an examination of the realism of current scenario-based disk images.