Likelihood ratios for categorical count data with applications in digital forensics

Law, Probability and Risk Pub Date : 2022-12-23 DOI:10.1093/lpr/mgac016

Rachel Longjohn, Padhraic Smyth, Hal S Stern

{"title":"Likelihood ratios for categorical count data with applications in digital forensics","authors":"Rachel Longjohn, Padhraic Smyth, Hal S Stern","doi":"10.1093/lpr/mgac016","DOIUrl":null,"url":null,"abstract":"We consider the forensic context in which the goal is to assess whether two sets of observed data came from the same source or from different sources. In particular, we focus on the situation in which the evidence consists of two sets of categorical count data: a set of event counts from an unknown source tied to a crime and a set of event counts generated by a known source. Using a same-source versus different-source hypothesis framework, we develop an approach to calculating a likelihood ratio. Under our proposed model, the likelihood ratio can be calculated in closed form, and we use this to theoretically analyse how the likelihood ratio is affected by how much data is observed, the number of event types being considered, and the prior used in the Bayesian model. Our work is motivated in particular by user-generated event data in digital forensics, a context in which relatively few statistical methodologies have yet been developed to support quantitative analysis of event data after it is extracted from a device. We evaluate our proposed method through experiments using three real-world event datasets, representing a variety of event types that may arise in digital forensics. The results of the theoretical analyses and experiments with real-world datasets demonstrate that while this model is a useful starting point for the statistical forensic analysis of user-generated event data, more work is needed before it can be applied for practical use.","PeriodicalId":501426,"journal":{"name":"Law, Probability and Risk","volume":"14 6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Law, Probability and Risk","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/lpr/mgac016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We consider the forensic context in which the goal is to assess whether two sets of observed data came from the same source or from different sources. In particular, we focus on the situation in which the evidence consists of two sets of categorical count data: a set of event counts from an unknown source tied to a crime and a set of event counts generated by a known source. Using a same-source versus different-source hypothesis framework, we develop an approach to calculating a likelihood ratio. Under our proposed model, the likelihood ratio can be calculated in closed form, and we use this to theoretically analyse how the likelihood ratio is affected by how much data is observed, the number of event types being considered, and the prior used in the Bayesian model. Our work is motivated in particular by user-generated event data in digital forensics, a context in which relatively few statistical methodologies have yet been developed to support quantitative analysis of event data after it is extracted from a device. We evaluate our proposed method through experiments using three real-world event datasets, representing a variety of event types that may arise in digital forensics. The results of the theoretical analyses and experiments with real-world datasets demonstrate that while this model is a useful starting point for the statistical forensic analysis of user-generated event data, more work is needed before it can be applied for practical use.

查看原文本刊更多论文

分类计数数据的似然比在数字取证中的应用

我们考虑的法医背景下，其目标是评估是否两组观测数据来自同一来源或来自不同的来源。特别地，我们关注证据由两组分类计数数据组成的情况:一组来自与犯罪相关的未知来源的事件计数和一组由已知来源生成的事件计数。使用同源与不同源假设框架，我们开发了一种计算似然比的方法。在我们提出的模型下，似然比可以以封闭形式计算，我们用它来从理论上分析似然比如何受到观察到的数据量、考虑的事件类型的数量以及贝叶斯模型中使用的先验的影响。我们的工作主要受到数字取证中用户生成的事件数据的推动，在这种情况下，相对较少的统计方法尚未开发出来，以支持从设备中提取事件数据后的定量分析。我们通过使用三个真实世界事件数据集的实验来评估我们提出的方法，这些数据集代表了数字取证中可能出现的各种事件类型。理论分析和实际数据集的实验结果表明，虽然该模型是用户生成事件数据的统计取证分析的有用起点，但在将其应用于实际使用之前，还需要做更多的工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Law, Probability and Risk

自引率

0.00%

发文量