The Digital Detective's Discourse - A toolset for forensically sound collaborative dark web content annotation and collection

IF 0.6 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Digital Forensics Security and Law Pub Date : 2022-01-01 DOI:10.15394/jdfsl.2022.1740

J. Bergman, O. Popov

{"title":"The Digital Detective's Discourse - A toolset for forensically sound collaborative dark web content annotation and collection","authors":"J. Bergman, O. Popov","doi":"10.15394/jdfsl.2022.1740","DOIUrl":null,"url":null,"abstract":"In the last decade, the proliferation of machine learning (ML) algorithms and their application on big data sets have benefited many researchers and practitioners in different scientific areas. Consequently, the research in cybercrime and digital forensics has relied on ML techniques and methods for analyzing large quantities of data such as text, graphics, images, videos, and network traffic scans to support criminal investigations. Complete and accurate training data sets are indispensable for efficient and effective machine learning models. An essential part of creating complete and accurate data sets is annotating or labelling data. We present a method for law enforcement agency investigators to annotate and store specific dark web content. Using a design science strategy, we design and develop tools to enable and extend web content annotation. The annotation tool was implemented as a plugin for the Tor browser. It can store web content, thus automatically creating a dataset of dark web data pertinent to criminal investigations. Combined with a central storage management server, enabling annotation sharing and collaboration, and a web scraping program, the dataset becomes multifold, dynamic, and extensive while maintaining the forensic soundness of the data saved and transmitted. To manifest our toolset’s fitness of purpose, we used our dataset as training data for ML based classification models. A five cross-fold validation technique was used to evaluate the classifiers, which reported an accuracy score of 85 96%. In the concluding sections, we discuss the possible use-cases of the proposed method in real-life cybercrime investigations, along with ethical concerns and future extensions.","PeriodicalId":43224,"journal":{"name":"Journal of Digital Forensics Security and Law","volume":"1 1","pages":""},"PeriodicalIF":0.6000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Digital Forensics Security and Law","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15394/jdfsl.2022.1740","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 2

Abstract

In the last decade, the proliferation of machine learning (ML) algorithms and their application on big data sets have benefited many researchers and practitioners in different scientific areas. Consequently, the research in cybercrime and digital forensics has relied on ML techniques and methods for analyzing large quantities of data such as text, graphics, images, videos, and network traffic scans to support criminal investigations. Complete and accurate training data sets are indispensable for efficient and effective machine learning models. An essential part of creating complete and accurate data sets is annotating or labelling data. We present a method for law enforcement agency investigators to annotate and store specific dark web content. Using a design science strategy, we design and develop tools to enable and extend web content annotation. The annotation tool was implemented as a plugin for the Tor browser. It can store web content, thus automatically creating a dataset of dark web data pertinent to criminal investigations. Combined with a central storage management server, enabling annotation sharing and collaboration, and a web scraping program, the dataset becomes multifold, dynamic, and extensive while maintaining the forensic soundness of the data saved and transmitted. To manifest our toolset’s fitness of purpose, we used our dataset as training data for ML based classification models. A five cross-fold validation technique was used to evaluate the classifiers, which reported an accuracy score of 85 96%. In the concluding sections, we discuss the possible use-cases of the proposed method in real-life cybercrime investigations, along with ethical concerns and future extensions.

查看原文本刊更多论文

数字侦探的话语-一个工具集法医声音协作暗网内容注释和收集

在过去的十年中，机器学习(ML)算法的激增及其在大数据集上的应用使不同科学领域的许多研究人员和从业者受益。因此，网络犯罪和数字取证的研究依赖于机器学习技术和方法来分析大量数据，如文本、图形、图像、视频和网络流量扫描，以支持刑事调查。完整、准确的训练数据集对于高效、有效的机器学习模型是必不可少的。创建完整和准确的数据集的一个重要部分是注释或标记数据。我们提出了一种方法，执法机构的调查人员注释和存储特定的暗网内容。使用设计科学策略，我们设计和开发工具来启用和扩展web内容注释。注释工具是作为Tor浏览器的插件实现的。它可以存储网络内容，从而自动创建与刑事调查相关的暗网数据集。与中央存储管理服务器相结合，允许注释共享和协作，以及网络抓取程序，数据集变得多元，动态和广泛，同时保持保存和传输数据的法医健全。为了显示我们的工具集的目的适应度，我们使用我们的数据集作为基于ML的分类模型的训练数据。使用五交叉验证技术来评估分类器，其准确率评分为85 96%。在结语部分，我们讨论了该方法在现实生活中的网络犯罪调查中的可能用例，以及伦理问题和未来的扩展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Digital Forensics Security and Law COMPUTER SCIENCE, INFORMATION SYSTEMS-

自引率

0.00%

发文量

审稿时长

10 weeks