Hesham A. Almansouri, Mohammad M. Khajah, Nasser B. Alsnayen
{"title":"Machine learning model for predicting cyber-criminal characteristics","authors":"Hesham A. Almansouri, Mohammad M. Khajah, Nasser B. Alsnayen","doi":"10.1016/j.kjs.2025.100487","DOIUrl":null,"url":null,"abstract":"<div><div>This study aims to predict investigation outcomes of individual cybercrime cases using the most relevant information provided by complainants. We curated a dataset on solved hacking cases from 2019 to 2022 from the cyber-crime combating department (3CD) in the State of Kuwait. Each case has a set of information provided by the complainants (input features), and a corresponding set of investigation results (outputs). For each output, several machine learning models, such as decision trees and feed-forward neural networks, were evaluated, via nested 5-fold cross-validation to measure how well they could predict the output, given the input features. Input feature sets were either selected from all possible input feature combinations (brute force) or from a limited set of officer-provided combinations (officer-guided). Finally, a post-hoc analysis of the results was performed to identify a single set of features that can be used to build reasonably predictive models for all collected outputs. Depending on the output, the brute force and officer-guided approaches have a median relative advantage of 92% and 53% over the baseline models and worst-officer score respectively. On almost all outputs, the brute-force approach is just as good, if not better, than the officer-guided approach. No relationship was observed between officer rank and the predictive power of the combination of features they selected. Different outputs require different sets of features, and there is a significant overlap between brute force and officer-guided features in five out of the 10 outputs. Most selected features have a reliable negative impact on prediction performance when perturbed, with some outputs relying on a few critical features and others on a spectrum of features. Finally, a single set of features can predict most outputs almost as well as output-specific features.</div></div>","PeriodicalId":17848,"journal":{"name":"Kuwait Journal of Science","volume":"53 1","pages":"Article 100487"},"PeriodicalIF":1.1000,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Kuwait Journal of Science","FirstCategoryId":"103","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2307410825001312","RegionNum":4,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
This study aims to predict investigation outcomes of individual cybercrime cases using the most relevant information provided by complainants. We curated a dataset on solved hacking cases from 2019 to 2022 from the cyber-crime combating department (3CD) in the State of Kuwait. Each case has a set of information provided by the complainants (input features), and a corresponding set of investigation results (outputs). For each output, several machine learning models, such as decision trees and feed-forward neural networks, were evaluated, via nested 5-fold cross-validation to measure how well they could predict the output, given the input features. Input feature sets were either selected from all possible input feature combinations (brute force) or from a limited set of officer-provided combinations (officer-guided). Finally, a post-hoc analysis of the results was performed to identify a single set of features that can be used to build reasonably predictive models for all collected outputs. Depending on the output, the brute force and officer-guided approaches have a median relative advantage of 92% and 53% over the baseline models and worst-officer score respectively. On almost all outputs, the brute-force approach is just as good, if not better, than the officer-guided approach. No relationship was observed between officer rank and the predictive power of the combination of features they selected. Different outputs require different sets of features, and there is a significant overlap between brute force and officer-guided features in five out of the 10 outputs. Most selected features have a reliable negative impact on prediction performance when perturbed, with some outputs relying on a few critical features and others on a spectrum of features. Finally, a single set of features can predict most outputs almost as well as output-specific features.
期刊介绍:
Kuwait Journal of Science (KJS) is indexed and abstracted by major publishing houses such as Chemical Abstract, Science Citation Index, Current contents, Mathematics Abstract, Micribiological Abstracts etc. KJS publishes peer-review articles in various fields of Science including Mathematics, Computer Science, Physics, Statistics, Biology, Chemistry and Earth & Environmental Sciences. In addition, it also aims to bring the results of scientific research carried out under a variety of intellectual traditions and organizations to the attention of specialized scholarly readership. As such, the publisher expects the submission of original manuscripts which contain analysis and solutions about important theoretical, empirical and normative issues.