Khanh Luong , Arash Mahboubi , Geoff Jarrad , Seyit Camtepe , Michael Bewong , Mohammed Bahutair , Hamed Aboutorab , Hang Thanh Bui
{"title":"ConceptUML: Multiphase unsupervised threat detection via latent concept learning, Hidden Markov Models and topic modelling","authors":"Khanh Luong , Arash Mahboubi , Geoff Jarrad , Seyit Camtepe , Michael Bewong , Mohammed Bahutair , Hamed Aboutorab , Hang Thanh Bui","doi":"10.1016/j.jisa.2025.104160","DOIUrl":null,"url":null,"abstract":"<div><div>Detecting lateral movement threats in large-scale system logs is a critical challenge due to the scarcity of labelled attack data, the presence of imbalanced datasets, and the sophisticated nature of modern adversaries. To address these issues, we propose <strong>ConceptUML</strong>, a semantic-driven, fully unsupervised threat detection framework designed to automatically identify anomalies related to lateral movement in heterogeneous log data. ConceptUML is structured around a three-phase architecture. In <em>Phase 1 (Latent Semantic Learning)</em>, contextualized embeddings generated by Sentence-BERT are combined with Non-negative Matrix Factorization to extract abstract concepts from system logs and external threat intelligence sources such as MITRE ATT&CK and CAPEC. In <em>Phase 2 (Unsupervised Threat Detection)</em>, a Hidden Markov Model is applied to cluster logs based on learned concepts, and each cluster is scored according to its semantic similarity to known adversarial techniques. <em>Phase 3 (Decision Refinement)</em> uses topic modelling to further isolate malicious event log subsets from within suspicious clusters, enabling high-precision triage. We evaluate ConceptUML using four real-world event log datasets, including Windows Event Logs and multiple subsets of the LMD-23 dataset, encompassing attacks such as exploitation of hashing techniques and remote services. The enhanced model with topic modelling achieves up to 92.54% detection quality and reduces detection error to as low as 8.14%, outperforming several baseline approaches including AutoEncoder, LogAnomaly, LOF, and DBScan. Our results confirm that ConceptUML delivers interpretable, scalable, and highly effective detection of lateral movement threats without requiring labelled training data or extensive manual feature engineering.</div></div>","PeriodicalId":48638,"journal":{"name":"Journal of Information Security and Applications","volume":"93 ","pages":"Article 104160"},"PeriodicalIF":3.7000,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Security and Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214212625001978","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Detecting lateral movement threats in large-scale system logs is a critical challenge due to the scarcity of labelled attack data, the presence of imbalanced datasets, and the sophisticated nature of modern adversaries. To address these issues, we propose ConceptUML, a semantic-driven, fully unsupervised threat detection framework designed to automatically identify anomalies related to lateral movement in heterogeneous log data. ConceptUML is structured around a three-phase architecture. In Phase 1 (Latent Semantic Learning), contextualized embeddings generated by Sentence-BERT are combined with Non-negative Matrix Factorization to extract abstract concepts from system logs and external threat intelligence sources such as MITRE ATT&CK and CAPEC. In Phase 2 (Unsupervised Threat Detection), a Hidden Markov Model is applied to cluster logs based on learned concepts, and each cluster is scored according to its semantic similarity to known adversarial techniques. Phase 3 (Decision Refinement) uses topic modelling to further isolate malicious event log subsets from within suspicious clusters, enabling high-precision triage. We evaluate ConceptUML using four real-world event log datasets, including Windows Event Logs and multiple subsets of the LMD-23 dataset, encompassing attacks such as exploitation of hashing techniques and remote services. The enhanced model with topic modelling achieves up to 92.54% detection quality and reduces detection error to as low as 8.14%, outperforming several baseline approaches including AutoEncoder, LogAnomaly, LOF, and DBScan. Our results confirm that ConceptUML delivers interpretable, scalable, and highly effective detection of lateral movement threats without requiring labelled training data or extensive manual feature engineering.
期刊介绍:
Journal of Information Security and Applications (JISA) focuses on the original research and practice-driven applications with relevance to information security and applications. JISA provides a common linkage between a vibrant scientific and research community and industry professionals by offering a clear view on modern problems and challenges in information security, as well as identifying promising scientific and "best-practice" solutions. JISA issues offer a balance between original research work and innovative industrial approaches by internationally renowned information security experts and researchers.