Corpora-based Password Guessing: An Efficient Approach for Small Training Sets

2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE) Pub Date : 2021-12-17 DOI:10.1109/ICECE54449.2021.9674437

Xiaochun Gan, Meng Chen, Dong Li, Zongyan Wu, Weili Han, Hu Chen

{"title":"Corpora-based Password Guessing: An Efficient Approach for Small Training Sets","authors":"Xiaochun Gan, Meng Chen, Dong Li, Zongyan Wu, Weili Han, Hu Chen","doi":"10.1109/ICECE54449.2021.9674437","DOIUrl":null,"url":null,"abstract":"Password guessing plays an important role in studying the vulnerability of passwords to improve security. In modern password guessing methods, the patterns of passwords from users in specific regions are discovered from a large number of leaked passwords. Most traditional methods, such as PCFG, Markov process, and other deep learning methods rely only on the training set. Different from other application areas of machine learning, the training set of password guessing comes from leaked real password sets, such as Rockyou, CSDN, and VK. Traditional approaches of password guessing are effective for large-scale training sets. However, the size of leaked password sets leaked by users of small languages or users of specific organizations is very small, which makes it difficult for current password guessing methods which relying only on training sets to discover enough words in passwords. In order to solve this problem, this paper proposed a corpus-based password guessing method. First, we analyzed the common words and their categories in the leaked password sets from users in three different countries. On this basis, we proposed an organization method for multiple language corpora, and constructed corpora of more than 3 million words. Secondly, we improved the traditional PCFG password segmentation method and described password structure based on corpora. Third, we evaluated the probability of words in the corpora which are not appearing in the training set based on the Lapalace smoothing. Actual tests show that our method can produce a finer structure than the PCFG. When the size of the training set decreases, the cracking rate of the PCFG decreases significantly, while the impact of our method is not significant, and the cracking rate is significantly higher than that of the PCFG.","PeriodicalId":166178,"journal":{"name":"2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICECE54449.2021.9674437","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Password guessing plays an important role in studying the vulnerability of passwords to improve security. In modern password guessing methods, the patterns of passwords from users in specific regions are discovered from a large number of leaked passwords. Most traditional methods, such as PCFG, Markov process, and other deep learning methods rely only on the training set. Different from other application areas of machine learning, the training set of password guessing comes from leaked real password sets, such as Rockyou, CSDN, and VK. Traditional approaches of password guessing are effective for large-scale training sets. However, the size of leaked password sets leaked by users of small languages or users of specific organizations is very small, which makes it difficult for current password guessing methods which relying only on training sets to discover enough words in passwords. In order to solve this problem, this paper proposed a corpus-based password guessing method. First, we analyzed the common words and their categories in the leaked password sets from users in three different countries. On this basis, we proposed an organization method for multiple language corpora, and constructed corpora of more than 3 million words. Secondly, we improved the traditional PCFG password segmentation method and described password structure based on corpora. Third, we evaluated the probability of words in the corpora which are not appearing in the training set based on the Lapalace smoothing. Actual tests show that our method can produce a finer structure than the PCFG. When the size of the training set decreases, the cracking rate of the PCFG decreases significantly, while the impact of our method is not significant, and the cracking rate is significantly higher than that of the PCFG.

查看原文本刊更多论文

基于语料库的密码猜测:一种小训练集的有效方法

密码猜测对于研究密码的漏洞，提高密码安全性具有重要意义。在现代密码猜测方法中，从大量泄露的密码中发现特定区域用户的密码模式。大多数传统的方法，如PCFG、马尔可夫过程等深度学习方法只依赖于训练集。与机器学习的其他应用领域不同，猜密码的训练集来自泄露的真实密码集，如Rockyou、CSDN、VK等。传统的密码猜测方法对于大规模的训练集是有效的。然而，小语种用户或特定组织用户泄露的密码集的规模非常小，这使得目前仅依靠训练集来发现密码中足够单词的猜密码方法非常困难。为了解决这一问题，本文提出了一种基于语料库的密码猜测方法。首先，我们分析了三个不同国家用户泄露的密码集中的常用词及其类别。在此基础上，我们提出了一种多语言语料库的组织方法，构建了300多万字的语料库。其次，改进了传统的PCFG密码分割方法，采用基于语料库的密码结构描述。第三，我们基于拉普拉斯平滑评估语料库中没有出现在训练集中的词的概率。实际测试结果表明，该方法能产生比PCFG更精细的结构。当训练集的大小减小时，PCFG的开裂率明显减小，而我们的方法影响不显著，而且开裂率明显高于PCFG。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE)

自引率

0.00%

发文量