钓鱼工具包源代码相似分布:一个案例研究

2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) Pub Date : 2022-03-01 DOI:10.1109/saner53432.2022.00116

E. Merlo, Mathieu Margier, Guy-Vincent Jourdan, Iosif-Viorel Onut

{"title":"钓鱼工具包源代码相似分布:一个案例研究","authors":"E. Merlo, Mathieu Margier, Guy-Vincent Jourdan, Iosif-Viorel Onut","doi":"10.1109/saner53432.2022.00116","DOIUrl":null,"url":null,"abstract":"Attackers (“phishers”) typically deploy source code in some host website to impersonate a brand or in general a situation in which a user is expected to provide some personal information of interest to phishers (e.g. credentials, credit card number). Phishing kits are ready-to-deploy sets of files that can be simply copied on a web server and used almost as they are. In this paper, we consider the static similarity analysis of the source code of 20871 phishing kits totalling over 182 million lines of PHP, Javascript and HTML code, that have been collected during phishing attacks and recovered by forensics teams. Reported experimental results show that as much as 90% of the analyzed kits share 90% or more of their source code with at least another kit. Differences are small, less than about 1000 programming words – identifiers, constants, strings and so on – in 40% of cases. A plausible lineage of phishing kits is presented by connecting together kits with the highest similarity. Obtained results show a very different reconstructed lineage for phishing kits when compared to a publicly available application such as Wordpress. Observed kits similarity distribution is consistent with the assumed hypothesis that kit propagation is often based on identical or near-identical copies at low cost changes. The proposed approach may help classifying new incoming phishing kits as “near-copy” or “intellectual leaps” from known and already encountered kits. This could facilitate the identification and classification of new kits as derived from older known kits.","PeriodicalId":437520,"journal":{"name":"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Phishing Kits Source Code Similarity Distribution: A Case Study\",\"authors\":\"E. Merlo, Mathieu Margier, Guy-Vincent Jourdan, Iosif-Viorel Onut\",\"doi\":\"10.1109/saner53432.2022.00116\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Attackers (“phishers”) typically deploy source code in some host website to impersonate a brand or in general a situation in which a user is expected to provide some personal information of interest to phishers (e.g. credentials, credit card number). Phishing kits are ready-to-deploy sets of files that can be simply copied on a web server and used almost as they are. In this paper, we consider the static similarity analysis of the source code of 20871 phishing kits totalling over 182 million lines of PHP, Javascript and HTML code, that have been collected during phishing attacks and recovered by forensics teams. Reported experimental results show that as much as 90% of the analyzed kits share 90% or more of their source code with at least another kit. Differences are small, less than about 1000 programming words – identifiers, constants, strings and so on – in 40% of cases. A plausible lineage of phishing kits is presented by connecting together kits with the highest similarity. Obtained results show a very different reconstructed lineage for phishing kits when compared to a publicly available application such as Wordpress. Observed kits similarity distribution is consistent with the assumed hypothesis that kit propagation is often based on identical or near-identical copies at low cost changes. The proposed approach may help classifying new incoming phishing kits as “near-copy” or “intellectual leaps” from known and already encountered kits. This could facilitate the identification and classification of new kits as derived from older known kits.\",\"PeriodicalId\":437520,\"journal\":{\"name\":\"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/saner53432.2022.00116\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/saner53432.2022.00116","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

攻击者(“钓鱼者”)通常在一些主机网站上部署源代码来冒充一个品牌，或者在一般情况下，用户需要提供一些钓鱼者感兴趣的个人信息(例如凭据，信用卡号)。网络钓鱼工具包是一组随时可以部署的文件，可以简单地复制到web服务器上，然后几乎按原样使用。在本文中，我们考虑对20871个网络钓鱼工具包的源代码进行静态相似性分析，这些工具包总计超过1.82亿行PHP、Javascript和HTML代码，这些代码是在网络钓鱼攻击期间收集并由取证团队恢复的。报告的实验结果表明，多达90%的分析工具包至少与另一个工具包共享90%或更多的源代码。差别很小，40%的情况下少于1000个编程词——标识符、常量、字符串等等。通过将具有最高相似性的工具包连接在一起，呈现出似是而非的钓鱼工具包谱系。获得的结果显示，与公开可用的应用程序(如Wordpress)相比，网络钓鱼工具包的重构谱系非常不同。观察到的试剂盒相似性分布与假设的假设是一致的，即试剂盒繁殖通常是基于相同或接近相同的副本，成本较低。建议的方法可能有助于将新传入的网络钓鱼工具包分类为“接近复制”或“智力飞跃”，来自已知和已经遇到的工具包。这可以促进识别和分类来自旧的已知试剂盒的新试剂盒。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Phishing Kits Source Code Similarity Distribution: A Case Study

Attackers (“phishers”) typically deploy source code in some host website to impersonate a brand or in general a situation in which a user is expected to provide some personal information of interest to phishers (e.g. credentials, credit card number). Phishing kits are ready-to-deploy sets of files that can be simply copied on a web server and used almost as they are. In this paper, we consider the static similarity analysis of the source code of 20871 phishing kits totalling over 182 million lines of PHP, Javascript and HTML code, that have been collected during phishing attacks and recovered by forensics teams. Reported experimental results show that as much as 90% of the analyzed kits share 90% or more of their source code with at least another kit. Differences are small, less than about 1000 programming words – identifiers, constants, strings and so on – in 40% of cases. A plausible lineage of phishing kits is presented by connecting together kits with the highest similarity. Obtained results show a very different reconstructed lineage for phishing kits when compared to a publicly available application such as Wordpress. Observed kits similarity distribution is consistent with the assumed hypothesis that kit propagation is often based on identical or near-identical copies at low cost changes. The proposed approach may help classifying new incoming phishing kits as “near-copy” or “intellectual leaps” from known and already encountered kits. This could facilitate the identification and classification of new kits as derived from older known kits.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)

自引率

0.00%

发文量