Modeling High-Level Behavior Patterns for Precise Similarity Analysis of Software

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI:10.1109/ICDM.2011.104

Taeho Kwon, Z. Su

{"title":"Modeling High-Level Behavior Patterns for Precise Similarity Analysis of Software","authors":"Taeho Kwon, Z. Su","doi":"10.1109/ICDM.2011.104","DOIUrl":null,"url":null,"abstract":"The analysis of software similarity has many applications such as detecting code clones, software plagiarism, code theft, and polymorphic malware. Because often source code is unavailable and code obfuscation is used to avoid detection, there has been much research on developing effective models to capture runtime behavior to aid detection. Existing models focus on low-level information such as dependency or purely occurrence of function calls, and suffer from poor precision, poor scalability, or both. To overcome limitations of existing models, this paper introduces a precise and succinct behavior representation that characterizes high-level object-accessing patterns as regular expressions. We first distill a set of high-level patterns (the alphabet S of the regular language) based on two pieces of information: function call patterns to access objects and type state information of the objects. Then we abstract a runtime trace of a program P into a regular expression e over the pattern alphabet S to produce P's behavior signature. We show that software instances derived from the same code exhibit similar behavior signatures and develop effective algorithms to cluster and match behavior signatures. To evaluate the effectiveness of our behavior model, we have applied it to the similarity analysis of polymorphic malware. Our results on a large malware collection demonstrate that our model is both precise and succinct for effective and scalable matching and detection of polymorphic malware.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"03 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 11th International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2011.104","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

The analysis of software similarity has many applications such as detecting code clones, software plagiarism, code theft, and polymorphic malware. Because often source code is unavailable and code obfuscation is used to avoid detection, there has been much research on developing effective models to capture runtime behavior to aid detection. Existing models focus on low-level information such as dependency or purely occurrence of function calls, and suffer from poor precision, poor scalability, or both. To overcome limitations of existing models, this paper introduces a precise and succinct behavior representation that characterizes high-level object-accessing patterns as regular expressions. We first distill a set of high-level patterns (the alphabet S of the regular language) based on two pieces of information: function call patterns to access objects and type state information of the objects. Then we abstract a runtime trace of a program P into a regular expression e over the pattern alphabet S to produce P's behavior signature. We show that software instances derived from the same code exhibit similar behavior signatures and develop effective algorithms to cluster and match behavior signatures. To evaluate the effectiveness of our behavior model, we have applied it to the similarity analysis of polymorphic malware. Our results on a large malware collection demonstrate that our model is both precise and succinct for effective and scalable matching and detection of polymorphic malware.

查看原文本刊更多论文

为精确的软件相似度分析建模高级行为模式

软件相似度分析在检测代码克隆、软件剽窃、代码盗窃和多态恶意软件等方面具有广泛的应用。由于源代码通常是不可用的，并且使用代码混淆来避免检测，因此已经有很多研究开发有效的模型来捕获运行时行为以帮助检测。现有模型侧重于低级信息，例如依赖关系或函数调用的纯粹发生，并且存在精度差、可伸缩性差或两者兼而有之的问题。为了克服现有模型的局限性，本文引入了一种精确而简洁的行为表示，将高级对象访问模式表征为正则表达式。我们首先根据两个信息提炼出一组高级模式(正则语言的字母S):访问对象的函数调用模式和对象的类型状态信息。然后，我们将程序P的运行时跟踪抽象为模式字母S上的正则表达式e，以产生P的行为签名。我们展示了来自相同代码的软件实例表现出相似的行为签名，并开发了有效的算法来聚类和匹配行为签名。为了评估我们的行为模型的有效性，我们将其应用于多态恶意软件的相似性分析。我们在大型恶意软件集合上的结果表明，我们的模型对于有效和可扩展的多态恶意软件匹配和检测既精确又简洁。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE 11th International Conference on Data Mining

自引率

0.00%

发文量