ISAdetect: Usable Automated Detection of CPU Architecture and Endianness for Executable Binary Files and Object Code

Proceedings of the Tenth ACM Conference on Data and Application Security and Privacy Pub Date : 2019-08-15 DOI:10.1145/3374664.3375742

Sami Kairajärvi, Andrei Costin, T. Hämäläinen

{"title":"ISAdetect: Usable Automated Detection of CPU Architecture and Endianness for Executable Binary Files and Object Code","authors":"Sami Kairajärvi, Andrei Costin, T. Hämäläinen","doi":"10.1145/3374664.3375742","DOIUrl":null,"url":null,"abstract":"Static and dynamic binary analysis techniques are actively used to reverse engineer software's behavior and to detect its vulnerabilities, even when only the binary code is available for analysis. To avoid analysis errors due to misreading op-codes for a wrong CPU architecture, these analysis tools must precisely identify the Instruction Set Architecture (ISA) of the object code under analysis. The variety of CPU architectures that modern security and reverse engineering tools must support is ever increasing due to massive proliferation of IoT devices and the diversity of firmware and malware targeting those devices. Recent studies concluded that falsely identifying the binary code's ISA caused alone about 10% of failures of IoT firmware analysis. The state of the art approaches detecting ISA for executable object code look promising, and their results demonstrate effectiveness and high-performance. However, they lack the support of publicly available datasets and toolsets, which makes the evaluation, comparison, and improvement of those techniques, datasets, and machine learning models quite challenging (if not impossible). This paper bridges multiple gaps in the field of automated and precise identification of architecture and endianness of binary files and object code. We develop from scratch the toolset and datasets that are lacking in this research space. As such, we contribute a comprehensive collection of open data, open source, and open API web-services. We also attempt experiment reconstruction and cross-validation of effectiveness, efficiency, and results of the state of the art methods. When training and testing classifiers using solely code-sections from executable binary files, all our classifiers performed equally well achieving over 98% accuracy. The results are consistent and comparable with the current state of the art, hence supports the general validity of the algorithms, features, and approaches suggested in those works.","PeriodicalId":171521,"journal":{"name":"Proceedings of the Tenth ACM Conference on Data and Application Security and Privacy","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Tenth ACM Conference on Data and Application Security and Privacy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3374664.3375742","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Static and dynamic binary analysis techniques are actively used to reverse engineer software's behavior and to detect its vulnerabilities, even when only the binary code is available for analysis. To avoid analysis errors due to misreading op-codes for a wrong CPU architecture, these analysis tools must precisely identify the Instruction Set Architecture (ISA) of the object code under analysis. The variety of CPU architectures that modern security and reverse engineering tools must support is ever increasing due to massive proliferation of IoT devices and the diversity of firmware and malware targeting those devices. Recent studies concluded that falsely identifying the binary code's ISA caused alone about 10% of failures of IoT firmware analysis. The state of the art approaches detecting ISA for executable object code look promising, and their results demonstrate effectiveness and high-performance. However, they lack the support of publicly available datasets and toolsets, which makes the evaluation, comparison, and improvement of those techniques, datasets, and machine learning models quite challenging (if not impossible). This paper bridges multiple gaps in the field of automated and precise identification of architecture and endianness of binary files and object code. We develop from scratch the toolset and datasets that are lacking in this research space. As such, we contribute a comprehensive collection of open data, open source, and open API web-services. We also attempt experiment reconstruction and cross-validation of effectiveness, efficiency, and results of the state of the art methods. When training and testing classifiers using solely code-sections from executable binary files, all our classifiers performed equally well achieving over 98% accuracy. The results are consistent and comparable with the current state of the art, hence supports the general validity of the algorithms, features, and approaches suggested in those works.

查看原文本刊更多论文

ISAdetect:可用的自动检测CPU架构和Endianness为可执行二进制文件和目标代码

静态和动态二进制分析技术被积极用于逆向工程软件的行为和检测其漏洞，即使只有二进制代码可用于分析。为了避免由于误读错误的CPU架构的操作码而导致的分析错误，这些分析工具必须精确地识别被分析的目标代码的指令集架构(ISA)。由于物联网设备的大规模扩散以及针对这些设备的固件和恶意软件的多样性，现代安全和逆向工程工具必须支持的CPU架构的多样性正在不断增加。最近的研究得出结论，仅错误识别二进制代码的ISA就导致了大约10%的物联网固件分析失败。检测可执行目标代码的ISA的最新方法看起来很有前途，它们的结果证明了有效性和高性能。然而，它们缺乏公开可用的数据集和工具集的支持，这使得这些技术、数据集和机器学习模型的评估、比较和改进相当具有挑战性(如果不是不可能的话)。本文填补了在二进制文件和目标代码的结构和字节序自动精确识别领域的多个空白。我们从零开始开发这个研究领域所缺乏的工具集和数据集。因此，我们提供了一个全面的开放数据、开放源代码和开放API web服务集合。我们也尝试实验重建和交叉验证的有效性，效率和结果的最先进的方法。当只使用可执行二进制文件中的代码段训练和测试分类器时，我们所有的分类器都表现得同样好，准确率超过98%。结果与当前的技术状态是一致和可比较的，因此支持这些作品中建议的算法、特征和方法的一般有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Tenth ACM Conference on Data and Application Security and Privacy

自引率

0.00%

发文量