{"title":"二进制程序输入的恢复结构","authors":"Seshagiri Prabhu Narasimha, Arun Lakhotia","doi":"10.1145/3508398.3511508","DOIUrl":null,"url":null,"abstract":"This paper presents an algorithm to automatically infer a recursive state machine (RSM) describing the space of acceptable input of an arbitrary binary program by executing that program with one or more valid inputs. The algorithm automatically identifies atomic fields of fixed and variable lengths and syntactic elements, such as separators and terminators, and generalizes them into regular expression tokens. It constructs an RSM of tokens to represent structures such as arrays and records. Further, it constructs nested states in the RSM to represent complex, nested structures. The RSM may serve as an independent parser for the program's acceptable inputs. A controlled experiment was performed using a prototype implementation of the algorithm and a set of synthetic programs with input formats that mimic characteristics of conventional data formats, such as CSV, PNG, PE file, etc. The experiment demonstrates that the inferred RSMs correctly identify the syntactic elements and their grammatical orderings. When used as generators, the RSMs also produced syntactically correct data for the formats that use terminators to end a sequence of elements, but not so when the format maintains a count of elements for variable length fields instead of a terminator. Experiments with real-world programs produced similar results.","PeriodicalId":102306,"journal":{"name":"Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy","volume":"76 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Recovering Structure of Input of a Binary Program\",\"authors\":\"Seshagiri Prabhu Narasimha, Arun Lakhotia\",\"doi\":\"10.1145/3508398.3511508\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents an algorithm to automatically infer a recursive state machine (RSM) describing the space of acceptable input of an arbitrary binary program by executing that program with one or more valid inputs. The algorithm automatically identifies atomic fields of fixed and variable lengths and syntactic elements, such as separators and terminators, and generalizes them into regular expression tokens. It constructs an RSM of tokens to represent structures such as arrays and records. Further, it constructs nested states in the RSM to represent complex, nested structures. The RSM may serve as an independent parser for the program's acceptable inputs. A controlled experiment was performed using a prototype implementation of the algorithm and a set of synthetic programs with input formats that mimic characteristics of conventional data formats, such as CSV, PNG, PE file, etc. The experiment demonstrates that the inferred RSMs correctly identify the syntactic elements and their grammatical orderings. When used as generators, the RSMs also produced syntactically correct data for the formats that use terminators to end a sequence of elements, but not so when the format maintains a count of elements for variable length fields instead of a terminator. Experiments with real-world programs produced similar results.\",\"PeriodicalId\":102306,\"journal\":{\"name\":\"Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy\",\"volume\":\"76 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-04-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3508398.3511508\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3508398.3511508","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
This paper presents an algorithm to automatically infer a recursive state machine (RSM) describing the space of acceptable input of an arbitrary binary program by executing that program with one or more valid inputs. The algorithm automatically identifies atomic fields of fixed and variable lengths and syntactic elements, such as separators and terminators, and generalizes them into regular expression tokens. It constructs an RSM of tokens to represent structures such as arrays and records. Further, it constructs nested states in the RSM to represent complex, nested structures. The RSM may serve as an independent parser for the program's acceptable inputs. A controlled experiment was performed using a prototype implementation of the algorithm and a set of synthetic programs with input formats that mimic characteristics of conventional data formats, such as CSV, PNG, PE file, etc. The experiment demonstrates that the inferred RSMs correctly identify the syntactic elements and their grammatical orderings. When used as generators, the RSMs also produced syntactically correct data for the formats that use terminators to end a sequence of elements, but not so when the format maintains a count of elements for variable length fields instead of a terminator. Experiments with real-world programs produced similar results.