{"title":"Post-Hoc LangSec的语法推理和机器学习方法","authors":"Sheridan S. Curley, Richard E. Harang","doi":"10.1109/SPW.2016.26","DOIUrl":null,"url":null,"abstract":"Formal Language Theory for Security (LangSec) applies the tools of theoretical computer science to the problem of protocol design and analysis. In practice, most results have focused on protocol design, showing that by restricting the complexity of protocols it is possible to design parsers with desirable and formally verifiable properties, such as correctness and equivalence. When we consider existing protocols, however, many of these were not subjected to formal analysis during their design, and many are not implemented in a manner consistent with their formal documentation. Determining a grammar for such protocols is the first step in analyzing them, which places this problem in the domain of grammatical inference, for which a deep theoretical literature exists. In particular, although it has been shown that the higher level categories of the Chomsky hierarchy cannot be generically learned, it is also known that certain subcategories of that hierarchy can be effectively learned. In this paper, we summarize some theoretical results for inferring well-known Chomsky grammars, with special attention to context-free grammars (CFGs) and their generated languages (CFLs). We then demonstrate that, despite negative learnability results in the theoretical regime, we can use long short-term memory (LSTM) networks, a type of recurrent neural network (RNN) architecture, to learn a grammar for URIs that appear in Apache HTTP access logs for a particular server with high accuracy. We discuss these results in the context of grammatical inference, and suggest avenues for further research into learnability of a subgroup of the context-free grammars.","PeriodicalId":341207,"journal":{"name":"2016 IEEE Security and Privacy Workshops (SPW)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Grammatical Inference and Machine Learning Approaches to Post-Hoc LangSec\",\"authors\":\"Sheridan S. Curley, Richard E. Harang\",\"doi\":\"10.1109/SPW.2016.26\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Formal Language Theory for Security (LangSec) applies the tools of theoretical computer science to the problem of protocol design and analysis. In practice, most results have focused on protocol design, showing that by restricting the complexity of protocols it is possible to design parsers with desirable and formally verifiable properties, such as correctness and equivalence. When we consider existing protocols, however, many of these were not subjected to formal analysis during their design, and many are not implemented in a manner consistent with their formal documentation. Determining a grammar for such protocols is the first step in analyzing them, which places this problem in the domain of grammatical inference, for which a deep theoretical literature exists. In particular, although it has been shown that the higher level categories of the Chomsky hierarchy cannot be generically learned, it is also known that certain subcategories of that hierarchy can be effectively learned. In this paper, we summarize some theoretical results for inferring well-known Chomsky grammars, with special attention to context-free grammars (CFGs) and their generated languages (CFLs). We then demonstrate that, despite negative learnability results in the theoretical regime, we can use long short-term memory (LSTM) networks, a type of recurrent neural network (RNN) architecture, to learn a grammar for URIs that appear in Apache HTTP access logs for a particular server with high accuracy. We discuss these results in the context of grammatical inference, and suggest avenues for further research into learnability of a subgroup of the context-free grammars.\",\"PeriodicalId\":341207,\"journal\":{\"name\":\"2016 IEEE Security and Privacy Workshops (SPW)\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-05-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE Security and Privacy Workshops (SPW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPW.2016.26\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Security and Privacy Workshops (SPW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPW.2016.26","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Grammatical Inference and Machine Learning Approaches to Post-Hoc LangSec
Formal Language Theory for Security (LangSec) applies the tools of theoretical computer science to the problem of protocol design and analysis. In practice, most results have focused on protocol design, showing that by restricting the complexity of protocols it is possible to design parsers with desirable and formally verifiable properties, such as correctness and equivalence. When we consider existing protocols, however, many of these were not subjected to formal analysis during their design, and many are not implemented in a manner consistent with their formal documentation. Determining a grammar for such protocols is the first step in analyzing them, which places this problem in the domain of grammatical inference, for which a deep theoretical literature exists. In particular, although it has been shown that the higher level categories of the Chomsky hierarchy cannot be generically learned, it is also known that certain subcategories of that hierarchy can be effectively learned. In this paper, we summarize some theoretical results for inferring well-known Chomsky grammars, with special attention to context-free grammars (CFGs) and their generated languages (CFLs). We then demonstrate that, despite negative learnability results in the theoretical regime, we can use long short-term memory (LSTM) networks, a type of recurrent neural network (RNN) architecture, to learn a grammar for URIs that appear in Apache HTTP access logs for a particular server with high accuracy. We discuss these results in the context of grammatical inference, and suggest avenues for further research into learnability of a subgroup of the context-free grammars.