Hongyi Yao, Gyan Ranjan, A. Tongaonkar, Yong Liao, Z. Morley Mao
{"title":"样本:用于分类移动应用流量的持久词法片段的自适应挖掘","authors":"Hongyi Yao, Gyan Ranjan, A. Tongaonkar, Yong Liao, Z. Morley Mao","doi":"10.1145/2789168.2790097","DOIUrl":null,"url":null,"abstract":"We present SAMPLES: Self Adaptive Mining of Persistent LExical Snippets; a systematic framework for classifying network traffic generated by mobile applications. SAMPLES constructs conjunctive rules, in an automated fashion, through a supervised methodology over a set of labeled flows (the training set). Each conjunctive rule corresponds to the lexical context, associated with an application identifier found in a snippet of the HTTP header, and is defined by: (a) the identifier type, (b) the HTTP header-field it occurs in, and (c) the prefix/suffix surrounding its occurrence. Subsequently, these conjunctive rules undergo an aggregate-and-validate step for improving accuracy and determining a priority order. The refined rule-set is then loaded into an application-identification engine where it operates at a per flow granularity, in an extract-and-lookup paradigm, to identify the application responsible for a given flow. Thus, SAMPLES can facilitate important network measurement and management tasks --- e.g. behavioral profiling [29], application-level firewalls [21,22] etc. --- which require a more detailed view of the underlying traffic than that afforded by traditional protocol/port based methods. We evaluate SAMPLES on a test set comprising 15 million flows (approx.) generated by over 700 K applications from the Android, iOS and Nokia market-places. SAMPLES successfully identifies over 90% of these applications with 99% accuracy on an average. This, in spite of the fact that fewer than 2% of the applications are required during the training phase, for each of the three market places. This is a testament to the universality and the scalability of our approach. We, therefore, expect SAMPLES to work with reasonable coverage and accuracy for other mobile platforms --- e.g. BlackBerry and Windows Mobile --- as well.","PeriodicalId":424497,"journal":{"name":"Proceedings of the 21st Annual International Conference on Mobile Computing and Networking","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"88","resultStr":"{\"title\":\"SAMPLES: Self Adaptive Mining of Persistent LExical Snippets for Classifying Mobile Application Traffic\",\"authors\":\"Hongyi Yao, Gyan Ranjan, A. Tongaonkar, Yong Liao, Z. Morley Mao\",\"doi\":\"10.1145/2789168.2790097\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present SAMPLES: Self Adaptive Mining of Persistent LExical Snippets; a systematic framework for classifying network traffic generated by mobile applications. SAMPLES constructs conjunctive rules, in an automated fashion, through a supervised methodology over a set of labeled flows (the training set). Each conjunctive rule corresponds to the lexical context, associated with an application identifier found in a snippet of the HTTP header, and is defined by: (a) the identifier type, (b) the HTTP header-field it occurs in, and (c) the prefix/suffix surrounding its occurrence. Subsequently, these conjunctive rules undergo an aggregate-and-validate step for improving accuracy and determining a priority order. The refined rule-set is then loaded into an application-identification engine where it operates at a per flow granularity, in an extract-and-lookup paradigm, to identify the application responsible for a given flow. Thus, SAMPLES can facilitate important network measurement and management tasks --- e.g. behavioral profiling [29], application-level firewalls [21,22] etc. --- which require a more detailed view of the underlying traffic than that afforded by traditional protocol/port based methods. We evaluate SAMPLES on a test set comprising 15 million flows (approx.) generated by over 700 K applications from the Android, iOS and Nokia market-places. SAMPLES successfully identifies over 90% of these applications with 99% accuracy on an average. This, in spite of the fact that fewer than 2% of the applications are required during the training phase, for each of the three market places. This is a testament to the universality and the scalability of our approach. We, therefore, expect SAMPLES to work with reasonable coverage and accuracy for other mobile platforms --- e.g. BlackBerry and Windows Mobile --- as well.\",\"PeriodicalId\":424497,\"journal\":{\"name\":\"Proceedings of the 21st Annual International Conference on Mobile Computing and Networking\",\"volume\":\"38 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-09-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"88\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 21st Annual International Conference on Mobile Computing and Networking\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2789168.2790097\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st Annual International Conference on Mobile Computing and Networking","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2789168.2790097","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
SAMPLES: Self Adaptive Mining of Persistent LExical Snippets for Classifying Mobile Application Traffic
We present SAMPLES: Self Adaptive Mining of Persistent LExical Snippets; a systematic framework for classifying network traffic generated by mobile applications. SAMPLES constructs conjunctive rules, in an automated fashion, through a supervised methodology over a set of labeled flows (the training set). Each conjunctive rule corresponds to the lexical context, associated with an application identifier found in a snippet of the HTTP header, and is defined by: (a) the identifier type, (b) the HTTP header-field it occurs in, and (c) the prefix/suffix surrounding its occurrence. Subsequently, these conjunctive rules undergo an aggregate-and-validate step for improving accuracy and determining a priority order. The refined rule-set is then loaded into an application-identification engine where it operates at a per flow granularity, in an extract-and-lookup paradigm, to identify the application responsible for a given flow. Thus, SAMPLES can facilitate important network measurement and management tasks --- e.g. behavioral profiling [29], application-level firewalls [21,22] etc. --- which require a more detailed view of the underlying traffic than that afforded by traditional protocol/port based methods. We evaluate SAMPLES on a test set comprising 15 million flows (approx.) generated by over 700 K applications from the Android, iOS and Nokia market-places. SAMPLES successfully identifies over 90% of these applications with 99% accuracy on an average. This, in spite of the fact that fewer than 2% of the applications are required during the training phase, for each of the three market places. This is a testament to the universality and the scalability of our approach. We, therefore, expect SAMPLES to work with reasonable coverage and accuracy for other mobile platforms --- e.g. BlackBerry and Windows Mobile --- as well.