{"title":"Memory-efficient regular expression matching for Chinese network content audit","authors":"Zezhi Zhu, Ping Lin, Luying Chen, Kun Zhang","doi":"10.1109/ICNIDC.2009.5360785","DOIUrl":null,"url":null,"abstract":"When match against Chinese keyword for network content audit, one of the biggest problems is that there is interference of “noise characters”, it makes the traditional way using explicit string pattern to match infeasible. Regular expression matching can solve the problem perfectly, but the DFA-base approaches for regular expression matching will also encounter the problem of excessive memory usage. In this paper, we try to solve the problem encountered when applying regular expression to Chinese network content audit. We propose a regular expression rewriting techniques and grouping principle that can solve excessive memory usage problem in DFA-based approach. Our solution can make it possible to apply regular expression to Chinese network content audit.","PeriodicalId":127306,"journal":{"name":"2009 IEEE International Conference on Network Infrastructure and Digital Content","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE International Conference on Network Infrastructure and Digital Content","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNIDC.2009.5360785","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
When match against Chinese keyword for network content audit, one of the biggest problems is that there is interference of “noise characters”, it makes the traditional way using explicit string pattern to match infeasible. Regular expression matching can solve the problem perfectly, but the DFA-base approaches for regular expression matching will also encounter the problem of excessive memory usage. In this paper, we try to solve the problem encountered when applying regular expression to Chinese network content audit. We propose a regular expression rewriting techniques and grouping principle that can solve excessive memory usage problem in DFA-based approach. Our solution can make it possible to apply regular expression to Chinese network content audit.