{"title":"Context coding of parse trees","authors":"J. Tarhio","doi":"10.1109/DCC.1995.515552","DOIUrl":null,"url":null,"abstract":"Summary form only given. General-purpose text compression works normally at the lexical level assuming that symbols to be encoded are independent or they depend on preceding symbols within a fixed distance. Traditionally such syntactical models have been focused on compression of source programs, but also other areas are feasible. The compression of a parse tree is an important and challenging part of syntactical modeling. A parse tree can be represented by a left parse which is a sequence of productions applied in preorder. A left parse can be encoded efficiently with arithmetic coding using counts of production alternatives of each nonterminal. We introduce a more refined method which reduces the size of a compressed tree. A blending scheme, PPM (prediction by partial matching) produces very good compression on text files. In PPM, adaptive models of several context lengths are maintained and they are blended during processing. The k preceding symbols of the symbol to be encoded form the context of order k. We apply the PPM technique to a left parse so that we use contexts of nodes instead of contexts consisting of preceding symbols in the sequence. We tested our approach with parse trees of Pascal programs. Our method gave on the average 20 percent better compression than the standard method based on counts of production alternatives of nonterminals. In our model, an item of the context is a pair (production, branch). The form of the item seems to be crucial. We tested three other variations for an item: production, nonterminal, and (nonterminal, branch), but all these three approaches produced clearly worse results.","PeriodicalId":107017,"journal":{"name":"Proceedings DCC '95 Data Compression Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1995-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC '95 Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1995.515552","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
Summary form only given. General-purpose text compression works normally at the lexical level assuming that symbols to be encoded are independent or they depend on preceding symbols within a fixed distance. Traditionally such syntactical models have been focused on compression of source programs, but also other areas are feasible. The compression of a parse tree is an important and challenging part of syntactical modeling. A parse tree can be represented by a left parse which is a sequence of productions applied in preorder. A left parse can be encoded efficiently with arithmetic coding using counts of production alternatives of each nonterminal. We introduce a more refined method which reduces the size of a compressed tree. A blending scheme, PPM (prediction by partial matching) produces very good compression on text files. In PPM, adaptive models of several context lengths are maintained and they are blended during processing. The k preceding symbols of the symbol to be encoded form the context of order k. We apply the PPM technique to a left parse so that we use contexts of nodes instead of contexts consisting of preceding symbols in the sequence. We tested our approach with parse trees of Pascal programs. Our method gave on the average 20 percent better compression than the standard method based on counts of production alternatives of nonterminals. In our model, an item of the context is a pair (production, branch). The form of the item seems to be crucial. We tested three other variations for an item: production, nonterminal, and (nonterminal, branch), but all these three approaches produced clearly worse results.