{"title":"快速收敛与贪婪标签短语字典","authors":"T. Smith, Ross Peeters","doi":"10.1109/DCC.1998.672128","DOIUrl":null,"url":null,"abstract":"Lexical categories have been shown to assist in giving good compression results when incorporated into context models. This paper describes a greedy dictionary-based model that maintains a dictionary of tag-phrases, along with separate lexicons for each unique tag. The text is tagged with part-of-speech (POS) labels and then given to the encoder, which uses the tags to construct the phrase dictionary in a manner similar to LZ78. The output is a sequence of arithmetically encoded phrase number coupled with the information needed to match the correct word with each tag in the phrase. Each unique word (defined as each novel word/tag pair) is transmitted once when it is first encountered, then retained in the appropriate dictionary and thereafter arithmetically encoded according to the empirical distribution for that dictionary whenever the word is encountered. We present results from some empirical tests showing that this \"tag-phrase dictionary\" technique achieves nearly identical compression as that obtainable using PPM, an explicit-context model. This goes against the widely held view that greedy dictionary schemes require much larger samples of text before they can compete with statistical context methods. Some interesting theoretical issues pertaining to text compression in general are implied, and these are also discussed.","PeriodicalId":191890,"journal":{"name":"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Fast convergence with a greedy tag-phrase dictionary\",\"authors\":\"T. Smith, Ross Peeters\",\"doi\":\"10.1109/DCC.1998.672128\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Lexical categories have been shown to assist in giving good compression results when incorporated into context models. This paper describes a greedy dictionary-based model that maintains a dictionary of tag-phrases, along with separate lexicons for each unique tag. The text is tagged with part-of-speech (POS) labels and then given to the encoder, which uses the tags to construct the phrase dictionary in a manner similar to LZ78. The output is a sequence of arithmetically encoded phrase number coupled with the information needed to match the correct word with each tag in the phrase. Each unique word (defined as each novel word/tag pair) is transmitted once when it is first encountered, then retained in the appropriate dictionary and thereafter arithmetically encoded according to the empirical distribution for that dictionary whenever the word is encountered. We present results from some empirical tests showing that this \\\"tag-phrase dictionary\\\" technique achieves nearly identical compression as that obtainable using PPM, an explicit-context model. This goes against the widely held view that greedy dictionary schemes require much larger samples of text before they can compete with statistical context methods. Some interesting theoretical issues pertaining to text compression in general are implied, and these are also discussed.\",\"PeriodicalId\":191890,\"journal\":{\"name\":\"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1998-03-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.1998.672128\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1998.672128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Fast convergence with a greedy tag-phrase dictionary
Lexical categories have been shown to assist in giving good compression results when incorporated into context models. This paper describes a greedy dictionary-based model that maintains a dictionary of tag-phrases, along with separate lexicons for each unique tag. The text is tagged with part-of-speech (POS) labels and then given to the encoder, which uses the tags to construct the phrase dictionary in a manner similar to LZ78. The output is a sequence of arithmetically encoded phrase number coupled with the information needed to match the correct word with each tag in the phrase. Each unique word (defined as each novel word/tag pair) is transmitted once when it is first encountered, then retained in the appropriate dictionary and thereafter arithmetically encoded according to the empirical distribution for that dictionary whenever the word is encountered. We present results from some empirical tests showing that this "tag-phrase dictionary" technique achieves nearly identical compression as that obtainable using PPM, an explicit-context model. This goes against the widely held view that greedy dictionary schemes require much larger samples of text before they can compete with statistical context methods. Some interesting theoretical issues pertaining to text compression in general are implied, and these are also discussed.