快速收敛与贪婪标签短语字典

Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225) Pub Date : 1998-03-30 DOI:10.1109/DCC.1998.672128

T. Smith, Ross Peeters

{"title":"快速收敛与贪婪标签短语字典","authors":"T. Smith, Ross Peeters","doi":"10.1109/DCC.1998.672128","DOIUrl":null,"url":null,"abstract":"Lexical categories have been shown to assist in giving good compression results when incorporated into context models. This paper describes a greedy dictionary-based model that maintains a dictionary of tag-phrases, along with separate lexicons for each unique tag. The text is tagged with part-of-speech (POS) labels and then given to the encoder, which uses the tags to construct the phrase dictionary in a manner similar to LZ78. The output is a sequence of arithmetically encoded phrase number coupled with the information needed to match the correct word with each tag in the phrase. Each unique word (defined as each novel word/tag pair) is transmitted once when it is first encountered, then retained in the appropriate dictionary and thereafter arithmetically encoded according to the empirical distribution for that dictionary whenever the word is encountered. We present results from some empirical tests showing that this \"tag-phrase dictionary\" technique achieves nearly identical compression as that obtainable using PPM, an explicit-context model. This goes against the widely held view that greedy dictionary schemes require much larger samples of text before they can compete with statistical context methods. Some interesting theoretical issues pertaining to text compression in general are implied, and these are also discussed.","PeriodicalId":191890,"journal":{"name":"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Fast convergence with a greedy tag-phrase dictionary\",\"authors\":\"T. Smith, Ross Peeters\",\"doi\":\"10.1109/DCC.1998.672128\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Lexical categories have been shown to assist in giving good compression results when incorporated into context models. This paper describes a greedy dictionary-based model that maintains a dictionary of tag-phrases, along with separate lexicons for each unique tag. The text is tagged with part-of-speech (POS) labels and then given to the encoder, which uses the tags to construct the phrase dictionary in a manner similar to LZ78. The output is a sequence of arithmetically encoded phrase number coupled with the information needed to match the correct word with each tag in the phrase. Each unique word (defined as each novel word/tag pair) is transmitted once when it is first encountered, then retained in the appropriate dictionary and thereafter arithmetically encoded according to the empirical distribution for that dictionary whenever the word is encountered. We present results from some empirical tests showing that this \\\"tag-phrase dictionary\\\" technique achieves nearly identical compression as that obtainable using PPM, an explicit-context model. This goes against the widely held view that greedy dictionary schemes require much larger samples of text before they can compete with statistical context methods. Some interesting theoretical issues pertaining to text compression in general are implied, and these are also discussed.\",\"PeriodicalId\":191890,\"journal\":{\"name\":\"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1998-03-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.1998.672128\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1998.672128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

词汇类别已被证明有助于在将其合并到上下文模型中时提供良好的压缩结果。本文描述了一个基于贪婪字典的模型，该模型维护一个标签短语字典，并为每个唯一的标签提供单独的词典。文本被标记为词性(POS)标签，然后提供给编码器，编码器使用这些标签以类似于LZ78的方式构建短语字典。输出是一个算术编码的短语数序列，加上将正确的单词与短语中的每个标记相匹配所需的信息。每个唯一的单词(定义为每个新单词/标记对)在第一次遇到时传输一次，然后保留在适当的字典中，然后在遇到单词时根据该字典的经验分布进行算术编码。我们给出了一些经验测试的结果，表明这种“标签短语字典”技术实现了与使用PPM(一种显式上下文模型)几乎相同的压缩。这与人们普遍持有的观点相悖，即贪婪字典方案需要更大的文本样本才能与统计上下文方法竞争。本文暗示了与文本压缩有关的一些有趣的理论问题，并对这些问题进行了讨论。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fast convergence with a greedy tag-phrase dictionary

Lexical categories have been shown to assist in giving good compression results when incorporated into context models. This paper describes a greedy dictionary-based model that maintains a dictionary of tag-phrases, along with separate lexicons for each unique tag. The text is tagged with part-of-speech (POS) labels and then given to the encoder, which uses the tags to construct the phrase dictionary in a manner similar to LZ78. The output is a sequence of arithmetically encoded phrase number coupled with the information needed to match the correct word with each tag in the phrase. Each unique word (defined as each novel word/tag pair) is transmitted once when it is first encountered, then retained in the appropriate dictionary and thereafter arithmetically encoded according to the empirical distribution for that dictionary whenever the word is encountered. We present results from some empirical tests showing that this "tag-phrase dictionary" technique achieves nearly identical compression as that obtainable using PPM, an explicit-context model. This goes against the widely held view that greedy dictionary schemes require much larger samples of text before they can compete with statistical context methods. Some interesting theoretical issues pertaining to text compression in general are implied, and these are also discussed.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)

自引率

0.00%

发文量