快速收敛与贪婪标签短语字典

T. Smith, Ross Peeters
{"title":"快速收敛与贪婪标签短语字典","authors":"T. Smith, Ross Peeters","doi":"10.1109/DCC.1998.672128","DOIUrl":null,"url":null,"abstract":"Lexical categories have been shown to assist in giving good compression results when incorporated into context models. This paper describes a greedy dictionary-based model that maintains a dictionary of tag-phrases, along with separate lexicons for each unique tag. The text is tagged with part-of-speech (POS) labels and then given to the encoder, which uses the tags to construct the phrase dictionary in a manner similar to LZ78. The output is a sequence of arithmetically encoded phrase number coupled with the information needed to match the correct word with each tag in the phrase. Each unique word (defined as each novel word/tag pair) is transmitted once when it is first encountered, then retained in the appropriate dictionary and thereafter arithmetically encoded according to the empirical distribution for that dictionary whenever the word is encountered. We present results from some empirical tests showing that this \"tag-phrase dictionary\" technique achieves nearly identical compression as that obtainable using PPM, an explicit-context model. This goes against the widely held view that greedy dictionary schemes require much larger samples of text before they can compete with statistical context methods. Some interesting theoretical issues pertaining to text compression in general are implied, and these are also discussed.","PeriodicalId":191890,"journal":{"name":"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Fast convergence with a greedy tag-phrase dictionary\",\"authors\":\"T. Smith, Ross Peeters\",\"doi\":\"10.1109/DCC.1998.672128\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Lexical categories have been shown to assist in giving good compression results when incorporated into context models. This paper describes a greedy dictionary-based model that maintains a dictionary of tag-phrases, along with separate lexicons for each unique tag. The text is tagged with part-of-speech (POS) labels and then given to the encoder, which uses the tags to construct the phrase dictionary in a manner similar to LZ78. The output is a sequence of arithmetically encoded phrase number coupled with the information needed to match the correct word with each tag in the phrase. Each unique word (defined as each novel word/tag pair) is transmitted once when it is first encountered, then retained in the appropriate dictionary and thereafter arithmetically encoded according to the empirical distribution for that dictionary whenever the word is encountered. We present results from some empirical tests showing that this \\\"tag-phrase dictionary\\\" technique achieves nearly identical compression as that obtainable using PPM, an explicit-context model. This goes against the widely held view that greedy dictionary schemes require much larger samples of text before they can compete with statistical context methods. Some interesting theoretical issues pertaining to text compression in general are implied, and these are also discussed.\",\"PeriodicalId\":191890,\"journal\":{\"name\":\"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1998-03-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.1998.672128\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1998.672128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

词汇类别已被证明有助于在将其合并到上下文模型中时提供良好的压缩结果。本文描述了一个基于贪婪字典的模型,该模型维护一个标签短语字典,并为每个唯一的标签提供单独的词典。文本被标记为词性(POS)标签,然后提供给编码器,编码器使用这些标签以类似于LZ78的方式构建短语字典。输出是一个算术编码的短语数序列,加上将正确的单词与短语中的每个标记相匹配所需的信息。每个唯一的单词(定义为每个新单词/标记对)在第一次遇到时传输一次,然后保留在适当的字典中,然后在遇到单词时根据该字典的经验分布进行算术编码。我们给出了一些经验测试的结果,表明这种“标签短语字典”技术实现了与使用PPM(一种显式上下文模型)几乎相同的压缩。这与人们普遍持有的观点相悖,即贪婪字典方案需要更大的文本样本才能与统计上下文方法竞争。本文暗示了与文本压缩有关的一些有趣的理论问题,并对这些问题进行了讨论。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Fast convergence with a greedy tag-phrase dictionary
Lexical categories have been shown to assist in giving good compression results when incorporated into context models. This paper describes a greedy dictionary-based model that maintains a dictionary of tag-phrases, along with separate lexicons for each unique tag. The text is tagged with part-of-speech (POS) labels and then given to the encoder, which uses the tags to construct the phrase dictionary in a manner similar to LZ78. The output is a sequence of arithmetically encoded phrase number coupled with the information needed to match the correct word with each tag in the phrase. Each unique word (defined as each novel word/tag pair) is transmitted once when it is first encountered, then retained in the appropriate dictionary and thereafter arithmetically encoded according to the empirical distribution for that dictionary whenever the word is encountered. We present results from some empirical tests showing that this "tag-phrase dictionary" technique achieves nearly identical compression as that obtainable using PPM, an explicit-context model. This goes against the widely held view that greedy dictionary schemes require much larger samples of text before they can compete with statistical context methods. Some interesting theoretical issues pertaining to text compression in general are implied, and these are also discussed.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信