Modeling word occurrences for the compression of concordances

A. Bookstein, S. T. Klein, T. Raita
{"title":"Modeling word occurrences for the compression of concordances","authors":"A. Bookstein, S. T. Klein, T. Raita","doi":"10.1109/DCC.1995.515572","DOIUrl":null,"url":null,"abstract":"Summary form only given. Effective compression of a text-based information retrieval system involves compression not only the text itself, but also of the concordance by which one accesses that text and which occupies an amount of storage comparable to the text itself. The concordance can be a rather complicated data structure, especially if it permits hierarchical access to the database. But one or more components of the hierarchy can usually be conceptualized as a bit-map. We conceptualize our bit-map as being generated as follows. At any bit-map site we are in one of two states: a cluster state (C), or a between-cluster state (B). In a given state, we generate a bit-map-value of zero or one and, governed by the transition probabilities of the model, enter a new state as we move to the next bit-map site. Such a model has been referred to as a hidden Markov model in the literature. Unfortunately, this model is analytically difficult to use. To approximate it, we introduce several traditional Markov models with four states each, B and C as above, and two transitional states. We present the models, show how they are connected, and state the formal compression algorithm based on these models. We also include some experimental results.","PeriodicalId":107017,"journal":{"name":"Proceedings DCC '95 Data Compression Conference","volume":"129 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1995-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC '95 Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1995.515572","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 24

Abstract

Summary form only given. Effective compression of a text-based information retrieval system involves compression not only the text itself, but also of the concordance by which one accesses that text and which occupies an amount of storage comparable to the text itself. The concordance can be a rather complicated data structure, especially if it permits hierarchical access to the database. But one or more components of the hierarchy can usually be conceptualized as a bit-map. We conceptualize our bit-map as being generated as follows. At any bit-map site we are in one of two states: a cluster state (C), or a between-cluster state (B). In a given state, we generate a bit-map-value of zero or one and, governed by the transition probabilities of the model, enter a new state as we move to the next bit-map site. Such a model has been referred to as a hidden Markov model in the literature. Unfortunately, this model is analytically difficult to use. To approximate it, we introduce several traditional Markov models with four states each, B and C as above, and two transitional states. We present the models, show how they are connected, and state the formal compression algorithm based on these models. We also include some experimental results.
为索引的压缩建模单词出现
只提供摘要形式。基于文本的信息检索系统的有效压缩不仅包括对文本本身的压缩,还包括对访问该文本的一致性的压缩,并且该一致性占用与文本本身相当的存储量。一致性可以是一个相当复杂的数据结构,特别是如果它允许对数据库进行分层访问的话。但是层次结构的一个或多个组件通常可以概念化为位图。我们将生成的位图概念化如下。在任何位图站点,我们都处于两种状态中的一种:集群状态(C)或集群间状态(B)。在给定状态下,我们生成一个位图值为0或1,并在模型的转移概率的控制下,在我们移动到下一个位图站点时进入一个新状态。这种模型在文献中被称为隐马尔可夫模型。不幸的是,这个模型在分析上很难使用。为了近似它,我们引入了几个传统的马尔可夫模型,每个模型都有四个状态,B和C,以及两个过渡状态。我们介绍了这些模型,展示了它们是如何连接的,并陈述了基于这些模型的形式化压缩算法。我们还包括一些实验结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信