Enterprise text processing: a sparse matrix approach

Nazli Goharian, D. Grossman, T. El-Ghazawi
{"title":"Enterprise text processing: a sparse matrix approach","authors":"Nazli Goharian, D. Grossman, T. El-Ghazawi","doi":"10.1109/ITCC.2001.918768","DOIUrl":null,"url":null,"abstract":"Documents, both internal and related publicly available, are now considered a corporate asset. The potential to efficiently and accurately search such documents is of great significance. We demonstrate the application of sparse matrix-vector multiplication algorithms for text storage and retrieval as a means of supporting efficient and accurate text processing. As many parallel sparse matrix-vector multiplication algorithms exist, such an information retrieval approach lends itself to parallelism. This enables us to attack the problem of parallel information retrieval, which has resisted good scalability. We use sparse matrix compression algorithms and compare the storage of a subcollection of the commonly used NIST TREC corpus with a traditional inverted index. We demonstrate query processing using a sparse matrix-vector multiplication algorithm. Our results indicate that our approach saves approximately 35% of the total storage requirements for the inverted index. Additionally to improve accuracy, we develop a novel matrix based relevance feedback technique as well as a proximity search algorithm. The results of our experiment to incorporate proximity search capability into the system also indicate 35% less storage for the sparse matrix over the inverted index.","PeriodicalId":318295,"journal":{"name":"Proceedings International Conference on Information Technology: Coding and Computing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings International Conference on Information Technology: Coding and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ITCC.2001.918768","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21

Abstract

Documents, both internal and related publicly available, are now considered a corporate asset. The potential to efficiently and accurately search such documents is of great significance. We demonstrate the application of sparse matrix-vector multiplication algorithms for text storage and retrieval as a means of supporting efficient and accurate text processing. As many parallel sparse matrix-vector multiplication algorithms exist, such an information retrieval approach lends itself to parallelism. This enables us to attack the problem of parallel information retrieval, which has resisted good scalability. We use sparse matrix compression algorithms and compare the storage of a subcollection of the commonly used NIST TREC corpus with a traditional inverted index. We demonstrate query processing using a sparse matrix-vector multiplication algorithm. Our results indicate that our approach saves approximately 35% of the total storage requirements for the inverted index. Additionally to improve accuracy, we develop a novel matrix based relevance feedback technique as well as a proximity search algorithm. The results of our experiment to incorporate proximity search capability into the system also indicate 35% less storage for the sparse matrix over the inverted index.
企业文本处理:稀疏矩阵方法
无论是内部文件还是相关的公开文件,现在都被视为公司资产。高效、准确地搜索此类文档的潜力具有重要意义。我们展示了稀疏矩阵向量乘法算法在文本存储和检索中的应用,作为一种支持高效和准确文本处理的手段。由于存在许多并行稀疏矩阵向量乘法算法,因此这种信息检索方法具有并行性。这使我们能够解决并行信息检索的问题,而并行信息检索一直缺乏良好的可扩展性。我们使用稀疏矩阵压缩算法,并比较了常用的NIST TREC语料库的子集合与传统倒排索引的存储。我们演示了使用稀疏矩阵-向量乘法算法的查询处理。我们的结果表明,我们的方法为倒排索引节省了大约35%的总存储需求。此外,为了提高准确性,我们开发了一种新的基于矩阵的相关反馈技术以及邻近搜索算法。我们将邻近搜索功能纳入系统的实验结果也表明,与倒排索引相比,稀疏矩阵的存储空间减少了35%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信