{"title":"企业文本处理:稀疏矩阵方法","authors":"Nazli Goharian, D. Grossman, T. El-Ghazawi","doi":"10.1109/ITCC.2001.918768","DOIUrl":null,"url":null,"abstract":"Documents, both internal and related publicly available, are now considered a corporate asset. The potential to efficiently and accurately search such documents is of great significance. We demonstrate the application of sparse matrix-vector multiplication algorithms for text storage and retrieval as a means of supporting efficient and accurate text processing. As many parallel sparse matrix-vector multiplication algorithms exist, such an information retrieval approach lends itself to parallelism. This enables us to attack the problem of parallel information retrieval, which has resisted good scalability. We use sparse matrix compression algorithms and compare the storage of a subcollection of the commonly used NIST TREC corpus with a traditional inverted index. We demonstrate query processing using a sparse matrix-vector multiplication algorithm. Our results indicate that our approach saves approximately 35% of the total storage requirements for the inverted index. Additionally to improve accuracy, we develop a novel matrix based relevance feedback technique as well as a proximity search algorithm. The results of our experiment to incorporate proximity search capability into the system also indicate 35% less storage for the sparse matrix over the inverted index.","PeriodicalId":318295,"journal":{"name":"Proceedings International Conference on Information Technology: Coding and Computing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2001-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"Enterprise text processing: a sparse matrix approach\",\"authors\":\"Nazli Goharian, D. Grossman, T. El-Ghazawi\",\"doi\":\"10.1109/ITCC.2001.918768\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Documents, both internal and related publicly available, are now considered a corporate asset. The potential to efficiently and accurately search such documents is of great significance. We demonstrate the application of sparse matrix-vector multiplication algorithms for text storage and retrieval as a means of supporting efficient and accurate text processing. As many parallel sparse matrix-vector multiplication algorithms exist, such an information retrieval approach lends itself to parallelism. This enables us to attack the problem of parallel information retrieval, which has resisted good scalability. We use sparse matrix compression algorithms and compare the storage of a subcollection of the commonly used NIST TREC corpus with a traditional inverted index. We demonstrate query processing using a sparse matrix-vector multiplication algorithm. Our results indicate that our approach saves approximately 35% of the total storage requirements for the inverted index. Additionally to improve accuracy, we develop a novel matrix based relevance feedback technique as well as a proximity search algorithm. The results of our experiment to incorporate proximity search capability into the system also indicate 35% less storage for the sparse matrix over the inverted index.\",\"PeriodicalId\":318295,\"journal\":{\"name\":\"Proceedings International Conference on Information Technology: Coding and Computing\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2001-04-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings International Conference on Information Technology: Coding and Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ITCC.2001.918768\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings International Conference on Information Technology: Coding and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ITCC.2001.918768","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Enterprise text processing: a sparse matrix approach
Documents, both internal and related publicly available, are now considered a corporate asset. The potential to efficiently and accurately search such documents is of great significance. We demonstrate the application of sparse matrix-vector multiplication algorithms for text storage and retrieval as a means of supporting efficient and accurate text processing. As many parallel sparse matrix-vector multiplication algorithms exist, such an information retrieval approach lends itself to parallelism. This enables us to attack the problem of parallel information retrieval, which has resisted good scalability. We use sparse matrix compression algorithms and compare the storage of a subcollection of the commonly used NIST TREC corpus with a traditional inverted index. We demonstrate query processing using a sparse matrix-vector multiplication algorithm. Our results indicate that our approach saves approximately 35% of the total storage requirements for the inverted index. Additionally to improve accuracy, we develop a novel matrix based relevance feedback technique as well as a proximity search algorithm. The results of our experiment to incorporate proximity search capability into the system also indicate 35% less storage for the sparse matrix over the inverted index.