Wen Pu, Ning Liu, Shuicheng Yan, Jun Yan, Kunqing Xie, Zheng Chen
{"title":"Local Word Bag Model for Text Categorization","authors":"Wen Pu, Ning Liu, Shuicheng Yan, Jun Yan, Kunqing Xie, Zheng Chen","doi":"10.1109/ICDM.2007.69","DOIUrl":null,"url":null,"abstract":"Many text processing applications adopted the bag of words (BOW) model representation of documents, in which each document is represented as a vector of weighted terms or n-grams, and then the cosine distance between two vectors is used as the similarity measurement. Although the great success in information retrieval and text categorization, the conventional BOW model ignores the detailed local text information, i.e. the co-occurrence pattern of words at sentence or paragraph level. In this paper, we propose a novel approach to represent a document as a set of local tf-idf vectors, or what we called local word bags (LWB). By encapsulating local information distributed around a document into multiple LWBs, we can measure the similarity of two documents via the partial match of their corresponding local bags. To perform the matching efficiently, we introduce the local word bag kernel (LWB kernel), a variant of VG-Pyramid match kernel. The new kernel enables the discriminative machine learning methods like SVM to compute the partial matching between two sets of LWBs in linear time after an one time hierarchical clustering procedure over all local bags at the initialization stage. Experiments on real world datasets demonstrate the effectiveness of our new approach.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2007.69","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 26
Abstract
Many text processing applications adopted the bag of words (BOW) model representation of documents, in which each document is represented as a vector of weighted terms or n-grams, and then the cosine distance between two vectors is used as the similarity measurement. Although the great success in information retrieval and text categorization, the conventional BOW model ignores the detailed local text information, i.e. the co-occurrence pattern of words at sentence or paragraph level. In this paper, we propose a novel approach to represent a document as a set of local tf-idf vectors, or what we called local word bags (LWB). By encapsulating local information distributed around a document into multiple LWBs, we can measure the similarity of two documents via the partial match of their corresponding local bags. To perform the matching efficiently, we introduce the local word bag kernel (LWB kernel), a variant of VG-Pyramid match kernel. The new kernel enables the discriminative machine learning methods like SVM to compute the partial matching between two sets of LWBs in linear time after an one time hierarchical clustering procedure over all local bags at the initialization stage. Experiments on real world datasets demonstrate the effectiveness of our new approach.
许多文本处理应用采用BOW (bag of words)模型表示文档,将每个文档表示为加权项向量或n-gram,然后使用两个向量之间的余弦距离作为相似度度量。传统的BOW模型虽然在信息检索和文本分类方面取得了巨大的成功,但忽略了局部文本的详细信息,即句子或段落层面的词共现模式。在本文中,我们提出了一种新的方法,将文档表示为一组局部tf-idf向量,或者我们称之为局部词包(LWB)。通过将分布在文档周围的局部信息封装到多个lwb中,我们可以通过两个文档对应的局部包的部分匹配来度量两个文档的相似性。为了有效地进行匹配,我们引入了局部词包核(LWB核),它是vg -金字塔匹配核的一种变体。新内核使得判别机器学习方法(如SVM)在初始化阶段对所有局部袋进行一次分层聚类过程后,在线性时间内计算两组LWBs之间的部分匹配。在真实世界数据集上的实验证明了我们的新方法的有效性。