使用相关词袋表示构建主题层次结构

R. G. Rossi, S. O. Rezende
{"title":"使用相关词袋表示构建主题层次结构","authors":"R. G. Rossi, S. O. Rezende","doi":"10.1145/2034691.2034733","DOIUrl":null,"url":null,"abstract":"A simple and intuitive way to organize a huge document collection is by a topic hierarchy. Generally two steps are carried out to build a topic hierarchy automatically: 1) hierarchical document clustering and 2) cluster labeling. For both steps, a good textual document representation is essential. The bag-of-words is the common way to represent text collections. In this representation, each document is represented by a vector where each word in the document collection represents a dimension (feature). This approach has well known problems as the high dimensionality and sparsity of data. Besides, most of the concepts are composed by more than one word, as \"document engineering\" or \"text mining\". In this paper an approach called bag-of-related-words is proposed to generate features compounded by a set of related words with a dimensionality smaller than the bag-of-words. The features are extracted from each textual document of a collection using association rules. Different ways to map the document into transactions in order to allow the extraction of association rules and interest measures to prune the number of features are analyzed. To evaluate how much the proposed approach can aid the topic hierarchy building, we carried out an objective evaluation for the clustering structure, and a subjective evaluation for topic hierarchies. All the results were compared with the bag-of-words. The obtained results demonstrated that the proposed representation is better than the bag-of-words for the topic hierarchy building.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"28 1","pages":"195-204"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":"{\"title\":\"Building a topic hierarchy using the bag-of-related-words representation\",\"authors\":\"R. G. Rossi, S. O. Rezende\",\"doi\":\"10.1145/2034691.2034733\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A simple and intuitive way to organize a huge document collection is by a topic hierarchy. Generally two steps are carried out to build a topic hierarchy automatically: 1) hierarchical document clustering and 2) cluster labeling. For both steps, a good textual document representation is essential. The bag-of-words is the common way to represent text collections. In this representation, each document is represented by a vector where each word in the document collection represents a dimension (feature). This approach has well known problems as the high dimensionality and sparsity of data. Besides, most of the concepts are composed by more than one word, as \\\"document engineering\\\" or \\\"text mining\\\". In this paper an approach called bag-of-related-words is proposed to generate features compounded by a set of related words with a dimensionality smaller than the bag-of-words. The features are extracted from each textual document of a collection using association rules. Different ways to map the document into transactions in order to allow the extraction of association rules and interest measures to prune the number of features are analyzed. To evaluate how much the proposed approach can aid the topic hierarchy building, we carried out an objective evaluation for the clustering structure, and a subjective evaluation for topic hierarchies. All the results were compared with the bag-of-words. The obtained results demonstrated that the proposed representation is better than the bag-of-words for the topic hierarchy building.\",\"PeriodicalId\":91385,\"journal\":{\"name\":\"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering\",\"volume\":\"28 1\",\"pages\":\"195-204\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"18\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2034691.2034733\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2034691.2034733","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 18

摘要

组织庞大文档集合的一种简单而直观的方法是通过主题层次结构。自动构建主题层次结构一般分为两个步骤:1)分层文档聚类和2)聚类标记。对于这两个步骤,良好的文本文档表示是必不可少的。词袋是表示文本集合的常用方法。在这种表示中,每个文档由一个向量表示,其中文档集合中的每个单词表示一个维度(特征)。这种方法存在众所周知的问题,即数据的高维性和稀疏性。此外,大多数概念都是由多个单词组成的,如“文档工程”或“文本挖掘”。本文提出了一种由一组维数小于词袋的相关词合成特征的方法。使用关联规则从集合的每个文本文档中提取特征。分析了将文档映射到事务的不同方法,以便提取关联规则和兴趣度量来减少特征的数量。为了评估所提出的方法对主题层次构建的帮助程度,我们对聚类结构进行了客观评价,对主题层次进行了主观评价。所有结果都与词汇袋进行了比较。结果表明,该方法在主题层次构建方面优于词袋方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Building a topic hierarchy using the bag-of-related-words representation
A simple and intuitive way to organize a huge document collection is by a topic hierarchy. Generally two steps are carried out to build a topic hierarchy automatically: 1) hierarchical document clustering and 2) cluster labeling. For both steps, a good textual document representation is essential. The bag-of-words is the common way to represent text collections. In this representation, each document is represented by a vector where each word in the document collection represents a dimension (feature). This approach has well known problems as the high dimensionality and sparsity of data. Besides, most of the concepts are composed by more than one word, as "document engineering" or "text mining". In this paper an approach called bag-of-related-words is proposed to generate features compounded by a set of related words with a dimensionality smaller than the bag-of-words. The features are extracted from each textual document of a collection using association rules. Different ways to map the document into transactions in order to allow the extraction of association rules and interest measures to prune the number of features are analyzed. To evaluate how much the proposed approach can aid the topic hierarchy building, we carried out an objective evaluation for the clustering structure, and a subjective evaluation for topic hierarchies. All the results were compared with the bag-of-words. The obtained results demonstrated that the proposed representation is better than the bag-of-words for the topic hierarchy building.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信