Paolo Napoletano, F. Colace, M. D. Santo, L. Greco
{"title":"Text Classification Using a Graph of Terms","authors":"Paolo Napoletano, F. Colace, M. D. Santo, L. Greco","doi":"10.1109/CISIS.2012.183","DOIUrl":null,"url":null,"abstract":"It is well known that supervised text classification methods need to learn from many labeled examples to achieve a high accuracy. However, in a real context, sufficient labeled examples are not always available. For this reason, there has been recent interest in methods that are capable of obtaining a high accuracy even if the size of the training set is not big. The main purpose of text mining techniques is to identify common patterns through the observation of vectors of features and then to use such patterns to make predictions. Most existing methods usually make use of a vector of features made up of weighted words that unfortunately are insufficiently discriminative when the number of features is much higher than the number of labeled examples. In this paper we demonstrate that, to obtain a greater accuracy in the analysis and revelation of common patterns, we could employ more complex features than simple weighted words. The proposed vector of features considers a hierarchical structure, named a mixed Graph of Terms, composed of a directed and an undirected sub-graph of words, that can be automatically constructed from a set of documents through the probabilistic Topic Model. The method has been tested on the top 10 classes of the ModApte split from the Reuters-21578 dataset, learned on several subsets of the original training set and showing a better performance than a method using a list of weighted words as a vector of features and linear support vector machines.","PeriodicalId":158978,"journal":{"name":"2012 Sixth International Conference on Complex, Intelligent, and Software Intensive Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Sixth International Conference on Complex, Intelligent, and Software Intensive Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISIS.2012.183","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13
Abstract
It is well known that supervised text classification methods need to learn from many labeled examples to achieve a high accuracy. However, in a real context, sufficient labeled examples are not always available. For this reason, there has been recent interest in methods that are capable of obtaining a high accuracy even if the size of the training set is not big. The main purpose of text mining techniques is to identify common patterns through the observation of vectors of features and then to use such patterns to make predictions. Most existing methods usually make use of a vector of features made up of weighted words that unfortunately are insufficiently discriminative when the number of features is much higher than the number of labeled examples. In this paper we demonstrate that, to obtain a greater accuracy in the analysis and revelation of common patterns, we could employ more complex features than simple weighted words. The proposed vector of features considers a hierarchical structure, named a mixed Graph of Terms, composed of a directed and an undirected sub-graph of words, that can be automatically constructed from a set of documents through the probabilistic Topic Model. The method has been tested on the top 10 classes of the ModApte split from the Reuters-21578 dataset, learned on several subsets of the original training set and showing a better performance than a method using a list of weighted words as a vector of features and linear support vector machines.
众所周知,有监督的文本分类方法需要从许多有标签的例子中学习才能达到较高的准确率。然而,在实际环境中,并不总是有足够的标记示例。由于这个原因,最近人们对即使训练集的大小不大也能获得高精度的方法很感兴趣。文本挖掘技术的主要目的是通过观察特征向量来识别共同的模式,然后使用这些模式进行预测。大多数现有方法通常使用由加权词组成的特征向量,不幸的是,当特征数量远远大于标记示例的数量时,这种方法的判别能力不足。在本文中,我们证明,为了获得更高的准确性,在共同模式的分析和揭示,我们可以使用更复杂的特征,而不是简单的加权词。所提出的特征向量考虑了一种分层结构,称为混合词图(mixed Graph of Terms),它由词的有向子图和无向子图组成,可以通过概率主题模型从一组文档中自动构造。该方法已经在路透社-21578数据集的ModApte拆分的前10个类上进行了测试,在原始训练集的几个子集上进行了学习,并显示出比使用加权词列表作为特征向量和线性支持向量机的方法更好的性能。