A comprehensive analysis of using semantic information in text categorization

2013 IEEE INISTA Pub Date : 2013-06-19 DOI:10.1109/INISTA.2013.6577651

Kerem Çelik, T. Gungor

引用次数: 16

Abstract

Traditional text categorization methods only deal with the content of the documents and use some statistic based metrics to represent the documents. The representation is then used by a machine learning approach to determine the document class. In this picture, the meaning of the document is missing. In order to add meaning into the text categorization process, we start with using part-of-speech tagging (POS). As expected, in a document each part-of-speech tag does not contribute the same amount of information to the document meaning. In addition to the POS information, we make use of WordNet to add semantic features such as synonyms, hypernyms, hyponyms, meronyms and topics into classification process. Using WordNet's semantic features introduces ambiguity and not all semantic features are really related to the document content. To overcome this problem, we introduce a new method to eliminate the ambiguity. Various combinations of POS, WordNet and word sense disambiguation are applied and the results show that using semantic features perform better than the traditional, context based methods.

查看原文本刊更多论文

综合分析语义信息在文本分类中的应用

传统的文本分类方法只处理文档的内容，并使用一些基于统计的度量来表示文档。然后，机器学习方法使用该表示来确定文档类。在这张图片中，文件的含义丢失了。为了在文本分类过程中添加意义，我们首先使用词性标注(POS)。正如预期的那样，在文档中，每个词性标记为文档含义提供的信息量并不相同。除了词性信息外，我们还利用WordNet在分类过程中加入了同义词、上义、下义、复义和主题等语义特征。使用WordNet的语义特性会引入歧义，而且并非所有的语义特性都与文档内容真正相关。为了克服这个问题，我们引入了一种新的消除歧义的方法。结果表明，基于语义特征的消歧方法比传统的基于上下文的消歧方法效果更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE INISTA

自引率

0.00%

发文量