{"title":"A comprehensive analysis of using semantic information in text categorization","authors":"Kerem Çelik, T. Gungor","doi":"10.1109/INISTA.2013.6577651","DOIUrl":null,"url":null,"abstract":"Traditional text categorization methods only deal with the content of the documents and use some statistic based metrics to represent the documents. The representation is then used by a machine learning approach to determine the document class. In this picture, the meaning of the document is missing. In order to add meaning into the text categorization process, we start with using part-of-speech tagging (POS). As expected, in a document each part-of-speech tag does not contribute the same amount of information to the document meaning. In addition to the POS information, we make use of WordNet to add semantic features such as synonyms, hypernyms, hyponyms, meronyms and topics into classification process. Using WordNet's semantic features introduces ambiguity and not all semantic features are really related to the document content. To overcome this problem, we introduce a new method to eliminate the ambiguity. Various combinations of POS, WordNet and word sense disambiguation are applied and the results show that using semantic features perform better than the traditional, context based methods.","PeriodicalId":301458,"journal":{"name":"2013 IEEE INISTA","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE INISTA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INISTA.2013.6577651","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16
Abstract
Traditional text categorization methods only deal with the content of the documents and use some statistic based metrics to represent the documents. The representation is then used by a machine learning approach to determine the document class. In this picture, the meaning of the document is missing. In order to add meaning into the text categorization process, we start with using part-of-speech tagging (POS). As expected, in a document each part-of-speech tag does not contribute the same amount of information to the document meaning. In addition to the POS information, we make use of WordNet to add semantic features such as synonyms, hypernyms, hyponyms, meronyms and topics into classification process. Using WordNet's semantic features introduces ambiguity and not all semantic features are really related to the document content. To overcome this problem, we introduce a new method to eliminate the ambiguity. Various combinations of POS, WordNet and word sense disambiguation are applied and the results show that using semantic features perform better than the traditional, context based methods.