基于上下文的文本分类双词法链

2015 International Conference on Advances in Computer Engineering and Applications Pub Date : 2015-03-19 DOI:10.1109/ICACEA.2015.7164744

S. Chakraverty, Bhawna Juneja, U. Pandey, Ashima Arora

{"title":"基于上下文的文本分类双词法链","authors":"S. Chakraverty, Bhawna Juneja, U. Pandey, Ashima Arora","doi":"10.1109/ICACEA.2015.7164744","DOIUrl":null,"url":null,"abstract":"Text Classification enhances the accessibility and systematic organization of the vast reserves of data populatingthe world-wide-web. Despite great strides in the field, the domain of context driven text classification provides fresh opportunities to develop more efficient context oriented techniques with refined metrics. In this paper, we propose a novel approach to categorize text documents using a dual lexical chaining technique. The algorithm first prepares a cohesive category-keyword matrix by feeding category names into the WordNet and Wikipedia ontology, extracting lexically and semantically related keywords from them and then adding to the keywords by employing a keyword enrichment process. Next, the WordNet is referred again to find the degree of lexical cohesiveness between the tokens of a document. Terms that are strongly related are woven together into two separate lexical chains; one for their noun senses and another for their verb senses, that represent the feature set for the document. This segregation enables a better expression of word cohesiveness as concept terms and action terms are treated distinctively. We propose a new metric to calculate the strength of a lexical chain. It includes a statistical part given by Term Frequency-Inverse Document Frequency-Relative Category Frequency (TF-IDF-RCF) which itself is an improvement upon the conventional TF-IDF measure. The chain's contextual strength is determined by the degree of its lexical matching with the category-keyword matrix as well as by the relative positions of its constituent terms. Results indicate the efficacy of our approach. We obtained an average accuracy of 90% on 6 categories derived from the 20 News Group and the Reuters corpora. Lexical chaining has been applied successfully to text summarization. Our results indicate a positive direction towards its usefulness for text classification.","PeriodicalId":202893,"journal":{"name":"2015 International Conference on Advances in Computer Engineering and Applications","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Dual lexical chaining for context based text classification\",\"authors\":\"S. Chakraverty, Bhawna Juneja, U. Pandey, Ashima Arora\",\"doi\":\"10.1109/ICACEA.2015.7164744\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text Classification enhances the accessibility and systematic organization of the vast reserves of data populatingthe world-wide-web. Despite great strides in the field, the domain of context driven text classification provides fresh opportunities to develop more efficient context oriented techniques with refined metrics. In this paper, we propose a novel approach to categorize text documents using a dual lexical chaining technique. The algorithm first prepares a cohesive category-keyword matrix by feeding category names into the WordNet and Wikipedia ontology, extracting lexically and semantically related keywords from them and then adding to the keywords by employing a keyword enrichment process. Next, the WordNet is referred again to find the degree of lexical cohesiveness between the tokens of a document. Terms that are strongly related are woven together into two separate lexical chains; one for their noun senses and another for their verb senses, that represent the feature set for the document. This segregation enables a better expression of word cohesiveness as concept terms and action terms are treated distinctively. We propose a new metric to calculate the strength of a lexical chain. It includes a statistical part given by Term Frequency-Inverse Document Frequency-Relative Category Frequency (TF-IDF-RCF) which itself is an improvement upon the conventional TF-IDF measure. The chain's contextual strength is determined by the degree of its lexical matching with the category-keyword matrix as well as by the relative positions of its constituent terms. Results indicate the efficacy of our approach. We obtained an average accuracy of 90% on 6 categories derived from the 20 News Group and the Reuters corpora. Lexical chaining has been applied successfully to text summarization. Our results indicate a positive direction towards its usefulness for text classification.\",\"PeriodicalId\":202893,\"journal\":{\"name\":\"2015 International Conference on Advances in Computer Engineering and Applications\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-03-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 International Conference on Advances in Computer Engineering and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICACEA.2015.7164744\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Advances in Computer Engineering and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACEA.2015.7164744","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

文本分类提高了万维网中海量数据的可访问性和系统性。尽管在这个领域取得了很大的进步，但是上下文驱动的文本分类领域为开发更有效的面向上下文的技术提供了新的机会。在本文中，我们提出了一种使用双词汇链技术对文本文档进行分类的新方法。该算法首先将类别名称输入到WordNet和Wikipedia本体中，从中提取词汇和语义相关的关键字，然后通过关键字充实过程对关键字进行添加，从而形成一个内聚的类别-关键字矩阵。接下来，再次引用WordNet来查找文档标记之间的词汇内聚程度。密切相关的术语被编织成两个独立的词汇链;一个用于名词意义，另一个用于动词意义，它们表示文档的特征集。由于概念术语和动作术语被区别对待，这种分离可以更好地表达单词的凝聚力。我们提出了一个新的度量来计算词汇链的强度。它包含了术语频率-逆文档频率-相对类别频率(TF-IDF- rcf)给出的统计部分，它本身就是对传统TF-IDF度量的改进。链的上下文强度由其与类别关键字矩阵的词法匹配程度以及其组成术语的相对位置决定。结果表明我们的方法是有效的。我们从20个新闻集团和路透社的语料库中获得了6个类别的平均准确率为90%。词汇链已经成功地应用于文本摘要中。我们的结果表明了它对文本分类有用性的积极方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Dual lexical chaining for context based text classification

Text Classification enhances the accessibility and systematic organization of the vast reserves of data populatingthe world-wide-web. Despite great strides in the field, the domain of context driven text classification provides fresh opportunities to develop more efficient context oriented techniques with refined metrics. In this paper, we propose a novel approach to categorize text documents using a dual lexical chaining technique. The algorithm first prepares a cohesive category-keyword matrix by feeding category names into the WordNet and Wikipedia ontology, extracting lexically and semantically related keywords from them and then adding to the keywords by employing a keyword enrichment process. Next, the WordNet is referred again to find the degree of lexical cohesiveness between the tokens of a document. Terms that are strongly related are woven together into two separate lexical chains; one for their noun senses and another for their verb senses, that represent the feature set for the document. This segregation enables a better expression of word cohesiveness as concept terms and action terms are treated distinctively. We propose a new metric to calculate the strength of a lexical chain. It includes a statistical part given by Term Frequency-Inverse Document Frequency-Relative Category Frequency (TF-IDF-RCF) which itself is an improvement upon the conventional TF-IDF measure. The chain's contextual strength is determined by the degree of its lexical matching with the category-keyword matrix as well as by the relative positions of its constituent terms. Results indicate the efficacy of our approach. We obtained an average accuracy of 90% on 6 categories derived from the 20 News Group and the Reuters corpora. Lexical chaining has been applied successfully to text summarization. Our results indicate a positive direction towards its usefulness for text classification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 International Conference on Advances in Computer Engineering and Applications

自引率

0.00%

发文量