Zhen Zhang , Mengqiu Liu , Xiyuan Jia , Gongxun Miao , Xin Wang , Hao Ni , Guohua Wu
{"title":"Improving text classification via computing category correlation matrix from text graph","authors":"Zhen Zhang , Mengqiu Liu , Xiyuan Jia , Gongxun Miao , Xin Wang , Hao Ni , Guohua Wu","doi":"10.1016/j.csl.2024.101688","DOIUrl":null,"url":null,"abstract":"<div><p>In text classification task, models have shown remarkable accuracy across various datasets. However, confusion often arises when certain categories within the dataset are too similar, causing misclassification of certain samples. This paper proposes an improved method for this problem, through the creation of a three-layer text graph for the corpus, which is used to calculate the Category Correlation Matrix (CCM). Additionally, this paper introduces category-adaptive contrastive learning for text embedding from the encoder, enhancing the model’s ability to distinguish between samples in confusable categories that are easily confused. Soft labels are generated using this matrix to guide the classifier, preventing the model from becoming overconfident with one-hot vectors. The efficacy of this approach was demonstrated through experimental evaluations on three text encoders and six different datasets.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000718/pdfft?md5=936898b07abaca17411cf1265567ad9a&pid=1-s2.0-S0885230824000718-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000718","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In text classification task, models have shown remarkable accuracy across various datasets. However, confusion often arises when certain categories within the dataset are too similar, causing misclassification of certain samples. This paper proposes an improved method for this problem, through the creation of a three-layer text graph for the corpus, which is used to calculate the Category Correlation Matrix (CCM). Additionally, this paper introduces category-adaptive contrastive learning for text embedding from the encoder, enhancing the model’s ability to distinguish between samples in confusable categories that are easily confused. Soft labels are generated using this matrix to guide the classifier, preventing the model from becoming overconfident with one-hot vectors. The efficacy of this approach was demonstrated through experimental evaluations on three text encoders and six different datasets.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.