Weakly supervised text classification method based on transformer

Ling Gan, aijun yi
{"title":"Weakly supervised text classification method based on transformer","authors":"Ling Gan, aijun yi","doi":"10.1117/12.2672391","DOIUrl":null,"url":null,"abstract":"The seed word-driven approach based on weakly supervised text classification (WTC) is the dominant approach. In existing seed word-driven methods,using metrics such as Term Frequency (TF), Inverse Document Frequency (IDF) and its combinations to update the seed words. the method assigns the same weight to all metrics, leading to the selection of common or poorly differentiated words as seed words; In addition most of the text classifiers used in the study have difficulty in capturing the correlation and global information between text information. In order to solve the above problems, Using Transformer as a text classifier first, The multi-headed self-attention mechanism allows capturing longrange dependencies while computing in parallel and fully learning the global semantic information of the input text. Then an improved TF-IDF method is proposed to increase the weight of IDF so that some common words that affect the classification can be filtered out. Its experimental results are improved on 20News and NYT datasets.","PeriodicalId":290902,"journal":{"name":"International Conference on Mechatronics Engineering and Artificial Intelligence","volume":"124 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Mechatronics Engineering and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2672391","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The seed word-driven approach based on weakly supervised text classification (WTC) is the dominant approach. In existing seed word-driven methods,using metrics such as Term Frequency (TF), Inverse Document Frequency (IDF) and its combinations to update the seed words. the method assigns the same weight to all metrics, leading to the selection of common or poorly differentiated words as seed words; In addition most of the text classifiers used in the study have difficulty in capturing the correlation and global information between text information. In order to solve the above problems, Using Transformer as a text classifier first, The multi-headed self-attention mechanism allows capturing longrange dependencies while computing in parallel and fully learning the global semantic information of the input text. Then an improved TF-IDF method is proposed to increase the weight of IDF so that some common words that affect the classification can be filtered out. Its experimental results are improved on 20News and NYT datasets.
基于变压器的弱监督文本分类方法
基于弱监督文本分类(WTC)的种子词驱动方法是主流方法。在现有的种子词驱动方法中,利用词频(Term Frequency, TF)、逆文档频率(Inverse Document Frequency, IDF)及其组合等指标来更新种子词。该方法为所有指标分配相同的权重,导致选择常见或差分化词作为种子词;此外,研究中使用的大多数文本分类器在捕获文本信息之间的相关性和全局信息方面存在困难。为了解决上述问题,首先使用Transformer作为文本分类器,多头自关注机制允许在并行计算的同时捕获远程依赖关系,并充分学习输入文本的全局语义信息。然后提出了一种改进的TF-IDF方法,增加IDF的权重,从而过滤掉一些影响分类的常用词。在20News和NYT数据集上对实验结果进行了改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信