WC-SBERT: Zero-Shot Topic Classification Using SBERT and Light Self-Training on Wikipedia Categories

IF 7.2 4区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Te-Yu Chi, Jyh-Shing Roger Jang
{"title":"WC-SBERT: Zero-Shot Topic Classification Using SBERT and Light Self-Training on Wikipedia Categories","authors":"Te-Yu Chi, Jyh-Shing Roger Jang","doi":"10.1145/3678183","DOIUrl":null,"url":null,"abstract":"\n In NLP (natural language processing), zero-shot topic classification requires machines to understand the contextual meanings of texts in a downstream task without using the corresponding labeled texts for training, which is highly desirable for various applications [2]. In this paper, we propose a novel approach to construct a zero-shot task-specific model called WC-SBERT with satisfactory performance. The proposed approach is highly efficient since it uses light self-training requiring target labels (target class names of downstream tasks) only, which is distinct from other research that uses both the target labels and the unlabeled texts for training. In particular, during the pre-training stage, WC-SBERT uses contrastive learning with the multiple negative ranking loss [9] to construct the pre-trained model based on the similarity between Wiki categories. For the self-training stage, online contrastive loss is utilized to reduce the distance between a target label and Wiki categories of similar Wiki pages to the label. Experimental results indicate that compared to existing self-training models, WC-SBERT achieves rapid inference on approximately 6.45 million Wiki text entries by utilizing pre-stored Wikipedia text embeddings, significantly reducing inference time per sample by a factor of 2,746 to 16,746. During the fine-tuning step, the time required for each sample is reduced by a factor of 23 to 67. Overall, the total training time shows a maximum reduction of 27.5 times across different datasets. Most importantly, our model has achieved SOTA (state-of-the-art) accuracy on two of the three commonly used datasets for evaluating zero-shot classification, namely the AG News (0.84) and Yahoo! Answers (0.64) datasets. The code for WC-SBERT is publicly available on GitHub\n \n 1\n \n , and the dataset can also be accessed on Hugging Face\n \n 2\n \n .\n","PeriodicalId":48967,"journal":{"name":"ACM Transactions on Intelligent Systems and Technology","volume":null,"pages":null},"PeriodicalIF":7.2000,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Intelligent Systems and Technology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3678183","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

In NLP (natural language processing), zero-shot topic classification requires machines to understand the contextual meanings of texts in a downstream task without using the corresponding labeled texts for training, which is highly desirable for various applications [2]. In this paper, we propose a novel approach to construct a zero-shot task-specific model called WC-SBERT with satisfactory performance. The proposed approach is highly efficient since it uses light self-training requiring target labels (target class names of downstream tasks) only, which is distinct from other research that uses both the target labels and the unlabeled texts for training. In particular, during the pre-training stage, WC-SBERT uses contrastive learning with the multiple negative ranking loss [9] to construct the pre-trained model based on the similarity between Wiki categories. For the self-training stage, online contrastive loss is utilized to reduce the distance between a target label and Wiki categories of similar Wiki pages to the label. Experimental results indicate that compared to existing self-training models, WC-SBERT achieves rapid inference on approximately 6.45 million Wiki text entries by utilizing pre-stored Wikipedia text embeddings, significantly reducing inference time per sample by a factor of 2,746 to 16,746. During the fine-tuning step, the time required for each sample is reduced by a factor of 23 to 67. Overall, the total training time shows a maximum reduction of 27.5 times across different datasets. Most importantly, our model has achieved SOTA (state-of-the-art) accuracy on two of the three commonly used datasets for evaluating zero-shot classification, namely the AG News (0.84) and Yahoo! Answers (0.64) datasets. The code for WC-SBERT is publicly available on GitHub 1 , and the dataset can also be accessed on Hugging Face 2 .
WC-SBERT:使用 SBERT 和维基百科类别的轻度自我训练进行零镜头主题分类
在 NLP(自然语言处理)中,零镜头主题分类要求机器在下游任务中理解文本的上下文含义,而无需使用相应的标记文本进行训练,这在各种应用中都是非常理想的[2]。在本文中,我们提出了一种新颖的方法来构建一种名为 WC-SBERT 的零点任务特定模型,并取得了令人满意的效果。与其他同时使用目标标签和未标签文本进行训练的研究不同,本文提出的方法只需目标标签(下游任务的目标类名)即可进行轻量级自我训练,因此具有很高的效率。具体而言,在预训练阶段,WC-SBERT 利用多重负排序损失(multiple negative ranking loss)[9]进行对比学习,根据维基类别之间的相似性构建预训练模型。在自我训练阶段,则利用在线对比损失来减小目标标签与与该标签相似的维基页面类别之间的距离。实验结果表明,与现有的自我训练模型相比,WC-SBERT 利用预先存储的维基百科文本嵌入,在约 645 万个维基文本条目上实现了快速推理,将每个样本的推理时间显著减少了 2,746 到 16,746 倍。在微调步骤中,每个样本所需的时间减少了 23 倍,达到 67 倍。总体而言,在不同的数据集上,总训练时间最多减少了 27.5 倍。最重要的是,我们的模型在三个常用于评估零点分类的数据集中的两个数据集上达到了 SOTA(最先进)的准确率,即 AG 新闻(0.84)和雅虎答案(0.64)数据集。WC-SBERT 的代码可在 GitHub 1 上公开获取,数据集也可在 Hugging Face 2 上访问。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, INFORMATION SYSTEMS
CiteScore
9.30
自引率
2.00%
发文量
131
期刊介绍: ACM Transactions on Intelligent Systems and Technology is a scholarly journal that publishes the highest quality papers on intelligent systems, applicable algorithms and technology with a multi-disciplinary perspective. An intelligent system is one that uses artificial intelligence (AI) techniques to offer important services (e.g., as a component of a larger system) to allow integrated systems to perceive, reason, learn, and act intelligently in the real world. ACM TIST is published quarterly (six issues a year). Each issue has 8-11 regular papers, with around 20 published journal pages or 10,000 words per paper. Additional references, proofs, graphs or detailed experiment results can be submitted as a separate appendix, while excessively lengthy papers will be rejected automatically. Authors can include online-only appendices for additional content of their published papers and are encouraged to share their code and/or data with other readers.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信