An Automatic Labeling Method for Subword-Phrase Recognition in Effective Text Classification

IF 2.8 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Informatica Pub Date : 2023-08-29 DOI:10.31449/inf.v47i3.4742

Yusuke Kimura, Takahiro Komamizu, Kenji Hatano

{"title":"An Automatic Labeling Method for Subword-Phrase Recognition in Effective Text Classification","authors":"Yusuke Kimura, Takahiro Komamizu, Kenji Hatano","doi":"10.31449/inf.v47i3.4742","DOIUrl":null,"url":null,"abstract":"Text classification methods using deep learning, which is trained with a tremendous amount of text, have achieved superior performance than traditional methods. In addition to its success, multi-task learning (MTL for short) has become a promising approach for text classification; for instance, a multi-task learning approach employs the named entity recognition as an auxiliary task for text classification, and it showcases that the auxiliary task helps make the text classification model higher classification performance. The existing MTL-based text classification methods depend on auxiliary tasks using supervised labels. Obtaining such supervision signals requires additional human and financial costs in addition to those for the main text classification task. To reduce these costs, this paper proposes a multi-task learning-based text classification framework reducing the additional costs on supervised label creation by automatically labeling phrases in texts for the auxiliary recognition task. A basic idea to realize the proposed framework is to utilize phrasal expressions consisting of subwords (called subword-phrase) and to deal with the recent situation in which the pre-trained neural language models such as BERT are designed upon subword-based tokenization to avoid out-of-vocabulary words being missed. To the best of our knowledge, there has been no text classification approach on top of subword-phrases, because subwords only sometimes express a coherent set of meanings. The proposed framework is novel in adding subword-phrase recognition as an auxiliary task and utilizing subword-phrases for text classification. It extracts subword-phrases in an unsupervised manner, particularly the statistics approach. In order to construct labels for effective subword-phrase recognition tasks, extracted subword-phrases are classified for document classes so that subword-phrases dedicated to some classes can be distinguishable. The experimental evaluation of the five popular datasets for text classification showcases the effectiveness of the involvement of the subword-phrase recognition as an auxiliary task. It also shows comparative results with the state-of-the-art method, and the comparison of various labeling schemes indicates insights for labeling common subword-phrases among several document classes.","PeriodicalId":56292,"journal":{"name":"Informatica","volume":"57 1","pages":"0"},"PeriodicalIF":2.8000,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatica","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31449/inf.v47i3.4742","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Text classification methods using deep learning, which is trained with a tremendous amount of text, have achieved superior performance than traditional methods. In addition to its success, multi-task learning (MTL for short) has become a promising approach for text classification; for instance, a multi-task learning approach employs the named entity recognition as an auxiliary task for text classification, and it showcases that the auxiliary task helps make the text classification model higher classification performance. The existing MTL-based text classification methods depend on auxiliary tasks using supervised labels. Obtaining such supervision signals requires additional human and financial costs in addition to those for the main text classification task. To reduce these costs, this paper proposes a multi-task learning-based text classification framework reducing the additional costs on supervised label creation by automatically labeling phrases in texts for the auxiliary recognition task. A basic idea to realize the proposed framework is to utilize phrasal expressions consisting of subwords (called subword-phrase) and to deal with the recent situation in which the pre-trained neural language models such as BERT are designed upon subword-based tokenization to avoid out-of-vocabulary words being missed. To the best of our knowledge, there has been no text classification approach on top of subword-phrases, because subwords only sometimes express a coherent set of meanings. The proposed framework is novel in adding subword-phrase recognition as an auxiliary task and utilizing subword-phrases for text classification. It extracts subword-phrases in an unsupervised manner, particularly the statistics approach. In order to construct labels for effective subword-phrase recognition tasks, extracted subword-phrases are classified for document classes so that subword-phrases dedicated to some classes can be distinguishable. The experimental evaluation of the five popular datasets for text classification showcases the effectiveness of the involvement of the subword-phrase recognition as an auxiliary task. It also shows comparative results with the state-of-the-art method, and the comparison of various labeling schemes indicates insights for labeling common subword-phrases among several document classes.

查看原文本刊更多论文

有效文本分类中子词-短语识别的自动标注方法

基于深度学习的文本分类方法在大量文本的训练下取得了优于传统方法的性能。除了它的成功之外，多任务学习(简称MTL)已经成为一种很有前途的文本分类方法;例如，一种多任务学习方法将命名实体识别作为文本分类的辅助任务，并展示了辅助任务有助于文本分类模型获得更高的分类性能。现有的基于mtl的文本分类方法依赖于使用监督标签的辅助任务。获取这样的监督信号，除了主要的文本分类任务之外，还需要额外的人力和财力成本。为了降低这些成本，本文提出了一种基于多任务学习的文本分类框架，通过为辅助识别任务自动标记文本中的短语来减少监督标签创建的额外成本。实现该框架的一个基本思路是利用由子词组成的短语表达式(称为子词-短语)，并处理基于子词的标记化设计的预训练神经语言模型(如BERT)以避免遗漏词汇外词的情况。据我们所知，目前还没有基于子词-短语的文本分类方法，因为子词有时只表达一组连贯的含义。该框架将子词-短语识别作为辅助任务，并利用子词-短语进行文本分类。它以一种无监督的方式提取子词短语，特别是统计方法。为了构造有效的子词-短语识别任务的标签，对提取的子词-短语进行文档类分类，使专用于某些类的子词-短语能够被区分。通过对五种常用文本分类数据集的实验评估，验证了子词-短语识别作为辅助任务的有效性。它还显示了与最先进的方法的比较结果，并且各种标记方案的比较表明了在几个文档类中标记常见子词短语的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Informatica 工程技术-计算机：信息系统

CiteScore

5.90

自引率

6.90%

发文量

审稿时长

12 months

期刊介绍： The quarterly journal Informatica provides an international forum for high-quality original research and publishes papers on mathematical simulation and optimization, recognition and control, programming theory and systems, automation systems and elements. Informatica provides a multidisciplinary forum for scientists and engineers involved in research and design including experts who implement and manage information systems applications.