An Integrated Classification Model for Massive Short Texts with Few Words

Xuetao Tang, Yi Zhu, Xuegang Hu, Peipei Li
{"title":"An Integrated Classification Model for Massive Short Texts with Few Words","authors":"Xuetao Tang, Yi Zhu, Xuegang Hu, Peipei Li","doi":"10.1145/3366715.3366734","DOIUrl":null,"url":null,"abstract":"The excellent performance of short texts classification has emerged in the past few years. However, massive short texts with few words like invoice data are different with traditional short texts like tweets in its no contextual and less semantic information, which hinders the application of conventional classification algorithms. To address these problems, we propose an integrated classification model for massive short texts with few words. More specifically, the word embedding model is introduced to train the word vectors of massive short texts with few words to form the feature space, and then the vector representation of each instance in texts is trained based on sentence embedding. With this integrated model, higher level representations are learned from massive short texts with few words. It can boost the performance of the base subsequent classifiers such as K-Nearest Neighbor. Extensive experiments conducted on dataset including 16 million real data demonstrate the superior classification performance of our proposed model compared with all competing state-of-the-art models.","PeriodicalId":425980,"journal":{"name":"Proceedings of the 2019 International Conference on Robotics Systems and Vehicle Technology - RSVT '19","volume":"15 44","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 International Conference on Robotics Systems and Vehicle Technology - RSVT '19","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3366715.3366734","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The excellent performance of short texts classification has emerged in the past few years. However, massive short texts with few words like invoice data are different with traditional short texts like tweets in its no contextual and less semantic information, which hinders the application of conventional classification algorithms. To address these problems, we propose an integrated classification model for massive short texts with few words. More specifically, the word embedding model is introduced to train the word vectors of massive short texts with few words to form the feature space, and then the vector representation of each instance in texts is trained based on sentence embedding. With this integrated model, higher level representations are learned from massive short texts with few words. It can boost the performance of the base subsequent classifiers such as K-Nearest Neighbor. Extensive experiments conducted on dataset including 16 million real data demonstrate the superior classification performance of our proposed model compared with all competing state-of-the-art models.
海量少词短文本集成分类模型
短文本分类在过去的几年里表现优异。然而,像发票数据这样字数较少的海量短文本不同于像推特这样的传统短文本,它没有上下文和较少的语义信息,这阻碍了传统分类算法的应用。为了解决这些问题,我们提出了一种针对大量短文本的集成分类模型。具体来说,引入词嵌入模型,对大量单词较少的短文本进行词向量训练,形成特征空间,然后基于句子嵌入训练文本中每个实例的向量表示。通过这个集成模型,可以从大量的短文本中学习到更高级的表示。它可以提高基本后续分类器(如K-Nearest Neighbor)的性能。在包含1600万真实数据的数据集上进行的大量实验表明,与所有竞争的最先进模型相比,我们提出的模型具有优越的分类性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信