{"title":"An Integrated Classification Model for Massive Short Texts with Few Words","authors":"Xuetao Tang, Yi Zhu, Xuegang Hu, Peipei Li","doi":"10.1145/3366715.3366734","DOIUrl":null,"url":null,"abstract":"The excellent performance of short texts classification has emerged in the past few years. However, massive short texts with few words like invoice data are different with traditional short texts like tweets in its no contextual and less semantic information, which hinders the application of conventional classification algorithms. To address these problems, we propose an integrated classification model for massive short texts with few words. More specifically, the word embedding model is introduced to train the word vectors of massive short texts with few words to form the feature space, and then the vector representation of each instance in texts is trained based on sentence embedding. With this integrated model, higher level representations are learned from massive short texts with few words. It can boost the performance of the base subsequent classifiers such as K-Nearest Neighbor. Extensive experiments conducted on dataset including 16 million real data demonstrate the superior classification performance of our proposed model compared with all competing state-of-the-art models.","PeriodicalId":425980,"journal":{"name":"Proceedings of the 2019 International Conference on Robotics Systems and Vehicle Technology - RSVT '19","volume":"15 44","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 International Conference on Robotics Systems and Vehicle Technology - RSVT '19","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3366715.3366734","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The excellent performance of short texts classification has emerged in the past few years. However, massive short texts with few words like invoice data are different with traditional short texts like tweets in its no contextual and less semantic information, which hinders the application of conventional classification algorithms. To address these problems, we propose an integrated classification model for massive short texts with few words. More specifically, the word embedding model is introduced to train the word vectors of massive short texts with few words to form the feature space, and then the vector representation of each instance in texts is trained based on sentence embedding. With this integrated model, higher level representations are learned from massive short texts with few words. It can boost the performance of the base subsequent classifiers such as K-Nearest Neighbor. Extensive experiments conducted on dataset including 16 million real data demonstrate the superior classification performance of our proposed model compared with all competing state-of-the-art models.