{"title":"Strategies for Short Text Representation in the Word Vector Space","authors":"Marcelo Pita, G. Pappa","doi":"10.1109/BRACIS.2018.00053","DOIUrl":null,"url":null,"abstract":"Short texts are present in many computer systems. Examples include social media messages, advertisement, Q&A websites, and an increasing number of other applications. They are characterized by little context words and a large vocabulary. As a consequence, traditional short text representations, such as TF and TF-IDF, have high dimensionality and are very sparse. The research field of word vectors has produced interesting word representations that are discriminative regarding semantics, which can be algebraically composed to create vector representations for paragraphs and documents. Literature reports limitations of this approach, producing the alternative Paragraph Vector method. Firstly, we investigate whether these limitations involving word vector operations are true for short text. Then, we propose a novel representation method based on the PSO meta-heuristic. Results in a document classification task are competitive with TF-IDF and show significant improvement over Paragraph Vector, with the advantage of dense and compact document vector representation.","PeriodicalId":405190,"journal":{"name":"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)","volume":"157 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BRACIS.2018.00053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Short texts are present in many computer systems. Examples include social media messages, advertisement, Q&A websites, and an increasing number of other applications. They are characterized by little context words and a large vocabulary. As a consequence, traditional short text representations, such as TF and TF-IDF, have high dimensionality and are very sparse. The research field of word vectors has produced interesting word representations that are discriminative regarding semantics, which can be algebraically composed to create vector representations for paragraphs and documents. Literature reports limitations of this approach, producing the alternative Paragraph Vector method. Firstly, we investigate whether these limitations involving word vector operations are true for short text. Then, we propose a novel representation method based on the PSO meta-heuristic. Results in a document classification task are competitive with TF-IDF and show significant improvement over Paragraph Vector, with the advantage of dense and compact document vector representation.