{"title":"用于文本分类的向量空间模型的统计评价","authors":"Yash Vijay, Anurag Sengupta, K. George","doi":"10.1109/SSCI.2018.8628920","DOIUrl":null,"url":null,"abstract":"In our paper, we statistically evaluate categorisation performance of a distributed embedding technique called word2vec, and popular sparse representations, on the labelled 20-newsgroups dataset and unlabelled United States political news dataset. We deploy extensive parametric variations of vector-space models for both supervised and unsupervised topic-categorisation, relatively gauge them, and report the best results. We introduce a methodology to deploy distributed embeddings for unsupervised learning using Principal Component Analysis, which performs exceedingly well on both datasets, both by topic coherence scores, and visual interpretation of token content of topic mixtures learnt. Our motivation is primarily driven by proving that dense word embeddings can perform as good as, if not better than, traditional frequency-based vector space models. In addition, this paper demonstrates that distributed embeddings based Support Vector Machines performs best for supervised publisher categorisation on the political news dataset, whereas Term-Frequency document Frequency based Support Vector Machines outperforms supervised topic categorisation in the 20-newsgroups dataset.","PeriodicalId":235735,"journal":{"name":"2018 IEEE Symposium Series on Computational Intelligence (SSCI)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Statistical Evaluation of Vector-space Models for Text Categorisation\",\"authors\":\"Yash Vijay, Anurag Sengupta, K. George\",\"doi\":\"10.1109/SSCI.2018.8628920\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In our paper, we statistically evaluate categorisation performance of a distributed embedding technique called word2vec, and popular sparse representations, on the labelled 20-newsgroups dataset and unlabelled United States political news dataset. We deploy extensive parametric variations of vector-space models for both supervised and unsupervised topic-categorisation, relatively gauge them, and report the best results. We introduce a methodology to deploy distributed embeddings for unsupervised learning using Principal Component Analysis, which performs exceedingly well on both datasets, both by topic coherence scores, and visual interpretation of token content of topic mixtures learnt. Our motivation is primarily driven by proving that dense word embeddings can perform as good as, if not better than, traditional frequency-based vector space models. In addition, this paper demonstrates that distributed embeddings based Support Vector Machines performs best for supervised publisher categorisation on the political news dataset, whereas Term-Frequency document Frequency based Support Vector Machines outperforms supervised topic categorisation in the 20-newsgroups dataset.\",\"PeriodicalId\":235735,\"journal\":{\"name\":\"2018 IEEE Symposium Series on Computational Intelligence (SSCI)\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE Symposium Series on Computational Intelligence (SSCI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SSCI.2018.8628920\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Symposium Series on Computational Intelligence (SSCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SSCI.2018.8628920","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Statistical Evaluation of Vector-space Models for Text Categorisation
In our paper, we statistically evaluate categorisation performance of a distributed embedding technique called word2vec, and popular sparse representations, on the labelled 20-newsgroups dataset and unlabelled United States political news dataset. We deploy extensive parametric variations of vector-space models for both supervised and unsupervised topic-categorisation, relatively gauge them, and report the best results. We introduce a methodology to deploy distributed embeddings for unsupervised learning using Principal Component Analysis, which performs exceedingly well on both datasets, both by topic coherence scores, and visual interpretation of token content of topic mixtures learnt. Our motivation is primarily driven by proving that dense word embeddings can perform as good as, if not better than, traditional frequency-based vector space models. In addition, this paper demonstrates that distributed embeddings based Support Vector Machines performs best for supervised publisher categorisation on the political news dataset, whereas Term-Frequency document Frequency based Support Vector Machines outperforms supervised topic categorisation in the 20-newsgroups dataset.