{"title":"CT-VAE: An Unsupervised Noise Filtering Algorithm for Weibo Topic Datasets","authors":"Yingying He, Zheng Wang, Wei Zhao, Zhensu Sun","doi":"10.1109/ICECE54449.2021.9674266","DOIUrl":null,"url":null,"abstract":"Weibo Topic has become a popular data source for text analysis. Its quality is important to the results of related research or applications. However, as a crowdsource dataset, Weibo Topic suffers from the noises generated by malicious bloggers. To attract more views, these bloggers tend to tag their blogs with unrelated topic tags, which brings significant noises to this dataset. To filter these noises, researchers have proposed automated filter methods using supervised or semi-supervised learning. However, these methods require human-annotated data to train the models, which significantly raises the cost to build such filtering systems. In this work, we propose an unsupervised filtering method, CT-VAE, based on Variational Auto-Encoder. CT-VAE trains multiple VAE models on different topics to identify the noises. Our experiments show that CT-VAE can achieve better results than supervised learning methods when more unlabeled data are collected.","PeriodicalId":166178,"journal":{"name":"2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICECE54449.2021.9674266","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Weibo Topic has become a popular data source for text analysis. Its quality is important to the results of related research or applications. However, as a crowdsource dataset, Weibo Topic suffers from the noises generated by malicious bloggers. To attract more views, these bloggers tend to tag their blogs with unrelated topic tags, which brings significant noises to this dataset. To filter these noises, researchers have proposed automated filter methods using supervised or semi-supervised learning. However, these methods require human-annotated data to train the models, which significantly raises the cost to build such filtering systems. In this work, we propose an unsupervised filtering method, CT-VAE, based on Variational Auto-Encoder. CT-VAE trains multiple VAE models on different topics to identify the noises. Our experiments show that CT-VAE can achieve better results than supervised learning methods when more unlabeled data are collected.