{"title":"Investigation of the Quality of Topic Models for Noisy Data Sources","authors":"Yue Xu, Yuefeng Li, D. K. Geeganage","doi":"10.1109/WI.2018.00-48","DOIUrl":null,"url":null,"abstract":"Latent Dirichlet Allocation (LDA) has become the most stable and widely used topic model to derive topics from collections of documents where it depicts different levels of success based on diversified domains of inputs. Nevertheless, it is a vital requirement to evaluate the LDA against the quality of the input. The noise and uncertainty of the content create a negative influence on the topic model. The major contribution of this investigation is to critically evaluate the LDA based on the quality of input sources and human perception. The empirical study shows the relationship between the quality of the input and the accuracy of the output generated by LDA. Perplexity and coherence have been evaluated with three data-sets (RCV1, conference data set, tweets) which contain different level of complexities and uncertainty in their contents. Human perception in generating topics has been compared with the LDA in terms of human defined topics. Results of the analysis demonstrate a strong relationship between the quality of the input and generated topics. Thus, highly relevant topics were generated from formally written contents while noisy and messy contents lead to generate meaningless topics. A considerable gap is noticed between human defined topics and LDA generated topics. Finally, a concept-based topic modeling technique is proposed to improve the quality of topics by capturing the meaning of the content and eliminating the irrelevant and meaningless topics.","PeriodicalId":405966,"journal":{"name":"2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI.2018.00-48","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Latent Dirichlet Allocation (LDA) has become the most stable and widely used topic model to derive topics from collections of documents where it depicts different levels of success based on diversified domains of inputs. Nevertheless, it is a vital requirement to evaluate the LDA against the quality of the input. The noise and uncertainty of the content create a negative influence on the topic model. The major contribution of this investigation is to critically evaluate the LDA based on the quality of input sources and human perception. The empirical study shows the relationship between the quality of the input and the accuracy of the output generated by LDA. Perplexity and coherence have been evaluated with three data-sets (RCV1, conference data set, tweets) which contain different level of complexities and uncertainty in their contents. Human perception in generating topics has been compared with the LDA in terms of human defined topics. Results of the analysis demonstrate a strong relationship between the quality of the input and generated topics. Thus, highly relevant topics were generated from formally written contents while noisy and messy contents lead to generate meaningless topics. A considerable gap is noticed between human defined topics and LDA generated topics. Finally, a concept-based topic modeling technique is proposed to improve the quality of topics by capturing the meaning of the content and eliminating the irrelevant and meaningless topics.