Best Setting of Model Parameters in Applying Topic Modeling on Textual Documents.

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics Pub Date : 2017-08-20 DOI:10.1145/3107411.3108195

Wen Zou, Weizhong Zhao, James J. Chen, R. Perkins

{"title":"Best Setting of Model Parameters in Applying Topic Modeling on Textual Documents.","authors":"Wen Zou, Weizhong Zhao, James J. Chen, R. Perkins","doi":"10.1145/3107411.3108195","DOIUrl":null,"url":null,"abstract":"Probabilistic topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. It offers a viable approach to structure huge textual document collections into latent topic themes to aid text mining. Latent Dirichlet Allocation (LDA) is the most commonly used topic modelling method across a wide number of technical fields. However, model development can be arduous and tedious, and requires burdensome and systematic sensitivity studies in order to find the best set of model parameters. In this study, we use a heuristic approach to estimate the most appropriate number of topics. Specifically, the rate of perplexity change (RPC) as a function of numbers of topics is proposed as a suitable selector. We test the stability and effectiveness of the proposed method for three markedly different types of grounded-truth datasets: Salmonella next generation sequencing, pharmacological side effects, and textual abstracts on computational biology and bioinformatics (TCBB) from PubMed. Then we describe extensive sensitivity studies to determine best practices for generating effective topic models. To test effectiveness and validity of topic models, we constructed a ground truth data set from PubMed that contained some 40 health related themes including negative controls, and mixed it with a data set of unstructured documents. We found that obtaining the most useful model, tuned to desired sensitivity versus specificity, requires an iterative process wherein preprocessing steps, the type of topic modeling algorithm, and the algorithm's model parameters are systematically varied. Models need to be compared with both qualitative, subjective assessments and quantitative, objective assessments, and care is required that Gibbs sampling in model estimation is sufficient to assure stable solutions. With a high quality model, documents can be rank-ordered in accordance with probability of being associated with complex regulatory query string, greatly lessoning text mining work. Importantly, topic models are agnostic about how words and documents are defined, and thus our findings are extensible to topic models where samples are defined as documents, and genes, proteins or their sequences are words.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3107411.3108195","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Probabilistic topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. It offers a viable approach to structure huge textual document collections into latent topic themes to aid text mining. Latent Dirichlet Allocation (LDA) is the most commonly used topic modelling method across a wide number of technical fields. However, model development can be arduous and tedious, and requires burdensome and systematic sensitivity studies in order to find the best set of model parameters. In this study, we use a heuristic approach to estimate the most appropriate number of topics. Specifically, the rate of perplexity change (RPC) as a function of numbers of topics is proposed as a suitable selector. We test the stability and effectiveness of the proposed method for three markedly different types of grounded-truth datasets: Salmonella next generation sequencing, pharmacological side effects, and textual abstracts on computational biology and bioinformatics (TCBB) from PubMed. Then we describe extensive sensitivity studies to determine best practices for generating effective topic models. To test effectiveness and validity of topic models, we constructed a ground truth data set from PubMed that contained some 40 health related themes including negative controls, and mixed it with a data set of unstructured documents. We found that obtaining the most useful model, tuned to desired sensitivity versus specificity, requires an iterative process wherein preprocessing steps, the type of topic modeling algorithm, and the algorithm's model parameters are systematically varied. Models need to be compared with both qualitative, subjective assessments and quantitative, objective assessments, and care is required that Gibbs sampling in model estimation is sufficient to assure stable solutions. With a high quality model, documents can be rank-ordered in accordance with probability of being associated with complex regulatory query string, greatly lessoning text mining work. Importantly, topic models are agnostic about how words and documents are defined, and thus our findings are extensible to topic models where samples are defined as documents, and genes, proteins or their sequences are words.

查看原文本刊更多论文

文本文档主题建模中模型参数的最佳设置。

概率主题建模是机器学习中一个活跃的研究领域，主要用作构建大型文本语料库的分析工具，用于数据挖掘。它提供了一种可行的方法，将庞大的文本文档集合结构成潜在的主题，以帮助文本挖掘。潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)是在许多技术领域中最常用的主题建模方法。然而，模型开发可能是艰巨和繁琐的，并且需要繁琐和系统的敏感性研究，以找到最佳的模型参数集。在本研究中，我们使用启发式方法来估计最合适的主题数量。具体来说，将困惑变化率(RPC)作为主题数的函数作为合适的选择器。我们对三种明显不同类型的基于事实的数据集测试了所提出方法的稳定性和有效性:沙门氏菌下一代测序，药理学副作用和PubMed计算生物学和生物信息学(TCBB)的文本摘要。然后，我们描述了广泛的敏感性研究，以确定生成有效主题模型的最佳实践。为了测试主题模型的有效性和有效性，我们从PubMed构建了一个包含约40个健康相关主题(包括阴性对照)的基本事实数据集，并将其与非结构化文档的数据集混合。我们发现，获得最有用的模型，调整到所需的灵敏度与特异性，需要一个迭代的过程，其中预处理步骤，主题建模算法的类型和算法的模型参数是系统地变化的。模型需要与定性的、主观的评估和定量的、客观的评估进行比较，并且需要注意模型估计中的Gibbs抽样足以保证稳定的解。有了一个高质量的模型，文档可以根据与复杂的规则查询字符串相关联的概率进行排序，极大地减少了文本挖掘工作。重要的是，主题模型不知道单词和文档是如何定义的，因此我们的发现可以扩展到主题模型中，其中样本被定义为文档，基因、蛋白质或它们的序列被定义为单词。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

自引率

0.00%

发文量