Supervised probabilistic latent semantic analysis with applications to controversy analysis of legislative bills

IF 0.8 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Intelligent Data Analysis Pub Date : 2023-11-27 DOI:10.3233/ida-227202

Eyor Alemayehu, Yi Fang

{"title":"Supervised probabilistic latent semantic analysis with applications to controversy analysis of legislative bills","authors":"Eyor Alemayehu, Yi Fang","doi":"10.3233/ida-227202","DOIUrl":null,"url":null,"abstract":"Probabilistic Latent Semantic Analysis (PLSA) is a fundamental text analysis technique that models each word in a document as a sample from a mixture of topics. PLSA is the precursor of probabilistic topic models including Latent Dirichlet Allocation (LDA). PLSA, LDA and their numerous extensions have been successfully applied to many text mining and retrieval tasks. One important extension of LDA is supervised LDA (sLDA), which distinguishes itself from most topic models in that it is supervised. However, to the best of our knowledge, no prior work extends PLSA in a similar manner sLDA extends LDA by jointly modeling the contents and the responses of documents. In this paper, we propose supervised PLSA (sPLSA) which can efficiently infer latent topics and their factorized response values from the contents and the responses of documents. The major challenge lies in estimating a document’s topic distribution which is a constrained probability that is dictated by both the content and the response of the document. To tackle this challenge, we introduce an auxiliary variable to transform the constrained optimization problem to an unconstrained optimization problem. This allows us to derive an efficient Expectation and Maximization (EM) algorithm for parameter estimation. Compared to sLDA, sPLSA converges much faster and requires less hyperparameter tuning, while performing similarly on topic modeling and better in response factorization. This makes sPLSA an appealing choice for latent response analysis such as ranking latent topics by their factorized response values. We apply the proposed sPLSA model to analyze the controversy of bills from the United States Congress. We demonstrate the effectiveness of our model by identifying contentious legislative issues.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"220 1","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Data Analysis","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.3233/ida-227202","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Probabilistic Latent Semantic Analysis (PLSA) is a fundamental text analysis technique that models each word in a document as a sample from a mixture of topics. PLSA is the precursor of probabilistic topic models including Latent Dirichlet Allocation (LDA). PLSA, LDA and their numerous extensions have been successfully applied to many text mining and retrieval tasks. One important extension of LDA is supervised LDA (sLDA), which distinguishes itself from most topic models in that it is supervised. However, to the best of our knowledge, no prior work extends PLSA in a similar manner sLDA extends LDA by jointly modeling the contents and the responses of documents. In this paper, we propose supervised PLSA (sPLSA) which can efficiently infer latent topics and their factorized response values from the contents and the responses of documents. The major challenge lies in estimating a document’s topic distribution which is a constrained probability that is dictated by both the content and the response of the document. To tackle this challenge, we introduce an auxiliary variable to transform the constrained optimization problem to an unconstrained optimization problem. This allows us to derive an efficient Expectation and Maximization (EM) algorithm for parameter estimation. Compared to sLDA, sPLSA converges much faster and requires less hyperparameter tuning, while performing similarly on topic modeling and better in response factorization. This makes sPLSA an appealing choice for latent response analysis such as ranking latent topics by their factorized response values. We apply the proposed sPLSA model to analyze the controversy of bills from the United States Congress. We demonstrate the effectiveness of our model by identifying contentious legislative issues.

查看原文本刊更多论文

有监督的概率潜在语义分析在立法议案争议分析中的应用

概率潜语义分析（Probabilistic Latent Semantic Analysis，PLSA）是一种基本的文本分析技术，它将文档中的每个单词作为主题混合物的样本进行建模。PLSA 是包括 Latent Dirichlet Allocation (LDA) 在内的概率主题模型的先驱。PLSA、LDA 及其大量扩展已成功应用于许多文本挖掘和检索任务。LDA 的一个重要扩展是有监督 LDA（sLDA），它与大多数主题模型的区别在于它是有监督的。然而，据我们所知，之前还没有工作以类似的方式对 PLSA 进行扩展。在本文中，我们提出了有监督的 PLSA（sPLSA），它可以从文档的内容和响应中有效地推断出潜在主题及其因子化响应值。主要的挑战在于估计文档的主题分布，而主题分布是一种受限概率，由文档的内容和响应决定。为了应对这一挑战，我们引入了一个辅助变量，将受限优化问题转化为无约束优化问题。这样，我们就能为参数估计推导出一种高效的期望最大化（EM）算法。与 sLDA 相比，sPLSA 的收敛速度更快，所需的超参数调整更少，同时在主题建模方面表现相似，在响应因子化方面表现更好。这使得 sPLSA 成为潜在响应分析（如根据其因子化响应值对潜在主题进行排序）的一个有吸引力的选择。我们将提出的 sPLSA 模型应用于分析美国国会议案的争议。我们通过识别有争议的立法问题证明了我们模型的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Intelligent Data Analysis 工程技术-计算机：人工智能

CiteScore

2.20

自引率

5.90%

发文量

审稿时长

3.3 months

期刊介绍： Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas of data visualization, data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering, database mining techniques, tools and applications, use of domain knowledge in data analysis, big data applications, evolutionary algorithms, machine learning, neural nets, fuzzy logic, statistical pattern recognition, knowledge filtering, and post-processing. In particular, papers are preferred that discuss development of new AI related data analysis architectures, methodologies, and techniques and their applications to various domains.