A topical VAEGAN-IHMM approach for automatic story segmentation.

IF 2.6 4区工程技术 Q1 Mathematics

Mathematical Biosciences and Engineering Pub Date : 2024-07-16 DOI:10.3934/mbe.2024289

Jia Yu, Huiling Peng, Guoqiang Wang, Nianfeng Shi

{"title":"A topical VAEGAN-IHMM approach for automatic story segmentation.","authors":"Jia Yu, Huiling Peng, Guoqiang Wang, Nianfeng Shi","doi":"10.3934/mbe.2024289","DOIUrl":null,"url":null,"abstract":"<p><p>Feature representations with rich topic information can greatly improve the performance of story segmentation tasks. VAEGAN offers distinct advantages in feature learning by combining variational autoencoder (VAE) and generative adversarial network (GAN), which not only captures intricate data representations through VAE's probabilistic encoding and decoding mechanism but also enhances feature diversity and quality via GAN's adversarial training. To better learn topical domain representation, we used a topical classifier to supervise the training process of VAEGAN. Based on the learned feature, a segmentor splits the document into shorter ones with different topics. Hidden Markov model (HMM) is a popular approach for story segmentation, in which stories are viewed as instances of topics (hidden states). The number of states has to be set manually but it is often unknown in real scenarios. To solve this problem, we proposed an infinite HMM (IHMM) approach which utilized an HDP prior on transition matrices over countably infinite state spaces to automatically infer the state's number from the data. Given a running text, a Blocked Gibbis sampler labeled the states with topic classes. The position where the topic changes was a story boundary. Experimental results on the TDT2 corpus demonstrated that the proposed topical VAEGAN-IHMM approach was significantly better than the traditional HMM method in story segmentation tasks and achieved state-of-the-art performance.</p>","PeriodicalId":49870,"journal":{"name":"Mathematical Biosciences and Engineering","volume":"21 7","pages":"6608-6630"},"PeriodicalIF":2.6000,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mathematical Biosciences and Engineering","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.3934/mbe.2024289","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 0

Abstract

Feature representations with rich topic information can greatly improve the performance of story segmentation tasks. VAEGAN offers distinct advantages in feature learning by combining variational autoencoder (VAE) and generative adversarial network (GAN), which not only captures intricate data representations through VAE's probabilistic encoding and decoding mechanism but also enhances feature diversity and quality via GAN's adversarial training. To better learn topical domain representation, we used a topical classifier to supervise the training process of VAEGAN. Based on the learned feature, a segmentor splits the document into shorter ones with different topics. Hidden Markov model (HMM) is a popular approach for story segmentation, in which stories are viewed as instances of topics (hidden states). The number of states has to be set manually but it is often unknown in real scenarios. To solve this problem, we proposed an infinite HMM (IHMM) approach which utilized an HDP prior on transition matrices over countably infinite state spaces to automatically infer the state's number from the data. Given a running text, a Blocked Gibbis sampler labeled the states with topic classes. The position where the topic changes was a story boundary. Experimental results on the TDT2 corpus demonstrated that the proposed topical VAEGAN-IHMM approach was significantly better than the traditional HMM method in story segmentation tasks and achieved state-of-the-art performance.

查看原文本刊更多论文

用于自动故事分割的专题 VAEGAN-IHMM 方法。

具有丰富主题信息的特征表征可以大大提高故事分割任务的性能。VAEGAN 结合了变异自动编码器（VAE）和生成对抗网络（GAN），在特征学习方面具有明显的优势，不仅能通过 VAE 的概率编码和解码机制捕捉复杂的数据表示，还能通过 GAN 的对抗训练提高特征的多样性和质量。为了更好地学习拓扑域表示，我们使用拓扑分类器来监督 VAEGAN 的训练过程。根据学习到的特征，分割器会将文档分割成不同主题的短文档。隐马尔可夫模型（HMM）是一种流行的故事分割方法，其中故事被视为主题实例（隐藏状态）。状态的数量必须手动设置，但在实际场景中往往是未知的。为了解决这个问题，我们提出了一种无限 HMM（IHMM）方法，利用可数无限状态空间上过渡矩阵的 HDP 先验，从数据中自动推断状态数。给定一个流水文本，一个 Blocked Gibbis 采样器用主题类别标记状态。主题变化的位置就是故事的边界。在 TDT2 语料库上的实验结果表明，在故事分割任务中，所提出的主题 VAEGAN-IHMM 方法明显优于传统的 HMM 方法，达到了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Mathematical Biosciences and Engineering 工程技术-数学跨学科应用

CiteScore

3.90

自引率

7.70%

发文量

586

审稿时长

>12 weeks

期刊介绍： Mathematical Biosciences and Engineering (MBE) is an interdisciplinary Open Access journal promoting cutting-edge research, technology transfer and knowledge translation about complex data and information processing. MBE publishes Research articles (long and original research); Communications (short and novel research); Expository papers; Technology Transfer and Knowledge Translation reports (description of new technologies and products); Announcements and Industrial Progress and News (announcements and even advertisement, including major conferences).