Topic Modelling Using VSM-LDA For Document Summarization

Luthfi Atikah, Novrindah Alvi Hasanah, Agus Zainal Arivin
{"title":"Topic Modelling Using VSM-LDA For Document Summarization","authors":"Luthfi Atikah, Novrindah Alvi Hasanah, Agus Zainal Arivin","doi":"10.31937/ti.v14i2.2854","DOIUrl":null,"url":null,"abstract":"Summarization is a process to simplify the contents of a document by eliminating elements that are considered unimportant but do not reduce the core meaning the document wants to convey. However, as is known, a document will contain more than one topic. So it is necessary to identify the topic so that the summarization process is more effective. Latent Dirichlet Allocation (LDA) is a commonly used method of identifying topics. However, when running a program on a different dataset, LDA experiences \"order effects\", that is, the resulting topic will be different if the train data sequence is changed. In the same document input, LDA will provide inconsistent topics resulting in low coherence values. Therefore, this paper proposes a topic modelling method using a combination of LDA and VSM (Vector Space Model) for automatic summarization. The proposed method can overcome order effects and identify document topics that are calculated based on the TF-IDF weight on VSM generated by LDA. The results of the proposed topic modeling method on the 1300 Twitter data resulted in the highest coherence value reaching 0.72. The summary results obtained Rouge 1 is 0.78, Rouge 2 is 0.67 dan Rouge L is 0.80.","PeriodicalId":347196,"journal":{"name":"Ultimatics : Jurnal Teknik Informatika","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ultimatics : Jurnal Teknik Informatika","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31937/ti.v14i2.2854","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Summarization is a process to simplify the contents of a document by eliminating elements that are considered unimportant but do not reduce the core meaning the document wants to convey. However, as is known, a document will contain more than one topic. So it is necessary to identify the topic so that the summarization process is more effective. Latent Dirichlet Allocation (LDA) is a commonly used method of identifying topics. However, when running a program on a different dataset, LDA experiences "order effects", that is, the resulting topic will be different if the train data sequence is changed. In the same document input, LDA will provide inconsistent topics resulting in low coherence values. Therefore, this paper proposes a topic modelling method using a combination of LDA and VSM (Vector Space Model) for automatic summarization. The proposed method can overcome order effects and identify document topics that are calculated based on the TF-IDF weight on VSM generated by LDA. The results of the proposed topic modeling method on the 1300 Twitter data resulted in the highest coherence value reaching 0.72. The summary results obtained Rouge 1 is 0.78, Rouge 2 is 0.67 dan Rouge L is 0.80.
基于svm - lda的文档摘要主题建模
摘要是一个简化文档内容的过程,通过消除那些被认为不重要的元素,但不会减少文档想要传达的核心含义。但是,众所周知,一个文档将包含多个主题。因此,有必要确定主题,以便总结过程更有效。潜在狄利克雷分配(LDA)是一种常用的主题识别方法。然而,当在不同的数据集上运行程序时,LDA会经历“顺序效应”,即如果改变了训练数据的顺序,得到的主题就会不同。在相同的文档输入中,LDA将提供不一致的主题,从而导致低相干值。因此,本文提出了一种LDA和VSM (Vector Space Model)相结合的主题建模方法,用于自动摘要。该方法可以克服顺序效应,识别基于LDA生成的VSM上的TF-IDF权重计算的文档主题。本文提出的主题建模方法在1300个Twitter数据上的结果显示,相干性值最高达到0.72。总结得出胭脂1为0.78,胭脂2为0.67,胭脂L为0.80。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信