基于svm - lda的文档摘要主题建模

Ultimatics : Jurnal Teknik Informatika Pub Date : 2022-12-30 DOI:10.31937/ti.v14i2.2854

Luthfi Atikah, Novrindah Alvi Hasanah, Agus Zainal Arivin

{"title":"基于svm - lda的文档摘要主题建模","authors":"Luthfi Atikah, Novrindah Alvi Hasanah, Agus Zainal Arivin","doi":"10.31937/ti.v14i2.2854","DOIUrl":null,"url":null,"abstract":"Summarization is a process to simplify the contents of a document by eliminating elements that are considered unimportant but do not reduce the core meaning the document wants to convey. However, as is known, a document will contain more than one topic. So it is necessary to identify the topic so that the summarization process is more effective. Latent Dirichlet Allocation (LDA) is a commonly used method of identifying topics. However, when running a program on a different dataset, LDA experiences \"order effects\", that is, the resulting topic will be different if the train data sequence is changed. In the same document input, LDA will provide inconsistent topics resulting in low coherence values. Therefore, this paper proposes a topic modelling method using a combination of LDA and VSM (Vector Space Model) for automatic summarization. The proposed method can overcome order effects and identify document topics that are calculated based on the TF-IDF weight on VSM generated by LDA. The results of the proposed topic modeling method on the 1300 Twitter data resulted in the highest coherence value reaching 0.72. The summary results obtained Rouge 1 is 0.78, Rouge 2 is 0.67 dan Rouge L is 0.80.","PeriodicalId":347196,"journal":{"name":"Ultimatics : Jurnal Teknik Informatika","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Topic Modelling Using VSM-LDA For Document Summarization\",\"authors\":\"Luthfi Atikah, Novrindah Alvi Hasanah, Agus Zainal Arivin\",\"doi\":\"10.31937/ti.v14i2.2854\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summarization is a process to simplify the contents of a document by eliminating elements that are considered unimportant but do not reduce the core meaning the document wants to convey. However, as is known, a document will contain more than one topic. So it is necessary to identify the topic so that the summarization process is more effective. Latent Dirichlet Allocation (LDA) is a commonly used method of identifying topics. However, when running a program on a different dataset, LDA experiences \\\"order effects\\\", that is, the resulting topic will be different if the train data sequence is changed. In the same document input, LDA will provide inconsistent topics resulting in low coherence values. Therefore, this paper proposes a topic modelling method using a combination of LDA and VSM (Vector Space Model) for automatic summarization. The proposed method can overcome order effects and identify document topics that are calculated based on the TF-IDF weight on VSM generated by LDA. The results of the proposed topic modeling method on the 1300 Twitter data resulted in the highest coherence value reaching 0.72. The summary results obtained Rouge 1 is 0.78, Rouge 2 is 0.67 dan Rouge L is 0.80.\",\"PeriodicalId\":347196,\"journal\":{\"name\":\"Ultimatics : Jurnal Teknik Informatika\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ultimatics : Jurnal Teknik Informatika\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.31937/ti.v14i2.2854\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ultimatics : Jurnal Teknik Informatika","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31937/ti.v14i2.2854","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

摘要是一个简化文档内容的过程，通过消除那些被认为不重要的元素，但不会减少文档想要传达的核心含义。但是，众所周知，一个文档将包含多个主题。因此，有必要确定主题，以便总结过程更有效。潜在狄利克雷分配(LDA)是一种常用的主题识别方法。然而，当在不同的数据集上运行程序时，LDA会经历“顺序效应”，即如果改变了训练数据的顺序，得到的主题就会不同。在相同的文档输入中，LDA将提供不一致的主题，从而导致低相干值。因此，本文提出了一种LDA和VSM (Vector Space Model)相结合的主题建模方法，用于自动摘要。该方法可以克服顺序效应，识别基于LDA生成的VSM上的TF-IDF权重计算的文档主题。本文提出的主题建模方法在1300个Twitter数据上的结果显示，相干性值最高达到0.72。总结得出胭脂1为0.78，胭脂2为0.67，胭脂L为0.80。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Topic Modelling Using VSM-LDA For Document Summarization

Summarization is a process to simplify the contents of a document by eliminating elements that are considered unimportant but do not reduce the core meaning the document wants to convey. However, as is known, a document will contain more than one topic. So it is necessary to identify the topic so that the summarization process is more effective. Latent Dirichlet Allocation (LDA) is a commonly used method of identifying topics. However, when running a program on a different dataset, LDA experiences "order effects", that is, the resulting topic will be different if the train data sequence is changed. In the same document input, LDA will provide inconsistent topics resulting in low coherence values. Therefore, this paper proposes a topic modelling method using a combination of LDA and VSM (Vector Space Model) for automatic summarization. The proposed method can overcome order effects and identify document topics that are calculated based on the TF-IDF weight on VSM generated by LDA. The results of the proposed topic modeling method on the 1300 Twitter data resulted in the highest coherence value reaching 0.72. The summary results obtained Rouge 1 is 0.78, Rouge 2 is 0.67 dan Rouge L is 0.80.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Ultimatics : Jurnal Teknik Informatika

自引率

0.00%

发文量