基于k -均值聚类的科学论文BERT提取摘要

Jurnal Teknik Informatika dan Sistem Informasi Pub Date : 2022-04-29 DOI:10.28932/jutisi.v8i1.4474

Feliks Victor Parningotan Samosir, Hapnes Toba, M. Ayub

{"title":"基于k -均值聚类的科学论文BERT提取摘要","authors":"Feliks Victor Parningotan Samosir, Hapnes Toba, M. Ayub","doi":"10.28932/jutisi.v8i1.4474","DOIUrl":null,"url":null,"abstract":"This study aims to propose methods and models for extractive text summarization with contextual embedding. To build this model, a combination of traditional machine learning algorithms such as K-Means Clustering and the latest BERT-based architectures such as Sentence-BERT (SBERT) is carried out. The contextual embedding process will be carried out at the sentence level by SBERT. Embedded sentences will be clustered and the distance calculated from the centroid. The top sentences from each cluster will be used as summary candidates. The dataset used in this study is a collection of scientific journals from NeurIPS. Performance evaluation carried out with ROUGE-L gave a result of 15.52% and a BERTScore of 85.55%. This result surpasses several previous models such as PyTextRank and BERT Extractive Summarizer. The results of these measurements prove that the use of contextual embedding is very good if applied to extractive text summarization which is generally done at the sentence level.","PeriodicalId":185279,"journal":{"name":"Jurnal Teknik Informatika dan Sistem Informasi","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"BESKlus : BERT Extractive Summarization with K-Means Clustering in Scientific Paper\",\"authors\":\"Feliks Victor Parningotan Samosir, Hapnes Toba, M. Ayub\",\"doi\":\"10.28932/jutisi.v8i1.4474\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study aims to propose methods and models for extractive text summarization with contextual embedding. To build this model, a combination of traditional machine learning algorithms such as K-Means Clustering and the latest BERT-based architectures such as Sentence-BERT (SBERT) is carried out. The contextual embedding process will be carried out at the sentence level by SBERT. Embedded sentences will be clustered and the distance calculated from the centroid. The top sentences from each cluster will be used as summary candidates. The dataset used in this study is a collection of scientific journals from NeurIPS. Performance evaluation carried out with ROUGE-L gave a result of 15.52% and a BERTScore of 85.55%. This result surpasses several previous models such as PyTextRank and BERT Extractive Summarizer. The results of these measurements prove that the use of contextual embedding is very good if applied to extractive text summarization which is generally done at the sentence level.\",\"PeriodicalId\":185279,\"journal\":{\"name\":\"Jurnal Teknik Informatika dan Sistem Informasi\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-04-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Jurnal Teknik Informatika dan Sistem Informasi\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.28932/jutisi.v8i1.4474\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Jurnal Teknik Informatika dan Sistem Informasi","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.28932/jutisi.v8i1.4474","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本研究旨在提出基于上下文嵌入的抽取文本摘要方法和模型。为了构建该模型，将传统的机器学习算法(如k均值聚类)和最新的基于bert的架构(如句子bert (SBERT))相结合。上下文嵌入过程将由SBERT在句子层面进行。嵌入的句子将被聚类，并从质心计算距离。每个聚类的最佳句子将被用作摘要候选词。本研究中使用的数据集是来自NeurIPS的科学期刊集。用ROUGE-L进行性能评价，结果为15.52%，BERTScore为85.55%。这个结果超过了以前的几个模型，如PyTextRank和BERT Extractive Summarizer。这些测量结果证明，如果将上下文嵌入应用于通常在句子级别完成的提取文本摘要，则效果非常好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

BESKlus : BERT Extractive Summarization with K-Means Clustering in Scientific Paper

This study aims to propose methods and models for extractive text summarization with contextual embedding. To build this model, a combination of traditional machine learning algorithms such as K-Means Clustering and the latest BERT-based architectures such as Sentence-BERT (SBERT) is carried out. The contextual embedding process will be carried out at the sentence level by SBERT. Embedded sentences will be clustered and the distance calculated from the centroid. The top sentences from each cluster will be used as summary candidates. The dataset used in this study is a collection of scientific journals from NeurIPS. Performance evaluation carried out with ROUGE-L gave a result of 15.52% and a BERTScore of 85.55%. This result surpasses several previous models such as PyTextRank and BERT Extractive Summarizer. The results of these measurements prove that the use of contextual embedding is very good if applied to extractive text summarization which is generally done at the sentence level.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Jurnal Teknik Informatika dan Sistem Informasi

自引率

0.00%

发文量