运用迁移学习和分而治之的方法提高抽象文本摘要的覆盖率和新颖性

IF 1.2 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Malaysian Journal of Computer Science Pub Date : 2023-07-31 DOI:10.22452/mjcs.vol36no3.4

Ayham Alomari, N. Idris, Aznul Qalid, I. Alsmadi

{"title":"运用迁移学习和分而治之的方法提高抽象文本摘要的覆盖率和新颖性","authors":"Ayham Alomari, N. Idris, Aznul Qalid, I. Alsmadi","doi":"10.22452/mjcs.vol36no3.4","DOIUrl":null,"url":null,"abstract":"Automatic Text Summarization (ATS) models yield outcomes with insufficient coverage of crucial details and poor degrees of novelty. The first issue resulted from the lengthy input, while the second problem resulted from the characteristics of the training dataset itself. This research employs the divide-and-conquer approach to address the first issue by breaking the lengthy input into smaller pieces to be summarized, followed by the conquest of the results in order to cover more significant details. For the second challenge, these chunks are summarized by models trained on datasets with higher novelty levels in order to produce more human-like and concise summaries with more novel words that do not appear in the input article. The results demonstrate an improvement in both coverage and novelty levels. Moreover, we defined a new metric to measure the novelty of the summary. Finally, we investigated the findings to discover whether the novelty is influenced more by the dataset itself, as in CNN/DM, or by the training model and its training objective, as in Pegasus.","PeriodicalId":49894,"journal":{"name":"Malaysian Journal of Computer Science","volume":" ","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2023-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"IMPROVING COVERAGE AND NOVELTY OF ABSTRACTIVE TEXT SUMMARIZATION USING TRANSFER LEARNING AND DIVIDE AND CONQUER APPROACHES\",\"authors\":\"Ayham Alomari, N. Idris, Aznul Qalid, I. Alsmadi\",\"doi\":\"10.22452/mjcs.vol36no3.4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic Text Summarization (ATS) models yield outcomes with insufficient coverage of crucial details and poor degrees of novelty. The first issue resulted from the lengthy input, while the second problem resulted from the characteristics of the training dataset itself. This research employs the divide-and-conquer approach to address the first issue by breaking the lengthy input into smaller pieces to be summarized, followed by the conquest of the results in order to cover more significant details. For the second challenge, these chunks are summarized by models trained on datasets with higher novelty levels in order to produce more human-like and concise summaries with more novel words that do not appear in the input article. The results demonstrate an improvement in both coverage and novelty levels. Moreover, we defined a new metric to measure the novelty of the summary. Finally, we investigated the findings to discover whether the novelty is influenced more by the dataset itself, as in CNN/DM, or by the training model and its training objective, as in Pegasus.\",\"PeriodicalId\":49894,\"journal\":{\"name\":\"Malaysian Journal of Computer Science\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2023-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Malaysian Journal of Computer Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.22452/mjcs.vol36no3.4\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Malaysian Journal of Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.22452/mjcs.vol36no3.4","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

自动文本摘要（ATS）模型产生的结果对关键细节的覆盖不足，新颖性较差。第一个问题是由于输入时间过长，而第二个问题是因为训练数据集本身的特性。这项研究采用了分而治之的方法来解决第一个问题，将冗长的输入分解成更小的部分进行总结，然后征服结果，以涵盖更重要的细节。对于第二个挑战，这些块由在具有更高新颖性水平的数据集上训练的模型进行总结，以便用输入文章中没有出现的更新颖的单词生成更人性化和简洁的摘要。结果表明，覆盖率和新颖性都有所提高。此外，我们定义了一个新的度量标准来衡量摘要的新颖性。最后，我们调查了这些发现，以发现新颖性是更多地受到数据集本身的影响，如在CNN/DM中，还是受到训练模型及其训练目标的影响，例如在飞马座中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

IMPROVING COVERAGE AND NOVELTY OF ABSTRACTIVE TEXT SUMMARIZATION USING TRANSFER LEARNING AND DIVIDE AND CONQUER APPROACHES

Automatic Text Summarization (ATS) models yield outcomes with insufficient coverage of crucial details and poor degrees of novelty. The first issue resulted from the lengthy input, while the second problem resulted from the characteristics of the training dataset itself. This research employs the divide-and-conquer approach to address the first issue by breaking the lengthy input into smaller pieces to be summarized, followed by the conquest of the results in order to cover more significant details. For the second challenge, these chunks are summarized by models trained on datasets with higher novelty levels in order to produce more human-like and concise summaries with more novel words that do not appear in the input article. The results demonstrate an improvement in both coverage and novelty levels. Moreover, we defined a new metric to measure the novelty of the summary. Finally, we investigated the findings to discover whether the novelty is influenced more by the dataset itself, as in CNN/DM, or by the training model and its training objective, as in Pegasus.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Malaysian Journal of Computer Science COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, THEORY & METHODS

CiteScore

2.20

自引率

33.30%

发文量

审稿时长

7.5 months

期刊介绍： The Malaysian Journal of Computer Science (ISSN 0127-9084) is published four times a year in January, April, July and October by the Faculty of Computer Science and Information Technology, University of Malaya, since 1985. Over the years, the journal has gained popularity and the number of paper submissions has increased steadily. The rigorous reviews from the referees have helped in ensuring that the high standard of the journal is maintained. The objectives are to promote exchange of information and knowledge in research work, new inventions/developments of Computer Science and on the use of Information Technology towards the structuring of an information-rich society and to assist the academic staff from local and foreign universities, business and industrial sectors, government departments and academic institutions on publishing research results and studies in Computer Science and Information Technology through a scholarly publication. The journal is being indexed and abstracted by Clarivate Analytics'' Web of Science and Elsevier''s Scopus