Improved BIO-based Chinese Automatic Abstract-generation Model

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2024-02-05 DOI:10.1145/3643695

Qing Li, Weibin Wan, Yuming Zhao, Xiaoyan Jiang

{"title":"Improved BIO-based Chinese Automatic Abstract-generation Model","authors":"Qing Li, Weibin Wan, Yuming Zhao, Xiaoyan Jiang","doi":"10.1145/3643695","DOIUrl":null,"url":null,"abstract":"<p>With its unique information-filtering function, text summarization technology has become a significant aspect of search engines and question-and-answer systems. However, existing models that include the copy mechanism often lack the ability to extract important fragments, resulting in generated content that suffers from thematic deviation and insufficient generalization. Specifically, Chinese automatic summarization using traditional generation methods often loses semantics because of its reliance on word lists. To address these issues, we proposed the novel BioCopy mechanism for the summarization task. By training the tags of predictive words and reducing the probability distribution range on the glossary, we enhanced the ability to generate continuous segments, which effectively solves the above problems. Additionally, we applied reinforced canonicality to the inputs to obtain better model results, making the model share the sub-network weight parameters and sparsing the model output to reduce the search space for model prediction. To further improve the model’s performance, we calculated the bilingual evaluation understudy (BLEU) score on the English dataset CNN/DailyMail to filter the thresholds and reduce the difficulty of word separation and the dependence of the output on the word list. We fully fine-tuned the model using the LCSTS dataset for the Chinese summarization task and conducted small-sample experiments using the CSL dataset. We also conducted ablation experiments on the Chinese dataset. The experimental results demonstrate that the optimized model can learn the semantic representation of the original text better than other models and performs well with small sample sizes.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"304 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3643695","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

With its unique information-filtering function, text summarization technology has become a significant aspect of search engines and question-and-answer systems. However, existing models that include the copy mechanism often lack the ability to extract important fragments, resulting in generated content that suffers from thematic deviation and insufficient generalization. Specifically, Chinese automatic summarization using traditional generation methods often loses semantics because of its reliance on word lists. To address these issues, we proposed the novel BioCopy mechanism for the summarization task. By training the tags of predictive words and reducing the probability distribution range on the glossary, we enhanced the ability to generate continuous segments, which effectively solves the above problems. Additionally, we applied reinforced canonicality to the inputs to obtain better model results, making the model share the sub-network weight parameters and sparsing the model output to reduce the search space for model prediction. To further improve the model’s performance, we calculated the bilingual evaluation understudy (BLEU) score on the English dataset CNN/DailyMail to filter the thresholds and reduce the difficulty of word separation and the dependence of the output on the word list. We fully fine-tuned the model using the LCSTS dataset for the Chinese summarization task and conducted small-sample experiments using the CSL dataset. We also conducted ablation experiments on the Chinese dataset. The experimental results demonstrate that the optimized model can learn the semantic representation of the original text better than other models and performs well with small sample sizes.

查看原文本刊更多论文

基于 BIO 的改进型中文自动摘要生成模型

凭借其独特的信息过滤功能，文本摘要技术已成为搜索引擎和问答系统的重要组成部分。然而，现有的包含复制机制的模型往往缺乏提取重要片段的能力，导致生成的内容存在主题偏离和概括性不足的问题。具体来说，使用传统生成方法进行中文自动摘要时，由于依赖词表，往往会丢失语义。为了解决这些问题，我们针对摘要任务提出了新颖的 BioCopy 机制。通过训练预测词的标签和缩小词汇表的概率分布范围，我们增强了生成连续词段的能力，从而有效地解决了上述问题。此外，为了获得更好的模型效果，我们还对输入进行了强化规范性处理，使模型共享子网络权重参数，并对模型输出进行稀疏化处理，以减少模型预测的搜索空间。为了进一步提高模型的性能，我们在英文数据集 CNN/DailyMail 上计算了双语评估劣度（BLEU）得分，以过滤阈值，降低分词难度和输出对词表的依赖性。我们使用 LCSTS 数据集对中文摘要任务的模型进行了全面微调，并使用 CSL 数据集进行了小样本实验。我们还在中文数据集上进行了消减实验。实验结果表明，优化后的模型能比其他模型更好地学习原文的语义表征，并且在样本量较小的情况下表现良好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Asian and Low-Resource Language Information Processing Computer Science-General Computer Science

CiteScore

3.60

自引率

15.00%

发文量

241

期刊介绍： The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to: -Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc. -Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc. -Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition. -Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc. -Machine Translation involving Asian or low-resource languages. -Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc. -Information Extraction and Filtering: including automatic abstraction, user profiling, etc. -Speech processing: including text-to-speech synthesis and automatic speech recognition. -Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc. -Cross-lingual information processing involving Asian or low-resource languages. -Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.