Improved BIO-based Chinese Automatic Abstract-generation Model

IF 1.8 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Qing Li, Weibin Wan, Yuming Zhao, Xiaoyan Jiang
{"title":"Improved BIO-based Chinese Automatic Abstract-generation Model","authors":"Qing Li, Weibin Wan, Yuming Zhao, Xiaoyan Jiang","doi":"10.1145/3643695","DOIUrl":null,"url":null,"abstract":"<p>With its unique information-filtering function, text summarization technology has become a significant aspect of search engines and question-and-answer systems. However, existing models that include the copy mechanism often lack the ability to extract important fragments, resulting in generated content that suffers from thematic deviation and insufficient generalization. Specifically, Chinese automatic summarization using traditional generation methods often loses semantics because of its reliance on word lists. To address these issues, we proposed the novel BioCopy mechanism for the summarization task. By training the tags of predictive words and reducing the probability distribution range on the glossary, we enhanced the ability to generate continuous segments, which effectively solves the above problems. Additionally, we applied reinforced canonicality to the inputs to obtain better model results, making the model share the sub-network weight parameters and sparsing the model output to reduce the search space for model prediction. To further improve the model’s performance, we calculated the bilingual evaluation understudy (BLEU) score on the English dataset CNN/DailyMail to filter the thresholds and reduce the difficulty of word separation and the dependence of the output on the word list. We fully fine-tuned the model using the LCSTS dataset for the Chinese summarization task and conducted small-sample experiments using the CSL dataset. We also conducted ablation experiments on the Chinese dataset. The experimental results demonstrate that the optimized model can learn the semantic representation of the original text better than other models and performs well with small sample sizes.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3643695","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

With its unique information-filtering function, text summarization technology has become a significant aspect of search engines and question-and-answer systems. However, existing models that include the copy mechanism often lack the ability to extract important fragments, resulting in generated content that suffers from thematic deviation and insufficient generalization. Specifically, Chinese automatic summarization using traditional generation methods often loses semantics because of its reliance on word lists. To address these issues, we proposed the novel BioCopy mechanism for the summarization task. By training the tags of predictive words and reducing the probability distribution range on the glossary, we enhanced the ability to generate continuous segments, which effectively solves the above problems. Additionally, we applied reinforced canonicality to the inputs to obtain better model results, making the model share the sub-network weight parameters and sparsing the model output to reduce the search space for model prediction. To further improve the model’s performance, we calculated the bilingual evaluation understudy (BLEU) score on the English dataset CNN/DailyMail to filter the thresholds and reduce the difficulty of word separation and the dependence of the output on the word list. We fully fine-tuned the model using the LCSTS dataset for the Chinese summarization task and conducted small-sample experiments using the CSL dataset. We also conducted ablation experiments on the Chinese dataset. The experimental results demonstrate that the optimized model can learn the semantic representation of the original text better than other models and performs well with small sample sizes.

基于 BIO 的改进型中文自动摘要生成模型
凭借其独特的信息过滤功能,文本摘要技术已成为搜索引擎和问答系统的重要组成部分。然而,现有的包含复制机制的模型往往缺乏提取重要片段的能力,导致生成的内容存在主题偏离和概括性不足的问题。具体来说,使用传统生成方法进行中文自动摘要时,由于依赖词表,往往会丢失语义。为了解决这些问题,我们针对摘要任务提出了新颖的 BioCopy 机制。通过训练预测词的标签和缩小词汇表的概率分布范围,我们增强了生成连续词段的能力,从而有效地解决了上述问题。此外,为了获得更好的模型效果,我们还对输入进行了强化规范性处理,使模型共享子网络权重参数,并对模型输出进行稀疏化处理,以减少模型预测的搜索空间。为了进一步提高模型的性能,我们在英文数据集 CNN/DailyMail 上计算了双语评估劣度(BLEU)得分,以过滤阈值,降低分词难度和输出对词表的依赖性。我们使用 LCSTS 数据集对中文摘要任务的模型进行了全面微调,并使用 CSL 数据集进行了小样本实验。我们还在中文数据集上进行了消减实验。实验结果表明,优化后的模型能比其他模型更好地学习原文的语义表征,并且在样本量较小的情况下表现良好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.60
自引率
15.00%
发文量
241
期刊介绍: The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to: -Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc. -Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc. -Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition. -Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc. -Machine Translation involving Asian or low-resource languages. -Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc. -Information Extraction and Filtering: including automatic abstraction, user profiling, etc. -Speech processing: including text-to-speech synthesis and automatic speech recognition. -Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc. -Cross-lingual information processing involving Asian or low-resource languages. -Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信