基于生成式预训练Transformer的新闻机器人语言建模

Raihan Hamid Suraperwata, S. Suyanto
{"title":"基于生成式预训练Transformer的新闻机器人语言建模","authors":"Raihan Hamid Suraperwata, S. Suyanto","doi":"10.1109/ICoICT49345.2020.9166359","DOIUrl":null,"url":null,"abstract":"The language model is typically represented as an unsupervised distribution estimate from a set of examples, each consisting of symbol sequences, and it could predict over sequences of words. We demonstrate the language model based on Generative Pretrained 2 will have a readable generated article for the journalistic robot. Nowadays, there is some trending of journalistic in Indonesia, freedom of the press, and it enables every journalist to make unprofessional news on the media. The problem affects the raise of journalist numbers who have lack journalistic knowledge and increases the amount of inappropriate news content in Indonesia. Therefore, to improve the quality of news produced by the mass media in Indonesia, a journalistic robot is needed to produce news content by the guidelines and the journalistic code of ethics. This research uses language modeling based on GPT-2 to generate articles. The program has four primary steps: building dataset, fine tuning GPT-2, modeling the trained data, and create articles. Furthermore, this research will add an Indonesian model for GPT-2 since the main purpose of this research is Indonesian articles. This paper proposes GPT-2 to be applied to news contents and calculate the result with BLEU scores to check if the results are readable content. These findings show that the proposed model is capable of generating a readable article after trained by 110 Indonesian articles with an excellent score of BLEU.","PeriodicalId":113108,"journal":{"name":"2020 8th International Conference on Information and Communication Technology (ICoICT)","volume":"314 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Language Modeling for Journalistic Robot based on Generative Pretrained Transformer 2\",\"authors\":\"Raihan Hamid Suraperwata, S. Suyanto\",\"doi\":\"10.1109/ICoICT49345.2020.9166359\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The language model is typically represented as an unsupervised distribution estimate from a set of examples, each consisting of symbol sequences, and it could predict over sequences of words. We demonstrate the language model based on Generative Pretrained 2 will have a readable generated article for the journalistic robot. Nowadays, there is some trending of journalistic in Indonesia, freedom of the press, and it enables every journalist to make unprofessional news on the media. The problem affects the raise of journalist numbers who have lack journalistic knowledge and increases the amount of inappropriate news content in Indonesia. Therefore, to improve the quality of news produced by the mass media in Indonesia, a journalistic robot is needed to produce news content by the guidelines and the journalistic code of ethics. This research uses language modeling based on GPT-2 to generate articles. The program has four primary steps: building dataset, fine tuning GPT-2, modeling the trained data, and create articles. Furthermore, this research will add an Indonesian model for GPT-2 since the main purpose of this research is Indonesian articles. This paper proposes GPT-2 to be applied to news contents and calculate the result with BLEU scores to check if the results are readable content. These findings show that the proposed model is capable of generating a readable article after trained by 110 Indonesian articles with an excellent score of BLEU.\",\"PeriodicalId\":113108,\"journal\":{\"name\":\"2020 8th International Conference on Information and Communication Technology (ICoICT)\",\"volume\":\"314 2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 8th International Conference on Information and Communication Technology (ICoICT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICoICT49345.2020.9166359\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 8th International Conference on Information and Communication Technology (ICoICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICoICT49345.2020.9166359","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

语言模型通常表示为来自一组示例的无监督分布估计,每个示例由符号序列组成,并且它可以预测单词序列。我们演示了基于生成式预训练2的语言模型将为新闻机器人生成可读的文章。如今,印度尼西亚的新闻业有一些趋势,新闻自由,这使得每个记者都可以在媒体上发表不专业的新闻。这个问题影响了缺乏新闻知识的记者人数的增加,并增加了印度尼西亚不适当新闻内容的数量。因此,为了提高印度尼西亚大众媒体生产的新闻质量,需要一个新闻机器人根据指导方针和新闻道德准则生产新闻内容。本研究使用基于GPT-2的语言建模来生成文章。该程序有四个主要步骤:构建数据集,微调GPT-2,对训练数据进行建模,并创建文章。此外,由于本研究的主要目的是印度尼西亚文章,因此本研究将为GPT-2添加印度尼西亚模型。本文提出将GPT-2应用于新闻内容,用BLEU分数计算结果,检验结果是否为可读内容。这些发现表明,所提出的模型经过110篇印度尼西亚文章的训练,能够生成一篇具有优秀BLEU分数的可读文章。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Language Modeling for Journalistic Robot based on Generative Pretrained Transformer 2
The language model is typically represented as an unsupervised distribution estimate from a set of examples, each consisting of symbol sequences, and it could predict over sequences of words. We demonstrate the language model based on Generative Pretrained 2 will have a readable generated article for the journalistic robot. Nowadays, there is some trending of journalistic in Indonesia, freedom of the press, and it enables every journalist to make unprofessional news on the media. The problem affects the raise of journalist numbers who have lack journalistic knowledge and increases the amount of inappropriate news content in Indonesia. Therefore, to improve the quality of news produced by the mass media in Indonesia, a journalistic robot is needed to produce news content by the guidelines and the journalistic code of ethics. This research uses language modeling based on GPT-2 to generate articles. The program has four primary steps: building dataset, fine tuning GPT-2, modeling the trained data, and create articles. Furthermore, this research will add an Indonesian model for GPT-2 since the main purpose of this research is Indonesian articles. This paper proposes GPT-2 to be applied to news contents and calculate the result with BLEU scores to check if the results are readable content. These findings show that the proposed model is capable of generating a readable article after trained by 110 Indonesian articles with an excellent score of BLEU.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信