基于文本到文本传输转换器的短文本关键字提取

Asian Conference on Intelligent Information and Database Systems Pub Date : 2022-09-28 DOI:10.48550/arXiv.2209.14008

Piotr Pęzik, Agnieszka Mikolajczyk-Barela, Adam Wawrzynski, Bartlomiej Niton, M. Ogrodniczuk

{"title":"基于文本到文本传输转换器的短文本关键字提取","authors":"Piotr Pęzik, Agnieszka Mikolajczyk-Barela, Adam Wawrzynski, Bartlomiej Niton, M. Ogrodniczuk","doi":"10.48550/arXiv.2209.14008","DOIUrl":null,"url":null,"abstract":"The paper explores the relevance of the Text-To-Text Transfer Transformer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the new Polish Open Science Metadata Corpus (POSMAC), which is released with this paper: a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project. We compare the results obtained by four different methods, i.e. plT5kw, extremeText, TermoPL, KeyBERT and conclude that the plT5kw model yields particularly promising results for both frequent and sparsely represented keywords. Furthermore, a plT5kw keyword generation model trained on the POSMAC also seems to produce highly useful results in cross-domain text labelling scenarios. We discuss the performance of the model on news stories and phone-based dialog transcripts which represent text genres and domains extrinsic to the dataset of scientific abstracts. Finally, we also attempt to characterize the challenges of evaluating a text-to-text model on both intrinsic and extrinsic keyword extraction.","PeriodicalId":397879,"journal":{"name":"Asian Conference on Intelligent Information and Database Systems","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Keyword Extraction from Short Texts with~a~Text-To-Text Transfer Transformer\",\"authors\":\"Piotr Pęzik, Agnieszka Mikolajczyk-Barela, Adam Wawrzynski, Bartlomiej Niton, M. Ogrodniczuk\",\"doi\":\"10.48550/arXiv.2209.14008\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The paper explores the relevance of the Text-To-Text Transfer Transformer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the new Polish Open Science Metadata Corpus (POSMAC), which is released with this paper: a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project. We compare the results obtained by four different methods, i.e. plT5kw, extremeText, TermoPL, KeyBERT and conclude that the plT5kw model yields particularly promising results for both frequent and sparsely represented keywords. Furthermore, a plT5kw keyword generation model trained on the POSMAC also seems to produce highly useful results in cross-domain text labelling scenarios. We discuss the performance of the model on news stories and phone-based dialog transcripts which represent text genres and domains extrinsic to the dataset of scientific abstracts. Finally, we also attempt to characterize the challenges of evaluating a text-to-text model on both intrinsic and extrinsic keyword extraction.\",\"PeriodicalId\":397879,\"journal\":{\"name\":\"Asian Conference on Intelligent Information and Database Systems\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Asian Conference on Intelligent Information and Database Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2209.14008\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Asian Conference on Intelligent Information and Database Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2209.14008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

本文探讨了波兰语文本到文本转换语言模型(T5)与从短文本段落中提取内在和外在关键字的相关性。评估是在新的波兰开放科学元数据语料库(POSMAC)上进行的，该语料库与本文一起发布:在CURLICAT项目中汇编的科学出版物摘要的216,214篇。我们比较了四种不同方法的结果，即plT5kw、extremeText、TermoPL、KeyBERT，并得出结论，plT5kw模型对频繁和稀疏表示的关键词都产生了特别有希望的结果。此外，在POSMAC上训练的plT5kw关键字生成模型似乎也在跨域文本标记场景中产生了非常有用的结果。我们讨论了该模型在新闻故事和基于电话的对话文本上的性能，这些对话文本代表了科学摘要数据集外部的文本类型和域。最后，我们还试图描述在内在和外在关键字提取上评估文本到文本模型的挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Keyword Extraction from Short Texts with~a~Text-To-Text Transfer Transformer

The paper explores the relevance of the Text-To-Text Transfer Transformer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the new Polish Open Science Metadata Corpus (POSMAC), which is released with this paper: a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project. We compare the results obtained by four different methods, i.e. plT5kw, extremeText, TermoPL, KeyBERT and conclude that the plT5kw model yields particularly promising results for both frequent and sparsely represented keywords. Furthermore, a plT5kw keyword generation model trained on the POSMAC also seems to produce highly useful results in cross-domain text labelling scenarios. We discuss the performance of the model on news stories and phone-based dialog transcripts which represent text genres and domains extrinsic to the dataset of scientific abstracts. Finally, we also attempt to characterize the challenges of evaluating a text-to-text model on both intrinsic and extrinsic keyword extraction.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Asian Conference on Intelligent Information and Database Systems

自引率

0.00%

发文量