Keyphrase generation for Vietnamese administrative documents: a collaborative approach

2020 12th International Conference on Knowledge and Systems Engineering (KSE) Pub Date : 2020-11-12 DOI:10.1109/KSE50997.2020.9287477

Thi-Thu-Trang Nguyen, Thi-Hai-Yen Vuong, Van-Lien Tran, Le-Minh Nguyen, X. Phan

{"title":"Keyphrase generation for Vietnamese administrative documents: a collaborative approach","authors":"Thi-Thu-Trang Nguyen, Thi-Hai-Yen Vuong, Van-Lien Tran, Le-Minh Nguyen, X. Phan","doi":"10.1109/KSE50997.2020.9287477","DOIUrl":null,"url":null,"abstract":"Keyphrases of a given document can be considered as its condensed summary. Unsupervised models focus on extracting keyphrases based only on the information contained in that document without interacting with other documents. While a good performance supervised learning model for keyphrase generation requires a massive effort to build training data, which can not generalize to new domains. Moreover, according to human perception, a user would comprehend the topic expressed in a document better if that user has already read other documents that express the same topic. Based on the above idea, we proposed a collaborative keyphrase generation system (CollabKG): a novel semi-supervised method by leveraging limited labeled data. The amount of labeled data will be enriched over time by the user. In our work, we conduct research on a large scale dataset consisting of 500,000 Vietnamese administrative documents. In CollabKG, each document is represented as a feature vector, and a cluster pruning algorithm is employed to accelerate finding the most similar documents. The generated keyphrases were manually evaluated for relevance and accuracy. In the final, the result we achieved shows high ratification. Therefore, we can conclude that CollabKG has good performance and fits a real-time system.","PeriodicalId":275683,"journal":{"name":"2020 12th International Conference on Knowledge and Systems Engineering (KSE)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 12th International Conference on Knowledge and Systems Engineering (KSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KSE50997.2020.9287477","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Keyphrases of a given document can be considered as its condensed summary. Unsupervised models focus on extracting keyphrases based only on the information contained in that document without interacting with other documents. While a good performance supervised learning model for keyphrase generation requires a massive effort to build training data, which can not generalize to new domains. Moreover, according to human perception, a user would comprehend the topic expressed in a document better if that user has already read other documents that express the same topic. Based on the above idea, we proposed a collaborative keyphrase generation system (CollabKG): a novel semi-supervised method by leveraging limited labeled data. The amount of labeled data will be enriched over time by the user. In our work, we conduct research on a large scale dataset consisting of 500,000 Vietnamese administrative documents. In CollabKG, each document is represented as a feature vector, and a cluster pruning algorithm is employed to accelerate finding the most similar documents. The generated keyphrases were manually evaluated for relevance and accuracy. In the final, the result we achieved shows high ratification. Therefore, we can conclude that CollabKG has good performance and fits a real-time system.

查看原文本刊更多论文

越南行政文件的关键词生成:一种协作方法

一个给定文档的关键字可以看作是它的浓缩摘要。无监督模型专注于仅基于该文档中包含的信息提取关键短语，而不与其他文档交互。然而，一个性能良好的关键字生成监督学习模型需要大量的工作来构建训练数据，而这些训练数据不能推广到新的领域。此外，根据人类的感知，如果用户已经阅读了表达同一主题的其他文档，那么用户将更好地理解文档中表达的主题。基于上述思想，我们提出了一种协作关键字生成系统(CollabKG):一种利用有限标记数据的新型半监督方法。随着时间的推移，用户将丰富标记数据的数量。在我们的工作中，我们对一个由50万份越南行政文件组成的大型数据集进行了研究。在CollabKG中，每个文档被表示为一个特征向量，并使用聚类修剪算法来加速查找最相似的文档。人工评估生成的关键字短语的相关性和准确性。最后，我们取得的结果显示了很高的认可。因此，我们可以得出结论，CollabKG具有良好的性能，适合实时系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 12th International Conference on Knowledge and Systems Engineering (KSE)

自引率

0.00%

发文量