A Support Vector Machines Approach to Vietnamese Key Phrase Extraction

Chau Q. Nguyen, Luan T. Hong, T. Phan
{"title":"A Support Vector Machines Approach to Vietnamese Key Phrase Extraction","authors":"Chau Q. Nguyen, Luan T. Hong, T. Phan","doi":"10.1109/RIVF.2009.5174613","DOIUrl":null,"url":null,"abstract":"Automatic key phrase extraction is the task of automatically selecting a set of phrases that describe the content of a simple sentence. That a key phrase is extracted means that it is present verbatim in the sentence to which it is assigned. Accurate key phrase extraction is fundamental to the success of many recent digital library applications, clustering, and semantic information retrieval techniques. The present research discusses a support vector machines (SVMs) approach for Vietnamese key phrase extraction and presents a number of experiments in which performance is incrementally improved. In general, the Vietnamese key phrase extracting process consists of three steps: word segmentation for identifying lexical units in an input sentence, part-of-speech tagging for words, and key phrase extraction for phrases. The performance of Vietnamese key phras extraction systems is generally measured by the precision rate attained. This depends strongly on the nature and the size of a training set of key phrases. Most results are superior to 70.30% with a training set of 9,000 Vietnamese key phrases with of 2,000 sentences which was selected from the corpus of Vietnamese Lexicography Center (www.vietlex.com.vn). I. INTRODUCTION Key phrases, which can be single keywords or multiword key terms, are linguistic descriptors of documents. They are often sufficiently informative to help human readers get a feel for the essential topics and main contents included in the source documents. Key phrases have also been used as features in many text-related applications such as text clustering, document similarity analysis, and document summarization. Manually extracting key phrases from a number of documents is quite expensive. Automatic key phrase extraction is a maturing technology that can serve as an efficient and practical alternative. Key phrase extraction may be viewed as a classification problem. A document can be seen as a bag of phrases wherein each phrase belongs to one of the two possible classes: either it is a key phrase or it is a non-key phrase. We approach this problem from the perspective of machine learning research and treat it as a problem of supervised learning from examples. We divide our documents into two sets: training documents and testing documents. The training documents are used to tune the key phrase extraction algorithms, in order to attempt to maximize their performance. That is, the training documents are used to teach the supervised learning algorithms how to distinguish key phrases from non-key phrases. The testing documents are used to evaluate the tuned algorithms. The motivation for this work is to establish the range of applications for key phrases. There are at least five general application areas for key phrases: Text summarization, human- readable index, interactive query refinement, machine-readable index, and feature extraction as preprocessing for further machine analysis. SVMs is an extraordeinary phenomenon in machine learning methodology. Research that applies this method has achieved good results, and has proven to be more effective than research that uses other learning methods, especially when applied to problems of natural language processing (3, 6, 7), pattern classification, or pattern recognition (8). In this paper, we present the application of SVMs to build a Vietnamese key phrase extraction system for Vietnamese text. In this section a Support Vector Machines model is introduced for Vietnamese key phrase extraction. The rest of the paper is organized as follows: Section 2 introduces a Support Vector Machines approach; Section 3 proposes a methodology of Vietnamese key phrase extraction model; Section 4 evaluates our approach on many Vietnamese query sentences with different styles of texts; and finally the conclusion is presented in Section 5.","PeriodicalId":243397,"journal":{"name":"2009 IEEE-RIVF International Conference on Computing and Communication Technologies","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE-RIVF International Conference on Computing and Communication Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RIVF.2009.5174613","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Automatic key phrase extraction is the task of automatically selecting a set of phrases that describe the content of a simple sentence. That a key phrase is extracted means that it is present verbatim in the sentence to which it is assigned. Accurate key phrase extraction is fundamental to the success of many recent digital library applications, clustering, and semantic information retrieval techniques. The present research discusses a support vector machines (SVMs) approach for Vietnamese key phrase extraction and presents a number of experiments in which performance is incrementally improved. In general, the Vietnamese key phrase extracting process consists of three steps: word segmentation for identifying lexical units in an input sentence, part-of-speech tagging for words, and key phrase extraction for phrases. The performance of Vietnamese key phras extraction systems is generally measured by the precision rate attained. This depends strongly on the nature and the size of a training set of key phrases. Most results are superior to 70.30% with a training set of 9,000 Vietnamese key phrases with of 2,000 sentences which was selected from the corpus of Vietnamese Lexicography Center (www.vietlex.com.vn). I. INTRODUCTION Key phrases, which can be single keywords or multiword key terms, are linguistic descriptors of documents. They are often sufficiently informative to help human readers get a feel for the essential topics and main contents included in the source documents. Key phrases have also been used as features in many text-related applications such as text clustering, document similarity analysis, and document summarization. Manually extracting key phrases from a number of documents is quite expensive. Automatic key phrase extraction is a maturing technology that can serve as an efficient and practical alternative. Key phrase extraction may be viewed as a classification problem. A document can be seen as a bag of phrases wherein each phrase belongs to one of the two possible classes: either it is a key phrase or it is a non-key phrase. We approach this problem from the perspective of machine learning research and treat it as a problem of supervised learning from examples. We divide our documents into two sets: training documents and testing documents. The training documents are used to tune the key phrase extraction algorithms, in order to attempt to maximize their performance. That is, the training documents are used to teach the supervised learning algorithms how to distinguish key phrases from non-key phrases. The testing documents are used to evaluate the tuned algorithms. The motivation for this work is to establish the range of applications for key phrases. There are at least five general application areas for key phrases: Text summarization, human- readable index, interactive query refinement, machine-readable index, and feature extraction as preprocessing for further machine analysis. SVMs is an extraordeinary phenomenon in machine learning methodology. Research that applies this method has achieved good results, and has proven to be more effective than research that uses other learning methods, especially when applied to problems of natural language processing (3, 6, 7), pattern classification, or pattern recognition (8). In this paper, we present the application of SVMs to build a Vietnamese key phrase extraction system for Vietnamese text. In this section a Support Vector Machines model is introduced for Vietnamese key phrase extraction. The rest of the paper is organized as follows: Section 2 introduces a Support Vector Machines approach; Section 3 proposes a methodology of Vietnamese key phrase extraction model; Section 4 evaluates our approach on many Vietnamese query sentences with different styles of texts; and finally the conclusion is presented in Section 5.
越南语关键短语提取的支持向量机方法
自动关键短语提取是自动选择一组描述简单句子内容的短语的任务。一个关键短语被提取出来意味着它一字不差地出现在指定的句子中。准确的关键短语提取是最近许多数字图书馆应用、聚类和语义信息检索技术成功的基础。本研究讨论了一种支持向量机(svm)方法用于越南语关键短语的提取,并提出了一些性能逐步提高的实验。一般来说,越南语关键短语提取过程包括三个步骤:用于识别输入句子中词汇单位的分词、用于标记单词的词性标注和用于提取短语的关键短语。越南语关键短语提取系统的性能通常以获得的准确率来衡量。这在很大程度上取决于关键短语训练集的性质和大小。从越南语词典中心(www.vietlex.com.vn)的语料库中选择9000个越南语关键短语和2000个句子作为训练集,大多数结果优于70.30%。关键短语是文档的语言描述符,可以是单关键字,也可以是多关键字。它们通常具有足够的信息,可以帮助人类读者了解源文档中包含的基本主题和主要内容。关键短语也被用作许多与文本相关的应用程序的特性,例如文本聚类、文档相似度分析和文档摘要。手动从大量文档中提取关键短语是非常昂贵的。自动关键短语提取是一项成熟的技术,可以作为一种高效实用的替代方案。关键词提取可以看作是一个分类问题。文档可以看作是一袋短语,其中每个短语属于两种可能的类别之一:它要么是关键短语,要么是非关键短语。我们从机器学习研究的角度来处理这个问题,并将其视为一个有监督学习的问题。我们将文档分为两组:培训文档和测试文档。训练文档用于调整关键短语提取算法,以尝试最大化其性能。也就是说,训练文档用来教监督学习算法如何区分关键短语和非关键短语。测试文档用于评估调优后的算法。这项工作的动机是建立关键短语的应用范围。关键短语至少有五个通用的应用领域:文本摘要、人类可读索引、交互式查询细化、机器可读索引,以及作为进一步机器分析预处理的特征提取。支持向量机是机器学习方法论中的一个非凡现象。应用该方法的研究已经取得了良好的结果,并且已被证明比使用其他学习方法的研究更有效,特别是在应用于自然语言处理(3,6,7)、模式分类或模式识别(8)的问题时。在本文中,我们提出了应用支持向量机构建越南语文本的越南语关键短语提取系统。本节介绍了一种支持向量机模型用于越南语关键短语的提取。本文的其余部分组织如下:第2节介绍了支持向量机方法;第三部分提出了一种越南语关键短语抽取模型的方法;第4节对不同文本风格的越南语查询句进行了分析;最后在第五节给出结论。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信