越南文推文中命名实体识别的文本规范化。

Q1 Mathematics

Computational Social Networks Pub Date : 2016-01-01 Epub Date: 2016-12-01 DOI:10.1186/s40649-016-0032-0

Vu H Nguyen, Hien T Nguyen, Vaclav Snasel

{"title":"越南文推文中命名实体识别的文本规范化。","authors":"Vu H Nguyen, Hien T Nguyen, Vaclav Snasel","doi":"10.1186/s40649-016-0032-0","DOIUrl":null,"url":null,"abstract":"Background: Named entity recognition (NER) is a task of detecting named entities in documents and categorizing them to predefined classes, such as person, location, and organization. This paper focuses on tweets posted on Twitter. Since tweets are noisy, irregular, brief, and include acronyms and spelling errors, NER in those tweets is a challenging task. Many approaches have been proposed to deal with this problem in tweets written in English, Germany, Chinese, etc., but none for Vietnamese tweets.Methods: We propose a method that normalizes a tweet before taking as an input of a learning model for NER in Vietnamese tweets. The normalization step detects spelling errors in a tweet and corrects them using an improved Dice's coefficient or n-grams. A Support Vector Machine learning algorithm is employed to learn a classifier using six different types of features.Results and conclusion: We train our method on a training set consisting of more than 40,000 named entities and evaluate it on a testing set consisting of 3,186 named entities. The experimental results showed that our system achieves state-of-the-art performance with F1 score of 82.13%.","PeriodicalId":52145,"journal":{"name":"Computational Social Networks","volume":"3 1","pages":"10"},"PeriodicalIF":0.0000,"publicationDate":"2016-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s40649-016-0032-0","citationCount":"7","resultStr":"{\"title\":\"Text normalization for named entity recognition in Vietnamese tweets.\",\"authors\":\"Vu H Nguyen, Hien T Nguyen, Vaclav Snasel\",\"doi\":\"10.1186/s40649-016-0032-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Named entity recognition (NER) is a task of detecting named entities in documents and categorizing them to predefined classes, such as person, location, and organization. This paper focuses on tweets posted on Twitter. Since tweets are noisy, irregular, brief, and include acronyms and spelling errors, NER in those tweets is a challenging task. Many approaches have been proposed to deal with this problem in tweets written in English, Germany, Chinese, etc., but none for Vietnamese tweets.Methods: We propose a method that normalizes a tweet before taking as an input of a learning model for NER in Vietnamese tweets. The normalization step detects spelling errors in a tweet and corrects them using an improved Dice's coefficient or n-grams. A Support Vector Machine learning algorithm is employed to learn a classifier using six different types of features.Results and conclusion: We train our method on a training set consisting of more than 40,000 named entities and evaluate it on a testing set consisting of 3,186 named entities. The experimental results showed that our system achieves state-of-the-art performance with F1 score of 82.13%.\",\"PeriodicalId\":52145,\"journal\":{\"name\":\"Computational Social Networks\",\"volume\":\"3 1\",\"pages\":\"10\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1186/s40649-016-0032-0\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational Social Networks\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1186/s40649-016-0032-0\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2016/12/1 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"Mathematics\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Social Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s40649-016-0032-0","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2016/12/1 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 7

摘要

背景:命名实体识别(NER)是一项检测文档中的命名实体并将其分类为预定义类(如人员、位置和组织)的任务。本文关注的是Twitter上发布的tweets。由于tweet嘈杂、不规则、简短，并且包含首字母缩写和拼写错误，因此在这些tweet中进行NER是一项具有挑战性的任务。已经提出了许多方法来处理用英语，德语，中文等写的推文中的这个问题，但没有针对越南文的推文。方法:我们提出了一种方法，在将越南语推文中的NER作为学习模型的输入之前对推文进行归一化。规范化步骤检测tweet中的拼写错误，并使用改进的Dice系数或n-grams来纠正它们。使用支持向量机学习算法来学习使用六种不同类型特征的分类器。结果和结论:我们在包含超过40,000个命名实体的训练集上训练我们的方法，并在包含3,186个命名实体的测试集上对其进行评估。实验结果表明，系统的F1分数达到了82.13%，达到了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Text normalization for named entity recognition in Vietnamese tweets.

查看原文本刊更多论文

Text normalization for named entity recognition in Vietnamese tweets.

Background: Named entity recognition (NER) is a task of detecting named entities in documents and categorizing them to predefined classes, such as person, location, and organization. This paper focuses on tweets posted on Twitter. Since tweets are noisy, irregular, brief, and include acronyms and spelling errors, NER in those tweets is a challenging task. Many approaches have been proposed to deal with this problem in tweets written in English, Germany, Chinese, etc., but none for Vietnamese tweets.

Methods: We propose a method that normalizes a tweet before taking as an input of a learning model for NER in Vietnamese tweets. The normalization step detects spelling errors in a tweet and corrects them using an improved Dice's coefficient or n-grams. A Support Vector Machine learning algorithm is employed to learn a classifier using six different types of features.

Results and conclusion: We train our method on a training set consisting of more than 40,000 named entities and evaluate it on a testing set consisting of 3,186 named entities. The experimental results showed that our system achieves state-of-the-art performance with F1 score of 82.13%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computational Social Networks Mathematics-Modeling and Simulation

自引率

0.00%

发文量

审稿时长

13 weeks

期刊介绍： Computational Social Networks showcases refereed papers dealing with all mathematical, computational and applied aspects of social computing. The objective of this journal is to advance and promote the theoretical foundation, mathematical aspects, and applications of social computing. Submissions are welcome which focus on common principles, algorithms and tools that govern network structures/topologies, network functionalities, security and privacy, network behaviors, information diffusions and influence, social recommendation systems which are applicable to all types of social networks and social media. Topics include (but are not limited to) the following: -Social network design and architecture -Mathematical modeling and analysis -Real-world complex networks -Information retrieval in social contexts, political analysts -Network structure analysis -Network dynamics optimization -Complex network robustness and vulnerability -Information diffusion models and analysis -Security and privacy -Searching in complex networks -Efficient algorithms -Network behaviors -Trust and reputation -Social Influence -Social Recommendation -Social media analysis -Big data analysis on online social networks This journal publishes rigorously refereed papers dealing with all mathematical, computational and applied aspects of social computing. The journal also includes reviews of appropriate books as special issues on hot topics.