Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval最新文献

Catapa Resume Parser: End to End Indonesian Resume Extraction Catapa简历解析器:端到端印度尼西亚简历提取

Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval Pub Date : 2019-06-28 DOI: 10.1145/3342827.3342832

Berty Chrismartin Lumban Tobing, Immanuel Rhesa Suhendra, Christian Halim

引用次数: 6

A Hybrid Method for Vietnamese Text Normalization 越南文文本规范化的混合方法

Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval Pub Date : 2019-06-28 DOI: 10.1145/3342827.3342851

Nguyen Thi Thu Trang, Dang Xuan Bach, N. X. Tung

{"title":"A Hybrid Method for Vietnamese Text Normalization","authors":"Nguyen Thi Thu Trang, Dang Xuan Bach, N. X. Tung","doi":"10.1145/3342827.3342851","DOIUrl":"https://doi.org/10.1145/3342827.3342851","url":null,"abstract":"This paper presents a hybrid method for normalizing written text often found on newspapers to its spoken form. To normalize raw text with a number of non-standard words (NSWs), a two-step model is proposed. The first step involves classifying NSWs into different categories using Random Forest. The latter one is to expand them, depending on their NSW types, into pronounceable syllables using a hybrid method. Most of numeric types can be expanded by well-defined rules while most of alphabetic ones must be expanded by a deep learning (i.e. sequence-to-sequence) model and a post adjustment. The experiment on a Vietnamese corpus with proposed NSW categories shows that the most ambiguous cases of the classification model are for abbreviation and read-as-sequence types, hence combined into one category for the latter expansion with more complex model and better context. The classification model gives an enhanced result of 99.20% with the category combination and the feature optimization. In the expansion, the sequence-to-sequence model shows a good result of 96.53% for abbreviations and 96.25% for loanwords with a post-adjustment for some completely wrong cases. This model can predict effectively the expansions of abbreviations in context.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129772251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features 不同类型N-gram特征的俄罗斯论坛帖子的作者归属

Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval Pub Date : 2019-06-28 DOI: 10.1145/3342827.3342834

T. Litvinova, O. Litvinova, Polina Panicheva

{"title":"Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features","authors":"T. Litvinova, O. Litvinova, Polina Panicheva","doi":"10.1145/3342827.3342834","DOIUrl":"https://doi.org/10.1145/3342827.3342834","url":null,"abstract":"Authorship attribution is an important field in online security. Recently there have been numerous successful works in authorship attribution in various European languages. Character n-grams are reported to be the best choice in authorship attribution, as they encode both style and content information. We evaluate different types of character n-gram features in an authorship attribution task in a real-world noisy dataset of Russian forum posts. We also supplement them with a number of new simple n-gram features capturing syntactic and discourse patterns. We perform authorship attribution in a single-topic and a cross-topic setting, as the research question is whether character n-grams capture both style and content information. Our results show that character n-grams are indeed very successful in Russian forum post authorship attribution. However, there is no clear distinction of style and content n-grams, as the same types of n-grams work well for both single-topic and cross-topic settings. In our experiments the generalized simple n-gram features which reveals syntactic and discourse patterns were proved to be also very important in authorship attribution of short informal Russian texts. They represent a different kind of authorship information and are a successful addition to the character n-grams in authorship attribution of forum texts in the Russian language.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121688805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Is It Possible to Use Chatbot for the Chinese Word Segmentation? 用聊天机器人进行汉语分词是否可行?

Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval Pub Date : 2019-06-28 DOI: 10.1145/3342827.3342836

K. Chang, Hsien-Tsung Chang

引用次数: 2

Improving Vietnamese WordNet using word embedding 利用词嵌入改进越南语WordNet

Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval Pub Date : 2019-06-28 DOI: 10.1145/3342827.3342854

Khang Nhut Lam, Tuan Huynh To, Thong Tri Tran, J. Kalita

引用次数: 3

Guideline for Academic Support of Student Career Path Using Mining Algorithm 基于挖掘算法的学生职业生涯路径学术支持指南

Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval Pub Date : 2019-06-28 DOI: 10.1145/3342827.3342841

M. Sodanil, Saranlita Chotirat, L. Poomhiran, Kanchana Viriyapant

{"title":"Guideline for Academic Support of Student Career Path Using Mining Algorithm","authors":"M. Sodanil, Saranlita Chotirat, L. Poomhiran, Kanchana Viriyapant","doi":"10.1145/3342827.3342841","DOIUrl":"https://doi.org/10.1145/3342827.3342841","url":null,"abstract":"In general, higher education is an important step in preparing a career for students in the future. Graduates should have qualifications that are recognized by both entrepreneurs and society. Therefore, every higher educational institution should make an effort to consider how to assist students' performance. This research aims to analyze the relationships between courses that are likely to produce a future career for students using the Apriori algorithm. The data used in the operation of the association rule was the student's grades from 25 main courses in the field of information technology, Department of Information Technology, Faculty of Science and Technology, Suan Sunandha Rajabhat University. This data was recorded between 2011 and 2019 and stored in the registration and graduate career system. The 14 association rules were determined from the operation by using the Weka 3.8.3 data mining software, this indicated that there were a few courses in which students could have future careers. Most importantly, the results can contribute to guidelines for the academic support of students' future career.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"146-147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124044282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Analysis of Native and Non-native Speakers' English Compositions based on Word-frequency Distribution and Text Statistics 基于词频分布和篇章统计的母语和非母语英语作文分析

Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval Pub Date : 2019-06-28 DOI: 10.1145/3342827.3342856

H. Tsubaki

{"title":"Analysis of Native and Non-native Speakers' English Compositions based on Word-frequency Distribution and Text Statistics","authors":"H. Tsubaki","doi":"10.1145/3342827.3342856","DOIUrl":"https://doi.org/10.1145/3342827.3342856","url":null,"abstract":"In this paper, word-frequency distribution of JACET 8000 basic words and text statistics were researched to compare and analyze differentials of English compositions (essays) written by native speakers and non-native speakers. As for the native speakers' essays, the Guiraud Index in each Level 2-8 to Average sentence length and Automated Readability Index had higher correlation coefficients. Meanwhile, on the non-native speakers' essays, the index values to Sentence count showed moderate correlation coefficients. It was observed that the productivity and readability of the compositions seem to depend on ranges of basic content words which native or non-native writers have acquired and can use in English. To verify the word-frequency distribution as proficiency rating measurement for non-native speakers, the estimation experiment was carried out based on a multiple-regression model using word-frequency distribution of 68 English compositions written by the non-native writers. The estimated scores of the learners showed a correlation score 0.475 to their actual TOEIC scores. These results confirmed the possibility of the word usage statistics for the objective evaluation of L2 (second language) learners' language proficiency.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127114961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluation of Morphological Embeddings for the Russian Language 俄语形态嵌入的评价

Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval Pub Date : 2019-06-28 DOI: 10.1145/3342827.3342846

V. Romanov, A. Khusainova

引用次数: 1

HWE: Hybrid Word Embeddings For Text Classification 用于文本分类的混合词嵌入

Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval Pub Date : 2019-06-28 DOI: 10.1145/3342827.3342837

Xuebo Song, P. Srimani, James Ze Wang

引用次数: 3

Natural Language Understanding in Smartdialog: A Platform for Vietnamese Intelligent Interactions 智能对话中的自然语言理解:越南语智能交互平台

Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval Pub Date : 2019-06-28 DOI: 10.1145/3342827.3342857

Nguyen Thi Thu Trang, Nguyen Hoang Ky, H. Sơn, N. T. Hung, Nguyễn Danh Huân

{"title":"Natural Language Understanding in Smartdialog: A Platform for Vietnamese Intelligent Interactions","authors":"Nguyen Thi Thu Trang, Nguyen Hoang Ky, H. Sơn, N. T. Hung, Nguyễn Danh Huân","doi":"10.1145/3342827.3342857","DOIUrl":"https://doi.org/10.1145/3342827.3342857","url":null,"abstract":"Nowadays in the modern world, interactive smart dialogs with text or voice are gaining traction as the main digital interaction channel between human and machine. However, most of the current platforms do not support or have not fully developed for Vietnamese. In this paper, the authors propose a smart conversational platform through a text channel and/or voice channel in Vietnamese language, including these main steps: (i) Input Conversion and Pre-Processing, (ii) Entity Recognition, (iii) Intent Classification, (iv) Action Prediction and Execution, and (v) Output Generation. This paper focuses on presenting problems related to natural language understanding. To recognize entities in a sentence, the authors studied and optimized the features for Vietnamese with the Conditional Random Field model. With the problem of predicting user intent, this work proposed, experimented, and compared of Random Forest and BiLSTM deep learning model to optimize for the Vietnamese language. A platform was built and deployed for Milo smart speaker application (LUMI smart home) and VADI driver virtual assistant with the accuracy of around 98.7%.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115209329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0