Berty Chrismartin Lumban Tobing, Immanuel Rhesa Suhendra, Christian Halim
{"title":"Catapa Resume Parser: End to End Indonesian Resume Extraction","authors":"Berty Chrismartin Lumban Tobing, Immanuel Rhesa Suhendra, Christian Halim","doi":"10.1145/3342827.3342832","DOIUrl":"https://doi.org/10.1145/3342827.3342832","url":null,"abstract":"This paper proposes a method to solve the problem of extracting contents from a resume, especially for Indonesian resumes using segmentation method by header followed by models for each corresponding headers. An end to end resume extraction system is created using some heuristic rules and machine learning algorithms to solve the problem. On average, an accuracy of ~91.41% is achieved for personal information entities (name, email, phone, gender, date of birth, and religion), ~68.47% accuracy for job experiences entities (company, job title, start date, and end date), and ~80.85% accuracy for educations entities (institution, major, level, start date, end date, and GPA) out of 221 random resumes using the aforementioned method.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122577445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Hybrid Method for Vietnamese Text Normalization","authors":"Nguyen Thi Thu Trang, Dang Xuan Bach, N. X. Tung","doi":"10.1145/3342827.3342851","DOIUrl":"https://doi.org/10.1145/3342827.3342851","url":null,"abstract":"This paper presents a hybrid method for normalizing written text often found on newspapers to its spoken form. To normalize raw text with a number of non-standard words (NSWs), a two-step model is proposed. The first step involves classifying NSWs into different categories using Random Forest. The latter one is to expand them, depending on their NSW types, into pronounceable syllables using a hybrid method. Most of numeric types can be expanded by well-defined rules while most of alphabetic ones must be expanded by a deep learning (i.e. sequence-to-sequence) model and a post adjustment. The experiment on a Vietnamese corpus with proposed NSW categories shows that the most ambiguous cases of the classification model are for abbreviation and read-as-sequence types, hence combined into one category for the latter expansion with more complex model and better context. The classification model gives an enhanced result of 99.20% with the category combination and the feature optimization. In the expansion, the sequence-to-sequence model shows a good result of 96.53% for abbreviations and 96.25% for loanwords with a post-adjustment for some completely wrong cases. This model can predict effectively the expansions of abbreviations in context.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129772251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features","authors":"T. Litvinova, O. Litvinova, Polina Panicheva","doi":"10.1145/3342827.3342834","DOIUrl":"https://doi.org/10.1145/3342827.3342834","url":null,"abstract":"Authorship attribution is an important field in online security. Recently there have been numerous successful works in authorship attribution in various European languages. Character n-grams are reported to be the best choice in authorship attribution, as they encode both style and content information. We evaluate different types of character n-gram features in an authorship attribution task in a real-world noisy dataset of Russian forum posts. We also supplement them with a number of new simple n-gram features capturing syntactic and discourse patterns. We perform authorship attribution in a single-topic and a cross-topic setting, as the research question is whether character n-grams capture both style and content information. Our results show that character n-grams are indeed very successful in Russian forum post authorship attribution. However, there is no clear distinction of style and content n-grams, as the same types of n-grams work well for both single-topic and cross-topic settings. In our experiments the generalized simple n-gram features which reveals syntactic and discourse patterns were proved to be also very important in authorship attribution of short informal Russian texts. They represent a different kind of authorship information and are a successful addition to the character n-grams in authorship attribution of forum texts in the Russian language.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121688805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Is It Possible to Use Chatbot for the Chinese Word Segmentation?","authors":"K. Chang, Hsien-Tsung Chang","doi":"10.1145/3342827.3342836","DOIUrl":"https://doi.org/10.1145/3342827.3342836","url":null,"abstract":"A word is the smallest item in Natural Language Processing. However, there is no obvious boundary for Chinese words. How to segment Chinese words always obstructs Chinese researches and applications. Nowadays, a neural network model, Seq2Seq with LSTM, is well-known for translation or chatbot application. In this paper, we try to transform the Chinese word segmentation problem into a translation problem. And we utilized an open-source chatbot to simulate the translation task. In our experimental results, we can produce similar Chinese word segmentation results when we provide training data which is automatically generated from famous Chinese word segmentation services.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124444381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Khang Nhut Lam, Tuan Huynh To, Thong Tri Tran, J. Kalita
{"title":"Improving Vietnamese WordNet using word embedding","authors":"Khang Nhut Lam, Tuan Huynh To, Thong Tri Tran, J. Kalita","doi":"10.1145/3342827.3342854","DOIUrl":"https://doi.org/10.1145/3342827.3342854","url":null,"abstract":"This paper presents a simple but effective method to improve the quality of WordNet synsets and extract glosses for synsets. We translate the Princeton WordNet and other intermediate WordNets to a target language using a machine translator, then the correct candidates are selected by applying different ranking methods: occurrence count, cosine similarity between words, cosine similarity between word embeddings and cosine similarity between Doc2Vec of sentences. Our approaches may be applicable to build WordNets in any language which has some bilingual dictionaries and at least a monolingual corpus in the target language.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114720256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Sodanil, Saranlita Chotirat, L. Poomhiran, Kanchana Viriyapant
{"title":"Guideline for Academic Support of Student Career Path Using Mining Algorithm","authors":"M. Sodanil, Saranlita Chotirat, L. Poomhiran, Kanchana Viriyapant","doi":"10.1145/3342827.3342841","DOIUrl":"https://doi.org/10.1145/3342827.3342841","url":null,"abstract":"In general, higher education is an important step in preparing a career for students in the future. Graduates should have qualifications that are recognized by both entrepreneurs and society. Therefore, every higher educational institution should make an effort to consider how to assist students' performance. This research aims to analyze the relationships between courses that are likely to produce a future career for students using the Apriori algorithm. The data used in the operation of the association rule was the student's grades from 25 main courses in the field of information technology, Department of Information Technology, Faculty of Science and Technology, Suan Sunandha Rajabhat University. This data was recorded between 2011 and 2019 and stored in the registration and graduate career system. The 14 association rules were determined from the operation by using the Weka 3.8.3 data mining software, this indicated that there were a few courses in which students could have future careers. Most importantly, the results can contribute to guidelines for the academic support of students' future career.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"146-147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124044282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of Native and Non-native Speakers' English Compositions based on Word-frequency Distribution and Text Statistics","authors":"H. Tsubaki","doi":"10.1145/3342827.3342856","DOIUrl":"https://doi.org/10.1145/3342827.3342856","url":null,"abstract":"In this paper, word-frequency distribution of JACET 8000 basic words and text statistics were researched to compare and analyze differentials of English compositions (essays) written by native speakers and non-native speakers. As for the native speakers' essays, the Guiraud Index in each Level 2-8 to Average sentence length and Automated Readability Index had higher correlation coefficients. Meanwhile, on the non-native speakers' essays, the index values to Sentence count showed moderate correlation coefficients. It was observed that the productivity and readability of the compositions seem to depend on ranges of basic content words which native or non-native writers have acquired and can use in English. To verify the word-frequency distribution as proficiency rating measurement for non-native speakers, the estimation experiment was carried out based on a multiple-regression model using word-frequency distribution of 68 English compositions written by the non-native writers. The estimated scores of the learners showed a correlation score 0.475 to their actual TOEIC scores. These results confirmed the possibility of the word usage statistics for the objective evaluation of L2 (second language) learners' language proficiency.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127114961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluation of Morphological Embeddings for the Russian Language","authors":"V. Romanov, A. Khusainova","doi":"10.1145/3342827.3342846","DOIUrl":"https://doi.org/10.1145/3342827.3342846","url":null,"abstract":"A number of morphology-based word embedding models were introduced in recent years. However, their evaluation was mostly limited to English, which is known to be a morphologically simple language. In this paper, we explore whether and to what extent incorporating morphology into word embeddings improves performance on downstream NLP tasks, in the case of morphologically rich Russian language. NLP tasks of our choice are POS tagging, Chunking, and NER -- for Russian language, all can be mostly solved using only morphology without understanding the semantics of words. Our experiments show that morphology-based embeddings trained with Skipgram objective do not outperform existing embedding model -- FastText. Moreover, a more complex, but morphology unaware model, BERT, allows to achieve significantly greater performance on the tasks that presumably require understanding of a word's morphology.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126070043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HWE: Hybrid Word Embeddings For Text Classification","authors":"Xuebo Song, P. Srimani, James Ze Wang","doi":"10.1145/3342827.3342837","DOIUrl":"https://doi.org/10.1145/3342827.3342837","url":null,"abstract":"Text classification is one of the most important tasks in natural language processing and information retrieval due to the increasing availability of documents in digital form and the ensuing need to access them in flexible ways. By assigning documents to labeled classes, text classification can reduce the search space and expedite the process of retrieving relevant documents. In this paper, we propose a novel text representation method, Hybrid Word Embeddings (HWE), which combines semantic information obtained fromWord- Net and contextual information extracted from text documents to provide concise and accurate representations of text documents. The proposed HWE method can improve the efficiency of deriving word semantics from text by taking advantage of the semantic relationships extracted from WordNet with less training corpus. Experimental study on classification of documents shows that the proposed HWE outperforms existing methods, including Doc2Vec and Word2Vec, in terms of classification accuracy, recall, precision, etc.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116206202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nguyen Thi Thu Trang, Nguyen Hoang Ky, H. Sơn, N. T. Hung, Nguyễn Danh Huân
{"title":"Natural Language Understanding in Smartdialog: A Platform for Vietnamese Intelligent Interactions","authors":"Nguyen Thi Thu Trang, Nguyen Hoang Ky, H. Sơn, N. T. Hung, Nguyễn Danh Huân","doi":"10.1145/3342827.3342857","DOIUrl":"https://doi.org/10.1145/3342827.3342857","url":null,"abstract":"Nowadays in the modern world, interactive smart dialogs with text or voice are gaining traction as the main digital interaction channel between human and machine. However, most of the current platforms do not support or have not fully developed for Vietnamese. In this paper, the authors propose a smart conversational platform through a text channel and/or voice channel in Vietnamese language, including these main steps: (i) Input Conversion and Pre-Processing, (ii) Entity Recognition, (iii) Intent Classification, (iv) Action Prediction and Execution, and (v) Output Generation. This paper focuses on presenting problems related to natural language understanding. To recognize entities in a sentence, the authors studied and optimized the features for Vietnamese with the Conditional Random Field model. With the problem of predicting user intent, this work proposed, experimented, and compared of Random Forest and BiLSTM deep learning model to optimize for the Vietnamese language. A platform was built and deployed for Milo smart speaker application (LUMI smart home) and VADI driver virtual assistant with the accuracy of around 98.7%.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115209329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}