A. Hidayatullah, Wisnu Kurniawan, Chanifah Indah Ratnasari
{"title":"Topic Modeling on Indonesian Online Shop Chat","authors":"A. Hidayatullah, Wisnu Kurniawan, Chanifah Indah Ratnasari","doi":"10.1145/3342827.3342831","DOIUrl":"https://doi.org/10.1145/3342827.3342831","url":null,"abstract":"This paper aims to discover topics from an Indonesian online shop chat. Moreover, we employed Latent Dirichlet Allocation to find out what kind of topics that are often discussed and conversation trends between buyers and customer service. Several tasks were performed, such as, collecting data, preprocessing, phrase aggregation, topic modeling, and topic analysis. We found several attracting findings during our experiments. In preprocessing task, product name extraction from URLs assisted to discover the intended product from the customer's conversation. On the other hand, the phrase aggregation task helped us to merge various terms which have same intended meaning, so that, we could obtain better topical model result and easier to determine the topic label.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126673506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluation of Pseudo-Relevance Feedback using Wikipedia","authors":"Murtadha Aljubran","doi":"10.1145/3342827.3342845","DOIUrl":"https://doi.org/10.1145/3342827.3342845","url":null,"abstract":"Users have specific information needs which are expressed in short queries to information retrieval systems. The queries are unstructured, and they tend to be short and ambiguous in most cases. Using the shallow language statistics including probabilistic or language models such as BM25 or Indri respectively can enhance the retrieval system metrics like Mean Average Precision (MAP). However, such methods depend on query terms and their presence in the retrieved document to define relevance. Query expansion is a technique that can be used to overcome this problem by expanding the query with terms from an initial top few relevant documents. The question that we try to answer is whether the quality of the corpus used for expansion produce a significant improvement MAP and precision at top 30 retrieved documents. We show that the quality and the selection criteria of expansion documents are important factors in query expansion performance.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129201296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Speaker Embedding for Speaker-Targeted Automatic Speech Recognition","authors":"Guan-Lin Chao, John Paul Shen, Ian Lane","doi":"10.1145/3342827.3342847","DOIUrl":"https://doi.org/10.1145/3342827.3342847","url":null,"abstract":"In this work, we investigate three types of deep speaker embedding as text-independent features for speaker-targeted speech recognition in cocktail party environments. The text-independent speaker embedding is extracted from the target speaker's existing speech segment (i-vector and x-vector) or face image (f-vector), which is concatenated with acoustic features of any new speech utterances as input features. Since the proposed model extracts the speaker embedding of the target speaker once and for all, it is computationally more efficient than many prior approaches which estimate the target speaker's characteristics on the fly. Empirical evaluation shows that using speaker embedding along with acoustic features improves Word Error Rate over the audio-only model, from 65.7% to 29.5%. Among the three types of speaker embedding, x-vector and f-vector show robustness against environment variations while i-vector tends to overfit to the specific speaker and environment condition.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131659915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Task-oriented Chatbot Based on LSTM and Reinforcement Learning","authors":"Tai-Liang Chou, Yu-Ling Hsueh","doi":"10.1145/3342827.3342844","DOIUrl":"https://doi.org/10.1145/3342827.3342844","url":null,"abstract":"Traditional conversational chatbots usually adopt a retrieved-based model. Developers have to provide a large amount of conversational data and classify those data to different intents. To avoid cumbersome development processes, we propose a method to build a chatbot by a sentence generation model which generates sequence sentences based on the generative adversarial network. The architecture of our model contains a generator that generates a diverse sentence, and a discriminator that judges the sentences between the generated and the raw data. In the generator, we combine the attention model that responses for tracking conversational states with the sequence-to-sequence model using hierarchical long-short term memory to extract sentence information. For the discriminator, we calculate twotypes of rewards to assign low rewards for repeated sentences and high rewards for diverse sentences. Extensive experiments are presented to demonstrate the utility of our model which generates more diverse and information-rich sentences than those of the existing approaches.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124649232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Building the Language Resource for a Cebuano-Filipino Neural Machine Translation System","authors":"Kristine Mae M. Adlaon, N. Marcos","doi":"10.1145/3342827.3342833","DOIUrl":"https://doi.org/10.1145/3342827.3342833","url":null,"abstract":"Parallel corpus is a critical resource in machine learning based translation. The task of collecting, extracting, and aligning texts in order to build an acceptable corpus for doing translation is very tedious most especially for low-resource languages. In this paper, we present the efforts made to build a parallel corpus for Cebuano and Filipino from two different domains: biblical texts and the web. For the biblical resource, subword unit translation for verbs and copy-able approach for nouns were applied to correct inconsistencies in translation. This correction mechanism was applied as a preprocessing technique. On the other hand, for Wikipedia being the main web resource, commonly occurring topic segments were extracted from both the source and the target languages. These observed topic segments are unique in 4 different categories. The identification of these topic segments may be used for automatic extraction of sentences. A Recurrent Neural Network was used to implement the translation using OpenNMT sequence modeling tool in TensorFlow. The two different corpora were then evaluated by using them as two separate inputs in the neural network. Results have shown a difference in BLEU score in both corpora.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126710273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Text Compression for Myanmar Information Retrieval","authors":"N. Lin, A. KudinovVitaly, Y. Soe","doi":"10.1145/3342827.3342830","DOIUrl":"https://doi.org/10.1145/3342827.3342830","url":null,"abstract":"Myanmar word segmentation is an important task for construction of dictionary file for Myanmar information retrieval and Myanmar text compression. Although Myanmar word segmentation using dictionary and orthography has been existed for Myanmar language, the performance of word segmentation depends on the coverage of the dictionary and training dataset and can cause out of vocabulary (OOV) problem, leading to lower precision and recall in information retrieval. And to compress Myanmar text, words in text needs to be recognized first. In this paper, we propose a new method for Myanmar word segmentation by local statistical dataset without the use of any additional data (e.g., training corpus) and new compressed Myanmar Information Retrieval (MIR) model which used End Tagged Dense Code (ETDC) text compressed method. The experimental results showed that the method can improve evaluation of vocabulary file with precision 75%, recall 87%, F-measure 80% and average compression ratio is 32% of texts for Myanmar language.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125993827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Applicability of Text-representing Centroids for Thai Language Documents","authors":"Sureeporn Nualnim, Nirach Romyen, M. Sodanil","doi":"10.1145/3342827.3342853","DOIUrl":"https://doi.org/10.1145/3342827.3342853","url":null,"abstract":"Text-representing centroids are investigated method recently used to categorize and compare documents written in European languages. As it will be shown, Asian languages and in particular Thai exhibit completely other language structures. Nevertheless, a strong justification will be given that the methodology of the text-representing centroids can be successfully applied to Thai documents, too. For the experiments, a corpus which contained 100 randomly selected articles from an offline Thai Wikipedia was used. The obtained centroids well reflect the topic of those documents as in the original publication. In addition, the centroids are quite suitable to compare any two files.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130076598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Text Classification of Network Pyramid Scheme based on Topic Model","authors":"Pengyu Mu, Jingsha He, Nafei Zhu","doi":"10.1145/3342827.3342835","DOIUrl":"https://doi.org/10.1145/3342827.3342835","url":null,"abstract":"At present, the network pyramid scheme has become a major tumor that hinders social development. In order to curb the propagation of the network pyramid scheme and effectively identify the pyramid scheme text in the network, this study proposes a joint topic model, Paragraph Vector Latent Dirichlet Allocation (PV_LDA), based on the characteristics of high-yield, high rebate, hierarchical salary and text topic diversity described in the text. The model uses the paragraph as the minimum processing unit to generate the topic distribution matrix of \"high-interest rate\" and \"hierarchical salary\" from the network pyramid scheme text. The Gibbs sampling is used to derive the \"pyramid scheme\" topic distribution matrix represented by the two features, which is used for classification processing by the classifier. the classification accuracy rate for the network pyramid scheme text can reach 86.25%. The conclusions show that the topic model proposed in this paper can capture the characteristics of the pyramid scheme more reasonably.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131029558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Sentiment Analysis for Comparing Attitudes between Computer Professionals and Laypersons on the Topic of Artificial Intelligence","authors":"Xueying Wang","doi":"10.1145/3342827.3342829","DOIUrl":"https://doi.org/10.1145/3342827.3342829","url":null,"abstract":"Most research in investigating computer professionals and laypersons' attitudes toward artificial intelligence (AI) are limited to online or offline surveys. This paper analyzes computer professionals' and laypersons' attitudes toward AI by using a sentiment lexicon developed by Wilson et al. To explore whether there is a correlation between the occupation categories (computer-related versus non-computer-related occupations) and people's attitudes toward artificial intelligence, I conducted a polarity classification of over 0.6 million tweets containing references to \"AI\", \"artificial intelligence\", or both. The result did not provide evidence of a relationship between public attitudes toward AI and the occupation categories. In the end, several future directions in the data collection and the data analysis are discussed.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117242519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel Task-Oriented Text Corpus in Silent Speech Recognition and its Natural Language Generation Construction Method","authors":"Dong Cao, Dongdong Zhang, Haibo Chen","doi":"10.1145/3342827.3342838","DOIUrl":"https://doi.org/10.1145/3342827.3342838","url":null,"abstract":"Millions of people with severe speech disorders around the world may regain their communication capabilities through techniques of silent speech recognition (SSR). Using electroencephalography (EEG) as a biomarker for speech decoding has been popular for SSR. However, the lack of SSR text corpus has impeded the development of this technique. Here, we construct a novel task-oriented text corpus, which is utilized in the field of SSR. In the process of construction, we propose a task-oriented hybrid construction method based on natural language generation (NLG) algorithm. The algorithm focuses on the strategy of data-to-text generation, and has two advantages including linguistic quality and high diversity. These two advantages use template-based method and deep neural networks respectively. In an SSR experiment with the generated text corpus, analysis results show that the performance of our hybrid construction method outperforms the pure method such as template-based natural language generation or neural natural language generation models.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133856133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}