NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4422
Pius von Däniken, Mark Cieliebak
{"title":"Transfer Learning and Sentence Level Features for Named Entity Recognition on Tweets","authors":"Pius von Däniken, Mark Cieliebak","doi":"10.18653/v1/W17-4422","DOIUrl":"https://doi.org/10.18653/v1/W17-4422","url":null,"abstract":"We present our system for the WNUT 2017 Named Entity Recognition challenge on Twitter data. We describe two modifications of a basic neural network architecture for sequence tagging. First, we show how we exploit additional labeled data, where the Named Entity tags differ from the target task. Then, we propose a way to incorporate sentence level features. Our system uses both methods and ranked second for entity level annotations, achieving an F1-score of 40.78, and second for surface form annotations, achieving an F1-score of 39.33.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130826509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4412
Osman Tursun, Ruken Cakici
{"title":"Noisy Uyghur Text Normalization","authors":"Osman Tursun, Ruken Cakici","doi":"10.18653/v1/W17-4412","DOIUrl":"https://doi.org/10.18653/v1/W17-4412","url":null,"abstract":"Uyghur is the second largest and most actively used social media language in China. However, a non-negligible part of Uyghur text appearing in social media is unsystematically written with the Latin alphabet, and it continues to increase in size. Uyghur text in this format is incomprehensible and ambiguous even to native Uyghur speakers. In addition, Uyghur texts in this form lack the potential for any kind of advancement for the NLP tasks related to the Uyghur language. Restoring and preventing noisy Uyghur text written with unsystematic Latin alphabets will be essential to the protection of Uyghur language and improving the accuracy of Uyghur NLP tasks. To this purpose, in this work we propose and compare the noisy channel model and the neural encoder-decoder model as normalizing methods.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131032904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4408
Su Lin Blodgett, Johnny Wei, Brendan T. O'Connor
{"title":"A Dataset and Classifier for Recognizing Social Media English","authors":"Su Lin Blodgett, Johnny Wei, Brendan T. O'Connor","doi":"10.18653/v1/W17-4408","DOIUrl":"https://doi.org/10.18653/v1/W17-4408","url":null,"abstract":"While language identification works well on standard texts, it performs much worse on social media language, in particular dialectal language—even for English. First, to support work on English language identification, we contribute a new dataset of tweets annotated for English versus non-English, with attention to ambiguity, code-switching, and automatic generation issues. It is randomly sampled from all public messages, avoiding biases towards pre-existing language classifiers. Second, we find that a demographic language model—which identifies messages with language similar to that used by several U.S. ethnic populations on Twitter—can be used to improve English language identification performance when combined with a traditional supervised language identifier. It increases recall with almost no loss of precision, including, surprisingly, for English messages written by non-U.S. authors. Our dataset and identifier ensemble are available online.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116368159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4424
Utpal Kumar Sikdar, Björn Gambäck
{"title":"A Feature-based Ensemble Approach to Recognition of Emerging and Rare Named Entities","authors":"Utpal Kumar Sikdar, Björn Gambäck","doi":"10.18653/v1/W17-4424","DOIUrl":"https://doi.org/10.18653/v1/W17-4424","url":null,"abstract":"Detecting previously unseen named entities in text is a challenging task. The paper describes how three initial classifier models were built using Conditional Random Fields (CRFs), Support Vector Machines (SVMs) and a Long Short-Term Memory (LSTM) recurrent neural network. The outputs of these three classifiers were then used as features to train another CRF classifier working as an ensemble. 5-fold cross-validation based on training and development data for the emerging and rare named entity recognition shared task showed precision, recall and F1-score of 66.87%, 46.75% and 54.97%, respectively. For surface form evaluation, the CRF ensemble-based system achieved precision, recall and F1 scores of 65.18%, 45.20% and 53.30%. When applied to unseen test data, the model reached 47.92% precision, 31.97% recall and 38.55% F1-score for entity level evaluation, with the corresponding surface form evaluation values of 44.91%, 30.47% and 36.31%.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125113003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4406
Linzi Xing, Michael J. Paul
{"title":"Incorporating Metadata into Content-Based User Embeddings","authors":"Linzi Xing, Michael J. Paul","doi":"10.18653/v1/W17-4406","DOIUrl":"https://doi.org/10.18653/v1/W17-4406","url":null,"abstract":"Low-dimensional vector representations of social media users can benefit applications like recommendation systems and user attribute inference. Recent work has shown that user embeddings can be improved by combining different types of information, such as text and network data. We propose a data augmentation method that allows novel feature types to be used within off-the-shelf embedding models. Experimenting with the task of friend recommendation on a dataset of 5,019 Twitter users, we show that our approach can lead to substantial performance gains with the simple addition of network and geographic features.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"163 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116639752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4410
Fraser Bowen, Jon Dehdari, Josef van Genabith
{"title":"The Effect of Error Rate in Artificially Generated Data for Automatic Preposition and Determiner Correction","authors":"Fraser Bowen, Jon Dehdari, Josef van Genabith","doi":"10.18653/v1/W17-4410","DOIUrl":"https://doi.org/10.18653/v1/W17-4410","url":null,"abstract":"In this research we investigate the impact of mismatches in the density and type of error between training and test data on a neural system correcting preposition and determiner errors. We use synthetically produced training data to control error density and type, and “real” error data for testing. Our results show it is possible to combine error types, although prepositions and determiners behave differently in terms of how much error should be artificially introduced into the training data in order to get the best results.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115065937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4420
Patrick Jansson, Shuhua Liu
{"title":"Distributed Representation, LDA Topic Modelling and Deep Learning for Emerging Named Entity Recognition from Social Media","authors":"Patrick Jansson, Shuhua Liu","doi":"10.18653/v1/W17-4420","DOIUrl":"https://doi.org/10.18653/v1/W17-4420","url":null,"abstract":"This paper reports our participation in the W-NUT 2017 shared task on emerging and rare entity recognition from user generated noisy text such as tweets, online reviews and forum discussions. To accomplish this challenging task, we explore an approach that combines LDA topic modelling with deep learning on word level and character level embeddings. The LDA topic modelling generates topic representation for each tweet which is used as a feature for each word in the tweet. The deep learning component consists of two-layer bidirectional LSTM and a CRF output layer. Our submitted result performed at 39.98 (F1) on entity and 37.77 on surface forms. Our new experiments after submission reached a best performance of 41.81 on entity and 40.57 on surface forms.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129267651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4403
Mourad Gridach, Hatem Haddad, Hala Mulki
{"title":"Churn Identification in Microblogs using Convolutional Neural Networks with Structured Logical Knowledge","authors":"Mourad Gridach, Hatem Haddad, Hala Mulki","doi":"10.18653/v1/W17-4403","DOIUrl":"https://doi.org/10.18653/v1/W17-4403","url":null,"abstract":"For brands, gaining new customer is more expensive than keeping an existing one. Therefore, the ability to keep customers in a brand is becoming more challenging these days. Churn happens when a customer leaves a brand to another competitor. Most of the previous work considers the problem of churn prediction using the Call Detail Records (CDRs). In this paper, we use micro-posts to classify customers into churny or non-churny. We explore the power of convolutional neural networks (CNNs) since they achieved state-of-the-art in various computer vision and NLP applications. However, the robustness of end-to-end models has some limitations such as the availability of a large amount of labeled data and uninterpretability of these models. We investigate the use of CNNs augmented with structured logic rules to overcome or reduce this issue. We developed our system called Churn_teacher by using an iterative distillation method that transfers the knowledge, extracted using just the combination of three logic rules, directly into the weight of the DNNs. Furthermore, we used weight normalization to speed up training our convolutional neural networks. Experimental results showed that with just these three rules, we were able to get state-of-the-art on publicly available Twitter dataset about three Telecom brands.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129110753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4402
Francesco Barbieri, Luis Espinosa Anke, Miguel Ballesteros, Juan Soler, Horacio Saggion
{"title":"Towards the Understanding of Gaming Audiences by Modeling Twitch Emotes","authors":"Francesco Barbieri, Luis Espinosa Anke, Miguel Ballesteros, Juan Soler, Horacio Saggion","doi":"10.18653/v1/W17-4402","DOIUrl":"https://doi.org/10.18653/v1/W17-4402","url":null,"abstract":"Videogame streaming platforms have become a paramount example of noisy user-generated text. These are websites where gaming is broadcasted, and allows interaction with viewers via integrated chatrooms. Probably the best known platform of this kind is Twitch, which has more than 100 million monthly viewers. Despite these numbers, and unlike other platforms featuring short messages (e.g. Twitter), Twitch has not received much attention from the Natural Language Processing community. In this paper we aim at bridging this gap by proposing two important tasks specific to the Twitch platform, namely (1) Emote prediction; and (2) Trolling detection. In our experiments, we evaluate three models: a BOW baseline, a logistic supervised classifiers based on word embeddings, and a bidirectional long short-term memory recurrent neural network (LSTM). Our results show that the LSTM model outperforms the other two models, where explicit features with proven effectiveness for similar tasks were encoded.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123655481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-07-13DOI: 10.18653/v1/W17-4417
P. Bhargava, Nemanja Spasojevic, Guoning Hu
{"title":"Lithium NLP: A System for Rich Information Extraction from Noisy User Generated Text on Social Media","authors":"P. Bhargava, Nemanja Spasojevic, Guoning Hu","doi":"10.18653/v1/W17-4417","DOIUrl":"https://doi.org/10.18653/v1/W17-4417","url":null,"abstract":"In this paper, we describe the Lithium Natural Language Processing (NLP) system - a resource-constrained, high-throughput and language-agnostic system for information extraction from noisy user generated text on social media. Lithium NLP extracts a rich set of information including entities, topics, hashtags and sentiment from text. We discuss several real world applications of the system currently incorporated in Lithium products. We also compare our system with existing commercial and academic NLP systems in terms of performance, information extracted and languages supported. We show that Lithium NLP is at par with and in some cases, outperforms state-of-the-art commercial NLP systems.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124695909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}