NUT@EMNLPPub Date : 2018-05-01DOI: 10.18653/v1/W18-6112
Kemal Kurniawan, Samuel Louvan
{"title":"Empirical Evaluation of Character-Based Model on Neural Named-Entity Recognition in Indonesian Conversational Texts","authors":"Kemal Kurniawan, Samuel Louvan","doi":"10.18653/v1/W18-6112","DOIUrl":"https://doi.org/10.18653/v1/W18-6112","url":null,"abstract":"Despite the long history of named-entity recognition (NER) task in the natural language processing community, previous work rarely studied the task on conversational texts. Such texts are challenging because they contain a lot of word variations which increase the number of out-of-vocabulary (OOV) words. The high number of OOV words poses a difficulty for word-based neural models. Meanwhile, there is plenty of evidence to the effectiveness of character-based neural models in mitigating this OOV problem. We report an empirical evaluation of neural sequence labeling models with character embedding to tackle NER task in Indonesian conversational texts. Our experiments show that (1) character models outperform word embedding-only models by up to 4 F1 points, (2) character models perform better in OOV cases with an improvement of as high as 15 F1 points, and (3) character models are robust against a very high OOV rate.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"286 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114953197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4407
Chris Emmery, Grzegorz Chrupała, Walter Daelemans
{"title":"Simple Queries as Distant Labels for Predicting Gender on Twitter","authors":"Chris Emmery, Grzegorz Chrupała, Walter Daelemans","doi":"10.18653/v1/W17-4407","DOIUrl":"https://doi.org/10.18653/v1/W17-4407","url":null,"abstract":"The majority of research on extracting missing user attributes from social media profiles use costly hand-annotated labels for supervised learning. Distantly supervised methods exist, although these generally rely on knowledge gathered using external sources. This paper demonstrates the effectiveness of gathering distant labels for self-reported gender on Twitter using simple queries. We confirm the reliability of this query heuristic by comparing with manual annotation. Moreover, using these labels for distant supervision, we demonstrate competitive model performance on the same data as models trained on manual annotations. As such, we offer a cheap, extensible, and fast alternative that can be employed beyond the task of gender classification.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117210764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4405
Anietie U Andy, Mark Dredze, M. Rwebangira, Chris Callison-Burch
{"title":"Constructing an Alias List for Named Entities during an Event","authors":"Anietie U Andy, Mark Dredze, M. Rwebangira, Chris Callison-Burch","doi":"10.18653/v1/W17-4405","DOIUrl":"https://doi.org/10.18653/v1/W17-4405","url":null,"abstract":"In certain fields, real-time knowledge from events can help in making informed decisions. In order to extract pertinent real-time knowledge related to an event, it is important to identify the named entities and their corresponding aliases related to the event. The problem of identifying aliases of named entities that spike has remained unexplored. In this paper, we introduce an algorithm, EntitySpike, that identifies entities that spike in popularity in tweets from a given time period, and constructs an alias list for these spiked entities. EntitySpike uses a temporal heuristic to identify named entities with similar context that occur in the same time period (within minutes) during an event. Each entity is encoded as a vector using this temporal heuristic. We show how these entity-vectors can be used to create a named entity alias list. We evaluated our algorithm on a dataset of temporally ordered tweets from a single event, the 2013 Grammy Awards show. We carried out various experiments on tweets that were published in the same time period and show that our algorithm identifies most entity name aliases and outperforms a competitive baseline.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125043364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4419
Gustavo Aguilar, Suraj Maharjan, Adrian Pastor Lopez-Monroy, T. Solorio
{"title":"A Multi-task Approach for Named Entity Recognition in Social Media Data","authors":"Gustavo Aguilar, Suraj Maharjan, Adrian Pastor Lopez-Monroy, T. Solorio","doi":"10.18653/v1/W17-4419","DOIUrl":"https://doi.org/10.18653/v1/W17-4419","url":null,"abstract":"Named Entity Recognition for social media data is challenging because of its inherent noisiness. In addition to improper grammatical structures, it contains spelling inconsistencies and numerous informal abbreviations. We propose a novel multi-task approach by employing a more general secondary task of Named Entity (NE) segmentation together with the primary task of fine-grained NE categorization. The multi-task neural network architecture learns higher order feature representations from word and character sequences along with basic Part-of-Speech tags and gazetteer information. This neural network acts as a feature extractor to feed a Conditional Random Fields classifier. We were able to obtain the first position in the 3rd Workshop on Noisy User-generated Text (WNUT-2017) with a 41.86% entity F1-score and a 40.24% surface F1-score.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125999368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4409
Bahar Salehi, Anders Søgaard
{"title":"Evaluating hypotheses in geolocation on a very large sample of Twitter","authors":"Bahar Salehi, Anders Søgaard","doi":"10.18653/v1/W17-4409","DOIUrl":"https://doi.org/10.18653/v1/W17-4409","url":null,"abstract":"Recent work in geolocation has made several hypotheses about what linguistic markers are relevant to detect where people write from. In this paper, we examine six hypotheses against a corpus consisting of all geo-tagged tweets from the US, or whose geo-tags could be inferred, in a 19% sample of Twitter history. Our experiments lend support to all six hypotheses, including that spelling variants and hashtags are strong predictors of location. We also study what kinds of common nouns are predictive of location after controlling for named entities such as dolphins or sharks","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"2128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129974432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4415
Bahar Salehi, Dirk Hovy, E. Hovy, Anders Søgaard
{"title":"Huntsville, hospitals, and hockey teams: Names can reveal your location","authors":"Bahar Salehi, Dirk Hovy, E. Hovy, Anders Søgaard","doi":"10.18653/v1/W17-4415","DOIUrl":"https://doi.org/10.18653/v1/W17-4415","url":null,"abstract":"Geolocation is the task of identifying a social media user’s primary location, and in natural language processing, there is a growing literature on to what extent automated analysis of social media posts can help. However, not all content features are equally revealing of a user’s location. In this paper, we evaluate nine name entity (NE) types. Using various metrics, we find that GEO-LOC, FACILITY and SPORT-TEAM are more informative for geolocation than other NE types. Using these types, we improve geolocation accuracy and reduce distance error over various famous text-based methods.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130070053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4421
Bill Yuchen Lin, Frank F. Xu, Zhiyi Luo, Kenny Q. Zhu
{"title":"Multi-channel BiLSTM-CRF Model for Emerging Named Entity Recognition in Social Media","authors":"Bill Yuchen Lin, Frank F. Xu, Zhiyi Luo, Kenny Q. Zhu","doi":"10.18653/v1/W17-4421","DOIUrl":"https://doi.org/10.18653/v1/W17-4421","url":null,"abstract":"In this paper, we present our multi-channel neural architecture for recognizing emerging named entity in social media messages, which we applied in the Novel and Emerging Named Entity Recognition shared task at the EMNLP 2017 Workshop on Noisy User-generated Text (W-NUT). We propose a novel approach, which incorporates comprehensive word representations with multi-channel information and Conditional Random Fields (CRF) into a traditional Bidirectional Long Short-Term Memory (BiLSTM) neural network without using any additional hand-craft features such as gazetteers. In comparison with other systems participating in the shared task, our system won the 2nd place.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130408228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4423
J. Williams, Giovanni C. Santia
{"title":"Context-Sensitive Recognition for Emerging and Rare Entities","authors":"J. Williams, Giovanni C. Santia","doi":"10.18653/v1/W17-4423","DOIUrl":"https://doi.org/10.18653/v1/W17-4423","url":null,"abstract":"This paper is a shared task system description for the 2017 W-NUT shared task on Rare and Emerging Named Entities. Our paper describes the development and application of a novel algorithm for named entity recognition that relies only on the contexts of word forms. A comparison against the other submitted systems is provided.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"170 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121040207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4418
Leon Derczynski, Eric Nichols, M. Erp, Nut Limsopatham
{"title":"Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition","authors":"Leon Derczynski, Eric Nichols, M. Erp, Nut Limsopatham","doi":"10.18653/v1/W17-4418","DOIUrl":"https://doi.org/10.18653/v1/W17-4418","url":null,"abstract":"This shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. Named entities form the basis of many modern approaches to other tasks (like event clustering and summarization), but recall on them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms. Take for example the tweet “so.. kktny in 30 mins?!” – even human experts find the entity ‘kktny’ hard to detect and resolve. The goal of this task is to provide a definition of emerging and of rare entities, and based on that, also datasets for detecting these entities. The task as described in this paper evaluated the ability of participating entries to detect and classify novel and emerging named entities in noisy text.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122769511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NUT@EMNLPPub Date : 2017-09-01DOI: 10.18653/v1/W17-4414
E. Flint, Elliot Ford, Olivia Thomas, Andrew Caines, P. Buttery
{"title":"A Text Normalisation System for Non-Standard English Words","authors":"E. Flint, Elliot Ford, Olivia Thomas, Andrew Caines, P. Buttery","doi":"10.18653/v1/W17-4414","DOIUrl":"https://doi.org/10.18653/v1/W17-4414","url":null,"abstract":"This paper investigates the problem of text normalisation; specifically, the normalisation of non-standard words (NSWs) in English. Non-standard words can be defined as those word tokens which do not have a dictionary entry, and cannot be pronounced using the usual letter-to-phoneme conversion rules; e.g. lbs, 99.3%, #EMNLP2017. NSWs pose a challenge to the proper functioning of text-to-speech technology, and the solution is to spell them out in such a way that they can be pronounced appropriately. We describe our four-stage normalisation system made up of components for detection, classification, division and expansion of NSWs. Performance is favourabe compared to previous work in the field (Sproat et al. 2001, Normalization of non-standard words), as well as state-of-the-art text-to-speech software. Further, we update Sproat et al.’s NSW taxonomy, and create a more customisable system where users are able to input their own abbreviations and specify into which variety of English (currently available: British or American) they wish to normalise.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124141243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}