NEWS@IJCNLPPub Date : 2009-08-07DOI: 10.3115/1699705.1699748
Jana Kravalova, Z. Žabokrtský
{"title":"Czech Named Entity Corpus and SVM-based Recognizer","authors":"Jana Kravalova, Z. Žabokrtský","doi":"10.3115/1699705.1699748","DOIUrl":"https://doi.org/10.3115/1699705.1699748","url":null,"abstract":"This paper deals with recognition of named entities in Czech texts. We present a recently released corpus of Czech sentences with manually annotated named entities, in which a rich two-level classification scheme was used. There are around 6000 sentences in the corpus with roughly 33000 marked named entity instances. We use the data for training and evaluating a named entity recognizer based on Support Vector Machine classification technique. The presented recognizer outperforms the results previously reported for NE recognition in Czech.","PeriodicalId":262513,"journal":{"name":"NEWS@IJCNLP","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115915212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NEWS@IJCNLPPub Date : 2009-08-07DOI: 10.3115/1699705.1699713
Martin Jansche, R. Sproat
{"title":"Named Entity Transcription with Pair n-Gram Models","authors":"Martin Jansche, R. Sproat","doi":"10.3115/1699705.1699713","DOIUrl":"https://doi.org/10.3115/1699705.1699713","url":null,"abstract":"We submitted results for each of the eight shared tasks. Except for Japanese name kanji restoration, which uses a noisy channel model, our Standard Run submissions were produced by generative long-range pair n-gram models, which we mostly augmented with publicly available data (either from LDC datasets or mined from Wikipedia) for the Non-Standard Runs.","PeriodicalId":262513,"journal":{"name":"NEWS@IJCNLP","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116958796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NEWS@IJCNLPPub Date : 2009-08-07DOI: 10.3115/1699705.1699746
M. G. A. Malik, L. Besacier, C. Boitet, P. Bhattacharyya
{"title":"A Hybrid Model for Urdu Hindi Transliteration","authors":"M. G. A. Malik, L. Besacier, C. Boitet, P. Bhattacharyya","doi":"10.3115/1699705.1699746","DOIUrl":"https://doi.org/10.3115/1699705.1699746","url":null,"abstract":"We report in this paper a novel hybrid approach for Urdu to Hindi transliteration that combines finite-state machine (FSM) based techniques with statistical word language model based approach. The output from the FSM is filtered with the word language model to produce the correct Hindi output. The main problem handled is the case of omission of diacritical marks from the input Urdu text. Our system produces the correct Hindi output even when the crucial information in the form of diacritic marks is absent. The approach improves the accuracy of the transducer-only approach from 50.7% to 79.1%. The results reported show that performance can be improved using a word language model to disambiguate the output produced by the transducer-only approach, especially when diacritic marks are not present in the Urdu input.","PeriodicalId":262513,"journal":{"name":"NEWS@IJCNLP","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122244580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NEWS@IJCNLPPub Date : 2009-08-07DOI: 10.3115/1699705.1699734
Sara Noeman
{"title":"Language Independent Transliteration System Using Phrase-based SMT Approach on Substrings","authors":"Sara Noeman","doi":"10.3115/1699705.1699734","DOIUrl":"https://doi.org/10.3115/1699705.1699734","url":null,"abstract":"Everyday the newswire introduce events from all over the world, highlighting new names of persons, locations and organizations with different origins. These names appear as Out of Vocabulary (OOV) words for Machine translation, cross lingual information retrieval, and many other NLP applications. One way to deal with OOV words is to transliterate the unknown words, that is, to render them in the orthography of the second language. We introduce a statistical approach for transliteration only using the bilingual resources released in the shared task and without any previous knowledge of the target languages. Mapping the Transliteration problem to the Machine Translation problem, we make use of the phrase based SMT approach and apply it on substrings of names. In the English to Russian task, we report ACC (Accuracy in top-1) of 0.545, Mean F-score of 0.917, and MRR (Mean Reciprocal Rank) of 0.596. Due to time constraints, we made a single experiment in the English to Chinese task, reporting ACC, Mean F-score, and MRR of 0.411, 0.737, and 0.464 respectively. Finally, it is worth mentioning that the system is language independent since the author is not aware of either languages used in the experiments.","PeriodicalId":262513,"journal":{"name":"NEWS@IJCNLP","volume":"10 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120883249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NEWS@IJCNLPPub Date : 2009-08-07DOI: 10.3115/1699705.1699743
Masatoshi Tsuchiya, Shoko Endo, S. Nakagawa
{"title":"Analysis and Robust Extraction of Changing Named Entities","authors":"Masatoshi Tsuchiya, Shoko Endo, S. Nakagawa","doi":"10.3115/1699705.1699743","DOIUrl":"https://doi.org/10.3115/1699705.1699743","url":null,"abstract":"This paper focuses on the change of named entities over time and its influence on the performance of the named entity tagger. First, we analyze Japanese named entities which appear in Mainichi Newspaper articles published in 1995, 1996, 1997, 1998 and 2005. This analysis reveals that the number of named entity types and the number of named entity tokens are almost steady over time and that 70 ~ 80% of named entity types in a certain year occur in the articles published either in its succeeding year or in its preceding year. These facts lead that 20 ~ 30% of named entity types are replaced with new ones every year. The experiment against these texts shows that our proposing semi-supervised method which combines a small annotated corpus and a large unannotated corpus for training works robustly although the traditional supervised method is fragile against the change of name entity distribution.","PeriodicalId":262513,"journal":{"name":"NEWS@IJCNLP","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114960807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NEWS@IJCNLPPub Date : 2009-08-07DOI: 10.3115/1699705.1699719
A. Finch, E. Sumita
{"title":"Transliteration by Bidirectional Statistical Machine Translation","authors":"A. Finch, E. Sumita","doi":"10.3115/1699705.1699719","DOIUrl":"https://doi.org/10.3115/1699705.1699719","url":null,"abstract":"The system presented in this paper uses phrase-based statistical machine translation (SMT) techniques to directly transliterate between all language pairs in this shared task. The technique makes no language specific assumptions, uses no dictionaries or explicit phonetic information. The translation process transforms sequences of tokens in the source language directly into to sequences of tokens in the target. All language pairs were transliterated by applying this technique in a single unified manner. The machine translation system used was a system comprised of two phrase-based SMT decoders. The first generated from the first token of the target to the last. The second system generated the target from last to first. Our results show that if only one of these decoding strategies is to be chosen, the optimal choice depends on the languages involved, and that in general a combination of the two approaches is able to outperform either approach.","PeriodicalId":262513,"journal":{"name":"NEWS@IJCNLP","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114836521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NEWS@IJCNLPPub Date : 2009-08-07DOI: 10.3115/1699705.1699721
Dipankar Bose, S. Sarkar
{"title":"Learning Multi Character Alignment Rules and Classification of Training Data for Transliteration","authors":"Dipankar Bose, S. Sarkar","doi":"10.3115/1699705.1699721","DOIUrl":"https://doi.org/10.3115/1699705.1699721","url":null,"abstract":"We address the issues of transliteration between Indian languages and English, especially for named entities. We use an EM algorithm to learn the alignment between the languages. We find that there are lot of ambiguities in the rules mapping the characters in the source language to the corresponding characters in the target language. Some of these ambiguities can be handled by capturing context by learning multi-character based alignments and use of character n-gram models. We observed that a word in the source script may have actually originated from different languages. Instead of learning one model for the language pair, we propose that one may use multiple models and a classifier to decide which model to use. A contribution of this work is that the models and classifiers are learned in a completely unsupervised manner. Using our system we were able to get quite accurate transliteration models.","PeriodicalId":262513,"journal":{"name":"NEWS@IJCNLP","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132351831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NEWS@IJCNLPPub Date : 2009-08-07DOI: 10.3115/1699705.1699739
Dayne Freitag, Zhiqiang Wang
{"title":"Name Transliteration with Bidirectional Perceptron Edit Models","authors":"Dayne Freitag, Zhiqiang Wang","doi":"10.3115/1699705.1699739","DOIUrl":"https://doi.org/10.3115/1699705.1699739","url":null,"abstract":"We report on our efforts as part of the shared task on the NEWS 2009 Machine Transliteration Shared Task. We applied an orthographic perceptron character edit model that we have used previously for name transliteration, enhancing it in two ways: by ranking possible transliterations according to the sum of their scores according to two models, one trained to generate left-to-right, and one right-to-left; and by constraining generated strings to be consistent with character bigrams observed in the respective language's training data. Our poor showing in the official evaluation was due to a bug in the script used to produce competition-compliant output. Subsequent evaluation shows that our approach yielded comparatively strong performance on all alphabetic language pairs we attempted.","PeriodicalId":262513,"journal":{"name":"NEWS@IJCNLP","volume":"280 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123269452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NEWS@IJCNLPPub Date : 2009-08-07DOI: 10.3115/1699705.1699738
Yilu Zhou
{"title":"Maximum n-Gram HMM-based Name Transliteration: Experiment in NEWS 2009 on English-Chinese Corpus","authors":"Yilu Zhou","doi":"10.3115/1699705.1699738","DOIUrl":"https://doi.org/10.3115/1699705.1699738","url":null,"abstract":"We propose an English-Chinese name transliteration system using a maximum N-gram Hidden Markov Model. To handle special challenges with alphabet-based and character-based language pair, we apply a two-phase transliteration model by building two HMM models, one between English and Chinese Pinyin and another between Chinese Pinyin and Chinese characters. Our model improves traditional HMM by assigning the longest prior translation sequence of syllables the largest weight. In our non-standard runs, we use a Web-mining module to boost the performance by adding online popularity information of candidate translations. The entire model does not rely on any dictionaries and the probability tables are derived merely from training corpus. In participation of NEWS 2009 experiment, our model achieved 0.462 Top-1 accuracy and 0.764 Mean F-score.","PeriodicalId":262513,"journal":{"name":"NEWS@IJCNLP","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130172772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NEWS@IJCNLPPub Date : 2009-08-07DOI: 10.3115/1699705.1699747
O. Kwong
{"title":"Graphemic Approximation of Phonological Context for English-Chinese Transliteration","authors":"O. Kwong","doi":"10.3115/1699705.1699747","DOIUrl":"https://doi.org/10.3115/1699705.1699747","url":null,"abstract":"Although direct orthographic mapping has been shown to outperform phoneme-based methods in English-to-Chinese (E2C) transliteration, it is observed that phonological context plays an important role in resolving graphemic ambiguity. In this paper, we investigate the use of surface graphemic features to approximate local phonological context for E2C. In the absence of an explicit phonemic representation of the English source names, experiments show that the previous and next character of a given English segment could effectively capture the local context affecting its expected pronunciation, and thus its rendition in Chinese.","PeriodicalId":262513,"journal":{"name":"NEWS@IJCNLP","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130460985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}