{"title":"Splitting Katakana Noun Compounds by Paraphrasing and Back-transliteration","authors":"Nobuhiro Kaji, M. Kitsuregawa","doi":"10.5715/JNLP.21.897","DOIUrl":null,"url":null,"abstract":"Word boundaries within noun compounds are not marked by white spaces in a number of languages including Japanese, and it is beneficial for various NLP applications to split such noun compounds. In the case of Japanese, noun compounds made up of katakana words are particularly difficult to split, because katakana words are highly productive and are often out-of-vocabulary. To overcome this difficulty, we propose using paraphrases and back-transliteration of katakana noun compounds for splitting them. Experiments demonstrated that splitting accuracy is improved with a statistical significance by extracting both paraphrases and back-transliterations from unlabeled textual data, and then using that information for constructing splitting models.","PeriodicalId":16243,"journal":{"name":"Journal of Information Processing","volume":"21 1","pages":"897-920"},"PeriodicalIF":0.0000,"publicationDate":"2012-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5715/JNLP.21.897","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0
Abstract
Word boundaries within noun compounds are not marked by white spaces in a number of languages including Japanese, and it is beneficial for various NLP applications to split such noun compounds. In the case of Japanese, noun compounds made up of katakana words are particularly difficult to split, because katakana words are highly productive and are often out-of-vocabulary. To overcome this difficulty, we propose using paraphrases and back-transliteration of katakana noun compounds for splitting them. Experiments demonstrated that splitting accuracy is improved with a statistical significance by extracting both paraphrases and back-transliterations from unlabeled textual data, and then using that information for constructing splitting models.