{"title":"跨语言应用中专有名称的音译","authors":"Paola Virga, S. Khudanpur","doi":"10.1145/860435.860503","DOIUrl":null,"url":null,"abstract":"Translation of proper names is generally recognized as a significant problem in many multi-lingual text and speech processing applications. Even when large bilingual lexicons used for machine translation (MT) and cross-lingual information retrieval (CLIR) provide significant coverage of the words encountered in the text, a significant portion of the tokens not covered by such lexicons are proper names (cf e.g. [3]). For CLIR applications in particular, proper names and technical terms are particularly important, as they carry some of the more distinctive information in a query. In IR systems where users provide very short queries (e.g. 2-3 words), their importance grows even further. Proper names are amenable to a speech-inspired translation approach. When writing a foreign name in ones native language, one tries to preserve the way it sounds. i.e. one uses an orthographic representation which, when “read aloud” by a native speaker of the language sounds as it would when spoken by a speaker of the foreign language — a process referred to as transliteration. If mechanisms were available (a) to render, say, an English name in its phonemic form, and (b) to convert this phonemic string into the orthography of, say, Mandarin Chinese, then one would have a mechanism for transliterating English names using Chinese characters. The first part has been addressed extensively in the automatic textto-speech synthesis literature. This paper describes a statistical approach for the second part. Several techniques have been proposed in the recent past for name transliteration. Finite state transducers that implement transformation rules for back-transliteration from Japanese to English are described in [2], and extended to Arabic in [5]. In both cases, the goal is to recognize words in Japanese or Arabic text which happen to be transliterations of English names. The strongly phonetic orthography of Korean is exploited in [1] to obtain good transliteration using relatively simple HMM-based models. A set of handcrafted rules for locally editing the phonemic spelling of an English name to conform to Mandarin syllabification is provided to a transformation-based learning algorithm in [4], which then learns how to convert an English phoneme sequence to a Mandarin syllable sequence. We describe here a fully data driven counterpart to the technique of [4] for English-to-Mandarin name transliteration. In addition to intrinsic evaluation, we test our transliteration system extrinsically for cross-lingual spoken document retrieval by usThis research was partially supported by DARPA via Grant No N66001-00-2-8910 and ONR via Grant No N00014-01-1-0685. Copyright is held by the author/owner. SIGIR’03, July 28–August 1, 2003, Toronto, Canada. ACM 1-58113-646-3/03/0007. ing English text queries to retrieve Mandarin audio from the Topic Detection and Tracking (TDT) corpus.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"46","resultStr":"{\"title\":\"Transliteration of proper names in cross-language applications\",\"authors\":\"Paola Virga, S. Khudanpur\",\"doi\":\"10.1145/860435.860503\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Translation of proper names is generally recognized as a significant problem in many multi-lingual text and speech processing applications. Even when large bilingual lexicons used for machine translation (MT) and cross-lingual information retrieval (CLIR) provide significant coverage of the words encountered in the text, a significant portion of the tokens not covered by such lexicons are proper names (cf e.g. [3]). For CLIR applications in particular, proper names and technical terms are particularly important, as they carry some of the more distinctive information in a query. In IR systems where users provide very short queries (e.g. 2-3 words), their importance grows even further. Proper names are amenable to a speech-inspired translation approach. When writing a foreign name in ones native language, one tries to preserve the way it sounds. i.e. one uses an orthographic representation which, when “read aloud” by a native speaker of the language sounds as it would when spoken by a speaker of the foreign language — a process referred to as transliteration. If mechanisms were available (a) to render, say, an English name in its phonemic form, and (b) to convert this phonemic string into the orthography of, say, Mandarin Chinese, then one would have a mechanism for transliterating English names using Chinese characters. The first part has been addressed extensively in the automatic textto-speech synthesis literature. This paper describes a statistical approach for the second part. Several techniques have been proposed in the recent past for name transliteration. Finite state transducers that implement transformation rules for back-transliteration from Japanese to English are described in [2], and extended to Arabic in [5]. In both cases, the goal is to recognize words in Japanese or Arabic text which happen to be transliterations of English names. The strongly phonetic orthography of Korean is exploited in [1] to obtain good transliteration using relatively simple HMM-based models. A set of handcrafted rules for locally editing the phonemic spelling of an English name to conform to Mandarin syllabification is provided to a transformation-based learning algorithm in [4], which then learns how to convert an English phoneme sequence to a Mandarin syllable sequence. We describe here a fully data driven counterpart to the technique of [4] for English-to-Mandarin name transliteration. In addition to intrinsic evaluation, we test our transliteration system extrinsically for cross-lingual spoken document retrieval by usThis research was partially supported by DARPA via Grant No N66001-00-2-8910 and ONR via Grant No N00014-01-1-0685. Copyright is held by the author/owner. SIGIR’03, July 28–August 1, 2003, Toronto, Canada. ACM 1-58113-646-3/03/0007. ing English text queries to retrieve Mandarin audio from the Topic Detection and Tracking (TDT) corpus.\",\"PeriodicalId\":209809,\"journal\":{\"name\":\"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2003-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"46\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/860435.860503\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/860435.860503","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Transliteration of proper names in cross-language applications
Translation of proper names is generally recognized as a significant problem in many multi-lingual text and speech processing applications. Even when large bilingual lexicons used for machine translation (MT) and cross-lingual information retrieval (CLIR) provide significant coverage of the words encountered in the text, a significant portion of the tokens not covered by such lexicons are proper names (cf e.g. [3]). For CLIR applications in particular, proper names and technical terms are particularly important, as they carry some of the more distinctive information in a query. In IR systems where users provide very short queries (e.g. 2-3 words), their importance grows even further. Proper names are amenable to a speech-inspired translation approach. When writing a foreign name in ones native language, one tries to preserve the way it sounds. i.e. one uses an orthographic representation which, when “read aloud” by a native speaker of the language sounds as it would when spoken by a speaker of the foreign language — a process referred to as transliteration. If mechanisms were available (a) to render, say, an English name in its phonemic form, and (b) to convert this phonemic string into the orthography of, say, Mandarin Chinese, then one would have a mechanism for transliterating English names using Chinese characters. The first part has been addressed extensively in the automatic textto-speech synthesis literature. This paper describes a statistical approach for the second part. Several techniques have been proposed in the recent past for name transliteration. Finite state transducers that implement transformation rules for back-transliteration from Japanese to English are described in [2], and extended to Arabic in [5]. In both cases, the goal is to recognize words in Japanese or Arabic text which happen to be transliterations of English names. The strongly phonetic orthography of Korean is exploited in [1] to obtain good transliteration using relatively simple HMM-based models. A set of handcrafted rules for locally editing the phonemic spelling of an English name to conform to Mandarin syllabification is provided to a transformation-based learning algorithm in [4], which then learns how to convert an English phoneme sequence to a Mandarin syllable sequence. We describe here a fully data driven counterpart to the technique of [4] for English-to-Mandarin name transliteration. In addition to intrinsic evaluation, we test our transliteration system extrinsically for cross-lingual spoken document retrieval by usThis research was partially supported by DARPA via Grant No N66001-00-2-8910 and ONR via Grant No N00014-01-1-0685. Copyright is held by the author/owner. SIGIR’03, July 28–August 1, 2003, Toronto, Canada. ACM 1-58113-646-3/03/0007. ing English text queries to retrieve Mandarin audio from the Topic Detection and Tracking (TDT) corpus.