跨语言应用中专有名称的音译

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval Pub Date : 2003-07-28 DOI:10.1145/860435.860503

Paola Virga, S. Khudanpur

{"title":"跨语言应用中专有名称的音译","authors":"Paola Virga, S. Khudanpur","doi":"10.1145/860435.860503","DOIUrl":null,"url":null,"abstract":"Translation of proper names is generally recognized as a significant problem in many multi-lingual text and speech processing applications. Even when large bilingual lexicons used for machine translation (MT) and cross-lingual information retrieval (CLIR) provide significant coverage of the words encountered in the text, a significant portion of the tokens not covered by such lexicons are proper names (cf e.g. [3]). For CLIR applications in particular, proper names and technical terms are particularly important, as they carry some of the more distinctive information in a query. In IR systems where users provide very short queries (e.g. 2-3 words), their importance grows even further. Proper names are amenable to a speech-inspired translation approach. When writing a foreign name in ones native language, one tries to preserve the way it sounds. i.e. one uses an orthographic representation which, when “read aloud” by a native speaker of the language sounds as it would when spoken by a speaker of the foreign language — a process referred to as transliteration. If mechanisms were available (a) to render, say, an English name in its phonemic form, and (b) to convert this phonemic string into the orthography of, say, Mandarin Chinese, then one would have a mechanism for transliterating English names using Chinese characters. The first part has been addressed extensively in the automatic textto-speech synthesis literature. This paper describes a statistical approach for the second part. Several techniques have been proposed in the recent past for name transliteration. Finite state transducers that implement transformation rules for back-transliteration from Japanese to English are described in [2], and extended to Arabic in [5]. In both cases, the goal is to recognize words in Japanese or Arabic text which happen to be transliterations of English names. The strongly phonetic orthography of Korean is exploited in [1] to obtain good transliteration using relatively simple HMM-based models. A set of handcrafted rules for locally editing the phonemic spelling of an English name to conform to Mandarin syllabification is provided to a transformation-based learning algorithm in [4], which then learns how to convert an English phoneme sequence to a Mandarin syllable sequence. We describe here a fully data driven counterpart to the technique of [4] for English-to-Mandarin name transliteration. In addition to intrinsic evaluation, we test our transliteration system extrinsically for cross-lingual spoken document retrieval by usThis research was partially supported by DARPA via Grant No N66001-00-2-8910 and ONR via Grant No N00014-01-1-0685. Copyright is held by the author/owner. SIGIR’03, July 28–August 1, 2003, Toronto, Canada. ACM 1-58113-646-3/03/0007. ing English text queries to retrieve Mandarin audio from the Topic Detection and Tracking (TDT) corpus.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"46","resultStr":"{\"title\":\"Transliteration of proper names in cross-language applications\",\"authors\":\"Paola Virga, S. Khudanpur\",\"doi\":\"10.1145/860435.860503\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Translation of proper names is generally recognized as a significant problem in many multi-lingual text and speech processing applications. Even when large bilingual lexicons used for machine translation (MT) and cross-lingual information retrieval (CLIR) provide significant coverage of the words encountered in the text, a significant portion of the tokens not covered by such lexicons are proper names (cf e.g. [3]). For CLIR applications in particular, proper names and technical terms are particularly important, as they carry some of the more distinctive information in a query. In IR systems where users provide very short queries (e.g. 2-3 words), their importance grows even further. Proper names are amenable to a speech-inspired translation approach. When writing a foreign name in ones native language, one tries to preserve the way it sounds. i.e. one uses an orthographic representation which, when “read aloud” by a native speaker of the language sounds as it would when spoken by a speaker of the foreign language — a process referred to as transliteration. If mechanisms were available (a) to render, say, an English name in its phonemic form, and (b) to convert this phonemic string into the orthography of, say, Mandarin Chinese, then one would have a mechanism for transliterating English names using Chinese characters. The first part has been addressed extensively in the automatic textto-speech synthesis literature. This paper describes a statistical approach for the second part. Several techniques have been proposed in the recent past for name transliteration. Finite state transducers that implement transformation rules for back-transliteration from Japanese to English are described in [2], and extended to Arabic in [5]. In both cases, the goal is to recognize words in Japanese or Arabic text which happen to be transliterations of English names. The strongly phonetic orthography of Korean is exploited in [1] to obtain good transliteration using relatively simple HMM-based models. A set of handcrafted rules for locally editing the phonemic spelling of an English name to conform to Mandarin syllabification is provided to a transformation-based learning algorithm in [4], which then learns how to convert an English phoneme sequence to a Mandarin syllable sequence. We describe here a fully data driven counterpart to the technique of [4] for English-to-Mandarin name transliteration. In addition to intrinsic evaluation, we test our transliteration system extrinsically for cross-lingual spoken document retrieval by usThis research was partially supported by DARPA via Grant No N66001-00-2-8910 and ONR via Grant No N00014-01-1-0685. Copyright is held by the author/owner. SIGIR’03, July 28–August 1, 2003, Toronto, Canada. ACM 1-58113-646-3/03/0007. ing English text queries to retrieve Mandarin audio from the Topic Detection and Tracking (TDT) corpus.\",\"PeriodicalId\":209809,\"journal\":{\"name\":\"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2003-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"46\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/860435.860503\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/860435.860503","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 46

摘要

专有名称的翻译是许多多语言文本和语音处理应用中公认的一个重要问题。即使用于机器翻译(MT)和跨语言信息检索(CLIR)的大型双语词典提供了文本中遇到的单词的大量覆盖，这些词典没有覆盖的很大一部分标记是专有名称(参见示例[3])。特别是对于CLIR应用程序，专有名称和技术术语尤其重要，因为它们在查询中携带了一些更独特的信息。在用户提供非常短的查询(例如2-3个单词)的IR系统中，它们的重要性进一步提高。专有名称适用于语音启发翻译方法。当用母语写一个外国名字时，人们试图保留它的发音方式。也就是说，人们使用一种正字法表示，当母语人士“大声朗读”该语言时，听起来就像说外语的人所说的那样——这一过程被称为音译。如果机制可用(a)以音位形式呈现英文名称，(b)将音位字符串转换为正字法，例如普通话，那么就会有一种使用中文字符音译英文名称的机制。第一部分已经在自动文本-语音合成的文献中得到了广泛的讨论。本文介绍了第二部分的统计方法。在最近的过去已经提出了几种技术的名称音译。在[2]中描述了实现从日语到英语的反音译转换规则的有限状态传感器，并在[5]中扩展到阿拉伯语。在这两种情况下，目标都是识别日语或阿拉伯语文本中碰巧是英语名称音译的单词。[1]利用韩语的强语音正字法，使用相对简单的基于hmm的模型获得良好的音译。[4]为基于转换的学习算法提供了一套手工编写的规则，用于本地编辑英文名称的音位拼写，以符合普通话音节化，然后该算法学习如何将英语音位序列转换为普通话音节序列。我们在这里描述了一个完全数据驱动的对应技术[4]，用于英语到汉语的名称音译。除了内部评估外，我们还对我们的音译系统进行了外部测试，用于美国的跨语言口语文档检索。本研究得到了DARPA和ONR的部分资助，资助号为N00014-01-1-0685。版权由作者/所有者持有。2003年7月28日至8月1日，加拿大多伦多。ACM 1 - 58113 - 646 - 3/03/0007。使用英文文本查询从主题检测和跟踪(TDT)语料库中检索中文音频。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Transliteration of proper names in cross-language applications

Translation of proper names is generally recognized as a significant problem in many multi-lingual text and speech processing applications. Even when large bilingual lexicons used for machine translation (MT) and cross-lingual information retrieval (CLIR) provide significant coverage of the words encountered in the text, a significant portion of the tokens not covered by such lexicons are proper names (cf e.g. [3]). For CLIR applications in particular, proper names and technical terms are particularly important, as they carry some of the more distinctive information in a query. In IR systems where users provide very short queries (e.g. 2-3 words), their importance grows even further. Proper names are amenable to a speech-inspired translation approach. When writing a foreign name in ones native language, one tries to preserve the way it sounds. i.e. one uses an orthographic representation which, when “read aloud” by a native speaker of the language sounds as it would when spoken by a speaker of the foreign language — a process referred to as transliteration. If mechanisms were available (a) to render, say, an English name in its phonemic form, and (b) to convert this phonemic string into the orthography of, say, Mandarin Chinese, then one would have a mechanism for transliterating English names using Chinese characters. The first part has been addressed extensively in the automatic textto-speech synthesis literature. This paper describes a statistical approach for the second part. Several techniques have been proposed in the recent past for name transliteration. Finite state transducers that implement transformation rules for back-transliteration from Japanese to English are described in [2], and extended to Arabic in [5]. In both cases, the goal is to recognize words in Japanese or Arabic text which happen to be transliterations of English names. The strongly phonetic orthography of Korean is exploited in [1] to obtain good transliteration using relatively simple HMM-based models. A set of handcrafted rules for locally editing the phonemic spelling of an English name to conform to Mandarin syllabification is provided to a transformation-based learning algorithm in [4], which then learns how to convert an English phoneme sequence to a Mandarin syllable sequence. We describe here a fully data driven counterpart to the technique of [4] for English-to-Mandarin name transliteration. In addition to intrinsic evaluation, we test our transliteration system extrinsically for cross-lingual spoken document retrieval by usThis research was partially supported by DARPA via Grant No N66001-00-2-8910 and ONR via Grant No N00014-01-1-0685. Copyright is held by the author/owner. SIGIR’03, July 28–August 1, 2003, Toronto, Canada. ACM 1-58113-646-3/03/0007. ing English text queries to retrieve Mandarin audio from the Topic Detection and Tracking (TDT) corpus.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval

自引率

0.00%

发文量