Bilingual phrase induction with local hard negative sampling

IF 7.3 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

CAAI Transactions on Intelligence Technology Pub Date : 2024-10-01 DOI:10.1049/cit2.12383

Hailong Cao, Hualin Miao, Weixuan Wang, Liangyou Li, Wei Peng, Tiejun Zhao

{"title":"Bilingual phrase induction with local hard negative sampling","authors":"Hailong Cao, Hualin Miao, Weixuan Wang, Liangyou Li, Wei Peng, Tiejun Zhao","doi":"10.1049/cit2.12383","DOIUrl":null,"url":null,"abstract":"<p>Bilingual lexicon induction focuses on learning word translation pairs, also known as bitexts, from monolingual corpora by establishing a mapping between the source and target embedding spaces. Despite recent advancements, bilingual lexicon induction is limited to inducing bitexts consisting of individual words, lacking the ability to handle semantics-rich phrases. To bridge this gap and support downstream cross-lingual tasks, it is practical to develop a method for bilingual phrase induction that extracts bilingual phrase pairs from monolingual corpora without relying on cross-lingual knowledge. In this paper, the authors propose a novel phrase embedding training method based on the skip-gram structure. Specifically, a local hard negative sampling strategy that utilises negative samples of central tokens in sliding windows to enhance phrase embedding learning is introduced. The proposed method achieves competitive or superior performance compared to baseline approaches, with exceptional results recorded for distant languages. Additionally, we develop a phrase representation learning method that leverages multilingual pre-trained language models. These mPLMs-based representations can be combined with the above-mentioned static phrase embeddings to further improve the accuracy of the bilingual phrase induction task. We manually construct a dataset of bilingual phrase pairs and integrate it with MUSE to facilitate the bilingual phrase induction task.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 1","pages":"147-159"},"PeriodicalIF":7.3000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.12383","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CAAI Transactions on Intelligence Technology","FirstCategoryId":"94","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12383","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Bilingual lexicon induction focuses on learning word translation pairs, also known as bitexts, from monolingual corpora by establishing a mapping between the source and target embedding spaces. Despite recent advancements, bilingual lexicon induction is limited to inducing bitexts consisting of individual words, lacking the ability to handle semantics-rich phrases. To bridge this gap and support downstream cross-lingual tasks, it is practical to develop a method for bilingual phrase induction that extracts bilingual phrase pairs from monolingual corpora without relying on cross-lingual knowledge. In this paper, the authors propose a novel phrase embedding training method based on the skip-gram structure. Specifically, a local hard negative sampling strategy that utilises negative samples of central tokens in sliding windows to enhance phrase embedding learning is introduced. The proposed method achieves competitive or superior performance compared to baseline approaches, with exceptional results recorded for distant languages. Additionally, we develop a phrase representation learning method that leverages multilingual pre-trained language models. These mPLMs-based representations can be combined with the above-mentioned static phrase embeddings to further improve the accuracy of the bilingual phrase induction task. We manually construct a dataset of bilingual phrase pairs and integrate it with MUSE to facilitate the bilingual phrase induction task.

Abstract Image

查看原文本刊更多论文

局部硬负抽样的双语短语归纳

双语词汇归纳主要是通过建立源嵌入空间和目标嵌入空间之间的映射关系，从单语语料库中学习单词翻译对（也称为bittext）。尽管最近取得了进展，但双语词汇归纳仅限于归纳由单个单词组成的二进制文本，缺乏处理语义丰富的短语的能力。为了弥补这一差距并支持下游的跨语言任务，开发一种从单语语料库中提取双语短语对的双语短语归纳方法是可行的，而不依赖于跨语言知识。本文提出了一种新的基于跳图结构的短语嵌入训练方法。具体来说，介绍了一种局部硬负抽样策略，该策略利用滑动窗口中中心标记的负样本来增强短语嵌入学习。与基线方法相比，所提出的方法取得了具有竞争力或更好的性能，并在远程语言中记录了出色的结果。此外，我们开发了一种利用多语言预训练语言模型的短语表示学习方法。这些基于mplms的表示可以与上述静态短语嵌入相结合，进一步提高双语短语归纳任务的准确性。我们手动构建了一个双语短语对数据集，并将其与MUSE集成，以方便双语短语归纳任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

CAAI Transactions on Intelligence Technology COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

11.00

自引率

3.90%

发文量

134

审稿时长

35 weeks

期刊介绍： CAAI Transactions on Intelligence Technology is a leading venue for original research on the theoretical and experimental aspects of artificial intelligence technology. We are a fully open access journal co-published by the Institution of Engineering and Technology (IET) and the Chinese Association for Artificial Intelligence (CAAI) providing research which is openly accessible to read and share worldwide.