{"title":"Bilingual phrase induction with local hard negative sampling","authors":"Hailong Cao, Hualin Miao, Weixuan Wang, Liangyou Li, Wei Peng, Tiejun Zhao","doi":"10.1049/cit2.12383","DOIUrl":null,"url":null,"abstract":"<p>Bilingual lexicon induction focuses on learning word translation pairs, also known as bitexts, from monolingual corpora by establishing a mapping between the source and target embedding spaces. Despite recent advancements, bilingual lexicon induction is limited to inducing bitexts consisting of individual words, lacking the ability to handle semantics-rich phrases. To bridge this gap and support downstream cross-lingual tasks, it is practical to develop a method for bilingual phrase induction that extracts bilingual phrase pairs from monolingual corpora without relying on cross-lingual knowledge. In this paper, the authors propose a novel phrase embedding training method based on the skip-gram structure. Specifically, a local hard negative sampling strategy that utilises negative samples of central tokens in sliding windows to enhance phrase embedding learning is introduced. The proposed method achieves competitive or superior performance compared to baseline approaches, with exceptional results recorded for distant languages. Additionally, we develop a phrase representation learning method that leverages multilingual pre-trained language models. These mPLMs-based representations can be combined with the above-mentioned static phrase embeddings to further improve the accuracy of the bilingual phrase induction task. We manually construct a dataset of bilingual phrase pairs and integrate it with MUSE to facilitate the bilingual phrase induction task.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 1","pages":"147-159"},"PeriodicalIF":8.4000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.12383","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CAAI Transactions on Intelligence Technology","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/cit2.12383","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Bilingual lexicon induction focuses on learning word translation pairs, also known as bitexts, from monolingual corpora by establishing a mapping between the source and target embedding spaces. Despite recent advancements, bilingual lexicon induction is limited to inducing bitexts consisting of individual words, lacking the ability to handle semantics-rich phrases. To bridge this gap and support downstream cross-lingual tasks, it is practical to develop a method for bilingual phrase induction that extracts bilingual phrase pairs from monolingual corpora without relying on cross-lingual knowledge. In this paper, the authors propose a novel phrase embedding training method based on the skip-gram structure. Specifically, a local hard negative sampling strategy that utilises negative samples of central tokens in sliding windows to enhance phrase embedding learning is introduced. The proposed method achieves competitive or superior performance compared to baseline approaches, with exceptional results recorded for distant languages. Additionally, we develop a phrase representation learning method that leverages multilingual pre-trained language models. These mPLMs-based representations can be combined with the above-mentioned static phrase embeddings to further improve the accuracy of the bilingual phrase induction task. We manually construct a dataset of bilingual phrase pairs and integrate it with MUSE to facilitate the bilingual phrase induction task.
期刊介绍:
CAAI Transactions on Intelligence Technology is a leading venue for original research on the theoretical and experimental aspects of artificial intelligence technology. We are a fully open access journal co-published by the Institution of Engineering and Technology (IET) and the Chinese Association for Artificial Intelligence (CAAI) providing research which is openly accessible to read and share worldwide.