TQDL: Integrated Models for Cross-Language Document Retrieval

Int. J. Comput. Linguistics Chin. Lang. Process. Pub Date : 2012-12-01 DOI:10.30019/IJCLCLP.201212.0002

Longyue Wang, Derek F. Wong, Lidia S. Chao

{"title":"TQDL: Integrated Models for Cross-Language Document Retrieval","authors":"Longyue Wang, Derek F. Wong, Lidia S. Chao","doi":"10.30019/IJCLCLP.201212.0002","DOIUrl":null,"url":null,"abstract":"This paper proposed an integrated approach for Cross-Language Information Retrieval (CLIR), which integrated with four statistical models: Translation model, Query generation model, Document retrieval model and Length Filter model. Given a certain document in the source language, it will be translated into the target language of the statistical machine translation model. The query generation model then selects the most relevant words in the translated version of the document as a query. Instead of retrieving all the target documents with the query, the length-based model can help to filter out a large amount of irrelevant candidates according to their length information. Finally, the left documents in the target language are scored by the document searching model, which mainly computes the similarities between query and document.Different from the traditional parallel corpora-based model which relies on IBM algorithm, we divided our CLIR model into four independent parts but all work together to deal with the term disambiguation, query generation and document retrieval. Besides, the TQDL method can efficiently solve the problem of translation ambiguity and query expansion for disambiguation, which are the big issues in Cross-Language Information Retrieval. Another contribution is the length filter, which are trained from a parallel corpus according to the ratio of length between two languages. This can not only improve the recall value due to filtering out lots of useless documents dynamically, but also increase the efficiency in a smaller search space. Therefore, the precision can be improved but not at the cost of recall.In order to evaluate the retrieval performance of the proposed model on cross-languages document retrieval, a number of experiments have been conducted on different settings. Firstly, the Europarl corpus which is the collection of parallel texts in 11 languages from the proceedings of the European Parliament was used for evaluation. And we tested the models extensively to the case that: the lengths of texts are uneven and some of them may have similar contents under the same topic, because it is hard to be distinguished and make full use of the resources.After comparing different strategies, the experimental results show a significant performance of the method. The precision is normally above 90% by using a larger query size. The length-based filter plays a very important role in improving the F-measure and optimizing efficiency.This fully illustrates the discrimination power of the proposed method. It is of a great significance to both cross-language searching on the Internet and the parallel corpus producing for statistical machine translation systems. In the future work, the TQDL system will be evaluated for Chinese language, which is a big changing and more meaningful to CLIR.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Comput. Linguistics Chin. Lang. Process.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30019/IJCLCLP.201212.0002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

This paper proposed an integrated approach for Cross-Language Information Retrieval (CLIR), which integrated with four statistical models: Translation model, Query generation model, Document retrieval model and Length Filter model. Given a certain document in the source language, it will be translated into the target language of the statistical machine translation model. The query generation model then selects the most relevant words in the translated version of the document as a query. Instead of retrieving all the target documents with the query, the length-based model can help to filter out a large amount of irrelevant candidates according to their length information. Finally, the left documents in the target language are scored by the document searching model, which mainly computes the similarities between query and document.Different from the traditional parallel corpora-based model which relies on IBM algorithm, we divided our CLIR model into four independent parts but all work together to deal with the term disambiguation, query generation and document retrieval. Besides, the TQDL method can efficiently solve the problem of translation ambiguity and query expansion for disambiguation, which are the big issues in Cross-Language Information Retrieval. Another contribution is the length filter, which are trained from a parallel corpus according to the ratio of length between two languages. This can not only improve the recall value due to filtering out lots of useless documents dynamically, but also increase the efficiency in a smaller search space. Therefore, the precision can be improved but not at the cost of recall.In order to evaluate the retrieval performance of the proposed model on cross-languages document retrieval, a number of experiments have been conducted on different settings. Firstly, the Europarl corpus which is the collection of parallel texts in 11 languages from the proceedings of the European Parliament was used for evaluation. And we tested the models extensively to the case that: the lengths of texts are uneven and some of them may have similar contents under the same topic, because it is hard to be distinguished and make full use of the resources.After comparing different strategies, the experimental results show a significant performance of the method. The precision is normally above 90% by using a larger query size. The length-based filter plays a very important role in improving the F-measure and optimizing efficiency.This fully illustrates the discrimination power of the proposed method. It is of a great significance to both cross-language searching on the Internet and the parallel corpus producing for statistical machine translation systems. In the future work, the TQDL system will be evaluated for Chinese language, which is a big changing and more meaningful to CLIR.

查看原文本刊更多论文

跨语言文档检索的集成模型

本文提出了一种集成跨语言信息检索(CLIR)的方法，该方法集成了翻译模型、查询生成模型、文档检索模型和长度过滤模型四个统计模型。给定源语言的某个文档，将其翻译成统计机器翻译模型的目标语言。然后，查询生成模型在文档的翻译版本中选择最相关的单词作为查询。与使用查询检索所有目标文档不同，基于长度的模型可以根据长度信息帮助过滤掉大量不相关的候选文档。最后，通过文档搜索模型对目标语言的剩余文档进行评分，该模型主要计算查询与文档之间的相似度。与传统的基于IBM算法的并行语料库模型不同，我们将CLIR模型分为四个独立的部分，分别处理术语消歧、查询生成和文档检索。此外，TQDL方法还能有效地解决跨语言信息检索中的翻译歧义和查询消歧扩展问题。另一个贡献是长度过滤器，它是根据两种语言之间的长度比例从并行语料库中训练出来的。这不仅可以动态地过滤掉大量无用的文档，从而提高召回值，而且可以在更小的搜索空间内提高效率。因此，精度可以提高，但不能以召回率为代价。为了评估该模型在跨语言文档检索中的检索性能，在不同的设置下进行了大量的实验。首先，欧洲平行语料库是欧洲议会会议记录中11种语言平行文本的集合，用于评估。我们对模型进行了广泛的测试，以解决文本长度参差不齐的情况，其中一些文本在同一主题下可能具有相似的内容，因为难以区分和充分利用资源。通过对不同策略的比较，实验结果显示了该方法的显著性能。通过使用更大的查询大小，精度通常在90%以上。基于长度的滤波器在提高f测度和优化效率方面起着非常重要的作用。这充分说明了该方法的判别能力。这对于统计机器翻译系统的跨语言搜索和并行语料库生成都具有重要意义。在今后的工作中，将对汉语的TQDL系统进行评价，这是一个很大的变化，对CLIR更有意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Int. J. Comput. Linguistics Chin. Lang. Process.

自引率

0.00%

发文量