A method of mining bilingual resources from Web Based on Maximum Frequent Sequential Pattern

Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010) Pub Date : 2010-09-30 DOI:10.1109/NLPKE.2010.5587831

Guiping Zhang, Yang Luo, D. Ji

引用次数: 0

Abstract

The bilingual resources are indispensable and vital resources in the NPL fields, such as machine translation, etc. A large amount of electronic information is embedded in the Internet, which can be used as a potential information source of large-scale multi-language corpus, so it is a potential and feasible way to mine a great capacity of true bilingual resources from the Web. This paper proposes a method of mining bilingual resources from the Web based on Maximum Frequent Sequential Pattern. The method uses the heuristic approach to search and filter the candidate bilingual web pages, then mines patterns using maximum frequent sequential, and uses a machine learning method for extending the pattern base and verifying bilingual resources in accordance with the Japanese to Chinese word proportion. The experimental results indicate that the method could extract bilingual resources efficiently, with the precision rate over 90%.

查看原文本刊更多论文

基于最大频繁序列模式的Web双语资源挖掘方法

双语资源是机器翻译等非物理物理领域不可缺少的重要资源。互联网中嵌入了大量的电子信息，这些信息可以作为大规模多语言语料库的潜在信息源，因此从网络中挖掘大容量的真正双语资源是一种潜在的、可行的方法。提出了一种基于最大频繁序列模式的Web双语资源挖掘方法。该方法采用启发式方法对候选双语网页进行搜索和过滤，然后利用最大频繁序列挖掘模式，并采用机器学习方法根据日文与中文字数比例扩展模式库并对双语资源进行验证。实验结果表明，该方法能够有效地提取双语资源，准确率达到90%以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)

自引率

0.00%

发文量