Measuring language similarity using trigrams: Limitations of language identification

2013 International Conference on Recent Trends in Information Technology (ICRTIT) Pub Date : 2013-07-25 DOI:10.1109/ICRTIT.2013.6844250

Nathaniel Oco, J. Ilao, R. Roxas, Leif Romeritch Syliongka

引用次数: 8

Abstract

Computational approaches in language identification often result in highnumber of false positivesand low recall rates, especially if the languages involved come from the same subfamily. In this paper, we aim to determine the cause of this problemby measuring language similarity through trigrams. Religious and literary texts were used as training data. Our experiments involving language identification show that the number of common trigrams for a given language pair is inversely proportional to precision and recall rates, whereas the average word length is directly proportional to the number of true positives. Future directions include improving language modeling and providing an approach to increase precision and recall.

查看原文本刊更多论文

使用三元组测量语言相似度:语言识别的局限性

语言识别中的计算方法通常会导致大量误报和低召回率，特别是当涉及的语言来自同一亚族时。在本文中，我们的目的是确定这一问题的原因，通过测量语言相似度的三元组。宗教和文学文本被用作训练数据。我们涉及语言识别的实验表明，给定语言对的常见三元组的数量与准确率和召回率成反比，而平均单词长度与真阳性的数量成正比。未来的方向包括改进语言建模和提供提高准确率和召回率的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 International Conference on Recent Trends in Information Technology (ICRTIT)

自引率

0.00%

发文量