语言识别研究

Australasian Document Computing Symposium Pub Date : 2012-12-05 DOI:10.1145/2407085.2407097

Rachel Mary Milne, Richard A. O'Keefe, A. Trotman

{"title":"语言识别研究","authors":"Rachel Mary Milne, Richard A. O'Keefe, A. Trotman","doi":"10.1145/2407085.2407097","DOIUrl":null,"url":null,"abstract":"Language identification is automatically determining the language that a previously unseen document was written in. We compared several prior methods on samples from the Wikipedia and the EuroParl collections. Most of these methods work well. But we identify that these (and presumably other document) collections are heterogeneous in size, and short documents are systematically different from large ones. That techniques that work well on long documents are different from those that work well on short ones. We believe that improvement in algorithms will be seen if length is taken into account.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"289 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"A study in language identification\",\"authors\":\"Rachel Mary Milne, Richard A. O'Keefe, A. Trotman\",\"doi\":\"10.1145/2407085.2407097\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Language identification is automatically determining the language that a previously unseen document was written in. We compared several prior methods on samples from the Wikipedia and the EuroParl collections. Most of these methods work well. But we identify that these (and presumably other document) collections are heterogeneous in size, and short documents are systematically different from large ones. That techniques that work well on long documents are different from those that work well on short ones. We believe that improvement in algorithms will be seen if length is taken into account.\",\"PeriodicalId\":402985,\"journal\":{\"name\":\"Australasian Document Computing Symposium\",\"volume\":\"289 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Australasian Document Computing Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2407085.2407097\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Australasian Document Computing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2407085.2407097","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

摘要

语言识别是自动确定以前未见过的文档所使用的语言。我们比较了维基百科和EuroParl收集样本的几种先前的方法。这些方法大多数都很有效。但是我们发现这些(可能还有其他文档)集合在大小上是异构的，短文档与大文档在系统上是不同的。对长文档有效的技术与对短文档有效的技术是不同的。我们相信，如果考虑到长度，算法将得到改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A study in language identification

Language identification is automatically determining the language that a previously unseen document was written in. We compared several prior methods on samples from the Wikipedia and the EuroParl collections. Most of these methods work well. But we identify that these (and presumably other document) collections are heterogeneous in size, and short documents are systematically different from large ones. That techniques that work well on long documents are different from those that work well on short ones. We believe that improvement in algorithms will be seen if length is taken into account.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Australasian Document Computing Symposium

自引率

0.00%

发文量