使用独立成分分析的阿拉伯文字网络文档的语言识别

2008 Second Asia International Conference on Modelling & Simulation (AMS) Pub Date : 2008-05-13 DOI:10.1109/AMS.2008.46

A. Selamat, Z. Lee

{"title":"使用独立成分分析的阿拉伯文字网络文档的语言识别","authors":"A. Selamat, Z. Lee","doi":"10.1109/AMS.2008.46","DOIUrl":null,"url":null,"abstract":"We analyze the language identification algorithms used to identify the Arabic script Web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selection methods for selecting the best features of Arabic script Web documents for Web page language identifications. Then we input the selected features based on the identification of latent semantics of user profiles using singular value decomposition (SVD). The SVD has been used to remove the noises on the documents retrieved before applying the ICA for topic extraction. We assume that the topic on each document is independent from each other. We have used the information retrieval measures that are precision, recall and F\\ in order to evaluate the effectiveness of the proposed algorithm. From the experiments, we have found that the proposed method could leads to good Arabic script language identification results with good separations of Arabic, Persian, and Urdu languages using the ICA.","PeriodicalId":122964,"journal":{"name":"2008 Second Asia International Conference on Modelling & Simulation (AMS)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Language Identifications of Arabic Script Web Documents Using Independent Component Analysis\",\"authors\":\"A. Selamat, Z. Lee\",\"doi\":\"10.1109/AMS.2008.46\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We analyze the language identification algorithms used to identify the Arabic script Web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selection methods for selecting the best features of Arabic script Web documents for Web page language identifications. Then we input the selected features based on the identification of latent semantics of user profiles using singular value decomposition (SVD). The SVD has been used to remove the noises on the documents retrieved before applying the ICA for topic extraction. We assume that the topic on each document is independent from each other. We have used the information retrieval measures that are precision, recall and F\\\\ in order to evaluate the effectiveness of the proposed algorithm. From the experiments, we have found that the proposed method could leads to good Arabic script language identification results with good separations of Arabic, Persian, and Urdu languages using the ICA.\",\"PeriodicalId\":122964,\"journal\":{\"name\":\"2008 Second Asia International Conference on Modelling & Simulation (AMS)\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-05-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 Second Asia International Conference on Modelling & Simulation (AMS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AMS.2008.46\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 Second Asia International Conference on Modelling & Simulation (AMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AMS.2008.46","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

我们使用独立组件分析(ICA)分析了用于识别阿拉伯语脚本Web文档(如阿拉伯语、爪哇语、波斯语和乌尔都语)的语言识别算法。我们将熵项加权方案和基于类的特征(CPBF)向量相结合作为特征选择方法，为网页语言识别选择阿拉伯文字Web文档的最佳特征。然后利用奇异值分解(SVD)对用户轮廓的潜在语义进行识别，并在此基础上输入所选择的特征。在应用ICA进行主题提取之前，使用奇异值分解去除检索到的文档上的噪声。我们假设每个文档上的主题是相互独立的。我们使用了精确度、召回率和F\来评估算法的有效性。实验结果表明，该方法能够很好地分离阿拉伯语、波斯语和乌尔都语。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Language Identifications of Arabic Script Web Documents Using Independent Component Analysis

We analyze the language identification algorithms used to identify the Arabic script Web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selection methods for selecting the best features of Arabic script Web documents for Web page language identifications. Then we input the selected features based on the identification of latent semantics of user profiles using singular value decomposition (SVD). The SVD has been used to remove the noises on the documents retrieved before applying the ICA for topic extraction. We assume that the topic on each document is independent from each other. We have used the information retrieval measures that are precision, recall and F\ in order to evaluate the effectiveness of the proposed algorithm. From the experiments, we have found that the proposed method could leads to good Arabic script language identification results with good separations of Arabic, Persian, and Urdu languages using the ICA.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2008 Second Asia International Conference on Modelling & Simulation (AMS)

自引率

0.00%

发文量