使用独立成分分析的阿拉伯文字网络文档的语言识别

A. Selamat, Z. Lee
{"title":"使用独立成分分析的阿拉伯文字网络文档的语言识别","authors":"A. Selamat, Z. Lee","doi":"10.1109/AMS.2008.46","DOIUrl":null,"url":null,"abstract":"We analyze the language identification algorithms used to identify the Arabic script Web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selection methods for selecting the best features of Arabic script Web documents for Web page language identifications. Then we input the selected features based on the identification of latent semantics of user profiles using singular value decomposition (SVD). The SVD has been used to remove the noises on the documents retrieved before applying the ICA for topic extraction. We assume that the topic on each document is independent from each other. We have used the information retrieval measures that are precision, recall and F\\ in order to evaluate the effectiveness of the proposed algorithm. From the experiments, we have found that the proposed method could leads to good Arabic script language identification results with good separations of Arabic, Persian, and Urdu languages using the ICA.","PeriodicalId":122964,"journal":{"name":"2008 Second Asia International Conference on Modelling & Simulation (AMS)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Language Identifications of Arabic Script Web Documents Using Independent Component Analysis\",\"authors\":\"A. Selamat, Z. Lee\",\"doi\":\"10.1109/AMS.2008.46\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We analyze the language identification algorithms used to identify the Arabic script Web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selection methods for selecting the best features of Arabic script Web documents for Web page language identifications. Then we input the selected features based on the identification of latent semantics of user profiles using singular value decomposition (SVD). The SVD has been used to remove the noises on the documents retrieved before applying the ICA for topic extraction. We assume that the topic on each document is independent from each other. We have used the information retrieval measures that are precision, recall and F\\\\ in order to evaluate the effectiveness of the proposed algorithm. From the experiments, we have found that the proposed method could leads to good Arabic script language identification results with good separations of Arabic, Persian, and Urdu languages using the ICA.\",\"PeriodicalId\":122964,\"journal\":{\"name\":\"2008 Second Asia International Conference on Modelling & Simulation (AMS)\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-05-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 Second Asia International Conference on Modelling & Simulation (AMS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AMS.2008.46\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 Second Asia International Conference on Modelling & Simulation (AMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AMS.2008.46","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

我们使用独立组件分析(ICA)分析了用于识别阿拉伯语脚本Web文档(如阿拉伯语、爪哇语、波斯语和乌尔都语)的语言识别算法。我们将熵项加权方案和基于类的特征(CPBF)向量相结合作为特征选择方法,为网页语言识别选择阿拉伯文字Web文档的最佳特征。然后利用奇异值分解(SVD)对用户轮廓的潜在语义进行识别,并在此基础上输入所选择的特征。在应用ICA进行主题提取之前,使用奇异值分解去除检索到的文档上的噪声。我们假设每个文档上的主题是相互独立的。我们使用了精确度、召回率和F\来评估算法的有效性。实验结果表明,该方法能够很好地分离阿拉伯语、波斯语和乌尔都语。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Language Identifications of Arabic Script Web Documents Using Independent Component Analysis
We analyze the language identification algorithms used to identify the Arabic script Web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selection methods for selecting the best features of Arabic script Web documents for Web page language identifications. Then we input the selected features based on the identification of latent semantics of user profiles using singular value decomposition (SVD). The SVD has been used to remove the noises on the documents retrieved before applying the ICA for topic extraction. We assume that the topic on each document is independent from each other. We have used the information retrieval measures that are precision, recall and F\ in order to evaluate the effectiveness of the proposed algorithm. From the experiments, we have found that the proposed method could leads to good Arabic script language identification results with good separations of Arabic, Persian, and Urdu languages using the ICA.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信