Mohammad. M. Alyan Nezhadi, M. Forghani, H. Hassanpour
{"title":"使用信号处理技术的文本语言识别","authors":"Mohammad. M. Alyan Nezhadi, M. Forghani, H. Hassanpour","doi":"10.1109/ICSPIS.2017.8311606","DOIUrl":null,"url":null,"abstract":"Human is often able to recognize spoken languages even if the meaning could not be understood. Text language determination is an important requirement in any text processing system. In this paper, a novel text language identification based on signal processing techniques is presented. In each language, there is a dependency between components of a sentence as well as components those construct the words. Considering the text as a time series, this dependency can be observed using signal processing techniques. The proposed method recognizes the language of a text, in a three-stage manner, using Wavelet packet and neural networks. First the preprocessing section that prepares the text for signal processing via adding some additional spaces between consecutive words, then represents the text using UTF8 coding system. In the second stage, the Wavelet packet is applied on the coded text, i.e. time-series, and a feature vector is extracted from wavelet packet coefficients of sub-bands. Finally the classification section applies a neural network classifier on extracted feature vector. The proposed method has been tested on the database gathered from Wikipedia with seven different languages (Arabic, English, French, Germany, Italian, Persian and Russian). The proposed method earned the accuracy above 97%. The proposed method is enough fast that makes it suitable to use in real-time applications.","PeriodicalId":380266,"journal":{"name":"2017 3rd Iranian Conference on Intelligent Systems and Signal Processing (ICSPIS)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Text language identification using signal processing techniques\",\"authors\":\"Mohammad. M. Alyan Nezhadi, M. Forghani, H. Hassanpour\",\"doi\":\"10.1109/ICSPIS.2017.8311606\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Human is often able to recognize spoken languages even if the meaning could not be understood. Text language determination is an important requirement in any text processing system. In this paper, a novel text language identification based on signal processing techniques is presented. In each language, there is a dependency between components of a sentence as well as components those construct the words. Considering the text as a time series, this dependency can be observed using signal processing techniques. The proposed method recognizes the language of a text, in a three-stage manner, using Wavelet packet and neural networks. First the preprocessing section that prepares the text for signal processing via adding some additional spaces between consecutive words, then represents the text using UTF8 coding system. In the second stage, the Wavelet packet is applied on the coded text, i.e. time-series, and a feature vector is extracted from wavelet packet coefficients of sub-bands. Finally the classification section applies a neural network classifier on extracted feature vector. The proposed method has been tested on the database gathered from Wikipedia with seven different languages (Arabic, English, French, Germany, Italian, Persian and Russian). The proposed method earned the accuracy above 97%. The proposed method is enough fast that makes it suitable to use in real-time applications.\",\"PeriodicalId\":380266,\"journal\":{\"name\":\"2017 3rd Iranian Conference on Intelligent Systems and Signal Processing (ICSPIS)\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 3rd Iranian Conference on Intelligent Systems and Signal Processing (ICSPIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSPIS.2017.8311606\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 3rd Iranian Conference on Intelligent Systems and Signal Processing (ICSPIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSPIS.2017.8311606","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Text language identification using signal processing techniques
Human is often able to recognize spoken languages even if the meaning could not be understood. Text language determination is an important requirement in any text processing system. In this paper, a novel text language identification based on signal processing techniques is presented. In each language, there is a dependency between components of a sentence as well as components those construct the words. Considering the text as a time series, this dependency can be observed using signal processing techniques. The proposed method recognizes the language of a text, in a three-stage manner, using Wavelet packet and neural networks. First the preprocessing section that prepares the text for signal processing via adding some additional spaces between consecutive words, then represents the text using UTF8 coding system. In the second stage, the Wavelet packet is applied on the coded text, i.e. time-series, and a feature vector is extracted from wavelet packet coefficients of sub-bands. Finally the classification section applies a neural network classifier on extracted feature vector. The proposed method has been tested on the database gathered from Wikipedia with seven different languages (Arabic, English, French, Germany, Italian, Persian and Russian). The proposed method earned the accuracy above 97%. The proposed method is enough fast that makes it suitable to use in real-time applications.