Nuzhat Khan, Muhammad Paend Bakht, Muhammad Junaid Khan, Abdul Samad
{"title":"Complex Network of Urdu Language","authors":"Nuzhat Khan, Muhammad Paend Bakht, Muhammad Junaid Khan, Abdul Samad","doi":"10.1109/MACS48846.2019.9024791","DOIUrl":null,"url":null,"abstract":"This work proposes state of the art technique to examine composition patterns and topological structure of Urdu language. The improved method explores Urdu text in form of co-occurrence network graph within framework of complex network theory. For the first time, Urdu text is successfully transformed into graph despite having difficulties in dealing with Nastalik script, unavailability of resources and limited support by language processing tools. We have constructed an open and unannotated corpus of more than 3 million words using random forest approach. An un-directed, un-weighted graph from co-occurrence network of Urdu is created in python 3.4. Resulting network designed with bag of bigrams model consists of 5180 nodes and 101415 edges. Deep statistical analysis of graph is performed in graph visualization tool Gephi 0.9.2. Furthermore, a null model of similar size according to Erdos-Renyi random graph is generated to compare with Urdu network. Comparison is based on average path length, clustering coefficient and hierarchy of both networks. From analysis of these key features, it is observed that Urdu network graph differs from random network. Smaller average path length and high clustering coefficient also confirm small world effect in Urdu language. Additionally, 11 communities are detected in Urdu network unlike random network where only one community exists. Statistical facts reveal that Urdu network is a scale free network with layered composition pattern. Small world effect and scale free behavior of Urdu declare it a complex network with paradigmatic hierarchy in terms of authority distribution among words.","PeriodicalId":434612,"journal":{"name":"2019 13th International Conference on Mathematics, Actuarial Science, Computer Science and Statistics (MACS)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 13th International Conference on Mathematics, Actuarial Science, Computer Science and Statistics (MACS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MACS48846.2019.9024791","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
This work proposes state of the art technique to examine composition patterns and topological structure of Urdu language. The improved method explores Urdu text in form of co-occurrence network graph within framework of complex network theory. For the first time, Urdu text is successfully transformed into graph despite having difficulties in dealing with Nastalik script, unavailability of resources and limited support by language processing tools. We have constructed an open and unannotated corpus of more than 3 million words using random forest approach. An un-directed, un-weighted graph from co-occurrence network of Urdu is created in python 3.4. Resulting network designed with bag of bigrams model consists of 5180 nodes and 101415 edges. Deep statistical analysis of graph is performed in graph visualization tool Gephi 0.9.2. Furthermore, a null model of similar size according to Erdos-Renyi random graph is generated to compare with Urdu network. Comparison is based on average path length, clustering coefficient and hierarchy of both networks. From analysis of these key features, it is observed that Urdu network graph differs from random network. Smaller average path length and high clustering coefficient also confirm small world effect in Urdu language. Additionally, 11 communities are detected in Urdu network unlike random network where only one community exists. Statistical facts reveal that Urdu network is a scale free network with layered composition pattern. Small world effect and scale free behavior of Urdu declare it a complex network with paradigmatic hierarchy in terms of authority distribution among words.
本文提出了乌尔都语组成模式和拓扑结构研究的最新技术。改进后的方法在复杂网络理论的框架内,以共现网络图的形式对乌尔都语文本进行研究。乌尔都语文本第一次成功地转换为图形,尽管在处理纳斯塔利克文字方面存在困难,资源不可用,语言处理工具的支持有限。我们使用随机森林方法构建了一个超过300万字的开放且无注释的语料库。在python 3.4中创建了乌尔都语共现网络的无向、无加权图。用bag of grams模型设计的网络由5180个节点和101415条边组成。在图形可视化工具Gephi 0.9.2中对图形进行深度统计分析。然后,根据Erdos-Renyi随机图生成一个大小相近的空模型,与Urdu网络进行比较。基于两个网络的平均路径长度、聚类系数和层次结构进行比较。通过对这些关键特征的分析,可以看出乌尔都网络图不同于随机网络。较小的平均路径长度和较高的聚类系数也证实了乌尔都语的世界效应较小。此外,乌尔都网络中有11个社区,而随机网络中只有一个社区。统计事实表明,乌尔都网络是一个具有分层组成模式的无标度网络。乌尔都语的小世界效应和尺度自由行为表明乌尔都语是一个复杂的网络,在话语间的权威分布上具有范式层次。