Sayed Majid Ali Shah, I. A. Ismaili, Z. Bhatti, Ahmed Waqas
{"title":"Designing XML tag based Sindhi language corpus","authors":"Sayed Majid Ali Shah, I. A. Ismaili, Z. Bhatti, Ahmed Waqas","doi":"10.1109/ICOMET.2018.8346381","DOIUrl":null,"url":null,"abstract":"Corpus play a vital role in building the foundation for processing any language recourse. Sindhi being a very old language, yet still is very limited in computational resources. In this paper, Sindhi corpus construction is discussed using XML tagging, to facilitate Sindhi Natural Language Processing and Machine learning features. A two-chain structure is defined for Sindhi language consisting of metadata and Sindhi source Document tags. These tags are further divided to create a complex XML structure with custom tags to identify each Sindhi document extracted from numerous sources.","PeriodicalId":381362,"journal":{"name":"2018 International Conference on Computing, Mathematics and Engineering Technologies (iCoMET)","volume":"49 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Computing, Mathematics and Engineering Technologies (iCoMET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOMET.2018.8346381","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Corpus play a vital role in building the foundation for processing any language recourse. Sindhi being a very old language, yet still is very limited in computational resources. In this paper, Sindhi corpus construction is discussed using XML tagging, to facilitate Sindhi Natural Language Processing and Machine learning features. A two-chain structure is defined for Sindhi language consisting of metadata and Sindhi source Document tags. These tags are further divided to create a complex XML structure with custom tags to identify each Sindhi document extracted from numerous sources.