{"title":"The Guidelines of Building a Treebank for Modern Standard Arabic","authors":"Amena Dheif, Ahmed Abd El Ghany, Sameh Al Ansary","doi":"10.1109/ESOLEC54569.2022.10009330","DOIUrl":null,"url":null,"abstract":"Treebanks are one of the most needed and used linguistic resources in the fields of Natural language processing (NLP) and Natural language understanding (NLU). Arabic has only two constituency-based treebanks and a number of dependency treebanks. The current research presents the guidelines for building a parsed Arabic treebank for Modern Standard Arabic (MSA). The guidelines show, firstly the choice of the grammar formalism, then the genre and size of the treebank, and finally the annotation layers of the treebank. The study also shows that using the traditional Arabic grammar syntactic theory to describe the Arabic syntax has proven to be more suitable than using any of the modern syntax theories. Working with the traditional Arabic grammar also helps avoid the errors that the available treebank fell in as a result of using guidelines that don't suit the Arabic grammar. The study adopts three layers of annotations: the morphological layer, the syntactic layer, and the grammatical function layer. The resultant tree is a very detailed and rich syntactic tree, which is preferable by the researcher over having a huge amount of data poorly and shallowly annotated.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 20th International Conference on Language Engineering (ESOLEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESOLEC54569.2022.10009330","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Treebanks are one of the most needed and used linguistic resources in the fields of Natural language processing (NLP) and Natural language understanding (NLU). Arabic has only two constituency-based treebanks and a number of dependency treebanks. The current research presents the guidelines for building a parsed Arabic treebank for Modern Standard Arabic (MSA). The guidelines show, firstly the choice of the grammar formalism, then the genre and size of the treebank, and finally the annotation layers of the treebank. The study also shows that using the traditional Arabic grammar syntactic theory to describe the Arabic syntax has proven to be more suitable than using any of the modern syntax theories. Working with the traditional Arabic grammar also helps avoid the errors that the available treebank fell in as a result of using guidelines that don't suit the Arabic grammar. The study adopts three layers of annotations: the morphological layer, the syntactic layer, and the grammatical function layer. The resultant tree is a very detailed and rich syntactic tree, which is preferable by the researcher over having a huge amount of data poorly and shallowly annotated.