R. W. Sholikah, A. Arifin, D. Purwitasari, C. Fatichah
{"title":"Co-occurrence technique and dictionary based method for Indonesian thesaurus construction","authors":"R. W. Sholikah, A. Arifin, D. Purwitasari, C. Fatichah","doi":"10.1109/ICOICT.2017.8074649","DOIUrl":null,"url":null,"abstract":"Thesaurus as control vocabulary can be an important tool in Natural Language Processing (NLP). However, constructing a thesaurus manually by experts can be time consuming. Besides that the subjectivity of each expert can affect the structure of the thesaurus. A lot of method has already been implemented to build an automatic thesaurus in languages that categorized as rich language resources. In poor language resources such as Indonesia, the research about this field is still limited. This paper proposed a framework to construct a thesaurus in Indonesian language using monolingual corpus. The method will use Indonesian dictionary and large monolingual corpus from news articles. The candidate related terms will be extracted from every resource, then the two candidate will produce the final result of thesaurus. The evaluation is done by using the thesaurus as QE (Query Expansion) resource in IR (Information Retrieval) system. The experimental results show that using the automatic thesaurus can obtain the precision and recall of the system with 54.00% and 85.42%, respectively.","PeriodicalId":244500,"journal":{"name":"2017 5th International Conference on Information and Communication Technology (ICoIC7)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 5th International Conference on Information and Communication Technology (ICoIC7)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOICT.2017.8074649","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Thesaurus as control vocabulary can be an important tool in Natural Language Processing (NLP). However, constructing a thesaurus manually by experts can be time consuming. Besides that the subjectivity of each expert can affect the structure of the thesaurus. A lot of method has already been implemented to build an automatic thesaurus in languages that categorized as rich language resources. In poor language resources such as Indonesia, the research about this field is still limited. This paper proposed a framework to construct a thesaurus in Indonesian language using monolingual corpus. The method will use Indonesian dictionary and large monolingual corpus from news articles. The candidate related terms will be extracted from every resource, then the two candidate will produce the final result of thesaurus. The evaluation is done by using the thesaurus as QE (Query Expansion) resource in IR (Information Retrieval) system. The experimental results show that using the automatic thesaurus can obtain the precision and recall of the system with 54.00% and 85.42%, respectively.