{"title":"印尼语多词候选表达式的自动提取","authors":"D. Gunawan, A. Amalia, Indra Charisma","doi":"10.1109/ICCSCE.2016.7893589","DOIUrl":null,"url":null,"abstract":"The utilization of dictionary-based multiword expressions (MWEs) has limitation regarding the availability of the word combination, because there are many possible multiword expressions that can be extracted from a text. This research is a preliminary study to extract multiword expressions from a text for Indonesian language. The aim of this study is determining the best method to extract multiword expression candidates for Indonesian language. This research proposed a method to extract multiword expression candidates from texts in a corpus. The text is tokenized and then filtered with stop words to remove unnecessary words. The result of these steps is multiword expression candidates that are still mixed with common and uncommon multiword expressions. To filter uncommon multiword expressions, they are ranked with the other multiword expressions from the other texts within the same corpus by using TF-IDF algorithm. This research evaluates three options for extracting multiword expression candidates. The option which utilizes combination of special characters and stop words to determine word combination is promising because it excels in combining word rate, has more appropriate multiword expression candidates, while it spends almost the same amount of memory usage compared to the others.","PeriodicalId":6540,"journal":{"name":"2016 6th IEEE International Conference on Control System, Computing and Engineering (ICCSCE)","volume":"89 1","pages":"304-309"},"PeriodicalIF":0.0000,"publicationDate":"2016-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Automatic extraction of multiword expression candidates for Indonesian language\",\"authors\":\"D. Gunawan, A. Amalia, Indra Charisma\",\"doi\":\"10.1109/ICCSCE.2016.7893589\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The utilization of dictionary-based multiword expressions (MWEs) has limitation regarding the availability of the word combination, because there are many possible multiword expressions that can be extracted from a text. This research is a preliminary study to extract multiword expressions from a text for Indonesian language. The aim of this study is determining the best method to extract multiword expression candidates for Indonesian language. This research proposed a method to extract multiword expression candidates from texts in a corpus. The text is tokenized and then filtered with stop words to remove unnecessary words. The result of these steps is multiword expression candidates that are still mixed with common and uncommon multiword expressions. To filter uncommon multiword expressions, they are ranked with the other multiword expressions from the other texts within the same corpus by using TF-IDF algorithm. This research evaluates three options for extracting multiword expression candidates. The option which utilizes combination of special characters and stop words to determine word combination is promising because it excels in combining word rate, has more appropriate multiword expression candidates, while it spends almost the same amount of memory usage compared to the others.\",\"PeriodicalId\":6540,\"journal\":{\"name\":\"2016 6th IEEE International Conference on Control System, Computing and Engineering (ICCSCE)\",\"volume\":\"89 1\",\"pages\":\"304-309\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 6th IEEE International Conference on Control System, Computing and Engineering (ICCSCE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCSCE.2016.7893589\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 6th IEEE International Conference on Control System, Computing and Engineering (ICCSCE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCSCE.2016.7893589","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Automatic extraction of multiword expression candidates for Indonesian language
The utilization of dictionary-based multiword expressions (MWEs) has limitation regarding the availability of the word combination, because there are many possible multiword expressions that can be extracted from a text. This research is a preliminary study to extract multiword expressions from a text for Indonesian language. The aim of this study is determining the best method to extract multiword expression candidates for Indonesian language. This research proposed a method to extract multiword expression candidates from texts in a corpus. The text is tokenized and then filtered with stop words to remove unnecessary words. The result of these steps is multiword expression candidates that are still mixed with common and uncommon multiword expressions. To filter uncommon multiword expressions, they are ranked with the other multiword expressions from the other texts within the same corpus by using TF-IDF algorithm. This research evaluates three options for extracting multiword expression candidates. The option which utilizes combination of special characters and stop words to determine word combination is promising because it excels in combining word rate, has more appropriate multiword expression candidates, while it spends almost the same amount of memory usage compared to the others.