Ari Aulia Hakim, Alva Erwin, Kho I Eng, M. Galinium, W. Muliady
{"title":"Automated document classification for news article in Bahasa Indonesia based on term frequency inverse document frequency (TF-IDF) approach","authors":"Ari Aulia Hakim, Alva Erwin, Kho I Eng, M. Galinium, W. Muliady","doi":"10.1109/ICITEED.2014.7007894","DOIUrl":null,"url":null,"abstract":"The exponential growth of the data may lead us to the information explosion era, an era where most of the data cannot be managed easily. Text mining study is believed to prevent the world from entering that era. One of the text mining studies that may prevent the explosion era is text classification. It is a way to classify articles into several predefined categories. In this research, the classifier implements TF-IDF algorithm. TF-IDF is an algorithm that counts the word weight by considering frequency of the word (TF) and in how many files the word can be found (IDF). Since the IDF could see the in how many files a term can be found, it can control the weight of each word. When a word can be found in so many files, it will be considered as an unimportant word. TF-IDF has been proven to create a classifier that could classify news articles in Bahasa Indonesia in a high accuracy; 98.3%.","PeriodicalId":148115,"journal":{"name":"2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"84","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICITEED.2014.7007894","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 84
Abstract
The exponential growth of the data may lead us to the information explosion era, an era where most of the data cannot be managed easily. Text mining study is believed to prevent the world from entering that era. One of the text mining studies that may prevent the explosion era is text classification. It is a way to classify articles into several predefined categories. In this research, the classifier implements TF-IDF algorithm. TF-IDF is an algorithm that counts the word weight by considering frequency of the word (TF) and in how many files the word can be found (IDF). Since the IDF could see the in how many files a term can be found, it can control the weight of each word. When a word can be found in so many files, it will be considered as an unimportant word. TF-IDF has been proven to create a classifier that could classify news articles in Bahasa Indonesia in a high accuracy; 98.3%.