{"title":"Feature Extraction TF-IDF to Perform Cyberbullying Text Classification: A Literature Review and Future Research Direction","authors":"Yudi Setiawan, Dani Gunawan, Rusdi Efendi","doi":"10.1109/ICITSI56531.2022.9970942","DOIUrl":null,"url":null,"abstract":"Feature extraction on text documents becomes a challenging task for making natural language and machine learning classifications. A document has a complex wording with various meanings and expressions contained in it. The complexity and variety of perceptions make it difficult to find labels and classify documents. The feature extraction process can be carried out to capture important text, phrases and words contained in a document so that the text classification process can be carried out. Term Frequency-Inverse Document Frequency (TF-IDF) is a feature extraction method by performing a grouping process based on the statistics of the occurrence of words from the data collection used. In this paper, the authors present feature extraction with the TF-IDF method with variations of the model approach. Such as; weighting on the occurrence of the word, the filter process on the words in the document, creation rules on term documents, extraction for two or more syllables, and combination with other extraction methods, to improve the text classification process in cyberbullying detection. This paper also opens up opportunities that can be done in the future regarding feature extraction with variations of statistical models of word occurrences in textual detection.","PeriodicalId":439918,"journal":{"name":"2022 International Conference on Information Technology Systems and Innovation (ICITSI)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Information Technology Systems and Innovation (ICITSI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICITSI56531.2022.9970942","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Feature extraction on text documents becomes a challenging task for making natural language and machine learning classifications. A document has a complex wording with various meanings and expressions contained in it. The complexity and variety of perceptions make it difficult to find labels and classify documents. The feature extraction process can be carried out to capture important text, phrases and words contained in a document so that the text classification process can be carried out. Term Frequency-Inverse Document Frequency (TF-IDF) is a feature extraction method by performing a grouping process based on the statistics of the occurrence of words from the data collection used. In this paper, the authors present feature extraction with the TF-IDF method with variations of the model approach. Such as; weighting on the occurrence of the word, the filter process on the words in the document, creation rules on term documents, extraction for two or more syllables, and combination with other extraction methods, to improve the text classification process in cyberbullying detection. This paper also opens up opportunities that can be done in the future regarding feature extraction with variations of statistical models of word occurrences in textual detection.