Lifia Zullani, Dodi Vionanda, Syafriandi, Dina Fitria
{"title":"Comparison of Error Rate Prediction in CART for Imbalanced Data","authors":"Lifia Zullani, Dodi Vionanda, Syafriandi, Dina Fitria","doi":"10.24036/ujsds/vol1-iss5/117","DOIUrl":"https://doi.org/10.24036/ujsds/vol1-iss5/117","url":null,"abstract":"CART is one of the tree based classification algorithms. CART is a tree consisting of root nodes, internal nodes, and terminal nodes. The accuracy of the model in CART can be calculated by measuring prediction errors in the model. One common method used to predict error rates is cross-validation. There are three cross-validation algorithms, namely leave one out, hold out, and k-fold cross-validation. These methods have different performance in dividing data into training data and testing data, so there are advantages and disadvantages to each method. Every algorithm has its shortcomings; hold out cannot guarantee that the training set represents the entire dataset, leave one out is very time-consuming and requires significant computation because it has to train the model as many times as there are data points, and k-fold provides longer computation time because the training algorithm must be run k times. In reality, the data often encountered is imbalanced. Imbalanced data refers to data with a different number of observations in each class. In CART, imbalanced data affects the prediction results. This research focuses on comparing error rate prediction methods in the CART model with imbalanced data. The study uses three types of data: univariate, bivariate, and multivariate, obtained from differences in population means and correlations between independent variables. The results obtained indicate that the k-fold algorithm is the most suitable error rate prediction algorithm applied to CART with imbalanced data.","PeriodicalId":220933,"journal":{"name":"UNP Journal of Statistics and Data Science","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139207247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arssita Nur Muharromah, Zamahsary Martha, Dony Permana, Tessy Octavia Mukhti
{"title":"Penerapan Metode Regresi Kuantil pada Data yang Mengandung Outlier untuk Tingkat Kejahatan di Jabodetabek","authors":"Arssita Nur Muharromah, Zamahsary Martha, Dony Permana, Tessy Octavia Mukhti","doi":"10.24036/ujsds/vol1-iss5/94","DOIUrl":"https://doi.org/10.24036/ujsds/vol1-iss5/94","url":null,"abstract":"Masalah kejahatan semakin meluas di Indonesia. Tingkat kejahatan di Jabodetabek merupakan yang tertinggi kedua di Indonesia. Dalam penelitian yang mengandung outlier ini, metode yang tepat untuk penelitian ini adalah regresi kuantil. Regresi Kuantil merupakan pengembangan dari regresi median atau metode Least Absolute Deviation (LAD) yang berguna untuk membagi data menjadi dua bagian untuk meminimalisir kesalahan. Namun, LAD ini dianggap tidak baik untuk pemodelan, oleh karena itu muncullah regresi kuantil. Regresi kuantil berguna untuk mengatasi masalah asumsi yang tidak terpenuhi dalam regresi klasik yaitu gejala heteroskedastisitas dan regresi kuantil dapat memodelkan data yang mengandung outlier. Pendekatan metode regresi kuantil adalah memisahkan ataupun membagi data menjadi beberapa bagian atau kuantil tertentu yang diduga terdapat perbedaan nilai estimasi. Pengukuran kebaikan model yang dihasilkan menggunakan koefisien determinasi atau R2 pada setiap kuantil. Pada penelitian ini digunakan lima kuantil yaitu 0,05; 0,25; 0,50; 0,75; dan 0,95. Dari hasil analisis diketahui bahwa model estimasi parameter terbaik terdapat pada kuantil 0,95 dengan seluruh variabel independen berpengaruh signifikan terhadap variabel dependen (tingkat kejahatan). sedangkan pada kuantil 0,25 dan 0,50 tidak ada variabel bebas yang berpengaruh signifikan, hal ini mungkin disebabkan pengaruh faktor lain yang tidak terdapat dalam penelitian yang mempengaruhi masing-masing kuantil.","PeriodicalId":220933,"journal":{"name":"UNP Journal of Statistics and Data Science","volume":"196 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139207824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Classification of Coronary Heart Disease at Semen Padang Hospital using Algorithm Classification And Regression Trees (CART)","authors":"Defal Aditya, Atus Amadi Defran, Putra, Dodi Vionanda, dan Tessy, Octavia Mukhti","doi":"10.24036/ujsds/vol1-iss5/104","DOIUrl":"https://doi.org/10.24036/ujsds/vol1-iss5/104","url":null,"abstract":"Penyakit kardiovaskuler merupakan salah satu penyakit degeneratif yang disebabkan karena menurunnya fungsi jantung dan pembuluh darah. Salah satu penyakit jantung yang sangat populer saat ini adalah penyakit jantung koroner. Faktor utama penyebab PJK, antara lain faktor usia, jenis kelamin, hipertensi, gula darah, dan kolesterol. Salah satu metode yang bisa digunakan untuk mengelompokkan PJK adalah klasifikasi. CART merupakan suatu pohon keputusan yang menggambarkan hubungan antara variabel respon dengan satu atau lebih variabel prediktor. Tujuan CART adalah untuk mendapatkan suatu kelompok data yang akurat sebagai penciri dari suatu pengklasifikasian. Berdasarkan hasil pohon optimal didapatkan atribut yang menjadi penciri utama dalama klasifikasi pasien PJK di Semen Padang Hospital adalah umur. Ketetapan hasil klasifikasi menggunakan confusion matrix menghasilkan nilai akurasi sebesar 66,67%, sensitvity sebesar 56,52% untuk mengklasifikasikan pasien PJK, dan specifity sebesar 84,61% untuk mengklasifikasikan pasien bukan PJK.","PeriodicalId":220933,"journal":{"name":"UNP Journal of Statistics and Data Science","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139207781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anggi Adrian, Yenni Danis, Kurniawati, N. Amalita, F. Fitri
{"title":"Forecasting the Exchange Rate of Yen to Rupiah Using the Long Short-Term Memory Method","authors":"Anggi Adrian, Yenni Danis, Kurniawati, N. Amalita, F. Fitri","doi":"10.24036/ujsds/vol1-iss5/114","DOIUrl":"https://doi.org/10.24036/ujsds/vol1-iss5/114","url":null,"abstract":"Long Short-Term Memory (LSTM) is a modification of the Recurrent Neural Network (RNN) designed to deal with the issues of exploding and vanishing gradients and makes it possible to manage long-term information. To tackle these problems, modifications were made to the RNN by providing memory cells that can store information for long periods. In this study, the objective was to forecast the exchange rate of Yen to Rupiah using the LSTM method. The data used in this research is daily purchasing rate data from January 2020 to May 2023 which consists of 848 observations. The data was divided into two sets: 80% for training and 20% for testing. For the forecasting process, experiments were conducted to identify the best model by adjusting several hyperparameters. The performance of each model was evaluated using the Mean Absolute Percentage Error (MAPE). Based on the experimental results, the best model obtained was the LSTM model with a batch size of 20, 150 epochs, and 50 neurons per layer, resulted in an MAPE value of 1,5399.","PeriodicalId":220933,"journal":{"name":"UNP Journal of Statistics and Data Science","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139198841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Findri Wara Putri, Dodi Vionanda, Atus Amadi putra, Fadhilah Fitri
{"title":"Comparison of Error Prediction Methods in Claassification Modeling with CHAID Methods for Balanced Data","authors":"Findri Wara Putri, Dodi Vionanda, Atus Amadi putra, Fadhilah Fitri","doi":"10.24036/ujsds/vol1-iss5/116","DOIUrl":"https://doi.org/10.24036/ujsds/vol1-iss5/116","url":null,"abstract":"Chi-Squared Automatic Interaction Detection (CHAID) is an exploratory method for classifying data by building classification trees. The classification result are displayed in the form of a tree diagram model. After the model is formed, it is necessary to calculate the accuracy of the model. The goal is to see the performance of the model. The accuracy of this model can be determined by calculating the level of prediction error in the model. The error rate prediction method works by dividing data into training data and testing data. There are three methods in the error rate prediction method, such as Leave one out cross validation (LOOCV), Hold out, and k-fold cross validation. These methods have different performance in dividing data into training data and test data, so that each method has advantages and disadvantages. Therefore, a comparison of the three error rate prediction methods was carried out with the aim of determining the appropriate method for the CHAID. This research is included in experimental research and uses simulation data from data generation results in RStudio. This comparison is carried out by considering several factors, namely the marginal probability matrix and different correlations. The comparison results will be observed using a boxplot by looking at the median error rate and lowest variance. This research found that k-fold cross validation is the most suitable error rate prediction method applied to the CHAID method for balanced data.","PeriodicalId":220933,"journal":{"name":"UNP Journal of Statistics and Data Science","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139198918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
AL Rezki Ivansyah, Fadhilah Fitri, Yenni Kurniawati, dan Tessy, Octavia Mukhti
{"title":"Implementation Self Organizing Maps Method In Cluster Analysis Based on Achievement Suistainable Development Goal/SDG’s West Sumatera Province","authors":"AL Rezki Ivansyah, Fadhilah Fitri, Yenni Kurniawati, dan Tessy, Octavia Mukhti","doi":"10.24036/ujsds/vol1-iss5/118","DOIUrl":"https://doi.org/10.24036/ujsds/vol1-iss5/118","url":null,"abstract":"Indonesian government's commitment to implementing the Sustainable Development Goals (SDG’s) agenda, particularly in West Sumatra. The government of West Sumatra supports the objectives and targets of achieving SDG’s by optimizing the implementation of SDG indicators in the Rencana Aksi Daerah (RAD) for SDG’s of West Sumatra Province for the years 2022-2026. However, in its execution, there is a need for annual monitoring and evaluation of the RAD for SDG’s in West Sumatra Province. Clustering is employed to serve as a consideration for evaluating the implementation of RAD for SDG’s in West Sumatra Province for the years 2022-2026. The clustering method used is Self Organizing Map (SOM), an effective tool for visualizing high-dimensional data and can be used to map high-dimensional data into one, two, or three dimensions, representing connected units or neurons. The data used consist of 14 SDG indicator variables across 19 regencies/cities in West Sumatra in the year 2022, sourced from the official website and publications of the Badan Pusat Statistika (BPS) of West Sumatra Province. The analysis results in the formation of 3 clusters with different characteristics, which can be used as references in making policy decisions and effective strategies to enhance the implementation performance of SDG’s programs in West Sumatra Province.","PeriodicalId":220933,"journal":{"name":"UNP Journal of Statistics and Data Science","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139200211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sentiment Analysis of Prabowo Subianto as 2024 Presidential Candidate on Twitter Using K-Nearest Neighbor Algorithm","authors":"Aurumnisva Faturrahmi, Zamahsary Martha, Yenni Kurniawati, Fadhilah Fitri","doi":"10.24036/ujsds/vol1-iss5/101","DOIUrl":"https://doi.org/10.24036/ujsds/vol1-iss5/101","url":null,"abstract":"The presidential election is one of the most talked topics at this moment. Based on many surveys, Prabowo Subianto is one of strongest candidates for the upcoming 2024 presidential election. This research aims to see how the public sentiment towards Prabowo Subianto as the presidential candidate tends to be positive or negative. Sentiment classification was conducted using the K-Nearest Neighbor (KNN) algorithm. This algorithm classifies sentiment based on the k value of the nearest neighbor. This analysis was conducted in several stages such as data collection, text preprocessing, data labelling, data classification using the KNN algorithm, and evaluating the accuracy of the model in classifying sentiment. In this research, the results of the sentiment classification were 2731 positive sentiments and 76 negative sentiments. Where the accuracy rate produced by the model using the value of k = 3 on the division of training data and testing data of 80:20 is 97,33%.","PeriodicalId":220933,"journal":{"name":"UNP Journal of Statistics and Data Science","volume":"45 18","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139203051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Structural Equation Modeling Partial Least Square (SEM-PLS) Untuk Membandingkan Kondisi Public Speaking Anxiety Mahasiswa Soshum dan Saintek","authors":"Sabina Chairun Najwa, Natasya Dwi Ovalingga, Hanifah Nazhiroh, R. Akbar, Fadhilah Fitri","doi":"10.24036/ujsds/vol1-iss5/132","DOIUrl":"https://doi.org/10.24036/ujsds/vol1-iss5/132","url":null,"abstract":"Public speaking is a communication skill to deliver opinion or massage to the audience. Public speaking anxiety, caused by various factors. Social and science students have differences in culture and learning systems. Therefore, students in both educational clusters have their own ways of overcoming communication barriers. This study aimed to identify factors that influence public speaking anxiety in social and science students at Padang State University. The method used is the Structural Equation Model Partial Least Square (SEM-PLS) to understand the influential factors in more detail and minimize analysis errors caused by missing values and multicollinearity due to diverse samples. The results of the analysis are path diagrams for structural models and outer loading tables. If the < value is 0.7, then recalculation is carried out so that a new model is formed. The feasibility of the social science family model was obtained 35% and the scientific science family was 36.5%. The effect of latent or exogenous variables in this study is weak. Social students have higher levels of speech anxiety than science students. This is influenced by humiliation, unfamiliar role, and negative result factors. In science students, the influencing factors are humiliation, preparation, and unfamiliar Role.","PeriodicalId":220933,"journal":{"name":"UNP Journal of Statistics and Data Science","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139206640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Naive Bayes Classifier Method on Sentiment Analysis of Bibit Application Users in Play Store","authors":"Afifa Lufti Insani, Zamahsary Martha, Yenni Kurniawati, Zilrahmi","doi":"10.24036/ujsds/vol1-iss5/102","DOIUrl":"https://doi.org/10.24036/ujsds/vol1-iss5/102","url":null,"abstract":"The increasing public interest in investment and supported by technological advances has begun to appear investment applications in the community which aim to facilitate the public in making investments. One of the investment applications that is widely used today is the Bibit application. This application is widely used by novice investors because of its ease of opening accounts, disbursing funds, purchasing mutual funds and easy-to-understand application design. Because investment applications are still new to the community, there are still many people who doubt and worry about the quality of the Bibit application, marked by the number of reviews in the review column available on the play store. Reviews on the application become a forum for criticism and suggestions to the application and become one of the considerations for potential users. Because reviews can be positive or negative towards the Seedling application. Sentiment analysis is needed to analyze whether the sentiment tends to be positive or negative. Then, classification is carried out to obtain a classification model that can be used to predict user sentiment using the Naive Bayes Classifier method. The results obtained obtained seed application users tend to have positive sentiments with an accuracy value of 79.45%.","PeriodicalId":220933,"journal":{"name":"UNP Journal of Statistics and Data Science","volume":" 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139197357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nadhea Ovella Syaqhasdy, Zamahsary Martha, N. Amalita, D. Fitria
{"title":"Classification of Nutrition Problems for Indonesian Toddler With Decision Tree Algorithm C4.5","authors":"Nadhea Ovella Syaqhasdy, Zamahsary Martha, N. Amalita, D. Fitria","doi":"10.24036/ujsds/vol1-iss5/98","DOIUrl":"https://doi.org/10.24036/ujsds/vol1-iss5/98","url":null,"abstract":"Indonesia continues to encounter numerous challenges, particularly in the health and economic sectors. As the future of the nation, the quality of human resources is crucial for Indonesia's development. The development of Indonesia is key to improving the quality of life of its people, and a focus on this development can positively impact the health and economy of the community. A healthy and educated generation is fundamental for the country's expected progress, as nutritional status is one of the factors significantly affecting the quality of human resources. Nutritional problems can cause serious impacts, such as improper physical growth, decreased IQ quality, and even death. The goal is to analyze the factors affecting the nutritional status of toddlers by classifying each variable using a decision tree. A decision tree is a flow chart that resembles a branching tree structure. The C4.5 algorithm was utilized in this study. It can process both numeric and categorical data, handle missing attribute values, and generate easy-to-interpret rules. After conducting the analysis, it was found that there are 392 districts/cities in Indonesia where the prevalence of stunted toddler nutritional status is less than 20%. The model created using the C4.5 algorithm was evaluated and achieved an accuracy of 99.8% and a kappa value close to 1. This indicates that the model can accurately classify toddler nutrition problems in Indonesia.","PeriodicalId":220933,"journal":{"name":"UNP Journal of Statistics and Data Science","volume":"58 1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139198589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}