DETECTION OF YOUTUBE COMMENT SPAM USING XGBOOST: OPTIMIZATION AND COMPARISON OF MACHINE LEARNING ALGORITHMS

Authors

  • Syahroni Wahyu Iriananda Universitas Widya Gama Malang
  • Niken Paramita Universitas Widya Gama Malang
  • Rizky Putra Kurniawan Universitas Widya Gama Malang

DOI:

https://doi.org/10.31328/jsae.v8i2.7296

Keywords:

youtube, spam, detection, xgboost, machine learning

Abstract

YouTube, as the largest video platform in Indonesia, faces significant challenges due to spam in the comment sections, which can reduce the quality of discussions. This study aims to develop a spam detection model for YouTube comments using the Extreme Gradient Boosting (XGBoost) algorithm, optimized through Randomized Search CV. This model is compared with several other algorithms, such as Random Forest, Naïve Bayes, and Support Vector Machine (SVM). The methodology used in this study involves developing a machine learning-based model with a primary focus on the XGBoost algorithm. The dataset used is the YouTube Spam Collection, which underwent preprocessing using Natural Language Processing (NLP) techniques such as tokenization and lemmatization to extract important features from the comments. The model was tested using various test data ratios (20%, 25%, 30%, and 35%) and optimized using Randomized Search CV to determine the best parameter combinations.The results indicate that Random Forest achieved the highest accuracy of 92.34% at a 30% test data ratio, while XGBoost demonstrated stable accuracy across different test data ratios, reaching 91.70%. Compared to previous studies, the accuracy of 92.34% is still lower than the previous results of 96.94%. Nevertheless, this study provides important insights by demonstrating that hyperparameter optimization can significantly improve the performance of machine learning algorithms. Based on these findings, it can be concluded that Random Forest and XGBoost are effective models for detecting spam in YouTube comments, with Random Forest excelling in accuracy and XGBoost offering consistent performance stability.

References

[1] A. Thompson, “Digital 2024 October Global Statshot Report,” We Are Social UK. Accessed: Nov. 21, 2024. [Online]. Available: https://wearesocial.com/uk/blog/2024/10/digital-2024-october-global-statshot-report/

[2] “Digital 2024: Indonesia,” DataReportal – Global Digital Insights. Accessed: Nov. 21, 2024. [Online]. Available: https://datareportal.com/reports/digital-2024-indonesia

[3] K. K. Aldous, J. An, and B. J. Jansen, “View, Like, Comment, Post: Analyzing User Engagement by Topic at 4 Levels across 5 Social Media Platforms for 53 News Organizations,” Proceedings of the International AAAI Conference on Web and Social Media, vol. 13, pp. 47–57, Jul. 2019, doi: 10.1609/icwsm.v13i01.3208.

[4] I. Dubovi and I. Tabak, “An empirical analysis of knowledge co-construction in YouTube comments,” Computers & Education, vol. 156, p. 103939, Oct. 2020, doi: 10.1016/j.compedu.2020.103939.

[5] A. Ilavendhan, S. Narayanan. A, and N. Janani, “Optimizing YouTube Spam Detection with Ensemble Deep Learning Techniques,” in 2024 14th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Jan. 2024, pp. 625–630. doi: 10.1109/Confluence60223.2024.10463326.

[6] I. Dubovi and I. Tabak, “Interactions between emotional and cognitive engagement with science on YouTube,” Public Understanding of Science, Feb. 2021, doi: 10.1177/0963662521990848.

[7] A. Antony, A. Rajendran, and G. Deepa, “YouTube Spam Comment Detection,” in Proceedings of the 2nd International Conference on Signal and Data Processing, K. P. Ray, A. Dixit, D. Adhikari, and R. Mathew, Eds., Singapore: Springer Nature, 2023, pp. 387–394. doi: 10.1007/978-981-99-1410-4_32.

[8] P. Nagaraj, K. Sudar, P. Thrived, P. K. Reddy, S. B. Babu, and P. S. R. Krishna, “Youtube Comment Spam Detection,” 2023 International Conference on Computer Communication and Informatics (ICCCI), pp. 1–6, 2023, doi: 10.1109/ICCCI56745.2023.10128559.

[9] A. S. Xiao and Q. Liang, “Spam detection for Youtube video comments using machine learning approaches,” Machine Learning with Applications, vol. 16, p. 100550, Jun. 2024, doi: 10.1016/j.mlwa.2024.100550.

[10] H. Valpadasu, P. Chakri, P. Harshitha, and P. Tarun, “Machine Learning based Spam Comments Detection on YouTube,” in 2023 7th International Conference on Intelligent Computing and Control Systems (ICICCS), May 2023, pp. 1234–1239. doi: 10.1109/ICICCS56967.2023.10142608.

[11] H. Oh, “A YouTube Spam Comments Detection Scheme Using Cascaded Ensemble Machine Learning Model,” IEEE Access, vol. 9, pp. 144121–144128, 2021, doi: 10.1109/ACCESS.2021.3121508.

[12] J. Yellapu, K. S. K. Reddy, G. V. Lakshmi, K. R. Vaibhavi, C. Vivek, and T. S. Kishore, “Spam Comment Detection Using the Ensemble Technique,” in 2024 4th International Conference on Intelligent Technologies (CONIT), Jun. 2024, pp. 1–7. doi: 10.1109/CONIT61985.2024.10626863.

[13] M. Sam’an and K. Imaddudin, “Hybrid deep learning model for YouTube spam comment detection,” International Journal of Electrical and Computer Engineering (IJECE), vol. 14, no. 3, Art. no. 3, Jun. 2024, doi: 10.11591/ijece.v14i3.pp3313-3319.

[14] A. Ganguly, D. A. U. Ruby, G. P. K. Reddy, and D. G. C. C. J, “A Novel Approach for Spam Comment Detection on YouTube Using Graph-Enhanced Hierarchical Attention Network (GE-HAN),” Feb. 14, 2024, Research Square. doi: 10.21203/rs.3.rs-3955289/v1.

[15] T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, in KDD ’16. New York, NY, USA: Association for Computing Machinery, Aug. 2016, pp. 785–794. doi: 10.1145/2939672.2939785.

[16] T. C. Alberto, J. V. Lochter, and T. A. Almeida, “TubeSpam: Comment Spam Filtering on YouTube,” in 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), Miami, FL: IEEE, Dec. 2015, pp. 138–143. doi: 10.1109/ICMLA.2015.37.

[17] R. K. Das, S. S. Dash, K. Das, and M. Panda, “Detection of Spam in YouTube Comments Using Different Classifiers,” in Advanced Computing and Intelligent Engineering, B. Pati, C. R. Panigrahi, R. Buyya, and K.-C. Li, Eds., Singapore: Springer, 2020, pp. 201–214. doi: 10.1007/978-981-15-1081-6_17.

[18] S. W. Iriananda, R. W. Budiawan, A. Y. Rahman, and I. Istiadi, “Optimasi Klasifikasi Sentimen Komentar Pengguna Game Bergerak Menggunakan Svm, Grid Search Dan Kombinasi N-Gram,” JTIIK, vol. 11, no. 4, pp. 743–752, Aug. 2024, doi: 10.25126/jtiik.1148244.

[19] T. Ahmad, P. Purnowo, and G. Windu, “Model Prediksi Kualitas Udara dengan Support Vector Machines dengan Optimasi Hyperparameter GridSearch CV | Buletin Ilmiah Sarjana Teknik Elektro,” May 2022, doi: 10.12928/biste.v4i1.6079.

[20] D. A. Anggoro and S. S. Mukti, “Performance Comparison of Grid Search and Random Search Methods for Hyperparameter Tuning in Extreme Gradient Boosting Algorithm to Predict Chronic Kidney Failure,” IJIES, vol. 14, no. 6, pp. 198–207, Dec. 2021, doi: 10.22266/ijies2021.1231.19.

[21] S. W. Iriananda, R. W. Budiawan, A. Y. Rahman, and I. Istiadi, “KINERJA SELEKSI FITUR N-GRAM PADA ANALISIS SENTIMEN ULASAN MOBILE GAME DI GOOGLE PLAYSTORE,” The 5th Conference on Innovation and Application of Science and Technology (CIASTECH), 2022.

[22] C. Cahyaningtyas, Y. Nataliani, and I. R. Widiasari, “Analisis Sentimen Pada Rating Aplikasi Shopee Menggunakan Metode Decision Tree Berbasis SMOTE,” AITI, vol. 18, no. 2, pp. 173–184, Nov. 2021, doi: 10.24246/aiti.v18i2.173-184.

[23] Hengyu Zheng, “Improved SMOTE algorithm for imbalanced dataset,” 2020 Chinese Automation Congress (CAC), pp. 693–697, Nov. 2020, doi: 10.1109/CAC51589.2020.9326603.

[24] S. Sofyan and A. Prasetyo, “Penerapan Synthetic Minority Oversampling Technique (SMOTE) Terhadap Data Tidak Seimbang Pada Tingkat Pendapatan Pekerja Informal Di Provinsi D.I. Yogyakarta Tahun 2019,” semnasoffstat, vol. 2021, no. 1, pp. 868–877, Nov. 2021, doi: 10.34123/semnasoffstat.v2021i1.1081.

[25] [25] D. R. I. M. Setiadi, A. R. Muslikh, S. W. Iriananda, W. Warto, J. Gondohanindijo, and A. A. Ojugo, “Outlier Detection Using Gaussian Mixture Model Clustering to Optimize XGBoost for Credit Approval Prediction,” J. Comput. Theor. Appl., vol. 2, no. 2, pp. 244–255, Nov. 2024, doi: 10.62411/jcta.11638.

Downloads

Published

2025-09-20

Issue

Section

Articles