Hybrid TF-IDF and Embedding Model for Improving Similarity and Clustering Accuracy
DOI:
https://doi.org/10.31328/jointecs.v10i1.7344Abstract
This study addresses improving accuracy in final project document recommendation systems for students by employing a
hybrid approach that combines the vector-based TF-IDF method with semantic embedding models SBERT and E5. The
primary issue tackled is the limitation of TF-IDF in capturing semantic context, often resulting in recommendations that are
less relevant in meaning. This research aims to develop a recommendation system that considers literal word similarity and
semantic similarity between documents. The Research utilizes 500 metadata records of student theses from an institutional
repository via the OAI-PMH protocol, focusing on key attributes such as title, abstract, and keywords. The data undergoes a
text preprocessing phase, which includes tokenization, normalization, stopword removal, and stemming. Three similarity
approaches are applied: TF-IDF, SBERT, and E5, each used to calculate document similarity. Evaluation compares similarity
scores, analyzes ranking shifts, score distributions, and document clustering using the K-Means algorithm. Experimental
results show that the hybrid model significantly improves recommendation quality. The E5 model produces the most stable
and semantically relevant similarity scores and the cleanest and most visually distinct document clusters. Both Spearman and
Pearson correlation analyses indicate that combining TF-IDF with embedding models improves the ranking order of
recommended documents. In conclusion, the hybrid approach using TF-IDF and semantic embeddings, particularly E5,
significantly enhances the accuracy of document recommendation systems and the effectiveness of topic clustering
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Pernyataan Hak Cipta dan Lisensi
Hak Cipta :
Penulis yang mempublikasikan naskahnya pada Jurnal ini menyetujui ketentuan berikut:
Hak cipta pada setiap artikel adalah milik penulis.
- Penulis mengakui bahwa JOINTECS (JOURNAL OF INFORMATION TECHNOLOGY AND COMPUTER SCIENCE) berhak sebagai yang mempublikasikan pertama kali dengan lisensi Creative Common Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
- Penulis dapat memasukan tulisan secara terpisah, mengatur distribusi non-ekskulif  dari naskah yang telah terbit di jurnal ini kedalam versi yang lain (misal: dikirim ke respository institusi penulis, publikasi kedalam buku, dll), dengan mengakui bahwa naskah telah terbit pertama kali pada JOINTECS (JOURNAL OF INFORMATION TECHNOLOGY AND COMPUTER SCIENCE);
Lisensi :
JOINTECS diterbitkan berdasarkan ketentuan Creative Common Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). Lisensi ini mengizinkan setiap orang untuk menyalin dan menyebarluaskan kembali materi ini dalam bentuk atau format apapun, menggubah, mengubah, dan membuat turunan dari materi ini untuk kepentingan apapun, termasuk kepentingan komersial, selama mereka mencantumkan kredit kepada Penulis atas ciptaan asli.
This work is Under licensed

Creative Common Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)