Hybrid TF-IDF and Embedding Model for Improving Similarity and Clustering Accuracy

Authors

  • Ero Wahyu Pratomo Universitas Amikom Yogyakarta
  • Ema Utami Universitas Amikom Yogyakarta

DOI:

https://doi.org/10.31328/jointecs.v10i1.7344

Abstract

This study addresses improving accuracy in final project document recommendation systems for students by employing a
hybrid approach that combines the vector-based TF-IDF method with semantic embedding models SBERT and E5. The
primary issue tackled is the limitation of TF-IDF in capturing semantic context, often resulting in recommendations that are
less relevant in meaning. This research aims to develop a recommendation system that considers literal word similarity and
semantic similarity between documents. The Research utilizes 500 metadata records of student theses from an institutional
repository via the OAI-PMH protocol, focusing on key attributes such as title, abstract, and keywords. The data undergoes a
text preprocessing phase, which includes tokenization, normalization, stopword removal, and stemming. Three similarity
approaches are applied: TF-IDF, SBERT, and E5, each used to calculate document similarity. Evaluation compares similarity
scores, analyzes ranking shifts, score distributions, and document clustering using the K-Means algorithm. Experimental
results show that the hybrid model significantly improves recommendation quality. The E5 model produces the most stable
and semantically relevant similarity scores and the cleanest and most visually distinct document clusters. Both Spearman and
Pearson correlation analyses indicate that combining TF-IDF with embedding models improves the ranking order of
recommended documents. In conclusion, the hybrid approach using TF-IDF and semantic embeddings, particularly E5,
significantly enhances the accuracy of document recommendation systems and the effectiveness of topic clustering

Downloads

Published

2025-09-30

Issue

Section

Articles