CNN-LSTM with Transformer-Based Fusion (CLIP: Contrastive Language-Image Pretraining) for Sentiment and Emotion

Authors

  • Irsyad Khalid Ilyas Universitas Amikom Yogyakarta
  • Andi Sunyoto Universitas Amikom Yogyakarta
  • M. Hanafi Universitas Amikom Yogyakarta

DOI:

https://doi.org/10.31328/jointecs.v10i1.7346

Abstract

Sentiment analysis on multimodal data presents significant challenges due to the complex integration of textual and visual
information, especially in internet memes where meaning often emerges from the interplay between image and text. This study
proposes a hybrid deep learning model that combines Convolutional Neural Networks (CNN) for visual feature extraction,
Long Short-Term Memory (LSTM) networks for contextual text understanding, and Contrastive Language–Image Pretraining
(CLIP) as a transformer-based fusion mechanism to align both modalities in a shared semantic space. Evaluated on the
Memotion 7K dataset, the proposed model achieves an accuracy of 85.6%, outperforming baseline architectures such as CNNLSTM (78.3%) and BERT-DenseNet (81.2%). Experimental results demonstrate that CLIP’s contrastive learning significantly
enhances the model’s ability to interpret ambiguous and sarcastic content by capturing nuanced text-image relationships. The
system shows balanced performance across sentiment classes, with strong precision and recall metrics, confirming its
robustness. This research contributes a methodologically sound framework for multimodal sentiment analysis and offers
practical implications for real-world applications such as social media monitoring, digital marketing, and cross-cultural
content understanding. Future work includes multilingual dataset expansion, computational optimization for deployment, and
integration of additional modalities like audio or video to further enrich emotional context.

Downloads

Published

2025-09-30

Issue

Section

Articles