CNN-LSTM with Transformer-Based Fusion (CLIP: Contrastive Language-Image Pretraining) for Sentiment and Emotion
DOI:
https://doi.org/10.31328/jointecs.v10i1.7346Abstract
Sentiment analysis on multimodal data presents significant challenges due to the complex integration of textual and visual
information, especially in internet memes where meaning often emerges from the interplay between image and text. This study
proposes a hybrid deep learning model that combines Convolutional Neural Networks (CNN) for visual feature extraction,
Long Short-Term Memory (LSTM) networks for contextual text understanding, and Contrastive Language–Image Pretraining
(CLIP) as a transformer-based fusion mechanism to align both modalities in a shared semantic space. Evaluated on the
Memotion 7K dataset, the proposed model achieves an accuracy of 85.6%, outperforming baseline architectures such as CNNLSTM (78.3%) and BERT-DenseNet (81.2%). Experimental results demonstrate that CLIP’s contrastive learning significantly
enhances the model’s ability to interpret ambiguous and sarcastic content by capturing nuanced text-image relationships. The
system shows balanced performance across sentiment classes, with strong precision and recall metrics, confirming its
robustness. This research contributes a methodologically sound framework for multimodal sentiment analysis and offers
practical implications for real-world applications such as social media monitoring, digital marketing, and cross-cultural
content understanding. Future work includes multilingual dataset expansion, computational optimization for deployment, and
integration of additional modalities like audio or video to further enrich emotional context.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Pernyataan Hak Cipta dan Lisensi
Hak Cipta :
Penulis yang mempublikasikan naskahnya pada Jurnal ini menyetujui ketentuan berikut:
Hak cipta pada setiap artikel adalah milik penulis.
- Penulis mengakui bahwa JOINTECS (JOURNAL OF INFORMATION TECHNOLOGY AND COMPUTER SCIENCE) berhak sebagai yang mempublikasikan pertama kali dengan lisensi Creative Common Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
- Penulis dapat memasukan tulisan secara terpisah, mengatur distribusi non-ekskulif  dari naskah yang telah terbit di jurnal ini kedalam versi yang lain (misal: dikirim ke respository institusi penulis, publikasi kedalam buku, dll), dengan mengakui bahwa naskah telah terbit pertama kali pada JOINTECS (JOURNAL OF INFORMATION TECHNOLOGY AND COMPUTER SCIENCE);
Lisensi :
JOINTECS diterbitkan berdasarkan ketentuan Creative Common Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). Lisensi ini mengizinkan setiap orang untuk menyalin dan menyebarluaskan kembali materi ini dalam bentuk atau format apapun, menggubah, mengubah, dan membuat turunan dari materi ini untuk kepentingan apapun, termasuk kepentingan komersial, selama mereka mencantumkan kredit kepada Penulis atas ciptaan asli.
This work is Under licensed

Creative Common Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)