ARCHIVES

Original Article

Generative AI-Based Image Captioning System Using Vision Transformers and Large Language Models

Ameena Firdous1 Dr.Waseema Masood2
1 Assistant Professor, Nawab Shah Alam Khan College of Engineering and Technology, Hyderabad, Telangana, India. 2 Associate Professor, Deccan College of Engineering and Technology, Hyderabad, Telangana, India.

Published Online: March-April 2026

Pages: 67-73

References

[1] Y. Ming, J. Liu, and L. Wang, “A Comprehensive Review on Automatic Image Captioning,” IEEE/CAA Journal of Automatica Sinica,
vol. 9, no. 10, pp. 1787–1808, 2022.
[2] J. Li, D. Li, and C. Xiong, “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and
Generation,” IEEE Access, 2022.
[3] J. Wang, Z. Yang, X. Hu, and L. Li, “GIT: A Generative Image-to-Text Transformer for Vision and Language,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2022.
[4] Y. Wang, J. Xu, and Y. Sun, “End-to-End Transformer-Based Model for Image Captioning,” IEEE Access, vol. 10, pp. 103210–103221,
2022.
[5] Y. Ma, J. Ji, X. Sun, and R. Ji, “Towards Local Visual Modeling for Image Captioning,” IEEE Transactions on Multimedia, vol. 25, pp.
4132–4143, 2023.
[6] J. Wan, M. Gan, L. Zhang, and J. Zhou, “Fine-Grained Image Captioning by Ranking Diffusion Transformer,” IEEE Transactions on
Image Processing, vol. 34, pp. 8332–8344, 2025.
[7] L. Wang, M. Zhang, and E. Chen, “Image Captioning Method Based on CLIP-Combined Local Feature Enhancement and Multi-Scale
Semantic Guidance,” IEEE Access, vol. 13, pp. 114320–114332, 2025.
[8] P. Kavitha and S. Rajendran, “Deep Learning-Based Image Captioning Using ResNet-50 and Hybrid LSTM-GRU Architecture,” IEEE
Access, vol. 13, pp. 102511–102520, 2025.
[9] X. Piao, Y. Chen, and H. Zhang, “Visual Spatial Relationship-Sensitive Transformer for Image Captioning,” IEEE Access, vol. 13, pp.
175221–175232, 2025.
[10] Albadarneh and M. Al-Smadi, “Attention-Based Transformer Models for Image Captioning: A Survey,” IEEE Access, vol. 13, pp.
165401–165418, 2025.
[11] M. Patel and R. Shah, “Enhanced Image Captioning with Context-Aware Attention Mechanisms,” IEEE Access, vol. 13, pp. 150332–
150345, 2025.
[12] H. Abdulgalil and A. Ahmed, “Next-Generation Image Captioning: A Survey of Transformer-Based Vision-Language Models,” IEEE
Access, vol. 14, pp. 23110–23128, 2025.
[13] B. Patra and A. Singh, “Enhancing Image Captioning with Asynchronous Dual Attention Networks,” IEEE Transactions on Neural
Networks and Learning Systems, 2025.
[14] S. Kumar and R. Sharma, “Generation of Image Captions Using Deep Learning and Vision-Language Models,” IEEE Access, vol. 12,
pp. 121210–121222, 2024.
[15] R. Thirunavukarasu and K. Suresh, “Transformer-Based Image Captioning with Multi-Modal Learning,” IEEE Access, vol. 12, pp.
112320–112334, 2024.

Related Articles

2026

AI-Based Stomach Cancer Detection Using Biomarkers, Medical Images, and Voice Analysis

2026

Hydrogen-Efficient Eco-Driving and Route Planning for Fuel-Cell Electric Vehicles Using Multi-Objective Optimization Under Traffic and Terrain Uncertainty

2026

A Data-Driven Machine Learning Framework for Assessing Patent Commercial Value and Technological Significance

2026

Evaluating Student Academic Performance Through a Benchmark of Fuzzy Reasoning Models

2026

A Hybrid Soft Computing Approach for Managing Uncertainty in Data Analytics

2026

Soft Computing Approaches for Robust Analysis of Imbalanced and Noisy Data

Share Article

X
LinkedIn
Facebook
WhatsApp

Or copy link

https://test.theijire.com/archives/10.59256/ijire.20260702008

*Instagram doesn't support direct link sharing from web. Copy the link and share it in your Instagram story or post.