Generative AI-Based Image Captioning System Using Vision Transformers and Large Language Models

Ameena Firdous; Dr.Waseema Masood

doi:https://www.doi.org/10.59256/ijire.20260702008

ARCHIVES

Original Article

Generative AI-Based Image Captioning System Using Vision Transformers and Large Language Models

Ameena Firdous¹ Dr.Waseema Masood²

¹ Assistant Professor, Nawab Shah Alam Khan College of Engineering and Technology, Hyderabad, Telangana, India. ² Associate Professor, Deccan College of Engineering and Technology, Hyderabad, Telangana, India.

Published Online: March-April 2026

Pages: 67-73

Cite this article

↗ https://www.doi.org/10.59256/ijire.20260702008

Abstract

View PDF

Recent developments of Generative Artificial Intelligence (AI) and Large Language Models (LLMs) have helped tremendously in improving the capabilities of machines to comprehend and write human-like language. Nonetheless, linking the visual information with natural language comprehension is a difficult issue to be solved in artificial intelligence. This project is aimed at the creation of the intelligent image captioning system that will provide the link between the computer vision and natural language processing. The suggested system processes the visual information in images and automatically produces valuable and contextual text description of the visual images, which will allow the machines to understand and describe the visual scenes in a better way. The system combines computer vision models, which are based on deep learning, and language generation techniques, which are based on transformers. One of the pretrained vision encoders in the form of a Convolutional Neural Network (CNN) or a Vision Transformer (ViT) is then applied to images and the visual features are then processed by a language decoder to create coherent captions. To train the model and evaluate it, publicly available datasets (MS COCO and Flickr8k/Flickr30k) are used. It is implemented with the help of Python and deep learning systems such as PyTorch or TensorFlow. Standard measures such as BLEU, METEOR and ROUGE are used to measure performance. The suggested system shows how generative AI can be used to enhance image understanding, accessibility, and automated content description with the help of computer vision.

Quick Links

Download

Manuscript Template Copyright Form

Policies

Share Article

X

Facebook

Or copy link

https://test.theijire.com/archives/10.59256/ijire.20260702008

*Instagram doesn't support direct link sharing from web. Copy the link and share it in your Instagram story or post.

ARCHIVES

Generative AI-Based Image Captioning System Using Vision Transformers and Large Language Models

Cite this article

Abstract

Related Articles

AI-Based Stomach Cancer Detection Using Biomarkers, Medical Images, and Voice Analysis

Hydrogen-Efficient Eco-Driving and Route Planning for Fuel-Cell Electric Vehicles Using Multi-Objective Optimization Under Traffic and Terrain Uncertainty

A Data-Driven Machine Learning Framework for Assessing Patent Commercial Value and Technological Significance

Evaluating Student Academic Performance Through a Benchmark of Fuzzy Reasoning Models

A Hybrid Soft Computing Approach for Managing Uncertainty in Data Analytics

Soft Computing Approaches for Robust Analysis of Imbalanced and Noisy Data

PlumX Metrics

Dimension

Quick Links

Download

Policies

Share Article