PyTorch

Context-Aware Image Captioning with Scene Graphs

End-to-end image captioning pipeline conditioned on structured scene graph triples.

Fine-tuning a 3B instruct model with QLoRA to produce strict JSON from free-form text — on a single consumer GPU. Exact-match accuracy 0.258 → 0.624.