Course: CSCI E-25 (Computer Vision), Harvard Extension School, Spring 2026 Code: github.com/nthapaliya/scene-graph-captioning
The question
Do structured scene graphs — explicit subject → predicate → object triples extracted
from an image — help a caption decoder produce better descriptions than image features alone?
The pipeline
Three stages, trained end-to-end:
- YOLO object detector fine-tuned on the Visual Genome VG150 subset (150 object classes, 50 predicates)
- Visual predicate classifier — ResNet-50 ROI features over subject + object bounding boxes, concatenated with class embeddings and a spatial vector, fed to a 3-layer MLP
- ViT + T5 caption decoder — ViT-base/16 image encoder and a T5-small scene-graph encoder, concatenated and decoded with cross-attention
The ablation baseline removes the scene graph input (sg_input_ids = None), keeping everything else identical.
Results
Evaluated on a 9,180-image held-out validation set (beam-4 decoding):
| Metric | Image-only | Scene-graph | Δ |
|---|---|---|---|
| BLEU-4 | 0.0597 | 0.0665 | +11.4% |
| METEOR | 0.4319 | 0.4474 | +3.6% |
| ROUGE-L | 0.2857 | 0.2924 | +2.3% |
The predicate classifier reached top-5 accuracy of 89.8% on the VG150 validation split.
Scene-graph conditioning consistently improves all metrics. BLEU-4 sees the largest relative lift, suggesting the model commits to specific multi-word phrases (“riding a skateboard”) rather than generic templates. The dominant failure mode is upstream detector error.
Dataset
Visual Genome VG150 subset
(91,804 images). Captions sourced from MS-COCO for images with a coco_id link; Visual
Genome region descriptions as fallback.
Stack
PyTorch, YOLO (Ultralytics), from-scratch Transformer decoder, NLTK (BLEU), rouge-score, h5py, uv