Course: CSCI E-222 (Foundations of Large Language Models), Harvard Extension School, Spring 2026 Code: github.com/nthapaliya/qlora-food-extraction


The task

Given an unstructured English caption of a food or drink scene, produce a valid JSON object with exactly four keys:

{
  "is_food_or_drink": true,
  "tags": ["fi", "di"],
  "food_items": ["cauliflower florets", "sweet potato wedges"],
  "drink_items": ["white wine"]
}

The contract is tight: a missing closing brace or wrong tag code counts as a miss.

The approach

QLoRA freezes the base model in 4-bit NF4 quantization and trains small low-rank adapter matrices on top in higher precision. Only the adapters are updated — 29.9M of 3.12B parameters (0.96%).

  • Base model: Qwen/Qwen2.5-3B-Instruct
  • Quantization: 4-bit NF4, double quantization, bfloat16 compute
  • LoRA rank: 16, applied to all linear layers, dropout 0.05
  • Loss masking: completion-only (prompt tokens masked with -100)
  • Hardware: RTX 3060 Ti, 8 GB VRAM — peak VRAM well under 8 GB

Results

Evaluated on a 213-row held-out test split. A malformed output scores zero on every downstream metric.

MetricBase modelFine-tunedΔ
json_valid0.9721.000+0.028
schema_ok0.9251.000+0.075
tags_f10.3050.943+0.638
exact_match0.2580.624+0.366

The base model’s biggest failure — generating free-form English tags instead of the dataset’s short two-letter codes — is exactly what a fine-tuned adapter fixes, even from fewer than 1,000 training rows.

Dataset

mrdbourke/FoodExtract-1k (1,420 rows, 50/50 food vs non-food). The label column stores a Python repr dict, not JSON — ast.literal_eval required.

Stack

PyTorch, Hugging Face (transformers, PEFT, bitsandbytes, datasets), Qwen2.5-3B-Instruct, uv