[{"content":"Course: CSCI E-25 (Computer Vision), Harvard Extension School, Spring 2026 Code: github.com/nthapaliya/scene-graph-captioning\nThe question Do structured scene graphs — explicit subject → predicate → object triples extracted from an image — help a caption decoder produce better descriptions than image features alone?\nThe pipeline Three stages, trained end-to-end:\nYOLO object detector fine-tuned on the Visual Genome VG150 subset (150 object classes, 50 predicates) Visual predicate classifier — ResNet-50 ROI features over subject + object bounding boxes, concatenated with class embeddings and a spatial vector, fed to a 3-layer MLP ViT + T5 caption decoder — ViT-base/16 image encoder and a T5-small scene-graph encoder, concatenated and decoded with cross-attention The ablation baseline removes the scene graph input (sg_input_ids = None), keeping everything else identical.\nResults Evaluated on a 9,180-image held-out validation set (beam-4 decoding):\nMetric Image-only Scene-graph Δ BLEU-4 0.0597 0.0665 +11.4% METEOR 0.4319 0.4474 +3.6% ROUGE-L 0.2857 0.2924 +2.3% The predicate classifier reached top-5 accuracy of 89.8% on the VG150 validation split.\nScene-graph conditioning consistently improves all metrics. BLEU-4 sees the largest relative lift, suggesting the model commits to specific multi-word phrases (\u0026ldquo;riding a skateboard\u0026rdquo;) rather than generic templates. The dominant failure mode is upstream detector error.\nDataset Visual Genome VG150 subset (91,804 images). Captions sourced from MS-COCO for images with a coco_id link; Visual Genome region descriptions as fallback.\nStack PyTorch, YOLO (Ultralytics), from-scratch Transformer decoder, NLTK (BLEU), rouge-score, h5py, uv\n","permalink":"https://ndthp.com/projects/scene-graph-captioning/","summary":"End-to-end image captioning pipeline conditioned on structured scene graph triples. BLEU-4 +11.4% over an image-only baseline.","title":"Context-Aware Image Captioning with Scene Graphs"},{"content":"Course: CSCI E-222 (Foundations of Large Language Models), Harvard Extension School, Spring 2026 Code: github.com/nthapaliya/qlora-food-extraction\nThe task Given an unstructured English caption of a food or drink scene, produce a valid JSON object with exactly four keys:\n{ \u0026#34;is_food_or_drink\u0026#34;: true, \u0026#34;tags\u0026#34;: [\u0026#34;fi\u0026#34;, \u0026#34;di\u0026#34;], \u0026#34;food_items\u0026#34;: [\u0026#34;cauliflower florets\u0026#34;, \u0026#34;sweet potato wedges\u0026#34;], \u0026#34;drink_items\u0026#34;: [\u0026#34;white wine\u0026#34;] } The contract is tight: a missing closing brace or wrong tag code counts as a miss.\nThe approach QLoRA freezes the base model in 4-bit NF4 quantization and trains small low-rank adapter matrices on top in higher precision. Only the adapters are updated — 29.9M of 3.12B parameters (0.96%).\nBase model: Qwen/Qwen2.5-3B-Instruct Quantization: 4-bit NF4, double quantization, bfloat16 compute LoRA rank: 16, applied to all linear layers, dropout 0.05 Loss masking: completion-only (prompt tokens masked with -100) Hardware: RTX 3060 Ti, 8 GB VRAM — peak VRAM well under 8 GB Results Evaluated on a 213-row held-out test split. A malformed output scores zero on every downstream metric.\nMetric Base model Fine-tuned Δ json_valid 0.972 1.000 +0.028 schema_ok 0.925 1.000 +0.075 tags_f1 0.305 0.943 +0.638 exact_match 0.258 0.624 +0.366 The base model\u0026rsquo;s biggest failure — generating free-form English tags instead of the dataset\u0026rsquo;s short two-letter codes — is exactly what a fine-tuned adapter fixes, even from fewer than 1,000 training rows.\nDataset mrdbourke/FoodExtract-1k (1,420 rows, 50/50 food vs non-food). The label column stores a Python repr dict, not JSON — ast.literal_eval required.\nStack PyTorch, Hugging Face (transformers, PEFT, bitsandbytes, datasets), Qwen2.5-3B-Instruct, uv\n","permalink":"https://ndthp.com/projects/qlora-food-extraction/","summary":"Fine-tuning a 3B instruct model with QLoRA to produce strict JSON from free-form text — on a single consumer GPU. Exact-match accuracy 0.258 → 0.624.","title":"QLoRA Fine-Tuning for Structured JSON Extraction"},{"content":"Course: CS 109B (Advanced Data Science), Harvard, Spring 2025 Code: github.com/nthapaliya/llm-multiagent-benchmark\nThe hypothesis A 4-agent discussion pipeline — Planner → Answerer → Critic → Moderator — can improve LLM reasoning accuracy on challenging benchmarks without retraining, by giving the model structured opportunities to self-correct over up to 5 rounds.\nThe pipeline Round 1: Planner → decompose question into a 2-4 step reasoning plan Answerer → follow plan, produce provisional answer Critic → identify errors, or \u0026#34;No further objections\u0026#34; Moderator → \u0026#34;FINAL ANSWER: X\u0026#34; or \u0026#34;CONTINUE\u0026#34; Rounds 2-5 (if CONTINUE): Answerer → revise reasoning incorporating prior critique Critic → re-evaluate Moderator → decide Each stage is a separate Gemini 2.0 Flash API call. The evaluation harness runs all benchmarks asynchronously with a dual token-bucket rate limiter (2,000 RPM / 4M TPM).\nResults Evaluated on 7,661 questions across three benchmarks:\nDataset Baseline Multi-Agent Δ MMLU Pro (5,000 Qs) 60.0% 52.7% −7.3 pp GPQA Diamond (161 Qs) 53.4% 50.3% −3.1 pp HLE (2,500 Qs) 3.8% 2.8% −1.0 pp Cost increased 3–5× and latency 3–5× per question with no accuracy gain.\nWhy it backfired: 85% of final answers settled at round 1 — the Critic rarely improves a correct answer but does occasionally break one. Later rounds amplify errors rather than correcting them.\nWhat this tells us Homogeneous self-critique doesn\u0026rsquo;t work at this scale. More promising directions: heterogeneous agent pools (mixing models with different strengths), targeted fact-checking verifiers, and selective invocation only on low-confidence initial answers.\nStack Python, Google Gemini API, asyncio (async rate-limited harness), pandas, Matplotlib, uv\n","permalink":"https://ndthp.com/projects/llm-multiagent-benchmark/","summary":"Rigorous evaluation of a 4-agent discussion pipeline on 7,661 benchmark questions. The framework decreased accuracy on all three benchmarks — an instructive negative result.","title":"LLM Benchmark Evaluation: Multi-Agent Discussion Framework"},{"content":"Course: CSCI E-89 (Deep Learning), Harvard Extension School, Fall 2024 Code: github.com/nthapaliya/cnn-image-upscaling\nOverview Single-image super-resolution (SISR) is the task of recovering a plausible high-resolution image from a low-resolution input. This project trains and compares three CNN architectures at 4× upscaling on the FFHQ dataset (70,000 high-quality face images), measuring output quality with PSNR and SSIM.\nFaces provide a structured benchmark domain where quality degradation is perceptually obvious and metrics are well-calibrated.\nArchitectures Model Key idea SRCNN Pioneering 3-layer super-resolution CNN (Dong et al., 2014) ESPCN Sub-pixel convolution (pixel shuffle) for efficient upscaling (Shi et al., 2016) EDSR Removes batch norm for more stable training at depth (Lim et al., 2017) Evaluation Both metrics computed on the luminance channel (Y of YCbCr), matching standard practice:\nPSNR — Peak Signal-to-Noise Ratio (higher is better, measured in dB) SSIM — Structural Similarity Index (higher is better, 0–1) Dataset FFHQ — 70,000 high-quality PNG face images at 1024×1024. Downloaded via Kaggle. Low-resolution training inputs created by bicubic downsampling (4× reduction). 65,000 train / 5,000 test split.\nStack TensorFlow 2.x, Keras, NumPy, Matplotlib, Kaggle API, uv\n","permalink":"https://ndthp.com/projects/cnn-image-upscaling/","summary":"Comparison of three CNN architectures for 4× single-image super-resolution on the FFHQ dataset. Evaluated with PSNR and SSIM.","title":"CNN Image Super-Resolution (4× Upscaling)"},{"content":"I\u0026rsquo;m a software engineer transitioning into data science, currently completing a Master\u0026rsquo;s in Data Science at Harvard Extension School (expected 2026).\nBefore the master\u0026rsquo;s I spent five years as a software developer at OfficeSpace Software, building and shipping features for a hybrid-workplace SaaS platform serving ~700 enterprise clients and 2 million users. I worked across the full stack — backend APIs (Ruby on Rails), frontend SPAs (React, Pixi.js), and cloud infrastructure (GCP, Azure).\nMy current work focuses on:\nComputer vision — scene graph generation, object detection (YOLO), image captioning LLMs \u0026amp; NLP — parameter-efficient fine-tuning (QLoRA/LoRA), RAG, benchmark evaluation Statistical modeling — regression, hypothesis testing, model diagnostics in R and Python I grew up in Nepal, studied physics at Amherst College, and have lived in Boston for the past several years. I\u0026rsquo;m fluent in English and Nepali, and conversational in Spanish.\nOutside of work I maintain a homelab (Proxmox, self-hosted services, GPS NTP server, ESP32 IoT), contribute to open-source tools I use daily (Fish shell, Neovim), and spend time cooking and hiking.\nContact: hello@ndthp.com GitHub: github.com/nthapaliya\n","permalink":"https://ndthp.com/about/","summary":"About Niraj Thapaliya","title":"About"},{"content":"Download PDF\nEducation Harvard Extension School — Master\u0026rsquo;s in Data Science (M.L.A.), expected May 2026 Coursework: Intro to Data Science (CS 109A/B), Deep Learning (E-89), Advanced Deep Learning (E-104), Computer Vision (E-25), Foundations of LLMs (E-222), Statistical Data Modeling (E-106), Data Structures (S-22)\nAmherst College — B.A., Physics, 2014\nData Science Projects See the Projects page for full write-ups.\nWork Experience OfficeSpace Software — Software Developer → Senior Software Developer (2016–2021) Built and scaled features for a hybrid-workplace SaaS platform (~700 enterprise clients, 2M users).\nReduced idle frontend CPU load by 90% through systematic profiling Designed and built customer-facing RESTful APIs; maintained GCP and Azure infrastructure Rewrote backend from Java to Ruby/Rails and frontend from Flash to React/HTML5 Humble Bones Granola — Co-founder (2021–2023)\nNortheast Regional Training Institute — Mentor \u0026amp; Volunteer (2021–present)\nSkills Languages: Python, R, SQL, JavaScript/ES6, Go, Rust, Java\nMachine Learning: scikit-learn, regression, decision trees, random forests, gradient boosting, feature engineering, PCA, clustering, A/B testing, statistical modeling in R\nDeep Learning: PyTorch, TensorFlow/Keras; CNNs, RNNs/LSTMs, Transformers, ViT, GNNs, GANs, Diffusion Models\nLLMs \u0026amp; NLP: Hugging Face Transformers, fine-tuning (LoRA/QLoRA), RAG, prompt engineering; OpenAI and Gemini APIs\nComputer Vision: YOLO, OpenCV, image segmentation, scene graph generation\nInfrastructure: pandas, NumPy, Docker, GCP, Azure, MySQL, Redis, Git, Linux\n","permalink":"https://ndthp.com/resume/","summary":"Resume — Niraj Thapaliya","title":"Resume"}]