Evaluation

LLM Benchmark Evaluation: Multi-Agent Discussion Framework

Rigorous evaluation of a 4-agent discussion pipeline on 7,661 benchmark questions. The framework decreased accuracy on all three benchmarks — an instructive negative result.