[Lab Note] Cost-Per-Correct as a Key LLM Metric

Many teams track accuracy, latency, and quality, but cost is often treated as a secondary concern. For production AI systems, I think Cost-Per-Correct (CPC) should be a primary optimization target.

CPC collapses multiple decisions into one measurable number:

  • prompting strategy
  • reasoning level
  • token caps
  • k-shot examples

To test this, I built an eval harness that runs grid experiments across reasoning levels, token caps, and shot counts, then logs accuracy, p50/p95 latency, and spend.

Key pattern: higher reasoning is not always better. It can increase token usage significantly and hurt CPC when applied indiscriminately. Lightweight reasoning plus selective escalation often gives better system economics.

The broader takeaway: for enterprise AI, “more intelligence” is less useful than predictable, measurable efficiency.