LLM Tool-Selection Benchmark (Search-Agent Harness)

in-progress

Problem & stakes

LLM “search agents” live or die on one unglamorous skill: given a user’s question and a set of tools, pick the right retrieval call with the right arguments. Most benchmarks score the agent’s final answer, which conflates retrieval reasoning with everything downstream — generation quality, the corpus, the index. I wanted to isolate and measure just the reasoning: the part you actually swap models for.

Constraints

  • Measure the model, not the infrastructure. The corpus and tool responses are simulated (hardcoded JSON) so results don’t drift with an index or an embedding model.
  • Discriminate, don’t just invoke. Real tool menus include plausible wrong tools; a benchmark that only asks “can it call a tool” saturates immediately.
  • Comparable across providers. One canonical tool schema (OpenAI function-calling); the Anthropic adapter converts internally, so OpenAI, Anthropic, and local Ollama models all sit the same exam.
  • Cost-aware by construction. Quality comes first, but cost is real — the headline metric has to price it in.

Approach

A single-turn evaluation harness: each task is one prompt → one expected tool call, scored all-or-nothing on the tool name plus every argument. The decisions that carry the design:

  • CPC (Cost Per Correct) as the north star — total run cost / tasks solved with the correct retrieval strategy. It rewards a cheap model that’s right and punishes an expensive one that’s confidently wrong.
  • Four realistic tools (semantic search with filters, fetch-by-id, metadata listing, query decomposition) mirroring real search APIs — trimmed from nine after red-teaming.
  • Distractor tools (categorically-wrong plus semantic near-misses) to test tool discrimination, tracked as an unscored behavioral metric.
  • Multi-step chains with non-adjacent state dependencies built from real conversation history, plus parallel-invocation tasks — the cases where single-turn selection stops being trivial.

Tradeoffs

  • All-or-nothing scoring over partial credit — binary correctness keeps CPC unambiguous and interpretable; partial credit is deferred.
  • Simulated corpus over a real index — gives up end-to-end realism to buy isolation and reproducibility, which is the entire point.
  • Keyword-contains matching for free-text arguments over an LLM judge — no judge cost or variance, paid for with authoring discipline on both sides.

Results

  • A 23-task hardened dataset run across 12 model configurations (OpenAI, Anthropic, and local Gemma), scored on CPC.
  • The first run surfaced the expected finding: single-turn tool selection from small tool sets is saturated industry-wide — which is precisely why the dataset now leans on longer chains, constraint-dense filters, near-miss distractors, and parallel invocation.
  • A live dashboard with per-config CPC, a cost/quality scatter, chain pass-rates, and distractor-pick behavior.

View live results → cpc.sahilashar.com

Code: github.com/SahilAshar/llm-token-harness

Lessons

The benchmark is the product. Once “can it call a tool” saturates, the signal moves entirely into discrimination (near-miss tools) and state (multi-step, non-adjacent dependencies) — so a good harness spends its complexity budget there, not on more easy tasks. And pricing the metric — CPC rather than raw accuracy — changes the ranking: the cheapest correct model, not the most capable one, tends to win on cost-per-correct.