LLM Tool-Selection Benchmark (Search-Agent Harness)
Problem & stakes
LLM “search agents” live or die on one unglamorous skill: given a user’s question and a set of tools, pick the right retrieval call with the right arguments. Most benchmarks score the agent’s final answer, which conflates retrieval reasoning with everything downstream — generation quality, the corpus, the index. I wanted to isolate and measure just the reasoning: the part you actually swap models for.
Constraints
- Measure the model, not the infrastructure. The corpus and tool responses are simulated (hardcoded JSON) so results don’t drift with an index or an embedding model.
- Discriminate, don’t just invoke. Real tool menus include plausible wrong tools; a benchmark that only asks “can it call a tool” saturates immediately.
- Comparable across providers. One canonical tool schema (OpenAI function-calling); the Anthropic adapter converts internally, so OpenAI, Anthropic, and local Ollama models all sit the same exam.
- Cost-aware by construction. Quality comes first, but cost is real — the headline metric has to price it in.
Approach
A single-turn evaluation harness: each task is one prompt → one expected tool call, scored all-or-nothing on the tool name plus every argument. The decisions that carry the design:
- CPC (Cost Per Correct) as the north star — total run cost / tasks solved with the correct retrieval strategy. It rewards a cheap model that’s right and punishes an expensive one that’s confidently wrong.
- Four realistic tools (semantic search with filters, fetch-by-id, metadata listing, query decomposition) mirroring real search APIs — trimmed from nine after red-teaming.
- Distractor tools (categorically-wrong plus semantic near-misses) to test tool discrimination, tracked as an unscored behavioral metric.
- Multi-step chains with non-adjacent state dependencies built from real conversation history, plus parallel-invocation tasks — the cases where single-turn selection stops being trivial.
Tradeoffs
- All-or-nothing scoring over partial credit — binary correctness keeps CPC unambiguous and interpretable; partial credit is deferred.
- Simulated corpus over a real index — gives up end-to-end realism to buy isolation and reproducibility, which is the entire point.
- Keyword-contains matching for free-text arguments over an LLM judge — no judge cost or variance, paid for with authoring discipline on both sides.
Results
- A 23-task hardened dataset run across 12 model configurations (OpenAI, Anthropic, and local Gemma), scored on CPC.
- The first run surfaced the expected finding: single-turn tool selection from small tool sets is saturated industry-wide — which is precisely why the dataset now leans on longer chains, constraint-dense filters, near-miss distractors, and parallel invocation.
- A live dashboard with per-config CPC, a cost/quality scatter, chain pass-rates, and distractor-pick behavior.
View live results → cpc.sahilashar.com
Code: github.com/SahilAshar/llm-token-harness
Lessons
The benchmark is the product. Once “can it call a tool” saturates, the signal moves entirely into discrimination (near-miss tools) and state (multi-step, non-adjacent dependencies) — so a good harness spends its complexity budget there, not on more easy tasks. And pricing the metric — CPC rather than raw accuracy — changes the ranking: the cheapest correct model, not the most capable one, tends to win on cost-per-correct.