LLM Tool-Selection Benchmark (Search-Agent Harness)

June 12, 2026 in-progress

Problem & stakes

LLM “search agents” live or die on one unglamorous skill: given a user’s question and a set of tools, pick the right retrieval call with the right arguments. Most benchmarks score the agent’s final answer, which conflates retrieval reasoning with everything downstream — generation quality, the corpus, the index. I wanted to isolate and measure just the reasoning: the part you actually swap models for.

Constraints

Measure the model, not the infrastructure. The corpus and tool responses are simulated (hardcoded JSON) so results don’t drift with an index or an embedding model.
Discriminate, don’t just invoke. Real tool menus include plausible wrong tools; a benchmark that only asks “can it call a tool” saturates immediately.
Comparable across providers. One canonical tool schema (OpenAI function-calling); the Anthropic adapter converts internally, so OpenAI, Anthropic, and local Ollama models all sit the same exam.
Cost-aware by construction. Quality comes first, but cost is real — the headline metric has to price it in.

Approach

A single-turn evaluation harness: each task is one prompt → one expected tool call, scored all-or-nothing on the tool name plus every argument. The decisions that carry the design:

CPC (Cost Per Correct) as the north star — total run cost / tasks solved with the correct retrieval strategy. It rewards a cheap model that’s right and punishes an expensive one that’s confidently wrong.
Four realistic tools (semantic search with filters, fetch-by-id, metadata listing, query decomposition) mirroring real search APIs — trimmed from nine after red-teaming.
Distractor tools (categorically-wrong plus semantic near-misses) to test tool discrimination, tracked as an unscored behavioral metric.
Multi-step chains with non-adjacent state dependencies built from real conversation history, plus parallel-invocation tasks — the cases where single-turn selection stops being trivial.

Tradeoffs

All-or-nothing scoring over partial credit — binary correctness keeps CPC unambiguous and interpretable; partial credit is deferred.
Simulated corpus over a real index — gives up end-to-end realism to buy isolation and reproducibility, which is the entire point.
Keyword-contains matching for free-text arguments over an LLM judge — no judge cost or variance, paid for with authoring discipline on both sides.

Results

A 23-task hardened dataset run across 12 model configurations (OpenAI, Anthropic, and local Gemma), scored on CPC.
The first run surfaced the expected finding: single-turn tool selection from small tool sets is saturated industry-wide — which is precisely why the dataset now leans on longer chains, constraint-dense filters, near-miss distractors, and parallel invocation.
A live dashboard with per-config CPC, a cost/quality scatter, chain pass-rates, and distractor-pick behavior.

View live results → cpc.sahilashar.com

Code: github.com/SahilAshar/llm-token-harness

Lessons

The benchmark is the product. Once “can it call a tool” saturates, the signal moves entirely into discrimination (near-miss tools) and state (multi-step, non-adjacent dependencies) — so a good harness spends its complexity budget there, not on more easy tasks. And pricing the metric — CPC rather than raw accuracy — changes the ranking: the cheapest correct model, not the most capable one, tends to win on cost-per-correct.