Engineering7 min read

What Doug Turnbull's Judgment-First Search Method Taught Our Search Agent

Build the evaluation layer before touching the retrieval stack — the disciplined approach that separates search teams that improve from those that spin

By The Data Workers Team

Doug Turnbull has spent more than a decade making search engines more honest. He co-authored Relevant Search and AI Powered Search, built Quepid (a widely-used open-source relevance evaluation tool), created the Elasticsearch Learning to Rank plugin, and led search at Reddit and Shopify. His blog at softwaredoug.com is one of the most practically useful bodies of work in the field.

The method that runs through all of it is something Turnbull calls judgment-first: before you reach for a fancier retrieval component, establish a graded evaluation framework and measure what you have. Without that baseline, every tuning decision is a guess dressed up as engineering.

What Is Actually Worth Learning

Turnbull's writing distills to four coupled ideas. First: relevance is a verb, not a noun — 'relevance is about deciding whether or not to verb the item.' The same asset can be the right answer for one query and wrong for another depending on where the user is in their decision process.

Second: the safety net grants permission to experiment. 'A judgment list gives your team permission to iterate quickly on relevance improvements.' Without graded baselines, engineers hesitate — every change risks silently breaking something else.

Third: no model is correct, but some models are useful. Every evaluation system gives a different view. Use them as one lens each, not ground truth.

Fourth: evaluation before retrieval complexity. 'The fanciest solutions don't matter as much as getting a good evaluation framework setup to evaluate the quality of search results.'

How a Method Becomes a Skill

When a query starts returning wrong results, the agent does not immediately reach for a retrieval change. It first surfaces a representative sample of queries, builds graded relevance scores for the current results, and establishes an NDCG baseline. Every subsequent ranking change is measured against that baseline before it ships.

Two tools wire this into the agent's actual capabilities. The agentic_search tool handles candidate retrieval and re-scoring. The reconcile_definitions tool handles a problem specific to data warehouse search: you cannot grade relevance for a query like 'monthly revenue' if the top two candidate assets define revenue differently.

One of More Than 400

The Data Workers agent swarm has more than 400 method-named skills across 19 agents. The judgment-first-search skill earns its place because it is the one most commonly skipped. Teams add vector search and rerankers before they have judgment lists. The skill tries to enforce the order: baseline first, complexity second.

A note on this post: This is independent commentary and homage. It distills publicly available writing and talks by Doug Turnbull to illustrate a working method, and every quote is drawn from and verified against the primary sources linked above. The skill it describes is named for the method, not the person, and contains no marketing claims attributed to them. Data Workers is not affiliated with, sponsored by, or endorsed by Doug Turnbull. If you are Doug Turnbull and would like anything adjusted or removed, email hello@dataworkers.io and we will respond promptly.

Related Posts