September 14, 2025 · Skelf-Research

Most agentic search throws queries at Google and prays

Why the dominant pattern in LLM-mediated search is functionally a wrapper around a single search call — and what changes when you make the plan the first artifact, not the last.

agentic searchretrievaldesign

If you read the system prompts behind the popular “AI search” products carefully, a pattern shows up. The model is told it can search. The user asks a question. The model rewrites the question slightly. It issues one search call. It gets back a list of URLs and snippets. It writes an answer that cites two or three of them. Done.

That is not agentic search. That is a chat completion with a hard-coded tool call. And in most cases the tool call is just google_search(user_query) — the same string the user typed, with maybe a polite “best” or “2024” prepended. The model is doing zero planning; the search engine is doing all the recall work; and the answer is grounded only in whatever Google decided to return for that exact phrasing.

It works often enough that we have collectively stopped noticing it doesn’t work. Until you ask a question that doesn’t map to a single canonical query.

Where the wrapper breaks

Try this with any of the popular hosted answer products: ask a comparative question that requires two distinct retrievals. “How does the 2024 emissions trajectory of cement compare to steel?” The model will issue one search. The search will surface a top-result that mostly talks about steel, or mostly talks about cement, or talks about both at a level too shallow to compare. The answer will inherit whichever framing the top result had.

The fix is obvious once you state the problem: you needed two queries, not one. You needed a plan. You needed something that looked at the question and said “the entities here are cement and steel; the relationship is emissions trajectory comparison; therefore I need to retrieve for both arms before I draft.”

The hosted products almost never do this. They issue one query because issuing one query is fast, and faster perceived latency wins user studies, and so the product has been optimized in the direction of “throw it at Google and let the LLM smooth over the gaps.”

What it costs to plan first

slorg makes a different trade. It assumes the plan is the artifact worth spending a round-trip on. So before it hits any search engine, it asks the LLM to draft a full answer from training data alone, and then to extract a knowledge graph from that draft — entities, relationships, the rough shape of what an answer to this question even looks like.

That graph is the plan. The keywords that go to SearxNG are derived from the graph nodes. If the question is comparative, the graph will have two arms and the search will run accordingly. If the question is single-entity but multi-aspect (“what are the failure modes of lithium-ion batteries in cold weather”), the graph carries those aspects forward into the keyword extraction step.

The cost is real and we don’t hide it: three LLM calls per search instead of zero, plus a content-fetch per URL. If you are building a chat product where the user expects sub-second response, slorg is the wrong tool. If you are building a research assistant where the user is willing to wait three seconds for a correct retrieval, slorg is more interesting.

The “and prays” part

The reason the dominant pattern feels like prayer is that there is no recovery path. If the single search call returns garbage, the model writes a confidently wrong answer over the garbage and moves on. There is no second opinion, no triangulation, no “this top result disagrees with my prior, let me look again.”

When the plan is the first artifact, there is at least one place where the system can fail loudly. If the draft answer in step one is internally contradictory, that gets baked into the graph and the graph looks weird, and you can see the graph in the response. If the keywords extracted in step three don’t cover the question, you can see the keywords in the response. The cost of the LLM round-trip is paid in visibility.

We don’t think this makes slorg agentic in any meaningful sense. There is no loop. The pipeline runs once, top to bottom, six steps, then returns. It is not an agent; it is a planner that happens to use an LLM to draft its plan. The distinction matters because “agentic” has come to mean “the model decides what tools to use,” and slorg decides nothing — the tools are fixed, the steps are fixed, the only variable is the question.

What it doesn’t fix

It does not fix the case where the LLM’s training data is wrong about the topic. If GPT writes a draft answer that confidently locates Mount Kilimanjaro in Kenya, the knowledge graph will include the wrong entity, the keywords will go to the wrong country, and SearxNG will dutifully return excellent Kenyan tourism pages. Plan-conditioned search amplifies a confident hallucination.

The mitigation is the same as for any retrieval system: ground the model’s confidence by comparing the draft against the retrieved results in the final scoring step. slorg does some of this — the relevance scorer at step six is told both the original query and asked to rank — but it does not loop back to revise the plan if the retrieved results contradict the draft. That would be agentic behavior. We have stayed away from it deliberately, because once you have a loop you have a system that can spend unbounded tokens, and we wanted a predictable cost model.

A note on why this matters for evaluation

There is a second-order reason to prefer plan-first search that doesn’t show up in user studies: it is evaluable. When the plan is a structured artifact, you can run regression tests on it. You can ask “for this question, what entities did the planner extract?” and assert that the right entities appear in the graph. You can ask “what keywords did the system send to the search engine?” and check whether they make sense to a domain expert. You can ask “what was the score on the top result?” and look for distribution shifts over time.

You cannot do any of this with the wrapper pattern. The wrapper is end-to-end opaque: a question goes in, a string comes out, and the only thing you can evaluate is whether the string sounds plausible. If something starts going wrong — a model upgrade silently changes the keyword phrasing, a search engine quietly downranks a domain you cared about — the wrapper has no internal seam at which to detect it.

We mention this because evaluation, in our experience, is where the cost of opacity actually compounds. Day one the wrapper works fine. Day three hundred the model has drifted, the search engine has reshuffled, and nobody can tell which change broke the output, because there is no intermediate state to inspect.

The honest summary

If you want fast, plausible answers grounded in whatever Google returns first, the wrapper pattern is fine. It is what almost every “AI search” product is doing, and for casual lookup it does the job. The plan-first pipeline is not strictly better for that use case; it is slower and more expensive and not obviously more correct on any single query.

If you want the plan to be a visible artifact — a graph, a list of keywords, a per-result score — and you are willing to pay three LLM calls per query for it, plan-first search is a different shape. slorg is one implementation of that shape. It is not the only one. The point is that the shape exists, and “throw the query at Google and pray” is a choice, not a law of nature. A retrieval system can commit to a plan before it runs; it can score its own outputs; it can return the intermediate state. It is just engineering, and somebody chose not to ship it in the product you have been using.

← All posts