October 9, 2025 · Skelf-Research

Planning the search: where the latency budget really goes

A breakdown of the three LLM calls and N content fetches that make up a slorg query, what each one costs, and which knobs actually move the needle.

latencyperformancedesign

slorg’s pipeline is deliberately slow. A vanilla search call with gpt-4o-mini summarizing the top results costs maybe one round trip and finishes inside a second. A slorg query costs three LLM round trips plus a content-fetch per candidate URL plus a final scoring pass, and the wall-clock is uncomfortably north of that.

Users who try slorg for the first time ask, reasonably, why. This post breaks down where the time goes and which knobs actually help.

The six steps, timed

For a typical query (single-topic factual question, gpt-4o-mini, default config), the rough wall-clock budget looks like this:

Step 1  Draft answer + knowledge graph     1.2 – 2.5 s   LLM call
Step 2  Graph extraction (part of step 1)  —             same call
Step 3  Keyword extraction from graph      0.4 – 0.9 s   LLM call
Step 4  SearxNG multi-engine search        0.8 – 2.0 s   network
Step 5  Content fetch (20 URLs default)    1.0 – 6.0 s   parallel HTTP
Step 6  Relevance scoring (20 results)     1.5 – 4.0 s   LLM call
        ────────────────────────────────────────────
        Total                              5.0 – 15 s

Those numbers are estimates from observed behavior on a default LORG_SEARCH_LIMIT=20, not benchmarks. Your network and your OpenAI account tier will produce different numbers. The shape is what matters.

Where the budget actually goes

The interesting observation is that step 5 — the content fetch — is the biggest source of variance. Steps 1, 3, and 6 are bounded by your model’s TPS; they are predictable. Step 4 is bounded by SearxNG and the four upstream search engines, also predictable. But step 5 is going out to twenty arbitrary websites and asking them to please return some HTML, and some of those websites are slow, some are behind Cloudflare challenges, some are paywalls that return 200 with a login wall, and some are just dead.

If you only optimize one thing in slorg, optimize the content-fetch concurrency and timeout. The defaults are conservative; in practice you almost always want to fetch all twenty URLs in parallel with an aggressive per-URL timeout (say 2 seconds) and just accept that some of them will not return content in time. The relevance scorer at step 6 handles a missing-content result fine — it just scores it low.

Why the LLM calls are non-negotiable

People ask whether the three LLM calls in steps 1, 3, and 6 can be collapsed into one. The short answer is no, not without changing what the pipeline does.

Step 1 (draft + graph) and step 3 (keywords from graph) could be merged into a single prompt that produces both. We’ve tried this. The result is that the model hedges — it generates the keywords in step 3 conditioned on what it expects to find rather than on the structured graph from step 2, which defeats the entire point of having a graph. The graph is the commitment; merging removes the commitment.

Step 6 has to be a separate call because by definition it runs after retrieval, and retrieval results don’t exist until step 5 completes. You could in principle do per-URL scoring concurrently with the next URL’s content fetch, and a future version might. Today it’s a batch.

The model knob

The model you set in LORG_DEFAULT_MODEL runs all three LLM steps. Bumping it from gpt-4o-mini to gpt-4o roughly triples cost and roughly doubles latency, in our experience. It also noticeably improves the keyword extraction in step 3 — the graph entities get translated into more searchable phrases — which can mean fewer dead-end retrievals at step 5.

There is no per-step model override yet. If you wanted a small model for the draft and a large model for the scoring, you would have to fork the package. That’s a real feature gap and an honest one.

The search engine knob

LORG_SEARCH_ENGINES defaults to all four. Dropping one engine doesn’t proportionally reduce latency — SearxNG queries them in parallel, so the slowest engine sets the floor for step 4. In our (informal) testing the slowest is usually Yahoo. Removing Yahoo from the list shaved 200-400ms reliably.

LORG_SEARCH_LIMIT is the biggest lever for step 5 and step 6. Going from 20 results to 10 cuts step-5 wall-time roughly in half (because the slowest fetches dominate) and step-6 in proportion (because the scorer processes fewer items). The downside is recall: you may miss a relevant result that would have ranked highly. For most queries, 10 is plenty. We’d recommend not going below 5.

What you can’t optimize

A few things are not knobs and people regularly ask:

The OpenAI base URL changes where the model calls go, not how many. Pointing at a local llama.cpp server saves money but costs latency unless your local hardware is faster than GPT.
Caching the draft answer does not exist in the package today. If the same question comes in twice within the same session, slorg does the whole thing twice.
Streaming does not exist either. slorg returns the final object when step 6 finishes. There is no way to start showing the user the draft answer at step 1 while the search runs.

These are reasonable feature requests. They are not currently shipped.

The trade you’re actually making

The point of the breakdown is this: slorg is not the right tool if your user is staring at a spinner. The wall-clock budget is structurally three LLM calls plus a fetch sweep, and no amount of tuning brings that below a few seconds.

slorg is the right tool if your user is willing to trade interactive latency for inspectable output. The result object includes the draft answer, the knowledge graph, the keywords, and the per-result score — each of which is itself useful, separate from “the final answer.” If you are wiring slorg into a research workflow where the intermediate artifacts get displayed or logged, you are getting the value you paid the latency for.

If you are wiring it into a chat box where the user only sees the last thing it produces, you are paying for planning the user can’t see. Don’t do that.

tl;dr

The latency budget breakdown:

~25% draft + graph (step 1)
~10% keyword extraction (step 3)
~15% SearxNG search (step 4)
~25% content fetch (step 5, varies wildly)
~25% scoring (step 6)

The knobs that move the needle:

LORG_SEARCH_LIMIT — biggest lever, lower it.
Content-fetch timeout — cut your losses on slow URLs.
LORG_SEARCH_ENGINES — drop Yahoo for ~300ms.
LORG_DEFAULT_MODEL — trade money and latency for better keyword extraction.

The knobs that don’t exist but should: streaming, per-step model selection, draft-answer caching. Open issues on the repo if those matter to you.

A note on what a “fast” version would even look like

A few people have asked whether we plan to ship a low-latency mode. The honest answer is that the levers available without changing the pipeline shape are mostly exhausted. You can cap the search limit, you can shorten timeouts, you can use a smaller model, you can run the LLM and the search engines in parallel where the dependencies allow it (which is roughly nowhere, because each step needs the previous one’s output). After those, the floor is set by three serial LLM round-trips plus the slowest content fetch.

To get below that floor you have to change what the pipeline does. Plausible options: drop step 1 entirely and use the user’s prompt as the search keyword (this is the wrapper pattern, which has its own tradeoffs we wrote about separately). Drop step 5 and let the scoring step work on snippets rather than fetched content (cheaper, but the model is grading on titles and summaries rather than the actual page). Drop step 6 and trust SearxNG’s ranking (this would make slorg a knowledge-graph-conditioned SearxNG, which is a defensible product but a different one).

We have not shipped any of these because each one removes a property we built slorg to have. The latency budget is the price of the properties. If you want the properties at a lower price, the answer is to make the LLM cheaper, not to make the pipeline shorter.

This sounds defensive. It is partly defensive. It is also a real engineering observation: when a system’s slowness is structural rather than incidental, optimization work tends to either restore the structure (good) or quietly remove it (bad). The cost breakdown above is here so you can do the math yourself before you start tuning.

← All posts