Source-aware retrieval: knowing what your engine knows

Most search engines hide their backends behind a single answer. slorg shows you which engines returned which URLs, what content was actually extracted, and what the LLM scored each one — because retrieval without source awareness is just confident guessing.

retrievaltransparencydesign

There is a quiet trick at the heart of most AI search products. They show you an answer with a few citations underneath it, and the citations look authoritative because they have favicons and proper domain names. What they do not show you is which search backend surfaced each citation, what content was actually fetched from it, whether the fetch succeeded, and how confident the answer-writing model was that the citation supported the claim.

That information exists. The system has it. It is just not displayed, because displaying it complicates the UI and confuses users who came for an answer, not a postmortem.

slorg makes the opposite choice. The result object you get back from lorg.search(...) includes the knowledge graph, the keywords used, each result’s source URL, the extracted content, and the per-result relevance score. The CLI prints this. The REST API returns this. The library hands it to you as structured data. The pipeline is source-aware because we wired it that way on principle, and because once you have seen the per-result scores you cannot un-see how often the top citation is scored 0.6.

This post is about why that matters, and what it doesn’t fix.

What source-aware retrieval is

A retrieval pipeline is source-aware if, for any returned answer, you can answer four questions:

  1. Where did this result come from? Not just the URL, but which backend index surfaced it. Was it Google? Bing? A specific vertical?
  2. What did we actually read? Did we fetch the page or just the snippet? Did the fetch succeed? Was the content paywalled or rendered client-side?
  3. What did the LLM think of it? Not “did it cite it” — that is a downstream consequence — but: when scored against the original query, what number did it get?
  4. What did the LLM expect to find? Before the search ran, what did the planner think the answer looked like? Did the retrieved content match that prior or contradict it?

slorg answers all four. Question 1 by exposing the search-engine result rather than hiding it. Question 2 by returning the extracted content text in the response. Question 3 via the explicit score field on each result. Question 4 by surfacing both the draft answer (step 1) and the knowledge graph (step 2) as first-class fields in the response.

That is what we mean by “knowing what your engine knows.”

What it costs to be source-aware

The cost is the response surface. The slorg result is not a one-liner answer. It is a structured object that looks roughly like:

{
  answer: "...",             // the draft from step 1, not the final
  knowledgeGraph: { ... },   // entities + relationships
  keywords: ["...", "..."],  // what we sent to SearxNG
  results: [
    {
      title: "...",
      source: "https://...", // URL only; SearxNG does merge engines
      score: 0.94,           // LLM's relevance estimate, 0–1
      content: "..."         // extracted body text
    },
    // ... more
  ],
  tokenCount: 1247
}

This is more data than most callers want. The CLI flattens it for a terminal display. The REST endpoint returns it whole. If you are using slorg as a library, you usually want to pick a few fields and ignore the rest.

The cost is also conceptual: users who are used to “AI search returns an answer” are not used to “AI search returns a draft plus a scored result list plus the plan.” We have to explain the result shape in the docs.

What source-awareness doesn’t fix

Source-aware retrieval is necessary for honest answers but it is not sufficient. Two failure modes survive even with full visibility:

The planner is wrong, confidently

If the draft answer in step 1 is wrong, the knowledge graph in step 2 will be wrong in a correlated way, the keywords in step 3 will go to the wrong topic, and the retrieved set will contain pages about the wrong thing. The per-result scores will be high for those pages — they really are relevant to the wrong question slorg ended up asking.

Source awareness shows you the chain. It tells you the keywords were wrong. It does not automatically correct them. A future version could loop back from step 6 to step 3 if the score distribution looks off; that would make slorg agentic, and we’ve explicitly said no to that. The visibility is the feature; the correction is left to the human reading the response.

Engine bias is inherited

SearxNG is a meta-search engine. It does not have its own index; it aggregates Google, Bing, Yahoo, DuckDuckGo. Anything those engines decline to surface, slorg cannot surface. Anything they rank highly, slorg starts from. If the topic is one that Google has chosen to suppress, demote, or wrap in a “complex topic” disclaimer, slorg inherits that decision and there is nothing in the pipeline that compensates for it.

This is true of every system built on public search. We mention it because product copy elsewhere likes to imply that “AI-powered” somehow routes around editorial decisions by the underlying engines. It does not. slorg is more honest about it because the SearxNG layer is visible in the architecture page; it is not more capable of avoiding it.

What honest source-awareness looks like in the UI

We do not ship a UI. slorg is a CLI, a library, and a REST endpoint. But for the teams who have wired slorg into research dashboards, here is what works in practice:

None of this is novel. Research engineers building eval dashboards have been doing it for years. The point is that source-awareness lets you do it; an opaque “AI search” product does not.

The recommendation

If you are choosing between an opaque hosted AI search product and slorg, the choice is roughly: do you want to trust a vendor’s editorial layer, or do you want the visibility to make your own editorial decisions?

There is no shame in choosing the vendor. Vendors are faster, they polish the UI, they hide the failure modes. For many use cases — quick fact lookups, casual research — that is the right trade.

slorg exists for the cases where it is not. Where you actually need to know what your engine knows, and where “the engine returned this confident answer” is not enough.

In those cases, the result object is the product. The answer is just one field.