1 · Context
The shorthand direct answers — each links to its full diagram below.
Q1 — How is the orchestration layered so Haiku handles routing and Sonnet only escalates when needed?
Haiku makes three cheap decisions per request (intent extraction, confidence gate, web-trust classification). Sonnet runs only at the final generation step. Opus is reserved for eval judging. The decision is logged in a per-request trace JSON so you can audit which model did what.
→ See 2b. Routing & Orchestration
Q2 — How are safety and accuracy enforced so the model leans on the certified corpus over generic web fallback?
A single trust tier (T1 authoritative / T2 vetted community / T3 open web) is enforced at four points: ingest payload, retrieval-time cap, generation prompt, and response-side profile. The lever max_trust_tier=1 cuts web off entirely. Trust signal is observable at every layer.
→ See 2c. Guardrails
Q3 — How are you approaching the transition to GraphRAG as the corpus grows?
Honest answer: not yet — defined trigger, not committed work. Vector retrieval handles ~95% of the questions techs actually ask today. The graph layer is forward-designed (hybrid entity-index over the existing vector path) with three measurable conditions that would fire the switch.
→ See 2d. Next iterations
2 · System Diagram
a · Stack components
- Vector store
- Qdrant 1.12 (HNSW, payload filter)
- Orchestrator
- FastAPI · single
POST /queryendpoint - Generation
- Anthropic Claude — Haiku (fast tier), Sonnet (standard tier), Opus (judge tier)
- Embeddings
- Voyage AI (primary), local sentence-transformers (fallback) — same
EmbeddingClientinterface - Rerank
- Cohere Rerank v3.5 (live) · noop branch retained as the fallback
- Web fallback
- Tavily, gated by a YAML trust-domain allowlist
- UI
- Next.js 16 + shadcn/ui · server-side proxy to FastAPI
- Edge / TLS
- Caddy 2 (auto Let's Encrypt)
- Runtime
- Docker Compose · Hetzner CPX node · Tailscale for admin SSH
- Tracing
- Per-request JSON trace written to disk; every model call, cost, and latency captured
Every external service sits behind a thin client interface (LLMClient, EmbeddingClient, SearchClient, RerankerClient) — swapping a provider is a config change, not a refactor.
b · Routing & Orchestration
Every request runs the same seven-step pipeline. Cheap models gate expensive ones; every decision is logged.
POST /query { query, mode }
│
▼
┌───────────────────────────────────────────────────────┐
│ 1. INTENT EXTRACTION Haiku · ~$0.0005 │
│ extract {manufacturer, system, component} as JSON │
└───────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────┐
│ 2. FILTERED RETRIEVAL Qdrant · vector │
│ embed query → top-20 with metadata filters │
└───────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────┐
│ 3. RERANK Cohere v3.5 · ~$.002│
│ top-20 → top-8 by cross-encoder relevance │
└───────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────┐
│ 4. CONFIDENCE GATE Haiku · ~$0.001 │
│ "does this context answer? yes | partial | no" │
└───────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────┐
│ 5. ROUTE DECISION pure function │
│ (mode, confidence) → use_corpus, use_web │
└───────────────────────────────────────────────────────┘
│
if use_web
▼
┌───────────────────────────────────────────────────────┐
│ 6. WEB FALLBACK Tavily │
│ tag each hit T2 (allowlist) or T3 (uncurated) │
└───────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────┐
│ 7. GENERATION Sonnet · ~$0.01 │
│ SAFETY_SYSTEM_PROMPT + corpus + web + question │
│ → answer with corpus vs web citations distinguished│
└───────────────────────────────────────────────────────┘
│
▼
trace JSON
Which model, at which step, and why:
| Step | Model | Tier | Why this tier | ~$ / call |
|---|---|---|---|---|
| Intent | Haiku 4.5 | fast | Structured JSON output, narrow vocabulary, schema-constrained — Sonnet over-spec'd. | $0.0005 |
| Retrieve | — | — | Vector + metadata filter, no LLM. | $0.0001 |
| Rerank | Cohere v3.5 | n/a | Cross-encoder is the right tool to narrow top-20 → top-8 by query relevance. | $0.002 |
| Confidence | Haiku 4.5 | fast | Near-binary classification on a short context. | $0.001 |
| Route | — | — | Pure 6-line function: (mode, confidence) → (use_corpus, use_web). | $0 |
| Web search | — | — | Tavily, not an LLM call. | ~$0 |
| Generation | Sonnet 4.6 | standard | This is where instruction-following, citation hygiene, and the safety contract have to hold. | $0.005–0.02 |
A typical answered query costs ~1.5¢ end-to-end. Sonnet accounts for ~90% of cost and ~70% of latency; the Haiku calls are essentially free noise. To optimize further the lever is Sonnet's input-token count — gated by retrieval k and chunking strategy. Make the retrieval better, not the model bigger.
The four routing modes — mode is a request parameter, not a global, so the UI can A/B test per request:
| Mode | Behavior | Used for |
|---|---|---|
corpus_only | Answer from corpus; never call web. | Highest-trust path. Safe default for tier-1-heavy queries (propane, electrical). |
hybrid_always | Retrieve corpus and call Tavily; return both side-by-side. | UX experiment for ambiguous queries. |
fallback | Corpus first; web only if confidence gate says no/partial. | Production default. Cost-respecting and confidence-aware. |
web_only | Skip corpus entirely. | Diagnostic / eval — measures how much the corpus contributes. |
The Tier enum (fast | standard | judge) is the contract, not the model name. When Haiku 4.5 becomes Haiku 5, the mapping changes in one config file and no orchestration code moves.
c · Guardrails
Three layers of trust signal — payload, prompt, response — each independently observable in the trace. A reviewer can read any answer and answer "what trust did this lean on?" in under a second.
Trust tier is a single integer (1, 2, or 3) shared across corpus chunks and web results — the same axis. Lower means more trusted:
| Tier | Members | How a thing gets tagged |
|---|---|---|
| T1 AUTHORITATIVE | Manufacturer service material; RVTI certified training; RV EMT internal SOPs. | Set at ingest via ChunkMetadata.trust_tier. |
| T2 TRUSTED_COMMUNITY | Vetted 3rd-party domains on the curated allowlist (NFPA, RVIA, IRVE, established RV references, etc.). | Computed at web-search time against trusted_domains.yaml. Editing the YAML + restart is the full curation workflow. |
| T3 UNCURATED | Forums, blogs, anything not in T1 or T2. | Default for any web result that doesn't match. Surfaced with a warning; generation prompt requires the model to acknowledge T3 status explicitly when citing one. |
The max_trust_tier request param applies symmetrically to corpus retrieval and web fallback. At max_trust_tier=1 no web result survives. At max_trust_tier=2 only T2 web survives. The lever is the contract, not theater.
Sample response-side trust profile — same Furrion AC recall question, different caps:
// (a) clean: corpus-only propane question at max_trust_tier=1
{ "tier_counts": { "1": 8 }, "best_tier": 1,
"warning": null }
// (b) soft warning: default fallback brings in T3 web
{ "tier_counts": { "1": 8, "3": 5 }, "best_tier": 1,
"warning": "answer cites tier-3 (uncurated) sources alongside
higher-tier material; treat tier-3 claims as unverified" }
// (c) same question, capped at max_trust_tier=2:
// all 5 T3 web hits dropped; Sonnet sees corpus only;
// answer reflects "I don't have recall data — check NHTSA"
{ "tier_counts": { "1": 8 }, "best_tier": 1,
"warning": null }
The safety prompt — 7 rules, verbatim from the validated POC contract:
- Cite your sources. Every factual claim names the module and page.
- If the retrieved context doesn't clearly answer, say so plainly. No inventing part numbers, torque specs, voltages, pressures, or wire colors.
- Ask for make and model first when not provided — Dometic ≠ Norcold, Suburban ≠ Atwood, Onan ≠ Generac.
- Lead with safety on propane, 120V AC, refrigerant, high-pressure water — safety step before diagnostic step.
- Plain, direct language, numbered steps — the audience is reading this on a phone in a driveway.
- Out-of-corpus signal — for service bulletins, recalls, state-specific code: "I don't have that in my training — check the manufacturer service manual", not a guess.
- Don't guess at model year. If it matters and isn't given, ask.
Rule 2 is the structural mitigation for the Haiku confidence-gate's known false-positive case (adjacent-but-not-answering context). Rule 6 is what makes the recall behavior in trust profile (c) land cleanly.
d · Next / Future iterations
The honest framing: GraphRAG is a forward-looking design with a defined adoption trigger, not something I'm building yet, and not something I should be building yet for this corpus. What I have done is design the current system so adding a graph layer is incremental — not rip-and-replace.
Vector retrieval is great for procedural and definitional questions (~95% of what techs ask). It misses on relationships between entities:
- Compatibility — "Which slide-out motors are interchangeable with Lippert 1620104?"
- Cross-model — "What other Dometic refrigerators share the eyebrow control board in the DM2682?"
- Platform-wide impact — "What service bulletins apply to my Ford E-450 chassis platform?"
Adoption trigger — three measurable conditions:
- A held-out gold set of ~15 entity-relationship questions, written with senior techs.
- A retrieval-eval run showing pure-vector recall@5 on that subset is below ~0.5 while recall@5 on procedural stays ≥ 0.75.
- OR: a structured content type lands that explicitly carries relationships (parts interchangeability table, recall-to-platform mapping) and vector retrieval doesn't surface them.
Same trigger pattern I used for the reranker (deferred until eval signal slipped → wired noop → flipped one env var when ready) and BM25 (still deferred). Don't add complexity ahead of the signal that demands it; flip the switch the moment the signal shows up.
The architecture when the triggers fire — hybrid retrieval over an entity index, not a wholesale move to a graph database. Current vector path stays; a parallel entity-lookup path is added; the router picks which one runs.
POST /query
│
▼
┌─────────────────────────────────┐
│ Intent extraction (Haiku) │
│ query_shape: procedural | │
│ compatibility | │
│ cross-model | │
│ platform-impact │
└──────────────┬──────────────────┘
│
┌─────────┴─────────────────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────────┐
│ Procedural path │ │ Entity-graph path │
│ (today) │ │ (new) │
│ • filtered vector│ │ • entity-extract from│
│ • confidence gate│ │ query │
│ • route, gen │ │ • lookup in entity │
│ │ │ index │
│ │ │ • fan-out to linked │
│ │ │ chunks │
│ │ │ • vector rerank │
│ │ │ • confidence + gen │
└────────┬─────────┘ └──────────┬───────────┘
│ │
└──────────────┬───────────────┘
▼
Sonnet generation
with SAFETY_SYSTEM_PROMPT
and the same precedence rules
Concretely: a flat key/value entity index ((entity_type, entity_id) → chunk_ids) populated by one cheap Haiku pass per chunk at ingest. Query-side classifier extends the existing intent extractor with a query_shape axis. Compatibility / cross-model / platform-impact questions fan out via the index; procedural and definitional take the existing path. No graph database, at first — dictionaries and payload-index filters until usage demands more.
The metadata schema is already entity-aware: ChunkMetadata.manufacturer, .component, and .model_coverage: list[str] are the seed fields an entity extractor would populate. Nothing in the current code precludes the future move; nothing commits to it either.
The half-day spike that would actually shift my confidence: sample 100 LCI manual chunks, run one Haiku pass each ($~0.10), hand-craft 5 entity-relationship questions, compare recall@5 between pure-vector and entity-lookup paths. ~$1 in API calls. Defensible answer either way.