1 · Context

The shorthand direct answers — each links to its full diagram below.

Q1 — How is the orchestration layered so Haiku handles routing and Sonnet only escalates when needed?

Haiku makes three cheap decisions per request (intent extraction, confidence gate, web-trust classification). Sonnet runs only at the final generation step. Opus is reserved for eval judging. The decision is logged in a per-request trace JSON so you can audit which model did what.
See 2b. Routing & Orchestration

Q2 — How are safety and accuracy enforced so the model leans on the certified corpus over generic web fallback?

A single trust tier (T1 authoritative / T2 vetted community / T3 open web) is enforced at four points: ingest payload, retrieval-time cap, generation prompt, and response-side profile. The lever max_trust_tier=1 cuts web off entirely. Trust signal is observable at every layer.
See 2c. Guardrails

Q3 — How are you approaching the transition to GraphRAG as the corpus grows?

Honest answer: not yet — defined trigger, not committed work. Vector retrieval handles ~95% of the questions techs actually ask today. The graph layer is forward-designed (hybrid entity-index over the existing vector path) with three measurable conditions that would fire the switch.
See 2d. Next iterations

2 · System Diagram

a · Stack components

Vector store
Qdrant 1.12 (HNSW, payload filter)
Orchestrator
FastAPI · single POST /query endpoint
Generation
Anthropic Claude — Haiku (fast tier), Sonnet (standard tier), Opus (judge tier)
Embeddings
Voyage AI (primary), local sentence-transformers (fallback) — same EmbeddingClient interface
Rerank
Cohere Rerank v3.5 (live) · noop branch retained as the fallback
Web fallback
Tavily, gated by a YAML trust-domain allowlist
UI
Next.js 16 + shadcn/ui · server-side proxy to FastAPI
Edge / TLS
Caddy 2 (auto Let's Encrypt)
Runtime
Docker Compose · Hetzner CPX node · Tailscale for admin SSH
Tracing
Per-request JSON trace written to disk; every model call, cost, and latency captured

Every external service sits behind a thin client interface (LLMClient, EmbeddingClient, SearchClient, RerankerClient) — swapping a provider is a config change, not a refactor.

b · Routing & Orchestration

Every request runs the same seven-step pipeline. Cheap models gate expensive ones; every decision is logged.

  POST /query { query, mode }
            │
            ▼
  ┌───────────────────────────────────────────────────────┐
  │ 1. INTENT EXTRACTION               Haiku · ~$0.0005   │
  │    extract {manufacturer, system, component} as JSON  │
  └───────────────────────────────────────────────────────┘
            │
            ▼
  ┌───────────────────────────────────────────────────────┐
  │ 2. FILTERED RETRIEVAL              Qdrant · vector    │
  │    embed query → top-20 with metadata filters         │
  └───────────────────────────────────────────────────────┘
            │
            ▼
  ┌───────────────────────────────────────────────────────┐
  │ 3. RERANK                          Cohere v3.5 · ~$.002│
  │    top-20 → top-8 by cross-encoder relevance          │
  └───────────────────────────────────────────────────────┘
            │
            ▼
  ┌───────────────────────────────────────────────────────┐
  │ 4. CONFIDENCE GATE                 Haiku · ~$0.001    │
  │    "does this context answer? yes | partial | no"     │
  └───────────────────────────────────────────────────────┘
            │
            ▼
  ┌───────────────────────────────────────────────────────┐
  │ 5. ROUTE DECISION                  pure function      │
  │    (mode, confidence) → use_corpus, use_web           │
  └───────────────────────────────────────────────────────┘
            │
        if use_web
            ▼
  ┌───────────────────────────────────────────────────────┐
  │ 6. WEB FALLBACK                    Tavily             │
  │    tag each hit T2 (allowlist) or T3 (uncurated)      │
  └───────────────────────────────────────────────────────┘
            │
            ▼
  ┌───────────────────────────────────────────────────────┐
  │ 7. GENERATION                      Sonnet · ~$0.01    │
  │    SAFETY_SYSTEM_PROMPT + corpus + web + question     │
  │    → answer with corpus vs web citations distinguished│
  └───────────────────────────────────────────────────────┘
            │
            ▼
       trace JSON

Which model, at which step, and why:

StepModelTierWhy this tier~$ / call
IntentHaiku 4.5fastStructured JSON output, narrow vocabulary, schema-constrained — Sonnet over-spec'd.$0.0005
RetrieveVector + metadata filter, no LLM.$0.0001
RerankCohere v3.5n/aCross-encoder is the right tool to narrow top-20 → top-8 by query relevance.$0.002
ConfidenceHaiku 4.5fastNear-binary classification on a short context.$0.001
RoutePure 6-line function: (mode, confidence) → (use_corpus, use_web).$0
Web searchTavily, not an LLM call.~$0
GenerationSonnet 4.6standardThis is where instruction-following, citation hygiene, and the safety contract have to hold.$0.005–0.02

A typical answered query costs ~1.5¢ end-to-end. Sonnet accounts for ~90% of cost and ~70% of latency; the Haiku calls are essentially free noise. To optimize further the lever is Sonnet's input-token count — gated by retrieval k and chunking strategy. Make the retrieval better, not the model bigger.

The four routing modesmode is a request parameter, not a global, so the UI can A/B test per request:

ModeBehaviorUsed for
corpus_onlyAnswer from corpus; never call web.Highest-trust path. Safe default for tier-1-heavy queries (propane, electrical).
hybrid_alwaysRetrieve corpus and call Tavily; return both side-by-side.UX experiment for ambiguous queries.
fallbackCorpus first; web only if confidence gate says no/partial.Production default. Cost-respecting and confidence-aware.
web_onlySkip corpus entirely.Diagnostic / eval — measures how much the corpus contributes.

The Tier enum (fast | standard | judge) is the contract, not the model name. When Haiku 4.5 becomes Haiku 5, the mapping changes in one config file and no orchestration code moves.

c · Guardrails

Three layers of trust signal — payload, prompt, response — each independently observable in the trace. A reviewer can read any answer and answer "what trust did this lean on?" in under a second.

Trust tier is a single integer (1, 2, or 3) shared across corpus chunks and web results — the same axis. Lower means more trusted:

TierMembersHow a thing gets tagged
T1 AUTHORITATIVEManufacturer service material; RVTI certified training; RV EMT internal SOPs.Set at ingest via ChunkMetadata.trust_tier.
T2 TRUSTED_COMMUNITYVetted 3rd-party domains on the curated allowlist (NFPA, RVIA, IRVE, established RV references, etc.).Computed at web-search time against trusted_domains.yaml. Editing the YAML + restart is the full curation workflow.
T3 UNCURATEDForums, blogs, anything not in T1 or T2.Default for any web result that doesn't match. Surfaced with a warning; generation prompt requires the model to acknowledge T3 status explicitly when citing one.

The max_trust_tier request param applies symmetrically to corpus retrieval and web fallback. At max_trust_tier=1 no web result survives. At max_trust_tier=2 only T2 web survives. The lever is the contract, not theater.

Sample response-side trust profile — same Furrion AC recall question, different caps:

// (a) clean: corpus-only propane question at max_trust_tier=1
{ "tier_counts": { "1": 8 }, "best_tier": 1,
  "warning": null }

// (b) soft warning: default fallback brings in T3 web
{ "tier_counts": { "1": 8, "3": 5 }, "best_tier": 1,
  "warning": "answer cites tier-3 (uncurated) sources alongside
              higher-tier material; treat tier-3 claims as unverified" }

// (c) same question, capped at max_trust_tier=2:
//     all 5 T3 web hits dropped; Sonnet sees corpus only;
//     answer reflects "I don't have recall data — check NHTSA"
{ "tier_counts": { "1": 8 }, "best_tier": 1,
  "warning": null }

The safety prompt — 7 rules, verbatim from the validated POC contract:

  1. Cite your sources. Every factual claim names the module and page.
  2. If the retrieved context doesn't clearly answer, say so plainly. No inventing part numbers, torque specs, voltages, pressures, or wire colors.
  3. Ask for make and model first when not provided — Dometic ≠ Norcold, Suburban ≠ Atwood, Onan ≠ Generac.
  4. Lead with safety on propane, 120V AC, refrigerant, high-pressure water — safety step before diagnostic step.
  5. Plain, direct language, numbered steps — the audience is reading this on a phone in a driveway.
  6. Out-of-corpus signal — for service bulletins, recalls, state-specific code: "I don't have that in my training — check the manufacturer service manual", not a guess.
  7. Don't guess at model year. If it matters and isn't given, ask.

Rule 2 is the structural mitigation for the Haiku confidence-gate's known false-positive case (adjacent-but-not-answering context). Rule 6 is what makes the recall behavior in trust profile (c) land cleanly.

d · Next / Future iterations

The honest framing: GraphRAG is a forward-looking design with a defined adoption trigger, not something I'm building yet, and not something I should be building yet for this corpus. What I have done is design the current system so adding a graph layer is incremental — not rip-and-replace.

Vector retrieval is great for procedural and definitional questions (~95% of what techs ask). It misses on relationships between entities:

  • Compatibility"Which slide-out motors are interchangeable with Lippert 1620104?"
  • Cross-model"What other Dometic refrigerators share the eyebrow control board in the DM2682?"
  • Platform-wide impact"What service bulletins apply to my Ford E-450 chassis platform?"

Adoption trigger — three measurable conditions:

  1. A held-out gold set of ~15 entity-relationship questions, written with senior techs.
  2. A retrieval-eval run showing pure-vector recall@5 on that subset is below ~0.5 while recall@5 on procedural stays ≥ 0.75.
  3. OR: a structured content type lands that explicitly carries relationships (parts interchangeability table, recall-to-platform mapping) and vector retrieval doesn't surface them.

Same trigger pattern I used for the reranker (deferred until eval signal slipped → wired noop → flipped one env var when ready) and BM25 (still deferred). Don't add complexity ahead of the signal that demands it; flip the switch the moment the signal shows up.

The architecture when the triggers fire — hybrid retrieval over an entity index, not a wholesale move to a graph database. Current vector path stays; a parallel entity-lookup path is added; the router picks which one runs.

  POST /query
        │
        ▼
  ┌─────────────────────────────────┐
  │ Intent extraction (Haiku)        │
  │  query_shape: procedural |       │
  │    compatibility |               │
  │    cross-model |                 │
  │    platform-impact               │
  └──────────────┬──────────────────┘
                 │
       ┌─────────┴─────────────────────┐
       ▼                               ▼
  ┌──────────────────┐         ┌──────────────────────┐
  │ Procedural path  │         │ Entity-graph path    │
  │ (today)          │         │ (new)                │
  │ • filtered vector│         │ • entity-extract from│
  │ • confidence gate│         │   query              │
  │ • route, gen     │         │ • lookup in entity   │
  │                  │         │   index              │
  │                  │         │ • fan-out to linked  │
  │                  │         │   chunks             │
  │                  │         │ • vector rerank      │
  │                  │         │ • confidence + gen   │
  └────────┬─────────┘         └──────────┬───────────┘
           │                              │
           └──────────────┬───────────────┘
                          ▼
              Sonnet generation
              with SAFETY_SYSTEM_PROMPT
              and the same precedence rules

Concretely: a flat key/value entity index ((entity_type, entity_id) → chunk_ids) populated by one cheap Haiku pass per chunk at ingest. Query-side classifier extends the existing intent extractor with a query_shape axis. Compatibility / cross-model / platform-impact questions fan out via the index; procedural and definitional take the existing path. No graph database, at first — dictionaries and payload-index filters until usage demands more.

The metadata schema is already entity-aware: ChunkMetadata.manufacturer, .component, and .model_coverage: list[str] are the seed fields an entity extractor would populate. Nothing in the current code precludes the future move; nothing commits to it either.

The half-day spike that would actually shift my confidence: sample 100 LCI manual chunks, run one Haiku pass each ($~0.10), hand-craft 5 entity-relationship questions, compare recall@5 between pure-vector and entity-lookup paths. ~$1 in API calls. Defensible answer either way.