RV EMT Tech Assistant — demo for Zapier

1 · Context

The shorthand direct answers — each links to its full diagram below.

Q1 — How is the orchestration layered so Haiku handles routing and Sonnet only escalates when needed?

Haiku makes three cheap decisions per request (intent extraction, confidence gate, web-trust classification). Sonnet runs only at the final generation step. Opus is reserved for eval judging. The decision is logged in a per-request trace JSON so you can audit which model did what.
→ See 2b. Routing & Orchestration

Q2 — How are safety and accuracy enforced so the model leans on the certified corpus over generic web fallback?

A single trust tier (T1 authoritative / T2 vetted community / T3 open web) is enforced at four points: ingest payload, retrieval-time cap, generation prompt, and response-side profile. The lever max_trust_tier=1 cuts web off entirely. Trust signal is observable at every layer.
→ See 2c. Guardrails

Q3 — How are you approaching the transition to GraphRAG as the corpus grows?

Honest answer: not yet — defined trigger, not committed work. Vector retrieval handles ~95% of the questions techs actually ask today. The graph layer is forward-designed (hybrid entity-index over the existing vector path) with three measurable conditions that would fire the switch.
→ See 2d. Next iterations

2 · System Diagram

a · Stack components

Vector store: Qdrant 1.12 (HNSW, payload filter)
Orchestrator: FastAPI · single POST /query endpoint
Generation: Anthropic Claude — Haiku (fast tier), Sonnet (standard tier), Opus (judge tier)
Embeddings: Voyage AI (primary), local sentence-transformers (fallback) — same EmbeddingClient interface
Rerank: Cohere Rerank v3.5 (live) · noop branch retained as the fallback
Web fallback: Tavily, gated by a YAML trust-domain allowlist
UI: Next.js 16 + shadcn/ui · server-side proxy to FastAPI
Edge / TLS: Caddy 2 (auto Let's Encrypt)
Runtime: Docker Compose · Hetzner CPX node · Tailscale for admin SSH
Tracing: Per-request JSON trace written to disk; every model call, cost, and latency captured

Every external service sits behind a thin client interface (LLMClient, EmbeddingClient, SearchClient, RerankerClient) — swapping a provider is a config change, not a refactor.

b · Routing & Orchestration

Every request runs the same seven-step pipeline. Cheap models gate expensive ones; every decision is logged.

  POST /query { query, mode }
            │
            ▼
  ┌───────────────────────────────────────────────────────┐
  │ 1. INTENT EXTRACTION               Haiku · ~$0.0005   │
  │    extract {manufacturer, system, component} as JSON  │
  └───────────────────────────────────────────────────────┘
            │
            ▼
  ┌───────────────────────────────────────────────────────┐
  │ 2. FILTERED RETRIEVAL              Qdrant · vector    │
  │    embed query → top-20 with metadata filters         │
  └───────────────────────────────────────────────────────┘
            │
            ▼
  ┌───────────────────────────────────────────────────────┐
  │ 3. RERANK                          Cohere v3.5 · ~$.002│
  │    top-20 → top-8 by cross-encoder relevance          │
  └───────────────────────────────────────────────────────┘
            │
            ▼
  ┌───────────────────────────────────────────────────────┐
  │ 4. CONFIDENCE GATE                 Haiku · ~$0.001    │
  │    "does this context answer? yes | partial | no"     │
  └───────────────────────────────────────────────────────┘
            │
            ▼
  ┌───────────────────────────────────────────────────────┐
  │ 5. ROUTE DECISION                  pure function      │
  │    (mode, confidence) → use_corpus, use_web           │
  └───────────────────────────────────────────────────────┘
            │
        if use_web
            ▼
  ┌───────────────────────────────────────────────────────┐
  │ 6. WEB FALLBACK                    Tavily             │
  │    tag each hit T2 (allowlist) or T3 (uncurated)      │
  └───────────────────────────────────────────────────────┘
            │
            ▼
  ┌───────────────────────────────────────────────────────┐
  │ 7. GENERATION                      Sonnet · ~$0.01    │
  │    SAFETY_SYSTEM_PROMPT + corpus + web + question     │
  │    → answer with corpus vs web citations distinguished│
  └───────────────────────────────────────────────────────┘
            │
            ▼
       trace JSON

Which model, at which step, and why:

Step	Model	Tier	Why this tier	~$ / call
Intent	Haiku 4.5	`fast`	Structured JSON output, narrow vocabulary, schema-constrained — Sonnet over-spec'd.	$0.0005
Retrieve	—	—	Vector + metadata filter, no LLM.	$0.0001
Rerank	Cohere v3.5	n/a	Cross-encoder is the right tool to narrow top-20 → top-8 by query relevance.	$0.002
Confidence	Haiku 4.5	`fast`	Near-binary classification on a short context.	$0.001
Route	—	—	Pure 6-line function: `(mode, confidence) → (use_corpus, use_web)`.	$0
Web search	—	—	Tavily, not an LLM call.	~$0
Generation	Sonnet 4.6	`standard`	This is where instruction-following, citation hygiene, and the safety contract have to hold.	$0.005–0.02

A typical answered query costs ~1.5¢ end-to-end. Sonnet accounts for ~90% of cost and ~70% of latency; the Haiku calls are essentially free noise. To optimize further the lever is Sonnet's input-token count — gated by retrieval k and chunking strategy. Make the retrieval better, not the model bigger.

The four routing modes — mode is a request parameter, not a global, so the UI can A/B test per request:

Mode	Behavior	Used for
`corpus_only`	Answer from corpus; never call web.	Highest-trust path. Safe default for tier-1-heavy queries (propane, electrical).
`hybrid_always`	Retrieve corpus and call Tavily; return both side-by-side.	UX experiment for ambiguous queries.
`fallback`	Corpus first; web only if confidence gate says `no`/`partial`.	Production default. Cost-respecting and confidence-aware.
`web_only`	Skip corpus entirely.	Diagnostic / eval — measures how much the corpus contributes.

The Tier enum (fast | standard | judge) is the contract, not the model name. When Haiku 4.5 becomes Haiku 5, the mapping changes in one config file and no orchestration code moves.

c · Guardrails

Three layers of trust signal — payload, prompt, response — each independently observable in the trace. A reviewer can read any answer and answer "what trust did this lean on?" in under a second.

Trust tier is a single integer (1, 2, or 3) shared across corpus chunks and web results — the same axis. Lower means more trusted:

Tier	Members	How a thing gets tagged
T1 AUTHORITATIVE	Manufacturer service material; RVTI certified training; RV EMT internal SOPs.	Set at ingest via `ChunkMetadata.trust_tier`.
T2 TRUSTED_COMMUNITY	Vetted 3rd-party domains on the curated allowlist (NFPA, RVIA, IRVE, established RV references, etc.).	Computed at web-search time against `trusted_domains.yaml`. Editing the YAML + restart is the full curation workflow.
T3 UNCURATED	Forums, blogs, anything not in T1 or T2.	Default for any web result that doesn't match. Surfaced with a warning; generation prompt requires the model to acknowledge T3 status explicitly when citing one.

The max_trust_tier request param applies symmetrically to corpus retrieval and web fallback. At max_trust_tier=1 no web result survives. At max_trust_tier=2 only T2 web survives. The lever is the contract, not theater.

Sample response-side trust profile — same Furrion AC recall question, different caps:

// (a) clean: corpus-only propane question at max_trust_tier=1
{ "tier_counts": { "1": 8 }, "best_tier": 1,
  "warning": null }

// (b) soft warning: default fallback brings in T3 web
{ "tier_counts": { "1": 8, "3": 5 }, "best_tier": 1,
  "warning": "answer cites tier-3 (uncurated) sources alongside
              higher-tier material; treat tier-3 claims as unverified" }

// (c) same question, capped at max_trust_tier=2:
//     all 5 T3 web hits dropped; Sonnet sees corpus only;
//     answer reflects "I don't have recall data — check NHTSA"
{ "tier_counts": { "1": 8 }, "best_tier": 1,
  "warning": null }

The safety prompt — 7 rules, verbatim from the validated POC contract:

Cite your sources. Every factual claim names the module and page.
If the retrieved context doesn't clearly answer, say so plainly. No inventing part numbers, torque specs, voltages, pressures, or wire colors.
Ask for make and model first when not provided — Dometic ≠ Norcold, Suburban ≠ Atwood, Onan ≠ Generac.
Lead with safety on propane, 120V AC, refrigerant, high-pressure water — safety step before diagnostic step.
Plain, direct language, numbered steps — the audience is reading this on a phone in a driveway.
Out-of-corpus signal — for service bulletins, recalls, state-specific code: "I don't have that in my training — check the manufacturer service manual", not a guess.
Don't guess at model year. If it matters and isn't given, ask.

Rule 2 is the structural mitigation for the Haiku confidence-gate's known false-positive case (adjacent-but-not-answering context). Rule 6 is what makes the recall behavior in trust profile (c) land cleanly.

d · Next / Future iterations

The honest framing: GraphRAG is a forward-looking design with a defined adoption trigger, not something I'm building yet, and not something I should be building yet for this corpus. What I have done is design the current system so adding a graph layer is incremental — not rip-and-replace.

Vector retrieval is great for procedural and definitional questions (~95% of what techs ask). It misses on relationships between entities:

Compatibility — "Which slide-out motors are interchangeable with Lippert 1620104?"
Cross-model — "What other Dometic refrigerators share the eyebrow control board in the DM2682?"
Platform-wide impact — "What service bulletins apply to my Ford E-450 chassis platform?"

Adoption trigger — three measurable conditions:

A held-out gold set of ~15 entity-relationship questions, written with senior techs.
A retrieval-eval run showing pure-vector recall@5 on that subset is below ~0.5 while recall@5 on procedural stays ≥ 0.75.
OR: a structured content type lands that explicitly carries relationships (parts interchangeability table, recall-to-platform mapping) and vector retrieval doesn't surface them.

Same trigger pattern I used for the reranker (deferred until eval signal slipped → wired noop → flipped one env var when ready) and BM25 (still deferred). Don't add complexity ahead of the signal that demands it; flip the switch the moment the signal shows up.

The architecture when the triggers fire — hybrid retrieval over an entity index, not a wholesale move to a graph database. Current vector path stays; a parallel entity-lookup path is added; the router picks which one runs.

  POST /query
        │
        ▼
  ┌─────────────────────────────────┐
  │ Intent extraction (Haiku)        │
  │  query_shape: procedural |       │
  │    compatibility |               │
  │    cross-model |                 │
  │    platform-impact               │
  └──────────────┬──────────────────┘
                 │
       ┌─────────┴─────────────────────┐
       ▼                               ▼
  ┌──────────────────┐         ┌──────────────────────┐
  │ Procedural path  │         │ Entity-graph path    │
  │ (today)          │         │ (new)                │
  │ • filtered vector│         │ • entity-extract from│
  │ • confidence gate│         │   query              │
  │ • route, gen     │         │ • lookup in entity   │
  │                  │         │   index              │
  │                  │         │ • fan-out to linked  │
  │                  │         │   chunks             │
  │                  │         │ • vector rerank      │
  │                  │         │ • confidence + gen   │
  └────────┬─────────┘         └──────────┬───────────┘
           │                              │
           └──────────────┬───────────────┘
                          ▼
              Sonnet generation
              with SAFETY_SYSTEM_PROMPT
              and the same precedence rules

Concretely: a flat key/value entity index ((entity_type, entity_id) → chunk_ids) populated by one cheap Haiku pass per chunk at ingest. Query-side classifier extends the existing intent extractor with a query_shape axis. Compatibility / cross-model / platform-impact questions fan out via the index; procedural and definitional take the existing path. No graph database, at first — dictionaries and payload-index filters until usage demands more.

The metadata schema is already entity-aware: ChunkMetadata.manufacturer, .component, and .model_coverage: list[str] are the seed fields an entity extractor would populate. Nothing in the current code precludes the future move; nothing commits to it either.

The half-day spike that would actually shift my confidence: sample 100 LCI manual chunks, run one Haiku pass each ($~0.10), hand-craft 5 entity-relationship questions, compare recall@5 between pure-vector and entity-lookup paths. ~$1 in API calls. Defensible answer either way.