LLM Gateway Architecture for Reliability

llm • gateway • reliability • ai-assistant

LLM Gateway Architecture for Reliability

LLM providers fail, throttle, or change behavior. A gateway keeps your assistant stable by encapsulating routing, retries, and logging.

Core responsibilities

  • Provider abstraction (Gemini, OpenAI, Anthropic, etc.).
  • Request shaping (prompt templates, metadata, temperature).
  • Timeout and retry management with jitter.
  • Circuit breaker state per provider and per tenant.
  • Metrics and fallback_reason logging.

Flow overview

  1. Request arrives from answerer service with tenant context.
  2. Gateway selects primary provider (e.g., Gemini) based on policy.
  3. Apply timeout (e.g., 6 seconds) and retry rules (max 2, exponential backoff).
  4. On failure, log fallback_reason and switch to secondary provider.
  5. Return response plus metadata (token counts, provider used) to caller.

Design tips

  • Keep prompts deterministic; version them alongside providers.
  • Support per-tenant overrides (enterprise clients may mandate a specific provider).
  • Expose admin controls to pause a provider globally.
  • Emit metrics: latency percentiles, error rates, token usage by provider.
  • Integrate with alerting (Google Chat, PagerDuty) when circuit breakers trip.

Security and compliance

  • Store API keys in Secret Manager, rotate regularly.
  • Mask keys in logs; only log request IDs, not prompts.
  • Apply rate limits per provider and per tenant to avoid surprise bills.

CrawlBot example

CrawlBot’s gateway defaults to Gemini, falls back to OpenAI, records fallback_reason, and logs token usage per tenant. Replicate this pattern to keep LLM-dependent assistants resilient.***