Why not use a fixed threshold?

Corpus quality and embedding distributions vary per tenant; static thresholds cause either refusals or hallucinations.

How do we seed the threshold?

Start with historical P95 retrieval scores (e.g., 0.82) and adjust using online statistics as chat volume grows.

Does CrawlBot automate this?

Yes. CrawlBot stores per-tenant score histograms, updates thresholds automatically, and logs changes for audits.

Adaptive Relevance Thresholds Explained

2/24/2025

relevance • rag • retrieval • ai-assistant

Adaptive Relevance Thresholds Explained

Retrieval scores determine whether an answer is safe to deliver. Too low and you hallucinate; too high and you refuse legitimate questions. Adaptive thresholds solve the equation by tuning per tenant and corpus.

What is the threshold?

It is the minimum retrieval score required to pass context into the LLM. CrawlBot uses hybrid scoring (vector + lexical fusion), normalizes results, and compares the highest score to a threshold.

Adaptive approach

Seed: Initialize with historical P95 score (e.g., 0.82) for the tenant or a global default.
Collect: Log scores for every answered query along with fallback reasons.
Calculate: Maintain rolling windows (e.g., last 1,000 chats) and compute percentiles.
Adjust: If fallback_reason=low_score spikes, lower the threshold slightly; if hallucination feedback rises, raise it.
Audit: Record each adjustment in a policy log with timestamp, old value, new value, and reason.

Signals to monitor

Containment vs fallback_reason=low_score.
Negative feedback flagged as “incorrect” when scores were near the threshold.
Corpus changes (big crawl, new language) that shift score distributions.

Implementation tips

Use exponential moving averages to avoid overreacting to single events.
Cap adjustments within a safe band (e.g., ±0.05) unless manual overrides apply.
Provide an admin override per tenant for regulated industries needing stricter refusals.

CrawlBot automation

CrawlBot’s config-profiles service stores adaptive policies per tenant, emits change events, and exposes them in the admin UI. Ops can simulate new thresholds using historic logs before applying them. Bring the same rigor to your stack to keep assistants confident and safe.***

Next Step

Evaluation Guide (Score Vendors)

Use rubric to compare solutions

Enterprise Security & SLA

Controls, retention, guarantees

Start Free 5‑Page Crawl

Hands-on trial environment