Metrics for AI Assistant Quality

metrics • kpi • assistant • evaluation

Metrics translate raw interaction traces into actionable quality & impact signals. Without a layered framework teams chase vanity stats (total sessions) while missing silent regressions (faithfulness drift, containment collapse in a segment, rising fallback loops). This article defines a balanced scorecard: retrieval performance, answer quality, support outcomes, business impact, and diagnostics—plus instrumentation design and target setting.

Metric Framework Overview

Five layers:

  1. Retrieval Performance (Can we surface the right evidence?)
  2. Generation Quality (Do answers reflect evidence & user intent?)
  3. Support Outcomes (Are users self‑resolving?)
  4. Business Impact (Does this drive activation / retention?)
  5. Diagnostics (Why did failures occur?)

Each higher layer depends on stability of the lower.

Retrieval Metrics

MetricDefinitionGoal (Phase 1)Notes
Recall@5% queries with at least one gold evidence chunk in top 5>70%Gold set derived
Precision@5Relevant / 5>65%Noise control
CoverageUnique pages referenced / total prioritized pages>85%Gap detection
Redundancy RateDuplicate source chunks ratio<30%Tune overlap
Freshness Age P50Median days since chunk update<14Content ops
Retrieval Latency P95ms for retrieval stage<350msUX budget

Generation Metrics

MetricDefinitionCollection
Faithfulness Error RateUnsupported claim %Human + model critique
Completeness ScoreRequired facts present (checklist)Human review
Helpfulness1–5 ratingUser / internal rater
Citation AccuracyCorrect citations / totalAutomated + sample
Refusal AppropriatenessProper refusals / total refusalsHuman sample
First Token Latency P95Model start timeTelemetry
Full Answer Latency P95End to endTelemetry

Support Outcomes

MetricDefinitionInsight
Containment RateSessions resolved w/o escalationDeflection strength
Assisted Resolution TimeTime when agent used AI draft vs manualEfficiency delta
Escalation RateEscalated sessions / totalComplexity mix
CSAT DeltaPost‑resolution CSAT vs baselineExperience impact
Multi‑Turn DepthAvg turns per resolved sessionEngagement & complexity

Business Impact

MetricDefinitionExample Use
Activation Assist% new users resolving onboarding blockersOnboarding success
Conversion InfluenceSessions preceding plan upgradeAttribution indicator
Retention CorrelationChurn rate difference cohort using assistantRenewal predictor
Support Cost per ResolvedTotal support spend / resolved sessionsEfficiency trend
Net SavingsModeled cost reduction (see CS automation article)ROI justification

Diagnostic Metrics

MetricDefinitionFailure Signal
Fallback Rate% responses using generic fallback templateRetrieval gap
Low Citation Count RateAnswers with <2 citationsContext insufficiency
Refusal Rate% queries refusedOver‑strict guardrail (if high)
Guardrail Trigger TypesDistribution (PII, injection, policy)Policy tuning
Prompt Version DriftSessions by prompt versionRollout integrity

Instrumentation Stack

Event schema suggestions:

  • query_issued { query_id, session_id, user_tier, locale, tokens }
  • retrieval_completed { query_id, candidates:[{chunk_id, score, source}], latency_ms }
  • answer_generated { query_id, answer_id, model_version, prompt_version, token_count, latency_ms, citation_count, refusal_flag }
  • feedback_submitted { answer_id, rating, reason_codes[] }
  • escalation_created { session_id, reason, time_from_first_query_ms }

All events share correlation_id for trace joining.

Benchmarking & Targets

Set phase gates:

  • Launch Gate: Faithfulness Error <10%, Containment >35%.
  • Scale Gate: Faithfulness Error <7%, Containment >50%, P95 Full Latency <2.5s.
  • Optimization Gate: Faithfulness Error <5%, Containment >60%, Precision@5 >70%.

Track variance by segment (locale, tier, intent cluster) to surface hidden regressions.

Continuous Improvement Loop

Loop:

  1. Detect anomaly (metric breach or downward trend)
  2. Root cause classify: retrieval, content gap, prompt, model, guardrail
  3. Form hypothesis & proposed change
  4. Run controlled experiment / offline benchmark
  5. Deploy behind flag; monitor leading indicators
  6. Promote or rollback; update changelog

Key Takeaways

  • Layer metrics—don’t conflate retrieval and generation.
  • Containment without faithfulness is hollow; faithfulness without containment lacks ROI.
  • Diagnostic events enable targeted remediation over guesswork.
  • Segment analysis reveals regressions masked in aggregate.
  • Treat target gates as quality contracts, not aspirations.