ums

RB-03: Cache Failure Recovery

Field Value
Runbook ID RB-03
Scope Redis cache failure — UMS auth graph and session cache
Owner Platform / On-Call Team
Last Review 2026-05-15

1. Failure Modes

Mode Symptom Likely Cause
Redis pod crash Auth latency spikes, cache miss 100% OOM kill, pod eviction
Redis network partition Connection timeouts in app logs Network policy change, k8s DNS
Redis data corruption Unexpected auth errors, stale graph RDB restore from bad snapshot
Sentinel failover in progress Intermittent failures, 5-30 s gap Primary failover election

2. Diagnosis

# Check Redis pod status
kubectl get pods -n ums-prod -l app=redis

# Check Redis logs
kubectl logs -n ums-prod deploy/redis-master --tail=100

# Test connectivity from app pod
kubectl exec -n ums-prod deploy/ums-auth-api -- redis-cli -h redis-master -p 6379 PING

# Check Redis memory usage
kubectl exec -n ums-prod deploy/redis-master -- redis-cli INFO memory | grep used_memory_human

# Check connected clients
kubectl exec -n ums-prod deploy/redis-master -- redis-cli CLIENT LIST | wc -l

# Check cache hit ratio
kubectl exec -n ums-prod deploy/redis-master -- redis-cli INFO stats \
  | grep -E "keyspace_hits|keyspace_misses"

3. Recovery Steps

3a. Redis Pod Crashed — Restart

# Delete pod to force restart (Deployment will recreate)
kubectl delete pod -n ums-prod -l app=redis-master

# Wait for pod to be ready
kubectl wait pod -n ums-prod -l app=redis-master --for=condition=Ready --timeout=120s

# Verify replication (if Sentinel / cluster)
kubectl exec -n ums-prod deploy/redis-master -- redis-cli INFO replication

3b. Application Failover to DB (Cache-Aside Degraded Mode)

The UMS auth graph repository falls back to DB queries when Redis is unavailable (circuit breaker pattern).

# Enable degraded-mode flag to skip cache reads
kubectl patch configmap ums-feature-flags -n ums-prod \
  --type merge \
  -p '{"data":{"REDIS_CACHE_ENABLED":"false"}}'

# Restart pods to pick up flag
kubectl rollout restart deployment/ums-auth-api -n ums-prod

# Monitor DB connection pool — expect higher utilization
kubectl exec -n ums-prod deploy/ums-db-pod -- psql -c \
  "SELECT count(*) FROM pg_stat_activity WHERE datname='ums_prod';"

Note: DB fallback is safe but increases DB load ~3x. Notify DBA if Redis outage exceeds 30 minutes.

3c. Flush Corrupted Cache (partial or full)

# Flush only auth graph keys (pattern match)
kubectl exec -n ums-prod deploy/redis-master -- redis-cli \
  --scan --pattern "auth:graph:*" | xargs redis-cli DEL

# Full flush (last resort — all cache cold start)
kubectl exec -n ums-prod deploy/redis-master -- redis-cli FLUSHDB ASYNC

4. Re-enable Cache After Recovery

# Re-enable Redis cache
kubectl patch configmap ums-feature-flags -n ums-prod \
  --type merge \
  -p '{"data":{"REDIS_CACHE_ENABLED":"true"}}'

kubectl rollout restart deployment/ums-auth-api -n ums-prod

# Verify hit ratio recovering (should reach > 80% within 10 min)
watch -n 30 "kubectl exec -n ums-prod deploy/redis-master -- redis-cli INFO stats \
  | grep -E 'keyspace_hits|keyspace_misses'"

5. Warm-up After Full Flush

For critical tenants, pre-warm the auth graph cache:

# Trigger projection warm-up via admin API
curl -X POST https://ums.internal/admin/cache/warm-up \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"tenants": ["tenant-1", "tenant-2"]}'

6. Verification Checklist


Back to Operations Index Back to Master Index