| Field | Value |
|---|---|
| Runbook ID | RB-03 |
| Scope | Redis cache failure — UMS auth graph and session cache |
| Owner | Platform / On-Call Team |
| Last Review | 2026-05-15 |
| Mode | Symptom | Likely Cause |
|---|---|---|
| Redis pod crash | Auth latency spikes, cache miss 100% | OOM kill, pod eviction |
| Redis network partition | Connection timeouts in app logs | Network policy change, k8s DNS |
| Redis data corruption | Unexpected auth errors, stale graph | RDB restore from bad snapshot |
| Sentinel failover in progress | Intermittent failures, 5-30 s gap | Primary failover election |
# Check Redis pod status
kubectl get pods -n ums-prod -l app=redis
# Check Redis logs
kubectl logs -n ums-prod deploy/redis-master --tail=100
# Test connectivity from app pod
kubectl exec -n ums-prod deploy/ums-auth-api -- redis-cli -h redis-master -p 6379 PING
# Check Redis memory usage
kubectl exec -n ums-prod deploy/redis-master -- redis-cli INFO memory | grep used_memory_human
# Check connected clients
kubectl exec -n ums-prod deploy/redis-master -- redis-cli CLIENT LIST | wc -l
# Check cache hit ratio
kubectl exec -n ums-prod deploy/redis-master -- redis-cli INFO stats \
| grep -E "keyspace_hits|keyspace_misses"
# Delete pod to force restart (Deployment will recreate)
kubectl delete pod -n ums-prod -l app=redis-master
# Wait for pod to be ready
kubectl wait pod -n ums-prod -l app=redis-master --for=condition=Ready --timeout=120s
# Verify replication (if Sentinel / cluster)
kubectl exec -n ums-prod deploy/redis-master -- redis-cli INFO replication
The UMS auth graph repository falls back to DB queries when Redis is unavailable (circuit breaker pattern).
# Enable degraded-mode flag to skip cache reads
kubectl patch configmap ums-feature-flags -n ums-prod \
--type merge \
-p '{"data":{"REDIS_CACHE_ENABLED":"false"}}'
# Restart pods to pick up flag
kubectl rollout restart deployment/ums-auth-api -n ums-prod
# Monitor DB connection pool — expect higher utilization
kubectl exec -n ums-prod deploy/ums-db-pod -- psql -c \
"SELECT count(*) FROM pg_stat_activity WHERE datname='ums_prod';"
Note: DB fallback is safe but increases DB load ~3x. Notify DBA if Redis outage exceeds 30 minutes.
# Flush only auth graph keys (pattern match)
kubectl exec -n ums-prod deploy/redis-master -- redis-cli \
--scan --pattern "auth:graph:*" | xargs redis-cli DEL
# Full flush (last resort — all cache cold start)
kubectl exec -n ums-prod deploy/redis-master -- redis-cli FLUSHDB ASYNC
# Re-enable Redis cache
kubectl patch configmap ums-feature-flags -n ums-prod \
--type merge \
-p '{"data":{"REDIS_CACHE_ENABLED":"true"}}'
kubectl rollout restart deployment/ums-auth-api -n ums-prod
# Verify hit ratio recovering (should reach > 80% within 10 min)
watch -n 30 "kubectl exec -n ums-prod deploy/redis-master -- redis-cli INFO stats \
| grep -E 'keyspace_hits|keyspace_misses'"
For critical tenants, pre-warm the auth graph cache:
# Trigger projection warm-up via admin API
curl -X POST https://ums.internal/admin/cache/warm-up \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"tenants": ["tenant-1", "tenant-2"]}'
PING returns PONG from app podREDIS_CACHE_ENABLED flag set to true| Back to Operations Index | Back to Master Index |