Jan Server Monitoring Runbook
Jan Server Monitoring Runbook
Quick Reference
| Alert | Severity | MTTR Target | On-Call Action |
|---|---|---|---|
| HighLLMLatency | Warning | 15min | §1 |
| QueueBacklog | Critical | 5min | §2 |
| CollectorDown | Critical | 2min | §3 |
| StorageFailure | Critical | 10min | §4 |
| TraceExportFailure | Warning | 30min | §5 |
| ClassifierErrors | Warning | 20min | §6 |
1. High LLM Latency
Alert: HighLLMLatency
Triggered when: P95 LLM API latency >2s for 5min
Impact: Degraded user experience, potential timeouts, increased abandonment rate
Investigation Steps
-
Check LLM Provider Dashboard
# Open Grafana open https://grafana/d/llm-overview- Review latency by model (GPT-4 vs GPT-3.5)
- Check error rates per provider
-
Verify Upstream Provider Status
- OpenAI: https://status.openai.com
- Anthropic: https://status.anthropic.com
- Azure: https://status.azure.com
-
Check Recent Deployments
git log --since="1 hour ago" --oneline kubectl rollout history deployment/llm-api -
Inspect Token Queue Depth
curl localhost:8080/metrics | grep queue_depth -
Review Jaeger Traces
- Find slow traces:
http://jaeger:16686/search?service=llm-api&minDuration=2s - Look for database queries, external API calls taking >1s
- Find slow traces:
Remediation
If Provider Issue:
# Enable fallback provider
jan-cli config set llm.fallback_enabled=true
jan-cli config set llm.fallback_provider=anthropicIf Jan Server Issue:
# Scale replicas
kubectl scale deployment/llm-api --replicas=5
# If memory exhaustion
kubectl top pod -l app=llm-api
kubectl set resources deployment/llm-api --limits=memory=2GiIf Database Bottleneck:
-- Check connection pool
psql jan_server -c "SELECT COUNT(*), state FROM pg_stat_activity GROUP BY state;"
-- Check slow queries
psql jan_server -c "SELECT query, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"Escalation
- After 30min: Page SRE team lead via PagerDuty
- After 1h: Engage vendor support (OpenAI/Anthropic)
- If P0: Notify customer success team for user communication
2. Queue Backlog
Alert: ResponseAPIQueueBacklog
Triggered when: Response API queue depth >100 for 10min
Impact: Processing delays, webhook failures, incomplete conversations
Root Causes
- Background worker pool exhausted
- Template API latency spike
- Media API unavailable
- Database connection pool exhausted
Investigation
# Check worker status
curl http://response-api:8081/metrics | grep workers_active
curl http://response-api:8081/metrics | grep workers_idle
# View queue contents
psql jan_server -c "SELECT COUNT(*), status, error_message FROM background_jobs GROUP BY status, error_message ORDER BY COUNT(*) DESC;"
# Check dependent services
make health-check
# View recent job failures
psql jan_server -c "SELECT id, status, error_message, created_at FROM background_jobs WHERE status='failed' ORDER BY created_at DESC LIMIT 20;"Remediation
-
Increase Worker Pool
kubectl set env deployment/response-api WORKER_POOL_SIZE=20 kubectl rollout status deployment/response-api -
Purge Old Jobs
jan-cli jobs purge --older-than=1h --status=failed jan-cli jobs retry --status=failed --max-retries=3 -
Restart Service (Last Resort)
kubectl rollout restart deployment/response-api kubectl rollout status deployment/response-api
3. Collector Outage
Alert: OTELCollectorDown
Triggered when: Collector unreachable for 2min
Impact: Loss of observability (no new traces/metrics), blind operations
Symptoms
- Grafana dashboards flatline
- Jaeger UI shows no recent traces
- Services log OTLP export errors
Investigation
# Check collector health
curl http://otel-collector:13133/
# View collector logs
kubectl logs -l app=otel-collector --tail=100
# Check resource usage
kubectl top pod -l app=otel-collector
# Verify connectivity from services
kubectl run -it --rm debug --image=curlimages/curl --restart=Never \
-- curl -v http://otel-collector:4318/v1/tracesRemediation
-
Restart Collector
kubectl rollout restart deployment/otel-collector kubectl rollout status deployment/otel-collector -
If Resource Exhaustion
# Increase memory kubectl set resources deployment/otel-collector --limits=memory=1Gi # Check Jaeger backend curl http://jaeger-query:16686/api/services -
If Configuration Error
# Validate config kubectl get configmap otel-collector-config -o yaml | yq '.data' # Revert to last known good config kubectl rollout undo deployment/otel-collector
Fallback Mode
Services continue operating without telemetry until collector is restored. No user impact.
4. Media API Storage Failure
Alert: MediaAPIStorageFailure
Triggered when: S3 error rate >10% for 2min
Impact: Upload/download failures, broken media references
Investigation
# Check S3 metrics
curl http://media-api:8080/metrics | grep s3_errors
# View recent errors
kubectl logs -l app=media-api --tail=50 | grep -i s3
# Check AWS status
open https://health.aws.amazon.com/health/status
# Verify credentials
kubectl get secret media-api-s3-credentials -o yamlRemediation
-
Verify S3 bucket exists and is accessible
aws s3 ls s3://jan-media-bucket/ -
Check IAM permissions
aws iam simulate-principal-policy \ --policy-source-arn arn:aws:iam::ACCOUNT:role/media-api-role \ --action-names s3:PutObject s3:GetObject -
Enable fallback storage
kubectl set env deployment/media-api STORAGE_FALLBACK_ENABLED=true
5. Trace Export Failure
Alert: TraceExportFailure
Triggered when: Jaeger export failing >10 spans/sec for 5min
Impact: Partial trace loss, incomplete observability
Investigation
# Check collector export metrics
curl http://otel-collector:8889/metrics | grep exporter_send_failed
# Check Jaeger ingestion
curl http://jaeger-collector:14269/metrics | grep spans_received
# View collector logs
kubectl logs -l app=otel-collector | grep -i errorRemediation
-
Verify Jaeger collector is running
kubectl get pods -l app=jaeger kubectl logs -l app=jaeger --tail=50 -
Check network connectivity
kubectl run -it --rm debug --image=curlimages/curl --restart=Never \ -- curl -v http://jaeger-collector:14268/api/traces -
Increase collector retry settings
- Edit
monitoring/otel-collector.yaml - Increase
max_elapsed_timefrom 5m to 10m - Increase
queue_sizefrom 5000 to 10000 - Apply config:
kubectl apply -f monitoring/otel-collector.yaml
- Edit
-
Temporary: Reduce sampling rate
kubectl set env deployment/llm-api OTEL_TRACES_SAMPLER_ARG=0.1 kubectl set env deployment/response-api OTEL_TRACES_SAMPLER_ARG=0.1
6. Conversation Classifier Errors
Alert: ConversationInsightFailure
Triggered when: Classifier error rate >5% for 5min
Impact: Missing conversation metadata, incomplete analytics
Investigation
# View classifier metrics
curl http://response-api:8081/metrics | grep classifier_errors
# Review error logs
kubectl logs -l app=response-api | grep classifierRemediation
-
Check for malformed prompt data
# Review recent requests kubectl logs -l app=response-api --tail=100 | grep -A5 "classifier error" -
Review recent classifier configuration changes
git log --since="1 day ago" --grep="classifier" --oneline kubectl describe configmap response-api-config -
Disable classifier temporarily (if persistent)
kubectl set env deployment/response-api CLASSIFIER_ENABLED=false
Appendix A: Common Commands
Health Checks
# All services
make health-check
# Individual service
curl http://SERVICE:PORT/health
# Monitoring stack
make monitor-testViewing Logs
# Recent logs
kubectl logs -l app=SERVICE --tail=100
# Follow logs
kubectl logs -l app=SERVICE -f
# Logs with timestamp
kubectl logs -l app=SERVICE --timestamps=trueMetrics Queries
# Service metrics
curl http://SERVICE:8080/metrics
# Prometheus query
curl 'http://localhost:9090/api/v1/query?query=METRIC_NAME'
# Alert status
curl http://localhost:9090/api/v1/rulesTrace Queries
# Recent traces for service
curl 'http://localhost:16686/api/traces?service=SERVICE&limit=10'
# Specific trace
curl 'http://localhost:16686/api/traces/TRACE_ID'
# Slow traces
curl 'http://localhost:16686/api/traces?service=SERVICE&minDuration=2s'Appendix B: Escalation Contacts
| Severity | Contact | Response Time | Channel |
|---|---|---|---|
| P0 (Critical) | SRE On-Call | <5min | PagerDuty |
| P1 (High) | Team Lead | <15min | Slack #incidents |
| P2 (Medium) | Dev Team | <1h | Slack #engineering |
| P3 (Low) | Ticket Queue | Next business day | Jira |
Appendix C: Useful Links
- Grafana: http://localhost:3000
- Jaeger: http://localhost:16686
- Prometheus: http://localhost:9090
- Monitoring Guide: docs/guides/monitoring.md
- Architecture Overview: docs/architecture/services.md
- Security Policy: docs/architecture/security.md