Guides
Monitoring & Troubleshooting Deep Dive
Monitoring & Troubleshooting Deep Dive
Status: v0.0.14 | Last Updated: December 23, 2025
Production monitoring is critical for maintaining Jan Server reliability. This guide covers health checks, metrics collection, distributed tracing, troubleshooting common issues, and performance optimization.
Table of Contents
- Service Health Monitoring
- Metrics & Observability
- Distributed Tracing
- Common Issues & Solutions
- Performance Optimization
- Logging Strategies
- Capacity Planning
- Incident Response
Service Health Monitoring
Health Check Endpoints
All services expose health check endpoints:
# LLM API health
curl http://localhost:8080/health
# Response API health
curl http://localhost:8082/health
# Media API health
curl http://localhost:8285/health
# MCP Tools health
curl http://localhost:8091/health
# Template API health
curl http://localhost:8185/healthResponse Format:
{
"status": "ok",
"version": "v0.0.14",
"timestamp": "2025-12-23T12:00:00Z",
"uptime_seconds": 3600,
"database": {
"connected": true,
"latency_ms": 2
},
"dependencies": {
"redis": {
"connected": true,
"latency_ms": 1
},
"message_queue": {
"connected": true,
"queue_depth": 5
}
}
}Kubernetes Probe Configuration
apiVersion: v1
kind: Pod
metadata:
name: jan-server-llm-api
spec:
containers:
- name: llm-api
image: jan-server:v0.0.14
ports:
- containerPort: 8080
# Readiness probe - Accept traffic?
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
# Liveness probe - Container alive?
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
# Startup probe - App started?
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 10Custom Health Checks
Metrics & Observability
Key Metrics by Service
LLM API Metrics:
llm_api_requests_total # Total API requests
llm_api_request_duration_seconds # Request latency histogram
llm_api_conversations_total # Total conversations created
llm_api_messages_total # Total messages sent
llm_api_tokens_processed_total # Tokens processed
llm_api_errors_total # Error count by type
llm_api_cache_hits_total # Cache hit rate
llm_api_active_conversations # Concurrent conversationsResponse API Metrics:
response_api_generations_total # Response generations
response_api_generation_duration_seconds # Generation latency
response_api_tokens_generated_total # Tokens in responses
response_api_errors_total # Generation errorsMedia API Metrics:
media_api_uploads_total # File uploads
media_api_upload_size_bytes # Upload size distribution
media_api_upload_duration_seconds # Upload latency
media_api_storage_bytes_used # Total storage used
media_api_errors_total # Upload errorsPrometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- 'alert_rules.yml'
scrape_configs:
- job_name: 'jan-llm-api'
metrics_path: '/metrics'
static_configs:
- targets: ['localhost:8080']
- job_name: 'jan-response-api'
metrics_path: '/metrics'
static_configs:
- targets: ['localhost:8082']
- job_name: 'jan-media-api'
metrics_path: '/metrics'
static_configs:
- targets: ['localhost:8285']
- job_name: 'jan-mcp-tools'
metrics_path: '/metrics'
static_configs:
- targets: ['localhost:8091']
- job_name: 'postgres'
static_configs:
- targets: ['localhost:9187']
- job_name: 'redis'
static_configs:
- targets: ['localhost:9121']Alert Rules
# alert_rules.yml
groups:
- name: jan_server_alerts
interval: 30s
rules:
# Service down
- alert: ServiceDown
expr: up{job=~"jan-.*"} == 0
for: 2m
annotations:
summary: "{{ $labels.job }} is down"
# High error rate
- alert: HighErrorRate
expr: |
rate(llm_api_errors_total[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate in LLM API"
description: "Error rate is {{ $value }}"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.99, rate(llm_api_request_duration_seconds_bucket[5m])) > 5
for: 5m
annotations:
summary: "High request latency detected"
description: "P99 latency is {{ $value }}s"
# Database connection pool exhausted
- alert: DatabasePoolExhausted
expr: |
pg_stat_activity_max_connections_remaining < 5
for: 1m
annotations:
summary: "Database connection pool nearly full"
# Disk space low
- alert: DiskSpaceLow
expr: |
node_filesystem_avail_bytes{mountpoint="/"} /
node_filesystem_size_bytes{mountpoint="/"} < 0.1
for: 5m
annotations:
summary: "Disk space low on {{ $labels.instance }}"
# Queue backlog
- alert: QueueBacklog
expr: |
pg_partman_queue_depth > 1000
for: 5m
annotations:
summary: "Message queue backlog detected"Grafana Dashboards
Key dashboard panels to create:
Dashboard: Jan Server Overview
├── Requests/Second
├── Error Rate (%)
├── P50/P95/P99 Latency
├── Active Conversations
├── Database Connections
├── Cache Hit Rate
├── Message Queue Depth
├── Storage Usage
└── Cost Per 1M TokensDistributed Tracing
OpenTelemetry Setup
Creating Custom Spans
Trace Context Propagation
Common Issues & Solutions
Issue 1: Database Connection Pool Exhausted
Symptoms:
Error: connection pool exhausted
Active connections: 50/50
Queue depth: 100+Root Causes:
- Queries taking too long (connections held)
- Connection leak (not closing properly)
- Sudden traffic spike
- N+1 query problem
Diagnosis:
-- Check active connections
SELECT datname, usename, count(*)
FROM pg_stat_activity
GROUP BY datname, usename;
-- Check long-running queries
SELECT query, query_start, now() - query_start as duration
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC;
-- Check waiting queries
SELECT query, query_start, now() - query_start as duration
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY duration DESC;Solutions:
Issue 2: Out of Memory
Symptoms:
RSS Memory: 2.5GB (out of 3GB limit)
Swap usage increasing
Process killed: OOMKillerRoot Causes:
- Memory leak in application
- Large result set loading entirely in memory
- Cache growing unbounded
- Goroutine leak (Go services)
Diagnosis:
Solutions:
Issue 3: Message Queue Backlog
Symptoms:
Queue depth: 50000+ messages
Processing lag: 30+ minutes
Consumer lag not catching upRoot Causes:
- Consumer slower than producer
- Poison pill messages blocking queue
- Consumer crash/hang
- Message processing timeout
Solutions:
Issue 4: High API Latency
Symptoms:
P99 latency: 10+ seconds
Some endpoints slow, others normal
Error rate increases under loadRoot Causes:
- Slow database queries
- Cache miss storm (thundering herd)
- External API calls
- Upstream service degradation
Diagnosis:
Solutions:
Issue 5: Authentication Failures
Symptoms:
Keycloak connection errors
401 Unauthorized responses
Token validation timeoutsSolutions:
Performance Optimization
Database Query Optimization
Cache Strategy
Connection Pooling
Logging Strategies
Structured Logging
Log Levels
Log Aggregation with ELK Stack
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/jan-server/*.log
json.message_key: message
json.keys_under_root: true
output.elasticsearch:
hosts: ["localhost:9200"]
index: "jan-server-%{+yyyy.MM.dd}"
processors:
- add_kubernetes_metadata:
in_cluster: trueCapacity Planning
Resource Monitoring
Scaling Recommendations
Traffic Level CPU Memory Disk Scaling
─────────────────────────────────────────────────────────────
Low (< 100 req/s) 20-30% 30-40% 50GB 1 instance
Medium (100-500) 40-50% 40-50% 100GB 2-3 instances
High (500-2000) 60-70% 50-70% 500GB 4-8 instances
Very High (2000+) >70% >70% 1TB+ Horizontal + cacheCost Optimization
Incident Response
Runbook Example: Database Down
## Incident: Database Connection Lost
### Detection
- Alert: `ServiceDown` for database
- Symptom: All APIs returning 500 errors
### Immediate Actions (0-5 min)
1. Check database status:
```bash
pg_isready -h localhost -p 5432- Check database logs:
docker logs jan-postgresql - If database is running, check connectivity from services:
kubectl exec -it pod/jan-llm-api -- psql -c "SELECT 1"
Diagnosis (5-15 min)
- Is database process running?
ps aux | grep postgres - Is disk full?
df -h - Check system logs:
journalctl -n 50 - Check network connectivity:
ping database-host
Recovery Steps
-
If disk full:
- Clean old logs:
rm -rf /var/log/postgresql/*.log* - Check large tables:
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) FROM pg_tables ORDER BY pg_total_relation_size DESC LIMIT 10 - Archive old data if applicable
- Clean old logs:
-
If connection pool exhausted:
- Check active connections:
SELECT count(*) FROM pg_stat_activity - Terminate idle connections:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour' - Restart services:
kubectl rollout restart deployment/jan-llm-api
- Check active connections:
-
If database corrupted:
- Check integrity:
REINDEX DATABASE postgres - If severe, restore from backup
- Check integrity:
Escalation
- If not resolved in 15 min: Page on-call DBA
- If customer impact: Update status page
Post-Incident
- Root cause analysis meeting
- Add monitoring/alerting to prevent recurrence
- Update runbook with findings
### Alert Severity Levels
---
## Summary Checklist
- [ ] Health checks configured for all services
- [ ] Prometheus scraping all metrics
- [ ] Grafana dashboards displaying key metrics
- [ ] Alert rules configured for critical issues
- [ ] Logging to centralized system
- [ ] Distributed tracing enabled
- [ ] Runbooks documented for common incidents
- [ ] On-call rotation established
- [ ] Regular chaos engineering exercises
- [ ] Quarterly capacity planning review
See [MCP Custom Tools Guide](./mcp-custom-tools) for tool-specific monitoring and [Webhooks Guide](./webhooks) for webhook health checks.