Guides

Monitoring & Troubleshooting Deep Dive

Monitoring & Troubleshooting Deep Dive

Status: v0.0.14 | Last Updated: December 23, 2025

Production monitoring is critical for maintaining Jan Server reliability. This guide covers health checks, metrics collection, distributed tracing, troubleshooting common issues, and performance optimization.

Table of Contents


Service Health Monitoring

Health Check Endpoints

All services expose health check endpoints:

# LLM API health
curl http://localhost:8080/health

# Response API health
curl http://localhost:8082/health

# Media API health
curl http://localhost:8285/health

# MCP Tools health
curl http://localhost:8091/health

# Template API health
curl http://localhost:8185/health

Response Format:

{
  "status": "ok",
  "version": "v0.0.14",
  "timestamp": "2025-12-23T12:00:00Z",
  "uptime_seconds": 3600,
  "database": {
    "connected": true,
    "latency_ms": 2
  },
  "dependencies": {
    "redis": {
      "connected": true,
      "latency_ms": 1
    },
    "message_queue": {
      "connected": true,
      "queue_depth": 5
    }
  }
}

Kubernetes Probe Configuration

apiVersion: v1
kind: Pod
metadata:
  name: jan-server-llm-api
spec:
  containers:
  - name: llm-api
    image: jan-server:v0.0.14
    ports:
    - containerPort: 8080
    
    # Readiness probe - Accept traffic?
    readinessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 10
      timeoutSeconds: 3
      failureThreshold: 3
    
    # Liveness probe - Container alive?
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 30
      timeoutSeconds: 5
      failureThreshold: 3
    
    # Startup probe - App started?
    startupProbe:
      httpGet:
        path: /health
        port: 8080
      failureThreshold: 30
      periodSeconds: 10

Custom Health Checks


Metrics & Observability

Key Metrics by Service

LLM API Metrics:

llm_api_requests_total                    # Total API requests
llm_api_request_duration_seconds          # Request latency histogram
llm_api_conversations_total               # Total conversations created
llm_api_messages_total                    # Total messages sent
llm_api_tokens_processed_total            # Tokens processed
llm_api_errors_total                      # Error count by type
llm_api_cache_hits_total                  # Cache hit rate
llm_api_active_conversations              # Concurrent conversations

Response API Metrics:

response_api_generations_total            # Response generations
response_api_generation_duration_seconds  # Generation latency
response_api_tokens_generated_total       # Tokens in responses
response_api_errors_total                 # Generation errors

Media API Metrics:

media_api_uploads_total                   # File uploads
media_api_upload_size_bytes               # Upload size distribution
media_api_upload_duration_seconds         # Upload latency
media_api_storage_bytes_used              # Total storage used
media_api_errors_total                    # Upload errors

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']

rule_files:
  - 'alert_rules.yml'

scrape_configs:
  - job_name: 'jan-llm-api'
    metrics_path: '/metrics'
    static_configs:
    - targets: ['localhost:8080']
  
  - job_name: 'jan-response-api'
    metrics_path: '/metrics'
    static_configs:
    - targets: ['localhost:8082']
  
  - job_name: 'jan-media-api'
    metrics_path: '/metrics'
    static_configs:
    - targets: ['localhost:8285']
  
  - job_name: 'jan-mcp-tools'
    metrics_path: '/metrics'
    static_configs:
    - targets: ['localhost:8091']
  
  - job_name: 'postgres'
    static_configs:
    - targets: ['localhost:9187']
  
  - job_name: 'redis'
    static_configs:
    - targets: ['localhost:9121']

Alert Rules

# alert_rules.yml
groups:
- name: jan_server_alerts
  interval: 30s
  rules:
  
  # Service down
  - alert: ServiceDown
    expr: up{job=~"jan-.*"} == 0
    for: 2m
    annotations:
      summary: "{{ $labels.job }} is down"
  
  # High error rate
  - alert: HighErrorRate
    expr: |
      rate(llm_api_errors_total[5m]) > 0.05
    for: 5m
    annotations:
      summary: "High error rate in LLM API"
      description: "Error rate is {{ $value }}"
  
  # High latency
  - alert: HighLatency
    expr: |
      histogram_quantile(0.99, rate(llm_api_request_duration_seconds_bucket[5m])) > 5
    for: 5m
    annotations:
      summary: "High request latency detected"
      description: "P99 latency is {{ $value }}s"
  
  # Database connection pool exhausted
  - alert: DatabasePoolExhausted
    expr: |
      pg_stat_activity_max_connections_remaining < 5
    for: 1m
    annotations:
      summary: "Database connection pool nearly full"
  
  # Disk space low
  - alert: DiskSpaceLow
    expr: |
      node_filesystem_avail_bytes{mountpoint="/"} / 
      node_filesystem_size_bytes{mountpoint="/"} < 0.1
    for: 5m
    annotations:
      summary: "Disk space low on {{ $labels.instance }}"
  
  # Queue backlog
  - alert: QueueBacklog
    expr: |
      pg_partman_queue_depth > 1000
    for: 5m
    annotations:
      summary: "Message queue backlog detected"

Grafana Dashboards

Key dashboard panels to create:

Dashboard: Jan Server Overview
├── Requests/Second
├── Error Rate (%)
├── P50/P95/P99 Latency
├── Active Conversations
├── Database Connections
├── Cache Hit Rate
├── Message Queue Depth
├── Storage Usage
└── Cost Per 1M Tokens

Distributed Tracing

OpenTelemetry Setup

Creating Custom Spans

Trace Context Propagation


Common Issues & Solutions

Issue 1: Database Connection Pool Exhausted

Symptoms:

Error: connection pool exhausted
Active connections: 50/50
Queue depth: 100+

Root Causes:

  • Queries taking too long (connections held)
  • Connection leak (not closing properly)
  • Sudden traffic spike
  • N+1 query problem

Diagnosis:

-- Check active connections
SELECT datname, usename, count(*) 
FROM pg_stat_activity 
GROUP BY datname, usename;

-- Check long-running queries
SELECT query, query_start, now() - query_start as duration
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC;

-- Check waiting queries
SELECT query, query_start, now() - query_start as duration
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY duration DESC;

Solutions:

Issue 2: Out of Memory

Symptoms:

RSS Memory: 2.5GB (out of 3GB limit)
Swap usage increasing
Process killed: OOMKiller

Root Causes:

  • Memory leak in application
  • Large result set loading entirely in memory
  • Cache growing unbounded
  • Goroutine leak (Go services)

Diagnosis:

Solutions:

Issue 3: Message Queue Backlog

Symptoms:

Queue depth: 50000+ messages
Processing lag: 30+ minutes
Consumer lag not catching up

Root Causes:

  • Consumer slower than producer
  • Poison pill messages blocking queue
  • Consumer crash/hang
  • Message processing timeout

Solutions:

Issue 4: High API Latency

Symptoms:

P99 latency: 10+ seconds
Some endpoints slow, others normal
Error rate increases under load

Root Causes:

  • Slow database queries
  • Cache miss storm (thundering herd)
  • External API calls
  • Upstream service degradation

Diagnosis:

Solutions:

Issue 5: Authentication Failures

Symptoms:

Keycloak connection errors
401 Unauthorized responses
Token validation timeouts

Solutions:


Performance Optimization

Database Query Optimization

Cache Strategy

Connection Pooling


Logging Strategies

Structured Logging

Log Levels

Log Aggregation with ELK Stack

# filebeat.yml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/jan-server/*.log
    json.message_key: message
    json.keys_under_root: true

output.elasticsearch:
  hosts: ["localhost:9200"]
  index: "jan-server-%{+yyyy.MM.dd}"

processors:
  - add_kubernetes_metadata:
      in_cluster: true

Capacity Planning

Resource Monitoring

Scaling Recommendations

Traffic Level          CPU      Memory    Disk        Scaling
─────────────────────────────────────────────────────────────
Low (< 100 req/s)     20-30%   30-40%    50GB        1 instance
Medium (100-500)      40-50%   40-50%    100GB       2-3 instances
High (500-2000)       60-70%   50-70%    500GB       4-8 instances
Very High (2000+)     >70%     >70%      1TB+        Horizontal + cache

Cost Optimization


Incident Response

Runbook Example: Database Down

## Incident: Database Connection Lost

### Detection
- Alert: `ServiceDown` for database
- Symptom: All APIs returning 500 errors

### Immediate Actions (0-5 min)
1. Check database status:
   ```bash
   pg_isready -h localhost -p 5432
  1. Check database logs:
    docker logs jan-postgresql
  2. If database is running, check connectivity from services:
    kubectl exec -it pod/jan-llm-api -- psql -c "SELECT 1"

Diagnosis (5-15 min)

  • Is database process running? ps aux | grep postgres
  • Is disk full? df -h
  • Check system logs: journalctl -n 50
  • Check network connectivity: ping database-host

Recovery Steps

  1. If disk full:

    • Clean old logs: rm -rf /var/log/postgresql/*.log*
    • Check large tables: SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) FROM pg_tables ORDER BY pg_total_relation_size DESC LIMIT 10
    • Archive old data if applicable
  2. If connection pool exhausted:

    • Check active connections: SELECT count(*) FROM pg_stat_activity
    • Terminate idle connections: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour'
    • Restart services: kubectl rollout restart deployment/jan-llm-api
  3. If database corrupted:

    • Check integrity: REINDEX DATABASE postgres
    • If severe, restore from backup

Escalation

  • If not resolved in 15 min: Page on-call DBA
  • If customer impact: Update status page

Post-Incident

  • Root cause analysis meeting
  • Add monitoring/alerting to prevent recurrence
  • Update runbook with findings

### Alert Severity Levels

---

## Summary Checklist

- [ ] Health checks configured for all services
- [ ] Prometheus scraping all metrics
- [ ] Grafana dashboards displaying key metrics
- [ ] Alert rules configured for critical issues
- [ ] Logging to centralized system
- [ ] Distributed tracing enabled
- [ ] Runbooks documented for common incidents
- [ ] On-call rotation established
- [ ] Regular chaos engineering exercises
- [ ] Quarterly capacity planning review

See [MCP Custom Tools Guide](./mcp-custom-tools) for tool-specific monitoring and [Webhooks Guide](./webhooks) for webhook health checks.