Deployment Guide
Deployment Guide
Comprehensive guide for deploying Jan Server to various environments.
Table of Contents
- Overview
- Prerequisites
- Deployment Options
- Kubernetes (Recommended)
- [Docker Compose](#docker compose)
- Hybrid Mode
- Environment Configuration
- Security Considerations
- Monitoring and Observability
Overview
Jan Server supports multiple deployment strategies to accommodate different use cases:
| Environment | Use Case | Orchestrator | Recommended For |
|---|---|---|---|
| Kubernetes | Production, Staging | Kubernetes/Helm | Scalable production deployments |
| Docker Compose | Development, Testing | Docker Compose | Local development and testing |
| Hybrid Mode | Development | Native + Docker | Fast iteration and debugging |
Prerequisites
All Deployments
- Docker 24+ and Docker Compose V2
- PostgreSQL 18+ (managed or in-cluster)
- Redis 7+ (managed or in-cluster)
- S3-compatible storage (for media-api)
Kubernetes Deployments
- Kubernetes 1.27+
- Helm 3.12+
- kubectl configured
- Sufficient cluster resources (see Resource Requirements)
Deployment Options
Kubernetes (Recommended)
Kubernetes deployment uses Helm charts for full orchestration and scalability.
1. Development (Minikube)
For local development and testing:
# Prerequisites
minikube start --cpus=4 --memory=8192 --driver=docker
# Build and load images
cd services/llm-api && go mod tidy && cd ../..
cd services/media-api && go mod tidy && cd ../..
cd services/mcp-tools && go mod tidy && cd ../..
docker build -t jan/llm-api:latest -f services/llm-api/Dockerfile .
docker build -t jan/media-api:latest -f services/media-api/Dockerfile .
docker build -t jan/mcp-tools:latest -f services/mcp-tools/Dockerfile .
# Load images into minikube
minikube image load jan/llm-api:latest jan/media-api:latest jan/mcp-tools:latest
minikube image load quay.io/keycloak/keycloak:24.0.5
minikube image load bitnami/postgresql:latest bitnami/redis:latest
# Deploy
cd k8s
helm install jan-server ./jan-server \
--namespace jan-server \
--create-namespace
# Create databases
kubectl exec -n jan-server jan-server-postgresql-0 -- bash -c "PGPASSWORD=postgres psql -U postgres << 'EOF'
CREATE USER media WITH PASSWORD 'media';
CREATE DATABASE media_api OWNER media;
CREATE USER keycloak WITH PASSWORD 'keycloak';
CREATE DATABASE keycloak OWNER keycloak;
EOF"
# Verify deployment
kubectl get pods -n jan-server
# Access services
kubectl port-forward -n jan-server svc/jan-server-llm-api 8080:8080
curl http://localhost:8080/healthzComplete guide: See k8s/SETUP.md
2. Cloud Kubernetes (AKS/EKS/GKE)
For production cloud deployments:
# Option A: With cloud-managed databases (recommended)
helm install jan-server ./jan-server \
--namespace jan-server \
--create-namespace \
--set postgresql.enabled=false \
--set redis.enabled=false \
--set global.postgresql.host=your-managed-postgres.cloud \
--set global.redis.host=your-managed-redis.cloud \
--set ingress.enabled=true \
--set ingress.className=nginx \
--set ingress.hosts[0].host=jan.yourdomain.com \
--set llmApi.autoscaling.enabled=true \
--set llmApi.replicaCount=3 \
--set llmApi.image.pullPolicy=Always \
--set mediaApi.image.pullPolicy=Always \
--set mcpTools.image.pullPolicy=Always
# Option B: With in-cluster databases
helm install jan-server ./jan-server \
--namespace jan-server \
--create-namespace \
--set postgresql.persistence.enabled=true \
--set postgresql.persistence.size=50Gi \
--set postgresql.persistence.storageClass=gp3 \
--set redis.master.persistence.enabled=true \
--set ingress.enabled=true \
--set llmApi.autoscaling.enabled=trueConfiguration guide: See k8s/README.md
3. On-Premises Kubernetes
For on-premises production:
# Use production values with external databases
helm install jan-server ./jan-server \
--namespace jan-server \
--create-namespace \
--values ./jan-server/values-production.yaml \
--set postgresql.enabled=false \
--set redis.enabled=false \
--set global.postgresql.host=postgres.internal \
--set global.redis.host=redis.internalDocker Compose
For local development and integration testing.
Development Mode
# Start infrastructure only (PostgreSQL, Keycloak, Kong)
make up-infra
# With API services (llm-api, media-api, response-api)
make up-api
# With MCP services (mcp-tools, vector-store)
make up-mcp
# Full stack with Kong + APIs + MCP
make up-full
# With GPU inference (local vLLM)
make up-vllm-gpuComplete guide: See Development Guide
Testing Environment
cp .env.template .env # ensure a clean env file
# Edit .env and set: COMPOSE_PROFILES=infra,api,mcp
make up-full # start stack under test
make test-all # run jan-cli api-test suitesHybrid Mode
For fast iteration during development:
make dev-full # start stack with host routing
# Replace a service with a host-native process
./jan-cli.sh dev run llm-api # macOS/Linux
.\jan-cli.ps1 dev run llm-api # Windows PowerShell
# Stop dev-full when done
make dev-full-stop # keep containers
make dev-full-down # remove containersComplete guide: See Development Guide - Dev-Full Mode
Environment Configuration
Required Environment Variables
LLM API
# Database
DB_POSTGRESQL_WRITE_DSN=postgres://jan_user:jan_password@localhost:5432/jan_llm_api?sslmode=disable
# Keycloak/Auth
KEYCLOAK_BASE_URL=http://localhost:8085
BACKEND_CLIENT_ID=llm-api
BACKEND_CLIENT_SECRET=your-secret
CLIENT=jan-client
# Provider toggles
VLLM_ENABLED=true
VLLM_PROVIDER_URL=http://localhost:8101/v1
REMOTE_LLM_ENABLED=false
REMOTE_LLM_PROVIDER_URL=
REMOTE_API_KEY=
JAN_PROVIDER_CONFIGS=true
JAN_PROVIDER_CONFIG_SET=default
HTTP_PORT=8080
LOG_LEVEL=debugMedia API
# Database
DB_POSTGRESQL_WRITE_DSN=postgres://media:media@localhost:5432/media_api?sslmode=disable
# S3 Storage (Required - AWS Standard Naming)
MEDIA_S3_ENDPOINT=https://s3.amazonaws.com
MEDIA_S3_REGION=us-east-1
MEDIA_S3_BUCKET=your-bucket
MEDIA_S3_ACCESS_KEY_ID=your-access-key-id
MEDIA_S3_SECRET_ACCESS_KEY=your-secret-access-key
MEDIA_S3_USE_PATH_STYLE=false
# Server
MEDIA_API_PORT=8285
LOG_LEVEL=infoMCP Tools
# Server
HTTP_PORT=8091
LOG_LEVEL=info
# Optional providers
EXA_API_KEY=your-exa-key
BRAVE_API_KEY=your-brave-keyConfiguration Files
Environment-specific configuration files in config/:
defaults.env- Default values for all environmentsdevelopment.env- Local development settingstesting.env- Test environment settingsproduction.env.example- Production template (copy and customize)secrets.env.example- Secrets template (never commit actual secrets)
Multi-vLLM Instance Deployment (High Availability)
For production deployments requiring high availability and load-balanced inference across multiple vLLM instances.
Overview
Jan Server supports running multiple vLLM instances with automatic round-robin load balancing. This enables:
- High Availability: Continue operating if one vLLM instance fails
- Scalability: Distribute inference load across multiple GPUs/servers
- Flexible Resource Allocation: Deploy vLLM instances independently
Architecture
LLM API (load balancer)
├── vLLM Instance 1 (Port 8101)
├── vLLM Instance 2 (Port 8102)
└── vLLM Instance 3 (Port 8103)The LLM API uses round-robin scheduling to distribute requests across instances.
Deployment Steps
1. Configure Multiple vLLM Instances
Add provider instances in config/defaults.yaml:
providers:
- vendor: vllm
enabled: true
endpoints:
- name: vllm-instance-1
base_url: http://localhost:8101/v1
api_key: ""
- name: vllm-instance-2
base_url: http://localhost:8102/v1
api_key: ""
- name: vllm-instance-3
base_url: http://localhost:8103/v1
api_key: ""2. Start Multiple vLLM Instances via Docker
Create a docker-compose.vllm-multi.yml:
version: '3.8'
services:
vllm-1:
image: vllm/vllm-openai:latest
container_name: vllm-instance-1
environment:
- VLLM_API_KEY=${VLLM_API_KEY:-}
- VLLM_SERVED_MODEL_NAME=meta-llama/Llama-2-7b-hf
ports:
- "8101:8000"
volumes:
- vllm-cache-1:/root/.cache
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
networks:
- jan-network
restart: unless-stopped
vllm-2:
image: vllm/vllm-openai:latest
container_name: vllm-instance-2
environment:
- VLLM_API_KEY=${VLLM_API_KEY:-}
- VLLM_SERVED_MODEL_NAME=meta-llama/Llama-2-7b-hf
ports:
- "8102:8000"
volumes:
- vllm-cache-2:/root/.cache
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
networks:
- jan-network
restart: unless-stopped
vllm-3:
image: vllm/vllm-openai:latest
container_name: vllm-instance-3
environment:
- VLLM_API_KEY=${VLLM_API_KEY:-}
- VLLM_SERVED_MODEL_NAME=meta-llama/Llama-2-7b-hf
ports:
- "8103:8000"
volumes:
- vllm-cache-3:/root/.cache
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
networks:
- jan-network
restart: unless-stopped
volumes:
vllm-cache-1:
vllm-cache-2:
vllm-cache-3:
networks:
jan-network:
external: trueStart the instances:
# Create shared network
docker network create jan-network
# Start multi-vLLM stack
docker compose -f docker-compose.vllm-multi.yml up -d
# Verify instances are healthy
curl http://localhost:8101/health
curl http://localhost:8102/health
curl http://localhost:8103/health3. Start LLM API with Multi-vLLM Configuration
# Set environment to use multiple vLLM endpoints
export VLLM_ENABLED=true
export VLLM_PROVIDER_URL=http://localhost:8101/v1,http://localhost:8102/v1,http://localhost:8103/v1
# Start LLM API
docker compose -f docker-compose.yml -f infra/docker/services-api.yml up -d llm-apiOr in docker-compose.yml, set the environment variable:
services:
llm-api:
environment:
VLLM_ENABLED: "true"
VLLM_PROVIDER_URL: "http://vllm-1:8000/v1,http://vllm-2:8000/v1,http://vllm-3:8000/v1"4. Kubernetes Deployment (Helm)
For Kubernetes, deploy vLLM instances as separate StatefulSets:
# Create a values override file: k8s/values-multi-vllm.yaml
cat > k8s/values-multi-vllm.yaml << 'EOF'
vllmInstances:
enabled: true
instances: 3
resources:
requests:
memory: "24Gi"
nvidia.com/gpu: "1"
limits:
memory: "32Gi"
nvidia.com/gpu: "1"
llmApi:
env:
VLLM_ENABLED: "true"
# Load balancer will handle round-robin
VLLM_PROVIDER_URL: "http://vllm-0:8000/v1,http://vllm-1:8000/v1,http://vllm-2:8000/v1"
replicaCount: 2
EOF
# Deploy with multi-vLLM configuration
helm install jan-server ./jan-server \
--namespace jan-server \
--create-namespace \
--values k8s/values-multi-vllm.yamlLoad Balancing Strategy
The LLM API implements round-robin load balancing across configured vLLM endpoints:
- Request Distribution: Each request is routed to the next vLLM instance in sequence
- Failover: If a vLLM instance is unavailable, requests retry on the next instance
- Health Checks: Periodic health checks verify instance availability
Monitoring Multi-vLLM Setup
Monitor the vLLM instances and load balancing:
# Check individual vLLM instance stats
curl http://localhost:8101/stats
curl http://localhost:8102/stats
curl http://localhost:8103/stats
# Monitor LLM API logs for load balancing
docker logs -f llm-api | grep "vllm\|provider\|load"
# Check request distribution across instances
docker stats vllm-instance-1 vllm-instance-2 vllm-instance-3Troubleshooting Multi-vLLM
Uneven Load Distribution
If requests aren't evenly distributed:
- Check instance health: Verify all instances respond to health checks
- Review logs: Check for error-based failover patterns
- Restart instances: Clear any stuck states with rolling restarts
# Rolling restart (minimizes downtime)
for i in 1 2 3; do
echo "Restarting vllm-instance-$i..."
docker restart vllm-instance-$i
sleep 30 # Wait for recovery
doneInstance Connection Failures
# Test connectivity from LLM API container
docker exec llm-api curl http://vllm-instance-1:8000/health
docker exec llm-api curl http://vllm-instance-2:8000/health
docker exec llm-api curl http://vllm-instance-3:8000/health
# Check Docker network connectivity
docker network inspect jan-networkMemory/GPU Issues
If instances are running out of memory:
- Reduce model size: Use a smaller quantized model
- Reduce batch size: Set
--max-model-lenin vLLM - Add more instances: Distribute load across more nodes
- Scale vertically: Upgrade to GPUs with more VRAM
# Update vLLM docker-compose to use smaller model
docker compose -f docker-compose.vllm-multi.yml down
# Edit docker-compose.vllm-multi.yml, change VLLM_SERVED_MODEL_NAME
docker compose -f docker-compose.vllm-multi.yml up -dPerformance Tuning
For optimal multi-vLLM performance:
| Setting | Recommendation | Notes |
|---|---|---|
| Instances | 2-4 per operator | Balance cost vs redundancy |
| Batch Size | 1-4 per instance | Depends on VRAM available |
| Model Size | 7B or smaller | For multi-instance on typical GPUs |
| Tensor Parallelism | Enabled if multi-GPU per instance | Reduces latency |
| Quantization | 8-bit or GPTQ | Reduces VRAM usage |
Cost Optimization
Multi-vLLM enables several cost optimizations:
- Spot Instances: Use cheaper spot GPUs with fast failover
- Mixed Hardware: Use different GPU types for different model sizes
- Autoscaling: Add/remove instances based on load
- Batch Processing: Queue requests to maximize GPU utilization
Security Considerations
Production Checklist
-
Secrets Management
- Use external secrets operator (e.g., AWS Secrets Manager, Azure Key Vault)
- Never commit secrets to version control
- Rotate credentials regularly
-
Network Security
- Enable network policies to restrict pod-to-pod communication
- Use TLS for all external endpoints
- Configure ingress with proper SSL certificates
-
Authentication
- Change default Keycloak admin password
- Configure proper realm settings
- Enable token exchange for client-to-client auth
-
Database Security
- Use managed database services when possible
- Enable SSL/TLS connections
- Implement backup and disaster recovery
-
Pod Security
- Apply pod security standards (restricted profile)
- Use non-root containers
- Enable security context constraints
Example: External Secrets
# Install external-secrets operator
helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets external-secrets/external-secrets \
--namespace external-secrets-system \
--create-namespace
# Create SecretStore for AWS Secrets Manager
kubectl apply -f - <<EOF
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: aws-secretsmanager
namespace: jan-server
spec:
provider:
aws:
service: SecretsManager
region: us-west-2
EOFResource Requirements
Minimum (Development)
| Component | CPU | Memory |
|---|---|---|
| LLM API | 250m | 256Mi |
| Media API | 250m | 256Mi |
| MCP Tools | 250m | 256Mi |
| PostgreSQL | 250m | 256Mi |
| Redis | 100m | 128Mi |
| Keycloak | 500m | 512Mi |
| Total | ~1.5 CPU | ~2Gi |
Recommended (Production)
| Component | CPU | Memory | Replicas |
|---|---|---|---|
| LLM API | 1000m | 1Gi | 3 |
| Media API | 500m | 512Mi | 2 |
| MCP Tools | 500m | 512Mi | 2 |
| PostgreSQL | 2000m | 4Gi | 1 (or managed) |
| Redis | 500m | 1Gi | 3 (cluster) |
| Keycloak | 1000m | 1Gi | 2 |
Storage Requirements
- PostgreSQL: 50Gi minimum (100Gi+ for production)
- Redis: 10Gi for persistence
- PVCs for media uploads (if not using S3)
Monitoring and Observability
Enable Monitoring Stack
# Start monitoring services
docker compose --profile monitoring up -d
# Access dashboards
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3331
# Jaeger: http://localhost:16686Key Metrics to Monitor
- Service Health: Endpoint availability, response times
- Database: Connection pool usage, query performance
- Resource Usage: CPU, memory, disk I/O
- Request Rates: Throughput, error rates
- Authentication: Token issuance, validation failures
Complete guide: See Monitoring Guide
Troubleshooting
Common Issues
Pods Not Starting
# Check pod status
kubectl get pods -n jan-server
# View pod logs
kubectl logs -n jan-server <pod-name>
# Describe pod for events
kubectl describe pod -n jan-server <pod-name>Database Connection Failures
# Verify PostgreSQL is running
kubectl exec -n jan-server jan-server-postgresql-0 -- psql -U postgres -c '\l'
# Check database exists
kubectl exec -n jan-server jan-server-postgresql-0 -- psql -U postgres -c '\l' | grep media_api
# Test connection from service pod
kubectl exec -n jan-server <service-pod> -- nc -zv jan-server-postgresql 5432Image Pull Failures
For minikube:
# Verify images are loaded
minikube image ls | grep jan/
# Reload if missing
minikube image load jan/llm-api:latestFor production:
# Check image pull policy
kubectl get deployment -n jan-server jan-server-llm-api -o yaml | grep pullPolicy
# Should be "Always" or "IfNotPresent" for registry imagesRelated Documentation
- Kubernetes Setup Guide - Complete k8s deployment steps
- Kubernetes Configuration - Helm chart configuration reference
- Development Guide - Local development setup
- Development Guide - Native service execution and dev-full mode
- Monitoring Guide - Observability setup
- Architecture Overview - System architecture
Support
For additional help:
- Review Getting Started
- Check Troubleshooting Guide
- See Architecture Documentation