# Production Deployment Guide

This document covers deploying the TipSharks ELO API in a production environment.
For local development and day-to-day operations, see [ops.md](ops.md).

---

## Architecture Overview

```
                     ┌─────────────┐
                     │   Client    │
                     └──────┬──────┘
                            │ :80/:443
                     ┌──────▼──────┐
                     │   nginx     │  ← Reverse proxy, TLS termination, static files
                     └──────┬──────┘
                            │
              ┌─────────────┼─────────────┐
              │             │             │
      ┌───────▼──────┐ ┌───▼────┐ ┌──────▼──────┐
      │   api:8000   │ │ worker │ │  redis:6379  │
      │  (FastAPI)   │ │  (CLI) │ │  (BullMQ)   │
      └───────┬──────┘ └────────┘ └──────┬───────┘
              │                          │
              └──────────┬───────────────┘
                         │
                  ┌──────▼──────┐
                  │  postgres   │
                  │  :5432      │
                  └─────────────┘
```

**Containers** (all on `tipsharks-network`):

| Service | Image | Purpose |
|---------|-------|---------|
| `db` | `postgres:16-alpine` | Primary data store |
| `redis` | `redis:7-alpine` | Job queue & caching |
| `api` | Build from `Dockerfile.api` | FastAPI REST API + Web UI |
| `worker` | Build from `Dockerfile.worker` | CLI ingestion, recompute, scraping (on-demand) |
| `nginx` | `nginx:alpine` | Reverse proxy, TLS, static file serving |

---

## Deployment Steps

### 1. Prerequisites

- Docker Engine 24+ and Docker Compose v2
- A `.env` file with production credentials (see `.env.prod.example`)
- PostgreSQL client tools (`pg_isready`, `psql`) for health checks
- (Optional) TLS certificate files for HTTPS termination

### 2. Configure Environment

```bash
cd tipsharks-elo-api

# Copy the production example and edit with your values
cp .env.prod.example .env
vi .env
```

**Critical values to set:**

```ini
POSTGRES_PASSWORD=<generate-strong-db-password>
API_ADMIN_TOKEN=<generate-32-char-random-string>
API_CORS_ALLOW_ORIGINS=https://yourdomain.com
REDIS_URL=redis://redis:6379/0
SCHEDULER_ENABLED=true
```

Generate a secure admin token:

```bash
python3 -c "import secrets; print(secrets.token_hex(16))"
# Example output: a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6
```

### 3. Build Images

```bash
docker compose -f docker-compose.prod.yml build
```

This builds the `api` and `worker` images from their respective Dockerfiles.
Images are tagged `tipsharks-elo-api:latest` and `tipsharks-elo-api-worker:latest`.

To tag for a specific release:

```bash
docker compose -f docker-compose.prod.yml build
docker tag tipsharks-elo-api:latest your-registry/tipsharks-elo-api:v1.0.0
docker tag tipsharks-elo-api-worker:latest your-registry/tipsharks-elo-api-worker:v1.0.0
```

### 4. Start Database and Redis

```bash
docker compose -f docker-compose.prod.yml up -d db redis
```

Verify they are healthy:

```bash
docker compose -f docker-compose.prod.yml ps
# Both should show "healthy" under STATUS
```

### 5. Run Database Migrations

```bash
# Migrations run via the worker container (tools profile)
docker compose -f docker-compose.prod.yml --profile tools run --rm worker alembic upgrade head
```

### 6. Start API

```bash
docker compose -f docker-compose.prod.yml up -d api
```

Wait for the health check to pass (may take up to 30 seconds on first start).

### 7. Start Nginx

```bash
docker compose -f docker-compose.prod.yml up -d nginx
```

### 8. Verify Deployment

```bash
# Health check via nginx
curl http://localhost/health

# Expected response:
# {"status": "healthy", "version": "0.1.0"}

# Check all services
docker compose -f docker-compose.prod.yml ps

# View logs
docker compose -f docker-compose.prod.yml logs --tail=50
```

### 9. Initial Data Ingestion

```bash
# Ingest the last 7 days of harness racing data
docker compose -f docker-compose.prod.yml --profile tools run --rm worker \
  python -m apps.backend.worker.cli ingest \
  --from $(date -d '7 days ago' +%Y-%m-%d) \
  --to $(date +%Y-%m-%d)
```

### 10. Initial Rating Recompute

```bash
# Full recompute from scratch
docker compose -f docker-compose.prod.yml --profile tools run --rm worker \
  python -m apps.backend.worker.cli recompute \
  --from 2020-01-01 --to $(date +%Y-%m-%d) --clear
```

---

## Managing Services

### Start All Services

```bash
docker compose -f docker-compose.prod.yml up -d
```

### Stop All Services

```bash
docker compose -f docker-compose.prod.yml down
```

To also remove volumes (destroys data!):

```bash
docker compose -f docker-compose.prod.yml down -v
```

### View Logs

```bash
# All services
docker compose -f docker-compose.prod.yml logs -f

# Single service
docker compose -f docker-compose.prod.yml logs -f api
docker compose -f docker-compose.prod.yml logs -f nginx
```

### Run Worker Commands

All worker CLI commands use the `--profile tools` flag:

```bash
# Help
docker compose -f docker-compose.prod.yml --profile tools run --rm worker --help

# System info
docker compose -f docker-compose.prod.yml --profile tools run --rm worker \
  python -m apps.backend.worker.cli info

# Ingest a date range
docker compose -f docker-compose.prod.yml --profile tools run --rm worker \
  python -m apps.backend.worker.cli ingest --date 2024-06-01

# Recompute ratings
docker compose -f docker-compose.prod.yml --profile tools run --rm worker \
  python -m apps.backend.worker.cli recompute --clear
```

---

## SSL/TLS Termination

### Option A: nginx with Self-Managed Certificates

1. Place your certificate and key files on the host:

```bash
mkdir -p /etc/ssl/certs /etc/ssl/private
# Copy your .crt and .key files
```

2. Uncomment the HTTPS server block in `infrastructure/nginx/nginx.conf`

3. Map the certificate paths in `docker-compose.prod.yml`:

```yaml
services:
  nginx:
    volumes:
      - /etc/ssl/certs:/etc/ssl/certs:ro
      - /etc/ssl/private:/etc/ssl/private:ro
```

4. Restart nginx:

```bash
docker compose -f docker-compose.prod.yml up -d nginx
```

### Option B: Let's Encrypt / Certbot

Run certbot on the host and mount certificates into the nginx container:

```bash
certbot certonly --standalone -d api.tipsharks.com
```

Then mount:

```yaml
volumes:
  - /etc/letsencrypt:/etc/letsencrypt:ro
```

### Option C: External Load Balancer

Deploy behind an AWS ALB, GCP HTTPS Load Balancer, or Cloudflare. In this case,
nginx runs on HTTP only and the load balancer terminates TLS.

---

## Secrets Management

### Recommended Approach: Environment File

The `.env` file is the primary mechanism. **Never commit `.env` to version control.**

```bash
# .env is in .gitignore already
echo ".env" >> .gitignore
```

### Alternative: Docker Secrets

For Docker Swarm or more strict environments:

```yaml
secrets:
  db_password:
    file: ./secrets/db_password.txt
  admin_token:
    file: ./secrets/admin_token.txt
```

### Alternative: External Secrets Manager

For cloud deployments, inject secrets via the orchestrator:

- **AWS**: Use Parameter Store or Secrets Manager with ECS task definitions
- **GCP**: Use Secret Manager with Cloud Run
- **Kubernetes**: Use External Secrets Operator with Vault/AWS/GCP

---

## Monitoring

### Health Checks

Built-in health check endpoints:

| Endpoint | Type | Description |
|----------|------|-------------|
| `/health` | HTTP | Returns `{"status": "healthy"}` |
| Container healthchecks | Docker | db: `pg_isready`, redis: `redis-cli ping`, api: `/health` probe |

### Log Aggregation

The API emits structured JSON logs. Configure aggregation in `.env`:

```ini
LOG_AGGREGATION_URL=https://logs.example.com/loki/api/v1/push
LOG_AGGREGATION_TOKEN=your-loki-token
```

Suggested integrations:

- **Loki + Grafana**: Lightweight, native Docker log driver
- **Datadog**: Set `DD_API_KEY` and use Datadog Agent sidecar
- **Papertrail**: Configure Docker log driver `syslog`
- **CloudWatch**: Use `awslogs` Docker log driver

### Metrics to Track

| Metric | Source | Alert Threshold |
|--------|--------|-----------------|
| API p95 latency | nginx logs / APM | > 500ms |
| API error rate (5xx) | nginx logs | > 1% |
| DB connection count | `pg_stat_activity` | > 80% of pool |
| Disk usage | Docker volume | > 80% |
| Worker failure rate | Application logs | Any failure |
| Rating snapshot count | `/ratings/stats` | Sudden drop |

### External Monitoring

Configure a ping-based uptime monitor:

```ini
HEALTHCHECK_PING_URL=https://betteruptime.com/api/v1/heartbeat/your-id
```

---

## Backup and Restore

### Automated Backups

The backup script (`infrastructure/scripts/backup.sh`) creates daily compressed
PostgreSQL dumps with 7-day retention.

**Run manually:**

```bash
./infrastructure/scripts/backup.sh
```

**Schedule with cron:**

```bash
# Run daily at 1 AM
0 1 * * * /path/to/tipsharks-elo-api/infrastructure/scripts/backup.sh \
  >> /var/log/tipsharks-backup.log 2>&1
```

**Schedule with systemd timer** (see `infrastructure/systemd/` if available).

### Restore from Backup

```bash
# 1. Stop the API
docker compose -f docker-compose.prod.yml stop api

# 2. Drop and recreate the database
docker compose -f docker-compose.prod.yml exec db \
  psql -U tipsharks -c "DROP DATABASE IF EXISTS tipsharks;"
docker compose -f docker-compose.prod.yml exec db \
  psql -U tipsharks -c "CREATE DATABASE tipsharks;"

# 3. Restore the backup
gunzip -c backups/tipsharks_20240115_020000.sql.gz | \
  docker compose -f docker-compose.prod.yml exec -T db \
  psql -U tipsharks tipsharks

# 4. Run any pending migrations
docker compose -f docker-compose.prod.yml --profile tools run --rm worker \
  alembic upgrade head

# 5. Restart the API
docker compose -f docker-compose.prod.yml up -d api
```

---

## Scaling

### Vertical Scaling

Increase resources for individual services:

```yaml
services:
  api:
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 2G
  db:
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 8G
```

### Horizontal Scaling (API)

Run multiple API instances behind nginx:

```yaml
# In docker-compose.prod.yml, use docker-compose --scale
docker compose -f docker-compose.prod.yml up -d --scale api=3 api
```

Update the nginx upstream block:

```nginx
upstream tipsharks_api {
    server api:8000;
    server api:8001;
    server api:8002;
}
```

### Database Read Replicas

For read-heavy workloads:

1. Set up PostgreSQL read replicas
2. Configure SQLAlchemy to use the replica for read queries:

```python
# Example: two-engine setup in database.py
engine_write = create_async_engine(DATABASE_URL)
engine_read = create_async_engine(REPLICA_URL)
```

---

## Security Checklist

- [ ] **Database password**: Generated strong random password, stored in `.env` only
- [ ] **Admin token**: 32+ character hex token via `secrets.token_hex(16)`
- [ ] **CORS origins**: Set to specific origins, not `*`
- [ ] **Firewall**: Restrict port 5432 and 6379 to Docker network only
- [ ] **SSL/TLS**: HTTPS enabled with valid certificate
- [ ] **HSTS**: Enabled once HTTPS is verified working
- [ ] **Rate limiting**: Default limits in place; adjust per usage patterns
- [ ] **Secrets**: `.env` in `.gitignore`; consider external secrets manager
- [ ] **Updates**: Regular Docker image rebuilds for security patches
- [ ] **Backups**: Automated daily backups verified weekly
- [ ] **Monitoring**: Uptime monitoring and error alerting configured

---

## Troubleshooting

### Container exits immediately

```bash
# Check logs
docker compose -f docker-compose.prod.yml logs api

# Common causes:
# - Missing .env file or environment variables
# - Database not yet healthy (depends_on race)
# - Port conflict on host
```

### Database connection refused

```bash
# Verify db is running
docker compose -f docker-compose.prod.yml ps db

# Check db logs
docker compose -f docker-compose.prod.yml logs db

# Test connection from api container
docker compose -f docker-compose.prod.yml exec api \
  python -c "import psycopg; psycopg.connect('${DATABASE_URL}')"
```

### Health check failing

```bash
# Always use the nginx proxy URL for external monitoring
curl http://localhost/health

# If nginx returns 502, check the api container
docker compose -f docker-compose.prod.yml logs api --tail=50
```

### Out of disk space

```bash
# Check Docker disk usage
docker system df

# Clean unused images, containers, volumes
docker system prune -a --volumes

# Remove old backups
find backups/ -name "*.sql.gz" -mtime +30 -delete
```

### SSL certificate expired

```bash
# Renew with certbot
certbot renew

# Restart nginx to pick up new certificates
docker compose -f docker-compose.prod.yml restart nginx
```
