# TipSharks Monitoring Guide

This guide covers monitoring setup, alert configuration, key metrics, and runbooks for the TipSharks ELO API platform.

## Table of Contents

1. [Architecture Overview](#architecture-overview)
2. [Grafana Dashboard](#grafana-dashboard)
3. [Prometheus Alerts](#prometheus-alerts)
4. [Key Metrics](#key-metrics)
5. [Runbook: Common Alerts](#runbook-common-alerts)
6. [Instrumentation Guide](#instrumentation-guide)

---

## Architecture Overview

```
┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  TipSharks  │────▶│  Prometheus  │────▶│   Grafana   │
│  API        │     │  (scrape)    │     │ (dashboards)│
└─────────────┘     └──────┬───────┘     └─────────────┘
                           │
                    ┌──────▼───────┐
                    │  Alertmanager│
                    │  (notify)    │
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              ▼            ▼            ▼
           Slack       PagerDuty      Email
```

The monitoring stack consists of:

- **Prometheus** — Time-series database that scrapes `/metrics` endpoints from all services
- **Grafana** — Visualization and alerting dashboard
- **Alertmanager** — Handles alert deduplication, silencing, and routing to notification channels
- **PostgreSQL Exporter** — Exposes database metrics (connections, disk, query performance)
- **Redis Exporter** — Exposes cache metrics (hit rate, memory, connections)

---

## Grafana Dashboard

### Import the Dashboard

1. Open Grafana at `http://grafana:3000` (default credentials: `admin`/`admin`)
2. Navigate to **Dashboards** → **Import**
3. Upload the file `infrastructure/grafana/dashboards/tipsharks-api.json`
4. Select the **Prometheus** data source
5. Click **Import**

### Dashboard Panels

| Panel | Type | Description |
|-------|------|-------------|
| API Request Rate | Time series | Requests per second (total and 5xx) |
| API Latency p95/p99 | Time series | Response latency percentiles |
| HTTP Status Code Distribution | Pie chart | Status code breakdown (2xx, 4xx, 5xx) |
| Database Connection Pool Usage | Time series | Active/idle/max connections |
| Ingestion Success Rate | Stat (single) | Percentage of successful ingestion runs |
| Ingestion Failure Count (24h) | Stat (single) | Number of failed ingestion runs |
| Rating Distribution | Histogram | Distribution of current rating values |
| Top 10 Entities by Rating | Table | Highest-rated horses, drivers, trainers |
| Winner Accuracy Trend | Time series | Prediction accuracy over evaluation windows |
| Brier Score Trend | Time series | Probability calibration quality |

### Dashboard Configuration

- **Refresh Interval**: 30 seconds
- **Time Range**: Last 6 hours (default)
- **Theme**: Dark
- **Datasource**: Prometheus (`uid: prometheus`)

---

## Prometheus Alerts

### Configure Alert Rules

Alert rules are defined in `infrastructure/grafana/alerts/api-alerts.yaml`.

#### Option A: Grafana Provisioning (Recommended)

1. Copy the alert rules file to Grafana's provisioning directory:
   ```bash
   cp infrastructure/grafana/alerts/api-alerts.yaml /etc/grafana/provisioning/alerting/
   ```
2. Restart Grafana:
   ```bash
   docker compose restart grafana
   ```

#### Option B: Manual Import via Grafana UI

1. Navigate to **Alerting** → **Alert rules**
2. Click **New alert rule**
3. Configure each rule manually using the expressions in the YAML file

### Alert Rules Summary

| Rule | Condition | For | Severity |
|------|-----------|-----|----------|
| API 5xx Error Rate > 1% | `5xx rate / total rate * 100 > 1` | 5m | **critical** |
| API p95 Latency > 500ms | `histogram_quantile(0.95, ...) * 1000 > 500` | 5m | **critical** |
| Ingestion Failure Rate > 5% | `failure rate / total rate * 100 > 5` | 10m | warning |
| Database Disk Usage > 80% | `used / total * 100 > 80` | 5m | warning |
| API Request Rate = 0 | `rate(requests[5m]) == 0` | 5m | **critical** |

### Notification Channels (Placeholders)

Configure these in Grafana **Alerting** → **Contact points**:

| Channel | Address | Notes |
|---------|---------|-------|
| Slack | `#tipsharks-alerts` | Critical and warning alerts |
| Email | `ops@tipsharks.local` | Critical alerts only |
| PagerDuty | TipSharks API Integration | Critical alerts only (on-call rotation) |

---

## Key Metrics

### API Metrics

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `tipsharks_api_requests_total` | Counter | `status`, `method`, `endpoint` | Total API requests |
| `tipsharks_api_request_duration_seconds` | Histogram | `method`, `endpoint` | Request latency in seconds |
| `tipsharks_api_requests_in_flight` | Gauge | — | Currently active requests |

### Database Metrics (via postgres_exporter)

| Metric | Description |
|--------|-------------|
| `pg_stat_activity_count` | Active database connections |
| `pg_database_size_bytes` | Database disk usage |
| `pg_stat_database_tup_fetched` | Query throughput |
| `pg_stat_database_blk_read_time` | I/O wait time |

### Ingestion Metrics

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `tipsharks_ingestion_success_total` | Counter | `source` | Successful ingestion runs |
| `tipsharks_ingestion_failure_total` | Counter | `source` | Failed ingestion runs |
| `tipsharks_ingestion_duration_seconds` | Histogram | `source` | Ingestion run duration |

### Rating Metrics

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `tipsharks_entity_rating` | Gauge | `entity_type`, `entity_id`, `name` | Current rating per entity |
| `tipsharks_rating_distribution` | Histogram | `entity_type` | Distribution of all ratings |

### Evaluation Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `tipsharks_eval_winner_accuracy` | Gauge | Winner prediction accuracy (0-1) |
| `tipsharks_eval_brier_score` | Gauge | Brier score (0-1, lower is better) |
| `tipsharks_eval_sample_size` | Gauge | Number of races in evaluation window |

---

## Runbook: Common Alerts

### API 5xx Error Rate > 1%

**Severity**: Critical

**Symptoms**:
- Spike in HTTP 500 responses
- Users report errors in the mobile app
- `rate(tipsharks_api_requests_total{status=~"5.."}[5m])` shows elevated values

**Triage**:
1. **Check API logs:**
   ```bash
   docker compose logs api --tail=200 | grep "ERROR"
   ```
2. **Test the health endpoint:**
   ```bash
   curl -f http://localhost:8000/health
   ```
3. **Check recent deployments:** Was a code change just deployed?
4. **Check database connectivity:**
   ```bash
   docker compose exec db psql -U tipsharks -c "SELECT 1;"
   ```
5. **Check for external API failures:** Is the TAB API or HRNZ scraper returning errors?

**Resolution**:
- If database is down: `docker compose restart db`
- If code issue: Rollback to last known good version
- If external API: The error is transient — monitor for recovery

**Escalation**:
- If unresolved after 15 minutes: Page on-call engineer
- If widespread: Consider rolling back the latest deployment

---

### API p95 Latency > 500ms

**Severity**: Critical

**Symptoms**:
- API responses feel slow
- `histogram_quantile(0.95, ...)` exceeds 500ms

**Triage**:
1. **Identify slow endpoints:**
   ```bash
   # In Grafana, filter the latency panel by endpoint label
   ```
2. **Check database query performance:**
   ```bash
   docker compose exec db psql -U tipsharks -c "
     SELECT query, calls, mean_exec_time
     FROM pg_stat_statements
     ORDER BY mean_exec_time DESC
     LIMIT 10;
   "
   ```
3. **Check connection pool exhaustion:**
   ```bash
   docker compose exec db psql -U tipsharks -c "
     SELECT count(*) FROM pg_stat_activity;
   "
   ```
4. **Check CPU/memory on API container:**
   ```bash
   docker compose stats api
   ```

**Resolution**:
- Add missing database indexes
- Increase database connection pool size (`DATABASE_POOL_SIZE`)
- Scale API horizontally (add more instances)
- Add caching for frequently queried endpoints

**Escalation**:
- If latency > 1s for 5 minutes: Page on-call engineer

---

### Ingestion Failure Rate > 5%

**Severity**: Warning

**Symptoms**:
- `tipsharks_ingestion_failure_total` counter is rising
- Recent races may not have ratings computed

**Triage**:
1. **Check ingestion logs:**
   ```bash
   docker compose logs worker --tail=100
   ```
2. **Check external API status:**
   ```bash
   curl -I https://api.beta.tab.com.au/v1/
   ```
3. **Check HRNZ scraper:**
   ```bash
   curl -I https://infohorse.hrnz.co.nz/datahrs/results/010131rs.htm
   ```
4. **Check disk space:**
   ```bash
   docker compose exec db psql -U tipsharks -c "
     SELECT pg_size_pretty(pg_database_size('tipsharks'));
   "
   ```

**Resolution**:
- If external API is down: Retry later (ingestion has built-in retries)
- If database is full: Run cleanup or increase disk
- If schema error: Run pending migrations: `alembic upgrade head`

**Escalation**:
- If failure persists for > 1 hour: Page on-call engineer

---

### Database Disk Usage > 80%

**Severity**: Warning

**Symptoms**:
- `tipsharks_db_disk_used_bytes / tipsharks_db_disk_total_bytes * 100 > 80`
- Database may become read-only if disk fills

**Triage**:
1. **Check current disk usage:**
   ```bash
   docker compose exec db psql -U tipsharks -c "
     SELECT pg_size_pretty(pg_database_size('tipsharks'));
   "
   ```
2. **Find largest tables:**
   ```bash
   docker compose exec db psql -U tipsharks -c "
     SELECT
       relname AS table,
       pg_size_pretty(pg_total_relation_size(relid)) AS total
     FROM pg_catalog.pg_statio_user_tables
     ORDER BY pg_total_relation_size(relid) DESC
     LIMIT 10;
   "
   ```

**Resolution**:
1. **Short term — Clean up old data:**
   ```bash
   docker compose exec db psql -U tipsharks -c "
     DELETE FROM rating_snapshots
     WHERE as_of_race_id IN (
       SELECT id FROM races r
       JOIN meetings m ON r.meeting_id = m.id
       WHERE m.meeting_date < NOW() - INTERVAL '2 years'
     );
   "
   ```
2. **Medium term — Increase disk:**
   - Adjust Docker volume size or cloud disk size
3. **Long term — Implement data retention policy:**
   - Keep rating snapshots for 2 years
   - Archive raw race data after 1 year
   - Set up automated cleanup job

**Escalation**:
- If > 90%: Page on-call engineer immediately

---

### API Request Rate Drops to Zero

**Severity**: Critical

**Symptoms**:
- `rate(tipsharks_api_requests_total[5m]) == 0 for 5m`
- All API endpoints are unresponsive
- Mobile app shows connection errors

**Triage**:
1. **Check if the container is running:**
   ```bash
   docker compose ps api
   ```
2. **Check container logs:**
   ```bash
   docker compose logs api --tail=50
   ```
3. **Check host resource usage:**
   ```bash
   docker compose stats --no-stream
   ```
4. **Test direct connectivity:**
   ```bash
   curl -f http://localhost:8000/health
   ```
5. **Check if port is listening:**
   ```bash
   ss -tlnp | grep 8000
   ```

**Resolution**:
1. **Restart the API service:**
   ```bash
   docker compose restart api
   ```
2. **If container won't start:**
   ```bash
   docker compose logs api | tail -50
   # Fix the underlying issue (config, port conflict, etc.)
   docker compose up -d api
   ```
3. **If the host is out of memory/disk:**
   ```bash
   docker system prune -a --volumes
   ```

**Escalation**:
- Immediate: Page on-call engineer
- If infrastructure issue: Notify DevOps team

---

## Instrumentation Guide

### Adding Metrics to the API

The `/metrics` endpoint is a stub in `apps/backend/api/main.py`. To add real metrics:

1. Install `prometheus-client`:
   ```bash
   pip install prometheus-client
   ```

2. Create a metrics module at `packages/common/metrics.py`:
   ```python
   from prometheus_client import Counter, Histogram, Gauge

   api_requests_total = Counter(
       'tipsharks_api_requests_total',
       'Total API requests',
       ['status', 'method', 'endpoint']
   )

   api_request_duration_seconds = Histogram(
       'tipsharks_api_request_duration_seconds',
       'API request duration in seconds',
       ['method', 'endpoint'],
       buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
   )

   entity_rating = Gauge(
       'tipsharks_entity_rating',
       'Current rating per entity',
       ['entity_type', 'entity_id', 'name']
   )
   ```

3. Update the `/metrics` endpoint to serve real metrics:
   ```python
   from prometheus_client import generate_latest, REGISTRY

   @app.get("/metrics", response_class=PlainTextResponse)
   def metrics():
       return PlainTextResponse(
           content=generate_latest(REGISTRY),
           media_type="text/plain"
       )
   ```

4. Add middleware to instrument requests:
   ```python
   @app.middleware("http")
   async def metrics_middleware(request: Request, call_next):
       method = request.method
       endpoint = request.url.path
       start_time = time.time()
       response = await call_next(request)
       duration = time.time() - start_time
       api_requests_total.labels(
           status=response.status_code,
           method=method,
           endpoint=endpoint
       ).inc()
       api_request_duration_seconds.labels(
           method=method,
           endpoint=endpoint
       ).observe(duration)
       return response
   ```

### Adding Metrics to the Worker

```python
from prometheus_client import Counter, Histogram

ingestion_success_total = Counter(
    'tipsharks_ingestion_success_total',
    'Successful ingestion runs',
    ['source']
)

ingestion_failure_total = Counter(
    'tipsharks_ingestion_failure_total',
    'Failed ingestion runs',
    ['source']
)

ingestion_duration_seconds = Histogram(
    'tipsharks_ingestion_duration_seconds',
    'Ingestion run duration in seconds',
    ['source']
)
```

---