# Grafana Monitoring Metrics

This document describes all available Prometheus metrics for monitoring the racing data scraper application in Grafana.

## Overview

The application exposes metrics on `http://localhost:9090/metrics` in Prometheus format. These metrics are organized into four priority levels:

1. **Scheduler Observability** - Track scheduler execution and health
2. **Data Quality** - Monitor completeness and correctness of scraped data
3. **API & Resource Health** - Track external dependencies and resource usage
4. **Results & Dividends** - Monitor race results processing

---

## Priority 1: Scheduler Observability

### scheduler_runs_total
**Type:** Counter
**Labels:** `schedule_type`, `status`
**Description:** Total number of scheduler executions
**Status values:** `success`, `partial`, `failed`, `failure`

**Example PromQL queries:**
```promql
# Success rate by scheduler
rate(scheduler_runs_total{status="success"}[5m]) / rate(scheduler_runs_total[5m])

# Failed scheduler runs in last hour
increase(scheduler_runs_total{status=~"failed|failure"}[1h])
```

### scheduler_duration_seconds
**Type:** Histogram
**Labels:** `schedule_type`
**Description:** Duration of scheduler executions
**Buckets:** 1s, 5s, 10s, 30s, 60s, 120s, 300s

**Example PromQL queries:**
```promql
# P95 scheduler duration
histogram_quantile(0.95, rate(scheduler_duration_seconds_bucket[5m]))

# Average duration by scheduler
rate(scheduler_duration_seconds_sum[5m]) / rate(scheduler_duration_seconds_count[5m])
```

### scheduler_last_success_timestamp_seconds
**Type:** Gauge
**Labels:** `schedule_type`
**Description:** Unix timestamp of last successful scheduler execution

**Example PromQL queries:**
```promql
# Time since last successful run (in minutes)
(time() - scheduler_last_success_timestamp_seconds) / 60

# Alert if scheduler hasn't run successfully in 2 hours
(time() - scheduler_last_success_timestamp_seconds{schedule_type="morning_scrape"}) > 7200
```

### scheduler_currently_running
**Type:** Gauge
**Labels:** `schedule_type`
**Description:** Whether scheduler is currently executing (1 = running, 0 = idle)

### scheduler_items_processed_total
**Type:** Counter
**Labels:** `schedule_type`
**Description:** Total items processed by schedulers

**Example PromQL queries:**
```promql
# Items processed per hour
rate(scheduler_items_processed_total[1h]) * 3600
```

### scheduler_items_failed_total
**Type:** Counter
**Labels:** `schedule_type`
**Description:** Total items that failed processing

**Example PromQL queries:**
```promql
# Failure rate
rate(scheduler_items_failed_total[5m]) / rate(scheduler_items_processed_total[5m])
```

### scheduler_errors_total
**Type:** Counter
**Labels:** `schedule_type`, `error_type`
**Description:** Total errors by scheduler and error type
**Error types:** `job_failure`, `exception`

---

## Priority 2: Data Quality Metrics

### races_missing_runners_total
**Type:** Gauge
**Description:** Number of races today with no runner data

**Example alert:**
```promql
races_missing_runners_total > 5
```

### races_missing_results_total
**Type:** Gauge
**Labels:** `minutes_after_start`
**Description:** Number of final races without results at various time intervals
**Label values:** `15`, `30`, `60`

**Example PromQL queries:**
```promql
# Races still missing results 30 minutes after completion
races_missing_results_total{minutes_after_start="30"}
```

### odds_snapshots_missing_total
**Type:** Gauge
**Labels:** `snapshot_type`
**Description:** Number of runners missing odds snapshots
**Snapshot types:** `morning`, `t60`, `t15`, `final`

**Example PromQL queries:**
```promql
# Total missing snapshots across all types
sum(odds_snapshots_missing_total)

# Morning odds missing
odds_snapshots_missing_total{snapshot_type="morning"}
```

### meetings_active_total
**Type:** Gauge
**Labels:** `country`, `category`
**Description:** Number of active meetings today
**Countries:** `AUS`, `NZL`
**Categories:** `T` (thoroughbred), `H` (harness)

**Example PromQL queries:**
```promql
# Total meetings today
sum(meetings_active_total)

# Australian thoroughbred meetings
meetings_active_total{country="AUS", category="T"}
```

### races_by_status_total
**Type:** Gauge
**Labels:** `status`
**Description:** Number of races by status
**Status values:** `Open`, `Closed`, `Interim`, `Final`, `Abandoned`

**Example PromQL queries:**
```promql
# Races currently open for betting
races_by_status_total{status="Open"}

# Final races (completed)
races_by_status_total{status="Final"}
```

### runners_scratched_total
**Type:** Gauge
**Labels:** `timing`
**Description:** Number of scratched runners by timing relative to race start
**Timing values:** `before_t60`, `t60_to_t15`, `after_t15`

**Example PromQL queries:**
```promql
# Late scratches (after T-15)
runners_scratched_total{timing="after_t15"}

# Total scratches today
sum(runners_scratched_total)
```

### races_today_total
**Type:** Gauge
**Labels:** `country`, `category`
**Description:** Total number of races scheduled for today

### runners_per_race_average
**Type:** Gauge
**Labels:** `country`, `category`
**Description:** Average number of runners per race

---

## Priority 3: API & Resource Health

### tab_api_request_duration_seconds
**Type:** Histogram
**Labels:** `endpoint`, `status`
**Description:** Duration of TAB API requests
**Buckets:** 0.1s, 0.5s, 1s, 2s, 5s, 10s

**Example PromQL queries:**
```promql
# P95 API latency
histogram_quantile(0.95, rate(tab_api_request_duration_seconds_bucket[5m]))

# Slow requests (>2s)
sum(rate(tab_api_request_duration_seconds_bucket{le="2"}[5m])) by (endpoint)
```

### tab_api_requests_total
**Type:** Counter
**Labels:** `endpoint`, `status`
**Description:** Total number of TAB API requests
**Status values:** `success`, `error`

**Example PromQL queries:**
```promql
# Request rate by endpoint
rate(tab_api_requests_total[5m])

# Error rate
rate(tab_api_requests_total{status="error"}[5m]) / rate(tab_api_requests_total[5m])
```

### tab_api_errors_total
**Type:** Counter
**Labels:** `endpoint`, `error_type`
**Description:** Total number of TAB API errors
**Error types:** `http_4xx`, `http_5xx`, `timeout`, `network_error`, `validation_error`

### tab_api_rate_limiter_queue_depth
**Type:** Gauge
**Description:** Number of requests queued in rate limiter

**Example alert:**
```promql
tab_api_rate_limiter_queue_depth > 50
```

### tab_api_rate_limiter_running
**Type:** Gauge
**Description:** Number of requests currently executing

---

## Priority 4: Results & Dividends

### race_results_processed_total
**Type:** Counter
**Description:** Total number of race results processed

**Example PromQL queries:**
```promql
# Results processed per hour
rate(race_results_processed_total[1h]) * 3600
```

### race_results_capture_latency_seconds
**Type:** Histogram
**Description:** Time from race start to result capture
**Buckets:** 300s (5min), 600s (10min), 900s (15min), 1800s (30min), 3600s (1hr), 7200s (2hr)

**Example PromQL queries:**
```promql
# P95 result capture latency
histogram_quantile(0.95, rate(race_results_capture_latency_seconds_bucket[1h]))

# Average time to capture results
rate(race_results_capture_latency_seconds_sum[1h]) / rate(race_results_capture_latency_seconds_count[1h])
```

### race_dividends_by_product_total
**Type:** Counter
**Labels:** `product_name`, `tote`
**Description:** Total dividends processed by product and tote
**Product names:** `Win`, `Place`, `Quinella`, `Exacta`, `Trifecta`, `First4`, etc.
**Tote values:** State-based totes (e.g., `VIC`, `NSW`, `QLD`)

**Example PromQL queries:**
```promql
# Win dividends processed
rate(race_dividends_by_product_total{product_name="Win"}[5m])

# Dividends by tote
sum(rate(race_dividends_by_product_total[5m])) by (tote)
```

### race_dividends_processed_total
**Type:** Counter
**Labels:** `product_name`
**Description:** Total number of dividends processed

---

## Existing Service Metrics

### meetings_processed_total
**Type:** Counter
**Labels:** `operation`, `status`
**Description:** Total number of meetings processed

### races_processed_total
**Type:** Counter
**Labels:** `operation`, `status`
**Description:** Total number of races processed

### race_runners_processed_total
**Type:** Counter
**Labels:** `scrape_type`, `status`
**Description:** Total number of runners processed
**Scrape types:** `morning_scrape`, `pre_race_t60`, `pre_race_t15`, `post_race`

### race_odds_snapshots_total
**Type:** Counter
**Labels:** `snapshot_type`
**Description:** Total number of odds snapshots captured

### meeting_service_operation_duration_seconds
**Type:** Histogram
**Labels:** `operation`
**Description:** Duration of meeting service operations

### race_service_operation_duration_seconds
**Type:** Histogram
**Labels:** `operation`
**Description:** Duration of race service operations

---

## System Metrics (Default)

These are provided automatically by Prometheus:

- `process_cpu_seconds_total` - CPU usage
- `process_resident_memory_bytes` - Memory usage
- `nodejs_eventloop_lag_seconds` - Event loop lag
- `nodejs_heap_size_total_bytes` - Heap size
- `nodejs_heap_size_used_bytes` - Used heap

---

## Grafana Dashboard Recommendations

### 1. Overview Dashboard
- Scheduler success rates (gauge)
- API request rate (time series)
- Active meetings by country (bar chart)
- Races by status (pie chart)
- System resources (CPU, memory)

### 2. Scheduler Health Dashboard
- Last successful run by scheduler (table)
- Scheduler duration P95 (time series)
- Items processed/failed (stacked area)
- Currently running schedulers (status list)

### 3. Data Quality Dashboard
- Missing runners (single stat)
- Missing results by time (heatmap)
- Missing odds snapshots (bar chart)
- Scratched runners by timing (pie chart)
- Races today vs processed (comparison gauge)

### 4. API Health Dashboard
- Request rate by endpoint (time series)
- Error rate percentage (gauge)
- P95 latency (time series)
- Rate limiter queue depth (time series)

### 5. Results Dashboard
- Results capture latency histogram
- Results processed rate (time series)
- Dividends by product (bar chart)
- Dividends by tote (pie chart)

---

## Alert Recommendations

### Critical Alerts

```promql
# Scheduler hasn't run in 2+ hours
(time() - scheduler_last_success_timestamp_seconds{schedule_type="morning_scrape"}) > 7200

# High API error rate (>5%)
rate(tab_api_requests_total{status="error"}[5m]) / rate(tab_api_requests_total[5m]) > 0.05

# Many races missing results (30+ min after finish)
races_missing_results_total{minutes_after_start="30"} > 10
```

### Warning Alerts

```promql
# Rate limiter queue backing up
tab_api_rate_limiter_queue_depth > 50

# High scheduler duration (>5 min)
rate(scheduler_duration_seconds_sum[5m]) / rate(scheduler_duration_seconds_count[5m]) > 300

# Missing morning odds snapshots
odds_snapshots_missing_total{snapshot_type="morning"} > 20
```

---

## Data Collection Frequency

- **Scheduler metrics**: Updated on each scheduler execution
- **Data quality metrics**: Collected every 60 seconds
- **API metrics**: Updated on each API request
- **Service metrics**: Updated on each service operation
- **System metrics**: Collected every 10 seconds (default)

---

## Example Grafana Queries

### Success Rate Over Time
```promql
sum(rate(scheduler_runs_total{status="success"}[5m])) /
sum(rate(scheduler_runs_total[5m]))
```

### Races Processed Per Day
```promql
sum(increase(races_processed_total[24h]))
```

### API Latency by Endpoint
```promql
histogram_quantile(0.95,
  sum(rate(tab_api_request_duration_seconds_bucket[5m])) by (endpoint, le)
)
```

### Data Completeness Percentage
```promql
(races_by_status_total{status="Final"} - races_missing_results_total{minutes_after_start="30"}) /
races_by_status_total{status="Final"} * 100
```
