# Elo Tuning Guide

This guide explains how to fine-tune the HarnessElo rating engine via environment variables.
It covers what each knob does, typical ranges, and a safe tuning workflow.

## Quick Start

1. Edit `.env` (or set env vars in your runtime) using the variables listed below.
2. Restart the worker/API so settings reload.
3. Recompute ratings to apply new parameters (use your normal recompute pipeline).
4. Validate results with backtests and sanity checks.

## Core Elo Math

These knobs control the core expected outcome and update size.

- `ELO_SCALE_C` (default `400.0`)
  - Logistic scale factor for expected outcomes.
  - Higher values make outcomes less sensitive to rating differences.
  - Typical range: `200` to `600`.

- `ELO_K_BASE` (default `24.0`)
  - Base update rate per race.
  - Higher values make ratings more volatile.
  - Typical range: `12` to `40`.

- `ELO_K_MIN`, `ELO_K_MAX` (optional)
  - Clamp the effective K-factor after RD scaling.
  - Use to prevent extreme updates for high/low RD.

- `PAIRWISE_NORMALIZER` (default `n_minus_1`)
  - Controls how pairwise deltas are averaged.
  - `n_minus_1`: current behavior, average over opponents.
  - `n`: slightly smaller updates in larger fields.
  - `comparisons`: uses only comparisons actually counted (useful with `TIE_HANDLING=skip`).

## Initial Ratings & Bounds

- `INITIAL_RATING` (default `1500.0`)
  - Starting point for new entities.

- `RATING_MIN`, `RATING_MAX` (optional)
  - Clamp ratings to a fixed band.
  - Useful for preventing drift or extreme outliers.

## Rating Deviation (RD)

RD governs uncertainty and affects K scaling when enabled.

- `ENABLE_RD` (default `false`)
  - When true, RD is tracked and used for K scaling.

- `INITIAL_RD` (default `350.0`)
  - Starting RD for new entities.

- `RD_MIN`, `RD_MAX` (defaults `50.0`, `350.0`)
  - Lower/upper bounds on RD.

- `RD_DECAY_PER_RACE` (default `15.0`)
  - RD decreases after each race (more certainty).

- `RD_DECAY_FLOOR` (default `0.0`)
  - Minimum RD decay applied per race.
  - Use to guarantee convergence even if `RD_DECAY_PER_RACE` is set low.

- `RD_INFLATION_PER_DAY` (default `0.5`)
  - RD increase per inactive day.

- `RD_INFLATION_CAP_DAYS` (optional)
  - Cap the number of inactive days used for inflation.

- `RD_SCALING_MODE` (default `linear`)
  - `linear`: K scales directly with RD (current behavior).
  - `sqrt`: gentler scaling for high RD.
  - `none`: RD does not affect K (but RD is still tracked).

## Multi-Entity Contributions

These knobs change how drivers/trainers contribute to effective ratings and updates.

- `ENABLE_DRIVER`, `ENABLE_TRAINER` (default `true`)
  - Include driver/trainer in effective rating and updates.

- `DRIVER_WEIGHT_ALPHA`, `TRAINER_WEIGHT_BETA` (defaults `0.35`, `0.15`)
  - Contribution of driver/trainer rating to effective rating.

- `HORSE_K_SCALE`, `DRIVER_K_SCALE`, `TRAINER_K_SCALE` (default `1.0`)
  - Extra multipliers on the rating update applied to each entity type.
  - Use if you want updates to be more or less aggressive per entity.

## Condition Adjustments (Barrier/Handicap)

Adjustments model systematic advantages in barriers or handicaps.

- `ENABLE_ADJUSTMENTS` (default `true`)
  - Global toggle for adjustment learning and usage.

- `ADJ_BARRIER_ENABLED`, `ADJ_HANDICAP_ENABLED` (default `true`)
  - Toggle barrier/handicap learning independently.

- `ADJ_LEARNING_RATE` (default `0.5`)
  - Learning rate for incremental updates.

- `ADJ_UPDATE_SCALE` (default `1.0`)
  - Multiplies performance deltas before they are applied.
  - Use to speed up/slow down learning without changing the base learning rate.

- `ADJ_MIN_SAMPLES` (default `0`)
  - Minimum samples required before adjustments are applied.
  - Helps avoid noisy adjustments with small sample sizes.

- `ADJ_CLAMP_MIN`, `ADJ_CLAMP_MAX` (optional)
  - Clamp adjustment values to a fixed band.

- `ADJ_GLOBAL_ONLY` (default `false`)
  - If true, only global adjustments are learned/applied.
  - Useful for sparse venues or early training.

## Place Scoring Signals

These knobs tune the independent place-scoring signal used for top-3 probability.

- `PLACE_HISTORY_LIMIT` (default `8`)
  - Number of recent races used to compute place consistency.

- `PLACE_PRIOR_RATE` (default `0.33`)
  - Prior top-3 rate used for smoothing.

- `PLACE_PRIOR_WEIGHT` (default `3.0`)
  - Weight of the prior top-3 rate in the smoothing formula.

- `PLACE_TOP3_WEIGHT` (default `0.75`)
  - Weight applied to the smoothed top-3 rate signal.

- `PLACE_CONSISTENCY_WEIGHT` (default `0.5`)
  - Weight applied to finish consistency (lower variance in recent placings).

## Distance Bucketing

Used by adjustment learning.

- `DISTANCE_BUCKETS` (default `1700,2000,2400`)
  - Thresholds for bucketing distances.

- `DISTANCE_BUCKET_MODE` (default `thresholds`)
  - `thresholds`: use `DISTANCE_BUCKETS` as ranges.
  - `fixed`: use fixed-size buckets.

- `DISTANCE_BUCKET_SIZE` (required for `fixed`)
  - Fixed bucket size in meters.
  - Example: `400` yields buckets like `0-399`, `400-799`, etc.

## Race Processing Rules

These knobs define which starters are included and how ties are handled.

- `MIN_FINISHERS` (default `2`)
  - Minimum number of starters required to process a race.

- `DNF_TREATED_AS_LAST` (default `false`)
  - When true, DNF/missing placings are treated as last place instead of excluded.

- `TIE_HANDLING` (default `ordered`)
  - `ordered`: ties treated as losses for the later comparison (current behavior).
  - `half`: ties score 0.5 in pairwise outcomes.
  - `skip`: tie comparisons are ignored; pairwise normalization can be set to `comparisons`.

## Recommended Tuning Workflow

1. **Baseline snapshot**: run recompute with current defaults and save metrics.
2. **Change one axis at a time**: start with `ELO_K_BASE` and `ELO_SCALE_C`.
3. **Stabilize**: if ratings are too volatile, lower `ELO_K_BASE` or use `RD_SCALING_MODE=sqrt`.
4. **Adjust RD**: set `ENABLE_RD=true` and calibrate `RD_DECAY_PER_RACE` / `RD_INFLATION_PER_DAY`.
5. **Adjust multi-entity weights**: tweak `DRIVER_WEIGHT_ALPHA`, `TRAINER_WEIGHT_BETA`.
6. **Enable adjustments**: start global-only, then enable venue-specific after enough samples.
7. **Apply clamps**: use `RATING_MIN/MAX` and `ADJ_CLAMP_MIN/MAX` to limit extremes.
8. **Validate**: compare predictive accuracy and calibration after each step.

## Efficient Tuning Methodology

Use a two-phase search: a broad scan to locate the right scale, then a local refinement.
Keep each run reproducible by logging config and metrics in a single file.

### Phase 0: Define a fixed evaluation protocol

- **Metrics**: log `log_loss` as the primary objective; also track `winner_acc`, `top3_overlap`, and place metrics (`place_log_loss`, `place_brier`, `place_top3_overlap`).
- **Scope**: evaluate all races and, if relevant, the `with_dt`/`no_dt` splits to detect regressions.
- **Seeded baseline**: run a single recompute with the current `.env` and record the metrics.

### Phase 1: Global search (coarse)

Start with the highest-impact knobs and sweep them one axis at a time.

1. **Elo sensitivity**: `ELO_SCALE_C` (e.g., 200, 250, 300, 350, 400, 500).
2. **Update rate**: `ELO_K_BASE` (exponential search: 100, 200, 400, 800, ... until loss worsens).
3. **RD coupling**: `RD_SCALING_MODE` in `linear`, `sqrt`, `none`.
4. **Pairwise normalizer**: `PAIRWISE_NORMALIZER` in `n_minus_1`, `n`, `comparisons`.

Keep the best value from each sweep before moving to the next knob.

### Phase 2: Local refinement (fine)

Once the rough region is found, tighten the grid around the current best.

- **Narrow ranges**: re-sweep with smaller steps (e.g., `ELO_SCALE_C` ±50).
- **Adjustment learning**: grid search `ADJ_LEARNING_RATE` and `ADJ_UPDATE_SCALE`.
- **RD dynamics**: sweep `RD_DECAY_PER_RACE` and `RD_INFLATION_PER_DAY` in small increments.
- **Place signal**: sweep `PLACE_TOP3_WEIGHT` and `PLACE_CONSISTENCY_WEIGHT`, then refine `PLACE_HISTORY_LIMIT` and `PLACE_PRIOR_WEIGHT`.

### Phase 3: Conditional features

Treat multi-entity ratings and adjustments as optional layers:

- Toggle `ENABLE_DRIVER`/`ENABLE_TRAINER` and verify accuracy before tuning weights.
- Only tune `DRIVER_WEIGHT_ALPHA`/`TRAINER_WEIGHT_BETA` if the toggles help.
- If adjustments help, tune `ADJ_MIN_SAMPLES` and clamps to reduce noise.

### Stop criteria

- **No improvement**: stop when 2-3 consecutive tweaks do not improve `log_loss`.
- **Regression check**: do not accept a gain in `all` if it significantly harms `with_dt`.

### Repeatable automation

For each sweep:

1. Update `.env` with the candidate values.
2. Run recompute.
3. Run the same evaluation script and append results to a log file.

This yields a single, ordered record of configurations and outcomes you can compare later.
Prefer batch scripts for sweeps to keep runs consistent and minimize manual errors.

## Scripts (Tuning Helpers)

These scripts live in `scripts/` and assume Docker Compose is available.
They update `.env`, run recompute, then evaluate accuracy with a consistent protocol.

- `scripts/evaluate_accuracy.py`: outputs `log_loss`, `winner_acc`, `top3_overlap`, plus place metrics (`place_log_loss`, `place_brier`, `place_top3_overlap`) for `all`, `with_dt`, `no_dt`.
- `scripts/sweep_param.sh`: sweep a single parameter over multiple values.
- `scripts/sweep_grid.sh`: sweep two parameters as a grid.
- `scripts/auto_sweep.sh`: runs an end-to-end tuning pass using the methodology below.

Examples:

```bash
# Evaluate current settings
docker compose run --rm \
  -e DATABASE_URL=postgresql+psycopg://tipsharks:tipsharks@db:5432/tipsharks \
  -e PYTHONPATH=/app \
  worker python scripts/evaluate_accuracy.py

# Sweep ELO_SCALE_C
scripts/sweep_param.sh ELO_SCALE_C 200 250 300 350 400 500

# Sweep adjustment learning (grid)
scripts/sweep_grid.sh ADJ_LEARNING_RATE "0.2 0.5 0.8" ADJ_UPDATE_SCALE "0.5 1.0 1.5"

# End-to-end auto sweep (writes results under /tmp/elo_auto_*)
scripts/auto_sweep.sh
```

Use `RESULTS=/tmp/my_run.txt` and `FROM_DATE`/`TO_DATE` env vars to control output and ranges.
For auto sweeps, use `RESULTS_DIR=/tmp/elo_auto_custom` and optional knobs:

```bash
SCALE_VALUES="200 250 300 350" \
RD_MODES="linear sqrt" \
PAIRWISE_VALUES="n_minus_1 n" \
K_START=100 K_MAX=800 K_FACTOR=2 \
ADJ_SWEEP=1 \
scripts/auto_sweep.sh
```

## Variant Playbooks

Use these patterns when data availability or feature scope changes.

### Horse-only baseline (data quality check)

1. Set `ENABLE_DRIVER=false` and `ENABLE_TRAINER=false`.
2. Tune `ELO_SCALE_C`, `ELO_K_BASE`, then `RD_SCALING_MODE`.
3. Keep the best as the fallback when driver/trainer coverage is sparse.

### Add driver/trainer contributions

1. Enable one at a time: `ENABLE_DRIVER=true` (trainer off), then `ENABLE_TRAINER=true`.
2. If gains appear, tune weights (`DRIVER_WEIGHT_ALPHA`, `TRAINER_WEIGHT_BETA`).
3. If updates drift, adjust `DRIVER_K_SCALE`/`TRAINER_K_SCALE` rather than global K.

### RD on/off

1. Tune `ELO_SCALE_C` and `ELO_K_BASE` with `ENABLE_RD=false`.
2. Enable RD and sweep `RD_SCALING_MODE`.
3. Fine-tune `RD_DECAY_PER_RACE` and `RD_INFLATION_PER_DAY` around the best mode.

### Adjustments on/off

1. Start with `ENABLE_ADJUSTMENTS=false` to establish the Elo baseline.
2. Enable adjustments with `ADJ_GLOBAL_ONLY=true` and tune `ADJ_LEARNING_RATE`.
3. If stable, turn `ADJ_GLOBAL_ONLY=false` and set `ADJ_MIN_SAMPLES` to avoid noise.

### Tie and DNF handling

1. If ties are common, test `TIE_HANDLING=half` or `skip`.
2. When using `skip`, set `PAIRWISE_NORMALIZER=comparisons`.
3. If DNF impacts results, compare `DNF_TREATED_AS_LAST=true` vs `false`.

### Adjustment distance bucketing

1. Use `DISTANCE_BUCKET_MODE=thresholds` with a few broad thresholds first.
2. If you need finer granularity, switch to `fixed` and tune `DISTANCE_BUCKET_SIZE`.

### Coverage-aware evaluation

Always review both `with_dt` and `no_dt` splits. If `with_dt` improves while `no_dt`
degrades, prefer a mixed strategy or avoid heavy driver/trainer weights.

## Example Profiles

Conservative ratings (slow movement):

```
ELO_K_BASE=16
RD_SCALING_MODE=sqrt
RD_DECAY_PER_RACE=10
ADJ_LEARNING_RATE=0.3
```

Aggressive ratings (fast adaptation):

```
ELO_K_BASE=32
RD_SCALING_MODE=linear
RD_DECAY_PER_RACE=20
ADJ_LEARNING_RATE=0.7
ADJ_UPDATE_SCALE=1.2
```

## Common Pitfalls

- **Too high K** leads to noisy ratings and unstable leaderboards.
- **No RD** can under-react to new or inactive horses.
- **Low sample adjustments** can introduce bias; use `ADJ_MIN_SAMPLES`.
- **Overweight driver/trainer** can swamp horse performance.

## Where to Change These Settings

- `.env`: default dev/test configuration.
- Production: set env vars in your deployment system and restart the worker/API.