# Racing Data Pipeline - System Overview

## Executive Summary

This system ingests horse racing data from the TAB Affiliates API (New Zealand and Australia) and stores it in a PostgreSQL database. The pipeline runs on a scheduled basis throughout race days, capturing odds changes, runner information, and results in near real-time.

**Key Capabilities:**
- Ingests thoroughbred and harness racing data from NZ and AUS
- Tracks horses across multiple races (horse master table)
- Captures odds at multiple time points (morning, T-60, T-15, final)
- Stores race results and dividends
- Provides observability via OpenTelemetry, Prometheus, and Grafana

## Architecture Diagram

```mermaid
graph TB
    subgraph "External Data Source"
        TAB[TAB Affiliates API<br/>api.tab.co.nz]
    end

    subgraph "Application Layer"
        Scheduler[Scheduler<br/>Node-Cron]
        RaceService[Race Service<br/>Data Processing]
        HorseMatching[Horse Matching<br/>6-Priority Algorithm]
    end

    subgraph "Data Storage"
        PostgreSQL[(PostgreSQL<br/>Racing Database)]
        Redis[(Redis<br/>Queue & Cache)]
    end

    subgraph "Observability"
        Prometheus[Prometheus<br/>Metrics]
        Jaeger[Jaeger<br/>Distributed Tracing]
        Grafana[Grafana<br/>Dashboards]
    end

    TAB -->|HTTP/JSON| RaceService
    Scheduler -->|Triggers| RaceService
    RaceService -->|Matches/Creates| HorseMatching
    HorseMatching -->|Horse Records| PostgreSQL
    RaceService -->|Races, Runners, Results| PostgreSQL
    RaceService -->|Job Queue| Redis
    RaceService -->|Metrics| Prometheus
    RaceService -->|Traces| Jaeger
    Prometheus -->|Data Source| Grafana
    PostgreSQL -->|Data Source| Grafana
```

## Data Flow Overview

```mermaid
sequenceDiagram
    participant Scheduler
    participant RaceService
    participant TAB API
    participant HorseMatching
    participant Database

    Note over Scheduler: 6:00 AM NZT<br/>Morning Scrape
    Scheduler->>RaceService: Trigger morning scrape
    RaceService->>TAB API: GET /meetings (today)
    TAB API-->>RaceService: Meeting list

    loop For each race
        RaceService->>TAB API: GET /events/{raceId}
        TAB API-->>RaceService: Race details + runners

        loop For each runner
            RaceService->>HorseMatching: Find/Create horse
            HorseMatching->>Database: Query existing horses
            HorseMatching->>Database: Create/Update horse
            HorseMatching-->>RaceService: Horse ID
        end

        RaceService->>Database: Upsert race
        RaceService->>Database: Upsert runners
        RaceService->>Database: Create odds snapshot
    end

    Note over Scheduler: T-60 minutes<br/>Pre-race odds
    Note over Scheduler: T-15 minutes<br/>Final odds
    Note over Scheduler: Post-race<br/>Results collection
```

## System Components

### 1. Scheduler
- **Technology:** Node-Cron
- **Purpose:** Triggers data collection at specific times
- **Schedule Types:**
  - Morning Scrape (6:00 AM): Captures opening odds and field information
  - T-60 Scrape: Captures odds 60 minutes before each race
  - T-15 Scrape: Captures final odds 15 minutes before each race
  - Results Scrape: Collects race results after completion

### 2. TAB API Client
- **Technology:** Axios with retry logic
- **Rate Limiting:** 100 requests per minute
- **Endpoints Used:**
  - `/meetings/list` - Get meetings for a date range
  - `/events/{id}` - Get race details including runners and odds

### 3. Race Service
- **Purpose:** Core business logic for data processing
- **Responsibilities:**
  - Parse API responses
  - Coordinate horse matching
  - Persist races, runners, and odds
  - Handle scratched runners
  - Create results records

### 4. Horse Matching Service
- **Purpose:** Deduplicate horses across multiple races
- **Strategy:** 6-priority matching algorithm (detailed in separate document)

### 5. Database
- **Technology:** PostgreSQL 16
- **Key Tables:**
  - `meetings` - Race meetings/venues
  - `races` - Individual races
  - `horses` - Master horse records (normalized)
  - `runners` - Horse entries in specific races
  - `results` - Race outcomes
  - `dividends` - Betting payouts
  - `odds_snapshots` - Time-series odds data

## Data Types Processed

### Meeting Data
A race meeting represents a race day at a specific venue.

**Key Attributes:**
- Date (YYYY-MM-DD)
- Venue name (e.g., "Ellerslie", "Flemington")
- Country (NZ or AUS)
- Category (T=Thoroughbred, H=Harness, G=Greyhound)
- Track condition
- Weather

### Race Data
An individual race within a meeting.

**Key Attributes:**
- Race number (1-12 typically)
- Start time (ISO 8601 timestamp)
- Distance (meters)
- Track condition
- Status (Open, Interim, Final, Abandoned)
- Prize money

### Horse Data (Master Table)
Normalized horse records that persist across races.

**Key Attributes:**
- Name (original from API, preserved)
- Normalized name (lowercase, no apostrophes - for matching)
- Breeding (sire, dam, dam's sire)
- Physical attributes (sex, colour, country)
- External IDs (TAB entrant ID, TAB horse ID, Harness NZ ID)
- Statistics (race count, first seen, last seen)

### Runner Data
A horse's entry in a specific race.

**Key Attributes:**
- Runner number
- Barrier position
- Jockey/driver
- Trainer
- Weight carried
- Current odds (fixed and pool)
- Scratched status

### Results Data
Outcome of a race.

**Key Attributes:**
- Finish position
- Margin (lengths behind winner)
- Race time
- Official/provisional status

### Odds Snapshot Data
Point-in-time odds capture.

**Key Attributes:**
- Snapshot type (morning, t60, t15, final)
- Fixed odds (win, place)
- Pool odds (tote win, tote place)
- Capture timestamp

## Deployment Architecture

```mermaid
graph TB
    subgraph "Docker Compose Stack"
        subgraph "Application Services"
            App[Racing App<br/>Node.js]
        end

        subgraph "Data Services"
            DB[(PostgreSQL<br/>:5432)]
            Cache[(Redis<br/>:6379)]
        end

        subgraph "Observability Stack"
            Prom[Prometheus<br/>:9091]
            Jaeg[Jaeger<br/>:16686]
            Graf[Grafana<br/>:3000]
        end
    end

    App -->|Read/Write| DB
    App -->|Queue Jobs| Cache
    App -->|Expose Metrics| Prom
    App -->|Send Traces| Jaeg
    Prom -->|Scrape| App
    Graf -->|Query| Prom
    Graf -->|Query| DB

    style App fill:#4CAF50
    style DB fill:#2196F3
    style Cache fill:#FF9800
    style Prom fill:#E91E63
    style Jaeg fill:#9C27B0
    style Graf fill:#FF5722
```

## Key Design Decisions

### 1. Horse Master Table (Normalization)
**Problem:** Without normalization, horses running in multiple races would create duplicate records with inconsistent data.

**Solution:** Centralized `horses` table with:
- Single source of truth for each horse
- Support for multiple external ID systems (TAB, Harness NZ)
- Intelligent matching algorithm to prevent duplicates

**Benefits:**
- Accurate career statistics (race count, win rate)
- Cross-race queries ("all races for Horse X")
- Reduced storage (no duplicated breeding data)
- Data quality (single record to correct/update)

### 2. Multi-Source ID Support
**Problem:** Different racing jurisdictions use different identifier systems.

**Solution:** Three separate ID fields:
- `tabEntrantId` (UUID) - TAB's primary identifier
- `tabHorseId` (Integer) - TAB's secondary identifier
- `harnessNzHorseId` (BigInt) - Harness Racing NZ identifier

**Benefits:**
- Can import historical data from multiple sources
- Future-proof for data consolidation
- Maintains referential integrity

### 3. Snapshot-Based Odds Tracking
**Problem:** Odds change constantly; need historical view.

**Solution:** Time-series snapshots at defined intervals:
- Morning (6:00 AM)
- T-60 (60 minutes before race)
- T-15 (15 minutes before race)
- Final (at jump time)

**Benefits:**
- Analyze odds movements over time
- Calculate market efficiency
- Identify "steamers" and "drifters"
- Historical odds for model training

### 4. Upsert Pattern for Race Data
**Problem:** Race data updates multiple times (scratching, odds changes).

**Solution:** Upsert (INSERT ... ON CONFLICT UPDATE) by race + runner number.

**Benefits:**
- Idempotent operations (safe to re-run)
- Handles late scratching
- Captures data changes over time

## Monitoring and Observability

### Metrics Collected
- API call latency and success rate
- Races processed per scrape
- Database query performance
- Horse matching hit/miss rate
- Scrape duration by type

### Dashboards Available
1. **Operations Health** - System status, scrape success rates
2. **Live Operations** - Real-time race-day monitoring
3. **Data Quality** - Missing fields, stale data detection
4. **Capacity Planning** - Weekend traffic projections

### Alerting
- Race-day specific alerts (T-60, T-15 windows)
- Data freshness alerts (>30 min old data)
- API error rate thresholds
- Database connection failures

## Data Retention

- **Meetings/Races:** Indefinite (complete historical archive)
- **Runners:** Indefinite (linked to horses table)
- **Horses:** Indefinite (master records)
- **Odds Snapshots:** Indefinite (time-series analysis)
- **Results:** Indefinite (historical outcomes)
- **Scrape Logs:** 90 days (operational audit trail)
- **Job Run Logs:** 90 days (debugging history)

## Performance Characteristics

### Typical Scrape Volume
- **Weekend Racing Day:**
  - 11 meetings (NZ + AUS)
  - ~85 races total
  - ~1,200 runners
  - Morning scrape: ~120 API calls, 8-12 minutes
  - T-60/T-15 scrapes: ~85 API calls, 5-8 minutes

### Database Size
- **Daily Growth:** ~15 MB
- **Monthly Growth:** ~450 MB
- **Yearly Projection:** ~5.4 GB

### System Requirements
- **Memory:** 2 GB minimum, 4 GB recommended
- **CPU:** 2 cores minimum
- **Storage:** 50 GB for first year
- **Network:** Stable connection to TAB API