# Horse Racing Data Scraper - Project Brief

## Project Overview

Build a near-realtime horse racing data collection system that scrapes race data from TAB Affiliates API. The system will collect meeting, race, and runner information throughout the day, tracking changes and storing results for future Elo rating calculations.

**Primary Goal**: Establish robust data collection pipeline for Australian and New Zealand thoroughbred and harness racing.

**Future Goal**: Use collected data to calculate Elo ratings considering horse history, form, jockey, trainer, track conditions, and pack dynamics to generate win/place probabilities.

---

## Technical Stack

- **Runtime**: Node.js (LTS version)
- **Language**: TypeScript
- **Database**: PostgreSQL (Docker for all environments)
- **Cache/Queue**: Redis + Bull/BullMQ
- **Scheduler**: node-cron
- **HTTP Client**: axios with retry logic
- **Validation**: zod
- **ORM**: Prisma (recommended) or Drizzle

### Testing & Observability Stack

- **Testing Framework**: Jest with ts-jest
- **Test Coverage**: NYC/Istanbul (target: 80%+ coverage)
- **E2E Testing**: Supertest for API integration tests
- **Mocking**: jest.mock() for dependencies
- **Logging**: Pino (structured JSON logging)
- **Monitoring**: OpenTelemetry with traces and metrics
- **Error Tracking**: Custom error aggregation with Postgres + structured logs
- **Metrics**: Prometheus-compatible metrics
- **APM**: Optional - Datadog or New Relic
- **Health Checks**: Custom /health endpoint

---

## Testing & Observability Philosophy

**Core Principle**: "If we can't observe it, we can't trust it. If we don't test it, it doesn't work."

### Testing Strategy

#### Test Pyramid

```
           /\
          /E2E\        (10%) - Full workflow tests
         /------\
        /Integr..\     (30%) - API + DB integration
       /----------\
      /   Unit     \   (60%) - Business logic, utilities
     /--------------\
```

#### Test-Driven Priorities

1. **Write tests BEFORE implementation** for critical paths:

   - API client (all endpoints)
   - Change detection algorithm
   - Data validation and transformation
   - Queue job processing

2. **Write tests ALONGSIDE implementation** for:

   - Service layer methods
   - Utility functions
   - Schedulers

3. **Write tests AFTER implementation** only for:
   - Configuration files
   - Simple getters/setters

#### Coverage Requirements

- **Minimum**: 70% overall coverage
- **Target**: 80%+ overall coverage
- **Critical paths**: 100% coverage required:
  - API client error handling
  - Data validation (Zod schemas)
  - Change detection logic
  - Database upsert operations

#### Test Types

**Unit Tests** (60% of test suite)

```typescript
// Test business logic in isolation
describe("ChangeDetectionService", () => {
  it("should detect scratched runners", () => {
    const previous = [
      { id: 1, scratched: false },
      { id: 2, scratched: false },
    ];
    const current = [
      { id: 1, scratched: true },
      { id: 2, scratched: false },
    ];

    const changes = detectChanges(previous, current);

    expect(changes).toEqual([
      {
        type: "scratch",
        runnerId: 1,
        field: "scratched",
        from: false,
        to: true,
      },
    ]);
  });
});
```

**Integration Tests** (30% of test suite)

```typescript
// Test API + Database interactions
describe("MeetingService Integration", () => {
  beforeEach(async () => {
    await clearTestDatabase();
  });

  it("should fetch and store meetings from API", async () => {
    // Use real Prisma client with test database
    const service = new MeetingService(prisma, apiClient);

    await service.fetchAndStore(new Date("2026-01-13"), "AUS");

    const meetings = await prisma.meeting.findMany();
    expect(meetings).toHaveLength(2);
    expect(meetings[0].country).toBe("AUS");
  });
});
```

**E2E Tests** (10% of test suite)

```typescript
// Test complete workflows
describe("Daily Scraping Workflow", () => {
  it("should complete full morning scrape", async () => {
    // Mock TAB API responses
    mockTabApi();

    // Trigger morning scrape
    await morningScrapeCron.execute();

    // Verify all data is collected
    const meetings = await prisma.meeting.findMany({ where: { date: today } });
    const races = await prisma.race.findMany({
      where: { meeting: { date: today } },
    });
    const runners = await prisma.runner.findMany();

    expect(meetings.length).toBeGreaterThan(0);
    expect(races.length).toBeGreaterThan(0);
    expect(runners.length).toBeGreaterThan(0);
  });
});
```

### Observability Strategy

#### Structured Logging

**Log Levels:**

- `trace`: Very detailed debug information (disabled in production)
- `debug`: Detailed information for debugging
- `info`: Standard operational messages
- `warn`: Warning conditions (degraded performance, retries)
- `error`: Error conditions that need attention
- `fatal`: Critical failures requiring immediate action

**What to Log:**

```typescript
// ✅ GOOD: Structured logging with context
logger.info("Starting meeting fetch", {
  date: "2026-01-13",
  country: "AUS",
  category: "T",
  traceId: "abc-123",
});

logger.error("API request failed", {
  url: "/meetings",
  statusCode: 500,
  retryCount: 2,
  errorMessage: error.message,
  traceId: "abc-123",
});

// ❌ BAD: Unstructured logging
console.log("Fetching meetings");
console.error("Error:", error);
```

**Logging Standards:**

- Always include `traceId` for request correlation
- Log entry/exit of critical operations
- Log all API requests (URL, params, duration, status)
- Log all database operations (query type, duration, rows affected)
- Log all queue job events (enqueue, start, complete, fail)
- Never log sensitive data (API keys, passwords)

#### Metrics Collection

**Key Metrics to Track:**

**API Metrics:**

```typescript
// Request duration histogram
metrics.histogram("api.request.duration", durationMs, {
  endpoint: "/meetings",
  status: 200,
  country: "AUS",
});

// Request count
metrics.counter("api.requests.total", 1, {
  endpoint: "/meetings",
  status: 200,
});

// Error rate
metrics.counter("api.errors.total", 1, {
  endpoint: "/meetings",
  errorType: "timeout",
});

// Rate limit tracking
metrics.gauge("api.rate_limit.remaining", 85, {
  endpoint: "/meetings",
});
```

**Scraping Metrics:**

```typescript
// Races scraped
metrics.counter("scraper.races.scraped", 1, {
  scrapeType: "initial",
  country: "AUS",
});

// Changes detected
metrics.counter("scraper.changes.detected", 3, {
  changeType: "scratch",
  raceId: "abc-123",
});

// Scrape duration
metrics.histogram("scraper.duration", durationMs, {
  scrapeType: "pre_race",
});
```

**Database Metrics:**

```typescript
// Query duration
metrics.histogram("db.query.duration", durationMs, {
  operation: "upsert",
  table: "meetings",
});

// Connection pool
metrics.gauge("db.connections.active", activeConnections);
metrics.gauge("db.connections.idle", idleConnections);
```

**Queue Metrics:**

```typescript
// Job queue length
metrics.gauge("queue.length", queueLength, {
  queueName: "race-scrape",
});

// Job processing time
metrics.histogram("queue.job.duration", durationMs, {
  queueName: "race-scrape",
  jobType: "fetch-race",
});

// Job failures
metrics.counter("queue.job.failed", 1, {
  queueName: "race-scrape",
  errorType: "timeout",
});
```

**Business Metrics:**

```typescript
// Data completeness
metrics.gauge("business.races.total", totalRaces, { date: "2026-01-13" });
metrics.gauge("business.races.scraped", scrapedRaces, { date: "2026-01-13" });
metrics.gauge("business.data.completeness", completenessPercent);

// Scratches detected
metrics.counter("business.scratches.detected", 1, {
  raceId: "abc-123",
  timeBeforeRace: "T-60",
});
```

#### Distributed Tracing

Use OpenTelemetry to trace requests across the system:

```typescript
// Create a span for each major operation
const span = tracer.startSpan("meeting.fetch_and_store");
span.setAttribute("meeting.country", "AUS");
span.setAttribute("meeting.date", "2026-01-13");

try {
  // Fetch from API (creates child span automatically)
  const meetings = await apiClient.getMeetings(date, country);
  span.addEvent("api_fetch_complete", { count: meetings.length });

  // Store in database (creates child span)
  await meetingService.store(meetings);
  span.addEvent("db_store_complete");

  span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
  span.recordException(error);
  span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
  throw error;
} finally {
  span.end();
}
```

#### Health Checks

Implement comprehensive health checks:

```typescript
// GET /health
{
  status: 'healthy', // 'healthy' | 'degraded' | 'unhealthy'
  timestamp: '2026-01-13T10:00:00Z',
  uptime: 3600,
  checks: {
    database: {
      status: 'healthy',
      responseTime: 5,
      details: { connections: { active: 5, idle: 10 } }
    },
    redis: {
      status: 'healthy',
      responseTime: 2,
      details: { memory: { used: '100MB', max: '1GB' } }
    },
    tabApi: {
      status: 'healthy',
      responseTime: 150,
      lastSuccessfulRequest: '2026-01-13T09:59:45Z'
    },
    queue: {
      status: 'healthy',
      details: {
        meetingQueue: { active: 2, waiting: 5, failed: 0 },
        raceQueue: { active: 10, waiting: 20, failed: 1 }
      }
    }
  }
}
```

#### Alerting Rules

**Critical Alerts** (PagerDuty/immediate notification):

- API failure rate >10% over 5 minutes
- Database connection pool exhausted
- Queue processing stopped for >10 minutes
- Zero races scraped for scheduled meeting
- Application crash/restart

**Warning Alerts** (Slack/email):

- API failure rate >5% over 10 minutes
- Scrape duration >2x normal baseline
- Queue depth growing continuously
- Missing data for >5% of expected races
- Rate limit threshold reached

**Info Alerts** (logging/metrics only):

- Individual API request failures
- Individual race scrape failures
- Retry attempts
- Scratches detected

#### Dashboard Requirements

Create dashboards showing:

**System Health:**

- API request rate and error rate
- Database query duration (p50, p95, p99)
- Queue depth and processing rate
- Memory and CPU usage

**Business Metrics:**

- Races scraped vs. expected (daily)
- Data completeness percentage
- Scratches detected (by time before race)
- Results captured within SLA (<5 mins after race)

**Error Tracking:**

- Error rate by type
- Top 10 errors by frequency
- Failed jobs by queue
- API endpoints with highest failure rate

---

## API Sources

### TAB Affiliates API

- **Base URL**: `/affiliates/v1/racing/`
- **OpenAPI Spec**: See `openapi.json`
- **Key Endpoints**:
  - `GET /meetings` - List all meetings for a date/category/country
  - `GET /meetings/{id}` - Get specific meeting details
  - `GET /races/{id}` - Get detailed race information including runners
- **Rate Limits**: Unknown - implement conservative rate limiting
- **Authentication**: TBD (likely API key)

**Note**: Harness Racing NZ API integration is planned for a future phase but is not included in the current scope.

---

## Data Collection Flow

### Phase 1: Morning Scrape (6:00 AM local)

1. **Fetch all meetings** for the day:

   - Categories: `T` (Thoroughbred), `H` (Harness)
   - Countries: `AUS`, `NZ`
   - Store: `meeting_id`, `name`, `date`, `country`, `state`, `category`

2. **For each meeting**, fetch meeting details:

   - Store: `track_condition`, `weather`, `video_channels`, etc.

3. **For each race** in each meeting, fetch race details:
   - Store: `race_id`, `race_number`, `name`, `start_time`, `distance`, `track_condition`, `status`
   - **Critical**: Store full `runners` array with:
     - Horse details (name, age, sex, weight)
     - Jockey details (name, weight allowance)
     - Trainer details
     - Form data
     - Barrier position
     - Current odds (if available)

### Phase 2: Pre-Race Updates

Run at **T-60 minutes** and **T-15 minutes** before each race:

**Check for changes:**

- Scratched runners (compare runners array)
- Track condition changes
- Weather updates
- Barrier changes
- Jockey changes
- Start time delays

**Action**:

- Update database records
- Log changes in `scrapes` table
- Flag races needing Elo recalculation

### Phase 3: Post-Race Scrape

Run **5 minutes after** advertised start time:

**Fetch race results:**

- Finishing positions
- Margins
- Race times
- Dividends/payouts
- Official status

**Store in** `results` table linked to runners.

---

## Database Schema

### Core Tables

```sql
-- Meetings table
CREATE TABLE meetings (
  id UUID PRIMARY KEY,
  name VARCHAR(255) NOT NULL,
  date DATE NOT NULL,
  country VARCHAR(3) NOT NULL,
  state VARCHAR(3),
  category VARCHAR(1) NOT NULL, -- 'T' or 'H'
  category_name VARCHAR(100),
  track_condition VARCHAR(50),
  weather VARCHAR(50),
  video_channels JSONB,
  quaddie INTEGER[],
  early_quaddie INTEGER[],
  tote_meeting_number INTEGER,
  tote_status VARCHAR(50),
  metadata JSONB, -- For flexible storage
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Races table
CREATE TABLE races (
  id UUID PRIMARY KEY,
  meeting_id UUID NOT NULL REFERENCES meetings(id),
  race_number INTEGER NOT NULL,
  name VARCHAR(255) NOT NULL,
  start_time TIMESTAMPTZ NOT NULL,
  tote_start_time TIME,
  distance INTEGER NOT NULL,
  track_condition VARCHAR(50),
  weather VARCHAR(50),
  status VARCHAR(50) NOT NULL, -- 'Upcoming', 'Final', 'Abandoned', etc.
  country VARCHAR(3) NOT NULL,
  state VARCHAR(3),
  metadata JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW(),
  UNIQUE(meeting_id, race_number)
);

-- Runners table (horses in a race)
CREATE TABLE runners (
  id UUID PRIMARY KEY,
  race_id UUID NOT NULL REFERENCES races(id),
  runner_number INTEGER NOT NULL,
  horse_name VARCHAR(255) NOT NULL,
  horse_id UUID, -- If available from API
  barrier INTEGER,
  weight DECIMAL(5,2),

  -- Jockey information
  jockey_name VARCHAR(255),
  jockey_id UUID,
  jockey_weight_allowance DECIMAL(4,2),

  -- Trainer information
  trainer_name VARCHAR(255),
  trainer_id UUID,

  -- Form and performance
  form VARCHAR(50), -- e.g., "1-2-3-4-5"
  last_starts JSONB, -- Detailed last starts data

  -- Status
  scratched BOOLEAN DEFAULT FALSE,
  scratched_at TIMESTAMPTZ,

  -- Odds (if available)
  opening_odds DECIMAL(10,2),
  current_odds DECIMAL(10,2),

  metadata JSONB, -- Flexible storage for additional API fields
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW(),
  UNIQUE(race_id, runner_number)
);

-- Results table
CREATE TABLE results (
  id UUID PRIMARY KEY,
  race_id UUID NOT NULL REFERENCES races(id),
  runner_id UUID NOT NULL REFERENCES runners(id),
  finish_position INTEGER,
  margin DECIMAL(10,2), -- Length behind winner
  race_time DECIMAL(10,3), -- Time in seconds

  -- Dividends
  win_dividend DECIMAL(10,2),
  place_dividend DECIMAL(10,2),

  -- Official result
  official BOOLEAN DEFAULT FALSE,
  disqualified BOOLEAN DEFAULT FALSE,

  metadata JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW(),

  UNIQUE(race_id, runner_id)
);

-- Scrapes tracking table
CREATE TABLE scrapes (
  id UUID PRIMARY KEY,
  race_id UUID NOT NULL REFERENCES races(id),
  scrape_type VARCHAR(50) NOT NULL, -- 'initial', 'pre_race_60', 'pre_race_15', 'post_race'
  scraped_at TIMESTAMPTZ DEFAULT NOW(),
  changes_detected JSONB, -- Store what changed
  success BOOLEAN DEFAULT TRUE,
  error_message TEXT,

  INDEX idx_race_scrape_type (race_id, scrape_type)
);
```

### Indexes for Performance

```sql
-- Query races by meeting
CREATE INDEX idx_races_meeting ON races(meeting_id);

-- Query races by start time (for scheduling)
CREATE INDEX idx_races_start_time ON races(start_time);

-- Query races by status
CREATE INDEX idx_races_status ON races(status);

-- Query runners by race
CREATE INDEX idx_runners_race ON runners(race_id);

-- Query meetings by date and country
CREATE INDEX idx_meetings_date_country ON meetings(date, country, category);

-- Query scrapes by race
CREATE INDEX idx_scrapes_race ON scrapes(race_id, scrape_type, scraped_at);
```

---

## Architecture

```
┌──────────────────────────────────────────────────┐
│              Scheduler (node-cron)               │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐ │
│  │  Morning   │  │  Pre-Race  │  │ Post-Race  │ │
│  │  06:00 AM  │  │ T-60/T-15  │  │  T+5 mins  │ │
│  └──────┬─────┘  └──────┬─────┘  └──────┬─────┘ │
└─────────┼────────────────┼────────────────┼───────┘
          │                │                │
          ▼                ▼                ▼
┌──────────────────────────────────────────────────┐
│           Job Queue (Bull/BullMQ + Redis)        │
│  ┌────────────────┐  ┌──────────────────────┐   │
│  │ Meeting Jobs   │  │    Race Jobs         │   │
│  │ - Fetch list   │  │ - Fetch details      │   │
│  │ - Fetch detail │  │ - Check for changes  │   │
│  └────────────────┘  │ - Fetch results      │   │
│                      └──────────────────────┘   │
└─────────────────────┬────────────────────────────┘
                      │
                      ▼
┌──────────────────────────────────────────────────┐
│              Worker Pool                         │
│  ┌────────────────────────────────────────┐     │
│  │  API Client (axios)                    │     │
│  │  - Rate limiting (bottleneck)          │     │
│  │  - Retry logic (exponential backoff)   │     │
│  │  - Request queuing                     │     │
│  │  - Error handling                      │     │
│  └────────────┬───────────────────────────┘     │
└───────────────┼──────────────────────────────────┘
                │
                ▼
┌──────────────────────────────────────────────────┐
│         Data Processing Layer                    │
│  ┌────────────────────────────────────────┐     │
│  │  - Validation (zod schemas)            │     │
│  │  - Change detection (diff algorithm)   │     │
│  │  - Data transformation                 │     │
│  └────────────┬───────────────────────────┘     │
└───────────────┼──────────────────────────────────┘
                │
                ▼
┌──────────────────────────────────────────────────┐
│            PostgreSQL Database                   │
│         (Prisma ORM for type safety)             │
└──────────────────────────────────────────────────┘
```

---

## Implementation Phases

### Phase 1: Foundation (Days 1-2)

**Goal**: Basic project structure and database setup

**Tasks**:

- [x] Initialize TypeScript project with proper tsconfig
- [x] Set up PostgreSQL locally (or use Docker)
- [x] Create Prisma schema matching the database design
- [x] Run migrations to create tables
- [x] Create basic environment configuration (.env)
- [x] Set up Redis for queue management
- [x] Create project folder structure:
  ```
  src/
  ├── api/           # API client implementations
  ├── models/        # Prisma client, types
  ├── schedulers/    # Cron jobs
  ├── workers/       # Queue workers
  ├── services/      # Business logic
  ├── utils/         # Helpers, validation
  └── config/        # Configuration files
  ```

**Deliverables**:

- Working database with tables
- TypeScript build system
- Environment configuration
- Redis connection

---

### Phase 2: API Client (Days 3-4)

**Goal**: Robust API client for TAB Affiliates API with comprehensive testing and observability

**Tasks**:

- [x] Create axios instance with configuration
- [x] Implement rate limiting (100 req/min starting point)
- [x] Add retry logic with exponential backoff
- [x] Create typed interfaces for API responses (from OpenAPI spec)
- [x] Implement core methods:
  - `getMeetings(date, country, category)`
  - `getMeetingById(id)`
  - `getRaceById(id)`
- [x] Add structured logging with Pino:
  - Log all requests (URL, params, duration)
  - Log all responses (status, duration)
  - Log all errors with context
  - Include trace IDs for correlation
- [x] Add metrics instrumentation:
  - Request duration histogram
  - Request count by endpoint
  - Error count by type
  - Rate limit tracking
- [x] Implement distributed tracing with OpenTelemetry
- [x] Handle API errors gracefully
- [x] **Write comprehensive unit tests**:
  - Test successful requests
  - Test error handling (400, 404, 500, timeout)
  - Test retry logic
  - Test rate limiting behavior
  - Mock axios responses
  - **Target: 90%+ coverage for API client**
- [x] **Write integration tests**:
  - Test against real API (with test account)
  - Validate response schemas with Zod
  - Test rate limit handling

**Validation**:

- Zod schemas for each API response
- Type safety end-to-end
- All tests passing with >90% coverage
- Metrics visible in console/dashboard
- Trace spans visible in logs

**Deliverables**:

- Fully typed API client
- Error handling framework
- Comprehensive test suite (unit + integration)
- Structured logging throughout
- Metrics collection
- Distributed tracing

---

### Phase 3: Data Collection Services (Days 5-7)

**Goal**: Core scraping logic with full test coverage and observability

**Tasks**:

- [x] Create `MeetingService`:
  - Fetch and store meetings
  - Update meeting details
  - Handle duplicates (upsert logic)
  - Add structured logging
  - Add metrics (meetings scraped, duration)
  - Add distributed tracing
  - **Write unit tests** (mock API and DB)
  - **Write integration tests** (real DB)
- [x] Create `RaceService`:
  - Fetch and store race details
  - Store runners array
  - Update race status
  - Add structured logging
  - Add metrics (races scraped, runners stored)
  - Add distributed tracing
  - **Write unit tests**
  - **Write integration tests**
- [x] Create `ChangeDetectionService`:
  - Compare previous vs current data
  - Identify scratches, condition changes
  - Log changes to scrapes table
  - Add metrics (changes detected by type)
  - **Write unit tests** (100% coverage required)
  - **Test edge cases**: null values, missing data, type changes
- [x] Create `ResultsService`:
  - Fetch and store race results
  - Link results to runners
  - Handle official vs unofficial results
  - Add structured logging
  - Add metrics (results captured, time to capture)
  - **Write unit tests**
  - **Write integration tests**

**Data Flow**:

```
API Response → Validation → Transformation → Database Upsert
                    ↓
              Change Detection
                    ↓
            Log to scrapes table
                    ↓
              Emit Metrics
                    ↓
              Create Trace Spans
```

**Testing Requirements**:

- Unit test coverage: >85% for all services
- Integration test coverage: All major workflows
- Test all error paths
- Test with invalid/malformed data
- Test concurrent operations
- Test transaction rollbacks

**Observability Requirements**:

- Every service method logs entry/exit
- All database operations emit duration metrics
- All errors logged with full context
- Trace spans for all async operations
- Business metrics for data quality

**Deliverables**:

- Service layer with clear interfaces
- Change detection algorithm
- Comprehensive error handling
- Full test suite (unit + integration)
- Structured logging throughout
- Metrics instrumentation
- Distributed tracing

---

### Phase 4: Job Scheduling & Queuing (Days 8-9)

**Goal**: Automated data collection pipeline

**Tasks**:

- [x] Set up Bull queues:
  - `meeting-queue`
  - `race-queue`
  - `results-queue`
- [x] Create queue processors (workers):
  - Meeting processor
  - Race processor (initial, pre-race, post-race)
  - Results processor
- [x] Implement schedulers:
  - Morning scrape (06:00 AM)
  - Pre-race scrapes (dynamic based on start times)
  - Post-race scrapes (dynamic, T+5 mins)
- [x] Add job priorities:
  - High: Pre-race updates (T-15)
  - Medium: Initial scrapes
  - Low: Post-race results (non-urgent)
- [x] Implement job retries with backoff
- [x] Add monitoring/logging for queue health

**Scheduling Logic**:

```javascript
// Morning: Fetch all meetings and queue race jobs
06:00 → Queue: Fetch meetings for today
     → For each meeting: Queue race detail jobs
     → For each race: Schedule pre-race jobs based on start_time

// Dynamic pre-race scheduling
For race starting at 14:30:
  → Schedule job at 13:30 (T-60)
  → Schedule job at 14:15 (T-15)

// Dynamic post-race scheduling
For race starting at 14:30:
  → Schedule results job at 14:35 (T+5)
```

**Deliverables**:

- Automated end-to-end pipeline
- Queue dashboard (Bull Board)
- Configurable scheduling

---

### Phase 5: Testing, Validation & Observability (Days 10-11)

**Goal**: Ensure reliability, data integrity, and production-grade observability

**Comprehensive Testing**:

**1. Unit Test Suite Review**

- [x] Verify >85% overall coverage
- [x] Verify 100% coverage for critical paths:
  - Change detection algorithms
  - Data validation logic
  - Error handling paths
- [x] Run mutation testing to verify test quality
- [x] Review and improve test descriptions
- [x] Add missing edge case tests

**2. Integration Testing**

- [x] Test complete workflows end-to-end:
  - Morning scrape workflow
  - Pre-race update workflow
  - Post-race results workflow
- [x] Test with real API (sandbox/test account)
- [x] Test database transactions and rollbacks
- [x] Test queue job processing
- [x] Test scheduler execution

**3. Load & Performance Testing**

- [x] Simulate full day's load (100+ races):
  - Measure API request throughput
  - Measure database write performance
  - Measure queue processing rate
  - Identify bottlenecks
- [x] Test concurrent scraping
- [x] Test memory usage under load
- [x] Test database connection pool behavior
- [x] Establish performance baselines:
  - Meeting fetch: <2s target
  - Race fetch: <1s target
  - Database upsert: <100ms target
  - Queue job: <5s target

**4. Chaos Engineering**

- [x] Test API failure scenarios:
  - Complete API outage
  - Slow API responses (>5s)
  - Rate limiting (429 responses)
  - Malformed responses
  - Timeouts
- [x] Test database failure scenarios:
  - Connection loss
  - Deadlocks
  - Slow queries
- [x] Test Redis failure:
  - Connection loss
  - Queue job failures
  - Memory exhaustion
- [x] Verify graceful degradation
- [x] Verify recovery after failures

**5. Data Quality Validation**

- [x] Run full-day scrape with real data
- [x] Validate completeness:
  - Check for missing races (compare against known schedule)
  - Verify runner counts match expected
  - Confirm all results captured
- [x] Validate accuracy:
  - Cross-check sample data against official sources
  - Verify change detection caught all scratches
  - Confirm timestamps are correct
- [x] Validate consistency:
  - Check for duplicate records
  - Verify foreign key relationships
  - Confirm data types and formats

**Observability Implementation**:

**1. Logging Infrastructure**

- [x] Configure log levels per environment:
  - Development: debug
  - Staging: info
  - Production: info (warn for high-volume operations)
- [x] Set up log aggregation (e.g., CloudWatch, Datadog)
- [x] Create log queries for common debugging scenarios
- [x] Test log volume in production-like load

**2. Metrics & Monitoring**

- [x] Configure Prometheus or compatible metrics exporter
- [x] Set up metrics scraping
- [x] Create Grafana dashboards:
  - **System Health Dashboard**:
    - API request rate, error rate, latency
    - Database connections, query duration
    - Queue depth, job rate, failure rate
    - Memory and CPU usage
  - **Business Metrics Dashboard**:
    - Races scraped today (vs. expected)
    - Data completeness percentage
    - Scratches detected (timeline)
    - Results capture latency
  - **Error Dashboard**:
    - Error rate by type
    - Failed jobs by queue
    - Top errors by frequency
- [x] Test metric collection under load

**3. Distributed Tracing**

- [x] Configure OpenTelemetry exporter
- [x] Set up trace collection (e.g., Jaeger, Honeycomb)
- [x] Verify traces for complete workflows
- [x] Test trace sampling under high load
- [x] Create trace-based alerts for slow operations

**4. Error Tracking**

- [x] Create error aggregation system:
  - Store errors in dedicated Postgres table
  - Group by error type, message, stack trace hash
  - Track frequency, first seen, last seen
  - Link to trace IDs for debugging
- [x] Implement error dashboard/queries
- [x] Set up error alerts based on frequency/severity
- [x] Configure error sampling for high-volume errors
- [x] Add source map support for stack traces

**5. Health Checks & Probes**

- [x] Implement `/health` endpoint
- [x] Implement `/ready` endpoint (for k8s readiness)
- [x] Test health checks under various failure modes
- [x] Configure health check monitoring

**6. Alerting**

- [x] Set up critical alerts:
  - API failure rate >10% (5min window) → PagerDuty
  - Zero races scraped for scheduled meeting → PagerDuty
  - Application crash/OOM → PagerDuty
  - Queue stopped processing >10min → PagerDuty
  - Database connection exhausted → PagerDuty
- [x] Set up warning alerts:
  - API failure rate >5% (10min window) → Slack
  - Scrape duration >2x baseline → Slack
  - Missing data >5% of expected → Slack
  - Queue depth growing >15min → Slack
- [x] Test alert delivery
- [x] Create runbook for each alert type

**7. Synthetic Monitoring**

- [x] Create smoke tests that run every 5 minutes:
  - Fetch today's meetings
  - Fetch a known race
  - Verify health endpoint
- [x] Alert if smoke tests fail
- [x] Track smoke test latency

**Production Readiness Checklist**:

- [x] All tests passing (unit, integration, e2e)
- [x] Test coverage >85% overall, 100% for critical paths
- [x] Performance baselines established and met
- [x] Chaos tests passing (graceful failure handling)
- [x] Full-day scrape validated with real data
- [x] Logging configured and tested
- [x] Metrics dashboards created
- [x] Distributed tracing working
- [x] Error tracking configured
- [x] Health checks implemented
- [x] Alerts configured and tested
- [x] Synthetic monitoring running
- [x] Runbook created for all alerts
- [x] No critical or high-severity bugs

**Deliverables**:

- Comprehensive test suite (>85% coverage)
- Performance benchmarks and baselines
- Complete observability stack:
  - Structured logging
  - Metrics dashboards
  - Distributed tracing
  - Error tracking
  - Health checks
  - Alerting rules
- Production readiness report
- Monitoring runbook

---

### Phase 6: Documentation & Handoff (Day 12)

**Goal**: Production-ready codebase

**Tasks**:

- [x] Write comprehensive README
- [x] Document API client usage
- [x] Create runbook for operations:
  - How to start/stop services
  - How to monitor queues
  - How to handle failures
  - How to backfill data
- [x] Environment setup guide:
  - Local development
  - Docker deployment
  - Production deployment
- [x] Add JSDoc comments to all public functions
- [x] Create data dictionary for database

**Deliverables**:

- Complete documentation
- Deployment guide
- Operational runbook

---

## Configuration

### Environment Variables

```bash
# Database
DATABASE_URL=postgresql://user:pass@localhost:5432/racing_db

# Redis
REDIS_URL=redis://localhost:6379

# TAB Affiliates API
TAB_API_BASE_URL=https://api.tab.com.au
TAB_API_KEY=xxx

# Scheduling
MORNING_SCRAPE_CRON=0 6 * * *  # 6 AM daily
TIMEZONE=Pacific/Auckland

# Rate Limiting
API_RATE_LIMIT_PER_MINUTE=100
API_RETRY_ATTEMPTS=3
API_RETRY_DELAY_MS=1000

# Feature Flags
ENABLE_HARNESS_RACING=true
ENABLE_THOROUGHBRED_RACING=true
ENABLE_AUS_RACING=true
ENABLE_NZ_RACING=true

# Monitoring
LOG_LEVEL=info
```

---

## Critical Considerations

### 1. Rate Limiting

- **TAB API**: Unknown limits - start conservative (100/min)
- **Solution**: Use `bottleneck` package for rate limiting
- **Monitoring**: Track 429 responses, adjust limits accordingly

### 2. Data Consistency

- **Problem**: Race data can change between scrapes
- **Solution**:
  - Always store timestamps
  - Use `updated_at` to track changes
  - Maintain audit trail in scrapes table

### 3. Race Timing

- **Problem**: Races can be delayed, abandoned
- **Solution**:
  - Don't hardcode timing offsets
  - Check race status before scraping results
  - Implement retry logic for delayed races

### 4. API Failures

- **Problem**: APIs can go down, timeout, return errors
- **Solution**:
  - Exponential backoff (1s, 2s, 4s, 8s, 16s)
  - Circuit breaker pattern
  - Fallback to cached data where appropriate
  - Alert on repeated failures

### 5. Duplicate Prevention

- **Problem**: Re-running scrapes can create duplicates
- **Solution**:
  - Use UUID primary keys from API
  - Implement upsert logic (INSERT ... ON CONFLICT)
  - Check scrapes table before re-scraping

### 6. Timezone Handling

- **Problem**: AUS and NZ races in different timezones
- **Solution**:
  - Store all times in UTC
  - Convert to local for display only
  - Use `luxon` or `date-fns-tz` for timezone math

---

## Success Metrics

### Data Completeness

- ✅ 100% of scheduled races captured
- ✅ <1% missing runner data
- ✅ <5% missing results (accounting for abandoned races)

### Reliability

- ✅ 99% uptime for scraping service
- ✅ <1% failed API requests (after retries)
- ✅ <5min lag between race finish and result capture

### Performance

- ✅ <10s to fetch and store a full meeting
- ✅ <2s to fetch and store a single race
- ✅ Queue processing <1min behind schedule

---

## Future Enhancements (Post-MVP)

1. **Elo Rating System**

   - Calculate ratings from historical results
   - Factor in track conditions, distance, class
   - Update ratings after each race
   - Generate win/place probabilities

2. **Real-time Odds Tracking**

   - Poll odds changes throughout the day
   - Track market movements
   - Compare Elo predictions vs market

3. **Historical Data Backfill**

   - Scrape past results (if available)
   - Build initial Elo ratings database
   - Validate model against historical outcomes

4. **Web Dashboard**

   - Display upcoming races
   - Show Elo ratings and predictions
   - Real-time updates via WebSockets
   - Historical performance charts

5. **Notifications**
   - Slack/Discord alerts for high-value predictions
   - Email summaries of daily performance
   - SMS for critical system failures

---

## Getting Started (For Claude Code)

**Suggested first command**:

```bash
"Initialize the horse racing scraper project. Start with Phase 1:
Create the TypeScript project structure, set up Prisma with PostgreSQL,
and create the database schema as specified in PROJECT_BRIEF.md.
Use the openapi.json and example JSON files as reference for the API structure."
```

**Files to reference**:

- This document (PROJECT_BRIEF.md)
- openapi.json (TAB API spec)
- list-of-meetings.json (example response)
- specified-meeting.json (example response)
- specified-race.json (example response) - if available

**Priorities**:

1. Type safety (TypeScript, Zod validation)
2. Error handling (never crash, always log)
3. Testability (pure functions, dependency injection)
4. Observability (comprehensive logging)
5. Maintainability (clear code structure, documentation)

---

## Questions & Unknowns

- [ ] TAB API authentication method?
- [ ] TAB API rate limits?
- [ ] What time does TAB publish the full day's card?
- [ ] How late can scratches occur? (T-10 mins?)
- [ ] Are there futures races to consider?
- [ ] Historical data availability for backfill?

**Note**: Harness Racing NZ API integration is deferred to a future phase.

---

_Last updated: 2026-01-13_
_Version: 1.0_