# Scheduler System Redesign

## Current Problems

1. **Inefficient API Polling**: Pre/post-race schedulers run every minute to check for work (1,440+ checks/day)
2. **Fixed 6 AM Timing**: Assumes all race data is available at 6 AM, but TAB posts data at varying times
3. **No Retry Logic**: If morning scrape fails or data isn't ready, we have no fallback
4. **Poor Observability**: No tracking of job outcomes, failures, or performance metrics in database

## Proposed Solution

### 1. Smart Scheduling Architecture

Instead of polling, use **event-driven scheduling**:

```
Morning Scrape (6 AM)
    ↓
Stores race times in DB
    ↓
Dynamically schedules specific jobs based on actual race times
    ↓
Pre-race T-60, T-15, Post-race T+5 jobs run at calculated times
```

### 2. Job Execution Flow

```
Job Triggered
    ↓
Create JobRun record (status: running)
    ↓
Execute work with retries
    ↓
Update JobRun with results (success/failure, metrics, errors)
    ↓
Create Scrape records for each race processed
    ↓
Record metrics (duration, items processed, error rates)
```

### 3. Database Schema Addition

Add `JobRun` model to track all job executions:

```prisma
model JobRun {
  id              String   @id @default(uuid()) @db.Uuid
  jobType         String   @db.VarChar(50)  // morning_scrape, pre_race_t60, etc.
  status          String   @db.VarChar(20)  // running, success, failed, partial
  startedAt       DateTime @default(now()) @db.Timestamptz(6)
  completedAt     DateTime? @db.Timestamptz(6)
  durationMs      Int?
  itemsProcessed  Int      @default(0)
  itemsFailed     Int      @default(0)
  errorMessage    String?  @db.Text
  metadata        Json?    // Detailed results, retry info, etc.

  @@index([jobType, startedAt])
  @@index([status])
  @@map("job_runs")
}
```

### 4. Retry Strategy

**Morning Scrape Retry Logic**:
- Initial attempt: 6:00 AM
- If data incomplete: Retry at 6:15, 6:30, 7:00, 8:00
- Track which meetings succeeded/failed in JobRun.metadata
- Only retry failed meetings

**Pre/Post Race Retry Logic**:
- Initial attempt: Exact scheduled time (T-60, T-15, T+5)
- On failure: Retry with exponential backoff (30s, 1m, 2m)
- Max 3 retries per race
- Track retry attempts in JobRun.metadata

### 5. Efficient Race-Time Scheduling

**Option A: Dynamic Cron Jobs**
- After morning scrape, query upcoming races from DB
- Create one-time scheduled jobs for each race time
- Use node-schedule for dynamic job creation
- Pros: Precise timing, no polling
- Cons: Many scheduled jobs in memory

**Option B: Smart Polling with Time Windows**
- Run pre-race scheduler every 5 minutes (not every minute)
- Query DB for races starting in next 65-70 minutes (for T-60)
- Only make API calls for races actually in the window
- Pros: Simpler, handles late schedule changes
- Cons: Less precise timing (but probably acceptable)

**Recommendation**: Start with Option B (simpler, more robust), migrate to Option A if needed.

### 6. Enhanced Observability

**Database Tracking**:
- Every scheduler run creates a JobRun record
- Individual race updates create Scrape records
- Link JobRun → Scrape via metadata

**Metrics Enhancements**:
```typescript
// New metrics
scheduler_job_runs_total{job_type, status}
scheduler_job_duration_seconds{job_type}
scheduler_items_processed_total{job_type}
scheduler_retry_attempts_total{job_type, retry_number}
scheduler_races_by_window{window} // Track how many races in each time window
```

**Dashboard Panels**:
- Job success/failure rates by type
- Average job duration trends
- Items processed per job
- Retry frequency and patterns
- Race coverage (% of races with successful scrapes)

### 7. Configuration Changes

```typescript
export const SCHEDULER_CONFIGS = {
  morning_scrape: {
    cronExpression: '0 6 * * *',
    timezone: 'Australia/Sydney',
    retrySchedule: ['0 15 6 * * *', '0 30 6 * * *', '0 0 7 * * *'], // 6:15, 6:30, 7:00
    maxRetries: 3,
  },
  pre_race_check: {
    cronExpression: '*/5 * * * *', // Every 5 minutes
    timezone: 'Australia/Sydney',
    windows: [
      { name: 't60', minutesBefore: 60, windowSize: 5 }, // Check 55-65 min before
      { name: 't15', minutesBefore: 15, windowSize: 5 }, // Check 10-20 min before
    ],
  },
  post_race_check: {
    cronExpression: '*/5 * * * *', // Every 5 minutes
    timezone: 'Australia/Sydney',
    minutesAfter: 5,
    windowSize: 5, // Check races finished 0-10 minutes ago
    maxRetries: 3,
    retryDelays: [30, 60, 120], // 30s, 1m, 2m
  },
};
```

## Implementation Plan

1. ✅ Add JobRun model to Prisma schema
2. ✅ Update BaseScheduler to create JobRun records
3. ✅ Implement retry logic in BaseScheduler
4. ✅ Update MorningScrapeScheduler with retry strategy
5. ✅ Consolidate pre-race schedulers into single PreRaceCheckScheduler with windows
6. ✅ Update PostRaceScheduler with efficient querying and retries
7. ✅ Add enhanced metrics
8. ✅ Update Grafana dashboard with job tracking panels
9. ✅ Write comprehensive tests

## Migration Path

1. Add JobRun table (non-breaking)
2. Update schedulers to log to JobRun while keeping current behavior
3. Verify observability is working
4. Switch to new scheduling strategy (5-minute intervals)
5. Monitor and tune window sizes based on actual data
6. Add retry logic incrementally
7. Remove old inefficient patterns

## Questions for Discussion

1. **Timing precision**: Is 5-minute polling acceptable, or do we need second-precision for T-60/T-15?
2. **Retry aggressiveness**: Should we retry more/less frequently? Different strategies per job type?
3. **Race data availability**: Do we know TAB's typical posting schedule? Could inform retry timing.
4. **Job history retention**: How long should we keep JobRun records? 30 days? 90 days?
5. **Alert thresholds**: What failure rates should trigger alerts? >10% failed jobs? >3 consecutive failures?