# Scheduler Redesign - Progress Update

## ✅ Completed

### 1. Database Schema - JobRun Tracking
Added `JobRun` model to track all scheduler executions:
- **jobType**: Which scheduler ran (morning_scrape, pre_race_t60, etc.)
- **status**: running, success, partial, failed
- **startedAt/completedAt**: Execution timeline
- **durationMs**: How long the job took
- **itemsProcessed/itemsFailed**: Success metrics
- **errorMessage**: Failure details
- **metadata**: Additional context (JSON)
- **retryCount**: Number of retries for this execution
- **parentJobRunId**: Links retries to original job

Also updated `Scrape` model with `jobRunId` to link individual race scrapes to job runs.

### 2. BaseScheduler - JobRun Integration
Enhanced BaseScheduler to automatically create and track JobRun records:
- Creates JobRun record at job start (status: 'running')
- Updates with results on completion (success/partial/failed)
- Stores itemsProcessed, errors, duration, metadata
- Links JobContext to JobRun via jobRunId
- All existing metrics still recorded (Prometheus counters/histograms)

### 3. API Efficiency - Reduced Polling
Changed scheduler intervals from **every minute** to **every 5 minutes**:
- Pre-race T-60: `*/5 * * * *` (288 checks/day vs 1,440 = **80% reduction**)
- Pre-race T-15: `*/5 * * * *` (288 checks/day vs 1,440 = **80% reduction**)
- Post-race T+5: `*/5 * * * *` (288 checks/day vs 1,440 = **80% reduction**)
- Morning scrape: Still `0 6 * * *` (daily at 6 AM)

### 4. Configuration - Retry Strategy
Added retry configurations:

**Morning Scrape Retries**:
- 6:15 AM - First retry (if initial 6:00 AM failed)
- 6:30 AM - Second retry
- 7:00 AM - Final retry
- Tracks which meetings succeeded/failed individually

**Pre-Race Retries**:
- Max 2 retries per race
- Delays: 1 min, 2 min
- For races where initial T-60 or T-15 update fails

**Post-Race Retries**:
- Max 3 retries per race
- Delays: 5 min, 5 min, 10 min
- Handles provisional → confirmed result transition
- Variable confirmation timing (TAB confirms results 5-15+ min after race)

### 5. Time Window Adjustments
Widened time windows to accommodate 5-minute intervals:
- **Pre-race T-60**: 55-70 minutes before (10-minute window)
- **Pre-race T-15**: 10-25 minutes before (10-minute window)
- **Post-race Initial**: 5-20 minutes after (for provisional results)
- **Post-race Confirmed**: 10-30 minutes after (for confirmed results)

These wider windows ensure no races are missed between 5-minute checks.

## 🚧 In Progress / Next Steps

### 1. Implement Retry Logic in BaseScheduler
Currently, JobRun records are created but retries are not yet automated. Need to:
- Add `scheduleRetry()` method to BaseScheduler
- Implement retry delay logic (exponential backoff)
- Link retry JobRuns to parent via `parentJobRunId`
- Add retry counter tracking

### 2. Update MorningScrapeScheduler
Enhance to handle individual meeting failures:
- Track which meetings succeeded/failed in metadata
- Only retry failed meetings (not entire job)
- Respect retry schedule (6:15, 6:30, 7:00)
- Store detailed meeting-level results in JobRun.metadata

### 3. Update PreRaceScheduler
Make it query-efficient:
- Query DB for races in time window (not poll empty space)
- Only make API calls for races actually in the window
- Track last update time to avoid duplicate scrapes
- Link scrapes to JobRun via jobRunId

### 4. Update PostRaceScheduler
Handle provisional → confirmed transition:
- Initial check at T+10 (may get provisional results)
- Check Result.official flag in DB
- Schedule follow-up checks if not yet confirmed
- Stop retrying once Result.official = true
- Max 3 retries to avoid infinite loops

### 5. Enhanced Observability
Add new Prometheus metrics:
- `scheduler_job_runs_total{job_type, status}` (track success/failure rates)
- `scheduler_retry_attempts_total{job_type, retry_number}` (retry patterns)
- `scheduler_races_by_window{window}` (how many races in each window)

Update Grafana dashboard with:
- Job success/failure rates over time
- Average job duration trends
- Retry frequency analysis
- Race coverage percentage (% of races successfully scraped)

### 6. Write Tests
Comprehensive test suite:
- Unit tests for retry logic
- Unit tests for time window calculations
- Integration tests for JobRun tracking
- Mocks for API client to test error scenarios
- Test race edge cases (late scratchings, provisional results, etc.)

## 📊 Current Status

**Database**: ✅ Schema migrated, JobRun table active
**BaseScheduler**: ✅ Creating and tracking JobRun records
**Config**: ✅ Updated to 5-minute intervals with retry configs
**Schedulers**: ⏳ Running but need updates for new strategy
**Tests**: ❌ Not yet written

## 🔍 Verification

You can check JobRun records being created:
```bash
docker exec racing-postgres psql -U racing -d racing_db -c '
  SELECT "jobType", status, "itemsProcessed", "durationMs", "startedAt"
  FROM job_runs
  ORDER BY "startedAt" DESC
  LIMIT 10;
'
```

Current output shows schedulers running every 5 minutes and creating successful JobRun records.

## 💭 Decisions Needed

Before implementing the remaining pieces, I'd like your input on:

1. **Morning Scrape Retry Logic**: Should I:
   - Automatically retry all failed meetings at 6:15, 6:30, 7:00?
   - Or wait for manual trigger to retry specific meetings?
   - Or create separate retry jobs that check previous JobRun failures?

2. **Post-Race Confirmation Detection**: How should we determine if a result is "confirmed"?
   - Check Result.official flag (if TAB API provides it)?
   - Wait a fixed time (e.g., always wait 15 min)?
   - Retry until result doesn't change between checks?

3. **Race-Level Retry Tracking**: Should we:
   - Store retry metadata in Scrape table?
   - Create a separate RaceJobTracking table?
   - Just use JobRun.metadata JSON field?

4. **Job History Cleanup**: Should I add an automated cleanup job for old JobRuns?
   - Delete JobRuns older than 30 days?
   - Or keep them indefinitely for analysis?

## 📝 Notes

- The system is currently running and stable with the new configuration
- 5-minute intervals are active and reducing API load by 80%
- JobRun tracking is working - we have full observability into job executions
- No code has been removed - all old functionality is preserved
- Migration is backwards compatible - can revert config if needed
