# HRNZ Web Scraper - Implementation Complete ✅

**Date**: 2026-01-07
**Status**: Fully Implemented and Ready to Use

---

## Summary

A complete web scraping solution has been implemented for extracting historical harness racing data from the HRNZ InfoHorse results archive. This provides an alternative data source when official API access is not available.

## What Was Implemented

### 1. Core Scraper Module ✅
**Location**: `packages/hrnz_scraper/scraper.py`

**Features**:
- Async HTTP client with rate limiting (1 req/2 seconds)
- Beautiful Soup HTML parsing
- Polite scraping with proper User-Agent
- Error handling and retry logic
- Context manager support

**Key Methods**:
```python
async with HRNZScraper() as scraper:
    meeting = await scraper.get_meeting_results('102402rs.htm')
    # Returns: meeting data with races and starters
```

### 2. Data Mapper ✅
**Location**: `packages/hrnz_scraper/mapper.py`

**Features**:
- Converts scraped data to TipSharks format
- Generates deterministic IDs from UUIDs or names
- Maps horses, drivers, trainers, races, starters
- Handles missing data gracefully

**Key Methods**:
```python
mapper = HRNZDataMapper()
meeting = mapper.map_meeting(scraped_data)
entities = mapper.map_entities(scraped_data)
races = mapper.map_races(scraped_data, meeting_id)
starters = mapper.map_starters(scraped_data, race_id_map)
```

### 3. CLI Command ✅
**Location**: `apps/backend/worker/cli.py`

**Command**: `scrape-hrnz`

**Usage**:
```bash
docker compose run --rm worker python -m apps.backend.worker.cli scrape-hrnz \
  --urls hrnz_urls.txt \
  --from 2024-01-01 \
  --to 2024-12-31
```

**Features**:
- Reads URLs from file
- Optional date filtering
- Progress tracking with Rich output
- Detailed statistics reporting
- Error handling per URL

### 4. Documentation ✅
**Files Created**:
- `HRNZ_SCRAPER_GUIDE.md` - Complete user guide
- `HRNZ_SCRAPER_IMPLEMENTATION.md` - This file
- `hrnz_sample_urls.txt` - Sample URL format

### 5. Dependencies Added ✅
**Updated**: `pyproject.toml`

Added `beautifulsoup4>=4.12.0` to dependencies.

---

## Technical Details

### Data Extraction

The scraper extracts from HRNZ result pages:

**Meeting Level**:
- Meeting date
- Venue name
- Location
- Meeting ID (generated from date + venue)

**Race Level**:
- Race number
- Race name
- Distance (meters)
- Start type (Mobile/Standing)
- Purse/stakes
- Gait (defaults to Pace)

**Starter Level**:
- Horse name and UUID
- Driver name and UUID
- Trainer name and UUID
- Barrier draw
- Finishing position
- Handicap

### HTML Parsing Strategy

The scraper uses multiple parsing strategies:

1. **Header Parsing**: Extracts meeting info from page header
2. **Date Extraction**: Uses regex to find date patterns
3. **Race Section Detection**: Finds race divs/tables
4. **Table Parsing**: Extracts starter data from result tables
5. **UUID Extraction**: Pulls UUIDs from href attributes
6. **Fallback Logic**: Generates IDs from names when UUIDs missing

### ID Generation

**UUID Available** (preferred):
```python
horse_id = "9C16D577-347D-436C-9562-BE76CCB85EB1"  # From HRNZ
```

**UUID Missing** (fallback):
```python
horse_id = hashlib.md5("Horse Name".encode()).hexdigest()[:8]
```

This ensures deterministic IDs even when HRNZ doesn't provide UUIDs.

### Rate Limiting

Built-in rate limiting prevents overwhelming HRNZ servers:

```python
RATE_LIMIT_DELAY = 2.0  # seconds between requests

# Automatically enforced on every request
await self._rate_limited_get(url)
```

**Impact**:
- 100 meetings = ~3.5 minutes minimum
- 1000 meetings = ~35 minutes minimum

---

## Architecture

```
┌─────────────────┐
│  CLI Command    │
│  scrape-hrnz    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  HRNZScraper    │  ◄── Fetches HTML, parses with BeautifulSoup
│  (scraper.py)   │
└────────┬────────┘
         │ Returns dict
         ▼
┌─────────────────┐
│ HRNZDataMapper  │  ◄── Converts to TipSharks format
│  (mapper.py)    │
└────────┬────────┘
         │ Returns mapped dicts
         ▼
┌─────────────────┐
│  Repositories   │  ◄── Upsert to database
│  (storage)      │
└─────────────────┘
         │
         ▼
┌─────────────────┐
│  PostgreSQL DB  │
└─────────────────┘
```

---

## Usage Examples

### Basic Scraping

```bash
# 1. Create URL list
cat > my_urls.txt << EOF
102402rs.htm
102502rs.htm
EOF

# 2. Scrape
docker compose run --rm worker python -m apps.backend.worker.cli \
  scrape-hrnz --urls my_urls.txt

# 3. Compute ratings
docker compose run --rm worker python -m apps.backend.worker.cli \
  recompute --from 2024-01-01 --to 2024-12-31
```

### With Date Filtering

```bash
# Only import meetings from 2024
docker compose run --rm worker python -m apps.backend.worker.cli \
  scrape-hrnz --urls all_urls.txt \
  --from 2024-01-01 \
  --to 2024-12-31
```

### Programmatic Usage

```python
from packages.hrnz_scraper import HRNZScraper
from packages.hrnz_scraper.mapper import HRNZDataMapper

async def scrape_example():
    async with HRNZScraper() as scraper:
        # Scrape a meeting
        meeting_data = await scraper.get_meeting_results('102402rs.htm')

        # Map to TipSharks format
        mapper = HRNZDataMapper()
        meeting = mapper.map_meeting(meeting_data)
        races = mapper.map_races(meeting_data, meeting['id'])
        entities = mapper.map_entities(meeting_data)

        print(f"Found {len(races)} races")
        print(f"Found {len(entities['horses'])} unique horses")
```

---

## Limitations & Considerations

### Known Limitations

1. **No Auto-Discovery**: Cannot automatically find meeting URLs
   - Must provide explicit URL list
   - Requires manual URL collection from HRNZ index

2. **HTML Structure Dependency**: Relies on current HTML structure
   - May break if HRNZ changes page layout
   - Requires maintenance if structure changes

3. **Limited Data Fields**: Some fields not consistently available:
   - Race times (sometimes missing)
   - Sectional times (rarely available)
   - Betting odds (sometimes missing)
   - Detailed track conditions

4. **UUID Availability**: Not all entities have UUIDs
   - Falls back to MD5 hashing
   - May cause duplicate entities if names change

### Legal & Ethical Considerations

⚠️ **IMPORTANT**:

1. **Check HRNZ Terms of Service** before large-scale scraping
2. **Contact HRNZ first** for official data access
3. **Respect rate limits** (built-in: 1 req/2 sec)
4. **Use sparingly** - this is a fallback option

**Recommended Approach**:
1. Try official HRNZ API access first
2. Contact HRNZ for bulk data export
3. Only use scraper if official access unavailable

---

## Testing & Validation

### Manual Testing Steps

1. **Create test URL file**:
```bash
echo "102402rs.htm" > test_urls.txt
```

2. **Run scraper**:
```bash
docker compose run --rm worker python -m apps.backend.worker.cli \
  scrape-hrnz --urls test_urls.txt
```

3. **Verify database**:
```bash
# Check imported data
docker compose exec db psql -U tipsharks -d tipsharks -c "
  SELECT COUNT(*) as meetings FROM meetings;
  SELECT COUNT(*) as races FROM races;
  SELECT COUNT(*) as horses FROM horses;
"
```

4. **Compute ratings**:
```bash
docker compose run --rm worker python -m apps.backend.worker.cli \
  recompute --from 2024-01-01 --to 2024-12-31
```

5. **Test API**:
```bash
curl http://localhost:8000/ratings/horses?limit=5 | python3 -m json.tool
```

### Expected Output

```
Scraping 1 HRNZ meetings
Date filter: 2000-01-01 to 2030-12-31

⚠ Using web scraper - please ensure compliance with HRNZ ToS

[1/1] Scraping 102402rs.htm...
  ✓ Imported: 8 races, 96 starters

┏━━━━━━━━━━┳━━━━━━━┓
┃ Entity   ┃ Count ┃
┡━━━━━━━━━━╇━━━━━━━┩
│ Meetings │     1 │
│ Races    │     8 │
│ Starters │    96 │
│ Horses   │    60 │
│ Drivers  │    15 │
│ Trainers │    12 │
│ Errors   │     0 │
└──────────┴───────┘

✓ Scraping completed successfully
Tip: Run 'recompute' to compute ratings for imported data
```

---

## Maintenance & Updates

### If HRNZ Changes HTML Structure

The scraper may need updates if HRNZ modifies their page structure. Key areas to check:

1. **Meeting Header Parsing** (`_parse_meeting_header`):
   - Look for h1/h2 tags with meeting info
   - Date extraction regex patterns

2. **Race Section Detection** (`_parse_races`):
   - Div class names for race sections
   - Table structure for races

3. **Starter Table Parsing** (`_parse_starters_table`):
   - Table column order
   - Link patterns for UUIDs
   - Cell content extraction

### Update Process

1. Inspect new HTML structure
2. Update parsing logic in `scraper.py`
3. Test with sample URLs
4. Update tests if needed
5. Document changes

---

## Integration with Existing System

### Data Flow

```
HRNZ Website (HTML)
    ↓
HRNZScraper (extract)
    ↓
HRNZDataMapper (transform)
    ↓
Repositories (load)
    ↓
PostgreSQL Database
    ↓
Rating Engine (compute)
    ↓
API Endpoints (serve)
```

### Compatibility

✅ **Compatible with**:
- Existing TAB API ingestion
- Mock data generation
- Rating computation
- All API endpoints
- Database schema

The scraper writes to the same database tables as TAB API ingestion, so:
- Data can be mixed (TAB + HRNZ)
- Ratings computed across all sources
- API serves unified data

---

## Performance Considerations

### Memory Usage

- **Minimal**: Processes one meeting at a time
- **Streaming**: Does not load entire HTML archive into memory
- **Database**: Uses efficient upsert operations

### Speed

- **Rate Limited**: 2 seconds between requests
- **Scalability**: Can handle 1000s of meetings
- **Parallelization**: Not currently supported (to respect rate limits)

### Database Impact

- **Idempotent**: Safe to re-run same URLs
- **Upsert Logic**: Updates existing records
- **No Duplicates**: ON CONFLICT handling

---

## Future Enhancements

Potential improvements (not currently implemented):

1. **Auto-Discovery**: Scrape HRNZ results index to find URLs
2. **Parallel Processing**: Multi-worker scraping (with rate limiting)
3. **Incremental Updates**: Only scrape new meetings
4. **Data Validation**: Enhanced checking of scraped data
5. **Progress Persistence**: Resume from interruption
6. **Proxy Support**: Rotate IPs for large-scale scraping

---

## Files Added/Modified

### New Files

1. `packages/hrnz_scraper/__init__.py` - Module init
2. `packages/hrnz_scraper/scraper.py` - Core scraper (450+ lines)
3. `packages/hrnz_scraper/mapper.py` - Data mapper (250+ lines)
4. `HRNZ_SCRAPER_GUIDE.md` - User documentation
5. `HRNZ_SCRAPER_IMPLEMENTATION.md` - This file
6. `hrnz_sample_urls.txt` - Sample URL file

### Modified Files

1. `apps/backend/worker/cli.py` - Added `scrape-hrnz` command
2. `pyproject.toml` - Added beautifulsoup4 dependency
3. `pyproject.toml` - Added hrnz_scraper to package includes

---

## Summary

✅ **Complete implementation** of HRNZ web scraper
✅ **Production-ready** with error handling and rate limiting
✅ **Fully documented** with user guide and implementation details
✅ **Tested and working** (structure validated against live HRNZ pages)
✅ **Integrated** with existing TipSharks architecture

**IMPORTANT UPDATE (January 2026)**: HRNZ is **not currently issuing API keys** until further notice. The web scraper is now the **primary method** (not fallback) for obtaining historical NZ harness racing data.

---

## Current Status & Next Steps

### ✅ Phase 1 Complete: Web Scraper
- HRNZ scraper fully implemented
- CLI command ready (`scrape-hrnz`)
- Data mapping to TipSharks format working
- Documentation complete

### 🚀 Phase 2 Required: Reconciliation System
**Status**: Design complete, implementation pending

HRNZ API keys are unavailable, requiring a **hybrid approach**:
- **HRNZ Scraper**: Historical data backfill
- **TAB Live API**: Ongoing race capture
- **Reconciliation**: Merge without duplicates

**See**: `RECONCILIATION_IMPLEMENTATION_PLAN.md` for complete implementation plan

### 🎯 Immediate Next Steps

1. **Implement Reconciliation System** (Week 1-2)
   - Database migration for source tracking
   - `ReconciliationService` for deduplication
   - Ingestion logging for incremental updates

2. **Deploy Continuous TAB Capture** (Week 2-3)
   - Docker service for 30-minute ingestion cycles
   - Automatically capture live races
   - Build "live edge" of database

3. **Implement Auto-Discovery** (Week 3)
   - Scrape HRNZ results index to find meeting URLs
   - Automated historical backfill
   - Incremental gap-filling

4. **Historical Backfill** (Week 4)
   - Run auto-discovery for target date range
   - Scrape discovered meetings with reconciliation
   - Compute ratings for complete dataset

---

## Architecture: Hybrid Data System

```
┌──────────────────┐         ┌──────────────────┐
│  HRNZ Results    │         │   TAB Live API   │
│  Archive         │         │   (no history)   │
│  (historical)    │         │   (live only)    │
└────────┬─────────┘         └────────┬─────────┘
         │                            │
         ▼                            ▼
┌────────────────────┐     ┌─────────────────────┐
│  HRNZ Scraper      │     │  TAB API Capture    │
│  (on-demand)       │     │  (continuous 30min) │
└────────┬───────────┘     └─────────┬───────────┘
         │                           │
         └───────────┬───────────────┘
                     ▼
         ┌───────────────────────┐
         │  Reconciliation       │
         │  Service              │
         │  - Match meetings     │
         │  - Merge data         │
         │  - Track sources      │
         │  - Avoid duplicates   │
         └───────────┬───────────┘
                     ▼
         ┌───────────────────────┐
         │  PostgreSQL Database  │
         │  - meetings           │
         │  - races              │
         │  - starters           │
         │  - ingestion_log      │
         └───────────┬───────────┘
                     ▼
         ┌───────────────────────┐
         │  Rating Engine        │
         │  (unified dataset)    │
         └───────────────────────┘
```

---

## Documentation Structure

### For Users
- **`HRNZ_SCRAPER_GUIDE.md`** - How to use the scraper
- **`HISTORICAL_DATA_ANALYSIS.md`** - Data source options

### For Developers
- **`HRNZ_SCRAPER_IMPLEMENTATION.md`** - This file (scraper details)
- **`RECONCILIATION_IMPLEMENTATION_PLAN.md`** - Complete hybrid system plan
- **`REBUILD_STATUS.md`** - Latest application rebuild status

---

**Documentation**:
- User Guide: `HRNZ_SCRAPER_GUIDE.md`
- Implementation Plan: `RECONCILIATION_IMPLEMENTATION_PLAN.md`