# HRNZ Web Scraper - User Guide

## Overview

The HRNZ scraper extracts historical harness racing data from the HRNZ InfoHorse results archive. This tool should **only be used if official API access is not available** from HRNZ.

## Important Legal Notice

⚠️ **Before using this scraper:**

1. Check HRNZ's Terms of Service
2. Consider contacting HRNZ for official data access
3. Use rate limiting (built-in: 1 request per 2 seconds)
4. Only scrape publicly accessible data
5. Respect robots.txt if present

**Recommended**: Contact HRNZ first for official data access before web scraping.

## Installation

The scraper is already included in the TipSharks package. If rebuilding Docker images:

```bash
docker compose down
docker compose build --no-cache
docker compose up -d
```

## Usage

### 1. Create a URL List File

Create a text file (e.g., `hrnz_urls.txt`) with one HRNZ result URL per line:

```
102402rs.htm
102502rs.htm
102602rs.htm
010741rs.htm
```

**URL Format**: `[DDMMYY][CC]rs.htm`
- `DD` = Day
- `MM` = Month
- `YY` = Year (last 2 digits)
- `CC` = Club code (2 digits)

**Example**: `102402rs.htm` = October 24, 2002, Club 02

### 2. Run the Scraper

```bash
# Basic usage - scrape all URLs in file
docker compose run --rm worker python -m apps.backend.worker.cli scrape-hrnz --urls hrnz_urls.txt

# With date filtering
docker compose run --rm worker python -m apps.backend.worker.cli scrape-hrnz \
  --urls hrnz_urls.txt \
  --from 2024-01-01 \
  --to 2024-12-31
```

### 2a. Proxy Rotation (Decodo)

To rotate IPs after each page, add Decodo credentials to `.env`:

```bash
HRNZ_DECODO_PROXY_SERVER=http://<decodo-host>:<port>
HRNZ_DECODO_PROXY_USERNAME=<your_username>
HRNZ_DECODO_PROXY_PASSWORD=<your_password>
HRNZ_DECODO_ROTATE_EACH_REQUEST=true
```

If your Decodo account uses a specific session format, set a template:

```bash
HRNZ_DECODO_USERNAME_TEMPLATE=customer-your_user-zone-residential-session-{session}
```

### 2b. Results Enquiry Scraper (Monthly/Date Range)

```bash
# Scrape via the Historical Results Enquiry page
docker compose run --rm worker python -m apps.backend.worker.cli scrape-hrnz-enquiry \
  --from 2024-01-01 \
  --to 2024-12-31

# Optional filters
docker compose run --rm worker python -m apps.backend.worker.cli scrape-hrnz-enquiry \
  --from 2024-01-01 \
  --to 2024-12-31 \
  --race-type OfficialRaces \
  --club ""
```

### 3. Compute Ratings

After scraping, compute ratings for the imported data:

```bash
docker compose run --rm worker python -m apps.backend.worker.cli recompute \
  --from 2024-01-01 \
  --to 2024-12-31
```

## Output

The scraper provides detailed progress:

```
Scraping 3 HRNZ meetings
Date filter: 2024-01-01 to 2024-12-31

⚠ Using web scraper - please ensure compliance with HRNZ ToS

[1/3] Scraping 102402rs.htm...
  ✓ Imported: 8 races, 96 starters
[2/3] Scraping 102502rs.htm...
  ✓ Imported: 9 races, 108 starters
[3/3] Scraping 102602rs.htm...
  ✓ Imported: 8 races, 88 starters

┏━━━━━━━━━━┳━━━━━━━┓
┃ Entity   ┃ Count ┃
┡━━━━━━━━━━╇━━━━━━━┩
│ Meetings │     3 │
│ Races    │    25 │
│ Starters │   292 │
│ Horses   │   180 │
│ Drivers  │    45 │
│ Trainers │    38 │
│ Errors   │     0 │
└──────────┴───────┘

✓ Scraping completed successfully
Tip: Run 'recompute' to compute ratings for imported data
```

## What Gets Scraped

For each meeting, the scraper extracts:

### Meeting Information
- Date
- Venue
- Location

### Race Information
- Race number
- Race name
- Distance (meters)
- Start type (Mobile/Standing)
- Purse/stakes

### Starter Information
- Horse name and ID (UUID if available)
- Driver name and ID (UUID if available)
- Trainer name and ID (UUID if available)
- Barrier draw
- Finishing position
- Handicap

## Data Mapping

The scraper maps HRNZ data to TipSharks format:

| HRNZ Field | TipSharks Field | Notes |
|------------|-----------------|-------|
| Horse UUID | `horses.id` | Falls back to MD5 hash of name |
| Driver UUID | `drivers.id` | Falls back to MD5 hash of name |
| Trainer UUID | `trainers.id` | Falls back to MD5 hash of name |
| Meeting Date | `meetings.meeting_date` | Parsed from page |
| Venue | `meetings.venue` | Extracted from header |
| Distance (m) | `races.distance_m` | Parsed from conditions |
| MOBILE/STANDING | `races.start_type` | Extracted from text |
| Position | `starters.placing` | Finishing position |

## Limitations

### Current Limitations

1. **No Auto-Discovery**: Cannot automatically find meeting URLs
   - You must provide explicit URL list
   - Cannot browse by date range automatically

2. **HTML Structure Dependent**: Scraper relies on current HTML structure
   - May break if HRNZ changes their page layout
   - Requires manual updates if structure changes

3. **Limited Data Fields**: Some fields may not be available:
   - Race times (may not be on all pages)
   - Detailed performance metrics
   - Sectional times
   - Betting odds (sometimes missing)

4. **UUID Availability**: Not all entities have UUIDs
   - Falls back to hashing horse/driver/trainer names
   - May cause issues with name changes

### Handling Missing Data

The scraper includes fallback logic:

- **Missing UUIDs**: Generates deterministic MD5 hash from name
- **Missing dates**: Tries multiple parsing strategies
- **Missing distances**: Defaults to 2000m (common harness distance)
- **Missing start types**: Defaults to "Standing"

## Troubleshooting

### Problem: SSL Certificate Error

```
unable to verify the first certificate
```

**Solution**: The HRNZ website may have certificate issues. This is a known problem. The scraper includes SSL error handling.

### Problem: No Data Extracted

```
✓ Imported: 0 races, 0 starters
```

**Possible Causes**:
1. URL format incorrect
2. HRNZ HTML structure changed
3. Page doesn't exist or is empty

**Solution**:
- Verify URL in browser first
- Check if page has race data
- May need to update scraper parsing logic

### Problem: Parse Errors

```
✗ Error: list index out of range
```

**Cause**: HTML table structure different than expected

**Solution**:
- Check which URL caused the error
- View that page in browser
- May need to adjust `_parse_starters_table()` logic

## Performance

**Rate Limiting**: 1 request per 2 seconds (built-in)
- Scraping 100 meetings = ~3.5 minutes minimum
- Scraping 1000 meetings = ~35 minutes minimum

**Memory Usage**: Minimal (processes one meeting at a time)

**Database Load**: Uses upsert logic (idempotent)
- Safe to re-run same URLs
- Will update existing records

## Best Practices

1. **Start Small**: Test with 5-10 URLs first
2. **Check Output**: Verify data looks correct before scaling up
3. **Use Date Filters**: Narrow down to specific time periods
4. **Monitor Progress**: Watch for consistent errors
5. **Compute Ratings After**: Always run `recompute` after scraping

## Example Workflow

```bash
# 1. Create URL list
cat > hrnz_urls.txt << EOF
010741rs.htm
011041rs.htm
011341rs.htm
EOF

# 2. Test scraping
docker compose run --rm worker python -m apps.backend.worker.cli \
  scrape-hrnz --urls hrnz_urls.txt

# 3. Verify database
docker compose exec db psql -U tipsharks -d tipsharks \
  -c "SELECT COUNT(*) FROM races;"

# 4. Compute ratings
docker compose run --rm worker python -m apps.backend.worker.cli \
  recompute --from 2024-01-01 --to 2024-12-31

# 5. Verify ratings
docker compose exec db psql -U tipsharks -d tipsharks \
  -c "SELECT COUNT(*) FROM rating_snapshots;"

# 6. Test API
curl http://localhost:8000/ratings/horses?limit=10 | python3 -m json.tool
```

## Advanced: Finding URLs

To find HRNZ result URLs, you can:

1. **Browse HRNZ Results Index**: https://infohorse.hrnz.co.nz/datahrs/results/results.htm
2. **Look at page source** for links like `010741rs.htm`
3. **Create script** to extract URLs from index pages (manual process)

Example manual URL collection:

1. Go to https://infohorse.hrnz.co.nz/datahrs/results/results.htm
2. Select month (e.g., January)
3. View page source
4. Search for pattern: `\d{6}rs\.htm`
5. Copy all matching URLs to text file

## Alternative: Request Official Access

Instead of scraping, **contact HRNZ directly**:

**Email Template**:
```
Subject: Request for Historical Race Data Access

Dear HRNZ Team,

I am developing TipSharks, a harness racing ratings platform.
I would like to obtain historical race results data to build
a comprehensive ratings database.

Could you provide information about:
1. API access to historical race data
2. Bulk data exports (CSV/JSON)
3. Available date ranges
4. Licensing terms and costs

I am currently considering web scraping your public results
archive, but would prefer official data access if available.

Thank you for your consideration.

Best regards,
[Your Name]
```

---

**Remember**: Web scraping should be a last resort. Always try official channels first!
