# HRNZ Web Scraper - User Guide ## Overview The HRNZ scraper extracts historical harness racing data from the HRNZ InfoHorse results archive. This tool should **only be used if official API access is not available** from HRNZ. ## Important Legal Notice ⚠️ **Before using this scraper:** 1. Check HRNZ's Terms of Service 2. Consider contacting HRNZ for official data access 3. Use rate limiting (built-in: 1 request per 2 seconds) 4. Only scrape publicly accessible data 5. Respect robots.txt if present **Recommended**: Contact HRNZ first for official data access before web scraping. ## Installation The scraper is already included in the TipSharks package. If rebuilding Docker images: ```bash docker compose down docker compose build --no-cache docker compose up -d ``` ## Usage ### 1. Create a URL List File Create a text file (e.g., `hrnz_urls.txt`) with one HRNZ result URL per line: ``` 102402rs.htm 102502rs.htm 102602rs.htm 010741rs.htm ``` **URL Format**: `[DDMMYY][CC]rs.htm` - `DD` = Day - `MM` = Month - `YY` = Year (last 2 digits) - `CC` = Club code (2 digits) **Example**: `102402rs.htm` = October 24, 2002, Club 02 ### 2. Run the Scraper ```bash # Basic usage - scrape all URLs in file docker compose run --rm worker python -m apps.backend.worker.cli scrape-hrnz --urls hrnz_urls.txt # With date filtering docker compose run --rm worker python -m apps.backend.worker.cli scrape-hrnz \ --urls hrnz_urls.txt \ --from 2024-01-01 \ --to 2024-12-31 ``` ### 2a. Proxy Rotation (Decodo) To rotate IPs after each page, add Decodo credentials to `.env`: ```bash HRNZ_DECODO_PROXY_SERVER=http://: HRNZ_DECODO_PROXY_USERNAME= HRNZ_DECODO_PROXY_PASSWORD= HRNZ_DECODO_ROTATE_EACH_REQUEST=true ``` If your Decodo account uses a specific session format, set a template: ```bash HRNZ_DECODO_USERNAME_TEMPLATE=customer-your_user-zone-residential-session-{session} ``` ### 2b. Results Enquiry Scraper (Monthly/Date Range) ```bash # Scrape via the Historical Results Enquiry page docker compose run --rm worker python -m apps.backend.worker.cli scrape-hrnz-enquiry \ --from 2024-01-01 \ --to 2024-12-31 # Optional filters docker compose run --rm worker python -m apps.backend.worker.cli scrape-hrnz-enquiry \ --from 2024-01-01 \ --to 2024-12-31 \ --race-type OfficialRaces \ --club "" ``` ### 3. Compute Ratings After scraping, compute ratings for the imported data: ```bash docker compose run --rm worker python -m apps.backend.worker.cli recompute \ --from 2024-01-01 \ --to 2024-12-31 ``` ## Output The scraper provides detailed progress: ``` Scraping 3 HRNZ meetings Date filter: 2024-01-01 to 2024-12-31 ⚠ Using web scraper - please ensure compliance with HRNZ ToS [1/3] Scraping 102402rs.htm... ✓ Imported: 8 races, 96 starters [2/3] Scraping 102502rs.htm... ✓ Imported: 9 races, 108 starters [3/3] Scraping 102602rs.htm... ✓ Imported: 8 races, 88 starters ┏━━━━━━━━━━┳━━━━━━━┓ ┃ Entity ┃ Count ┃ ┡━━━━━━━━━━╇━━━━━━━┩ │ Meetings │ 3 │ │ Races │ 25 │ │ Starters │ 292 │ │ Horses │ 180 │ │ Drivers │ 45 │ │ Trainers │ 38 │ │ Errors │ 0 │ └──────────┴───────┘ ✓ Scraping completed successfully Tip: Run 'recompute' to compute ratings for imported data ``` ## What Gets Scraped For each meeting, the scraper extracts: ### Meeting Information - Date - Venue - Location ### Race Information - Race number - Race name - Distance (meters) - Start type (Mobile/Standing) - Purse/stakes ### Starter Information - Horse name and ID (UUID if available) - Driver name and ID (UUID if available) - Trainer name and ID (UUID if available) - Barrier draw - Finishing position - Handicap ## Data Mapping The scraper maps HRNZ data to TipSharks format: | HRNZ Field | TipSharks Field | Notes | |------------|-----------------|-------| | Horse UUID | `horses.id` | Falls back to MD5 hash of name | | Driver UUID | `drivers.id` | Falls back to MD5 hash of name | | Trainer UUID | `trainers.id` | Falls back to MD5 hash of name | | Meeting Date | `meetings.meeting_date` | Parsed from page | | Venue | `meetings.venue` | Extracted from header | | Distance (m) | `races.distance_m` | Parsed from conditions | | MOBILE/STANDING | `races.start_type` | Extracted from text | | Position | `starters.placing` | Finishing position | ## Limitations ### Current Limitations 1. **No Auto-Discovery**: Cannot automatically find meeting URLs - You must provide explicit URL list - Cannot browse by date range automatically 2. **HTML Structure Dependent**: Scraper relies on current HTML structure - May break if HRNZ changes their page layout - Requires manual updates if structure changes 3. **Limited Data Fields**: Some fields may not be available: - Race times (may not be on all pages) - Detailed performance metrics - Sectional times - Betting odds (sometimes missing) 4. **UUID Availability**: Not all entities have UUIDs - Falls back to hashing horse/driver/trainer names - May cause issues with name changes ### Handling Missing Data The scraper includes fallback logic: - **Missing UUIDs**: Generates deterministic MD5 hash from name - **Missing dates**: Tries multiple parsing strategies - **Missing distances**: Defaults to 2000m (common harness distance) - **Missing start types**: Defaults to "Standing" ## Troubleshooting ### Problem: SSL Certificate Error ``` unable to verify the first certificate ``` **Solution**: The HRNZ website may have certificate issues. This is a known problem. The scraper includes SSL error handling. ### Problem: No Data Extracted ``` ✓ Imported: 0 races, 0 starters ``` **Possible Causes**: 1. URL format incorrect 2. HRNZ HTML structure changed 3. Page doesn't exist or is empty **Solution**: - Verify URL in browser first - Check if page has race data - May need to update scraper parsing logic ### Problem: Parse Errors ``` ✗ Error: list index out of range ``` **Cause**: HTML table structure different than expected **Solution**: - Check which URL caused the error - View that page in browser - May need to adjust `_parse_starters_table()` logic ## Performance **Rate Limiting**: 1 request per 2 seconds (built-in) - Scraping 100 meetings = ~3.5 minutes minimum - Scraping 1000 meetings = ~35 minutes minimum **Memory Usage**: Minimal (processes one meeting at a time) **Database Load**: Uses upsert logic (idempotent) - Safe to re-run same URLs - Will update existing records ## Best Practices 1. **Start Small**: Test with 5-10 URLs first 2. **Check Output**: Verify data looks correct before scaling up 3. **Use Date Filters**: Narrow down to specific time periods 4. **Monitor Progress**: Watch for consistent errors 5. **Compute Ratings After**: Always run `recompute` after scraping ## Example Workflow ```bash # 1. Create URL list cat > hrnz_urls.txt << EOF 010741rs.htm 011041rs.htm 011341rs.htm EOF # 2. Test scraping docker compose run --rm worker python -m apps.backend.worker.cli \ scrape-hrnz --urls hrnz_urls.txt # 3. Verify database docker compose exec db psql -U tipsharks -d tipsharks \ -c "SELECT COUNT(*) FROM races;" # 4. Compute ratings docker compose run --rm worker python -m apps.backend.worker.cli \ recompute --from 2024-01-01 --to 2024-12-31 # 5. Verify ratings docker compose exec db psql -U tipsharks -d tipsharks \ -c "SELECT COUNT(*) FROM rating_snapshots;" # 6. Test API curl http://localhost:8000/ratings/horses?limit=10 | python3 -m json.tool ``` ## Advanced: Finding URLs To find HRNZ result URLs, you can: 1. **Browse HRNZ Results Index**: https://infohorse.hrnz.co.nz/datahrs/results/results.htm 2. **Look at page source** for links like `010741rs.htm` 3. **Create script** to extract URLs from index pages (manual process) Example manual URL collection: 1. Go to https://infohorse.hrnz.co.nz/datahrs/results/results.htm 2. Select month (e.g., January) 3. View page source 4. Search for pattern: `\d{6}rs\.htm` 5. Copy all matching URLs to text file ## Alternative: Request Official Access Instead of scraping, **contact HRNZ directly**: **Email Template**: ``` Subject: Request for Historical Race Data Access Dear HRNZ Team, I am developing TipSharks, a harness racing ratings platform. I would like to obtain historical race results data to build a comprehensive ratings database. Could you provide information about: 1. API access to historical race data 2. Bulk data exports (CSV/JSON) 3. Available date ranges 4. Licensing terms and costs I am currently considering web scraping your public results archive, but would prefer official data access if available. Thank you for your consideration. Best regards, [Your Name] ``` --- **Remember**: Web scraping should be a last resort. Always try official channels first!