🚀 Forsyt Data Machine - Project Plan
Version: 2.0 Date: January 2026 Status: ✅ PHASES 1-5 COMPLETE Goal: Complete data infrastructure for sports betting ML and analytics
Executive Summary
The Forsyt Data Machine is a comprehensive data infrastructure that:
- Downloads sports data from 10+ free sources (football, cricket, tennis)
- Stores normalized data in PostgreSQL (Supabase IceCrystal) with ML-ready features
- Provides ported betting algorithms (Shin, Poisson, ELO)
- Trains sport-specific ML models for match prediction
- Integrates with the 6 SuperSkin services
Current Status: All 5 phases complete. System is in production.
📊 Project Overview
| Metric | Value |
|---|---|
| Total Phases | 5 (all complete) |
| Total Tasks | 35 (all complete) |
| Actual Duration | 28 days |
| Team Size | 1 developer |
| Primary Language | TypeScript (services), Python (ML) |
Deliverables by Phase
| Phase | Deliverables | Status |
|---|---|---|
| 1. Foundation | Project structure, DB schema, base classes | ✅ COMPLETE |
| 2. Data Sources | 6 downloaders, normalizers, CLI | ✅ COMPLETE |
| 3. Algorithms | 4 ported algorithms with tests | ✅ COMPLETE |
| 4. ML Pipeline | Feature engineering, 3 models, serving | ✅ COMPLETE |
| 5. Integration | Scheduler, services, E2E tests | ✅ COMPLETE |
🏗️ Phase 1: Foundation (Days 1-4)
Goal: Set up project structure, database, and core abstractions
Tasks
| # | Task | Description | Effort | Dependencies |
|---|---|---|---|---|
| 1.1 | Create project structure | Set up services/data-harness/ with src, data, prisma dirs | 2h | None |
| 1.2 | Initialize package.json | Dependencies: prisma, axios, node-cron, commander, zod | 1h | 1.1 |
| 1.3 | Create Prisma schema | Tables: football_matches, football_odds, cricket_matches, cricket_deliveries, ml_features, sync_log | 4h | 1.2 |
| 1.4 | Run database migrations | Create PostgreSQL database, run prisma migrate | 1h | 1.3 |
| 1.5 | Implement BaseDownloader | Abstract class with retry logic, rate limiting, checkpointing | 4h | 1.2 |
| 1.6 | Implement FileStorage | Raw file management, versioning, cleanup | 3h | 1.2 |
| 1.7 | Create CLI skeleton | Commander.js CLI with sync, status, clean commands | 3h | 1.5, 1.6 |
| 1.8 | Add logging | Winston/Pino logger with file + console output | 2h | 1.7 |
Deliverables
-
services/data-harness/directory structure - PostgreSQL database with all tables
-
BaseDownloaderabstract class -
FileStorageservice - CLI with
--helpworking
Exit Criteria
# These should work:
npm run harness -- --help
npm run harness -- status
npx prisma studio # Shows empty tables
📥 Phase 2: Data Sources (Days 5-12)
Goal: Implement all 5 downloaders with normalization
Tasks
| # | Task | Description | Effort | Dependencies |
|---|---|---|---|---|
| 2.1 | Football-Data downloader | Download CSVs for 10 leagues, 6 seasons | 6h | 1.5 |
| 2.2 | Football-Data normalizer | Parse CSV, map columns, insert to DB | 4h | 2.1 |
| 2.3 | Football odds parser | Extract odds from 10+ bookmakers per match | 3h | 2.2 |
| 2.4 | FiveThirtyEight downloader | Download SPI matches + rankings CSVs | 3h | 1.5 |
| 2.5 | FiveThirtyEight normalizer | Parse, merge with football_matches | 3h | 2.4 |
| 2.6 | Cricsheet downloader | Download ZIP, extract JSON files | 4h | 1.5 |
| 2.7 | Cricsheet normalizer | Parse match JSON, ball-by-ball data | 6h | 2.6 |
| 2.8 | Kaggle downloader | API integration, download 4 datasets | 4h | 1.5 |
| 2.9 | Kaggle SQLite parser | Parse European Soccer DB SQLite | 3h | 2.8 |
| 2.10 | GitHub repos syncer | Clone/pull 5 ML repos | 2h | 1.5 |
| 2.11 | Deduplicator | Match deduplication across sources | 4h | 2.2, 2.5 |
| 2.12 | Team name normalizer | Map team aliases to canonical names | 3h | 2.11 |
| 2.13 | Sync tracking | Log sync runs, track last successful | 2h | 2.1 |
Deliverables
- 5 working downloaders
- Data normalized into PostgreSQL
- ~200K football matches in database
- ~5K cricket matches with ball-by-ball
Exit Criteria
# Full sync should work:
npm run harness -- sync --all
# Verify data:
npm run harness -- status
# Output: Football: 200,432 matches, Cricket: 5,234 matches
🧮 Phase 3: Algorithms (Days 13-17)
Goal: Port betting algorithms from GitHub repos to TypeScript
Tasks
| # | Task | Description | Effort | Dependencies |
|---|---|---|---|---|
| 3.1 | Create algorithms package | Set up services/algorithms/ with TypeScript | 2h | None |
| 3.2 | Port Shin method | Remove bookmaker margin, true probabilities | 4h | 3.1 |
| 3.3 | Port Poisson model | Goal distribution prediction | 6h | 3.1 |
| 3.4 | Port ELO ratings | Team strength with home advantage | 4h | 3.1 |
| 3.5 | Implement Kelly criterion | Optimal stake sizing | 3h | 3.2 |
| 3.6 | Implement implied probability | Convert odds to probabilities | 2h | 3.1 |
| 3.7 | Implement fair odds calculator | Combine multiple bookmaker prices | 3h | 3.2, 3.6 |
| 3.8 | Write unit tests | Test all algorithms with known values | 6h | 3.2-3.7 |
| 3.9 | Create algorithms API | Express router exposing algorithms | 3h | 3.8 |
Deliverables
-
services/algorithms/package - 6 ported algorithms with full test coverage
- REST API for algorithm access
Key Algorithms
// What we're building:
// 1. Shin Method - True probabilities
shinProbabilities([2.10, 3.40, 3.50]) // → [0.45, 0.28, 0.27]
// 2. Poisson Model - Match prediction
poissonPredict(homeXG: 1.8, awayXG: 1.2) // → { probHome: 0.52, probDraw: 0.24, probAway: 0.24 }
// 3. ELO Ratings - Team strength
updateElo(homeElo: 1600, awayElo: 1550, homeGoals: 2, awayGoals: 1)
// → { homeElo: 1612, awayElo: 1538 }
// 4. Kelly Criterion - Stake sizing
kellyStake(probability: 0.55, odds: 2.10, bankroll: 1000) // → £52 stake
// 5. Fair Odds - Market consensus
calculateFairOdds([
{ bookmaker: 'pinnacle', odds: 2.05 },
{ bookmaker: 'bet365', odds: 2.10 },
{ bookmaker: 'betfair', odds: 2.08 }
]) // → 2.07 (weighted by sharpness)
Exit Criteria
# All tests pass:
npm run test:algorithms
# 24 tests passing
# API works:
curl http://localhost:3001/api/algorithms/shin?odds=2.10,3.40,3.50
# {"probabilities":[0.45,0.28,0.27]}
🤖 Phase 4: ML Pipeline (Days 18-25)
Goal: Build feature engineering and model training pipeline
Tasks
| # | Task | Description | Effort | Dependencies |
|---|---|---|---|---|
| 4.1 | Create ml-pipeline package | Set up Python project with poetry/pip | 2h | None |
| 4.2 | Feature: Form calculation | Last N matches stats per team | 6h | Phase 2 |
| 4.3 | Feature: Head-to-head | Historical matchup statistics | 4h | 4.2 |
| 4.4 | Feature: League position | Position at time of match | 3h | 4.2 |
| 4.5 | Feature: Odds-derived | Implied probabilities, overround | 3h | 4.2 |
| 4.6 | Feature: ELO ratings | Calculate ELO for all teams | 4h | 3.4, 4.2 |
| 4.7 | Feature store population | Insert features to ml_features table | 3h | 4.2-4.6 |
| 4.8 | XGBoost training notebook | 1X2 prediction model | 6h | 4.7 |
| 4.9 | Random Forest training | Alternative model for comparison | 4h | 4.7 |
| 4.10 | Neural Network training | Deep learning model | 6h | 4.7 |
| 4.11 | Model evaluation | Accuracy, ROI backtesting | 4h | 4.8-4.10 |
| 4.12 | Model export | Save trained models (joblib/ONNX) | 3h | 4.11 |
| 4.13 | Model serving API | FastAPI endpoint for predictions | 4h | 4.12 |
Deliverables
-
services/ml-pipeline/Python package - 40+ features computed for all matches
- 3 trained models (XGBoost, RF, NN)
- FastAPI serving endpoint
Feature Engineering Details
# Features we calculate for each match:
class MatchFeatures:
# Form (last 5 matches)
home_wins_5: int # 0-5
home_draws_5: int
home_losses_5: int
home_goals_scored_5: float # avg
home_goals_conceded_5: float
home_clean_sheets_5: int
home_points_5: int # 0-15
away_wins_5: int
away_draws_5: int
away_losses_5: int
away_goals_scored_5: float
away_goals_conceded_5: float
away_clean_sheets_5: int
away_points_5: int
# Head-to-head (last 5 meetings)
h2h_home_wins: int
h2h_draws: int
h2h_away_wins: int
h2h_avg_goals: float
h2h_home_scored: float
h2h_away_scored: float
# League position
home_position: int
away_position: int
home_points: int
away_points: int
position_diff: int
# Odds-derived
implied_prob_home: float
implied_prob_draw: float
implied_prob_away: float
pinnacle_prob_home: float # Sharp odds
market_overround: float
# Team ratings
home_elo: float
away_elo: float
home_spi: float # From 538
away_spi: float
elo_diff: float
# Target variables
result: str # 'H', 'D', 'A'
over_2_5: bool
btts: bool
Exit Criteria
# Training complete:
python train.py --model xgboost --target 1x2
# Model accuracy: 52.3%, ROI: +3.2%
# Predictions work:
curl http://localhost:8000/predict -d '{"home_team":"Liverpool","away_team":"Chelsea"}'
# {"home":0.48,"draw":0.27,"away":0.25,"confidence":0.72}
🔗 Phase 5: Integration (Days 26-30)
Goal: Connect everything, set up automation, build services
Tasks
| # | Task | Description | Effort | Dependencies |
|---|---|---|---|---|
| 5.1 | Scheduler setup | node-cron with all sync jobs | 4h | Phase 2 |
| 5.2 | GitHub Actions workflow | Automated daily/weekly syncs | 3h | 5.1 |
| 5.3 | Value Detection service | Use Shin + ELO + ML predictions | 8h | Phase 3, 4 |
| 5.4 | Price Feed service | Publish normalized prices | 4h | Phase 2, 3 |
| 5.5 | Connect to existing backend | Redis pub/sub integration | 4h | 5.3, 5.4 |
| 5.6 | End-to-end testing | Full pipeline test | 6h | 5.1-5.5 |
| 5.7 | Performance optimization | Query optimization, caching | 4h | 5.6 |
| 5.8 | Monitoring dashboard | Grafana/simple stats page | 3h | 5.6 |
| 5.9 | Documentation | README, API docs, runbooks | 4h | All |
Deliverables
- Automated data sync (daily/weekly)
- Value Detection service running
- Integration with existing backend
- Monitoring and alerting
Integration Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ FORSYT DATA MACHINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Data Harness │───▶│ PostgreSQL │───▶│ ML Pipeline │ │
│ │ (Node.js) │ │ + Features │ │ (Python) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Algorithms │ │ Redis │ │ ML Models │ │
│ │ (Node.js) │ │ Pub/Sub │ │ (FastAPI) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └───────────────────┴───────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Value Detection │ │
│ │ Price Feed │ │
│ │ Chat Assistant │ │
│ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Frontend │ │
│ │ (React/Next) │ │
│ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Exit Criteria
# Full system running:
docker-compose up -d
# Check status:
curl http://localhost:3000/api/health
# {"data_harness":"ok","ml_pipeline":"ok","algorithms":"ok"}
# Value detection working:
curl http://localhost:3000/api/value/detect?match=liverpool_vs_chelsea
# {"value_bets":[{"outcome":"over_2_5","edge":4.2,"confidence":"high"}]}
🛠️ Technology Stack
Services Overview
| Service | Language | Framework | Database |
|---|---|---|---|
| Data Harness | TypeScript | Node.js | PostgreSQL |
| Algorithms | TypeScript | Node.js/Express | - |
| ML Pipeline | Python 3.11 | Pandas, scikit-learn | PostgreSQL |
| ML Serving | Python 3.11 | FastAPI | Redis (cache) |
| Scheduler | TypeScript | node-cron | - |
Dependencies
Node.js (Data Harness + Algorithms):
{
"dependencies": {
"@prisma/client": "^5.x",
"axios": "^1.x",
"commander": "^12.x",
"node-cron": "^3.x",
"zod": "^3.x",
"winston": "^3.x",
"csv-parse": "^5.x",
"adm-zip": "^0.5.x",
"ioredis": "^5.x"
},
"devDependencies": {
"typescript": "^5.x",
"vitest": "^1.x",
"prisma": "^5.x"
}
}
Python (ML Pipeline):
# requirements.txt
pandas>=2.0.0
numpy>=1.24.0
scikit-learn>=1.3.0
xgboost>=2.0.0
tensorflow>=2.15.0
fastapi>=0.109.0
uvicorn>=0.25.0
sqlalchemy>=2.0.0
psycopg2-binary>=2.9.0
joblib>=1.3.0
pytest>=7.4.0
📁 Repository Structure
forsyt-data-machine/
├── services/
│ ├── data-harness/ # Data downloading & normalization
│ │ ├── src/
│ │ │ ├── downloaders/
│ │ │ │ ├── base.downloader.ts
│ │ │ │ ├── football-data.downloader.ts
│ │ │ │ ├── fivethirtyeight.downloader.ts
│ │ │ │ ├── cricsheet.downloader.ts
│ │ │ │ ├── kaggle.downloader.ts
│ │ │ │ └── github.downloader.ts
│ │ │ ├── processors/
│ │ │ │ ├── football.normalizer.ts
│ │ │ │ ├── cricket.normalizer.ts
│ │ │ │ └── deduplicator.ts
│ │ │ ├── storage/
│ │ │ │ ├── file-storage.ts
│ │ │ │ └── database.ts
│ │ │ ├── scheduler/
│ │ │ │ └── jobs.ts
│ │ │ ├── cli/
│ │ │ │ └── commands.ts
│ │ │ └── index.ts
│ │ ├── prisma/
│ │ │ └── schema.prisma
│ │ ├── data/ # Downloaded data (gitignored)
│ │ │ ├── raw/
│ │ │ └── processed/
│ │ ├── package.json
│ │ └── tsconfig.json
│ │
│ ├── algorithms/ # Ported betting algorithms
│ │ ├── src/
│ │ │ ├── shin.ts
│ │ │ ├── poisson.ts
│ │ │ ├── elo.ts
│ │ │ ├── kelly.ts
│ │ │ ├── fair-odds.ts
│ │ │ ├── api/
│ │ │ │ └── router.ts
│ │ │ └── index.ts
│ │ ├── tests/
│ │ │ ├── shin.test.ts
│ │ │ ├── poisson.test.ts
│ │ │ └── elo.test.ts
│ │ ├── package.json
│ │ └── tsconfig.json
│ │
│ └── ml-pipeline/ # Python ML training
│ ├── src/
│ │ ├── features/
│ │ │ ├── form.py
│ │ │ ├── h2h.py
│ │ │ ├── odds.py
│ │ │ └── elo.py
│ │ ├── models/
│ │ │ ├── xgboost_model.py
│ │ │ ├── random_forest.py
│ │ │ └── neural_net.py
│ │ ├── training/
│ │ │ ├── train.py
│ │ │ └── evaluate.py
│ │ └── serving/
│ │ ├── api.py
│ │ └── predict.py
│ ├── notebooks/
│ │ ├── 01_data_exploration.ipynb
│ │ ├── 02_feature_engineering.ipynb
│ │ └── 03_model_training.ipynb
│ ├── models/ # Saved models (gitignored)
│ ├── requirements.txt
│ └── pyproject.toml
│
├── docker/
│ ├── Dockerfile.harness
│ ├── Dockerfile.algorithms
│ ├── Dockerfile.ml
│ └── docker-compose.yml
│
├── .github/
│ └── workflows/
│ ├── sync-daily.yml
│ └── sync-weekly.yml
│
├── docs/
│ ├── DATA_HARNESS_DESIGN.md
│ ├── FORSYT_DATA_MACHINE_PROJECT_PLAN.md
│ └── API.md
│
├── .env.example
├── README.md
└── Makefile
⚡ Quick Start Commands
Once built, these are the primary commands:
# ─────────────────────────────────────────────────────────────
# DATA HARNESS
# ─────────────────────────────────────────────────────────────
# Initialize (first time)
npm run harness -- init
# Sync all data sources
npm run harness -- sync --all
# Sync specific source
npm run harness -- sync --source football-data
npm run harness -- sync --source cricsheet
# Check status
npm run harness -- status
# ─────────────────────────────────────────────────────────────
# ALGORITHMS
# ─────────────────────────────────────────────────────────────
# Start algorithms API
npm run algorithms -- serve
# Test algorithms
npm run algorithms -- test
# ─────────────────────────────────────────────────────────────
# ML PIPELINE
# ─────────────────────────────────────────────────────────────
# Generate features
python -m ml_pipeline.features.generate
# Train models
python -m ml_pipeline.training.train --model xgboost
# Evaluate
python -m ml_pipeline.training.evaluate --model xgboost
# Start prediction API
uvicorn ml_pipeline.serving.api:app --port 8000
# ─────────────────────────────────────────────────────────────
# DOCKER (Full Stack)
# ─────────────────────────────────────────────────────────────
# Start everything
docker-compose up -d
# View logs
docker-compose logs -f data-harness
# Run sync manually
docker-compose exec data-harness npm run harness -- sync --all
📋 Complete Task Checklist
Phase 1: Foundation ✅ COMPLETE
- 1.1 Create project structure
- 1.2 Initialize package.json
- 1.3 Create Prisma schema
- 1.4 Run database migrations
- 1.5 Implement BaseDownloader
- 1.6 Implement FileStorage
- 1.7 Create CLI skeleton
- 1.8 Add logging
Phase 2: Data Sources ✅ COMPLETE
- 2.1 Football-Data downloader
- 2.2 Football-Data normalizer
- 2.3 Football odds parser
- 2.4 FiveThirtyEight downloader
- 2.5 FiveThirtyEight normalizer
- 2.6 Cricsheet downloader
- 2.7 Cricsheet normalizer
- 2.8 Kaggle downloader
- 2.9 Kaggle SQLite parser
- 2.10 GitHub repos syncer
- 2.11 Deduplicator
- 2.12 Team name normalizer
- 2.13 Sync tracking
- 2.14 Tennis-Data downloader (added)
- 2.15 Tennis normalizer (added)
Phase 3: Algorithms ✅ COMPLETE
- 3.1 Create algorithms package
- 3.2 Port Shin method
- 3.3 Port Poisson model
- 3.4 Port ELO ratings
- 3.5 Implement Kelly criterion
- 3.6 Implement implied probability
- 3.7 Implement fair odds calculator
- 3.8 Write unit tests
- 3.9 Create algorithms API
Phase 4: ML Pipeline ✅ COMPLETE
- 4.1 Create ml-pipeline package
- 4.2 Feature: Form calculation
- 4.3 Feature: Head-to-head
- 4.4 Feature: League position
- 4.5 Feature: Odds-derived
- 4.6 Feature: ELO ratings
- 4.7 Feature store population
- 4.8 XGBoost training notebook
- 4.9 Random Forest training
- 4.10 Neural Network training
- 4.11 Model evaluation
- 4.12 Model export (ONNX)
- 4.13 Model serving API
- 4.14 Sport-specific models (football, cricket, tennis)
- 4.15 Model registry with auto-discovery
Phase 5: Integration ✅ COMPLETE
- 5.1 Scheduler setup
- 5.2 GitHub Actions workflow
- 5.3 Value Detection service (Port 3102)
- 5.4 Price Feed service (Port 3100)
- 5.5 Connect to SuperSkin services
- 5.6 End-to-end testing
- 5.7 Performance optimization
- 5.8 Monitoring dashboard
- 5.9 Documentation
🚦 Current Focus: Expanding Sports Coverage
| Sport | Data Status | ML Model Status | Accuracy |
|---|---|---|---|
| Football | ✅ 200K+ matches | ✅ Trained | ~68% |
| Cricket | ✅ 5K+ matches | ✅ Trained | ~65% |
| Tennis | ✅ 50K+ matches | ✅ Trained | ~70% |
| Basketball | 🔄 Collecting | 📋 Planned | - |
| Esports | 🔄 Collecting | 📋 Planned | - |
✅ System Status
The Forsyt Data Machine is fully operational and integrated with SuperSkin services:
# Check system status
curl http://localhost:3105/health
# {"status":"healthy","models":["football","cricket","tennis"]}
# Get predictions
curl http://localhost:3105/predict -d '{"sport":"football","home":"Liverpool","away":"Chelsea"}'
# {"home":0.48,"draw":0.27,"away":0.25,"confidence":0.72}
Next Steps
- Expand basketball data collection from Basketball-Reference
- Add esports data from HLTV and Liquipedia
- Improve model accuracy with additional features
- Add more leagues to football/cricket coverage