🚀 Forsyt Data Machine - Project Plan

Version: 2.0 Date: January 2026 Status: ✅ PHASES 1-5 COMPLETE Goal: Complete data infrastructure for sports betting ML and analytics

Executive Summary

The Forsyt Data Machine is a comprehensive data infrastructure that:

Downloads sports data from 10+ free sources (football, cricket, tennis)
Stores normalized data in PostgreSQL (Supabase IceCrystal) with ML-ready features
Provides ported betting algorithms (Shin, Poisson, ELO)
Trains sport-specific ML models for match prediction
Integrates with the 6 SuperSkin services

Current Status: All 5 phases complete. System is in production.

📊 Project Overview

Metric	Value
Total Phases	5 (all complete)
Total Tasks	35 (all complete)
Actual Duration	28 days
Team Size	1 developer
Primary Language	TypeScript (services), Python (ML)

Deliverables by Phase

Phase	Deliverables	Status
1. Foundation	Project structure, DB schema, base classes	✅ COMPLETE
2. Data Sources	6 downloaders, normalizers, CLI	✅ COMPLETE
3. Algorithms	4 ported algorithms with tests	✅ COMPLETE
4. ML Pipeline	Feature engineering, 3 models, serving	✅ COMPLETE
5. Integration	Scheduler, services, E2E tests	✅ COMPLETE

🏗️ Phase 1: Foundation (Days 1-4)

Goal: Set up project structure, database, and core abstractions

Tasks

#	Task	Description	Effort	Dependencies
1.1	Create project structure	Set up `services/data-harness/` with src, data, prisma dirs	2h	None
1.2	Initialize package.json	Dependencies: prisma, axios, node-cron, commander, zod	1h	1.1
1.3	Create Prisma schema	Tables: football_matches, football_odds, cricket_matches, cricket_deliveries, ml_features, sync_log	4h	1.2
1.4	Run database migrations	Create PostgreSQL database, run prisma migrate	1h	1.3
1.5	Implement BaseDownloader	Abstract class with retry logic, rate limiting, checkpointing	4h	1.2
1.6	Implement FileStorage	Raw file management, versioning, cleanup	3h	1.2
1.7	Create CLI skeleton	Commander.js CLI with sync, status, clean commands	3h	1.5, 1.6
1.8	Add logging	Winston/Pino logger with file + console output	2h	1.7

Deliverables

services/data-harness/ directory structure
PostgreSQL database with all tables
BaseDownloader abstract class
FileStorage service
CLI with --help working

Exit Criteria

# These should work:
npm run harness -- --help
npm run harness -- status
npx prisma studio  # Shows empty tables

📥 Phase 2: Data Sources (Days 5-12)

Goal: Implement all 5 downloaders with normalization

Tasks

#	Task	Description	Effort	Dependencies
2.1	Football-Data downloader	Download CSVs for 10 leagues, 6 seasons	6h	1.5
2.2	Football-Data normalizer	Parse CSV, map columns, insert to DB	4h	2.1
2.3	Football odds parser	Extract odds from 10+ bookmakers per match	3h	2.2
2.4	FiveThirtyEight downloader	Download SPI matches + rankings CSVs	3h	1.5
2.5	FiveThirtyEight normalizer	Parse, merge with football_matches	3h	2.4
2.6	Cricsheet downloader	Download ZIP, extract JSON files	4h	1.5
2.7	Cricsheet normalizer	Parse match JSON, ball-by-ball data	6h	2.6
2.8	Kaggle downloader	API integration, download 4 datasets	4h	1.5
2.9	Kaggle SQLite parser	Parse European Soccer DB SQLite	3h	2.8
2.10	GitHub repos syncer	Clone/pull 5 ML repos	2h	1.5
2.11	Deduplicator	Match deduplication across sources	4h	2.2, 2.5
2.12	Team name normalizer	Map team aliases to canonical names	3h	2.11
2.13	Sync tracking	Log sync runs, track last successful	2h	2.1

Deliverables

5 working downloaders
Data normalized into PostgreSQL
~200K football matches in database
~5K cricket matches with ball-by-ball

Exit Criteria

# Full sync should work:
npm run harness -- sync --all

# Verify data:
npm run harness -- status
# Output: Football: 200,432 matches, Cricket: 5,234 matches

🧮 Phase 3: Algorithms (Days 13-17)

Goal: Port betting algorithms from GitHub repos to TypeScript

Tasks

#	Task	Description	Effort	Dependencies
3.1	Create algorithms package	Set up `services/algorithms/` with TypeScript	2h	None
3.2	Port Shin method	Remove bookmaker margin, true probabilities	4h	3.1
3.3	Port Poisson model	Goal distribution prediction	6h	3.1
3.4	Port ELO ratings	Team strength with home advantage	4h	3.1
3.5	Implement Kelly criterion	Optimal stake sizing	3h	3.2
3.6	Implement implied probability	Convert odds to probabilities	2h	3.1
3.7	Implement fair odds calculator	Combine multiple bookmaker prices	3h	3.2, 3.6
3.8	Write unit tests	Test all algorithms with known values	6h	3.2-3.7
3.9	Create algorithms API	Express router exposing algorithms	3h	3.8

Deliverables

services/algorithms/ package
6 ported algorithms with full test coverage
REST API for algorithm access

Key Algorithms

// What we're building:

// 1. Shin Method - True probabilities
shinProbabilities([2.10, 3.40, 3.50]) // → [0.45, 0.28, 0.27]

// 2. Poisson Model - Match prediction
poissonPredict(homeXG: 1.8, awayXG: 1.2) // → { probHome: 0.52, probDraw: 0.24, probAway: 0.24 }

// 3. ELO Ratings - Team strength
updateElo(homeElo: 1600, awayElo: 1550, homeGoals: 2, awayGoals: 1)
// → { homeElo: 1612, awayElo: 1538 }

// 4. Kelly Criterion - Stake sizing
kellyStake(probability: 0.55, odds: 2.10, bankroll: 1000) // → £52 stake

// 5. Fair Odds - Market consensus
calculateFairOdds([
  { bookmaker: 'pinnacle', odds: 2.05 },
  { bookmaker: 'bet365', odds: 2.10 },
  { bookmaker: 'betfair', odds: 2.08 }
]) // → 2.07 (weighted by sharpness)

Exit Criteria

# All tests pass:
npm run test:algorithms
# 24 tests passing

# API works:
curl http://localhost:3001/api/algorithms/shin?odds=2.10,3.40,3.50
# {"probabilities":[0.45,0.28,0.27]}

🤖 Phase 4: ML Pipeline (Days 18-25)

Goal: Build feature engineering and model training pipeline

Tasks

#	Task	Description	Effort	Dependencies
4.1	Create ml-pipeline package	Set up Python project with poetry/pip	2h	None
4.2	Feature: Form calculation	Last N matches stats per team	6h	Phase 2
4.3	Feature: Head-to-head	Historical matchup statistics	4h	4.2
4.4	Feature: League position	Position at time of match	3h	4.2
4.5	Feature: Odds-derived	Implied probabilities, overround	3h	4.2
4.6	Feature: ELO ratings	Calculate ELO for all teams	4h	3.4, 4.2
4.7	Feature store population	Insert features to ml_features table	3h	4.2-4.6
4.8	XGBoost training notebook	1X2 prediction model	6h	4.7
4.9	Random Forest training	Alternative model for comparison	4h	4.7
4.10	Neural Network training	Deep learning model	6h	4.7
4.11	Model evaluation	Accuracy, ROI backtesting	4h	4.8-4.10
4.12	Model export	Save trained models (joblib/ONNX)	3h	4.11
4.13	Model serving API	FastAPI endpoint for predictions	4h	4.12

Deliverables

services/ml-pipeline/ Python package
40+ features computed for all matches
3 trained models (XGBoost, RF, NN)
FastAPI serving endpoint

Feature Engineering Details

# Features we calculate for each match:

class MatchFeatures:
    # Form (last 5 matches)
    home_wins_5: int           # 0-5
    home_draws_5: int
    home_losses_5: int
    home_goals_scored_5: float # avg
    home_goals_conceded_5: float
    home_clean_sheets_5: int
    home_points_5: int         # 0-15

    away_wins_5: int
    away_draws_5: int
    away_losses_5: int
    away_goals_scored_5: float
    away_goals_conceded_5: float
    away_clean_sheets_5: int
    away_points_5: int

    # Head-to-head (last 5 meetings)
    h2h_home_wins: int
    h2h_draws: int
    h2h_away_wins: int
    h2h_avg_goals: float
    h2h_home_scored: float
    h2h_away_scored: float

    # League position
    home_position: int
    away_position: int
    home_points: int
    away_points: int
    position_diff: int

    # Odds-derived
    implied_prob_home: float
    implied_prob_draw: float
    implied_prob_away: float
    pinnacle_prob_home: float  # Sharp odds
    market_overround: float

    # Team ratings
    home_elo: float
    away_elo: float
    home_spi: float           # From 538
    away_spi: float
    elo_diff: float

    # Target variables
    result: str               # 'H', 'D', 'A'
    over_2_5: bool
    btts: bool

Exit Criteria

# Training complete:
python train.py --model xgboost --target 1x2
# Model accuracy: 52.3%, ROI: +3.2%

# Predictions work:
curl http://localhost:8000/predict -d '{"home_team":"Liverpool","away_team":"Chelsea"}'
# {"home":0.48,"draw":0.27,"away":0.25,"confidence":0.72}

🔗 Phase 5: Integration (Days 26-30)

Goal: Connect everything, set up automation, build services

Tasks

#	Task	Description	Effort	Dependencies
5.1	Scheduler setup	node-cron with all sync jobs	4h	Phase 2
5.2	GitHub Actions workflow	Automated daily/weekly syncs	3h	5.1
5.3	Value Detection service	Use Shin + ELO + ML predictions	8h	Phase 3, 4
5.4	Price Feed service	Publish normalized prices	4h	Phase 2, 3
5.5	Connect to existing backend	Redis pub/sub integration	4h	5.3, 5.4
5.6	End-to-end testing	Full pipeline test	6h	5.1-5.5
5.7	Performance optimization	Query optimization, caching	4h	5.6
5.8	Monitoring dashboard	Grafana/simple stats page	3h	5.6
5.9	Documentation	README, API docs, runbooks	4h	All

Deliverables

Automated data sync (daily/weekly)
Value Detection service running
Integration with existing backend
Monitoring and alerting

Integration Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                     FORSYT DATA MACHINE                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐           │
│  │ Data Harness │───▶│  PostgreSQL  │───▶│ ML Pipeline  │           │
│  │ (Node.js)    │    │  + Features  │    │ (Python)     │           │
│  └──────────────┘    └──────────────┘    └──────────────┘           │
│         │                   │                   │                    │
│         ▼                   ▼                   ▼                    │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐           │
│  │  Algorithms  │    │    Redis     │    │  ML Models   │           │
│  │  (Node.js)   │    │   Pub/Sub    │    │  (FastAPI)   │           │
│  └──────────────┘    └──────────────┘    └──────────────┘           │
│         │                   │                   │                    │
│         └───────────────────┴───────────────────┘                    │
│                             │                                        │
│                             ▼                                        │
│                    ┌──────────────────┐                              │
│                    │ Value Detection  │                              │
│                    │ Price Feed       │                              │
│                    │ Chat Assistant   │                              │
│                    └──────────────────┘                              │
│                             │                                        │
│                             ▼                                        │
│                    ┌──────────────────┐                              │
│                    │    Frontend      │                              │
│                    │  (React/Next)    │                              │
│                    └──────────────────┘                              │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Exit Criteria

# Full system running:
docker-compose up -d

# Check status:
curl http://localhost:3000/api/health
# {"data_harness":"ok","ml_pipeline":"ok","algorithms":"ok"}

# Value detection working:
curl http://localhost:3000/api/value/detect?match=liverpool_vs_chelsea
# {"value_bets":[{"outcome":"over_2_5","edge":4.2,"confidence":"high"}]}

🛠️ Technology Stack

Services Overview

Service	Language	Framework	Database
Data Harness	TypeScript	Node.js	PostgreSQL
Algorithms	TypeScript	Node.js/Express	-
ML Pipeline	Python 3.11	Pandas, scikit-learn	PostgreSQL
ML Serving	Python 3.11	FastAPI	Redis (cache)
Scheduler	TypeScript	node-cron	-

Dependencies

Node.js (Data Harness + Algorithms):

{
  "dependencies": {
    "@prisma/client": "^5.x",
    "axios": "^1.x",
    "commander": "^12.x",
    "node-cron": "^3.x",
    "zod": "^3.x",
    "winston": "^3.x",
    "csv-parse": "^5.x",
    "adm-zip": "^0.5.x",
    "ioredis": "^5.x"
  },
  "devDependencies": {
    "typescript": "^5.x",
    "vitest": "^1.x",
    "prisma": "^5.x"
  }
}

Python (ML Pipeline):

# requirements.txt
pandas>=2.0.0
numpy>=1.24.0
scikit-learn>=1.3.0
xgboost>=2.0.0
tensorflow>=2.15.0
fastapi>=0.109.0
uvicorn>=0.25.0
sqlalchemy>=2.0.0
psycopg2-binary>=2.9.0
joblib>=1.3.0
pytest>=7.4.0

📁 Repository Structure

forsyt-data-machine/
├── services/
│   ├── data-harness/                 # Data downloading & normalization
│   │   ├── src/
│   │   │   ├── downloaders/
│   │   │   │   ├── base.downloader.ts
│   │   │   │   ├── football-data.downloader.ts
│   │   │   │   ├── fivethirtyeight.downloader.ts
│   │   │   │   ├── cricsheet.downloader.ts
│   │   │   │   ├── kaggle.downloader.ts
│   │   │   │   └── github.downloader.ts
│   │   │   ├── processors/
│   │   │   │   ├── football.normalizer.ts
│   │   │   │   ├── cricket.normalizer.ts
│   │   │   │   └── deduplicator.ts
│   │   │   ├── storage/
│   │   │   │   ├── file-storage.ts
│   │   │   │   └── database.ts
│   │   │   ├── scheduler/
│   │   │   │   └── jobs.ts
│   │   │   ├── cli/
│   │   │   │   └── commands.ts
│   │   │   └── index.ts
│   │   ├── prisma/
│   │   │   └── schema.prisma
│   │   ├── data/                     # Downloaded data (gitignored)
│   │   │   ├── raw/
│   │   │   └── processed/
│   │   ├── package.json
│   │   └── tsconfig.json
│   │
│   ├── algorithms/                   # Ported betting algorithms
│   │   ├── src/
│   │   │   ├── shin.ts
│   │   │   ├── poisson.ts
│   │   │   ├── elo.ts
│   │   │   ├── kelly.ts
│   │   │   ├── fair-odds.ts
│   │   │   ├── api/
│   │   │   │   └── router.ts
│   │   │   └── index.ts
│   │   ├── tests/
│   │   │   ├── shin.test.ts
│   │   │   ├── poisson.test.ts
│   │   │   └── elo.test.ts
│   │   ├── package.json
│   │   └── tsconfig.json
│   │
│   └── ml-pipeline/                  # Python ML training
│       ├── src/
│       │   ├── features/
│       │   │   ├── form.py
│       │   │   ├── h2h.py
│       │   │   ├── odds.py
│       │   │   └── elo.py
│       │   ├── models/
│       │   │   ├── xgboost_model.py
│       │   │   ├── random_forest.py
│       │   │   └── neural_net.py
│       │   ├── training/
│       │   │   ├── train.py
│       │   │   └── evaluate.py
│       │   └── serving/
│       │       ├── api.py
│       │       └── predict.py
│       ├── notebooks/
│       │   ├── 01_data_exploration.ipynb
│       │   ├── 02_feature_engineering.ipynb
│       │   └── 03_model_training.ipynb
│       ├── models/                   # Saved models (gitignored)
│       ├── requirements.txt
│       └── pyproject.toml
│
├── docker/
│   ├── Dockerfile.harness
│   ├── Dockerfile.algorithms
│   ├── Dockerfile.ml
│   └── docker-compose.yml
│
├── .github/
│   └── workflows/
│       ├── sync-daily.yml
│       └── sync-weekly.yml
│
├── docs/
│   ├── DATA_HARNESS_DESIGN.md
│   ├── FORSYT_DATA_MACHINE_PROJECT_PLAN.md
│   └── API.md
│
├── .env.example
├── README.md
└── Makefile

⚡ Quick Start Commands

Once built, these are the primary commands:

# ─────────────────────────────────────────────────────────────
# DATA HARNESS
# ─────────────────────────────────────────────────────────────

# Initialize (first time)
npm run harness -- init

# Sync all data sources
npm run harness -- sync --all

# Sync specific source
npm run harness -- sync --source football-data
npm run harness -- sync --source cricsheet

# Check status
npm run harness -- status

# ─────────────────────────────────────────────────────────────
# ALGORITHMS
# ─────────────────────────────────────────────────────────────

# Start algorithms API
npm run algorithms -- serve

# Test algorithms
npm run algorithms -- test

# ─────────────────────────────────────────────────────────────
# ML PIPELINE
# ─────────────────────────────────────────────────────────────

# Generate features
python -m ml_pipeline.features.generate

# Train models
python -m ml_pipeline.training.train --model xgboost

# Evaluate
python -m ml_pipeline.training.evaluate --model xgboost

# Start prediction API
uvicorn ml_pipeline.serving.api:app --port 8000

# ─────────────────────────────────────────────────────────────
# DOCKER (Full Stack)
# ─────────────────────────────────────────────────────────────

# Start everything
docker-compose up -d

# View logs
docker-compose logs -f data-harness

# Run sync manually
docker-compose exec data-harness npm run harness -- sync --all

📋 Complete Task Checklist

Phase 1: Foundation ✅ COMPLETE

Phase 2: Data Sources ✅ COMPLETE

Phase 3: Algorithms ✅ COMPLETE

Phase 4: ML Pipeline ✅ COMPLETE

Phase 5: Integration ✅ COMPLETE

🚦 Current Focus: Expanding Sports Coverage

Sport	Data Status	ML Model Status	Accuracy
Football	✅ 200K+ matches	✅ Trained	~68%
Cricket	✅ 5K+ matches	✅ Trained	~65%
Tennis	✅ 50K+ matches	✅ Trained	~70%
Basketball	🔄 Collecting	📋 Planned	-
Esports	🔄 Collecting	📋 Planned	-

✅ System Status

The Forsyt Data Machine is fully operational and integrated with SuperSkin services:

# Check system status
curl http://localhost:3105/health
# {"status":"healthy","models":["football","cricket","tennis"]}

# Get predictions
curl http://localhost:3105/predict -d '{"sport":"football","home":"Liverpool","away":"Chelsea"}'
# {"home":0.48,"draw":0.27,"away":0.25,"confidence":0.72}

Next Steps

Expand basketball data collection from Basketball-Reference
Add esports data from HLTV and Liquipedia
Improve model accuracy with additional features
Add more leagues to football/cricket coverage

Executive Summary​

📊 Project Overview​

Deliverables by Phase​

🏗️ Phase 1: Foundation (Days 1-4)​

Tasks​

Deliverables​

Exit Criteria​

📥 Phase 2: Data Sources (Days 5-12)​

Tasks​

Deliverables​

Exit Criteria​

🧮 Phase 3: Algorithms (Days 13-17)​

Tasks​

Deliverables​

Key Algorithms​

Exit Criteria​

🤖 Phase 4: ML Pipeline (Days 18-25)​

Tasks​

Deliverables​

Feature Engineering Details​

Exit Criteria​

🔗 Phase 5: Integration (Days 26-30)​

Tasks​

Deliverables​

Integration Architecture​

Exit Criteria​

🛠️ Technology Stack​

Services Overview​

Dependencies​

📁 Repository Structure​

⚡ Quick Start Commands​

📋 Complete Task Checklist​

Phase 1: Foundation ✅ COMPLETE​

Phase 2: Data Sources ✅ COMPLETE​

Phase 3: Algorithms ✅ COMPLETE​

Phase 4: ML Pipeline ✅ COMPLETE​

Phase 5: Integration ✅ COMPLETE​

🚦 Current Focus: Expanding Sports Coverage​

✅ System Status​

Next Steps​

Executive Summary

📊 Project Overview

Deliverables by Phase

🏗️ Phase 1: Foundation (Days 1-4)

Tasks

Deliverables

Exit Criteria

📥 Phase 2: Data Sources (Days 5-12)

Tasks

Deliverables

Exit Criteria

🧮 Phase 3: Algorithms (Days 13-17)

Tasks

Deliverables

Key Algorithms

Exit Criteria

🤖 Phase 4: ML Pipeline (Days 18-25)

Tasks

Deliverables

Feature Engineering Details

Exit Criteria

🔗 Phase 5: Integration (Days 26-30)

Tasks

Deliverables

Integration Architecture

Exit Criteria

🛠️ Technology Stack

Services Overview

Dependencies

📁 Repository Structure

⚡ Quick Start Commands

📋 Complete Task Checklist

Phase 1: Foundation ✅ COMPLETE

Phase 2: Data Sources ✅ COMPLETE

Phase 3: Algorithms ✅ COMPLETE

Phase 4: ML Pipeline ✅ COMPLETE

Phase 5: Integration ✅ COMPLETE

🚦 Current Focus: Expanding Sports Coverage

✅ System Status

Next Steps