Skip to main content

🚀 Forsyt Data Machine - Project Plan

Version: 2.0 Date: January 2026 Status: ✅ PHASES 1-5 COMPLETE Goal: Complete data infrastructure for sports betting ML and analytics


Executive Summary

The Forsyt Data Machine is a comprehensive data infrastructure that:

  1. Downloads sports data from 10+ free sources (football, cricket, tennis)
  2. Stores normalized data in PostgreSQL (Supabase IceCrystal) with ML-ready features
  3. Provides ported betting algorithms (Shin, Poisson, ELO)
  4. Trains sport-specific ML models for match prediction
  5. Integrates with the 6 SuperSkin services

Current Status: All 5 phases complete. System is in production.


📊 Project Overview

MetricValue
Total Phases5 (all complete)
Total Tasks35 (all complete)
Actual Duration28 days
Team Size1 developer
Primary LanguageTypeScript (services), Python (ML)

Deliverables by Phase

PhaseDeliverablesStatus
1. FoundationProject structure, DB schema, base classes✅ COMPLETE
2. Data Sources6 downloaders, normalizers, CLI✅ COMPLETE
3. Algorithms4 ported algorithms with tests✅ COMPLETE
4. ML PipelineFeature engineering, 3 models, serving✅ COMPLETE
5. IntegrationScheduler, services, E2E tests✅ COMPLETE

🏗️ Phase 1: Foundation (Days 1-4)

Goal: Set up project structure, database, and core abstractions

Tasks

#TaskDescriptionEffortDependencies
1.1Create project structureSet up services/data-harness/ with src, data, prisma dirs2hNone
1.2Initialize package.jsonDependencies: prisma, axios, node-cron, commander, zod1h1.1
1.3Create Prisma schemaTables: football_matches, football_odds, cricket_matches, cricket_deliveries, ml_features, sync_log4h1.2
1.4Run database migrationsCreate PostgreSQL database, run prisma migrate1h1.3
1.5Implement BaseDownloaderAbstract class with retry logic, rate limiting, checkpointing4h1.2
1.6Implement FileStorageRaw file management, versioning, cleanup3h1.2
1.7Create CLI skeletonCommander.js CLI with sync, status, clean commands3h1.5, 1.6
1.8Add loggingWinston/Pino logger with file + console output2h1.7

Deliverables

  • services/data-harness/ directory structure
  • PostgreSQL database with all tables
  • BaseDownloader abstract class
  • FileStorage service
  • CLI with --help working

Exit Criteria

# These should work:
npm run harness -- --help
npm run harness -- status
npx prisma studio # Shows empty tables

📥 Phase 2: Data Sources (Days 5-12)

Goal: Implement all 5 downloaders with normalization

Tasks

#TaskDescriptionEffortDependencies
2.1Football-Data downloaderDownload CSVs for 10 leagues, 6 seasons6h1.5
2.2Football-Data normalizerParse CSV, map columns, insert to DB4h2.1
2.3Football odds parserExtract odds from 10+ bookmakers per match3h2.2
2.4FiveThirtyEight downloaderDownload SPI matches + rankings CSVs3h1.5
2.5FiveThirtyEight normalizerParse, merge with football_matches3h2.4
2.6Cricsheet downloaderDownload ZIP, extract JSON files4h1.5
2.7Cricsheet normalizerParse match JSON, ball-by-ball data6h2.6
2.8Kaggle downloaderAPI integration, download 4 datasets4h1.5
2.9Kaggle SQLite parserParse European Soccer DB SQLite3h2.8
2.10GitHub repos syncerClone/pull 5 ML repos2h1.5
2.11DeduplicatorMatch deduplication across sources4h2.2, 2.5
2.12Team name normalizerMap team aliases to canonical names3h2.11
2.13Sync trackingLog sync runs, track last successful2h2.1

Deliverables

  • 5 working downloaders
  • Data normalized into PostgreSQL
  • ~200K football matches in database
  • ~5K cricket matches with ball-by-ball

Exit Criteria

# Full sync should work:
npm run harness -- sync --all

# Verify data:
npm run harness -- status
# Output: Football: 200,432 matches, Cricket: 5,234 matches

🧮 Phase 3: Algorithms (Days 13-17)

Goal: Port betting algorithms from GitHub repos to TypeScript

Tasks

#TaskDescriptionEffortDependencies
3.1Create algorithms packageSet up services/algorithms/ with TypeScript2hNone
3.2Port Shin methodRemove bookmaker margin, true probabilities4h3.1
3.3Port Poisson modelGoal distribution prediction6h3.1
3.4Port ELO ratingsTeam strength with home advantage4h3.1
3.5Implement Kelly criterionOptimal stake sizing3h3.2
3.6Implement implied probabilityConvert odds to probabilities2h3.1
3.7Implement fair odds calculatorCombine multiple bookmaker prices3h3.2, 3.6
3.8Write unit testsTest all algorithms with known values6h3.2-3.7
3.9Create algorithms APIExpress router exposing algorithms3h3.8

Deliverables

  • services/algorithms/ package
  • 6 ported algorithms with full test coverage
  • REST API for algorithm access

Key Algorithms

// What we're building:

// 1. Shin Method - True probabilities
shinProbabilities([2.10, 3.40, 3.50]) // → [0.45, 0.28, 0.27]

// 2. Poisson Model - Match prediction
poissonPredict(homeXG: 1.8, awayXG: 1.2) // → { probHome: 0.52, probDraw: 0.24, probAway: 0.24 }

// 3. ELO Ratings - Team strength
updateElo(homeElo: 1600, awayElo: 1550, homeGoals: 2, awayGoals: 1)
// → { homeElo: 1612, awayElo: 1538 }

// 4. Kelly Criterion - Stake sizing
kellyStake(probability: 0.55, odds: 2.10, bankroll: 1000) // → £52 stake

// 5. Fair Odds - Market consensus
calculateFairOdds([
{ bookmaker: 'pinnacle', odds: 2.05 },
{ bookmaker: 'bet365', odds: 2.10 },
{ bookmaker: 'betfair', odds: 2.08 }
]) // → 2.07 (weighted by sharpness)

Exit Criteria

# All tests pass:
npm run test:algorithms
# 24 tests passing

# API works:
curl http://localhost:3001/api/algorithms/shin?odds=2.10,3.40,3.50
# {"probabilities":[0.45,0.28,0.27]}

🤖 Phase 4: ML Pipeline (Days 18-25)

Goal: Build feature engineering and model training pipeline

Tasks

#TaskDescriptionEffortDependencies
4.1Create ml-pipeline packageSet up Python project with poetry/pip2hNone
4.2Feature: Form calculationLast N matches stats per team6hPhase 2
4.3Feature: Head-to-headHistorical matchup statistics4h4.2
4.4Feature: League positionPosition at time of match3h4.2
4.5Feature: Odds-derivedImplied probabilities, overround3h4.2
4.6Feature: ELO ratingsCalculate ELO for all teams4h3.4, 4.2
4.7Feature store populationInsert features to ml_features table3h4.2-4.6
4.8XGBoost training notebook1X2 prediction model6h4.7
4.9Random Forest trainingAlternative model for comparison4h4.7
4.10Neural Network trainingDeep learning model6h4.7
4.11Model evaluationAccuracy, ROI backtesting4h4.8-4.10
4.12Model exportSave trained models (joblib/ONNX)3h4.11
4.13Model serving APIFastAPI endpoint for predictions4h4.12

Deliverables

  • services/ml-pipeline/ Python package
  • 40+ features computed for all matches
  • 3 trained models (XGBoost, RF, NN)
  • FastAPI serving endpoint

Feature Engineering Details

# Features we calculate for each match:

class MatchFeatures:
# Form (last 5 matches)
home_wins_5: int # 0-5
home_draws_5: int
home_losses_5: int
home_goals_scored_5: float # avg
home_goals_conceded_5: float
home_clean_sheets_5: int
home_points_5: int # 0-15

away_wins_5: int
away_draws_5: int
away_losses_5: int
away_goals_scored_5: float
away_goals_conceded_5: float
away_clean_sheets_5: int
away_points_5: int

# Head-to-head (last 5 meetings)
h2h_home_wins: int
h2h_draws: int
h2h_away_wins: int
h2h_avg_goals: float
h2h_home_scored: float
h2h_away_scored: float

# League position
home_position: int
away_position: int
home_points: int
away_points: int
position_diff: int

# Odds-derived
implied_prob_home: float
implied_prob_draw: float
implied_prob_away: float
pinnacle_prob_home: float # Sharp odds
market_overround: float

# Team ratings
home_elo: float
away_elo: float
home_spi: float # From 538
away_spi: float
elo_diff: float

# Target variables
result: str # 'H', 'D', 'A'
over_2_5: bool
btts: bool

Exit Criteria

# Training complete:
python train.py --model xgboost --target 1x2
# Model accuracy: 52.3%, ROI: +3.2%

# Predictions work:
curl http://localhost:8000/predict -d '{"home_team":"Liverpool","away_team":"Chelsea"}'
# {"home":0.48,"draw":0.27,"away":0.25,"confidence":0.72}

🔗 Phase 5: Integration (Days 26-30)

Goal: Connect everything, set up automation, build services

Tasks

#TaskDescriptionEffortDependencies
5.1Scheduler setupnode-cron with all sync jobs4hPhase 2
5.2GitHub Actions workflowAutomated daily/weekly syncs3h5.1
5.3Value Detection serviceUse Shin + ELO + ML predictions8hPhase 3, 4
5.4Price Feed servicePublish normalized prices4hPhase 2, 3
5.5Connect to existing backendRedis pub/sub integration4h5.3, 5.4
5.6End-to-end testingFull pipeline test6h5.1-5.5
5.7Performance optimizationQuery optimization, caching4h5.6
5.8Monitoring dashboardGrafana/simple stats page3h5.6
5.9DocumentationREADME, API docs, runbooks4hAll

Deliverables

  • Automated data sync (daily/weekly)
  • Value Detection service running
  • Integration with existing backend
  • Monitoring and alerting

Integration Architecture

┌─────────────────────────────────────────────────────────────────────┐
│ FORSYT DATA MACHINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Data Harness │───▶│ PostgreSQL │───▶│ ML Pipeline │ │
│ │ (Node.js) │ │ + Features │ │ (Python) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Algorithms │ │ Redis │ │ ML Models │ │
│ │ (Node.js) │ │ Pub/Sub │ │ (FastAPI) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └───────────────────┴───────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Value Detection │ │
│ │ Price Feed │ │
│ │ Chat Assistant │ │
│ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Frontend │ │
│ │ (React/Next) │ │
│ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘

Exit Criteria

# Full system running:
docker-compose up -d

# Check status:
curl http://localhost:3000/api/health
# {"data_harness":"ok","ml_pipeline":"ok","algorithms":"ok"}

# Value detection working:
curl http://localhost:3000/api/value/detect?match=liverpool_vs_chelsea
# {"value_bets":[{"outcome":"over_2_5","edge":4.2,"confidence":"high"}]}

🛠️ Technology Stack

Services Overview

ServiceLanguageFrameworkDatabase
Data HarnessTypeScriptNode.jsPostgreSQL
AlgorithmsTypeScriptNode.js/Express-
ML PipelinePython 3.11Pandas, scikit-learnPostgreSQL
ML ServingPython 3.11FastAPIRedis (cache)
SchedulerTypeScriptnode-cron-

Dependencies

Node.js (Data Harness + Algorithms):

{
"dependencies": {
"@prisma/client": "^5.x",
"axios": "^1.x",
"commander": "^12.x",
"node-cron": "^3.x",
"zod": "^3.x",
"winston": "^3.x",
"csv-parse": "^5.x",
"adm-zip": "^0.5.x",
"ioredis": "^5.x"
},
"devDependencies": {
"typescript": "^5.x",
"vitest": "^1.x",
"prisma": "^5.x"
}
}

Python (ML Pipeline):

# requirements.txt
pandas>=2.0.0
numpy>=1.24.0
scikit-learn>=1.3.0
xgboost>=2.0.0
tensorflow>=2.15.0
fastapi>=0.109.0
uvicorn>=0.25.0
sqlalchemy>=2.0.0
psycopg2-binary>=2.9.0
joblib>=1.3.0
pytest>=7.4.0

📁 Repository Structure

forsyt-data-machine/
├── services/
│ ├── data-harness/ # Data downloading & normalization
│ │ ├── src/
│ │ │ ├── downloaders/
│ │ │ │ ├── base.downloader.ts
│ │ │ │ ├── football-data.downloader.ts
│ │ │ │ ├── fivethirtyeight.downloader.ts
│ │ │ │ ├── cricsheet.downloader.ts
│ │ │ │ ├── kaggle.downloader.ts
│ │ │ │ └── github.downloader.ts
│ │ │ ├── processors/
│ │ │ │ ├── football.normalizer.ts
│ │ │ │ ├── cricket.normalizer.ts
│ │ │ │ └── deduplicator.ts
│ │ │ ├── storage/
│ │ │ │ ├── file-storage.ts
│ │ │ │ └── database.ts
│ │ │ ├── scheduler/
│ │ │ │ └── jobs.ts
│ │ │ ├── cli/
│ │ │ │ └── commands.ts
│ │ │ └── index.ts
│ │ ├── prisma/
│ │ │ └── schema.prisma
│ │ ├── data/ # Downloaded data (gitignored)
│ │ │ ├── raw/
│ │ │ └── processed/
│ │ ├── package.json
│ │ └── tsconfig.json
│ │
│ ├── algorithms/ # Ported betting algorithms
│ │ ├── src/
│ │ │ ├── shin.ts
│ │ │ ├── poisson.ts
│ │ │ ├── elo.ts
│ │ │ ├── kelly.ts
│ │ │ ├── fair-odds.ts
│ │ │ ├── api/
│ │ │ │ └── router.ts
│ │ │ └── index.ts
│ │ ├── tests/
│ │ │ ├── shin.test.ts
│ │ │ ├── poisson.test.ts
│ │ │ └── elo.test.ts
│ │ ├── package.json
│ │ └── tsconfig.json
│ │
│ └── ml-pipeline/ # Python ML training
│ ├── src/
│ │ ├── features/
│ │ │ ├── form.py
│ │ │ ├── h2h.py
│ │ │ ├── odds.py
│ │ │ └── elo.py
│ │ ├── models/
│ │ │ ├── xgboost_model.py
│ │ │ ├── random_forest.py
│ │ │ └── neural_net.py
│ │ ├── training/
│ │ │ ├── train.py
│ │ │ └── evaluate.py
│ │ └── serving/
│ │ ├── api.py
│ │ └── predict.py
│ ├── notebooks/
│ │ ├── 01_data_exploration.ipynb
│ │ ├── 02_feature_engineering.ipynb
│ │ └── 03_model_training.ipynb
│ ├── models/ # Saved models (gitignored)
│ ├── requirements.txt
│ └── pyproject.toml

├── docker/
│ ├── Dockerfile.harness
│ ├── Dockerfile.algorithms
│ ├── Dockerfile.ml
│ └── docker-compose.yml

├── .github/
│ └── workflows/
│ ├── sync-daily.yml
│ └── sync-weekly.yml

├── docs/
│ ├── DATA_HARNESS_DESIGN.md
│ ├── FORSYT_DATA_MACHINE_PROJECT_PLAN.md
│ └── API.md

├── .env.example
├── README.md
└── Makefile

⚡ Quick Start Commands

Once built, these are the primary commands:

# ─────────────────────────────────────────────────────────────
# DATA HARNESS
# ─────────────────────────────────────────────────────────────

# Initialize (first time)
npm run harness -- init

# Sync all data sources
npm run harness -- sync --all

# Sync specific source
npm run harness -- sync --source football-data
npm run harness -- sync --source cricsheet

# Check status
npm run harness -- status

# ─────────────────────────────────────────────────────────────
# ALGORITHMS
# ─────────────────────────────────────────────────────────────

# Start algorithms API
npm run algorithms -- serve

# Test algorithms
npm run algorithms -- test

# ─────────────────────────────────────────────────────────────
# ML PIPELINE
# ─────────────────────────────────────────────────────────────

# Generate features
python -m ml_pipeline.features.generate

# Train models
python -m ml_pipeline.training.train --model xgboost

# Evaluate
python -m ml_pipeline.training.evaluate --model xgboost

# Start prediction API
uvicorn ml_pipeline.serving.api:app --port 8000

# ─────────────────────────────────────────────────────────────
# DOCKER (Full Stack)
# ─────────────────────────────────────────────────────────────

# Start everything
docker-compose up -d

# View logs
docker-compose logs -f data-harness

# Run sync manually
docker-compose exec data-harness npm run harness -- sync --all

📋 Complete Task Checklist

Phase 1: Foundation ✅ COMPLETE

  • 1.1 Create project structure
  • 1.2 Initialize package.json
  • 1.3 Create Prisma schema
  • 1.4 Run database migrations
  • 1.5 Implement BaseDownloader
  • 1.6 Implement FileStorage
  • 1.7 Create CLI skeleton
  • 1.8 Add logging

Phase 2: Data Sources ✅ COMPLETE

  • 2.1 Football-Data downloader
  • 2.2 Football-Data normalizer
  • 2.3 Football odds parser
  • 2.4 FiveThirtyEight downloader
  • 2.5 FiveThirtyEight normalizer
  • 2.6 Cricsheet downloader
  • 2.7 Cricsheet normalizer
  • 2.8 Kaggle downloader
  • 2.9 Kaggle SQLite parser
  • 2.10 GitHub repos syncer
  • 2.11 Deduplicator
  • 2.12 Team name normalizer
  • 2.13 Sync tracking
  • 2.14 Tennis-Data downloader (added)
  • 2.15 Tennis normalizer (added)

Phase 3: Algorithms ✅ COMPLETE

  • 3.1 Create algorithms package
  • 3.2 Port Shin method
  • 3.3 Port Poisson model
  • 3.4 Port ELO ratings
  • 3.5 Implement Kelly criterion
  • 3.6 Implement implied probability
  • 3.7 Implement fair odds calculator
  • 3.8 Write unit tests
  • 3.9 Create algorithms API

Phase 4: ML Pipeline ✅ COMPLETE

  • 4.1 Create ml-pipeline package
  • 4.2 Feature: Form calculation
  • 4.3 Feature: Head-to-head
  • 4.4 Feature: League position
  • 4.5 Feature: Odds-derived
  • 4.6 Feature: ELO ratings
  • 4.7 Feature store population
  • 4.8 XGBoost training notebook
  • 4.9 Random Forest training
  • 4.10 Neural Network training
  • 4.11 Model evaluation
  • 4.12 Model export (ONNX)
  • 4.13 Model serving API
  • 4.14 Sport-specific models (football, cricket, tennis)
  • 4.15 Model registry with auto-discovery

Phase 5: Integration ✅ COMPLETE

  • 5.1 Scheduler setup
  • 5.2 GitHub Actions workflow
  • 5.3 Value Detection service (Port 3102)
  • 5.4 Price Feed service (Port 3100)
  • 5.5 Connect to SuperSkin services
  • 5.6 End-to-end testing
  • 5.7 Performance optimization
  • 5.8 Monitoring dashboard
  • 5.9 Documentation

🚦 Current Focus: Expanding Sports Coverage

SportData StatusML Model StatusAccuracy
Football✅ 200K+ matches✅ Trained~68%
Cricket✅ 5K+ matches✅ Trained~65%
Tennis✅ 50K+ matches✅ Trained~70%
Basketball🔄 Collecting📋 Planned-
Esports🔄 Collecting📋 Planned-

✅ System Status

The Forsyt Data Machine is fully operational and integrated with SuperSkin services:

# Check system status
curl http://localhost:3105/health
# {"status":"healthy","models":["football","cricket","tennis"]}

# Get predictions
curl http://localhost:3105/predict -d '{"sport":"football","home":"Liverpool","away":"Chelsea"}'
# {"home":0.48,"draw":0.27,"away":0.25,"confidence":0.72}

Next Steps

  1. Expand basketball data collection from Basketball-Reference
  2. Add esports data from HLTV and Liquipedia
  3. Improve model accuracy with additional features
  4. Add more leagues to football/cricket coverage