🔄 Data Harness: Design & Architecture

Version: 2.0 Date: January 2026 Status: ✅ IMPLEMENTED Purpose: Automated system to download, normalize, and store sports data for ML training

Executive Summary

The Data Harness is a standalone service that powers the ML Training Pipeline for SuperSkin. It:

Downloads data from 10+ free sources (Cricsheet, Football-Data, FiveThirtyEight, Kaggle, GitHub, Tennis-Data)
Normalizes data into a common schema for each sport (football, cricket, tennis)
Stores raw files + processed data in PostgreSQL (Supabase IceCrystal)
Generates ML-ready features for the 6 SuperSkin services

Current Status:

✅ Football data pipeline complete (200K+ matches)
✅ Cricket data pipeline complete (5K+ matches)
✅ Tennis data pipeline complete (50K+ matches)
✅ ML feature generation for all 3 sports
🔄 Basketball/Esports data collection in progress

Run frequency: Daily for live data, weekly for historical archives

📊 Data Sources Summary

Primary Sources (Implemented)

Source	Sport	Data Type	Size	Format	Status
Football-Data.co.uk	Football	Results + Odds	200K+ matches	CSV	✅ Complete
FiveThirtyEight	Football	SPI Ratings	100K+ matches	CSV	✅ Complete
Cricsheet	Cricket	Ball-by-ball	5K+ matches	JSON/YAML	✅ Complete
Tennis-Data.co.uk	Tennis	Results + Odds	50K+ matches	CSV	✅ Complete
Kaggle: IPL Complete	Cricket	Match + Delivery	1.9 MB	CSV	✅ Complete
Kaggle: European Soccer	Football	Results + Odds + Players	34 MB	SQLite	✅ Complete
Kaggle: Transfermarkt	Football	Player Values	172 MB	CSV	✅ Complete

Secondary Sources (In Progress)

Source	Sport	Data Type	Status
Basketball-Reference	Basketball	NBA/WNBA stats	🔄 In Progress
HLTV	Esports	CS2 match data	🔄 In Progress
Liquipedia	Esports	Multi-game data	📋 Planned
RSSSF	Football	Historical results	📋 Planned

ML Tools (GitHub Repos to Clone)

Repo	Purpose	Language	Stars
ProphitBet	Complete ML prediction app	Python	466 ⭐
octopy	Poisson, Shin, ELO algorithms	Python	70 ⭐
cricketdata	R package for cricket data	R	91 ⭐
transfermarkt-datasets	Auto-updating pipeline	Python	329 ⭐

🏗️ Architecture

High-Level Data Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                           DATA HARNESS PIPELINE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌────────────┐ │
│  │  Scheduler   │───▶│  Downloaders │───▶│  Processors  │───▶│  Storage   │ │
│  │  (node-cron) │    │  (per source)│    │  (normalize) │    │  (PG+Files)│ │
│  └──────────────┘    └──────────────┘    └──────────────┘    └────────────┘ │
│        │                    │                   │                   │        │
│        │                    ▼                   ▼                   ▼        │
│        │              /data/raw/         /data/processed/      PostgreSQL    │
│        │              {source}/          {sport}/              + Features    │
│        │                                                                     │
│        └─────────────────── Retry on failure ───────────────────────────────│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Component Details

1. Scheduler

Technology: node-cron or Bull queue
Jobs:
- sync-cricsheet - Weekly (Sundays 2am)
- sync-football-data - Daily (6am)
- sync-538 - Daily (6am)
- sync-kaggle - Weekly (Sundays 3am)
- sync-github-repos - Monthly (1st of month)

2. Downloaders (One per Source)

Each downloader is a standalone module that:

Handles authentication if needed (Kaggle API key)
Implements retry logic with exponential backoff
Tracks last successful sync timestamp
Downloads incrementally when possible
Validates downloaded files (checksums, size)

3. Processors

Normalizer: Converts source-specific formats to common schema
Deduplicator: Removes duplicate matches across sources
Validator: Ensures data matches expected schema
Feature Generator: Creates ML-ready features

4. Storage

Raw Files: /data/raw/{source}/{date}/ - Original downloads
Processed: /data/processed/{sport}/ - Normalized CSVs
PostgreSQL: Structured tables for querying
Feature Store: Pre-computed ML features

📁 Directory Structure

services/data-harness/
├── src/
│   ├── downloaders/
│   │   ├── base.downloader.ts       # Abstract base class
│   │   ├── cricsheet.downloader.ts
│   │   ├── football-data.downloader.ts
│   │   ├── fivethirtyeight.downloader.ts
│   │   ├── kaggle.downloader.ts
│   │   └── github.downloader.ts
│   │
│   ├── processors/
│   │   ├── normalizer/
│   │   │   ├── cricket.normalizer.ts
│   │   │   └── football.normalizer.ts
│   │   ├── deduplicator.ts
│   │   ├── validator.ts
│   │   └── feature-generator.ts
│   │
│   ├── storage/
│   │   ├── file-storage.ts          # Raw file management
│   │   ├── database.ts              # PostgreSQL operations
│   │   └── feature-store.ts         # ML features
│   │
│   ├── scheduler/
│   │   ├── enrichment-scheduler.ts  # Cron-based job runner with Redis locks
│   │   └── jobs/
│   │       ├── index.ts             # Job registry
│   │       ├── team-ratings.ts      # Daily ELO updates
│   │       ├── player-sync.ts       # Sportmonks player data
│   │       ├── odds-archival.ts     # Oracle → DataMachine archival
│   │       ├── ml-features.ts       # Form, H2H, ELO features
│   │       └── cricket-stats.ts     # Ball-by-ball aggregation
│   │
│   ├── cli/
│   │   └── commands.ts              # Manual sync commands
│   │
│   └── index.ts                     # Service entry point
│
├── data/                            # Data storage (gitignored)
│   ├── raw/
│   │   ├── cricsheet/
│   │   ├── football-data/
│   │   ├── fivethirtyeight/
│   │   ├── kaggle/
│   │   └── github/
│   └── processed/
│       ├── cricket/
│       └── football/
│
├── prisma/
│   └── schema.prisma                # Database schema
│
├── package.json
├── tsconfig.json
└── README.md

📥 Downloader Specifications

1. Cricsheet Downloader

Source: https://cricsheet.org/downloads/

Files to Download:

File	Size	Contents
`all_json.zip`	~50 MB	All matches in JSON format
`ipl_json.zip`	~5 MB	IPL only
`t20s_json.zip`	~15 MB	All T20 internationals

Download Logic:

interface CricsheetDownloader {
  // Check if new data available (compare Last-Modified header)
  async checkForUpdates(): Promise<boolean>;

  // Download and extract ZIP files
  async download(): Promise<DownloadResult>;

  // Parse JSON/YAML match files
  async parseMatches(dir: string): Promise<CricketMatch[]>;
}

Incremental Strategy:

Store Last-Modified header from previous download
Only re-download if changed
Keep last 3 versions for rollback

2. Football-Data.co.uk Downloader

Source: https://www.football-data.co.uk/data.php

Files to Download (per league per season):

League	URL Pattern	Example
Premier League	`/mmz4281/{season}/E0.csv`	`/mmz4281/2425/E0.csv`
Championship	`/mmz4281/{season}/E1.csv`	`/mmz4281/2425/E1.csv`
La Liga	`/mmz4281/{season}/SP1.csv`	`/mmz4281/2425/SP1.csv`
Bundesliga	`/mmz4281/{season}/D1.csv`	`/mmz4281/2425/D1.csv`
Serie A	`/mmz4281/{season}/I1.csv`	`/mmz4281/2425/I1.csv`
Ligue 1	`/mmz4281/{season}/F1.csv`	`/mmz4281/2425/F1.csv`

Season Format: 2425 = 2024-25 season

CSV Columns (Betting Odds):

Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR,
B365H, B365D, B365A,      # Bet365
BWH, BWD, BWA,            # Betway
IWH, IWD, IWA,            # Interwetten
PSH, PSD, PSA,            # Pinnacle
WHH, WHD, WHA,            # William Hill
VCH, VCD, VCA,            # VC Bet
MaxH, MaxD, MaxA,         # Maximum odds
AvgH, AvgD, AvgA          # Average odds

Download Logic:

const LEAGUES = ['E0', 'E1', 'SP1', 'D1', 'I1', 'F1', 'N1', 'P1', 'T1', 'G1'];
const SEASONS = ['2425', '2324', '2223', '2122', '2021', '1920']; // Last 6 seasons

async downloadAllLeagues(): Promise<void> {
  for (const league of LEAGUES) {
    for (const season of SEASONS) {
      await this.downloadLeagueSeason(league, season);
      await sleep(1000); // Rate limit: 1 req/sec
    }
  }
}

3. FiveThirtyEight Downloader

Source: https://projects.fivethirtyeight.com/soccer-api/

Files to Download:

File	URL	Update Freq
SPI Matches	`club/spi_matches.csv`	Historical
SPI Latest	`club/spi_matches_latest.csv`	Was daily
Global Rankings	`club/spi_global_rankings.csv`	Current ratings
International	`international/spi_matches_intl.csv`	Intl matches

CSV Columns (SPI Matches):

date, league_id, league, team1, team2, spi1, spi2,
prob1, prob2, probtie, proj_score1, proj_score2,
importance1, importance2, score1, score2, xg1, xg2,
nsxg1, nsxg2, adj_score1, adj_score2

Key Fields:

spi1, spi2 - Team power ratings (0-100 scale)
prob1, prob2, probtie - Win/draw probabilities
xg1, xg2 - Expected goals

4. Kaggle Downloader

Prerequisites:

Kaggle API key in ~/.kaggle/kaggle.json
Install: pip install kaggle

Datasets to Download:

const KAGGLE_DATASETS = [
  {
    id: 'patrickb1912/ipl-complete-dataset-20082020',
    name: 'ipl-complete',
    files: ['matches.csv', 'deliveries.csv'],
    frequency: 'yearly'
  },
  {
    id: 'hugomathien/soccer',
    name: 'european-soccer',
    files: ['database.sqlite'],
    frequency: 'static'  // One-time download
  },
  {
    id: 'davidcariboo/player-scores',
    name: 'transfermarkt',
    files: ['games.csv', 'players.csv', 'appearances.csv', 'player_valuations.csv'],
    frequency: 'weekly'
  },
  {
    id: 'saife245/english-premier-league',
    name: 'epl-20-years',
    files: ['EPL_Set.csv'],
    frequency: 'static'
  }
];

Download Command:

kaggle datasets download -d {dataset_id} -p ./data/raw/kaggle/{name}/

5. GitHub Downloader

Repos to Clone/Update:

const GITHUB_REPOS = [
  {
    url: 'https://github.com/kochlisGit/ProphitBet-Soccer-Bets-Predictor',
    name: 'prophitbet',
    purpose: 'ML models, feature engineering'
  },
  {
    url: 'https://github.com/octosport/octopy',
    name: 'octopy',
    purpose: 'Poisson, Shin, ELO algorithms'
  },
  {
    url: 'https://github.com/robjhyndman/cricketdata',
    name: 'cricketdata',
    purpose: 'Cricket data fetching (R)'
  },
  {
    url: 'https://github.com/dcaribou/transfermarkt-datasets',
    name: 'transfermarkt-datasets',
    purpose: 'Data pipeline reference'
  },
  {
    url: 'https://github.com/jrbadiabo/Bet-on-Sibyl',
    name: 'bet-on-sibyl',
    purpose: 'Multi-sport prediction models'
  }
];

Clone/Update Logic:

async syncRepo(repo: GitHubRepo): Promise<void> {
  const localPath = `./data/raw/github/${repo.name}`;

  if (await exists(localPath)) {
    // Update existing repo
    await exec(`git -C ${localPath} pull`);
  } else {
    // Clone new repo
    await exec(`git clone ${repo.url} ${localPath}`);
  }
}

🗄️ Database Schema

Normalized Tables (PostgreSQL)

-- =============================================
-- FOOTBALL TABLES
-- =============================================

CREATE TABLE football_matches (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  external_id     TEXT UNIQUE,           -- Source-specific ID for deduplication
  source          TEXT NOT NULL,         -- 'football-data', '538', 'kaggle-euro', etc.

  -- Match Info
  date            DATE NOT NULL,
  kickoff_time    TIMESTAMPTZ,
  league          TEXT NOT NULL,
  season          TEXT NOT NULL,         -- '2024-25'
  home_team       TEXT NOT NULL,
  away_team       TEXT NOT NULL,

  -- Results
  home_goals      INT,
  away_goals      INT,
  home_goals_ht   INT,                   -- Half-time
  away_goals_ht   INT,
  result          TEXT,                  -- 'H', 'D', 'A'

  -- Statistics
  home_shots      INT,
  away_shots      INT,
  home_corners    INT,
  away_corners    INT,
  home_fouls      INT,
  away_fouls      INT,
  home_yellow     INT,
  away_yellow     INT,
  home_red        INT,
  away_red        INT,

  -- FiveThirtyEight SPI (if available)
  home_spi        DECIMAL(5,2),          -- 0-100 power rating
  away_spi        DECIMAL(5,2),
  home_xg         DECIMAL(4,2),          -- Expected goals
  away_xg         DECIMAL(4,2),
  prob_home       DECIMAL(5,4),          -- Win probability
  prob_draw       DECIMAL(5,4),
  prob_away       DECIMAL(5,4),

  -- Timestamps
  created_at      TIMESTAMPTZ DEFAULT NOW(),
  updated_at      TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_football_matches_date ON football_matches(date);
CREATE INDEX idx_football_matches_league ON football_matches(league, season);
CREATE INDEX idx_football_matches_teams ON football_matches(home_team, away_team);


-- Historical betting odds
CREATE TABLE football_odds (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  match_id        UUID REFERENCES football_matches(id) ON DELETE CASCADE,

  bookmaker       TEXT NOT NULL,         -- 'bet365', 'pinnacle', 'betway', etc.
  market          TEXT NOT NULL,         -- '1x2', 'over_under_2.5', 'btts'

  -- 1X2 odds
  odds_home       DECIMAL(6,3),
  odds_draw       DECIMAL(6,3),
  odds_away       DECIMAL(6,3),

  -- Over/Under 2.5
  odds_over       DECIMAL(6,3),
  odds_under      DECIMAL(6,3),

  -- BTTS (Both Teams To Score)
  odds_btts_yes   DECIMAL(6,3),
  odds_btts_no    DECIMAL(6,3),

  timestamp       TIMESTAMPTZ,           -- When odds were recorded

  UNIQUE(match_id, bookmaker, market)
);

CREATE INDEX idx_football_odds_match ON football_odds(match_id);


-- Team metadata and current ratings
CREATE TABLE football_teams (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name            TEXT NOT NULL,
  normalized_name TEXT NOT NULL UNIQUE,  -- Standardized name
  country         TEXT,
  league          TEXT,

  -- Current ratings (updated from 538/ELO)
  spi_rating      DECIMAL(5,2),
  elo_rating      DECIMAL(7,2),
  offensive_rating DECIMAL(5,2),
  defensive_rating DECIMAL(5,2),

  -- Aliases for name matching
  aliases         TEXT[],                -- ['Man Utd', 'Manchester United', 'Man United']

  updated_at      TIMESTAMPTZ DEFAULT NOW()
);


-- =============================================
-- CRICKET TABLES
-- =============================================

CREATE TABLE cricket_matches (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  external_id     TEXT UNIQUE,
  source          TEXT NOT NULL,         -- 'cricsheet', 'kaggle-ipl'

  -- Match Info
  date            DATE NOT NULL,
  venue           TEXT,
  city            TEXT,
  match_type      TEXT NOT NULL,         -- 'T20', 'ODI', 'Test', 'IPL'
  competition     TEXT,                  -- 'IPL 2024', 'T20 World Cup 2024'
  season          TEXT,

  -- Teams
  team1           TEXT NOT NULL,
  team2           TEXT NOT NULL,
  toss_winner     TEXT,
  toss_decision   TEXT,                  -- 'bat', 'field'

  -- Result
  winner          TEXT,
  win_type        TEXT,                  -- 'runs', 'wickets', 'super_over'
  win_margin      INT,
  result          TEXT,                  -- 'normal', 'tie', 'no_result', 'DLS'
  player_of_match TEXT,

  -- Scores
  team1_runs      INT,
  team1_wickets   INT,
  team1_overs     DECIMAL(4,1),
  team2_runs      INT,
  team2_wickets   INT,
  team2_overs     DECIMAL(4,1),

  created_at      TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_cricket_matches_date ON cricket_matches(date);
CREATE INDEX idx_cricket_matches_type ON cricket_matches(match_type, competition);


-- Ball-by-ball data (from Cricsheet)
CREATE TABLE cricket_deliveries (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  match_id        UUID REFERENCES cricket_matches(id) ON DELETE CASCADE,

  innings         INT NOT NULL,          -- 1 or 2
  over_number     INT NOT NULL,
  ball_number     INT NOT NULL,

  batsman         TEXT NOT NULL,
  non_striker     TEXT NOT NULL,
  bowler          TEXT NOT NULL,

  runs_batsman    INT DEFAULT 0,
  runs_extras     INT DEFAULT 0,
  runs_total      INT DEFAULT 0,

  extras_type     TEXT,                  -- 'wide', 'noball', 'bye', 'legbye'

  wicket          BOOLEAN DEFAULT FALSE,
  wicket_type     TEXT,                  -- 'bowled', 'caught', 'lbw', 'run_out'
  wicket_player   TEXT,
  wicket_fielder  TEXT,

  -- Running totals at this point
  total_runs      INT,
  total_wickets   INT,

  UNIQUE(match_id, innings, over_number, ball_number)
);

CREATE INDEX idx_cricket_deliveries_match ON cricket_deliveries(match_id);
CREATE INDEX idx_cricket_deliveries_player ON cricket_deliveries(batsman);


-- =============================================
-- ML FEATURE STORE
-- =============================================

CREATE TABLE ml_features_football (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  match_id        UUID REFERENCES football_matches(id) ON DELETE CASCADE,

  -- Pre-computed features for ML models
  -- Form (last 5 matches)
  home_form_points    INT,               -- Points from last 5 games
  away_form_points    INT,
  home_form_goals_for DECIMAL(4,2),      -- Avg goals scored last 5
  home_form_goals_ag  DECIMAL(4,2),      -- Avg goals conceded last 5
  away_form_goals_for DECIMAL(4,2),
  away_form_goals_ag  DECIMAL(4,2),

  -- Head to head (last 5 meetings)
  h2h_home_wins       INT,
  h2h_draws           INT,
  h2h_away_wins       INT,
  h2h_avg_goals       DECIMAL(4,2),

  -- League position at time of match
  home_league_pos     INT,
  away_league_pos     INT,
  home_league_points  INT,
  away_league_points  INT,

  -- Odds-derived
  implied_prob_home   DECIMAL(5,4),
  implied_prob_draw   DECIMAL(5,4),
  implied_prob_away   DECIMAL(5,4),
  odds_value_home     DECIMAL(5,4),      -- Edge vs fair odds
  odds_value_away     DECIMAL(5,4),

  -- Target variables
  target_result       TEXT,              -- 'H', 'D', 'A'
  target_over_2_5     BOOLEAN,
  target_btts         BOOLEAN,

  computed_at         TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_ml_features_match ON ml_features_football(match_id);


-- =============================================
-- SYNC TRACKING
-- =============================================

CREATE TABLE data_sync_log (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  source          TEXT NOT NULL,
  started_at      TIMESTAMPTZ NOT NULL,
  completed_at    TIMESTAMPTZ,
  status          TEXT NOT NULL,         -- 'running', 'success', 'failed'
  records_added   INT DEFAULT 0,
  records_updated INT DEFAULT 0,
  error_message   TEXT,
  metadata        JSONB                  -- Source-specific info
);

CREATE INDEX idx_sync_log_source ON data_sync_log(source, started_at DESC);


-- =============================================
-- PLAYER DATA TABLES (NEW)
-- =============================================

CREATE TABLE football_players (
  id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  external_id       TEXT UNIQUE,           -- Sportmonks/Transfermarkt ID
  source            TEXT NOT NULL,         -- 'sportmonks', 'transfermarkt'

  name              TEXT NOT NULL,
  normalized_name   TEXT NOT NULL,
  first_name        TEXT,
  last_name         TEXT,
  date_of_birth     DATE,
  nationality       TEXT,
  height            INT,                   -- cm
  weight            INT,                   -- kg
  position          TEXT,                  -- 'GK', 'DF', 'MF', 'FW'
  position_detail   TEXT,                  -- 'Centre-Back', 'Left Winger'
  foot              TEXT,                  -- 'Left', 'Right', 'Both'

  team_id           UUID REFERENCES football_teams(id),
  shirt_number      INT,

  market_value      DECIMAL(12,2),         -- EUR
  market_value_date DATE,

  total_apps        INT DEFAULT 0,
  total_goals       INT DEFAULT 0,
  total_assists     INT DEFAULT 0,

  created_at        TIMESTAMPTZ DEFAULT NOW(),
  updated_at        TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_players_team ON football_players(team_id);
CREATE INDEX idx_players_name ON football_players(normalized_name);


CREATE TABLE player_appearances (
  id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  player_id         UUID REFERENCES football_players(id) ON DELETE CASCADE,

  match_date        DATE NOT NULL,
  competition       TEXT NOT NULL,
  season            TEXT NOT NULL,
  home_team         TEXT NOT NULL,
  away_team         TEXT NOT NULL,
  match_result      TEXT,                  -- '2-1', '0-0'

  started           BOOLEAN DEFAULT FALSE,
  minutes_played    INT,
  substitute_in     INT,
  substitute_out    INT,

  goals             INT DEFAULT 0,
  assists           INT DEFAULT 0,
  yellow_cards      INT DEFAULT 0,
  red_cards         INT DEFAULT 0,

  rating            DECIMAL(3,1),          -- Match rating 0-10

  created_at        TIMESTAMPTZ DEFAULT NOW(),

  UNIQUE(player_id, match_date, home_team, away_team)
);

CREATE INDEX idx_appearances_player ON player_appearances(player_id);
CREATE INDEX idx_appearances_date ON player_appearances(match_date);


-- =============================================
-- CRICKET PLAYER TABLES (NEW)
-- =============================================

CREATE TABLE cricket_players (
  id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  external_id       TEXT UNIQUE,           -- Roanuz/Cricsheet ID
  source            TEXT NOT NULL,

  name              TEXT NOT NULL,
  normalized_name   TEXT NOT NULL,
  date_of_birth     DATE,
  nationality       TEXT,
  batting_style     TEXT,                  -- 'Right-hand bat'
  bowling_style     TEXT,                  -- 'Right-arm fast'
  role              TEXT,                  -- 'Batsman', 'Bowler', 'All-rounder'

  current_team      TEXT,
  ipl_team          TEXT,

  t20_matches       INT DEFAULT 0,
  t20_runs          INT DEFAULT 0,
  t20_wickets       INT DEFAULT 0,
  t20_average       DECIMAL(6,2),
  t20_strike_rate   DECIMAL(6,2),
  t20_economy       DECIMAL(4,2),

  created_at        TIMESTAMPTZ DEFAULT NOW(),
  updated_at        TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_cricket_players_name ON cricket_players(normalized_name);


CREATE TABLE cricket_player_season_stats (
  id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  player_id         UUID REFERENCES cricket_players(id) ON DELETE CASCADE,

  season            TEXT NOT NULL,         -- '2024', 'IPL 2024'
  competition       TEXT NOT NULL,
  team_name         TEXT NOT NULL,

  -- Batting
  matches           INT DEFAULT 0,
  innings           INT DEFAULT 0,
  runs              INT DEFAULT 0,
  balls_faced       INT DEFAULT 0,
  not_outs          INT DEFAULT 0,
  high_score        INT,
  average           DECIMAL(6,2),
  strike_rate       DECIMAL(6,2),
  fours             INT DEFAULT 0,
  sixes             INT DEFAULT 0,
  fifties           INT DEFAULT 0,
  hundreds          INT DEFAULT 0,

  -- Bowling
  overs_bowled      DECIMAL(5,1),
  wickets           INT DEFAULT 0,
  runs_conceded     INT DEFAULT 0,
  economy           DECIMAL(4,2),
  bowling_average   DECIMAL(6,2),
  best_bowling      TEXT,                  -- '4/25'

  updated_at        TIMESTAMPTZ DEFAULT NOW(),

  UNIQUE(player_id, season, competition)
);

CREATE INDEX idx_cricket_stats_player ON cricket_player_season_stats(player_id);

⏰ Data Enrichment Scheduler (NEW)

The Data Harness includes a cron-based enrichment scheduler that runs continuous data updates:

Enrichment Jobs

Job ID	Name	Schedule	Description
`team-ratings`	Team Ratings Update	Daily 4:00 AM	Updates ELO ratings from recent match results
`player-sync`	Player Stats Sync	Every 6 hours	Syncs player data from Sportmonks API
`odds-archival`	Odds History Archival	Every hour	Archives odds from Oracle → Data Machine
`ml-features`	ML Feature Computation	Every 2 hours	Computes form, H2H, ELO features
`cricket-stats`	Cricket Stats Aggregation	Daily 5:00 AM	Aggregates ball-by-ball into player stats
`ipl-live-stats`	IPL Live Stats	Every 15 min	Real-time IPL stats (during season)

Scheduler Architecture

// src/scheduler/enrichment-scheduler.ts

class EnrichmentScheduler {
  private redis: Redis;           // For distributed locking
  private jobs: Map<string, cron.ScheduledTask>;

  registerJob(job: EnrichmentJob): void;
  async start(): Promise<void>;
  async executeJob(jobId: string): Promise<void>;
  async stop(): Promise<void>;
  getStatus(): Record<string, JobStatus>;
}

Job Configuration

const enrichmentJobs = [
  {
    id: 'team-ratings',
    name: 'Team Ratings Update',
    schedule: '0 4 * * *',        // Daily at 4 AM
    handler: updateTeamRatings,
    enabled: true
  },
  {
    id: 'player-sync',
    name: 'Player Stats Sync',
    schedule: '0 */6 * * *',      // Every 6 hours
    handler: syncPlayerStats,
    enabled: true
  },
  {
    id: 'cricket-stats',
    name: 'Cricket Stats Aggregation',
    schedule: '0 5 * * *',        // Daily at 5 AM
    handler: aggregateCricketStats,
    enabled: true
  }
];

Distributed Locking

Jobs use Redis locks to prevent concurrent execution across multiple instances:

// Before executing job
const lockKey = `enrichment:lock:${jobId}`;
const acquired = await redis.set(lockKey, '1', 'EX', 3600, 'NX');

if (!acquired) {
  logger.warn(`Job ${jobId} already running, skipping`);
  return;
}

// Execute job...

// Release lock
await redis.del(lockKey);

💻 CLI Commands

The Data Harness provides CLI commands for manual operations:

# Initialize database and directory structure
npm run harness -- init

# Sync all sources
npm run harness -- sync --all

# Sync specific source
npm run harness -- sync --source cricsheet
npm run harness -- sync --source football-data
npm run harness -- sync --source fivethirtyeight
npm run harness -- sync --source kaggle
npm run harness -- sync --source github

# Sync with date range (for Football-Data)
npm run harness -- sync --source football-data --from 2024-01-01 --to 2024-12-31

# Download specific seasons
npm run harness -- sync --source football-data --seasons 2425,2324

# View sync status
npm run harness -- status

# Generate ML features
npm run harness -- features --sport football --from 2020-01-01

# Clean old raw files (keep last N days)
npm run harness -- clean --keep-days 30

# Validate data integrity
npm run harness -- validate

# Export data for analysis
npm run harness -- export --sport football --format csv --output ./exports/

🔧 TypeScript Interfaces

Core Types

// src/types/index.ts

export type Sport = 'football' | 'cricket';
export type DataSource = 'cricsheet' | 'football-data' | 'fivethirtyeight' | 'kaggle' | 'github';

export interface SyncResult {
  source: DataSource;
  status: 'success' | 'failed' | 'partial';
  startedAt: Date;
  completedAt: Date;
  recordsAdded: number;
  recordsUpdated: number;
  errors: string[];
}

export interface DownloaderConfig {
  source: DataSource;
  baseUrl: string;
  rateLimit: number;        // Requests per second
  retryAttempts: number;
  retryDelayMs: number;
}

// Base downloader interface
export interface IDownloader {
  readonly source: DataSource;
  checkForUpdates(): Promise<boolean>;
  download(): Promise<DownloadResult>;
  getLastSync(): Promise<Date | null>;
}

export interface DownloadResult {
  success: boolean;
  filesDownloaded: string[];
  totalBytes: number;
  errors: string[];
}

Football Data Types

// src/types/football.ts

export interface FootballMatch {
  id?: string;
  externalId: string;
  source: DataSource;

  date: Date;
  kickoffTime?: Date;
  league: string;
  season: string;
  homeTeam: string;
  awayTeam: string;

  homeGoals?: number;
  awayGoals?: number;
  homeGoalsHT?: number;
  awayGoalsHT?: number;
  result?: 'H' | 'D' | 'A';

  // Statistics
  stats?: {
    homeShots?: number;
    awayShots?: number;
    homeCorners?: number;
    awayCorners?: number;
    homeFouls?: number;
    awayFouls?: number;
  };

  // FiveThirtyEight SPI data
  spi?: {
    homeSPI: number;
    awaySPI: number;
    homeXG: number;
    awayXG: number;
    probHome: number;
    probDraw: number;
    probAway: number;
  };
}

export interface FootballOdds {
  matchId: string;
  bookmaker: string;
  market: '1x2' | 'over_under_2.5' | 'btts';

  home?: number;
  draw?: number;
  away?: number;
  over?: number;
  under?: number;

  timestamp?: Date;
}

export interface TeamRating {
  team: string;
  normalizedName: string;
  spiRating: number;
  eloRating: number;
  offensiveRating: number;
  defensiveRating: number;
}

Cricket Data Types

// src/types/cricket.ts

export interface CricketMatch {
  id?: string;
  externalId: string;
  source: DataSource;

  date: Date;
  venue?: string;
  city?: string;
  matchType: 'T20' | 'ODI' | 'Test' | 'IPL' | 'BBL' | 'PSL';
  competition?: string;
  season?: string;

  team1: string;
  team2: string;
  tossWinner?: string;
  tossDecision?: 'bat' | 'field';

  winner?: string;
  winType?: 'runs' | 'wickets' | 'super_over' | 'tie' | 'no_result';
  winMargin?: number;
  playerOfMatch?: string;

  team1Score?: InningsScore;
  team2Score?: InningsScore;
}

export interface InningsScore {
  runs: number;
  wickets: number;
  overs: number;
}

export interface CricketDelivery {
  matchId: string;
  innings: 1 | 2;
  overNumber: number;
  ballNumber: number;

  batsman: string;
  nonStriker: string;
  bowler: string;

  runsBatsman: number;
  runsExtras: number;
  runsTotal: number;

  extrasType?: 'wide' | 'noball' | 'bye' | 'legbye';

  wicket: boolean;
  wicketType?: string;
  wicketPlayer?: string;
  wicketFielder?: string;
}

📅 Implementation Status (All Complete)

Phase 1: Foundation ✅ COMPLETE

✅ Created services/data-harness/ directory structure
✅ Set up package.json with dependencies
✅ Created Prisma schema for PostgreSQL (Supabase)
✅ Implemented base downloader class
✅ Implemented file storage service
✅ Implemented Football-Data.co.uk downloader
✅ Implemented football normalizer
✅ Created CLI skeleton

Phase 2: All Downloaders ✅ COMPLETE

✅ Implemented Cricsheet downloader
✅ Implemented FiveThirtyEight downloader
✅ Implemented Tennis-Data.co.uk downloader
✅ Implemented cricket normalizer
✅ Implemented tennis normalizer
✅ Added sync tracking to database
✅ Implemented Kaggle downloader
✅ Implemented GitHub repo syncer
✅ Added retry logic and error handling
✅ Completed CLI commands

Phase 3: Features & Scheduler ✅ COMPLETE

✅ Implemented ML feature generator (football)
✅ Implemented ML feature generator (cricket)
✅ Implemented ML feature generator (tennis)
✅ Added data validation
✅ Created deduplication logic
✅ Set up node-cron scheduler
✅ Configured job schedules
✅ Added monitoring/logging

Phase 4: ML Integration ✅ COMPLETE

✅ Connected to SuperSkin ML Prediction Service
✅ Implemented sport-specific model training
✅ Created ONNX export pipeline
✅ Set up model registry with auto-discovery

Current Focus: Expanding Sports Coverage

Sport	Data Status	ML Model Status
Football	✅ 200K+ matches	✅ Trained (~68% accuracy)
Cricket	✅ 5K+ matches	✅ Trained (~65% accuracy)
Tennis	✅ 50K+ matches	✅ Trained (~70% accuracy)
Basketball	🔄 Collecting	📋 Planned
Esports	🔄 Collecting	📋 Planned

🔐 Configuration

Environment Variables

# Database
DATABASE_URL=postgresql://user:pass@localhost:5432/data_harness

# Kaggle API (required for Kaggle downloads)
KAGGLE_USERNAME=your_username
KAGGLE_KEY=your_api_key

# GitHub Token (optional, for higher rate limits)
GITHUB_TOKEN=ghp_xxxxxxxxxxxx

# Storage paths
DATA_RAW_PATH=./data/raw
DATA_PROCESSED_PATH=./data/processed

# Rate limits
FOOTBALL_DATA_RATE_LIMIT=1    # requests per second
KAGGLE_RATE_LIMIT=5           # requests per second

# Retention
RAW_DATA_RETENTION_DAYS=30    # Keep raw files for 30 days

Schedule Configuration

// src/scheduler/jobs.ts

export const SYNC_SCHEDULES = {
  'football-data': {
    cron: '0 6 * * *',        // Daily at 6am
    timeout: 30 * 60 * 1000,  // 30 min timeout
    retries: 3
  },
  'fivethirtyeight': {
    cron: '0 6 * * *',        // Daily at 6am
    timeout: 10 * 60 * 1000,  // 10 min timeout
    retries: 3
  },
  'cricsheet': {
    cron: '0 2 * * 0',        // Sundays at 2am
    timeout: 60 * 60 * 1000,  // 60 min timeout
    retries: 3
  },
  'kaggle': {
    cron: '0 3 * * 0',        // Sundays at 3am
    timeout: 120 * 60 * 1000, // 2 hour timeout
    retries: 2
  },
  'github': {
    cron: '0 4 1 * *',        // 1st of month at 4am
    timeout: 30 * 60 * 1000,  // 30 min timeout
    retries: 2
  }
};

📊 Expected Data Volumes

Source	Records	Storage	Initial Download
Football-Data.co.uk	200K matches	~50 MB	~5 min
FiveThirtyEight	100K matches	~30 MB	~2 min
Cricsheet	5K matches, 500K deliveries	~100 MB	~10 min
Kaggle: IPL	1K matches	~5 MB	~1 min
Kaggle: European Soccer	25K matches	~35 MB	~2 min
Kaggle: Transfermarkt	1.2M appearances	~175 MB	~5 min
GitHub repos	5 repos	~500 MB	~10 min
Total		~900 MB	~35 min

✅ Next Steps

Confirm directory location: Should this be in services/data-harness/ or a different location?
Database choice: Use existing PostgreSQL or create dedicated instance?
Start building: Begin with Phase 1 (Foundation + Football-Data downloader)

Would you like me to start implementing the Data Harness service?

🧠 ML GitHub Repos: Integration Strategy

Overview

These GitHub repos are NOT meant to be run as-is. Instead, we:

Extract algorithms → Port key mathematical functions to TypeScript/Python
Learn patterns → Understand feature engineering approaches
Reference implementations → Use as documentation for complex calculations
Clone for training → Run ML training notebooks offline

Usage Matrix

Repo	Strategy	What to Extract	Target Service
ProphitBet	Extract + Adapt	Feature engineering, model training	ML Pipeline
octopy	Port to TypeScript	Shin, Poisson, ELO algorithms	Value Detection
transfermarkt-datasets	Reference	dbt pipeline pattern	Data Harness
cricketdata	Run as-is (R)	Cricket data fetching	Data Harness
Bet-on-Sibyl	Reference	Multi-sport scraping	Data Harness

1️⃣ ProphitBet (HIGHEST VALUE) ⭐⭐⭐⭐⭐

What it is: Complete betting prediction app with ML models

What to extract:

# From: ProphitBet/database/repositories/league.py
# This shows how they calculate features from Football-Data.co.uk

class LeagueRepository:
    def get_features(self, match):
        return {
            'home_wins': self._count_wins(home_team, last_n=5),
            'home_draws': self._count_draws(home_team, last_n=5),
            'home_losses': self._count_losses(home_team, last_n=5),
            'home_goals_scored': self._avg_goals_scored(home_team, last_n=5),
            'home_goals_conceded': self._avg_goals_conceded(home_team, last_n=5),
            'away_wins': ...,
            # ... 40+ features
        }

Feature Engineering (to port):

// Port to: services/ml-pipeline/src/features/football.ts

interface FootballFeatures {
  // Form-based (last N matches)
  homeWins: number;
  homeDraws: number;
  homeLosses: number;
  homeGoalsScored: number;      // Average
  homeGoalsConceded: number;
  homeCleanSheets: number;

  awayWins: number;
  awayDraws: number;
  awayLosses: number;
  awayGoalsScored: number;
  awayGoalsConceded: number;

  // Head-to-head
  h2hHomeWins: number;
  h2hDraws: number;
  h2hAwayWins: number;
  h2hAvgGoals: number;

  // Odds-derived
  impliedProbHome: number;
  impliedProbDraw: number;
  impliedProbAway: number;
  marketOverround: number;
}

ML Models to study:

XGBoost with hyperparameter tuning
Feature importance analysis (Boruta)
Model evaluation with filtering

2️⃣ octopy (CORE ALGORITHMS) ⭐⭐⭐⭐⭐

What it is: Python implementations of essential betting math

Key Algorithms to Port:

A. Shin Method (Remove Bookmaker Margin)

# From: octopy/notebooks/shin.ipynb
# Converts bookmaker odds to true probabilities

def shin_probabilities(odds: list) -> list:
    """
    Shin (1993) method for extracting true probabilities
    from bookmaker odds by estimating the overround.

    Example:
    odds = [2.10, 3.40, 3.50]  # Home/Draw/Away
    probs = shin_probabilities(odds)
    # Returns: [0.45, 0.28, 0.27] (sums to 1.0)
    """
    implied = [1/o for o in odds]
    overround = sum(implied)

    # Shin's iterative method to find true probs
    z = (overround - 1) / (len(odds) - 1)
    true_probs = []
    for p in implied:
        true_prob = (sqrt(z**2 + 4*(1-z)*p**2 / overround) - z) / (2*(1-z))
        true_probs.append(true_prob)

    return true_probs

Port to TypeScript:

// services/value-detection/src/algorithms/shin.ts

export function shinProbabilities(odds: number[]): number[] {
  const implied = odds.map(o => 1 / o);
  const overround = implied.reduce((a, b) => a + b, 0);

  const z = (overround - 1) / (odds.length - 1);

  return implied.map(p => {
    const discriminant = z ** 2 + (4 * (1 - z) * p ** 2) / overround;
    return (Math.sqrt(discriminant) - z) / (2 * (1 - z));
  });
}

// Usage in Value Detection Service:
const bookmakerOdds = [2.10, 3.40, 3.50];
const trueProbabilities = shinProbabilities(bookmakerOdds);
// [0.45, 0.28, 0.27] - fair probabilities without margin

B. Poisson Model (Goal Distribution)

# From: octopy/notebooks/poisson.ipynb
# Predicts goal distributions for matches

from scipy.stats import poisson

def predict_match(home_attack, home_defense, away_attack, away_defense,
                  league_avg_goals=1.35):
    """
    Predicts match outcome using Poisson distribution.

    Parameters derived from historical data:
    - attack_strength = team_goals_scored / league_avg
    - defense_strength = team_goals_conceded / league_avg
    """
    # Expected goals
    home_xg = home_attack * away_defense * league_avg_goals
    away_xg = away_attack * home_defense * league_avg_goals

    # Goal distribution (0-6 goals each)
    max_goals = 7
    home_probs = [poisson.pmf(i, home_xg) for i in range(max_goals)]
    away_probs = [poisson.pmf(i, away_xg) for i in range(max_goals)]

    # Match outcome matrix
    prob_home_win = 0
    prob_draw = 0
    prob_away_win = 0
    prob_over_25 = 0
    prob_btts = 0

    for h in range(max_goals):
        for a in range(max_goals):
            p = home_probs[h] * away_probs[a]
            if h > a:
                prob_home_win += p
            elif h == a:
                prob_draw += p
            else:
                prob_away_win += p
            if h + a > 2.5:
                prob_over_25 += p
            if h > 0 and a > 0:
                prob_btts += p

    return {
        'home_xg': home_xg,
        'away_xg': away_xg,
        'prob_home': prob_home_win,
        'prob_draw': prob_draw,
        'prob_away': prob_away_win,
        'prob_over_25': prob_over_25,
        'prob_btts': prob_btts
    }

Port to TypeScript:

// services/value-detection/src/algorithms/poisson.ts

function poissonPmf(k: number, lambda: number): number {
  return (Math.exp(-lambda) * Math.pow(lambda, k)) / factorial(k);
}

export function predictMatch(
  homeAttack: number,
  homeDefense: number,
  awayAttack: number,
  awayDefense: number,
  leagueAvg = 1.35
): MatchPrediction {
  const homeXG = homeAttack * awayDefense * leagueAvg;
  const awayXG = awayAttack * homeDefense * leagueAvg;

  const maxGoals = 7;
  const homeProbs = Array.from({ length: maxGoals }, (_, i) => poissonPmf(i, homeXG));
  const awayProbs = Array.from({ length: maxGoals }, (_, i) => poissonPmf(i, awayXG));

  let probHome = 0, probDraw = 0, probAway = 0, probOver25 = 0, probBTTS = 0;

  for (let h = 0; h < maxGoals; h++) {
    for (let a = 0; a < maxGoals; a++) {
      const p = homeProbs[h] * awayProbs[a];
      if (h > a) probHome += p;
      else if (h === a) probDraw += p;
      else probAway += p;
      if (h + a > 2) probOver25 += p;
      if (h > 0 && a > 0) probBTTS += p;
    }
  }

  return { homeXG, awayXG, probHome, probDraw, probAway, probOver25, probBTTS };
}

C. ELO Rating System

# From: octopy/notebooks/elo.ipynb

def update_elo(home_elo, away_elo, home_goals, away_goals, k=20, home_advantage=100):
    """
    Update ELO ratings after a match.

    k = sensitivity (higher = more volatile)
    home_advantage = ELO bonus for home team
    """
    # Expected score
    home_expected = 1 / (1 + 10 ** ((away_elo - home_elo - home_advantage) / 400))
    away_expected = 1 - home_expected

    # Actual score (1 = win, 0.5 = draw, 0 = loss)
    if home_goals > away_goals:
        home_actual, away_actual = 1, 0
    elif home_goals < away_goals:
        home_actual, away_actual = 0, 1
    else:
        home_actual, away_actual = 0.5, 0.5

    # Goal difference multiplier
    goal_diff = abs(home_goals - away_goals)
    multiplier = 1 + (goal_diff - 1) * 0.5 if goal_diff > 1 else 1

    # Update ratings
    new_home_elo = home_elo + k * multiplier * (home_actual - home_expected)
    new_away_elo = away_elo + k * multiplier * (away_actual - away_expected)

    return new_home_elo, new_away_elo

Port to TypeScript:

// services/value-detection/src/algorithms/elo.ts

export function updateElo(
  homeElo: number,
  awayElo: number,
  homeGoals: number,
  awayGoals: number,
  k = 20,
  homeAdvantage = 100
): { homeElo: number; awayElo: number } {
  // Expected score
  const homeExpected = 1 / (1 + Math.pow(10, (awayElo - homeElo - homeAdvantage) / 400));

  // Actual score
  let homeActual: number;
  if (homeGoals > awayGoals) homeActual = 1;
  else if (homeGoals < awayGoals) homeActual = 0;
  else homeActual = 0.5;

  // Goal difference multiplier
  const goalDiff = Math.abs(homeGoals - awayGoals);
  const multiplier = goalDiff > 1 ? 1 + (goalDiff - 1) * 0.5 : 1;

  return {
    homeElo: homeElo + k * multiplier * (homeActual - homeExpected),
    awayElo: awayElo + k * multiplier * (1 - homeActual - (1 - homeExpected))
  };
}

3️⃣ transfermarkt-datasets (PIPELINE PATTERN)

What to learn:

How they use GitHub Actions for weekly automation
dbt for data transformation
DuckDB for efficient processing

Pattern to adopt:

# .github/workflows/sync-data.yml
name: Sync Data Sources

on:
  schedule:
    - cron: '0 6 * * *'  # Daily at 6am
  workflow_dispatch:     # Manual trigger

jobs:
  sync:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Download Football-Data
        run: npm run harness -- sync --source football-data

      - name: Process and normalize
        run: npm run harness -- process

      - name: Generate features
        run: npm run harness -- features --sport football

      - name: Upload to storage
        run: npm run harness -- upload

4️⃣ cricketdata (R PACKAGE)

How to use:

# Run this R script as part of Data Harness

library(cricketdata)

# Fetch all T20 international data
t20_data <- fetch_cricsheet("t20")

# Fetch IPL data
ipl_data <- fetch_cricsheet("ipl")

# Get player statistics
players <- fetch_player_data()

# Save to CSV for our pipeline to ingest
write.csv(t20_data, "data/raw/cricketdata/t20_matches.csv")
write.csv(ipl_data, "data/raw/cricketdata/ipl_matches.csv")

Integration:

// Call R script from Node.js
import { exec } from 'child_process';

async function syncCricketDataR(): Promise<void> {
  return new Promise((resolve, reject) => {
    exec('Rscript scripts/fetch_cricket.R', (error, stdout) => {
      if (error) reject(error);
      else resolve();
    });
  });
}

5️⃣ Where Each Algorithm Gets Used

Algorithm	Service	Use Case
Shin Method	Value Detection	Remove bookmaker margin to find true probabilities
Poisson Model	Value Detection, Chat	Predict goal distributions, suggest over/under bets
ELO Ratings	Price Feed, Value Detection	Team strength ratings, improve odds accuracy
Feature Engineering	ML Pipeline	Train predictive models
XGBoost/RF	ML Pipeline	Match outcome prediction

📁 Ported Algorithms Directory Structure

services/
├── value-detection/
│   └── src/
│       └── algorithms/           # Ported from GitHub repos
│           ├── shin.ts           # From octopy
│           ├── poisson.ts        # From octopy
│           ├── elo.ts            # From octopy
│           ├── kelly.ts          # Kelly criterion for stake sizing
│           └── index.ts
│
├── ml-pipeline/
│   └── src/
│       ├── features/
│       │   ├── football.ts       # From ProphitBet
│       │   └── cricket.ts        # Custom
│       ├── models/
│       │   ├── xgboost.py        # From ProphitBet (keep Python)
│       │   ├── random-forest.py
│       │   └── neural-net.py
│       └── training/
│           └── train.py          # Jupyter notebooks converted
│
└── data-harness/
    └── scripts/
        └── fetch_cricket.R       # From cricketdata package

🔄 Integration Workflow

1. CLONE REPOS (monthly)
   └── Data Harness downloads to /data/raw/github/

2. EXTRACT ALGORITHMS (one-time, then maintain)
   └── Port Python → TypeScript for real-time services
   └── Keep Python for batch ML training

3. USE IN SERVICES
   └── Value Detection: shin.ts, poisson.ts, elo.ts
   └── ML Pipeline: features/football.ts, models/*.py
   └── Chat Assistant: Calls value detection API

4. TRAIN MODELS (weekly)
   └── Run Python notebooks with latest data
   └── Export trained models to services

Executive Summary​

📊 Data Sources Summary​

Primary Sources (Implemented)​

Secondary Sources (In Progress)​

ML Tools (GitHub Repos to Clone)​

🏗️ Architecture​

High-Level Data Flow​

Component Details​

1. Scheduler​

2. Downloaders (One per Source)​

3. Processors​

4. Storage​

📁 Directory Structure​

📥 Downloader Specifications​

1. Cricsheet Downloader​

2. Football-Data.co.uk Downloader​

3. FiveThirtyEight Downloader​

4. Kaggle Downloader​

5. GitHub Downloader​

🗄️ Database Schema​

Normalized Tables (PostgreSQL)​

⏰ Data Enrichment Scheduler (NEW)​

Enrichment Jobs​

Scheduler Architecture​

Job Configuration​

Distributed Locking​

💻 CLI Commands​

🔧 TypeScript Interfaces​

Core Types​

Football Data Types​

Cricket Data Types​

📅 Implementation Status (All Complete)​

Phase 1: Foundation ✅ COMPLETE​

Phase 2: All Downloaders ✅ COMPLETE​

Phase 3: Features & Scheduler ✅ COMPLETE​

Phase 4: ML Integration ✅ COMPLETE​

Current Focus: Expanding Sports Coverage​

🔐 Configuration​

Environment Variables​

Schedule Configuration​

📊 Expected Data Volumes​

✅ Next Steps​

🧠 ML GitHub Repos: Integration Strategy​

Overview​

Usage Matrix​

1️⃣ ProphitBet (HIGHEST VALUE) ⭐⭐⭐⭐⭐​

2️⃣ octopy (CORE ALGORITHMS) ⭐⭐⭐⭐⭐​

A. Shin Method (Remove Bookmaker Margin)​

B. Poisson Model (Goal Distribution)​

C. ELO Rating System​

3️⃣ transfermarkt-datasets (PIPELINE PATTERN)​

4️⃣ cricketdata (R PACKAGE)​

5️⃣ Where Each Algorithm Gets Used​

📁 Ported Algorithms Directory Structure​

🔄 Integration Workflow​

Executive Summary

📊 Data Sources Summary

Primary Sources (Implemented)

Secondary Sources (In Progress)

ML Tools (GitHub Repos to Clone)

🏗️ Architecture

High-Level Data Flow

Component Details

1. Scheduler

2. Downloaders (One per Source)

3. Processors

4. Storage

📁 Directory Structure

📥 Downloader Specifications

1. Cricsheet Downloader

2. Football-Data.co.uk Downloader

3. FiveThirtyEight Downloader

4. Kaggle Downloader

5. GitHub Downloader

🗄️ Database Schema

Normalized Tables (PostgreSQL)

⏰ Data Enrichment Scheduler (NEW)

Enrichment Jobs

Scheduler Architecture

Job Configuration

Distributed Locking

💻 CLI Commands

🔧 TypeScript Interfaces

Core Types

Football Data Types

Cricket Data Types

📅 Implementation Status (All Complete)

Phase 1: Foundation ✅ COMPLETE

Phase 2: All Downloaders ✅ COMPLETE

Phase 3: Features & Scheduler ✅ COMPLETE

Phase 4: ML Integration ✅ COMPLETE

Current Focus: Expanding Sports Coverage

🔐 Configuration

Environment Variables

Schedule Configuration

📊 Expected Data Volumes

✅ Next Steps

🧠 ML GitHub Repos: Integration Strategy

Overview

Usage Matrix

1️⃣ ProphitBet (HIGHEST VALUE) ⭐⭐⭐⭐⭐

2️⃣ octopy (CORE ALGORITHMS) ⭐⭐⭐⭐⭐

A. Shin Method (Remove Bookmaker Margin)

B. Poisson Model (Goal Distribution)

C. ELO Rating System

3️⃣ transfermarkt-datasets (PIPELINE PATTERN)

4️⃣ cricketdata (R PACKAGE)

5️⃣ Where Each Algorithm Gets Used

📁 Ported Algorithms Directory Structure

🔄 Integration Workflow