π Data Harness: Design & Architecture
Version: 2.0 Date: January 2026 Status: β IMPLEMENTED Purpose: Automated system to download, normalize, and store sports data for ML training
Executive Summaryβ
The Data Harness is a standalone service that powers the ML Training Pipeline for SuperSkin. It:
- Downloads data from 10+ free sources (Cricsheet, Football-Data, FiveThirtyEight, Kaggle, GitHub, Tennis-Data)
- Normalizes data into a common schema for each sport (football, cricket, tennis)
- Stores raw files + processed data in PostgreSQL (Supabase IceCrystal)
- Generates ML-ready features for the 6 SuperSkin services
Current Status:
- β Football data pipeline complete (200K+ matches)
- β Cricket data pipeline complete (5K+ matches)
- β Tennis data pipeline complete (50K+ matches)
- β ML feature generation for all 3 sports
- π Basketball/Esports data collection in progress
Run frequency: Daily for live data, weekly for historical archives
π Data Sources Summaryβ
Primary Sources (Implemented)β
| Source | Sport | Data Type | Size | Format | Status |
|---|---|---|---|---|---|
| Football-Data.co.uk | Football | Results + Odds | 200K+ matches | CSV | β Complete |
| FiveThirtyEight | Football | SPI Ratings | 100K+ matches | CSV | β Complete |
| Cricsheet | Cricket | Ball-by-ball | 5K+ matches | JSON/YAML | β Complete |
| Tennis-Data.co.uk | Tennis | Results + Odds | 50K+ matches | CSV | β Complete |
| Kaggle: IPL Complete | Cricket | Match + Delivery | 1.9 MB | CSV | β Complete |
| Kaggle: European Soccer | Football | Results + Odds + Players | 34 MB | SQLite | β Complete |
| Kaggle: Transfermarkt | Football | Player Values | 172 MB | CSV | β Complete |
Secondary Sources (In Progress)β
| Source | Sport | Data Type | Status |
|---|---|---|---|
| Basketball-Reference | Basketball | NBA/WNBA stats | π In Progress |
| HLTV | Esports | CS2 match data | π In Progress |
| Liquipedia | Esports | Multi-game data | π Planned |
| RSSSF | Football | Historical results | π Planned |
ML Tools (GitHub Repos to Clone)β
| Repo | Purpose | Language | Stars |
|---|---|---|---|
| ProphitBet | Complete ML prediction app | Python | 466 β |
| octopy | Poisson, Shin, ELO algorithms | Python | 70 β |
| cricketdata | R package for cricket data | R | 91 β |
| transfermarkt-datasets | Auto-updating pipeline | Python | 329 β |
ποΈ Architectureβ
High-Level Data Flowβ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA HARNESS PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββ β
β β Scheduler βββββΆβ Downloaders βββββΆβ Processors βββββΆβ Storage β β
β β (node-cron) β β (per source)β β (normalize) β β (PG+Files)β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββ β
β β β β β β
β β βΌ βΌ βΌ β
β β /data/raw/ /data/processed/ PostgreSQL β
β β {source}/ {sport}/ + Features β
β β β
β ββββββββββββββββββββ Retry on failure ββββββββββββββββββββββββββββββββ
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Component Detailsβ
1. Schedulerβ
- Technology: node-cron or Bull queue
- Jobs:
sync-cricsheet- Weekly (Sundays 2am)sync-football-data- Daily (6am)sync-538- Daily (6am)sync-kaggle- Weekly (Sundays 3am)sync-github-repos- Monthly (1st of month)
2. Downloaders (One per Source)β
Each downloader is a standalone module that:
- Handles authentication if needed (Kaggle API key)
- Implements retry logic with exponential backoff
- Tracks last successful sync timestamp
- Downloads incrementally when possible
- Validates downloaded files (checksums, size)
3. Processorsβ
- Normalizer: Converts source-specific formats to common schema
- Deduplicator: Removes duplicate matches across sources
- Validator: Ensures data matches expected schema
- Feature Generator: Creates ML-ready features
4. Storageβ
- Raw Files:
/data/raw/{source}/{date}/- Original downloads - Processed:
/data/processed/{sport}/- Normalized CSVs - PostgreSQL: Structured tables for querying
- Feature Store: Pre-computed ML features
π Directory Structureβ
services/data-harness/
βββ src/
β βββ downloaders/
β β βββ base.downloader.ts # Abstract base class
β β βββ cricsheet.downloader.ts
β β βββ football-data.downloader.ts
β β βββ fivethirtyeight.downloader.ts
β β βββ kaggle.downloader.ts
β β βββ github.downloader.ts
β β
β βββ processors/
β β βββ normalizer/
β β β βββ cricket.normalizer.ts
β β β βββ football.normalizer.ts
β β βββ deduplicator.ts
β β βββ validator.ts
β β βββ feature-generator.ts
β β
β βββ storage/
β β βββ file-storage.ts # Raw file management
β β βββ database.ts # PostgreSQL operations
β β βββ feature-store.ts # ML features
β β
β βββ scheduler/
β β βββ enrichment-scheduler.ts # Cron-based job runner with Redis locks
β β βββ jobs/
β β βββ index.ts # Job registry
β β βββ team-ratings.ts # Daily ELO updates
β β βββ player-sync.ts # Sportmonks player data
β β βββ odds-archival.ts # Oracle β DataMachine archival
β β βββ ml-features.ts # Form, H2H, ELO features
β β βββ cricket-stats.ts # Ball-by-ball aggregation
β β
β βββ cli/
β β βββ commands.ts # Manual sync commands
β β
β βββ index.ts # Service entry point
β
βββ data/ # Data storage (gitignored)
β βββ raw/
β β βββ cricsheet/
β β βββ football-data/
β β βββ fivethirtyeight/
β β βββ kaggle/
β β βββ github/
β βββ processed/
β βββ cricket/
β βββ football/
β
βββ prisma/
β βββ schema.prisma # Database schema
β
βββ package.json
βββ tsconfig.json
βββ README.md
π₯ Downloader Specificationsβ
1. Cricsheet Downloaderβ
Source: https://cricsheet.org/downloads/
Files to Download:
| File | Size | Contents |
|---|---|---|
all_json.zip | ~50 MB | All matches in JSON format |
ipl_json.zip | ~5 MB | IPL only |
t20s_json.zip | ~15 MB | All T20 internationals |
Download Logic:
interface CricsheetDownloader {
// Check if new data available (compare Last-Modified header)
async checkForUpdates(): Promise<boolean>;
// Download and extract ZIP files
async download(): Promise<DownloadResult>;
// Parse JSON/YAML match files
async parseMatches(dir: string): Promise<CricketMatch[]>;
}
Incremental Strategy:
- Store Last-Modified header from previous download
- Only re-download if changed
- Keep last 3 versions for rollback
2. Football-Data.co.uk Downloaderβ
Source: https://www.football-data.co.uk/data.php
Files to Download (per league per season):
| League | URL Pattern | Example |
|---|---|---|
| Premier League | /mmz4281/{season}/E0.csv | /mmz4281/2425/E0.csv |
| Championship | /mmz4281/{season}/E1.csv | /mmz4281/2425/E1.csv |
| La Liga | /mmz4281/{season}/SP1.csv | /mmz4281/2425/SP1.csv |
| Bundesliga | /mmz4281/{season}/D1.csv | /mmz4281/2425/D1.csv |
| Serie A | /mmz4281/{season}/I1.csv | /mmz4281/2425/I1.csv |
| Ligue 1 | /mmz4281/{season}/F1.csv | /mmz4281/2425/F1.csv |
Season Format: 2425 = 2024-25 season
CSV Columns (Betting Odds):
Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR,
B365H, B365D, B365A, # Bet365
BWH, BWD, BWA, # Betway
IWH, IWD, IWA, # Interwetten
PSH, PSD, PSA, # Pinnacle
WHH, WHD, WHA, # William Hill
VCH, VCD, VCA, # VC Bet
MaxH, MaxD, MaxA, # Maximum odds
AvgH, AvgD, AvgA # Average odds
Download Logic:
const LEAGUES = ['E0', 'E1', 'SP1', 'D1', 'I1', 'F1', 'N1', 'P1', 'T1', 'G1'];
const SEASONS = ['2425', '2324', '2223', '2122', '2021', '1920']; // Last 6 seasons
async downloadAllLeagues(): Promise<void> {
for (const league of LEAGUES) {
for (const season of SEASONS) {
await this.downloadLeagueSeason(league, season);
await sleep(1000); // Rate limit: 1 req/sec
}
}
}
3. FiveThirtyEight Downloaderβ
Source: https://projects.fivethirtyeight.com/soccer-api/
Files to Download:
| File | URL | Update Freq |
|---|---|---|
| SPI Matches | club/spi_matches.csv | Historical |
| SPI Latest | club/spi_matches_latest.csv | Was daily |
| Global Rankings | club/spi_global_rankings.csv | Current ratings |
| International | international/spi_matches_intl.csv | Intl matches |
CSV Columns (SPI Matches):
date, league_id, league, team1, team2, spi1, spi2,
prob1, prob2, probtie, proj_score1, proj_score2,
importance1, importance2, score1, score2, xg1, xg2,
nsxg1, nsxg2, adj_score1, adj_score2
Key Fields:
spi1,spi2- Team power ratings (0-100 scale)prob1,prob2,probtie- Win/draw probabilitiesxg1,xg2- Expected goals
4. Kaggle Downloaderβ
Prerequisites:
- Kaggle API key in
~/.kaggle/kaggle.json - Install:
pip install kaggle
Datasets to Download:
const KAGGLE_DATASETS = [
{
id: 'patrickb1912/ipl-complete-dataset-20082020',
name: 'ipl-complete',
files: ['matches.csv', 'deliveries.csv'],
frequency: 'yearly'
},
{
id: 'hugomathien/soccer',
name: 'european-soccer',
files: ['database.sqlite'],
frequency: 'static' // One-time download
},
{
id: 'davidcariboo/player-scores',
name: 'transfermarkt',
files: ['games.csv', 'players.csv', 'appearances.csv', 'player_valuations.csv'],
frequency: 'weekly'
},
{
id: 'saife245/english-premier-league',
name: 'epl-20-years',
files: ['EPL_Set.csv'],
frequency: 'static'
}
];
Download Command:
kaggle datasets download -d {dataset_id} -p ./data/raw/kaggle/{name}/
5. GitHub Downloaderβ
Repos to Clone/Update:
const GITHUB_REPOS = [
{
url: 'https://github.com/kochlisGit/ProphitBet-Soccer-Bets-Predictor',
name: 'prophitbet',
purpose: 'ML models, feature engineering'
},
{
url: 'https://github.com/octosport/octopy',
name: 'octopy',
purpose: 'Poisson, Shin, ELO algorithms'
},
{
url: 'https://github.com/robjhyndman/cricketdata',
name: 'cricketdata',
purpose: 'Cricket data fetching (R)'
},
{
url: 'https://github.com/dcaribou/transfermarkt-datasets',
name: 'transfermarkt-datasets',
purpose: 'Data pipeline reference'
},
{
url: 'https://github.com/jrbadiabo/Bet-on-Sibyl',
name: 'bet-on-sibyl',
purpose: 'Multi-sport prediction models'
}
];
Clone/Update Logic:
async syncRepo(repo: GitHubRepo): Promise<void> {
const localPath = `./data/raw/github/${repo.name}`;
if (await exists(localPath)) {
// Update existing repo
await exec(`git -C ${localPath} pull`);
} else {
// Clone new repo
await exec(`git clone ${repo.url} ${localPath}`);
}
}
ποΈ Database Schemaβ
Normalized Tables (PostgreSQL)β
-- =============================================
-- FOOTBALL TABLES
-- =============================================
CREATE TABLE football_matches (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
external_id TEXT UNIQUE, -- Source-specific ID for deduplication
source TEXT NOT NULL, -- 'football-data', '538', 'kaggle-euro', etc.
-- Match Info
date DATE NOT NULL,
kickoff_time TIMESTAMPTZ,
league TEXT NOT NULL,
season TEXT NOT NULL, -- '2024-25'
home_team TEXT NOT NULL,
away_team TEXT NOT NULL,
-- Results
home_goals INT,
away_goals INT,
home_goals_ht INT, -- Half-time
away_goals_ht INT,
result TEXT, -- 'H', 'D', 'A'
-- Statistics
home_shots INT,
away_shots INT,
home_corners INT,
away_corners INT,
home_fouls INT,
away_fouls INT,
home_yellow INT,
away_yellow INT,
home_red INT,
away_red INT,
-- FiveThirtyEight SPI (if available)
home_spi DECIMAL(5,2), -- 0-100 power rating
away_spi DECIMAL(5,2),
home_xg DECIMAL(4,2), -- Expected goals
away_xg DECIMAL(4,2),
prob_home DECIMAL(5,4), -- Win probability
prob_draw DECIMAL(5,4),
prob_away DECIMAL(5,4),
-- Timestamps
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_football_matches_date ON football_matches(date);
CREATE INDEX idx_football_matches_league ON football_matches(league, season);
CREATE INDEX idx_football_matches_teams ON football_matches(home_team, away_team);
-- Historical betting odds
CREATE TABLE football_odds (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
match_id UUID REFERENCES football_matches(id) ON DELETE CASCADE,
bookmaker TEXT NOT NULL, -- 'bet365', 'pinnacle', 'betway', etc.
market TEXT NOT NULL, -- '1x2', 'over_under_2.5', 'btts'
-- 1X2 odds
odds_home DECIMAL(6,3),
odds_draw DECIMAL(6,3),
odds_away DECIMAL(6,3),
-- Over/Under 2.5
odds_over DECIMAL(6,3),
odds_under DECIMAL(6,3),
-- BTTS (Both Teams To Score)
odds_btts_yes DECIMAL(6,3),
odds_btts_no DECIMAL(6,3),
timestamp TIMESTAMPTZ, -- When odds were recorded
UNIQUE(match_id, bookmaker, market)
);
CREATE INDEX idx_football_odds_match ON football_odds(match_id);
-- Team metadata and current ratings
CREATE TABLE football_teams (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT NOT NULL,
normalized_name TEXT NOT NULL UNIQUE, -- Standardized name
country TEXT,
league TEXT,
-- Current ratings (updated from 538/ELO)
spi_rating DECIMAL(5,2),
elo_rating DECIMAL(7,2),
offensive_rating DECIMAL(5,2),
defensive_rating DECIMAL(5,2),
-- Aliases for name matching
aliases TEXT[], -- ['Man Utd', 'Manchester United', 'Man United']
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- =============================================
-- CRICKET TABLES
-- =============================================
CREATE TABLE cricket_matches (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
external_id TEXT UNIQUE,
source TEXT NOT NULL, -- 'cricsheet', 'kaggle-ipl'
-- Match Info
date DATE NOT NULL,
venue TEXT,
city TEXT,
match_type TEXT NOT NULL, -- 'T20', 'ODI', 'Test', 'IPL'
competition TEXT, -- 'IPL 2024', 'T20 World Cup 2024'
season TEXT,
-- Teams
team1 TEXT NOT NULL,
team2 TEXT NOT NULL,
toss_winner TEXT,
toss_decision TEXT, -- 'bat', 'field'
-- Result
winner TEXT,
win_type TEXT, -- 'runs', 'wickets', 'super_over'
win_margin INT,
result TEXT, -- 'normal', 'tie', 'no_result', 'DLS'
player_of_match TEXT,
-- Scores
team1_runs INT,
team1_wickets INT,
team1_overs DECIMAL(4,1),
team2_runs INT,
team2_wickets INT,
team2_overs DECIMAL(4,1),
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_cricket_matches_date ON cricket_matches(date);
CREATE INDEX idx_cricket_matches_type ON cricket_matches(match_type, competition);
-- Ball-by-ball data (from Cricsheet)
CREATE TABLE cricket_deliveries (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
match_id UUID REFERENCES cricket_matches(id) ON DELETE CASCADE,
innings INT NOT NULL, -- 1 or 2
over_number INT NOT NULL,
ball_number INT NOT NULL,
batsman TEXT NOT NULL,
non_striker TEXT NOT NULL,
bowler TEXT NOT NULL,
runs_batsman INT DEFAULT 0,
runs_extras INT DEFAULT 0,
runs_total INT DEFAULT 0,
extras_type TEXT, -- 'wide', 'noball', 'bye', 'legbye'
wicket BOOLEAN DEFAULT FALSE,
wicket_type TEXT, -- 'bowled', 'caught', 'lbw', 'run_out'
wicket_player TEXT,
wicket_fielder TEXT,
-- Running totals at this point
total_runs INT,
total_wickets INT,
UNIQUE(match_id, innings, over_number, ball_number)
);
CREATE INDEX idx_cricket_deliveries_match ON cricket_deliveries(match_id);
CREATE INDEX idx_cricket_deliveries_player ON cricket_deliveries(batsman);
-- =============================================
-- ML FEATURE STORE
-- =============================================
CREATE TABLE ml_features_football (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
match_id UUID REFERENCES football_matches(id) ON DELETE CASCADE,
-- Pre-computed features for ML models
-- Form (last 5 matches)
home_form_points INT, -- Points from last 5 games
away_form_points INT,
home_form_goals_for DECIMAL(4,2), -- Avg goals scored last 5
home_form_goals_ag DECIMAL(4,2), -- Avg goals conceded last 5
away_form_goals_for DECIMAL(4,2),
away_form_goals_ag DECIMAL(4,2),
-- Head to head (last 5 meetings)
h2h_home_wins INT,
h2h_draws INT,
h2h_away_wins INT,
h2h_avg_goals DECIMAL(4,2),
-- League position at time of match
home_league_pos INT,
away_league_pos INT,
home_league_points INT,
away_league_points INT,
-- Odds-derived
implied_prob_home DECIMAL(5,4),
implied_prob_draw DECIMAL(5,4),
implied_prob_away DECIMAL(5,4),
odds_value_home DECIMAL(5,4), -- Edge vs fair odds
odds_value_away DECIMAL(5,4),
-- Target variables
target_result TEXT, -- 'H', 'D', 'A'
target_over_2_5 BOOLEAN,
target_btts BOOLEAN,
computed_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_ml_features_match ON ml_features_football(match_id);
-- =============================================
-- SYNC TRACKING
-- =============================================
CREATE TABLE data_sync_log (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source TEXT NOT NULL,
started_at TIMESTAMPTZ NOT NULL,
completed_at TIMESTAMPTZ,
status TEXT NOT NULL, -- 'running', 'success', 'failed'
records_added INT DEFAULT 0,
records_updated INT DEFAULT 0,
error_message TEXT,
metadata JSONB -- Source-specific info
);
CREATE INDEX idx_sync_log_source ON data_sync_log(source, started_at DESC);
-- =============================================
-- PLAYER DATA TABLES (NEW)
-- =============================================
CREATE TABLE football_players (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
external_id TEXT UNIQUE, -- Sportmonks/Transfermarkt ID
source TEXT NOT NULL, -- 'sportmonks', 'transfermarkt'
name TEXT NOT NULL,
normalized_name TEXT NOT NULL,
first_name TEXT,
last_name TEXT,
date_of_birth DATE,
nationality TEXT,
height INT, -- cm
weight INT, -- kg
position TEXT, -- 'GK', 'DF', 'MF', 'FW'
position_detail TEXT, -- 'Centre-Back', 'Left Winger'
foot TEXT, -- 'Left', 'Right', 'Both'
team_id UUID REFERENCES football_teams(id),
shirt_number INT,
market_value DECIMAL(12,2), -- EUR
market_value_date DATE,
total_apps INT DEFAULT 0,
total_goals INT DEFAULT 0,
total_assists INT DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_players_team ON football_players(team_id);
CREATE INDEX idx_players_name ON football_players(normalized_name);
CREATE TABLE player_appearances (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
player_id UUID REFERENCES football_players(id) ON DELETE CASCADE,
match_date DATE NOT NULL,
competition TEXT NOT NULL,
season TEXT NOT NULL,
home_team TEXT NOT NULL,
away_team TEXT NOT NULL,
match_result TEXT, -- '2-1', '0-0'
started BOOLEAN DEFAULT FALSE,
minutes_played INT,
substitute_in INT,
substitute_out INT,
goals INT DEFAULT 0,
assists INT DEFAULT 0,
yellow_cards INT DEFAULT 0,
red_cards INT DEFAULT 0,
rating DECIMAL(3,1), -- Match rating 0-10
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(player_id, match_date, home_team, away_team)
);
CREATE INDEX idx_appearances_player ON player_appearances(player_id);
CREATE INDEX idx_appearances_date ON player_appearances(match_date);
-- =============================================
-- CRICKET PLAYER TABLES (NEW)
-- =============================================
CREATE TABLE cricket_players (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
external_id TEXT UNIQUE, -- Roanuz/Cricsheet ID
source TEXT NOT NULL,
name TEXT NOT NULL,
normalized_name TEXT NOT NULL,
date_of_birth DATE,
nationality TEXT,
batting_style TEXT, -- 'Right-hand bat'
bowling_style TEXT, -- 'Right-arm fast'
role TEXT, -- 'Batsman', 'Bowler', 'All-rounder'
current_team TEXT,
ipl_team TEXT,
t20_matches INT DEFAULT 0,
t20_runs INT DEFAULT 0,
t20_wickets INT DEFAULT 0,
t20_average DECIMAL(6,2),
t20_strike_rate DECIMAL(6,2),
t20_economy DECIMAL(4,2),
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_cricket_players_name ON cricket_players(normalized_name);
CREATE TABLE cricket_player_season_stats (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
player_id UUID REFERENCES cricket_players(id) ON DELETE CASCADE,
season TEXT NOT NULL, -- '2024', 'IPL 2024'
competition TEXT NOT NULL,
team_name TEXT NOT NULL,
-- Batting
matches INT DEFAULT 0,
innings INT DEFAULT 0,
runs INT DEFAULT 0,
balls_faced INT DEFAULT 0,
not_outs INT DEFAULT 0,
high_score INT,
average DECIMAL(6,2),
strike_rate DECIMAL(6,2),
fours INT DEFAULT 0,
sixes INT DEFAULT 0,
fifties INT DEFAULT 0,
hundreds INT DEFAULT 0,
-- Bowling
overs_bowled DECIMAL(5,1),
wickets INT DEFAULT 0,
runs_conceded INT DEFAULT 0,
economy DECIMAL(4,2),
bowling_average DECIMAL(6,2),
best_bowling TEXT, -- '4/25'
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(player_id, season, competition)
);
CREATE INDEX idx_cricket_stats_player ON cricket_player_season_stats(player_id);
β° Data Enrichment Scheduler (NEW)β
The Data Harness includes a cron-based enrichment scheduler that runs continuous data updates:
Enrichment Jobsβ
| Job ID | Name | Schedule | Description |
|---|---|---|---|
team-ratings | Team Ratings Update | Daily 4:00 AM | Updates ELO ratings from recent match results |
player-sync | Player Stats Sync | Every 6 hours | Syncs player data from Sportmonks API |
odds-archival | Odds History Archival | Every hour | Archives odds from Oracle β Data Machine |
ml-features | ML Feature Computation | Every 2 hours | Computes form, H2H, ELO features |
cricket-stats | Cricket Stats Aggregation | Daily 5:00 AM | Aggregates ball-by-ball into player stats |
ipl-live-stats | IPL Live Stats | Every 15 min | Real-time IPL stats (during season) |
Scheduler Architectureβ
// src/scheduler/enrichment-scheduler.ts
class EnrichmentScheduler {
private redis: Redis; // For distributed locking
private jobs: Map<string, cron.ScheduledTask>;
registerJob(job: EnrichmentJob): void;
async start(): Promise<void>;
async executeJob(jobId: string): Promise<void>;
async stop(): Promise<void>;
getStatus(): Record<string, JobStatus>;
}
Job Configurationβ
const enrichmentJobs = [
{
id: 'team-ratings',
name: 'Team Ratings Update',
schedule: '0 4 * * *', // Daily at 4 AM
handler: updateTeamRatings,
enabled: true
},
{
id: 'player-sync',
name: 'Player Stats Sync',
schedule: '0 */6 * * *', // Every 6 hours
handler: syncPlayerStats,
enabled: true
},
{
id: 'cricket-stats',
name: 'Cricket Stats Aggregation',
schedule: '0 5 * * *', // Daily at 5 AM
handler: aggregateCricketStats,
enabled: true
}
];
Distributed Lockingβ
Jobs use Redis locks to prevent concurrent execution across multiple instances:
// Before executing job
const lockKey = `enrichment:lock:${jobId}`;
const acquired = await redis.set(lockKey, '1', 'EX', 3600, 'NX');
if (!acquired) {
logger.warn(`Job ${jobId} already running, skipping`);
return;
}
// Execute job...
// Release lock
await redis.del(lockKey);
π» CLI Commandsβ
The Data Harness provides CLI commands for manual operations:
# Initialize database and directory structure
npm run harness -- init
# Sync all sources
npm run harness -- sync --all
# Sync specific source
npm run harness -- sync --source cricsheet
npm run harness -- sync --source football-data
npm run harness -- sync --source fivethirtyeight
npm run harness -- sync --source kaggle
npm run harness -- sync --source github
# Sync with date range (for Football-Data)
npm run harness -- sync --source football-data --from 2024-01-01 --to 2024-12-31
# Download specific seasons
npm run harness -- sync --source football-data --seasons 2425,2324
# View sync status
npm run harness -- status
# Generate ML features
npm run harness -- features --sport football --from 2020-01-01
# Clean old raw files (keep last N days)
npm run harness -- clean --keep-days 30
# Validate data integrity
npm run harness -- validate
# Export data for analysis
npm run harness -- export --sport football --format csv --output ./exports/
π§ TypeScript Interfacesβ
Core Typesβ
// src/types/index.ts
export type Sport = 'football' | 'cricket';
export type DataSource = 'cricsheet' | 'football-data' | 'fivethirtyeight' | 'kaggle' | 'github';
export interface SyncResult {
source: DataSource;
status: 'success' | 'failed' | 'partial';
startedAt: Date;
completedAt: Date;
recordsAdded: number;
recordsUpdated: number;
errors: string[];
}
export interface DownloaderConfig {
source: DataSource;
baseUrl: string;
rateLimit: number; // Requests per second
retryAttempts: number;
retryDelayMs: number;
}
// Base downloader interface
export interface IDownloader {
readonly source: DataSource;
checkForUpdates(): Promise<boolean>;
download(): Promise<DownloadResult>;
getLastSync(): Promise<Date | null>;
}
export interface DownloadResult {
success: boolean;
filesDownloaded: string[];
totalBytes: number;
errors: string[];
}
Football Data Typesβ
// src/types/football.ts
export interface FootballMatch {
id?: string;
externalId: string;
source: DataSource;
date: Date;
kickoffTime?: Date;
league: string;
season: string;
homeTeam: string;
awayTeam: string;
homeGoals?: number;
awayGoals?: number;
homeGoalsHT?: number;
awayGoalsHT?: number;
result?: 'H' | 'D' | 'A';
// Statistics
stats?: {
homeShots?: number;
awayShots?: number;
homeCorners?: number;
awayCorners?: number;
homeFouls?: number;
awayFouls?: number;
};
// FiveThirtyEight SPI data
spi?: {
homeSPI: number;
awaySPI: number;
homeXG: number;
awayXG: number;
probHome: number;
probDraw: number;
probAway: number;
};
}
export interface FootballOdds {
matchId: string;
bookmaker: string;
market: '1x2' | 'over_under_2.5' | 'btts';
home?: number;
draw?: number;
away?: number;
over?: number;
under?: number;
timestamp?: Date;
}
export interface TeamRating {
team: string;
normalizedName: string;
spiRating: number;
eloRating: number;
offensiveRating: number;
defensiveRating: number;
}
Cricket Data Typesβ
// src/types/cricket.ts
export interface CricketMatch {
id?: string;
externalId: string;
source: DataSource;
date: Date;
venue?: string;
city?: string;
matchType: 'T20' | 'ODI' | 'Test' | 'IPL' | 'BBL' | 'PSL';
competition?: string;
season?: string;
team1: string;
team2: string;
tossWinner?: string;
tossDecision?: 'bat' | 'field';
winner?: string;
winType?: 'runs' | 'wickets' | 'super_over' | 'tie' | 'no_result';
winMargin?: number;
playerOfMatch?: string;
team1Score?: InningsScore;
team2Score?: InningsScore;
}
export interface InningsScore {
runs: number;
wickets: number;
overs: number;
}
export interface CricketDelivery {
matchId: string;
innings: 1 | 2;
overNumber: number;
ballNumber: number;
batsman: string;
nonStriker: string;
bowler: string;
runsBatsman: number;
runsExtras: number;
runsTotal: number;
extrasType?: 'wide' | 'noball' | 'bye' | 'legbye';
wicket: boolean;
wicketType?: string;
wicketPlayer?: string;
wicketFielder?: string;
}
π Implementation Status (All Complete)β
Phase 1: Foundation β COMPLETEβ
β
Created services/data-harness/ directory structure
β
Set up package.json with dependencies
β
Created Prisma schema for PostgreSQL (Supabase)
β
Implemented base downloader class
β
Implemented file storage service
β
Implemented Football-Data.co.uk downloader
β
Implemented football normalizer
β
Created CLI skeleton
Phase 2: All Downloaders β COMPLETEβ
β
Implemented Cricsheet downloader
β
Implemented FiveThirtyEight downloader
β
Implemented Tennis-Data.co.uk downloader
β
Implemented cricket normalizer
β
Implemented tennis normalizer
β
Added sync tracking to database
β
Implemented Kaggle downloader
β
Implemented GitHub repo syncer
β
Added retry logic and error handling
β
Completed CLI commands
Phase 3: Features & Scheduler β COMPLETEβ
β
Implemented ML feature generator (football)
β
Implemented ML feature generator (cricket)
β
Implemented ML feature generator (tennis)
β
Added data validation
β
Created deduplication logic
β
Set up node-cron scheduler
β
Configured job schedules
β
Added monitoring/logging
Phase 4: ML Integration β COMPLETEβ
β
Connected to SuperSkin ML Prediction Service
β
Implemented sport-specific model training
β
Created ONNX export pipeline
β
Set up model registry with auto-discovery
Current Focus: Expanding Sports Coverageβ
| Sport | Data Status | ML Model Status |
|---|---|---|
| Football | β 200K+ matches | β Trained (~68% accuracy) |
| Cricket | β 5K+ matches | β Trained (~65% accuracy) |
| Tennis | β 50K+ matches | β Trained (~70% accuracy) |
| Basketball | π Collecting | π Planned |
| Esports | π Collecting | π Planned |
π Configurationβ
Environment Variablesβ
# Database
DATABASE_URL=postgresql://user:pass@localhost:5432/data_harness
# Kaggle API (required for Kaggle downloads)
KAGGLE_USERNAME=your_username
KAGGLE_KEY=your_api_key
# GitHub Token (optional, for higher rate limits)
GITHUB_TOKEN=ghp_xxxxxxxxxxxx
# Storage paths
DATA_RAW_PATH=./data/raw
DATA_PROCESSED_PATH=./data/processed
# Rate limits
FOOTBALL_DATA_RATE_LIMIT=1 # requests per second
KAGGLE_RATE_LIMIT=5 # requests per second
# Retention
RAW_DATA_RETENTION_DAYS=30 # Keep raw files for 30 days
Schedule Configurationβ
// src/scheduler/jobs.ts
export const SYNC_SCHEDULES = {
'football-data': {
cron: '0 6 * * *', // Daily at 6am
timeout: 30 * 60 * 1000, // 30 min timeout
retries: 3
},
'fivethirtyeight': {
cron: '0 6 * * *', // Daily at 6am
timeout: 10 * 60 * 1000, // 10 min timeout
retries: 3
},
'cricsheet': {
cron: '0 2 * * 0', // Sundays at 2am
timeout: 60 * 60 * 1000, // 60 min timeout
retries: 3
},
'kaggle': {
cron: '0 3 * * 0', // Sundays at 3am
timeout: 120 * 60 * 1000, // 2 hour timeout
retries: 2
},
'github': {
cron: '0 4 1 * *', // 1st of month at 4am
timeout: 30 * 60 * 1000, // 30 min timeout
retries: 2
}
};
π Expected Data Volumesβ
| Source | Records | Storage | Initial Download |
|---|---|---|---|
| Football-Data.co.uk | 200K matches | ~50 MB | ~5 min |
| FiveThirtyEight | 100K matches | ~30 MB | ~2 min |
| Cricsheet | 5K matches, 500K deliveries | ~100 MB | ~10 min |
| Kaggle: IPL | 1K matches | ~5 MB | ~1 min |
| Kaggle: European Soccer | 25K matches | ~35 MB | ~2 min |
| Kaggle: Transfermarkt | 1.2M appearances | ~175 MB | ~5 min |
| GitHub repos | 5 repos | ~500 MB | ~10 min |
| Total | ~900 MB | ~35 min |
β Next Stepsβ
- Confirm directory location: Should this be in
services/data-harness/or a different location? - Database choice: Use existing PostgreSQL or create dedicated instance?
- Start building: Begin with Phase 1 (Foundation + Football-Data downloader)
Would you like me to start implementing the Data Harness service?
π§ ML GitHub Repos: Integration Strategyβ
Overviewβ
These GitHub repos are NOT meant to be run as-is. Instead, we:
- Extract algorithms β Port key mathematical functions to TypeScript/Python
- Learn patterns β Understand feature engineering approaches
- Reference implementations β Use as documentation for complex calculations
- Clone for training β Run ML training notebooks offline
Usage Matrixβ
| Repo | Strategy | What to Extract | Target Service |
|---|---|---|---|
| ProphitBet | Extract + Adapt | Feature engineering, model training | ML Pipeline |
| octopy | Port to TypeScript | Shin, Poisson, ELO algorithms | Value Detection |
| transfermarkt-datasets | Reference | dbt pipeline pattern | Data Harness |
| cricketdata | Run as-is (R) | Cricket data fetching | Data Harness |
| Bet-on-Sibyl | Reference | Multi-sport scraping | Data Harness |
1οΈβ£ ProphitBet (HIGHEST VALUE) ββββββ
What it is: Complete betting prediction app with ML models
What to extract:
# From: ProphitBet/database/repositories/league.py
# This shows how they calculate features from Football-Data.co.uk
class LeagueRepository:
def get_features(self, match):
return {
'home_wins': self._count_wins(home_team, last_n=5),
'home_draws': self._count_draws(home_team, last_n=5),
'home_losses': self._count_losses(home_team, last_n=5),
'home_goals_scored': self._avg_goals_scored(home_team, last_n=5),
'home_goals_conceded': self._avg_goals_conceded(home_team, last_n=5),
'away_wins': ...,
# ... 40+ features
}
Feature Engineering (to port):
// Port to: services/ml-pipeline/src/features/football.ts
interface FootballFeatures {
// Form-based (last N matches)
homeWins: number;
homeDraws: number;
homeLosses: number;
homeGoalsScored: number; // Average
homeGoalsConceded: number;
homeCleanSheets: number;
awayWins: number;
awayDraws: number;
awayLosses: number;
awayGoalsScored: number;
awayGoalsConceded: number;
// Head-to-head
h2hHomeWins: number;
h2hDraws: number;
h2hAwayWins: number;
h2hAvgGoals: number;
// Odds-derived
impliedProbHome: number;
impliedProbDraw: number;
impliedProbAway: number;
marketOverround: number;
}
ML Models to study:
- XGBoost with hyperparameter tuning
- Feature importance analysis (Boruta)
- Model evaluation with filtering
2οΈβ£ octopy (CORE ALGORITHMS) ββββββ
What it is: Python implementations of essential betting math
Key Algorithms to Port:
A. Shin Method (Remove Bookmaker Margin)β
# From: octopy/notebooks/shin.ipynb
# Converts bookmaker odds to true probabilities
def shin_probabilities(odds: list) -> list:
"""
Shin (1993) method for extracting true probabilities
from bookmaker odds by estimating the overround.
Example:
odds = [2.10, 3.40, 3.50] # Home/Draw/Away
probs = shin_probabilities(odds)
# Returns: [0.45, 0.28, 0.27] (sums to 1.0)
"""
implied = [1/o for o in odds]
overround = sum(implied)
# Shin's iterative method to find true probs
z = (overround - 1) / (len(odds) - 1)
true_probs = []
for p in implied:
true_prob = (sqrt(z**2 + 4*(1-z)*p**2 / overround) - z) / (2*(1-z))
true_probs.append(true_prob)
return true_probs
Port to TypeScript:
// services/value-detection/src/algorithms/shin.ts
export function shinProbabilities(odds: number[]): number[] {
const implied = odds.map(o => 1 / o);
const overround = implied.reduce((a, b) => a + b, 0);
const z = (overround - 1) / (odds.length - 1);
return implied.map(p => {
const discriminant = z ** 2 + (4 * (1 - z) * p ** 2) / overround;
return (Math.sqrt(discriminant) - z) / (2 * (1 - z));
});
}
// Usage in Value Detection Service:
const bookmakerOdds = [2.10, 3.40, 3.50];
const trueProbabilities = shinProbabilities(bookmakerOdds);
// [0.45, 0.28, 0.27] - fair probabilities without margin
B. Poisson Model (Goal Distribution)β
# From: octopy/notebooks/poisson.ipynb
# Predicts goal distributions for matches
from scipy.stats import poisson
def predict_match(home_attack, home_defense, away_attack, away_defense,
league_avg_goals=1.35):
"""
Predicts match outcome using Poisson distribution.
Parameters derived from historical data:
- attack_strength = team_goals_scored / league_avg
- defense_strength = team_goals_conceded / league_avg
"""
# Expected goals
home_xg = home_attack * away_defense * league_avg_goals
away_xg = away_attack * home_defense * league_avg_goals
# Goal distribution (0-6 goals each)
max_goals = 7
home_probs = [poisson.pmf(i, home_xg) for i in range(max_goals)]
away_probs = [poisson.pmf(i, away_xg) for i in range(max_goals)]
# Match outcome matrix
prob_home_win = 0
prob_draw = 0
prob_away_win = 0
prob_over_25 = 0
prob_btts = 0
for h in range(max_goals):
for a in range(max_goals):
p = home_probs[h] * away_probs[a]
if h > a:
prob_home_win += p
elif h == a:
prob_draw += p
else:
prob_away_win += p
if h + a > 2.5:
prob_over_25 += p
if h > 0 and a > 0:
prob_btts += p
return {
'home_xg': home_xg,
'away_xg': away_xg,
'prob_home': prob_home_win,
'prob_draw': prob_draw,
'prob_away': prob_away_win,
'prob_over_25': prob_over_25,
'prob_btts': prob_btts
}
Port to TypeScript:
// services/value-detection/src/algorithms/poisson.ts
function poissonPmf(k: number, lambda: number): number {
return (Math.exp(-lambda) * Math.pow(lambda, k)) / factorial(k);
}
export function predictMatch(
homeAttack: number,
homeDefense: number,
awayAttack: number,
awayDefense: number,
leagueAvg = 1.35
): MatchPrediction {
const homeXG = homeAttack * awayDefense * leagueAvg;
const awayXG = awayAttack * homeDefense * leagueAvg;
const maxGoals = 7;
const homeProbs = Array.from({ length: maxGoals }, (_, i) => poissonPmf(i, homeXG));
const awayProbs = Array.from({ length: maxGoals }, (_, i) => poissonPmf(i, awayXG));
let probHome = 0, probDraw = 0, probAway = 0, probOver25 = 0, probBTTS = 0;
for (let h = 0; h < maxGoals; h++) {
for (let a = 0; a < maxGoals; a++) {
const p = homeProbs[h] * awayProbs[a];
if (h > a) probHome += p;
else if (h === a) probDraw += p;
else probAway += p;
if (h + a > 2) probOver25 += p;
if (h > 0 && a > 0) probBTTS += p;
}
}
return { homeXG, awayXG, probHome, probDraw, probAway, probOver25, probBTTS };
}
C. ELO Rating Systemβ
# From: octopy/notebooks/elo.ipynb
def update_elo(home_elo, away_elo, home_goals, away_goals, k=20, home_advantage=100):
"""
Update ELO ratings after a match.
k = sensitivity (higher = more volatile)
home_advantage = ELO bonus for home team
"""
# Expected score
home_expected = 1 / (1 + 10 ** ((away_elo - home_elo - home_advantage) / 400))
away_expected = 1 - home_expected
# Actual score (1 = win, 0.5 = draw, 0 = loss)
if home_goals > away_goals:
home_actual, away_actual = 1, 0
elif home_goals < away_goals:
home_actual, away_actual = 0, 1
else:
home_actual, away_actual = 0.5, 0.5
# Goal difference multiplier
goal_diff = abs(home_goals - away_goals)
multiplier = 1 + (goal_diff - 1) * 0.5 if goal_diff > 1 else 1
# Update ratings
new_home_elo = home_elo + k * multiplier * (home_actual - home_expected)
new_away_elo = away_elo + k * multiplier * (away_actual - away_expected)
return new_home_elo, new_away_elo
Port to TypeScript:
// services/value-detection/src/algorithms/elo.ts
export function updateElo(
homeElo: number,
awayElo: number,
homeGoals: number,
awayGoals: number,
k = 20,
homeAdvantage = 100
): { homeElo: number; awayElo: number } {
// Expected score
const homeExpected = 1 / (1 + Math.pow(10, (awayElo - homeElo - homeAdvantage) / 400));
// Actual score
let homeActual: number;
if (homeGoals > awayGoals) homeActual = 1;
else if (homeGoals < awayGoals) homeActual = 0;
else homeActual = 0.5;
// Goal difference multiplier
const goalDiff = Math.abs(homeGoals - awayGoals);
const multiplier = goalDiff > 1 ? 1 + (goalDiff - 1) * 0.5 : 1;
return {
homeElo: homeElo + k * multiplier * (homeActual - homeExpected),
awayElo: awayElo + k * multiplier * (1 - homeActual - (1 - homeExpected))
};
}
3οΈβ£ transfermarkt-datasets (PIPELINE PATTERN)β
What to learn:
- How they use GitHub Actions for weekly automation
- dbt for data transformation
- DuckDB for efficient processing
Pattern to adopt:
# .github/workflows/sync-data.yml
name: Sync Data Sources
on:
schedule:
- cron: '0 6 * * *' # Daily at 6am
workflow_dispatch: # Manual trigger
jobs:
sync:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Download Football-Data
run: npm run harness -- sync --source football-data
- name: Process and normalize
run: npm run harness -- process
- name: Generate features
run: npm run harness -- features --sport football
- name: Upload to storage
run: npm run harness -- upload
4οΈβ£ cricketdata (R PACKAGE)β
How to use:
# Run this R script as part of Data Harness
library(cricketdata)
# Fetch all T20 international data
t20_data <- fetch_cricsheet("t20")
# Fetch IPL data
ipl_data <- fetch_cricsheet("ipl")
# Get player statistics
players <- fetch_player_data()
# Save to CSV for our pipeline to ingest
write.csv(t20_data, "data/raw/cricketdata/t20_matches.csv")
write.csv(ipl_data, "data/raw/cricketdata/ipl_matches.csv")
Integration:
// Call R script from Node.js
import { exec } from 'child_process';
async function syncCricketDataR(): Promise<void> {
return new Promise((resolve, reject) => {
exec('Rscript scripts/fetch_cricket.R', (error, stdout) => {
if (error) reject(error);
else resolve();
});
});
}
5οΈβ£ Where Each Algorithm Gets Usedβ
| Algorithm | Service | Use Case |
|---|---|---|
| Shin Method | Value Detection | Remove bookmaker margin to find true probabilities |
| Poisson Model | Value Detection, Chat | Predict goal distributions, suggest over/under bets |
| ELO Ratings | Price Feed, Value Detection | Team strength ratings, improve odds accuracy |
| Feature Engineering | ML Pipeline | Train predictive models |
| XGBoost/RF | ML Pipeline | Match outcome prediction |
π Ported Algorithms Directory Structureβ
services/
βββ value-detection/
β βββ src/
β βββ algorithms/ # Ported from GitHub repos
β βββ shin.ts # From octopy
β βββ poisson.ts # From octopy
β βββ elo.ts # From octopy
β βββ kelly.ts # Kelly criterion for stake sizing
β βββ index.ts
β
βββ ml-pipeline/
β βββ src/
β βββ features/
β β βββ football.ts # From ProphitBet
β β βββ cricket.ts # Custom
β βββ models/
β β βββ xgboost.py # From ProphitBet (keep Python)
β β βββ random-forest.py
β β βββ neural-net.py
β βββ training/
β βββ train.py # Jupyter notebooks converted
β
βββ data-harness/
βββ scripts/
βββ fetch_cricket.R # From cricketdata package
π Integration Workflowβ
1. CLONE REPOS (monthly)
βββ Data Harness downloads to /data/raw/github/
2. EXTRACT ALGORITHMS (one-time, then maintain)
βββ Port Python β TypeScript for real-time services
βββ Keep Python for batch ML training
3. USE IN SERVICES
βββ Value Detection: shin.ts, poisson.ts, elo.ts
βββ ML Pipeline: features/football.ts, models/*.py
βββ Chat Assistant: Calls value detection API
4. TRAIN MODELS (weekly)
βββ Run Python notebooks with latest data
βββ Export trained models to services