Skip to main content

B-Book Architecture & Design Document v3.0 (Simplified)

System: Hannibal Date: February 2026 Status: Living Document Audience: Product, Engineering, Operations, Stakeholders Predecessor: v2.0 (preserved as scale-up reference)


Table of Contents

  1. Executive Summary
  2. The Core Problem -- In Plain English
  3. Simplification Philosophy
  4. The Forwarding Matrix -- The Brain of the System
  5. The Bet Flow -- Step by Step
  6. Cascading Upline Routing
  7. Agent Liability Limits & NO_NEW_RISK
  8. User Win Limits & Stake Reduction
  9. Period Definitions -- Night & Weekly
  10. Exposure Accounting (2-Tier: Redis + PostgreSQL)
  11. Audit Trail & Determinism
  12. Performance Architecture
  13. The Agent Experience
  14. Nightmare Scenarios & How We Handle Them
  15. Bet Cancellation / Void / Partial Settlement State Machine
  16. Matrix Version Control
  17. Dead Letter Queue and Poison Bet Handling
  18. Settlement Cascade & Failure Isolation
  19. Cash-Out / Early Settlement
  20. Lay Bet Support
  21. Collusion Detection
  22. Agent Hierarchy Migration
  23. Minimum Forwarding / Skin-in-the-Game Requirements
  24. Panic Button & Abuse Prevention
  25. Timestamp Security & Period Boundaries
  26. Sharp Detection via Multiple Accounts
  27. Rate Limiting on Configuration Changes
  28. Hedge Execution
  29. Reconciliation
  30. Monitoring & Alerting

1. Executive Summary

What is the B-Book?

The B-Book is Hannibal's hierarchical, deterministic risk-management and routing engine. It sits between a punter placing a bet and the final destination of that bet's risk.

In traditional bookmaking, a "B-Book" means the bookie keeps the bet on their own books. An "A-Book" means the bookie hedges on an exchange like Betfair. Hannibal's B-Book is a routing engine that decides, for every bet, what percentage stays at each level of an agent hierarchy and what percentage gets forwarded up the chain. It enforces limits automatically, cascades overflow intelligently, and maintains a complete audit trail.

What problem does it solve?

In sports betting agent networks across India, agents operate at different levels of a hierarchy. Risk allocation today happens through manual spreadsheets, WhatsApp groups, and phone calls. There is no enforcement, no automatic cap management, and no audit trail. Disputes have no source of truth.

The B-Book automates all of this with a configurable, enforceable, auditable system.

v3 Simplification

This document is a deliberate simplification of the v2.0 architecture for Phase 1 launch. We removed premature optimizations, eliminated unnecessary infrastructure layers, and focused on correctness over throughput. The v2.0 document is preserved as a scale-up reference when we outgrow these simpler designs.

What changed: 2-tier caching instead of 3. Simple atomic counters instead of sharded exposure. A version integer instead of full MVCC. Sequential writes instead of distributed locking. Market orders for hedging instead of limit orders. No multi-currency. No responsible gambling module. Nightly reconciliation instead of every 15 minutes. Structured logging instead of Prometheus/Grafana.

What did NOT change: The forwarding matrix, cascading routing, exposure ledger design, NO_NEW_RISK, void/cancellation state machine, dead letter queue, settlement idempotency, cash-out, lay bets, collusion detection design, panic button, agent migration, minimum forwarding requirements, timestamp security, sharp detection design, and rate limiting.


2. The Core Problem -- In Plain English

The Agent Hierarchy: A Real Example

                    +-----------------------+
| Betfair Exchange |
| (External Hedge) |
+-----------+-----------+
|
+-----------+-----------+
| HANNIBAL |
| (The Platform) |
+-----------+-----------+
|
+-----------+-----------+
| VIKRAM |
| Master Agent, Delhi |
| Manages 12 sub-agents|
+-----------+-----------+
|
+-----------------+-----------------+
| |
+-----------+-----------+ +-----------+-----------+
| RAJESH | | PRIYA |
| Sub-Agent, Mumbai | | Sub-Agent, Bangalore |
| 200 cricket punters | | 150 football punters |
+-----------+-----------+ +-----------------------+
|
+---------+---------+
| AMIT | SONIA | ... 198 more punters
| Punter | Punter |
+---------+---------+

Vikram is a master agent in Delhi with 15 years of experience and a strong bankroll. Rajesh is one of his sub-agents in Mumbai, managing 200 cricket punters with a moderate bankroll. Amit is one of Rajesh's punters placing bets of 500 to 50,000 on IPL matches.

When Amit Places a Bet

Amit places 10,000 on Mumbai Indians to win at odds 1.85. Here is what must happen in 90 milliseconds:

  1. Can Amit place this bet? Check his per-click win limit.
  2. How much does Rajesh keep? The forwarding matrix says 60%. Rajesh keeps 6,000.
  3. Can Rajesh afford that? Check his cricket limits, per-match limits, night limit.
  4. The other 4,000 goes to Vikram. Vikram's matrix keeps 60%, forwards 40%.
  5. The remaining 1,600 reaches the platform. Hannibal retains some, hedges the rest on Betfair.
  6. Record everything.

The Fundamental Questions

QuestionWho Answers It
How much risk does Rajesh keep?Rajesh's forwarding matrix + his limits
How much risk does Vikram keep?Vikram's forwarding matrix + his limits
How much risk does the platform keep?Platform's risk configuration
How much gets hedged on Betfair?Whatever remains after all agents have taken their share
What if someone's limits are breached?Overflow cascades up to the next level
What if Betfair is unavailable?Platform absorbs as retained risk, retries asynchronously

3. Simplification Philosophy

The Guiding Principle

Naive, then optimize. Build the simplest correct thing. Measure. Optimize only what measurements prove is too slow. The v2.0 document was designed by experienced architects who anticipated scale problems. Those problems are real but they are not Phase 1 problems. Solving them prematurely introduces complexity that slows development, increases bugs, and makes the system harder to reason about.

What Was Simplified and Why

Simplificationv2.0 Designv3.0 DesignWhy
Caching3-tier: Application LRU (5s TTL) -> Redis -> PostgreSQL2-tier: Redis -> PostgreSQLLRU adds cache coherency problems across instances. Redis is sub-millisecond and shared. Removing LRU eliminates pub/sub invalidation, multi-instance divergence, and an entire class of bugs. Adds ~1-2ms latency, still well within 90ms budget.
Exposure CountersSharded counters (8-16 shards per agent per scope) with random shard selection and summationOne Redis key per agent per scope. INCRBY for atomic updates. Write-behind to PostgreSQL.Redis handles 100K+ ops/sec on a single key. We expect ~60 bets/sec for the busiest agent. Sharding is unnecessary until we 10x our traffic. No shard summation, no random selection, no complexity.
Matrix VersioningFull MVCC with version history, garbage collection, version-aware caching, compressed retention tiersSimple integer version on the matrix. Bet captures version at start. If version changed by position write time, re-process with new matrix.Matrix changes are rare (5-10 per day per agent). The probability of a collision between a matrix change and a bet in flight is astronomically low. When it happens, re-processing one bet is trivial. No version storage, no GC, no version-aware cache logic.
Write AtomicityCross-agent distributed locking OR per-level atomicity with partial routing statesSequential writes: Level 1, then Level 2, then Platform. Each level is its own transaction. Mid-cascade failure goes to DLQ.The cascade is 2-3 levels deep and each write takes 5-15ms. Sequential execution adds 10-30ms total but eliminates deadlocks, cross-agent locking, and partial routing states. If Level 2 fails after Level 1 succeeds, the bet enters the DLQ for manual resolution. This is acceptable because it is extremely rare.
Hedge ExecutionLimit orders, configurable max slippage, partial fill management, re-pricing (3 attempts), execution quality analyticsMarket orders on Betfair. Accept the fill price. Retry 3 times on failure, then mark as unhedged and alert. Log theoretical vs actual spread for future optimization.For Phase 1, we are hedging small amounts. The slippage difference between a market order and an optimized limit order strategy on these amounts is negligible. Building a sophisticated execution engine before we have meaningful hedge volume is premature.
Multi-CurrencyFull FX conversion at currency zone boundaries, FX rate capture, FX risk management, FX reserve buffersEverything in INR. No FX.India-only launch. When we expand internationally, we add currency support then.
Responsible GamblingSelf-exclusion, deposit limits, session time limits, reality check notificationsNot included in v3.B2B agent model, not B2C. Add when entering regulated B2C markets.
ReconciliationEvery 15 minutes (scheduled), post-settlement, full, targeted. Auto-correction for minor drift. Drift tracking over time.One nightly job: compare ledger values to SUM of open positions. If mismatch, alert ops team. Manual fix.Our bet volume in Phase 1 does not warrant 15-minute checks. Nightly is sufficient to catch any drift before it compounds. If a ledger diverges by a meaningful amount, a human should investigate the root cause rather than auto-correcting.
MonitoringPrometheus + Grafana + AlertManager. Per-pipeline-stage histograms. Ops dashboards.Structured logging (Pino) with key metrics. Simple health endpoint. Slack/email alerts for critical failures.Prometheus/Grafana is infrastructure that must be deployed, configured, and maintained. For Phase 1 with 3 app instances, structured logs with grep/jq are sufficient. Add Prometheus when operational complexity warrants it.
Audit Trail3-tier hot/warm/cold storage. Checksum chains. Separate database. Nightly tier migration.JSONB rows in PostgreSQL, one table, partitioned by month. Delete partitions older than 6 months.Our first-season audit volume is ~18 GB. This fits comfortably in a single PostgreSQL instance. Cold storage, compression, and migration jobs are unnecessary overhead for a year.

What This Means for the Latency Budget

Stepv2.0 Budgetv3.0 BudgetChange
Request parsing & validation5ms5msSame
Metrics computation3ms3msSame
User win cap check5ms5msSame
Stake reduction2ms2msSame
Matrix resolution (from Redis, not LRU)10ms12ms+2ms (Redis vs LRU)
Agent cap evaluation (per level, sequential)10ms x N12ms x N+2ms per level (Redis vs LRU)
Position creation (sequential per level)15ms20ms+5ms (sequential vs parallel)
Exposure ledger update (Redis INCRBY + async PG write)10ms8ms-2ms (simpler, no sharding)
Audit record creation10ms10msSame
Response5ms5msSame
Total (3-level cascade)~85ms~92ms+7ms, still within 110ms headroom

The simplified architecture adds roughly 7ms to the critical path. We have 20ms of headroom before user experience degrades. This is acceptable.


4. The Forwarding Matrix -- The Brain of the System

What It Is

The forwarding matrix is a multi-dimensional lookup table that determines what percentage of each bet an agent retains versus forwards to their upline. It is the single most important configuration in the entire B-Book system.

The 5 Dimensions

DimensionWhat It MeansExample Values
market_typeThe type of betMATCH_ODDS, FANCY, BOOKMAKER, OVER_UNDER, LINE
sport_typeWhich sportCRICKET, FOOTBALL, TENNIS, KABADDI
event_phaseWhen in the event lifecyclePRE_MATCH, IN_PLAY, APPROACHING_START
source_typeWhat kind of punterNORMAL, SHARP, VIP, NEW_ACCOUNT
liquidity_bandExchange liquidity available to hedgeHIGH, MEDIUM, LOW, NONE

Wildcard Matching

An agent does not need a rule for every combination. Wildcards (*) mean "match anything."

Example of Rajesh's forwarding matrix:

Rulemarket_typesport_typeevent_phasesource_typeliquidity_bandForward %
R1FANCYCRICKETIN_PLAYSHARP*95%
R2FANCYCRICKETIN_PLAY**70%
R3MATCH_ODDSCRICKETPRE_MATCH*HIGH40%
R4MATCH_ODDSCRICKETPRE_MATCH*LOW70%
R5MATCH_ODDSCRICKETIN_PLAY**60%
R6*CRICKET*SHARP*90%
R7*FOOTBALL***80%
R8*****50%

R1: Sharp user, in-play cricket fancy -- forward 95%. Rajesh keeps only 5% because this is the most dangerous bet type. R3: Pre-match cricket match odds with high liquidity -- forward only 40%. Rajesh keeps 60% because these are safe, well-priced bets. R7: Any football bet -- forward 80%. Rajesh is not a football expert. R8: Catch-all. Forward 50%.

Tie-Breaking Rules (Deterministic)

When a bet matches multiple rules:

  1. Most specific rule wins. Specificity = count of non-wildcard dimensions. R1 (4 specific dimensions) beats R2 (3 specific dimensions).
  2. If specificity is equal, higher forward percentage wins. This is the risk-safe default: when in doubt, forward more.
  3. If forward percentage is also equal, oldest rule wins. Deterministic ordering by creation timestamp.

Resolution Precedence Chain

The matrix is not the only routing input. There is a four-level precedence chain:

LevelWhat It IsWhen to Use It
User OverrideA specific forward % for a specific punter"This user is a known sharp -- forward 100% of their bets"
Market OverrideA specific forward % for a specific event/market"The CSK vs MI final is too big -- forward 90% of everything"
Matrix RuleThe multi-dimensional lookupNormal day-to-day operations
Agent DefaultA single fallback percentage"If nothing else matches, forward 50%"

Sensible Defaults by Sport

ScenarioRecommended RetentionWhy
Cricket in-play fancy10-20%Highest variance, hardest to price
Cricket pre-match match odds40-60%Well-priced, ample liquidity
Cricket in-play match odds20-40%More volatile than pre-match
Football Premier League pre-match50-70%Deep liquidity, well-understood markets
Football lower leagues20-40%Less information, integrity risk
Tennis10-25%Volatile, retirement risk, low liquidity
Kabaddi5-15%Thin markets, poor external pricing

Common Mistakes and Prevention

MistakeHow We Prevent It
Retention too high on in-play fanciesWarn when retention exceeds recommended range; require confirmation
No catch-all ruleSystem requires a * / * / * / * / * fallback at all times
Conflicting rules they do not understandDashboard shows which rule matched for every bet; "test my matrix" dry-run tool

5. The Bet Flow -- Step by Step

The Complete Flow

Walking Through the Numbers

The Bet: Amit places 10,000 on MI to beat CSK at odds 1.85, pre-match IPL 2026.

Step 1: Compute Metrics

MetricCalculationValue
StakeAs submitted10,000
Potential Win10,000 x 0.858,500
LiabilitySame as potential win for a back bet8,500

Step 2: User Win Cap Check. Amit's per-click limit is 50,000. Potential win 8,500 is under. Aggregate daily limit is 2,00,000. Current accumulated: 45,000 + 8,500 = 53,500, still under. Pass.

Step 3: Resolve Forwarding Percentage. Bet characteristics: MATCH_ODDS, CRICKET, PRE_MATCH, NORMAL, HIGH liquidity. Matches Rule R3: forward 40%. Rajesh retains 60%.

Step 4: Agent Cap Evaluation (Rajesh). Rajesh retains 60% of 10,000 = 6,000 stake (5,100 liability). Check limits:

  • Cricket sport limit: 12,05,100 / 50,00,000. Pass.
  • MI vs CSK match limit: 1,25,100 / 5,00,000. Pass.
  • Night period limit: 3,05,100 / 10,00,000. Pass.

All pass. Rajesh retains full 6,000.

Step 5: Cascade to Vikram. 4,000 flows to Vikram. His matrix says retain 60% of cricket pre-match match odds. Retains 2,400, forwards 1,600.

Step 6: Platform. Receives 1,600. Retains 800, hedges 800 on Betfair.

Step 7: Execute. Positions created sequentially -- Rajesh first, then Vikram, then Platform.

EntityRetained StakeRetained LiabilityForwarded
Rajesh6,0005,1004,000 to Vikram
Vikram2,4002,0401,600 to Platform
Platform800680800 to Betfair
Total10,000

The stake always sums to the original 10,000. Risk is distributed, never created or destroyed.

Step 8: Audit. Complete audit record persisted as a JSONB row in the audit table: original bet, matrix rule matched at each level, every limit check, final routing breakdown, timestamps.


6. Cascading Upline Routing

How Bets Flow Through the Hierarchy

A bet enters at the bottom and flows upward. At each level, the agent retains what they can and forwards the rest.

User (Amit) places 10,000 bet
|
v
+---[RAJESH: Level 1]---+
| Forward 40% |
| Retains: 6,000 |
| Forwards: 4,000 |
+---------|---------------+
|
v
+---[VIKRAM: Level 2]---+
| Forward 40% |
| Retains: 2,400 |
| Forwards: 1,600 |
+---------|---------------+
|
v
+---[PLATFORM]----------+
| Retains: 800 |
| Hedges: 800 |
+---------|---------------+
|
v
+---[BETFAIR]------------+
| Receives: 800 |
+-------------------------+

What Happens at Each Level

At every level the system performs:

  1. Resolve source_type for this agent (own classification > downstream trust > default NORMAL)
  2. Resolve forwarding percentage (precedence chain)
  3. Calculate retained amount = incoming stake x (1 - forward %)
  4. Check agent's limits
  5. If limits allow: retain, forward the rest
  6. If limits breached: retain only up to the limit, forward everything else (overflow)

Overflow Example

Amit bets 50,000 on CSK at 2.10. Potential win: 55,000.

RAJESH (L1) - Forward 40%
Wants to retain: 30,000 (60%)
Cricket match limit remaining: 25,000 <-- LIMIT HIT
Actually retains: 25,000
Forwards: 25,000 (intended 20,000 + 5,000 overflow)
|
v
VIKRAM (L2) - Forward 40%
Receives: 25,000
Retains: 15,000 (60%), all limits OK
Forwards: 10,000
|
v
PLATFORM: Retains 5,000, Hedges 5,000

Vikram did not know 5,000 was overflow. He simply received 25,000 and processed normally. The cascade never drops a bet.

Suspended Agents

If Vikram is suspended, bets skip him. Rajesh's forwarded amount goes directly to the platform. Punters can still bet; risk routes around the suspended agent.

The Betfair Backstop

After all agents take their share, any remaining exposure reaches the platform, which can hedge on Betfair. If Betfair is down: the platform absorbs the risk temporarily, the bet is still accepted, and the hedge order enters a retry queue. Punter experience is never degraded by hedge-side issues.

Does Sharp Classification Travel Upline?

The information travels, but each agent decides independently.

Each forwarded bet carries metadata: originating_user_id, downstream_classification, forwarding_reason. Each upline agent resolves source_type independently:

Each agent configures trust per sub-agent:

Sub-Agenttrust_downstream_flagsReason
RajeshtrueExperienced, 8 years track record
ArunfalseNew, only 3 months, unproven detection

7. Agent Liability Limits & NO_NEW_RISK

The Limit Structure

Limit TypeScopeExample
Sport LimitTotal liability across all events in a sport"50 lakh total cricket exposure"
Market LimitTotal liability on a specific event/market"No more than 5 lakh on any single IPL match"
Night Period LimitTotal liability during the night window"Cap my night session at 10 lakh"
Weekly Period LimitTotal liability during the weekly cycle"Cap my weekly exposure at 1 crore"

How Limits Interact: Most Restrictive Wins

All applicable limits are checked simultaneously. The most restrictive determines retention.

Example: Thursday night during IPL. Rajesh's state:

LimitCapacityUsedRemaining
Cricket Sport50,00,00038,00,00012,00,000
MI vs CSK Match5,00,0004,50,00050,000
Night Period10,00,0009,20,00080,000
Weekly Period40,00,00035,00,0005,00,000

A bet wants to add 1,00,000 of retained liability. The match limit (50,000) is most restrictive. Rajesh retains only 50,000; the other 50,000 overflows to Vikram.

NO_NEW_RISK Mode

NO_NEW_RISK activates when retained liability reaches the limit for a given scope. The agent cannot take on new risk-increasing exposure, but hedge bets are still accepted.

Think of it like a credit card limit. You cannot make new purchases, but you can make payments.

Scope is granular:

ScenarioCricketTennisFootball
Rajesh hits cricket limitNO_NEW_RISKNormalNormal
Rajesh hits MI vs CSK match limitNO_NEW_RISK (this match only)NormalNormal
Rajesh hits night period limitNO_NEW_RISK (all sports)NO_NEW_RISKNO_NEW_RISK

Hedge detection rule:

If WorstCaseLiability AFTER the bet < WorstCaseLiability BEFORE the bet, it is a hedge. Allow it.

A bet on the opposite outcome of an existing position is almost always a hedge. A bet on the same outcome is never a hedge. A bet on a third outcome may or may not be depending on amounts and odds.

Exiting NO_NEW_RISK: Settlements reduce exposure, hedge bets reduce exposure, or an admin raises the limit.


8. User Win Limits & Stake Reduction

Per-Click Win Limit

Caps the maximum a punter can win on a single bet. If Amit has a per-click limit of 50,000 and bets at odds 50.00 with a 5,000 stake (potential win 2,45,000), the stake is reduced:

max_stake = 50,000 / (50.00 - 1) = 50,000 / 49 = 1,020 (rounded down)

The punter sees: "Maximum stake at these odds: 1,020." If the reduced stake falls below the minimum bet size (100), the bet is rejected: "This market is currently unavailable at these odds."

Aggregate Win Limit

Caps total cumulative potential wins over a period (daily). Prevents a punter from placing many individually-acceptable bets that collectively create enormous exposure.

Sharp Detection Signals

SignalWhat It Means
Closing Line Value (CLV)Consistently bets at better prices than the closing line. Strongest predictor of long-term profitability.
Consistent stakingSame stake regardless of odds. Professionals use flat staking.
Early bettingBets within the first hour of market opening when prices are softest.
Unpopular marketsBets on obscure leagues with weakest pricing.
No mean reversionProfits do not revert over time. Lucky punters revert; skilled punters do not.
Post-bet price movementPrice moves in their direction after they bet.

9. Period Definitions -- Night & Weekly

Why Bookies Use Periods

Bookies think in operational windows, not lifetime exposure. The night session (when live betting peaks) has different risk characteristics than a quiet afternoon. The weekly cycle aligns with settlement cycles.

Night Period

Configurable per agent. Examples:

AgentTimezoneNight PeriodWhy
Rajesh (India, cricket)IST7:00 PM - 2:00 AMIPL matches start at 7:30 PM
Priya (India, football)IST10:00 PM - 4:00 AMPL matches at 12:30 AM IST

Period Rollover

When a live match spans a period boundary, open exposure carries forward. Rajesh's 8,00,000 at 1:55 AM is NOT zeroed at 2:00 AM. It carries forward as a starting balance. New bets after 2:00 AM count against day period limits.

DST

Period boundaries are defined in the agent's local time and converted to UTC fresh each day. "7 PM to 2 AM" always means that in the agent's local clock. On spring-forward day, the period is 1 hour shorter. On fall-back, the system uses the first occurrence of the repeated hour.


10. Exposure Accounting (2-Tier: Redis + PostgreSQL)

Three Ledgers Per Agent Per Scope

LedgerWhat It Tracks
retained_open_liabilityWorst-case payout on retained bets. Checked against limits.
forwarded_open_liabilityLiability forwarded upward. Agent has no financial exposure.
open_potential_winTotal punters could win against this agent.

The 2-Tier Architecture

+------------------+     +------------------+
| REDIS | | POSTGRESQL |
| (Fast R/W) | | (Source of Truth)|
| | | |
| - Sub-ms reads | | - Atomic writes |
| - INCRBY atomic | | - FOR UPDATE |
| - Single shared | | locking at |
| cache for all | | boundaries |
| instances | | |
+--------+---------+ +--------+---------+
| |
v v
"Rajesh has 12L "Rajesh has exactly
used of 50L -- 12,05,100 used --
clearly within UPDATE with lock"
limit, fast pass"

How it works:

  1. Redis (primary read/write): Every bet reads exposure from Redis first. Redis is shared across all application instances -- no cache coherency problem. Exposure counters are stored as simple keys: exposure:{agent_id}:{scope}:retained, exposure:{agent_id}:{scope}:forwarded, exposure:{agent_id}:{scope}:potential_win. Updated with atomic INCRBY on every bet.

  2. PostgreSQL (source of truth): After the Redis update, a write-behind job flushes the current Redis value to PostgreSQL every 5 seconds. For bets where the agent is near their limit (>85% utilized), we bypass Redis and go directly to PostgreSQL with SELECT ... FOR UPDATE locking. This ensures two simultaneous bets cannot both claim the last 50,000 of capacity.

Why this works: During an IPL match, Rajesh might receive 50 bets per minute. For 45 of those, Redis immediately confirms he is within limits. Only the 5 bets near his limit need the PostgreSQL lock. Median latency stays low while correctness is guaranteed at the boundary.

When Redis is down: Fall back to PostgreSQL for all reads. Latency increases from <1ms to 5-15ms. No data loss, no incorrect decisions.

Settlement Impact

When a match settles, exposure associated with that match is removed from all agents: retained_open_liability decrements for retained positions, forwarded_open_liability decrements for forwarded positions, open_potential_win decrements for all.


11. Audit Trail & Determinism

The Audit Record

Every bet produces a structured, queryable JSONB record in PostgreSQL. The audit table is partitioned by month. Partitions older than 6 months are dropped.

An audit record contains:

FieldExample
bet_idbet_a1b2c3d4
original_stake10,000
adjusted_stake10,000
per_click_win_cap_checkPASS: 8,500 < 50,000
forwarding_chainFull routing at each level with matrix rule, limits, overflow
limit_checksEvery limit checked at every level with result
period_contextNIGHT (19:00-02:00 IST), Week 7 of 2026
timestampsmatrix_resolve: 2ms, cap_check: 5ms, total: 23ms
matrix_version7 (integer version captured at start of processing)

Why Determinism Matters

Agents must trust the system. If Rajesh sees a routing he does not understand, he will revert to manual processes. The audit trail shows exactly why every decision was made.

Disputes are trivially resolvable: load the audit record, see the exact split and the rules that produced it.

Configuration Change Log

All configuration changes are recorded with who made the change, when, and why. You can answer: "What was Rajesh's matrix at 9:47 PM on March 15?" by querying the config changelog.


12. Performance Architecture

Latency Budget (Simplified)

StepBudget
Request parsing & validation5ms
Metrics computation3ms
User win cap check (Redis)5ms
Matrix resolution (Redis)12ms
Agent cap evaluation (per level, Redis or PG)12ms x N levels
Position creation (sequential, per level)8ms x N levels
Exposure ledger update (Redis INCRBY)3ms
Audit record (async buffer)0ms on critical path
Response5ms
Total (3-level cascade)~92ms

Exposure Counter Design (Simple Redis Atomic Counters)

One Redis key per agent per scope. No sharding.

KEY: exposure:rajesh:cricket:retained
VALUE: 1205100 (in paisa)

KEY: exposure:rajesh:mi_vs_csk:retained
VALUE: 125100

KEY: exposure:rajesh:night_2026_02_11:retained
VALUE: 305100

Operations:

  • Read: GET exposure:rajesh:cricket:retained -- sub-millisecond
  • Increment: INCRBY exposure:rajesh:cricket:retained 5100 -- atomic, sub-millisecond
  • Decrement (settlement): DECRBY exposure:rajesh:cricket:retained 5100 -- atomic

Write-behind to PostgreSQL: a background job reads all exposure keys and writes to exposure_ledger table every 5 seconds. If Redis restarts, the recovery job rebuilds keys from PostgreSQL.

What Is Cached Where

DataLocationInvalidation
Agent forwarding matrixRedis hash matrix:{agent_id}On matrix update, delete key
Agent limitsRedis hash limits:{agent_id}On limit update, delete key
Current exposureRedis keys exposure:{agent_id}:{scope}:*Updated on every bet via INCRBY
User win cap stateRedis key user_agg_win:{user_id}:{date}Updated on every bet, reset on period boundary
NO_NEW_RISK flagsRedis key nonewrisk:{agent_id}:{scope}Set on limit breach, cleared on settlement or limit change

13. The Agent Experience

Three Tiers of Experience

The system runs on the same engine for everyone. What changes is how much complexity the agent sees.

Tier 1: "Set and Forget" (80% of agents). Three questions at setup: What sports? What is your nightly budget? How aggressive? The system generates the full matrix, limits, and win caps from these three answers. Traffic light dashboard. WhatsApp alerts.

Tier 2: "Dashboard Driver" (15% of agents). Real-time risk dashboard. Per-sport limits. Per-user management. Sees exposure bars, match-by-match breakdown, recent bets.

Tier 3: "Matrix Master" (5% of agents). Full 5D matrix editor. Test bet simulator. Historical P&L analysis per rule.

The "Sleep Well" Number

Every agent sees one number prominently:

YOUR MAXIMUM LOSS TONIGHT
3,42,000
out of your 10,00,000 night budget
████████░░░░░░░░░░░░ 34%

This is the sum of worst-case liability across all retained open positions. It is a mathematical guarantee: the cascade routing, limits, and NO_NEW_RISK ensure this number never exceeds the budget.

The Panic Button

One button sets forwarding to 100% for all sports and markets, places hedge orders on Betfair for all current retained positions, and notifies the upline. This is the emergency exit.


14. Nightmare Scenarios & How We Handle Them

Syndicate Attack (Correlated Positions Across Agents)

The platform maintains aggregate exposure per event. Cross-agent correlation detects users who consistently bet the same outcome at the same time through different agents. Platform-level event limits cap total exposure regardless of individual agent limits.

Data Feed Failure During Live Play

If odds have not updated for more than 5 seconds during in-play, the market is automatically suspended. No new bets until the feed resumes. Price movement circuit breaker suspends on abnormal jumps.

Rogue Agent Dumping Toxic Flow

Track P&L of retained vs forwarded bets per agent. If forwarded bets consistently lose money while retained bets win, flag the agent. The platform can force minimum retention.

Double Settlement / Result Correction

Re-settlement reverses previous settlements and re-applies with corrected results. All agents in the cascade are affected. Communication chain notifies everyone with full audit trails.

System Outage During Peak

Circuit breaker pattern: when response times exceed 500ms, switch to degraded mode where all bets are forwarded 100% to Betfair. No agent retains risk during the outage. When the system recovers, normal routing resumes.


15. Bet Cancellation / Void / Partial Settlement State Machine

The Complete State Machine

State Definitions

StateExposure ImpactReversible?
BET_PLACEDLedgers not yet updatedYes (rolls back on system error)
ACTIVEFully reflected in all agent ledgersNo -- can only move forward
SETTLEDExposure removed, P&L appliedCan move to RE_SETTLED
VOIDEDAll exposure atomically removed as if bet never happenedNo -- final
CANCELLEDAll exposure removed. Functionally identical to void.No
CASH_OUT_SETTLEDOriginal position closed via counter-positionNo
RE_SETTLEDPrevious P&L reversed, new P&L appliedCan be re-settled again

Who Can Initiate Transitions

TransitionByTime Window
ACTIVE -> SETTLEDSystem (automatic on event result)After event concludes
ACTIVE -> VOIDEDPlatform admin onlyAny time before settlement
BET_PLACED -> CANCELLEDPunter (own bet)Within 3-5 seconds pre-match, 0 for in-play
BET_PLACED -> CANCELLEDAgent (their punter's bet)Within 60 seconds
SETTLED -> RE_SETTLEDSUPER_ADMIN onlyWithin 72 hours

Agents cannot void bets. Only the platform can void. This prevents fraud against the upline.

How Voids Cascade

A void reads the original audit record to determine exactly what to reverse. It does not recalculate. Even if the agent changed their matrix since placement, the void reverses exactly what was originally done.

Every void has an idempotency key. Pressing "Void" twice does not double-decrement exposure. All ledger updates happen in a single PostgreSQL transaction with FOR UPDATE locks in agent_id ascending order (preventing deadlocks).

After a void, NO_NEW_RISK flags are re-evaluated: if exposure dropped below the limit, the flag is cleared.

Partial Void (Multi-Leg Bets)

When one leg of an accumulator is voided, that leg is treated as a winner at odds 1.00. Remaining legs settle normally with the voided leg's odds removed from the calculation.


16. Matrix Version Control

The Problem

Rajesh changes his matrix at 9:47 PM during MI vs CSK. Fifteen bets are in various processing stages. If half use the old matrix and half use the new one, the audit trail is inconsistent.

The Simple Solution: Version Integer

The matrix has a version integer, incremented on any change. When a bet enters processing, it captures the current version.

BET PROCESSING:
1. Bet arrives at matrix resolution step
2. Read current matrix version from Redis: version = 7
3. Evaluate rules
4. Record version 7 in the bet's audit trail
5. Continue to cap evaluation and position creation

At position write time:
6. Re-check: is matrix version still 7?
7a. If yes: write positions normally
7b. If no (version is now 8): re-resolve the matrix with the new rules, re-calculate the split, then write

Why This Is Sufficient

Matrix changes are rare: 5-10 per day per agent. The window between a bet reading the matrix version and writing positions is 20-40 milliseconds. The probability of a matrix change landing in that window is roughly:

10 changes/day = 10 changes / 86,400,000 ms = 1 change per 8,640,000 ms
40ms window: probability = 40 / 8,640,000 = 0.000005 (0.0005%)

When the astronomically unlikely collision does occur, we simply re-process the bet with the new matrix. No version history needed. No garbage collection. No version-aware caching.

What the Agent Sees

Matrix changes take effect on the next bet that starts processing after the Redis key is updated (which happens within milliseconds of the save). The dashboard shows: "Matrix updated. New bets will use the updated rules."


17. Dead Letter Queue and Poison Bet Handling

The Retry Pipeline

What Constitutes a Poison Bet

ConditionExample
Event already settledBet queued during outage, event settles before replay
Agent suspended mid-processingAgent suspended while bets were in retry queue
Market no longer existsData feed error caused market deletion
Duplicate bet_idNetwork timeout caused client to retry; first attempt succeeded
Invalid state transitionPunter cancelled during retry window

What the Punter Experiences

If the punter never saw "Bet Confirmed": They saw "Bet is being processed." If retries succeed, push notification: "Your bet has been confirmed." If retries fail, push notification: "Your bet could not be placed. No funds were deducted."

If the punter saw "Bet Confirmed" (partial cascade failure): The punter continues to see "Bet Active." The system retries in the background. If retries fail, an admin resolves via the DLQ -- typically void and refund.

Manual Resolution

The DLQ dashboard shows each entry with full context: the original bet, the failure reason, the punter experience (did they see a confirmation?), and the event status. Admin options: Void & Refund, Force Process, or Settle at Result.


18. Settlement Cascade & Failure Isolation

Per-Position Settlement State

Each position is settled independently. Position 2,847 failing does not block position 2,848. The worker records the failure and moves on.

Settlement Worker Design

Workers claim positions using SELECT ... FOR UPDATE SKIP LOCKED. Each position is processed: calculate P&L, update exposure ledger, set status to SETTLED. Failed positions remain for retry. Settlement is partitioned by agent for fault isolation.

Settlement Idempotency

Every settlement has an idempotency key: {position_id}_{hash(event_result)}. Before executing, the system checks if this key already exists. If it does, the settlement is skipped. This makes settlement safe to retry at any time.

Partial Failure Recovery

A background monitor runs every 30 seconds:

  • Positions in PROCESSING for >60 seconds: assume the worker crashed, reset to PENDING
  • Positions in FAILED: retry up to 5 times, then escalate to DLQ
  • Events with mixed CONFIRMED/PENDING: generate a report and alert if >15 minutes old

19. Cash-Out / Early Settlement

How Cash-Out Works

Cash-out is a counter-bet at current odds that closes the original position. The punter locks in a guaranteed profit or limits a loss.

Formula (back bet):

cash_out_return = stake * (original_odds / current_odds) * (1 - margin)
cash_out_profit = cash_out_return - stake

Example: Amit bet 10,000 on MI at 1.85. MI now at 1.20 (dominating).

cash_out_return = 10,000 * (1.85 / 1.20) * (1 - 0.05) = 14,646
Amit's profit if he cashes out: 4,646
(vs 8,500 if MI wins, vs -10,000 if MI loses)

Routing Through the Original Proportions

The cash-out counter-bet routes through the SAME proportions as the original bet, not the current matrix. If Rajesh retained 60%, he closes 60% of the position.

Partial Cash-Out

Amit can cash out any percentage. 50% cash-out closes half the position at all levels. The remaining 50% continues normally.

Interaction with NO_NEW_RISK

Cash-out creates a counter-position that reduces exposure. It is treated as a hedge and allowed even when NO_NEW_RISK is active.


20. Lay Bet Support

What Is a Lay Bet?

A lay bet is the opposite of a back bet. Sonia lays MI to win means she bets MI will NOT win. She wins if MI draws or loses.

Liability difference:

  • Back: punter risks stake, wins stake * (odds - 1)
  • Lay: punter risks stake * (odds - 1), wins the stake

Exposure Impact

A lay bet on outcome X DECREASES the agent's exposure on outcome X and INCREASES exposure on other outcomes. It is a natural hedge.

Before Sonia's lay:
MI Win liability: 5,00,000
MI Not Win liability: 0

After Sonia lays MI at 1.85 for 10,000 (Rajesh retains 60%):
MI Win liability: 4,94,900 (reduced by 5,100)
MI Not Win liability: 6,000 (increased by 6,000)

Worst case went from 5,00,000 to 4,94,900 = HEDGE

How the Matrix Handles Lay Bets

Same 5-dimensional matrix, same rules, same precedence chain. The only difference is how the exposure ledger is updated and how NO_NEW_RISK evaluates the bet. If the lay reduces worst-case liability, it is allowed even under NO_NEW_RISK.


21. Collusion Detection

The Problem

An agent conspires with a sharp punter: marks them as NORMAL to retain winning bets instead of forwarding them. Or marks them as SHARP only when they are about to lose, forwarding toxic flow upline.

Detection Signals

SignalSeverity
Classification flip before winning streakHIGH
Classification flip-flop (>3 times/week)HIGH
Override % changes correlate with bet outcomesCRITICAL
Forwarded flow consistently loses vs retained flow consistently winsHIGH
Single user dominates forwarded volume (>30%)MEDIUM

Cooling-Off Period for Classification Changes

Changes are queued, not instant:

  • NORMAL to SHARP: 24-hour cooling-off
  • SHARP to NORMAL: 72-hour cooling-off (riskier direction, longer wait)
  • During cooling-off: old classification remains active. Agent cannot change again.

Upline Audit Rights

Vikram can see Rajesh's user classifications, pending changes, and classification-vs-outcome correlation reports. Vikram can flag changes for review and request the platform to freeze Rajesh's override capability. Vikram cannot directly change Rajesh's classifications.

Correlation Engine (Phase 2 Implementation)

A nightly analysis builds timelines of classification changes, bet placements, and outcomes. Calculates collusion scores based on statistical correlation. Thresholds trigger escalating alerts from informational (score 25-75) to automatic override freeze (score 150+).

For Phase 1, the cooling-off periods and upline audit rights provide baseline protection. The automated correlation engine is Phase 2.


22. Agent Hierarchy Migration

The Problem

Rajesh leaves Vikram's network and joins Suresh's. Rajesh has open positions forwarded through Vikram. The system must handle this cleanly.

Effective-Dated Changes

Hierarchy changes are never instantaneous. They take effect at a scheduled date.

Dual-Path Settlement

After cutover:

  • New bets: Rajesh -> Suresh -> Platform
  • Old positions: Still settled through Vikram (he holds the positions)
Bet TimingRoutingSettlement
Before cutover, settles before cutoverThrough VikramThrough Vikram
Before cutover, settles after cutoverThrough VikramThrough Vikram (dual-path)
After cutoverThrough SureshThrough Suresh

When all old positions settle, final reconciliation between Rajesh and Vikram, then migration status becomes COMPLETE.


23. Minimum Forwarding / Skin-in-the-Game Requirements

How It Works

The upline sets a minimum retention percentage per downstream agent. This floor cannot be overridden by the downstream agent's matrix.

Vikram requires Rajesh to retain at least 20% of every bet. If Rajesh's matrix says forward 95% for SHARP users, the system caps it at 80% forward (20% retention minimum).

Limits always win over minimum retention. If Rajesh's limits would be breached by retaining 20%, he retains only what his limits allow.

When Rajesh tries to save a matrix rule that violates the minimum, the system auto-adjusts and shows a clear message explaining the constraint.


24. Panic Button & Abuse Prevention

What the Panic Button Does

  1. Immediately sets forwarding to 100% for all sports and markets
  2. Places hedge orders on Betfair for all current retained positions
  3. Notifies the upline
  4. Logs the action

Abuse Prevention

The abuse pattern: panic when losing (lock in small loss), hold when winning (collect full profit).

Cost escalation:

  • First panic per period: platform absorbs hedge spread cost
  • Second panic: 50% spread cost charged to the agent
  • Third+: 100% charged to the agent

Controls:

ControlValue
Panics per night (free)2
Cooling-off after panic30 minutes before matrix can be restored
Minimum hedge duration15 minutes (cannot un-hedge immediately)

Detection: Track panic frequency, correlation with book direction (panics when losing, not when winning), and P&L improvement from panic usage. If panic consistently improves P&L by >20% over 3 weeks with asymmetric usage, flag as GAMING. Escalation: full spread cost, then suspension of panic feature replaced with automatic NO_NEW_RISK.


25. Timestamp Security & Period Boundaries

Server-Side Authority

The authoritative timestamp is assigned at the API gateway when the request is received. This is immutable. Client timestamps are stored for debugging and fraud detection but never trusted for business logic.

BET PROCESSING:
1. Client sends request with client_timestamp
2. API gateway assigns server_timestamp = NOW()
3. All period boundary evaluation uses server_timestamp
4. All exposure updates reference server_timestamp
5. Audit stores BOTH timestamps

Period Boundary Tie-Breaking

Bets at exactly the boundary belong to the ENDING period. The interval is closed-open: [19:00, 02:00). A bet at 01:59:59.999 is NIGHT. A bet at 02:00:00.000 is DAY.

Clock Skew

All server instances synchronized via NTP to within 50ms. If drift exceeds 200ms, the instance is removed from the load balancer. For period boundaries (at hour granularity), 50ms skew is irrelevant.


26. Sharp Detection via Multiple Accounts

The Detection Pillars (Design for Phase 1, Heavy Implementation Phase 2)

Four independent signals:

Pillar 1: Device Fingerprinting. Same device used by multiple accounts. Similarity scoring based on device_id, canvas fingerprint, IP + user agent, screen resolution + timezone.

Pillar 2: IP/Network Correlation. Multiple accounts from same IP or subnet. Sequential IP usage patterns. VPN/proxy detection.

Pillar 3: Betting Pattern Similarity. Outcome correlation (>80% over 100+ bets is suspicious). Timing correlation (<5 minutes apart). Correlated obscure market selection. Similar stake distributions. Similar CLV profiles.

Pillar 4: Payment Method Overlap. Same bank account, UPI ID, phone number, or card on multiple accounts.

Phase 1 Implementation

For Phase 1: implement device fingerprint collection and basic IP logging. Store the data. Manual review when agents report suspected syndicates.

For Phase 2: build the automated correlation engine that runs nightly, scores clusters, and generates syndicate alerts with recommended actions (classify all as SHARP, block, review individually).


27. Rate Limiting on Configuration Changes

Per-Agent Rate Limits

Configuration TypeRate Limit
Matrix rule changes1 per 5 minutes
User override changes1 per user per 10 minutes
Market override changes1 per market per 5 minutes
Agent default changes1 per 15 minutes
Limit changes1 per limit per 10 minutes
Panic buttonNo rate limit on activation; 30-minute cooling-off

Queue Behavior

Rapid changes that exceed the rate limit are queued. Only the most recent value is applied when the rate limit window expires. Intermediate values are never applied.

The agent sees: "Your change is pending. It will apply in ~4 minutes. Only your most recent value will be used."

Panic Button Interaction

Panic activation bypasses all rate limits (safety trumps stability). Configuration changes while panic is active are queued and applied only after panic is deactivated.


28. Hedge Execution

The Simple Design

For Phase 1, hedge execution is deliberately simple:

What this means:

  • Market orders, not limit orders. Accept whatever price Betfair gives.
  • No re-pricing, no partial fill management, no slippage optimization.
  • If the order fails 3 times (Betfair down, no liquidity, API error), mark the amount as unhedged and alert.
  • Log the theoretical price (what the punter got) vs the actual fill price. This data feeds Phase 2 optimization.

Why this is fine for Phase 1: We are hedging small amounts. The spread between a market order and an optimized limit order on a 800 bet is a few rupees. Building a sophisticated execution engine before we have meaningful hedge volume is premature.

Betfair Health Check

Ping Betfair every 10 seconds. Three consecutive failures means Betfair is down. Queue all orders for retry. Platform absorbs risk. When Betfair returns, drain the queue.

If an event settles while a hedge order is queued, cancel the order. The platform bears the outcome as retained risk.


29. Reconciliation

The Simple Design

One nightly job. Runs at 4:00 AM IST when traffic is minimal.

What it checks:

For every active agent, for every scope with open positions:

Step 1: Read exposure_ledger.retained_open_liability for this agent+scope
Step 2: SUM(liability) FROM positions WHERE agent_id AND scope AND status='OPEN' AND type='RETAINED'
Step 3: Compare.
Step 4: If they match: log success.
Step 5: If mismatch: alert ops team via Slack/email with:
- Agent name
- Scope
- Ledger value vs computed value
- Drift amount and direction

No auto-correction. If the ledger is wrong, a human investigates the root cause. Auto-correction hides bugs. In Phase 1, every mismatch should be understood and fixed at the source.

The manual recompute tool:

reconciliation recompute --agent=rajesh --scope=cricket

1. Lock the agent+scope (advisory lock, briefly delays new bets)
2. SUM all open positions
3. Update exposure_ledger to match
4. Update Redis
5. Release lock
6. Log the correction

This is the fix tool. Use it after investigating and understanding why the drift occurred.


30. Monitoring & Alerting

The Simple Design

No Prometheus. No Grafana. No AlertManager. Instead:

1. Structured Logging (Pino)

Every bet, settlement, hedge, and error is logged as structured JSON. Key fields: timestamp, event, agent_id, bet_id, latency_ms, success, error.

This gives us everything we need for debugging and performance analysis. Use jq or a log aggregator to query.

2. Health Endpoint

GET /api/v1/monitoring/health returns:

{
"status": "healthy",
"redis": "connected",
"postgresql": "connected",
"betfair": "healthy",
"uptime_seconds": 86400,
"bets_last_minute": 42,
"avg_latency_ms": 65,
"errors_last_minute": 0
}

3. Critical Failure Alerts (Slack/Email)

Only three categories of alerts for Phase 1:

AlertTriggerChannel
NO_NEW_RISK activatedAny agent enters NO_NEW_RISKSlack
Settlement failureAny position fails to settle after 3 retriesSlack + Email
Hedge failureHedge marked as unhedged after 3 retriesSlack
DLQ entryAny bet enters the dead letter queueSlack
Reconciliation mismatchNightly job finds any driftEmail
Betfair down3 consecutive health check failuresSlack

That is it. Six alert types. When we have operational complexity that warrants dashboards, we add Prometheus/Grafana.


This concludes Parts I and II of the simplified B-Book Architecture v3.0. The document covers 30 sections (down from 46 in v2.0), with every simplification explicitly documented and justified. The v2.0 document is preserved as the scale-up reference for when Phase 1 traffic and operational needs outgrow these simpler designs.



Part III: Implementation Architecture

The following sections provide the complete implementation specification. Every database table, every API endpoint, every pipeline step, and every deployment detail. An LLM reading this can build the entire system without asking a single question.


31. Technology Stack

Core Technologies

TechnologyVersionPurposeWhy This Choice
Node.js20 LTSApplication runtimeEvent-loop model handles high concurrency cheaply. Non-blocking I/O suits the many Redis and DB calls in the bet pipeline. Team has existing expertise.
TypeScript5.xLanguageType safety prevents financial bugs. Domain types like Stake, Liability, and ForwardPercentage catch errors at compile time.
PostgreSQL16Primary databaseACID transactions for financial data. FOR UPDATE locking for limit enforcement at boundaries. JSONB for flexible audit payloads. Partitioning for table growth.
Prisma5.xORMType-safe database access from TypeScript. Schema-as-code for migrations. Use raw SQL only for FOR UPDATE locks and bulk operations.
Redis7.xCache and countersSub-millisecond reads for exposure checks. Atomic INCRBY for exposure counters. Used as the backing store for BullMQ.
BullMQLatestJob queueSettlement processing, hedge retry, reconciliation, audit writes. Built on Redis. Supports delayed jobs, retries, priority queues, dead letter.
DockerLatestContainerizationConsistent environments. Docker Compose for local development. Single-container production deployment.
PinoLatestStructured loggingFast JSON logger. Structured logs are queryable. Low overhead at high throughput. Replaces the need for Prometheus in v1.
ZodLatestRuntime validationValidates all API inputs and configuration at runtime. Complements TypeScript compile-time types.
Socket.IOLatestWebSocketReal-time dashboard updates to agents. Push notifications for alerts.

What We Are NOT Using (and Why)

TechnologyWhy Not
Prometheus + GrafanaNo dedicated ops person yet. Pino structured logs and a simple /health endpoint are sufficient for v1. Add Prometheus when you have someone to look at dashboards.
Separate Audit DBOverkill for v1 volume. Audit trail goes into the same PostgreSQL instance as JSONB in a partitioned table. Separate it when table size exceeds 100 GB.
Kafka or Redis StreamsBullMQ covers all async job needs (settlement, hedging, reconciliation). Kafka is for 1000+ bets/sec. We are targeting 167 bets/sec.
Application LRU CacheAdds a consistency headache (3-tier cache invalidation). Redis is fast enough at sub-millisecond latency. Add an LRU layer only if Redis P99 exceeds 5ms under load.
MicroservicesOne Node.js process is simpler to deploy, debug, and operate. Modules are separated in code but deployed as one application. Extract into microservices only if a specific module needs independent scaling.
Multi-currency / FXINR only for v1. All amounts in paisa (integer arithmetic, no floating-point). Add multi-currency when expanding beyond India.
Responsible Gambling ModuleNot required for the Indian agent market in v1. Add when entering regulated markets (UK, EU).

32. System Architecture Overview

System Diagram

How Services Communicate

FromToMethodNotes
Client to APIRESTHTTP/JSON, auth via JWTStandard request-response
Client to DashboardWebSocketSocket.IOReal-time exposure updates, alerts
API to Bet ProcessingFunction callIn-processEverything is one Node.js process
Bet Processing to Cascade EngineFunction callIn-processSynchronous -- the bet pipeline is a single call chain
Bet Processing to Hedge QueueBullMQ jobAsyncFire-and-forget from the bet pipeline
Settlement trigger to Settlement WorkerBullMQ jobAsyncSettlement jobs queued when events settle
Config Change to CacheRedis DELDirectDelete the Redis key; next read repopulates from PostgreSQL

Deployment Topology

Local Development: Docker Compose with three services: app (Node.js), postgres, redis.

Production (v1): One Node.js process running the full application and all BullMQ workers. One PostgreSQL instance. One Redis instance. Put behind nginx or a cloud load balancer if you need SSL termination.

Scale horizontally by adding a second Node.js instance behind the load balancer when traffic demands it. No code changes required -- BullMQ handles job distribution across workers automatically, and Redis ensures shared state.


33. Database Schema Design

Entity-Relationship Diagram

Table: agents

Stores every entity in the hierarchy: sub-agents, master agents, and the platform itself.

ColumnTypeConstraintsDescription
idUUIDPRIMARY KEY, DEFAULT gen_random_uuid()Unique agent identifier
external_idVARCHAR(100)UNIQUE, NOT NULLHuman-readable ID (e.g., rajesh_mumbai)
nameVARCHAR(255)NOT NULLDisplay name
parent_agent_idUUIDREFERENCES agents(id), NULLABLEUpline agent. NULL for the platform.
levelINTEGERNOT NULLHierarchy depth. 0 = platform, 1 = master, 2 = sub, etc.
statusVARCHAR(20)NOT NULL, DEFAULT 'ACTIVE'ACTIVE, SUSPENDED, DEACTIVATED
timezoneVARCHAR(50)NOT NULL, DEFAULT 'Asia/Kolkata'IANA timezone for period boundaries
default_forward_percentageDECIMAL(5,2)NOT NULL, DEFAULT 50.00Fallback when no matrix rule matches
matrix_versionINTEGERNOT NULL, DEFAULT 1Incremented on every matrix change. Simple integer -- no version history table.
night_period_startTIMENULLABLENight period start in local time
night_period_endTIMENULLABLENight period end in local time
weekly_period_start_dayINTEGERNOT NULL, DEFAULT 11=Monday, 7=Sunday
is_platformBOOLEANNOT NULL, DEFAULT falseTrue for the single platform agent
platform_retain_percentageDECIMAL(5,2)NULLABLEOnly for platform: percent to retain vs hedge
created_atTIMESTAMPTZNOT NULL, DEFAULT NOW()
updated_atTIMESTAMPTZNOT NULL, DEFAULT NOW()

Indexes:

  • idx_agents_parent on (parent_agent_id)
  • idx_agents_status on (status)
  • UNIQUE on (external_id)

Simplification vs v2: No separate tier column (derive from config complexity). The matrix_version integer lives directly on the agent row instead of a separate matrix_versions table. When a matrix changes, increment this integer and record the change in audit_trail.

Table: agent_limits

ColumnTypeConstraintsDescription
idUUIDPRIMARY KEY
agent_idUUIDREFERENCES agents(id), NOT NULL
limit_typeVARCHAR(30)NOT NULLSPORT, MARKET, NIGHT_PERIOD, WEEKLY_PERIOD
sport_typeVARCHAR(30)NULLABLECRICKET, FOOTBALL, etc. NULL for period limits across all sports
event_idVARCHAR(100)NULLABLEOnly for MARKET limits -- the specific event
limit_amountBIGINTNOT NULLIn paisa. Example: 50 lakh = 5,000,000,000 paisa
is_activeBOOLEANNOT NULL, DEFAULT true
created_atTIMESTAMPTZNOT NULL, DEFAULT NOW()
updated_atTIMESTAMPTZNOT NULL, DEFAULT NOW()

Indexes:

  • idx_agent_limits_lookup on (agent_id, limit_type, sport_type)
  • UNIQUE on (agent_id, limit_type, sport_type, event_id) -- prevents duplicate limits

Table: forwarding_matrix_rules

ColumnTypeConstraintsDescription
idUUIDPRIMARY KEY
agent_idUUIDREFERENCES agents(id), NOT NULL
market_typeVARCHAR(30)NOT NULL, DEFAULT '*'MATCH_ODDS, FANCY, BOOKMAKER, OVER_UNDER, LINE, or *
sport_typeVARCHAR(30)NOT NULL, DEFAULT '*'CRICKET, FOOTBALL, TENNIS, KABADDI, or *
event_phaseVARCHAR(30)NOT NULL, DEFAULT '*'PRE_MATCH, IN_PLAY, APPROACHING_START, or *
source_typeVARCHAR(30)NOT NULL, DEFAULT '*'NORMAL, SHARP, VIP, NEW_ACCOUNT, or *
liquidity_bandVARCHAR(30)NOT NULL, DEFAULT '*'HIGH, MEDIUM, LOW, NONE, or *
forward_percentageDECIMAL(5,2)NOT NULL, CHECK (0 <= val <= 100)Percentage to forward to upline
specificityINTEGERNOT NULLCount of non-wildcard dimensions (0-5). Computed at insert/update in application code.
is_activeBOOLEANNOT NULL, DEFAULT true
created_atTIMESTAMPTZNOT NULL, DEFAULT NOW()Oldest rule wins tie-breaks
updated_atTIMESTAMPTZNOT NULL, DEFAULT NOW()

Indexes:

  • idx_fmr_agent_active on (agent_id) WHERE is_active = true
  • idx_fmr_lookup on (agent_id, market_type, sport_type, event_phase, source_type, liquidity_band) WHERE is_active = true

Simplification vs v2: No version column on each rule. Instead, the agent's matrix_version integer is incremented whenever any rule changes. Bets record the matrix_version from the agent row at processing time, which is sufficient for audit replay.

Table: user_overrides

ColumnTypeConstraintsDescription
idUUIDPRIMARY KEY
user_idUUIDREFERENCES users(id), NOT NULL
agent_idUUIDREFERENCES agents(id), NOT NULLThe agent applying this override
forward_percentageDECIMAL(5,2)NOT NULLOverride forward percent
reasonTEXTNOT NULLWhy the override was set
created_byUUIDNOT NULLWho set the override
is_activeBOOLEANNOT NULL, DEFAULT true
created_atTIMESTAMPTZNOT NULL, DEFAULT NOW()
expires_atTIMESTAMPTZNULLABLEOptional expiry

Indexes:

  • UNIQUE on (user_id, agent_id) WHERE is_active = true

Table: market_overrides

ColumnTypeConstraintsDescription
idUUIDPRIMARY KEY
agent_idUUIDREFERENCES agents(id), NOT NULL
event_idVARCHAR(100)NOT NULLThe specific event/market
forward_percentageDECIMAL(5,2)NOT NULLOverride forward percent
reasonTEXTNOT NULL
created_byUUIDNOT NULL
is_activeBOOLEANNOT NULL, DEFAULT true
created_atTIMESTAMPTZNOT NULL, DEFAULT NOW()
expires_atTIMESTAMPTZNULLABLE

Indexes:

  • UNIQUE on (agent_id, event_id) WHERE is_active = true

Table: users

ColumnTypeConstraintsDescription
idUUIDPRIMARY KEY
external_idVARCHAR(100)UNIQUE, NOT NULLUser ID from the main platform
agent_idUUIDREFERENCES agents(id), NOT NULLThe agent this user belongs to
nameVARCHAR(255)NOT NULL
per_click_win_limitBIGINTNOT NULL, DEFAULT 5000000In paisa. Default 50,000 INR
aggregate_win_limit_dailyBIGINTNOT NULL, DEFAULT 20000000In paisa. Default 2,00,000 INR
min_stakeBIGINTNOT NULL, DEFAULT 10000In paisa. Default 100 INR
statusVARCHAR(20)NOT NULL, DEFAULT 'ACTIVE'ACTIVE, SUSPENDED
created_atTIMESTAMPTZNOT NULL, DEFAULT NOW()
updated_atTIMESTAMPTZNOT NULL, DEFAULT NOW()

Indexes:

  • idx_users_agent on (agent_id)
  • UNIQUE on (external_id)

Simplification vs v2: No self-exclusion fields, no session time limits, no deposit limits. Those belong to the responsible gambling module (v2).

Table: user_classifications

ColumnTypeConstraintsDescription
idUUIDPRIMARY KEY
user_idUUIDREFERENCES users(id), NOT NULL
agent_idUUIDREFERENCES agents(id), NOT NULLAgent making this classification
classificationVARCHAR(30)NOT NULLNORMAL, SHARP, VIP, NEW_ACCOUNT
reasonTEXTNULLABLEWhy classified
created_atTIMESTAMPTZNOT NULL, DEFAULT NOW()
updated_atTIMESTAMPTZNOT NULL, DEFAULT NOW()

Indexes:

  • UNIQUE on (user_id, agent_id)

Table: agent_trust_config

ColumnTypeConstraintsDescription
idUUIDPRIMARY KEY
agent_idUUIDREFERENCES agents(id), NOT NULLThe upline agent
sub_agent_idUUIDREFERENCES agents(id), NOT NULLThe downstream agent
trust_downstream_flagsBOOLEANNOT NULL, DEFAULT falseWhether to trust sub-agent's user classifications
created_atTIMESTAMPTZNOT NULL, DEFAULT NOW()

Indexes:

  • UNIQUE on (agent_id, sub_agent_id)

Table: events

ColumnTypeConstraintsDescription
idVARCHAR(100)PRIMARY KEYExternal event ID from odds provider
sport_typeVARCHAR(30)NOT NULL
nameVARCHAR(500)NOT NULL"MI vs CSK, IPL 2026"
start_timeTIMESTAMPTZNOT NULL
statusVARCHAR(20)NOT NULL, DEFAULT 'UPCOMING'UPCOMING, LIVE, SUSPENDED, SETTLED, VOID
resultJSONBNULLABLESettlement result data
settled_atTIMESTAMPTZNULLABLE
created_atTIMESTAMPTZNOT NULL, DEFAULT NOW()

Indexes:

  • idx_events_sport_status on (sport_type, status)
  • idx_events_start_time on (start_time)

Table: markets

ColumnTypeConstraintsDescription
idVARCHAR(100)PRIMARY KEYExternal market ID
event_idVARCHAR(100)REFERENCES events(id), NOT NULL
market_typeVARCHAR(30)NOT NULLMATCH_ODDS, FANCY, BOOKMAKER, etc.
nameVARCHAR(255)NOT NULL
statusVARCHAR(20)NOT NULL, DEFAULT 'OPEN'OPEN, SUSPENDED, CLOSED, SETTLED, VOID
created_atTIMESTAMPTZNOT NULL, DEFAULT NOW()

Indexes:

  • idx_markets_event on (event_id)
  • idx_markets_status on (status)

Table: bets (partitioned by month on created_at)

ColumnTypeConstraintsDescription
idUUIDPRIMARY KEY
user_idUUIDNOT NULLThe punter
agent_idUUIDNOT NULLThe originating agent (Level 1)
event_idVARCHAR(100)NOT NULL
market_idVARCHAR(100)NOT NULL
selectionVARCHAR(255)NOT NULL"MI to win"
sideVARCHAR(10)NOT NULLBACK or LAY
requested_stakeBIGINTNOT NULLOriginal stake in paisa
accepted_stakeBIGINTNOT NULLAfter stake reduction
oddsDECIMAL(10,4)NOT NULLDecimal odds
potential_winBIGINTNOT NULLIn paisa
liabilityBIGINTNOT NULLIn paisa
stake_reduction_reasonVARCHAR(50)NULLABLEPER_CLICK_LIMIT, AGGREGATE_LIMIT, or null
market_typeVARCHAR(30)NOT NULL
sport_typeVARCHAR(30)NOT NULL
event_phaseVARCHAR(30)NOT NULLPRE_MATCH, IN_PLAY
source_typeVARCHAR(30)NOT NULLAs resolved at originating agent
liquidity_bandVARCHAR(30)NOT NULLHIGH, MEDIUM, LOW, NONE
matrix_version_snapshotINTEGERNOT NULLThe agent's matrix_version at processing time
statusVARCHAR(20)NOT NULL, DEFAULT 'OPEN'OPEN, SETTLED, VOIDED, PARTIALLY_VOIDED, CANCELLED, CASH_OUT_SETTLED, RE_SETTLED
routing_statusVARCHAR(20)NOT NULL, DEFAULT 'COMPLETE'COMPLETE, PARTIAL, FAILED
period_contextVARCHAR(20)NOT NULLNIGHT, DAY
total_processing_time_msINTEGERNOT NULLEnd-to-end latency
settled_atTIMESTAMPTZNULLABLE
created_atTIMESTAMPTZNOT NULL, DEFAULT NOW()Partition key

Indexes:

  • idx_bets_user_time on (user_id, created_at DESC)
  • idx_bets_agent_time on (agent_id, created_at DESC)
  • idx_bets_event on (event_id)
  • idx_bets_market on (market_id)
  • idx_bets_status_open on (status) WHERE status = 'OPEN'
  • idx_bets_routing_incomplete on (routing_status) WHERE routing_status != 'COMPLETE'

Table: positions (partitioned by month on created_at)

One row per agent per bet. If Amit's bet flows through Rajesh, Vikram, and the Platform, that is 3 RETAINED position rows (and optionally 3 FORWARDED rows, though forwarded amounts can be derived).

ColumnTypeConstraintsDescription
idUUIDPRIMARY KEY
bet_idUUIDNOT NULL
agent_idUUIDNOT NULLAgent holding this position
cascade_levelINTEGERNOT NULL1 = first agent, 2 = upline, etc.
position_typeVARCHAR(20)NOT NULLRETAINED or FORWARDED
stakeBIGINTNOT NULLStake portion in paisa
liabilityBIGINTNOT NULLLiability portion in paisa
potential_winBIGINTNOT NULLPotential win portion in paisa
forward_percentage_usedDECIMAL(5,2)NOT NULLThe forward percent that produced this split
forward_sourceVARCHAR(30)NOT NULLUSER_OVERRIDE, MARKET_OVERRIDE, MATRIX_RULE, AGENT_DEFAULT
matrix_rule_idUUIDNULLABLEThe specific matrix rule that matched
overflow_amountBIGINTNOT NULL, DEFAULT 0How much was overflow from limit breach
event_idVARCHAR(100)NOT NULLDenormalized for fast queries
market_idVARCHAR(100)NOT NULLDenormalized
sport_typeVARCHAR(30)NOT NULLDenormalized
selectionVARCHAR(255)NOT NULLDenormalized
sideVARCHAR(10)NOT NULLDenormalized BACK/LAY
oddsDECIMAL(10,4)NOT NULLDenormalized
statusVARCHAR(20)NOT NULL, DEFAULT 'OPEN'OPEN, SETTLED, VOIDED
settlement_idUUIDNULLABLE
settled_amountBIGINTNULLABLEActual P&L in paisa (positive = agent profit)
created_atTIMESTAMPTZNOT NULL, DEFAULT NOW()

Indexes:

  • idx_positions_bet on (bet_id)
  • idx_positions_agent_open on (agent_id, status) WHERE status = 'OPEN'
  • idx_positions_agent_event on (agent_id, event_id) WHERE status = 'OPEN'
  • idx_positions_event_status on (event_id, status)

Table: exposure_ledger

One row per agent per scope. No sharding. Simple.

ColumnTypeConstraintsDescription
idUUIDPRIMARY KEY
agent_idUUIDNOT NULL
scope_typeVARCHAR(30)NOT NULLSPORT, MARKET, NIGHT_PERIOD, WEEKLY_PERIOD
scope_keyVARCHAR(200)NOT NULLe.g., "cricket", "mi_vs_csk_2026_03_15", "night_2026_03_15"
retained_open_liabilityBIGINTNOT NULL, DEFAULT 0In paisa
forwarded_open_liabilityBIGINTNOT NULL, DEFAULT 0In paisa
open_potential_winBIGINTNOT NULL, DEFAULT 0In paisa
no_new_risk_activeBOOLEANNOT NULL, DEFAULT false
no_new_risk_triggered_atTIMESTAMPTZNULLABLE
last_updated_atTIMESTAMPTZNOT NULL, DEFAULT NOW()

Indexes:

  • UNIQUE on (agent_id, scope_type, scope_key)
  • idx_exposure_agent_scope on (agent_id, scope_type)
  • idx_exposure_no_new_risk on (no_new_risk_active) WHERE no_new_risk_active = true

Simplification vs v2: No shard_index column. One row per agent per scope. If this becomes a write bottleneck (measurable as PostgreSQL write contention P99 exceeding 50ms), add sharding later.

Table: settlements (partitioned by month on created_at)

ColumnTypeConstraintsDescription
idUUIDPRIMARY KEY
event_idVARCHAR(100)NOT NULL
market_idVARCHAR(100)NOT NULL
position_idUUIDNOT NULL
agent_idUUIDNOT NULL
bet_idUUIDNOT NULL
settlement_typeVARCHAR(20)NOT NULLWIN, LOSS, VOID, PUSH
stakeBIGINTNOT NULLThe position's stake
payoutBIGINTNOT NULLAmount paid to/from agent
profit_lossBIGINTNOT NULLAgent P&L (positive = profit)
idempotency_keyVARCHAR(200)UNIQUE, NOT NULL{position_id}_{event_result_hash}
statusVARCHAR(20)NOT NULL, DEFAULT 'COMPLETED'COMPLETED, REVERSED, RE_SETTLED
created_atTIMESTAMPTZNOT NULL, DEFAULT NOW()

Indexes:

  • idx_settlements_event on (event_id)
  • idx_settlements_agent_time on (agent_id, created_at DESC)
  • idx_settlements_bet on (bet_id)
  • UNIQUE on (idempotency_key)

Table: audit_trail (partitioned by month on created_at)

ColumnTypeConstraintsDescription
idUUIDPRIMARY KEY
bet_idUUIDNULLABLENULL for config change audits
record_typeVARCHAR(30)NOT NULLBET_PLACED, BET_SETTLED, BET_VOIDED, CONFIG_CHANGED
agent_idUUIDNOT NULLPrimary agent for this record
user_idUUIDNULLABLE
event_idVARCHAR(100)NULLABLE
payloadJSONBNOT NULLFull structured audit data
processing_time_msINTEGERNULLABLE
created_atTIMESTAMPTZNOT NULL, DEFAULT NOW()

Indexes:

  • idx_audit_bet on (bet_id) WHERE bet_id IS NOT NULL
  • idx_audit_agent_time on (agent_id, created_at DESC)
  • idx_audit_type_time on (record_type, created_at DESC)

Simplification vs v2: No checksum or previous_checksum columns. No checksum chain. The JSONB payload is the truth. If you need tamper detection later, add checksums in a migration.

Table: dead_letter_queue

ColumnTypeConstraintsDescription
idUUIDPRIMARY KEY
sourceVARCHAR(50)NOT NULLBET_PROCESSING, SETTLEMENT, HEDGE, RECONCILIATION
reference_idUUIDNOT NULLbet_id, settlement_id, etc.
error_messageTEXTNOT NULL
error_stackTEXTNULLABLE
payloadJSONBNOT NULLFull context for retry
retry_countINTEGERNOT NULL, DEFAULT 0
max_retriesINTEGERNOT NULL, DEFAULT 3
statusVARCHAR(20)NOT NULL, DEFAULT 'PENDING'PENDING, RETRYING, RESOLVED, ESCALATED
resolved_byUUIDNULLABLE
resolved_atTIMESTAMPTZNULLABLE
resolution_notesTEXTNULLABLE
created_atTIMESTAMPTZNOT NULL, DEFAULT NOW()
updated_atTIMESTAMPTZNOT NULL, DEFAULT NOW()

Indexes:

  • idx_dlq_status on (status) WHERE status IN ('PENDING', 'RETRYING')
  • idx_dlq_reference on (reference_id)

Table: hedge_orders

ColumnTypeConstraintsDescription
idUUIDPRIMARY KEY
bet_idUUIDNOT NULL
event_idVARCHAR(100)NOT NULL
market_idVARCHAR(100)NOT NULL
selectionVARCHAR(255)NOT NULL
sideVARCHAR(10)NOT NULLBACK or LAY
target_priceDECIMAL(10,4)NOT NULLPrice the punter received
requested_amountBIGINTNOT NULLIn paisa
filled_amountBIGINTNOT NULL, DEFAULT 0In paisa
betfair_bet_idVARCHAR(100)NULLABLEBetfair's order reference
statusVARCHAR(20)NOT NULL, DEFAULT 'QUEUED'QUEUED, SENT, FILLED, FAILED, UNHEDGED
retry_countINTEGERNOT NULL, DEFAULT 0
error_messageTEXTNULLABLE
created_atTIMESTAMPTZNOT NULL, DEFAULT NOW()
sent_atTIMESTAMPTZNULLABLEWhen sent to Betfair
filled_atTIMESTAMPTZNULLABLE

Indexes:

  • idx_hedge_bet on (bet_id)
  • idx_hedge_status on (status) WHERE status IN ('QUEUED', 'SENT')

Simplification vs v2: No reprice_count, max_reprice_attempts, time_in_force_seconds, limit_price, slippage, or average_fill_price. We send market orders, not limit orders. Retry 3 times. Mark as UNHEDGED on failure.

Table: alerts

ColumnTypeConstraintsDescription
idUUIDPRIMARY KEY
alert_typeVARCHAR(50)NOT NULLNO_NEW_RISK_ACTIVATED, LIMIT_APPROACHING, RECONCILIATION_MISMATCH, HEDGE_FAILED, DLQ_ENTRY, etc.
severityVARCHAR(5)NOT NULLP1, P2, P3
agent_idUUIDNULLABLE
titleVARCHAR(255)NOT NULL
descriptionTEXTNOT NULL
metadataJSONBNULLABLE
statusVARCHAR(20)NOT NULL, DEFAULT 'ACTIVE'ACTIVE, ACKNOWLEDGED, RESOLVED
acknowledged_byUUIDNULLABLE
resolved_atTIMESTAMPTZNULLABLE
created_atTIMESTAMPTZNOT NULL, DEFAULT NOW()

Indexes:

  • idx_alerts_active on (status) WHERE status = 'ACTIVE'
  • idx_alerts_agent on (agent_id, created_at DESC)

Note on removed tables vs v2:

  • No config_changelog table: config change audits go into audit_trail with record_type = 'CONFIG_CHANGED' and full old/new values in the JSONB payload.
  • No feature_flags table: use environment variables (e.g., FEATURE_CASCADING_ROUTING=true) or a simple JSON config file. Reload on deploy.
  • No reconciliation_results / reconciliation_discrepancies tables: reconciliation mismatches are written as alerts with alert_type = 'RECONCILIATION_MISMATCH' and details in the metadata JSONB.
  • No agent_hierarchy_history table: hierarchy changes go into audit_trail with record_type = 'CONFIG_CHANGED'.

34. API Design

Bet Placement APIs

MethodPathDescriptionAuth
POST/api/v1/betsPlace a new betUser JWT
GET/api/v1/bets/:betIdGet bet details with routing infoUser/Agent JWT
GET/api/v1/betsList bets with filtersAgent JWT
POST/api/v1/bets/:betId/voidVoid an open betAdmin JWT
POST/api/v1/bets/:betId/cancelCancel within windowUser/Agent JWT
POST/api/v1/bets/:betId/cash-outRequest cash-out offerUser JWT
POST/api/v1/bets/:betId/cash-out/acceptAccept cash-outUser JWT
POST/api/v1/bets/simulateDry-run (no money, full routing)Agent JWT

POST /api/v1/bets -- Request:

{
"user_id": "uuid",
"event_id": "string",
"market_id": "string",
"selection": "MI to win",
"side": "BACK",
"stake": 1000000,
"odds": 1.85,
"market_type": "MATCH_ODDS",
"sport_type": "CRICKET",
"event_phase": "PRE_MATCH",
"liquidity_band": "HIGH"
}

All monetary values in paisa. 1000000 paisa = 10,000 INR.

Response (success):

{
"bet_id": "uuid",
"status": "ACCEPTED",
"accepted_stake": 1000000,
"stake_reduced": false,
"potential_win": 850000,
"message": null
}

Response (stake reduced):

{
"bet_id": "uuid",
"status": "ACCEPTED_REDUCED",
"accepted_stake": 588200,
"original_stake": 1000000,
"stake_reduced": true,
"potential_win": 500000,
"message": "Maximum stake at these odds: 5,882 INR"
}

Response (rejected):

{
"bet_id": null,
"status": "REJECTED",
"reason": "MARKET_SUSPENDED",
"message": "This market is currently unavailable."
}

Agent Configuration APIs

MethodPathDescriptionAuth
GET/api/v1/agents/:agentIdGet agent profile and configAgent JWT
PATCH/api/v1/agents/:agentIdUpdate agent settingsAgent JWT
GET/api/v1/agents/:agentId/limitsGet all limitsAgent JWT
PUT/api/v1/agents/:agentId/limitsSet/update limitsAgent JWT
GET/api/v1/agents/:agentId/matrixGet forwarding matrix rulesAgent JWT
POST/api/v1/agents/:agentId/matrix/rulesAdd a matrix ruleAgent JWT
PUT/api/v1/agents/:agentId/matrix/rules/:ruleIdUpdate a ruleAgent JWT
DELETE/api/v1/agents/:agentId/matrix/rules/:ruleIdDelete a ruleAgent JWT
POST/api/v1/agents/:agentId/matrix/testTest a bet against the matrixAgent JWT
GET/api/v1/agents/:agentId/exposureCurrent exposure summaryAgent JWT
GET/api/v1/agents/:agentId/exposure/:scopeExposure for a specific scopeAgent JWT
POST/api/v1/agents/:agentId/panicTrigger panic modeAgent JWT
GET/api/v1/agents/:agentId/sub-agentsList sub-agentsAgent JWT
GET/api/v1/agents/:agentId/trust-configGet trust settingsAgent JWT
PUT/api/v1/agents/:agentId/trust-config/:subAgentIdUpdate trustAgent JWT

POST /api/v1/agents/:agentId/matrix/rules -- Request:

{
"market_type": "FANCY",
"sport_type": "CRICKET",
"event_phase": "IN_PLAY",
"source_type": "*",
"liquidity_band": "*",
"forward_percentage": 70.00
}

Response:

{
"rule_id": "uuid",
"new_matrix_version": 48,
"specificity": 3,
"effective_immediately": true
}

User Management APIs

MethodPathDescriptionAuth
GET/api/v1/users/:userIdGet user profileAgent JWT
PATCH/api/v1/users/:userIdUpdate user settingsAgent JWT
POST/api/v1/users/:userId/overrideSet user forward overrideAgent JWT
DELETE/api/v1/users/:userId/overrideRemove overrideAgent JWT
POST/api/v1/users/:userId/classifySet classificationAgent JWT
GET/api/v1/users/:userId/betsBet historyAgent JWT

Settlement APIs

MethodPathDescriptionAuth
POST/api/v1/settlements/events/:eventIdTrigger settlementSystem/Admin JWT
GET/api/v1/settlements/events/:eventIdSettlement statusAgent JWT
POST/api/v1/settlements/events/:eventId/reverseReverse settlementAdmin JWT
POST/api/v1/settlements/events/:eventId/resettleRe-settle with corrected resultsAdmin JWT
GET/api/v1/settlements/agents/:agentIdAgent settlement historyAgent JWT

POST /api/v1/settlements/events/:eventId -- Request:

{
"result": {
"winner": "MI",
"market_results": {
"match_odds": { "winning_selection": "MI to win" },
"fancy_180_runs": { "actual_value": 187, "line": 180 }
}
},
"result_source": "OFFICIAL",
"confirmed_by": "admin_uuid"
}

Admin APIs

MethodPathDescriptionAuth
POST/api/v1/admin/agentsCreate agentAdmin JWT
POST/api/v1/admin/agents/:agentId/suspendSuspend agentAdmin JWT
POST/api/v1/admin/agents/:agentId/reactivateReactivateAdmin JWT
GET/api/v1/admin/dead-letter-queueView DLQAdmin JWT
POST/api/v1/admin/dead-letter-queue/:id/retryRetry DLQ entryAdmin JWT
POST/api/v1/admin/dead-letter-queue/:id/resolveResolve DLQ entryAdmin JWT
POST/api/v1/admin/reconciliation/runManual reconciliationAdmin JWT

Monitoring APIs

MethodPathDescriptionAuth
GET/api/v1/monitoring/healthHealth checkPublic
GET/api/v1/monitoring/alertsActive alertsAdmin JWT
POST/api/v1/monitoring/alerts/:id/acknowledgeAck alertAdmin JWT

35. The Bet Processing Pipeline (Step by Step)

This is the most important section. Every step from HTTP request to response.

Latency Budget (2-tier cache: Redis and PostgreSQL)

StepBudgetDescription
1. Request parsing & validation5msParse JSON, Zod validate
2. Timestamp & metrics1msAssign server timestamp, compute win/liability
3. User win cap check8msRedis for aggregate, compute max stake
4. Matrix version capture1msRead agent.matrix_version from Redis
5. Forwarding resolution10msPrecedence chain lookup
6. Cascade routing (per level)12ms x NLimit checks + retention calc at each level
7. Position creation + ledger update15msDB transaction
8. Hedge queue placement2msBullMQ add
9. Audit trail write5msAsync BullMQ job (non-blocking)
10. Response2msHTTP response
Total (3-level cascade)~85msWithin 90ms budget

Without the LRU cache, steps that previously hit LRU now hit Redis (sub-millisecond). The total latency increases by roughly 3-5ms compared to the 3-tier approach, which is well within budget.

Step 1: Request Received (5ms)

What happens: HTTP POST arrives at /api/v1/bets. Express middleware parses JSON. Zod validates the schema: user_id is UUID, stake is positive integer, odds is positive decimal, side is BACK or LAY, all required fields present.

What can go wrong:

  • Malformed JSON: Return 400.
  • Missing or invalid fields: Return 400 with specific validation errors.

Error handling: Immediate rejection. No audit record. No further processing.

Step 2: Timestamp Assignment and Metrics Computation (1ms)

What happens:

2a. Assign a server-side timestamp: const processedAt = Date.now(). This is the authoritative time for all downstream decisions (period boundaries, ordering). Never trust the client timestamp.

2b. Compute financial metrics using integer arithmetic in paisa:

potential_win = Math.floor(stake * (odds - 1))
liability = potential_win (for a BACK bet from the bookie's perspective)

For a LAY bet, the liability is different:

// LAY bet: punter risks stake * (odds - 1), bookie's liability = stake
liability = stake (for a LAY bet)
potential_win = Math.floor(stake * (odds - 1))

2c. Determine period context. Convert processedAt to the originating agent's timezone. Check if the time falls within the agent's night period (closed-open interval: [night_start, night_end)). Set period_context to NIGHT or DAY.

What can go wrong: Nothing. Pure computation.

Step 3: User Win Cap Check (8ms)

What happens:

3a. Per-click win cap: Compare potential_win against user.per_click_win_limit. If exceeded, compute the max allowable stake:

max_stake = Math.floor(per_click_win_limit / (odds - 1))

3b. Aggregate win cap: Read the user's accumulated potential wins for today from Redis key agg_win:{user_id}:{date}. If adding this bet's potential_win exceeds user.aggregate_win_limit_daily, compute:

remaining_win = aggregate_limit - current_accumulated
max_stake_from_aggregate = Math.floor(remaining_win / (odds - 1))

3c. Take the minimum of max_stake, max_stake_from_aggregate, and the original stake. If the result is less than user.min_stake, reject the bet entirely with "This market is currently unavailable at these odds."

3d. If the stake was reduced, set stake_reduction_reason to PER_CLICK_LIMIT or AGGREGATE_LIMIT.

3e. After accepting the bet, atomically increment the Redis aggregate counter: INCRBY agg_win:{user_id}:{date} {potential_win}. Set TTL to end of day if not already set.

What can go wrong:

  • Redis unavailable for aggregate check: Fall back to a PostgreSQL query -- SELECT SUM(potential_win) FROM bets WHERE user_id = $1 AND created_at >= $today. Adds ~15ms but is correct.
  • Two simultaneous bets both read the same aggregate before either increments: The 10% buffer on the aggregate limit absorbs this. The INCRBY is atomic, so the second bet will see the updated value for subsequent bets.

Error handling: If reduced stake is below minimum, return status: REJECTED, reason: BELOW_MINIMUM.

Step 4: Matrix Version Capture (1ms)

What happens: Read the originating agent's matrix_version from Redis key agent:{agent_id}:matrix_version. Store it in the processing context as matrix_version_snapshot. This value will be saved on the bet record to enable deterministic replay.

What can go wrong:

  • Redis miss: Read from PostgreSQL (SELECT matrix_version FROM agents WHERE id = $1). Adds ~3ms.

Why this matters: If Rajesh changes his matrix after this bet is captured but before it finishes processing, the bet still uses the version it captured. During disputes, the audit record shows which version was used.

Step 5: Forwarding Percentage Resolution (10ms)

What happens: For each agent in the cascade (starting with the punter's direct agent), resolve the forwarding percentage using the 4-level precedence chain:

5a. User override: Check Redis key user_override:{user_id}:{agent_id}. If active and not expired, use its forward_percentage. Done.

5b. Market override: Check Redis key market_override:{agent_id}:{event_id}. If active and not expired, use its forward_percentage. Done.

5c. Matrix rule lookup: Load the agent's active rules from Redis key matrix:{agent_id}. This is a JSON array of all active rules. In application code, filter to rules where every non-wildcard dimension matches the bet's characteristics. From matching rules, pick the one with the highest specificity. If tied on specificity, pick the highest forward_percentage (risk-safe default). If still tied, pick the oldest rule (lowest created_at).

5d. Agent default: If no rule matches (should never happen if a catch-all wildcard rule exists), use agent.default_forward_percentage.

What can go wrong:

  • Agent has no matrix rules AND no default: Log a P2 alert. Use 100% forward (safest possible -- the agent retains nothing).
  • Redis cache miss for matrix rules: Load from PostgreSQL. On load, populate Redis key with TTL of 1 hour.

Step 6: Cascade Routing -- Level by Level (12ms per level)

This is the heart of the system. The cascade engine walks up the agent hierarchy from the punter's agent to the platform.

For each level in the hierarchy:

6a. Determine source_type. Does this agent have their own classification for the user? Check user_classifications (cached in Redis). If yes, use it. If no, check agent_trust_config -- does this agent trust the downstream agent's flags? If yes, use the downstream classification. If no, default to NORMAL.

6b. Resolve forward percentage using the same precedence chain as Step 5, but for THIS agent's matrix with the resolved source_type.

6c. Calculate retention:

retained_stake = Math.floor(incoming_stake * (100 - forward_percentage) / 100)
forwarded_stake = incoming_stake - retained_stake
retained_liability = Math.floor(retained_stake * (odds - 1))

6d. Check all applicable limits. For this agent, read exposure from Redis for each relevant scope:

  • Sport limit: exposure:{agent_id}:SPORT:{sport_type}
  • Market limit: exposure:{agent_id}:MARKET:{event_id}
  • Night period limit: exposure:{agent_id}:NIGHT_PERIOD:{date}
  • Weekly period limit: exposure:{agent_id}:WEEKLY_PERIOD:{week}

For each limit, compare current_exposure + retained_liability against limit_amount. The most restrictive limit determines max retention.

The exposure check decision:

Read the exposure value from Redis. If the agent is far from their limit (say, below 80% utilization), trust the Redis value and proceed. If the agent is near their limit (above 80%), use PostgreSQL with SELECT ... FOR UPDATE to get an authoritative locked value. The 80% threshold is configurable per agent.

Why 80%? At 80% utilization, even if Redis is stale by a few bets, the remaining 20% headroom absorbs the error. At 95% utilization, staleness matters because a few extra bets could breach the limit.

6e. If all limits pass: Agent retains the calculated amount.

6f. If any limit would be breached: Agent retains only up to the most restrictive remaining capacity. The difference becomes overflow added to forwarded_stake.

Real example: Rajesh's matrix says retain 60% of Amit's 10,000 INR bet = 6,000 INR stake. But Rajesh's MI vs CSK match limit has only 3,000 INR remaining. Rajesh retains 3,000 INR. The extra 3,000 INR (overflow) is added to forwarded, making it 7,000 INR total forwarded to Vikram.

6g. NO_NEW_RISK check: If the agent is in NO_NEW_RISK for any applicable scope (checked via Redis flag nnr:{agent_id}:{scope}), determine if this bet is a hedge:

worst_case_before = current worst-case liability across all outcomes
worst_case_after = worst-case liability if this bet's retention is added

If worst_case_after < worst_case_before: it is a hedge. ALLOW retention.
If worst_case_after >= worst_case_before: NOT a hedge. Forward 100%.

For a simple back bet on the same outcome as existing liability: never a hedge. For a back bet on the opposite outcome (or a lay bet): usually a hedge.

6h. Suspended agent: If agent.status = SUSPENDED, skip entirely. Forward 100% to the parent.

6i. Record the decision for this level in a local array (will become the audit payload).

6j. Forward remaining to next level. If the current agent is the platform, the remaining amount is the hedge amount.

What can go wrong:

  • Parent agent not found (broken hierarchy): P1 alert. Forward directly to the platform.
  • DB lock timeout on near-limit check: Retry once (most waits resolve in <20ms). If retry fails, assume limit breached, forward 100% from this level. Safe default.

Step 7: Position Creation and Ledger Update (15ms)

What happens: Write positions and update exposure ledgers. This is done sequentially, level by level, each level in its own database transaction.

For each level (e.g., Rajesh, then Vikram, then Platform):

7a. Begin a PostgreSQL transaction.

7b. Insert the RETAINED position row for this agent.

7c. Update the exposure_ledger rows for this agent:

UPDATE exposure_ledger
SET retained_open_liability = retained_open_liability + $retained_liability,
forwarded_open_liability = forwarded_open_liability + $forwarded_liability,
open_potential_win = open_potential_win + $potential_win,
last_updated_at = NOW()
WHERE agent_id = $agent_id
AND scope_type = $scope_type
AND scope_key = $scope_key;

If no row exists yet for this scope, insert one (upsert pattern).

7d. If the ledger update pushes the agent into NO_NEW_RISK territory (retained_open_liability >= limit), set no_new_risk_active = true and update Redis flag.

7e. Commit the transaction.

7f. After commit, update Redis exposure counter: SET exposure:{agent_id}:{scope_type}:{scope_key} {new_value}.

Why sequential per-level instead of one big transaction? No cross-agent locking. Rajesh's transaction does not block Vikram's. If Rajesh's write succeeds but Vikram's fails, Rajesh's position is committed. The bet is marked as routing_status = PARTIAL and the remaining levels are retried via the DLQ.

What can go wrong:

  • DB write fails at level 2 of 3: Level 1 positions are already committed. Mark bet as routing_status = PARTIAL. Queue a retry job for levels 2-3 in BullMQ. The retry reads the bet record to see which levels are done and resumes from the incomplete level.
  • Redis update fails after DB commit: Redis is stale until the next write or until reconciliation corrects it. Acceptable because the DB is the source of truth.

Step 8: Hedge Queue Placement (2ms)

What happens: If the platform's forwarded amount (the amount to be hedged on Betfair) is greater than zero, add a BullMQ job to the hedge-orders queue:

{
"bet_id": "uuid",
"event_id": "string",
"market_id": "string",
"selection": "MI to win",
"side": "BACK",
"target_price": 1.85,
"amount_paisa": 80000
}

This is async. The hedge does not block the bet response.

What can go wrong:

  • BullMQ/Redis unavailable: Write the hedge order directly to the hedge_orders table in PostgreSQL with status = QUEUED. A polling job picks up queued orders every 5 seconds.

Step 9: Audit Trail Write (async, non-blocking)

What happens: Add a BullMQ job to the audit-trail queue with the complete audit payload:

The payload includes:

  • Original bet details (all fields)
  • The forwarding chain: for each level, the incoming stake, the forward percentage source (USER_OVERRIDE, MATRIX_RULE, etc.), the rule that matched, the retention amount, every limit checked and its result, any overflow
  • The matrix_version_snapshot
  • The period context
  • Processing timestamps for each step

This write is intentionally async. The bet is confirmed to the punter before the audit record is persisted. If the audit write fails, the bet is still valid -- a reconciliation job detects orphaned bets without audit records and regenerates them from position data.

What can go wrong:

  • Audit job fails 3 times: Goes to the DLQ. P3 alert. Manual investigation. The bet itself is unaffected.

Step 10: Response to Punter (2ms)

What happens: Return the HTTP response. Emit a Socket.IO event to the agent's dashboard with the new bet.

What can go wrong:

  • Client already disconnected: The bet is processed. Client can query GET /api/v1/bets/:betId.
  • Socket.IO delivery failure: Non-critical. Dashboard polls every 5 seconds as a fallback.

36. The Settlement Pipeline (Step by Step)

Step 1: Event Result Received

An admin or automated feed posts the result to POST /api/v1/settlements/events/:eventId. The system validates the result, confirms the event is in a settleable state (LIVE or SUSPENDED), and enqueues a BullMQ settlement job.

Step 2: Market Resolution

For each market in the event, determine winners and losers:

Market TypeResolution Logic
MATCH_ODDSSelection matching winning_selection = WIN. Others = LOSS.
FANCY (over/under)actual_value >= line: OVER wins. actual_value < line: UNDER wins.
BOOKMAKERSame as MATCH_ODDS.

Step 3: Position Identification

Query all open positions for this event:

SELECT * FROM positions
WHERE event_id = :eventId AND status = 'OPEN'
ORDER BY bet_id, cascade_level

Group by bet_id, then by agent_id.

Step 4: Per-Position Settlement (Idempotent)

For each position:

4a. Generate idempotency key: {position_id}_{sha256(event_result_json)}.

4b. Check if a settlement with this key exists. If yes, skip. Already settled.

4c. Determine WIN or LOSS from the market resolution.

4d. Calculate settlement:

  • WIN (punter wins, agent loses): payout = position.liability, profit_loss = -position.liability
  • LOSS (punter loses, agent wins): payout = 0, profit_loss = position.stake
  • VOID: payout = position.stake (return stake), profit_loss = 0

4e. In a transaction:

  • Insert settlement record.
  • Update position status to SETTLED.
  • Decrement exposure ledger: retained_open_liability -= position.liability (for RETAINED). forwarded_open_liability -= position.liability (for FORWARDED). open_potential_win -= position.potential_win.

4f. After commit, update Redis exposure values.

What can go wrong:

  • Idempotency key already exists: Skip. Safe by design.
  • DB failure mid-settlement: The failed position stays OPEN. A settlement monitor job picks it up and retries every 30 seconds. After 5 failures, it goes to the DLQ.

Step 5: NO_NEW_RISK Re-evaluation

After settling positions, check all agents who were in NO_NEW_RISK. If retained_open_liability < limit, clear the NO_NEW_RISK flag in both PostgreSQL and Redis. Fire a P3 alert: "Rajesh exited NO_NEW_RISK for cricket."

Step 6: Reconciliation Check

Run a targeted check for each affected agent: compare the ledger value to the sum of open positions. If they match, all good. If they differ, insert an alert with alert_type = RECONCILIATION_MISMATCH. Details in the metadata JSONB.

Step 7: Notifications

Mark the event as SETTLED. Emit Socket.IO events. Queue settlement summary notifications.


37. Configuration Management

How Matrix Rules Are Cached in Redis

When the system needs an agent's matrix rules (during bet processing), it follows this path:

  1. Check Redis key matrix:{agent_id}. If present, use it.
  2. If missing, load from PostgreSQL: SELECT * FROM forwarding_matrix_rules WHERE agent_id = $1 AND is_active = true.
  3. Store in Redis as matrix:{agent_id} with a 1-hour TTL.
  4. Return the rules.

Cache Invalidation: Simple and Direct

When an agent's matrix changes (rule added, updated, or deleted):

  1. The API handler writes the change to PostgreSQL.
  2. Increment agents.matrix_version in the same transaction.
  3. After commit, delete Redis key matrix:{agent_id}.
  4. Delete Redis key agent:{agent_id}:matrix_version.

That is it. No pub/sub broadcast needed for v1 (single process). The next bet that needs this agent's matrix will find a cache miss, load from PostgreSQL, and repopulate Redis.

If you scale to multiple Node.js instances: Add a Redis pub/sub channel config.invalidate. After deleting the Redis key, publish a message {agent_id, type: "MATRIX"}. Each instance subscribes and can clear any in-process caches. But for v1 with one process, this is unnecessary.

What Gets Cached Where

DataRedis Key PatternTTLInvalidation
Agent matrix rulesmatrix:{agent_id}1 hourDeleted on any rule change
Agent matrix versionagent:{agent_id}:matrix_version1 hourDeleted on any rule change
Agent limitslimits:{agent_id}1 hourDeleted on limit change
User overrideuser_override:{user_id}:{agent_id}1 hourDeleted on override change
Market overridemarket_override:{agent_id}:{event_id}1 hourDeleted on override change
Exposure counterexposure:{agent_id}:{scope_type}:{scope_key}No TTLUpdated on every write
NO_NEW_RISK flagnnr:{agent_id}:{scope}No TTLSet/cleared on state change
User aggregate winsagg_win:{user_id}:{date}End of dayIncremented atomically
User classificationuser_class:{user_id}:{agent_id}1 hourDeleted on change

38. Error Handling Patterns

Error Categories

CategoryExamplesHandling
ValidationBad input, missing fieldsReturn 400. No processing. No audit.
Business ruleSuspended market, below-minimum stakeReturn 200 with status: REJECTED. Audit record created.
Transient infrastructureRedis timeout, DB pool exhaustedRetry up to 3 times with backoff (100ms, 200ms, 400ms). Fall back or 503.
Permanent infrastructureDB downCircuit breaker opens. All bets fall back to 100% forward (safe default).
Data integrityNegative ledger, broken hierarchyP1 alert. DLQ. Manual investigation.
External serviceBetfair API errorHedge queue absorbs it. Mark as UNHEDGED after 3 retries.

Retry Policies

OperationMax RetriesBackoffOn Exhaustion
Redis read250ms, 100msFall back to PostgreSQL
Redis write250ms, 100msLog warning; DB is source of truth
PostgreSQL read2100ms, 200msReturn 503
PostgreSQL write1100msDLQ the operation
Betfair API31s, 2s, 4sMark hedge order as UNHEDGED
Settlement position51s, 5s, 30s, 60s, 120sDLQ

Circuit Breakers

ServiceOpens AfterBehavior When OpenCloses After
Redis5 consecutive failures in 10 secondsAll reads go to PostgreSQL. Writes go to PostgreSQL only.30 seconds of successful PostgreSQL reads
Betfair5 consecutive failures in 60 secondsAll hedge orders queue up. No new Betfair calls.Probe every 30 seconds; close after 3 successes

When Redis circuit breaker is open, the system still works -- just slower. Every Redis read becomes a PostgreSQL read (add ~5ms per read). This is the graceful degradation path.

DLQ Integration

Any operation that exhausts retries creates a DLQ entry:

Source: BET_PROCESSING / SETTLEMENT / HEDGE
Reference ID: the bet_id or position_id
Error: message + stack trace
Payload: full context to retry manually
Status: PENDING

A P2 alert fires on every DLQ entry. Admins resolve via API: retry, void and refund, or manual override.


39. Testing Strategy

Unit Test Targets

ModuleCoverageKey Scenarios
MatrixResolutionModule95%Wildcard matching, specificity tie-breaking (most specific wins, then highest forward wins, then oldest wins), precedence chain order (user override > market > matrix > default), missing rules fallback to 100% forward
CascadeEngineModule95%2-level cascade, 4-level cascade, limit overflow cascading extra to upline, suspended agent skip, NO_NEW_RISK with hedge detection (back bet blocked, lay bet allowed)
LimitEnforcementModule95%All 4 limit types, most restrictive wins, exact boundary (limit minus 1 paisa passes, limit exactly fails), period-aware checking
ExposureLedgerModule90%Increment/decrement, upsert on first bet, Redis update after commit, Redis fallback to PostgreSQL
SettlementModule95%WIN/LOSS/VOID settlement, idempotency (same key returns without re-processing), ledger decrement to zero, re-settlement
StakeReductionModule95%Per-click reduction, aggregate reduction, below-minimum rejection, edge case odds (1.01 produces huge stake, 1000.00 produces tiny stake)
HedgeExecutionModule90%Successful fill, Betfair error retry, 3x failure marks UNHEDGED

Integration Test Scenarios

ScenarioWhat It TestsExpected Outcome
Full 3-level cascade betAmit bets 10,000 INR through Rajesh to Vikram to Platform3 retained positions created, all exposure ledgers incremented, audit record complete, hedge order queued
Limit overflowRajesh has 3,000 INR remaining on match limit, bet wants 6,000 INR retentionRajesh retains 3,000, overflow of 3,000 forwarded to Vikram, Vikram processes normally
NO_NEW_RISK hedgeRajesh in NO_NEW_RISK, lay bet arrives on the overexposed outcomeLay bet accepted as hedge, Rajesh's worst-case liability decreases
NO_NEW_RISK non-hedgeRajesh in NO_NEW_RISK, back bet arrives on the same outcomeBack bet forwarded 100% to Vikram
Stake reductionBet at odds 50.00 with 50,000 INR per-click limitStake reduced from 5,000 to 1,020 INR
Settlement cascadeEvent settles, positions across 3 agentsAll positions settled, ledgers decremented, NO_NEW_RISK cleared
Matrix version captureAgent changes matrix between two betsFirst bet uses old version, second uses new version, audit records differ
Concurrent bets near limit10 simultaneous bets at 95% utilizationAll resolve, total does not exceed limit (PostgreSQL FOR UPDATE path)
Betfair timeoutHedge order sent, Betfair returns 503Retried 3 times, marked UNHEDGED, bet still accepted
Redis outageRedis killed mid-testSystem falls back to PostgreSQL for all reads, latency increases, correctness maintained
Void cascadeAdmin voids a bet, all positions reversedEvery position reversed, every ledger decremented, NO_NEW_RISK re-evaluated
Cash-outAmit requests cash-out at improved oddsCounter-positions created using original proportions (not current matrix), exposure reduced
Suspended agentVikram suspended, bet arrives for RajeshRajesh's overflow skips Vikram, goes directly to Platform
Idempotent settlementSettlement triggered twice for same eventSecond invocation is a no-op (idempotency keys match)

Load Test Scenarios

ScenarioTrafficSuccess Criteria
Sustained peak167 bets/sec for 30 minutesP99 < 90ms, zero errors
Burst spike500 bets/sec for 60 secondsP99 < 200ms, error rate < 0.1%
Settlement storm3 events settle simultaneously (5,000 positions)Completes within 5 minutes

Chaos Test Scenarios

ScenarioSimulationExpected Behavior
Redis downKill RedisCircuit breaker opens in 3 seconds. All reads fall back to PostgreSQL. No data loss. Bets accepted at higher latency.
PostgreSQL downKill PostgreSQLAll writes fail. 503 returned. Alert fires. Bets queued on client side or rejected.
Betfair downBlock outbound to BetfairHedge queue grows. Bets accepted. Platform absorbs risk. UNHEDGED alert fires.
Slow PostgreSQLAdd artificial 50ms latencyP99 increases. Some bets exceed budget. No data loss.

40. Deployment Strategy

Docker Compose for Local Dev

services:
app:
build: .
ports:
- "3000:3000"
environment:
DATABASE_URL: postgresql://hannibal:secret@postgres:5432/hannibal
REDIS_URL: redis://redis:6379
BETFAIR_API_KEY: test_key
NODE_ENV: development
depends_on:
- postgres
- redis

postgres:
image: postgres:16
environment:
POSTGRES_DB: hannibal
POSTGRES_USER: hannibal
POSTGRES_PASSWORD: secret
volumes:
- postgres_data:/var/lib/postgresql/data
ports:
- "5432:5432"

redis:
image: redis:7
ports:
- "6379:6379"

volumes:
postgres_data:

Three services. One command: docker compose up. Database migrations via npx prisma migrate dev.

Production Deployment

One Docker container running the Node.js application with all modules and BullMQ workers. Deploy behind nginx or a cloud load balancer for SSL termination.

ComponentSpecNotes
App container2 vCPU, 4 GB RAMSingle instance to start. Add a second when P99 exceeds 90ms under load.
PostgreSQL4 vCPU, 16 GB RAM, SSDManaged service (RDS, Cloud SQL, etc.). Automated backups.
Redis2 vCPU, 4 GB RAMManaged service (ElastiCache, Redis Cloud). Persistence enabled.

Feature Flag Rollout Process

Feature flags are environment variables:

FEATURE_CASCADING_ROUTING=false
FEATURE_USER_WIN_LIMITS=false
FEATURE_NO_NEW_RISK=false
FEATURE_HEDGE_EXECUTION=false
FEATURE_PERIOD_MANAGEMENT=false

Rollout steps:

  1. Deploy code with feature off (env var = false).
  2. Enable for a single test agent by adding agent-specific override in config.
  3. Monitor for 24-48 hours: check audit trails, ledger accuracy.
  4. Enable globally (set env var = true). Redeploy.
  5. After 2 weeks stable, remove the feature flag check from code.

41. Implementation Phases

Phase 1: Foundation (Weeks 1-4)

Goal: All building blocks exist but are not connected into a pipeline.

WeekDeliverables
1Prisma schema migration: all tables. Seed data with test agents (Vikram, Rajesh, Priya) and test users (Amit, Sonia). Docker Compose setup. Basic Express app with health endpoint.
2ExposureLedgerModule: PostgreSQL reads/writes, single counter per agent per scope (no sharding). Redis read/write for counters. LimitEnforcementModule: check all 4 limit types, return max retainable amount.
3MatrixResolutionModule: full 5D wildcard matching with specificity tie-breaking. Precedence chain. ConfigModule: load matrix rules from DB, cache in Redis with delete-on-change invalidation.
4UserManagementModule: per-click win cap check, aggregate win cap (Redis INCRBY). StakeReductionModule. AuditModule: BullMQ job that writes JSONB audit record.

Phase 2: Core Pipeline (Weeks 5-8)

Goal: End-to-end bet placement and settlement through the full cascade.

WeekDeliverables
5CascadeEngineModule: N-level cascade with matrix resolution and limit checking per level. Overflow handling. Suspended agent skip. BetProcessingModule: orchestrates the full pipeline from HTTP request to response.
6Position creation with per-level transactions. Exposure ledger updates atomic with positions. End-to-end bet placement through 3 levels. Full integration tests.
7SettlementModule: event result processing, position settlement (idempotent), ledger decrement, bet status updates.
8Void/cancellation state machine. Cash-out (counter-position using original proportions). Lay bet support. Feature flag: enable cascade per agent.

End of Phase 2: The system can accept, route, settle, void, and cash out bets through the full cascade. This is the MVP.

Phase 3: Production Hardening (Weeks 9-12)

Goal: Ready for real traffic with safety mechanisms.

WeekDeliverables
9NO_NEW_RISK: automatic trigger, hedge detection, scoped activation. Period management: night/weekly periods, timezone handling, carry-forward logic.
10HedgeExecutionModule: Betfair API integration, market orders, 3x retry, mark UNHEDGED on failure. BullMQ hedge queue.
11Dead letter queue with admin API (list, retry, resolve). Nightly reconciliation: compare ledgers to position sums, alert on mismatch.
12Circuit breakers for Redis and Betfair. Redis fallback to PostgreSQL. End-to-end chaos testing. Load testing at 167 bets/sec.

End of Phase 3: Production-ready for controlled launch.

Phase 4: Intelligence (Weeks 13-16)

WeekDeliverables
13Sharp detection: CLV calculation over rolling 500-bet window, behavioral scoring, automatic classification.
14Collusion detection: classification flip timing correlation, forwarded-flow P&L asymmetry detection.
15Agent dashboard: real-time exposure, traffic light view, bet feed, settlement summary. Socket.IO push updates.
16WhatsApp integration: scheduled status messages, interactive commands (status, stop, resume, panic).

Phase 5: Scale-Up (When Data Demands It)

This phase has no fixed timeline. Each item is triggered by a measurable threshold, not a calendar date. See the Scale-Up Reference section below.


42. Scale-Up Reference

This section is the bridge between v1 (simple) and v2 (full architecture). Add complexity only when data proves you need it.

When to Add Each Optimization

OptimizationTrigger ThresholdWhat to DoExpected Impact
Application LRU cacheRedis P99 latency exceeds 5ms under sustained loadAdd an in-memory LRU cache (e.g., lru-cache npm) with 5-second TTL in front of Redis for exposure reads and matrix lookupsReduces Redis calls by ~60%. Drops P99 back below 2ms. Adds cache invalidation complexity.
Sharded exposure countersPostgreSQL write contention on exposure_ledger exceeds 50ms P99 for hot agentsAdd shard_index column to exposure_ledger. Use 8 shards per hot agent. Each write picks a random shard. Reads sum all shards.Reduces lock contention by 8x.
Prometheus + GrafanaYou have a dedicated ops person (or the team exceeds 5 engineers)Add prom-client for metrics. Deploy Prometheus and Grafana. Build dashboards for bet latency, settlement time, hedge fill rate, DLQ depth.Real-time operational visibility. Replaces log-grep debugging with dashboards.
Read replicasDashboard queries slow down bet processing (same PostgreSQL instance)Add a PostgreSQL read replica. Route all dashboard/reporting queries to the replica.Isolates read load from write path.
Multiple Node.js instancesSingle process CPU exceeds 70% sustained during peakAdd a second instance behind a load balancer. BullMQ handles job distribution automatically.Doubles throughput capacity.
Separate audit databaseAudit trail table exceeds 100 GB or audit writes cause write contentionMove audit_trail to a separate PostgreSQL instance. Change audit write module to target the new connection.Isolates audit I/O from transactional I/O.
Multi-currencyExpanding beyond India (Southeast Asia, Africa)Add base_currency to agents, FX conversion at platform boundary, FX rate capture in audit trail. See v2 Section 34 for full design.Enables Ghanaian Cedis, Thai Baht, etc.
Responsible gambling moduleEntering regulated markets (UK, EU, Australia)Add self-exclusion, session limits, reality checks, deposit limits. See v2 Section 46 for full design.Regulatory compliance.
Message broker (Kafka)Bet volume exceeds 1000 bets/sec sustainedReplace BullMQ for bet event streaming with Kafka. Keep BullMQ for background jobs.Higher throughput, better backpressure, multi-consumer.
Microservice extractionA specific module (e.g., hedge execution) needs independent scaling or a separate deployment cycleExtract the module into its own service with its own repository, deployment, and scaling. Communicate via message queue.Independent scaling and deployment.
Limit order hedgingBetfair slippage on market orders exceeds 2% averageReplace market orders with limit orders. Add reprice logic, time-in-force, partial fill tracking. See v2 Section 52 hedge_orders schema.Better execution quality, less slippage.
Matrix version history tableAgents create more than 100 matrix versions and audit replay needs fast version lookupAdd a matrix_versions table storing immutable snapshots. Reference by version_id instead of version integer.Faster historical replay, cleaner audit.
Reconciliation frequencyNightly reconciliation is too slow to catch drift during IPL matchesAdd 15-minute scheduled reconciliation. Add post-settlement targeted reconciliation. See v2 Section 36 for full design.Faster drift detection.

How to Measure Each Threshold

MetricHow to Measure in v1 (without Prometheus)
Redis P99 latencyPino log with redis_latency_ms on each call. Query logs: sort, take 99th percentile.
PostgreSQL write contentionPino log with db_write_ms on exposure_ledger updates. Watch for spikes above 50ms.
CPU utilizationprocess.cpuUsage() logged every 60 seconds. Or top / container metrics from your hosting platform.
Bet volumeCount from Pino logs or a simple in-memory counter logged every minute.
Hedge slippagetarget_price - fill_price logged on each hedge fill. Average over a rolling window.
Audit table sizeSELECT pg_total_relation_size('audit_trail') run nightly.

The Principle

Start simple. Measure everything. Add complexity only when a specific, measurable threshold is crossed. Every optimization in the v2 document is available as a well-documented upgrade path. You never need to redesign -- you add a layer, a shard, or a service.

The simplest architecture that works is the best architecture until it stops working. Then you have data to tell you exactly where to invest.



This document is maintained by the Hannibal engineering and product teams. For the full-scale architecture with all optimizations, refer to B-Book-Architecture-v2.md.