B-Book Architecture & Design Document v3.0 (Simplified)
System: Hannibal Date: February 2026 Status: Living Document Audience: Product, Engineering, Operations, Stakeholders Predecessor: v2.0 (preserved as scale-up reference)
Table of Contents
- Executive Summary
- The Core Problem -- In Plain English
- Simplification Philosophy
- The Forwarding Matrix -- The Brain of the System
- The Bet Flow -- Step by Step
- Cascading Upline Routing
- Agent Liability Limits & NO_NEW_RISK
- User Win Limits & Stake Reduction
- Period Definitions -- Night & Weekly
- Exposure Accounting (2-Tier: Redis + PostgreSQL)
- Audit Trail & Determinism
- Performance Architecture
- The Agent Experience
- Nightmare Scenarios & How We Handle Them
- Bet Cancellation / Void / Partial Settlement State Machine
- Matrix Version Control
- Dead Letter Queue and Poison Bet Handling
- Settlement Cascade & Failure Isolation
- Cash-Out / Early Settlement
- Lay Bet Support
- Collusion Detection
- Agent Hierarchy Migration
- Minimum Forwarding / Skin-in-the-Game Requirements
- Panic Button & Abuse Prevention
- Timestamp Security & Period Boundaries
- Sharp Detection via Multiple Accounts
- Rate Limiting on Configuration Changes
- Hedge Execution
- Reconciliation
- Monitoring & Alerting
1. Executive Summary
What is the B-Book?
The B-Book is Hannibal's hierarchical, deterministic risk-management and routing engine. It sits between a punter placing a bet and the final destination of that bet's risk.
In traditional bookmaking, a "B-Book" means the bookie keeps the bet on their own books. An "A-Book" means the bookie hedges on an exchange like Betfair. Hannibal's B-Book is a routing engine that decides, for every bet, what percentage stays at each level of an agent hierarchy and what percentage gets forwarded up the chain. It enforces limits automatically, cascades overflow intelligently, and maintains a complete audit trail.
What problem does it solve?
In sports betting agent networks across India, agents operate at different levels of a hierarchy. Risk allocation today happens through manual spreadsheets, WhatsApp groups, and phone calls. There is no enforcement, no automatic cap management, and no audit trail. Disputes have no source of truth.
The B-Book automates all of this with a configurable, enforceable, auditable system.
v3 Simplification
This document is a deliberate simplification of the v2.0 architecture for Phase 1 launch. We removed premature optimizations, eliminated unnecessary infrastructure layers, and focused on correctness over throughput. The v2.0 document is preserved as a scale-up reference when we outgrow these simpler designs.
What changed: 2-tier caching instead of 3. Simple atomic counters instead of sharded exposure. A version integer instead of full MVCC. Sequential writes instead of distributed locking. Market orders for hedging instead of limit orders. No multi-currency. No responsible gambling module. Nightly reconciliation instead of every 15 minutes. Structured logging instead of Prometheus/Grafana.
What did NOT change: The forwarding matrix, cascading routing, exposure ledger design, NO_NEW_RISK, void/cancellation state machine, dead letter queue, settlement idempotency, cash-out, lay bets, collusion detection design, panic button, agent migration, minimum forwarding requirements, timestamp security, sharp detection design, and rate limiting.
2. The Core Problem -- In Plain English
The Agent Hierarchy: A Real Example
+-----------------------+
| Betfair Exchange |
| (External Hedge) |
+-----------+-----------+
|
+-----------+-----------+
| HANNIBAL |
| (The Platform) |
+-----------+-----------+
|
+-----------+-----------+
| VIKRAM |
| Master Agent, Delhi |
| Manages 12 sub-agents|
+-----------+-----------+
|
+-----------------+-----------------+
| |
+-----------+-----------+ +-----------+-----------+
| RAJESH | | PRIYA |
| Sub-Agent, Mumbai | | Sub-Agent, Bangalore |
| 200 cricket punters | | 150 football punters |
+-----------+-----------+ +-----------------------+
|
+---------+---------+
| AMIT | SONIA | ... 198 more punters
| Punter | Punter |
+---------+---------+
Vikram is a master agent in Delhi with 15 years of experience and a strong bankroll. Rajesh is one of his sub-agents in Mumbai, managing 200 cricket punters with a moderate bankroll. Amit is one of Rajesh's punters placing bets of 500 to 50,000 on IPL matches.
When Amit Places a Bet
Amit places 10,000 on Mumbai Indians to win at odds 1.85. Here is what must happen in 90 milliseconds:
- Can Amit place this bet? Check his per-click win limit.
- How much does Rajesh keep? The forwarding matrix says 60%. Rajesh keeps 6,000.
- Can Rajesh afford that? Check his cricket limits, per-match limits, night limit.
- The other 4,000 goes to Vikram. Vikram's matrix keeps 60%, forwards 40%.
- The remaining 1,600 reaches the platform. Hannibal retains some, hedges the rest on Betfair.
- Record everything.
The Fundamental Questions
| Question | Who Answers It |
|---|---|
| How much risk does Rajesh keep? | Rajesh's forwarding matrix + his limits |
| How much risk does Vikram keep? | Vikram's forwarding matrix + his limits |
| How much risk does the platform keep? | Platform's risk configuration |
| How much gets hedged on Betfair? | Whatever remains after all agents have taken their share |
| What if someone's limits are breached? | Overflow cascades up to the next level |
| What if Betfair is unavailable? | Platform absorbs as retained risk, retries asynchronously |
3. Simplification Philosophy
The Guiding Principle
Naive, then optimize. Build the simplest correct thing. Measure. Optimize only what measurements prove is too slow. The v2.0 document was designed by experienced architects who anticipated scale problems. Those problems are real but they are not Phase 1 problems. Solving them prematurely introduces complexity that slows development, increases bugs, and makes the system harder to reason about.
What Was Simplified and Why
| Simplification | v2.0 Design | v3.0 Design | Why |
|---|---|---|---|
| Caching | 3-tier: Application LRU (5s TTL) -> Redis -> PostgreSQL | 2-tier: Redis -> PostgreSQL | LRU adds cache coherency problems across instances. Redis is sub-millisecond and shared. Removing LRU eliminates pub/sub invalidation, multi-instance divergence, and an entire class of bugs. Adds ~1-2ms latency, still well within 90ms budget. |
| Exposure Counters | Sharded counters (8-16 shards per agent per scope) with random shard selection and summation | One Redis key per agent per scope. INCRBY for atomic updates. Write-behind to PostgreSQL. | Redis handles 100K+ ops/sec on a single key. We expect ~60 bets/sec for the busiest agent. Sharding is unnecessary until we 10x our traffic. No shard summation, no random selection, no complexity. |
| Matrix Versioning | Full MVCC with version history, garbage collection, version-aware caching, compressed retention tiers | Simple integer version on the matrix. Bet captures version at start. If version changed by position write time, re-process with new matrix. | Matrix changes are rare (5-10 per day per agent). The probability of a collision between a matrix change and a bet in flight is astronomically low. When it happens, re-processing one bet is trivial. No version storage, no GC, no version-aware cache logic. |
| Write Atomicity | Cross-agent distributed locking OR per-level atomicity with partial routing states | Sequential writes: Level 1, then Level 2, then Platform. Each level is its own transaction. Mid-cascade failure goes to DLQ. | The cascade is 2-3 levels deep and each write takes 5-15ms. Sequential execution adds 10-30ms total but eliminates deadlocks, cross-agent locking, and partial routing states. If Level 2 fails after Level 1 succeeds, the bet enters the DLQ for manual resolution. This is acceptable because it is extremely rare. |
| Hedge Execution | Limit orders, configurable max slippage, partial fill management, re-pricing (3 attempts), execution quality analytics | Market orders on Betfair. Accept the fill price. Retry 3 times on failure, then mark as unhedged and alert. Log theoretical vs actual spread for future optimization. | For Phase 1, we are hedging small amounts. The slippage difference between a market order and an optimized limit order strategy on these amounts is negligible. Building a sophisticated execution engine before we have meaningful hedge volume is premature. |
| Multi-Currency | Full FX conversion at currency zone boundaries, FX rate capture, FX risk management, FX reserve buffers | Everything in INR. No FX. | India-only launch. When we expand internationally, we add currency support then. |
| Responsible Gambling | Self-exclusion, deposit limits, session time limits, reality check notifications | Not included in v3. | B2B agent model, not B2C. Add when entering regulated B2C markets. |
| Reconciliation | Every 15 minutes (scheduled), post-settlement, full, targeted. Auto-correction for minor drift. Drift tracking over time. | One nightly job: compare ledger values to SUM of open positions. If mismatch, alert ops team. Manual fix. | Our bet volume in Phase 1 does not warrant 15-minute checks. Nightly is sufficient to catch any drift before it compounds. If a ledger diverges by a meaningful amount, a human should investigate the root cause rather than auto-correcting. |
| Monitoring | Prometheus + Grafana + AlertManager. Per-pipeline-stage histograms. Ops dashboards. | Structured logging (Pino) with key metrics. Simple health endpoint. Slack/email alerts for critical failures. | Prometheus/Grafana is infrastructure that must be deployed, configured, and maintained. For Phase 1 with 3 app instances, structured logs with grep/jq are sufficient. Add Prometheus when operational complexity warrants it. |
| Audit Trail | 3-tier hot/warm/cold storage. Checksum chains. Separate database. Nightly tier migration. | JSONB rows in PostgreSQL, one table, partitioned by month. Delete partitions older than 6 months. | Our first-season audit volume is ~18 GB. This fits comfortably in a single PostgreSQL instance. Cold storage, compression, and migration jobs are unnecessary overhead for a year. |
What This Means for the Latency Budget
| Step | v2.0 Budget | v3.0 Budget | Change |
|---|---|---|---|
| Request parsing & validation | 5ms | 5ms | Same |
| Metrics computation | 3ms | 3ms | Same |
| User win cap check | 5ms | 5ms | Same |
| Stake reduction | 2ms | 2ms | Same |
| Matrix resolution (from Redis, not LRU) | 10ms | 12ms | +2ms (Redis vs LRU) |
| Agent cap evaluation (per level, sequential) | 10ms x N | 12ms x N | +2ms per level (Redis vs LRU) |
| Position creation (sequential per level) | 15ms | 20ms | +5ms (sequential vs parallel) |
| Exposure ledger update (Redis INCRBY + async PG write) | 10ms | 8ms | -2ms (simpler, no sharding) |
| Audit record creation | 10ms | 10ms | Same |
| Response | 5ms | 5ms | Same |
| Total (3-level cascade) | ~85ms | ~92ms | +7ms, still within 110ms headroom |
The simplified architecture adds roughly 7ms to the critical path. We have 20ms of headroom before user experience degrades. This is acceptable.
4. The Forwarding Matrix -- The Brain of the System
What It Is
The forwarding matrix is a multi-dimensional lookup table that determines what percentage of each bet an agent retains versus forwards to their upline. It is the single most important configuration in the entire B-Book system.
The 5 Dimensions
| Dimension | What It Means | Example Values |
|---|---|---|
| market_type | The type of bet | MATCH_ODDS, FANCY, BOOKMAKER, OVER_UNDER, LINE |
| sport_type | Which sport | CRICKET, FOOTBALL, TENNIS, KABADDI |
| event_phase | When in the event lifecycle | PRE_MATCH, IN_PLAY, APPROACHING_START |
| source_type | What kind of punter | NORMAL, SHARP, VIP, NEW_ACCOUNT |
| liquidity_band | Exchange liquidity available to hedge | HIGH, MEDIUM, LOW, NONE |
Wildcard Matching
An agent does not need a rule for every combination. Wildcards (*) mean "match anything."
Example of Rajesh's forwarding matrix:
| Rule | market_type | sport_type | event_phase | source_type | liquidity_band | Forward % |
|---|---|---|---|---|---|---|
| R1 | FANCY | CRICKET | IN_PLAY | SHARP | * | 95% |
| R2 | FANCY | CRICKET | IN_PLAY | * | * | 70% |
| R3 | MATCH_ODDS | CRICKET | PRE_MATCH | * | HIGH | 40% |
| R4 | MATCH_ODDS | CRICKET | PRE_MATCH | * | LOW | 70% |
| R5 | MATCH_ODDS | CRICKET | IN_PLAY | * | * | 60% |
| R6 | * | CRICKET | * | SHARP | * | 90% |
| R7 | * | FOOTBALL | * | * | * | 80% |
| R8 | * | * | * | * | * | 50% |
R1: Sharp user, in-play cricket fancy -- forward 95%. Rajesh keeps only 5% because this is the most dangerous bet type. R3: Pre-match cricket match odds with high liquidity -- forward only 40%. Rajesh keeps 60% because these are safe, well-priced bets. R7: Any football bet -- forward 80%. Rajesh is not a football expert. R8: Catch-all. Forward 50%.
Tie-Breaking Rules (Deterministic)
When a bet matches multiple rules:
- Most specific rule wins. Specificity = count of non-wildcard dimensions. R1 (4 specific dimensions) beats R2 (3 specific dimensions).
- If specificity is equal, higher forward percentage wins. This is the risk-safe default: when in doubt, forward more.
- If forward percentage is also equal, oldest rule wins. Deterministic ordering by creation timestamp.
Resolution Precedence Chain
The matrix is not the only routing input. There is a four-level precedence chain:
| Level | What It Is | When to Use It |
|---|---|---|
| User Override | A specific forward % for a specific punter | "This user is a known sharp -- forward 100% of their bets" |
| Market Override | A specific forward % for a specific event/market | "The CSK vs MI final is too big -- forward 90% of everything" |
| Matrix Rule | The multi-dimensional lookup | Normal day-to-day operations |
| Agent Default | A single fallback percentage | "If nothing else matches, forward 50%" |
Sensible Defaults by Sport
| Scenario | Recommended Retention | Why |
|---|---|---|
| Cricket in-play fancy | 10-20% | Highest variance, hardest to price |
| Cricket pre-match match odds | 40-60% | Well-priced, ample liquidity |
| Cricket in-play match odds | 20-40% | More volatile than pre-match |
| Football Premier League pre-match | 50-70% | Deep liquidity, well-understood markets |
| Football lower leagues | 20-40% | Less information, integrity risk |
| Tennis | 10-25% | Volatile, retirement risk, low liquidity |
| Kabaddi | 5-15% | Thin markets, poor external pricing |
Common Mistakes and Prevention
| Mistake | How We Prevent It |
|---|---|
| Retention too high on in-play fancies | Warn when retention exceeds recommended range; require confirmation |
| No catch-all rule | System requires a * / * / * / * / * fallback at all times |
| Conflicting rules they do not understand | Dashboard shows which rule matched for every bet; "test my matrix" dry-run tool |
5. The Bet Flow -- Step by Step
The Complete Flow
Walking Through the Numbers
The Bet: Amit places 10,000 on MI to beat CSK at odds 1.85, pre-match IPL 2026.
Step 1: Compute Metrics
| Metric | Calculation | Value |
|---|---|---|
| Stake | As submitted | 10,000 |
| Potential Win | 10,000 x 0.85 | 8,500 |
| Liability | Same as potential win for a back bet | 8,500 |
Step 2: User Win Cap Check. Amit's per-click limit is 50,000. Potential win 8,500 is under. Aggregate daily limit is 2,00,000. Current accumulated: 45,000 + 8,500 = 53,500, still under. Pass.
Step 3: Resolve Forwarding Percentage. Bet characteristics: MATCH_ODDS, CRICKET, PRE_MATCH, NORMAL, HIGH liquidity. Matches Rule R3: forward 40%. Rajesh retains 60%.
Step 4: Agent Cap Evaluation (Rajesh). Rajesh retains 60% of 10,000 = 6,000 stake (5,100 liability). Check limits:
- Cricket sport limit: 12,05,100 / 50,00,000. Pass.
- MI vs CSK match limit: 1,25,100 / 5,00,000. Pass.
- Night period limit: 3,05,100 / 10,00,000. Pass.
All pass. Rajesh retains full 6,000.
Step 5: Cascade to Vikram. 4,000 flows to Vikram. His matrix says retain 60% of cricket pre-match match odds. Retains 2,400, forwards 1,600.
Step 6: Platform. Receives 1,600. Retains 800, hedges 800 on Betfair.
Step 7: Execute. Positions created sequentially -- Rajesh first, then Vikram, then Platform.
| Entity | Retained Stake | Retained Liability | Forwarded |
|---|---|---|---|
| Rajesh | 6,000 | 5,100 | 4,000 to Vikram |
| Vikram | 2,400 | 2,040 | 1,600 to Platform |
| Platform | 800 | 680 | 800 to Betfair |
| Total | 10,000 |
The stake always sums to the original 10,000. Risk is distributed, never created or destroyed.
Step 8: Audit. Complete audit record persisted as a JSONB row in the audit table: original bet, matrix rule matched at each level, every limit check, final routing breakdown, timestamps.
6. Cascading Upline Routing
How Bets Flow Through the Hierarchy
A bet enters at the bottom and flows upward. At each level, the agent retains what they can and forwards the rest.
User (Amit) places 10,000 bet
|
v
+---[RAJESH: Level 1]---+
| Forward 40% |
| Retains: 6,000 |
| Forwards: 4,000 |
+---------|---------------+
|
v
+---[VIKRAM: Level 2]---+
| Forward 40% |
| Retains: 2,400 |
| Forwards: 1,600 |
+---------|---------------+
|
v
+---[PLATFORM]----------+
| Retains: 800 |
| Hedges: 800 |
+---------|---------------+
|
v
+---[BETFAIR]------------+
| Receives: 800 |
+-------------------------+
What Happens at Each Level
At every level the system performs:
- Resolve source_type for this agent (own classification > downstream trust > default NORMAL)
- Resolve forwarding percentage (precedence chain)
- Calculate retained amount = incoming stake x (1 - forward %)
- Check agent's limits
- If limits allow: retain, forward the rest
- If limits breached: retain only up to the limit, forward everything else (overflow)
Overflow Example
Amit bets 50,000 on CSK at 2.10. Potential win: 55,000.
RAJESH (L1) - Forward 40%
Wants to retain: 30,000 (60%)
Cricket match limit remaining: 25,000 <-- LIMIT HIT
Actually retains: 25,000
Forwards: 25,000 (intended 20,000 + 5,000 overflow)
|
v
VIKRAM (L2) - Forward 40%
Receives: 25,000
Retains: 15,000 (60%), all limits OK
Forwards: 10,000
|
v
PLATFORM: Retains 5,000, Hedges 5,000
Vikram did not know 5,000 was overflow. He simply received 25,000 and processed normally. The cascade never drops a bet.
Suspended Agents
If Vikram is suspended, bets skip him. Rajesh's forwarded amount goes directly to the platform. Punters can still bet; risk routes around the suspended agent.
The Betfair Backstop
After all agents take their share, any remaining exposure reaches the platform, which can hedge on Betfair. If Betfair is down: the platform absorbs the risk temporarily, the bet is still accepted, and the hedge order enters a retry queue. Punter experience is never degraded by hedge-side issues.
Does Sharp Classification Travel Upline?
The information travels, but each agent decides independently.
Each forwarded bet carries metadata: originating_user_id, downstream_classification, forwarding_reason. Each upline agent resolves source_type independently:
Each agent configures trust per sub-agent:
| Sub-Agent | trust_downstream_flags | Reason |
|---|---|---|
| Rajesh | true | Experienced, 8 years track record |
| Arun | false | New, only 3 months, unproven detection |
7. Agent Liability Limits & NO_NEW_RISK
The Limit Structure
| Limit Type | Scope | Example |
|---|---|---|
| Sport Limit | Total liability across all events in a sport | "50 lakh total cricket exposure" |
| Market Limit | Total liability on a specific event/market | "No more than 5 lakh on any single IPL match" |
| Night Period Limit | Total liability during the night window | "Cap my night session at 10 lakh" |
| Weekly Period Limit | Total liability during the weekly cycle | "Cap my weekly exposure at 1 crore" |
How Limits Interact: Most Restrictive Wins
All applicable limits are checked simultaneously. The most restrictive determines retention.
Example: Thursday night during IPL. Rajesh's state:
| Limit | Capacity | Used | Remaining |
|---|---|---|---|
| Cricket Sport | 50,00,000 | 38,00,000 | 12,00,000 |
| MI vs CSK Match | 5,00,000 | 4,50,000 | 50,000 |
| Night Period | 10,00,000 | 9,20,000 | 80,000 |
| Weekly Period | 40,00,000 | 35,00,000 | 5,00,000 |
A bet wants to add 1,00,000 of retained liability. The match limit (50,000) is most restrictive. Rajesh retains only 50,000; the other 50,000 overflows to Vikram.
NO_NEW_RISK Mode
NO_NEW_RISK activates when retained liability reaches the limit for a given scope. The agent cannot take on new risk-increasing exposure, but hedge bets are still accepted.
Think of it like a credit card limit. You cannot make new purchases, but you can make payments.
Scope is granular:
| Scenario | Cricket | Tennis | Football |
|---|---|---|---|
| Rajesh hits cricket limit | NO_NEW_RISK | Normal | Normal |
| Rajesh hits MI vs CSK match limit | NO_NEW_RISK (this match only) | Normal | Normal |
| Rajesh hits night period limit | NO_NEW_RISK (all sports) | NO_NEW_RISK | NO_NEW_RISK |
Hedge detection rule:
If WorstCaseLiability AFTER the bet < WorstCaseLiability BEFORE the bet, it is a hedge. Allow it.
A bet on the opposite outcome of an existing position is almost always a hedge. A bet on the same outcome is never a hedge. A bet on a third outcome may or may not be depending on amounts and odds.
Exiting NO_NEW_RISK: Settlements reduce exposure, hedge bets reduce exposure, or an admin raises the limit.
8. User Win Limits & Stake Reduction
Per-Click Win Limit
Caps the maximum a punter can win on a single bet. If Amit has a per-click limit of 50,000 and bets at odds 50.00 with a 5,000 stake (potential win 2,45,000), the stake is reduced:
max_stake = 50,000 / (50.00 - 1) = 50,000 / 49 = 1,020 (rounded down)
The punter sees: "Maximum stake at these odds: 1,020." If the reduced stake falls below the minimum bet size (100), the bet is rejected: "This market is currently unavailable at these odds."
Aggregate Win Limit
Caps total cumulative potential wins over a period (daily). Prevents a punter from placing many individually-acceptable bets that collectively create enormous exposure.
Sharp Detection Signals
| Signal | What It Means |
|---|---|
| Closing Line Value (CLV) | Consistently bets at better prices than the closing line. Strongest predictor of long-term profitability. |
| Consistent staking | Same stake regardless of odds. Professionals use flat staking. |
| Early betting | Bets within the first hour of market opening when prices are softest. |
| Unpopular markets | Bets on obscure leagues with weakest pricing. |
| No mean reversion | Profits do not revert over time. Lucky punters revert; skilled punters do not. |
| Post-bet price movement | Price moves in their direction after they bet. |
9. Period Definitions -- Night & Weekly
Why Bookies Use Periods
Bookies think in operational windows, not lifetime exposure. The night session (when live betting peaks) has different risk characteristics than a quiet afternoon. The weekly cycle aligns with settlement cycles.
Night Period
Configurable per agent. Examples:
| Agent | Timezone | Night Period | Why |
|---|---|---|---|
| Rajesh (India, cricket) | IST | 7:00 PM - 2:00 AM | IPL matches start at 7:30 PM |
| Priya (India, football) | IST | 10:00 PM - 4:00 AM | PL matches at 12:30 AM IST |
Period Rollover
When a live match spans a period boundary, open exposure carries forward. Rajesh's 8,00,000 at 1:55 AM is NOT zeroed at 2:00 AM. It carries forward as a starting balance. New bets after 2:00 AM count against day period limits.
DST
Period boundaries are defined in the agent's local time and converted to UTC fresh each day. "7 PM to 2 AM" always means that in the agent's local clock. On spring-forward day, the period is 1 hour shorter. On fall-back, the system uses the first occurrence of the repeated hour.
10. Exposure Accounting (2-Tier: Redis + PostgreSQL)
Three Ledgers Per Agent Per Scope
| Ledger | What It Tracks |
|---|---|
| retained_open_liability | Worst-case payout on retained bets. Checked against limits. |
| forwarded_open_liability | Liability forwarded upward. Agent has no financial exposure. |
| open_potential_win | Total punters could win against this agent. |
The 2-Tier Architecture
+------------------+ +------------------+
| REDIS | | POSTGRESQL |
| (Fast R/W) | | (Source of Truth)|
| | | |
| - Sub-ms reads | | - Atomic writes |
| - INCRBY atomic | | - FOR UPDATE |
| - Single shared | | locking at |
| cache for all | | boundaries |
| instances | | |
+--------+---------+ +--------+---------+
| |
v v
"Rajesh has 12L "Rajesh has exactly
used of 50L -- 12,05,100 used --
clearly within UPDATE with lock"
limit, fast pass"
How it works:
-
Redis (primary read/write): Every bet reads exposure from Redis first. Redis is shared across all application instances -- no cache coherency problem. Exposure counters are stored as simple keys:
exposure:{agent_id}:{scope}:retained,exposure:{agent_id}:{scope}:forwarded,exposure:{agent_id}:{scope}:potential_win. Updated with atomicINCRBYon every bet. -
PostgreSQL (source of truth): After the Redis update, a write-behind job flushes the current Redis value to PostgreSQL every 5 seconds. For bets where the agent is near their limit (>85% utilized), we bypass Redis and go directly to PostgreSQL with
SELECT ... FOR UPDATElocking. This ensures two simultaneous bets cannot both claim the last 50,000 of capacity.
Why this works: During an IPL match, Rajesh might receive 50 bets per minute. For 45 of those, Redis immediately confirms he is within limits. Only the 5 bets near his limit need the PostgreSQL lock. Median latency stays low while correctness is guaranteed at the boundary.
When Redis is down: Fall back to PostgreSQL for all reads. Latency increases from <1ms to 5-15ms. No data loss, no incorrect decisions.
Settlement Impact
When a match settles, exposure associated with that match is removed from all agents: retained_open_liability decrements for retained positions, forwarded_open_liability decrements for forwarded positions, open_potential_win decrements for all.
11. Audit Trail & Determinism
The Audit Record
Every bet produces a structured, queryable JSONB record in PostgreSQL. The audit table is partitioned by month. Partitions older than 6 months are dropped.
An audit record contains:
| Field | Example |
|---|---|
| bet_id | bet_a1b2c3d4 |
| original_stake | 10,000 |
| adjusted_stake | 10,000 |
| per_click_win_cap_check | PASS: 8,500 < 50,000 |
| forwarding_chain | Full routing at each level with matrix rule, limits, overflow |
| limit_checks | Every limit checked at every level with result |
| period_context | NIGHT (19:00-02:00 IST), Week 7 of 2026 |
| timestamps | matrix_resolve: 2ms, cap_check: 5ms, total: 23ms |
| matrix_version | 7 (integer version captured at start of processing) |
Why Determinism Matters
Agents must trust the system. If Rajesh sees a routing he does not understand, he will revert to manual processes. The audit trail shows exactly why every decision was made.
Disputes are trivially resolvable: load the audit record, see the exact split and the rules that produced it.
Configuration Change Log
All configuration changes are recorded with who made the change, when, and why. You can answer: "What was Rajesh's matrix at 9:47 PM on March 15?" by querying the config changelog.
12. Performance Architecture
Latency Budget (Simplified)
| Step | Budget |
|---|---|
| Request parsing & validation | 5ms |
| Metrics computation | 3ms |
| User win cap check (Redis) | 5ms |
| Matrix resolution (Redis) | 12ms |
| Agent cap evaluation (per level, Redis or PG) | 12ms x N levels |
| Position creation (sequential, per level) | 8ms x N levels |
| Exposure ledger update (Redis INCRBY) | 3ms |
| Audit record (async buffer) | 0ms on critical path |
| Response | 5ms |
| Total (3-level cascade) | ~92ms |
Exposure Counter Design (Simple Redis Atomic Counters)
One Redis key per agent per scope. No sharding.
KEY: exposure:rajesh:cricket:retained
VALUE: 1205100 (in paisa)
KEY: exposure:rajesh:mi_vs_csk:retained
VALUE: 125100
KEY: exposure:rajesh:night_2026_02_11:retained
VALUE: 305100
Operations:
- Read:
GET exposure:rajesh:cricket:retained-- sub-millisecond - Increment:
INCRBY exposure:rajesh:cricket:retained 5100-- atomic, sub-millisecond - Decrement (settlement):
DECRBY exposure:rajesh:cricket:retained 5100-- atomic
Write-behind to PostgreSQL: a background job reads all exposure keys and writes to exposure_ledger table every 5 seconds. If Redis restarts, the recovery job rebuilds keys from PostgreSQL.
What Is Cached Where
| Data | Location | Invalidation |
|---|---|---|
| Agent forwarding matrix | Redis hash matrix:{agent_id} | On matrix update, delete key |
| Agent limits | Redis hash limits:{agent_id} | On limit update, delete key |
| Current exposure | Redis keys exposure:{agent_id}:{scope}:* | Updated on every bet via INCRBY |
| User win cap state | Redis key user_agg_win:{user_id}:{date} | Updated on every bet, reset on period boundary |
| NO_NEW_RISK flags | Redis key nonewrisk:{agent_id}:{scope} | Set on limit breach, cleared on settlement or limit change |
13. The Agent Experience
Three Tiers of Experience
The system runs on the same engine for everyone. What changes is how much complexity the agent sees.
Tier 1: "Set and Forget" (80% of agents). Three questions at setup: What sports? What is your nightly budget? How aggressive? The system generates the full matrix, limits, and win caps from these three answers. Traffic light dashboard. WhatsApp alerts.
Tier 2: "Dashboard Driver" (15% of agents). Real-time risk dashboard. Per-sport limits. Per-user management. Sees exposure bars, match-by-match breakdown, recent bets.
Tier 3: "Matrix Master" (5% of agents). Full 5D matrix editor. Test bet simulator. Historical P&L analysis per rule.
The "Sleep Well" Number
Every agent sees one number prominently:
YOUR MAXIMUM LOSS TONIGHT
3,42,000
out of your 10,00,000 night budget
████████░░░░░░░░░░░░ 34%
This is the sum of worst-case liability across all retained open positions. It is a mathematical guarantee: the cascade routing, limits, and NO_NEW_RISK ensure this number never exceeds the budget.
The Panic Button
One button sets forwarding to 100% for all sports and markets, places hedge orders on Betfair for all current retained positions, and notifies the upline. This is the emergency exit.
14. Nightmare Scenarios & How We Handle Them
Syndicate Attack (Correlated Positions Across Agents)
The platform maintains aggregate exposure per event. Cross-agent correlation detects users who consistently bet the same outcome at the same time through different agents. Platform-level event limits cap total exposure regardless of individual agent limits.
Data Feed Failure During Live Play
If odds have not updated for more than 5 seconds during in-play, the market is automatically suspended. No new bets until the feed resumes. Price movement circuit breaker suspends on abnormal jumps.
Rogue Agent Dumping Toxic Flow
Track P&L of retained vs forwarded bets per agent. If forwarded bets consistently lose money while retained bets win, flag the agent. The platform can force minimum retention.
Double Settlement / Result Correction
Re-settlement reverses previous settlements and re-applies with corrected results. All agents in the cascade are affected. Communication chain notifies everyone with full audit trails.
System Outage During Peak
Circuit breaker pattern: when response times exceed 500ms, switch to degraded mode where all bets are forwarded 100% to Betfair. No agent retains risk during the outage. When the system recovers, normal routing resumes.
15. Bet Cancellation / Void / Partial Settlement State Machine
The Complete State Machine
State Definitions
| State | Exposure Impact | Reversible? |
|---|---|---|
| BET_PLACED | Ledgers not yet updated | Yes (rolls back on system error) |
| ACTIVE | Fully reflected in all agent ledgers | No -- can only move forward |
| SETTLED | Exposure removed, P&L applied | Can move to RE_SETTLED |
| VOIDED | All exposure atomically removed as if bet never happened | No -- final |
| CANCELLED | All exposure removed. Functionally identical to void. | No |
| CASH_OUT_SETTLED | Original position closed via counter-position | No |
| RE_SETTLED | Previous P&L reversed, new P&L applied | Can be re-settled again |
Who Can Initiate Transitions
| Transition | By | Time Window |
|---|---|---|
| ACTIVE -> SETTLED | System (automatic on event result) | After event concludes |
| ACTIVE -> VOIDED | Platform admin only | Any time before settlement |
| BET_PLACED -> CANCELLED | Punter (own bet) | Within 3-5 seconds pre-match, 0 for in-play |
| BET_PLACED -> CANCELLED | Agent (their punter's bet) | Within 60 seconds |
| SETTLED -> RE_SETTLED | SUPER_ADMIN only | Within 72 hours |
Agents cannot void bets. Only the platform can void. This prevents fraud against the upline.
How Voids Cascade
A void reads the original audit record to determine exactly what to reverse. It does not recalculate. Even if the agent changed their matrix since placement, the void reverses exactly what was originally done.
Every void has an idempotency key. Pressing "Void" twice does not double-decrement exposure. All ledger updates happen in a single PostgreSQL transaction with FOR UPDATE locks in agent_id ascending order (preventing deadlocks).
After a void, NO_NEW_RISK flags are re-evaluated: if exposure dropped below the limit, the flag is cleared.
Partial Void (Multi-Leg Bets)
When one leg of an accumulator is voided, that leg is treated as a winner at odds 1.00. Remaining legs settle normally with the voided leg's odds removed from the calculation.
16. Matrix Version Control
The Problem
Rajesh changes his matrix at 9:47 PM during MI vs CSK. Fifteen bets are in various processing stages. If half use the old matrix and half use the new one, the audit trail is inconsistent.
The Simple Solution: Version Integer
The matrix has a version integer, incremented on any change. When a bet enters processing, it captures the current version.
BET PROCESSING:
1. Bet arrives at matrix resolution step
2. Read current matrix version from Redis: version = 7
3. Evaluate rules
4. Record version 7 in the bet's audit trail
5. Continue to cap evaluation and position creation
At position write time:
6. Re-check: is matrix version still 7?
7a. If yes: write positions normally
7b. If no (version is now 8): re-resolve the matrix with the new rules, re-calculate the split, then write
Why This Is Sufficient
Matrix changes are rare: 5-10 per day per agent. The window between a bet reading the matrix version and writing positions is 20-40 milliseconds. The probability of a matrix change landing in that window is roughly:
10 changes/day = 10 changes / 86,400,000 ms = 1 change per 8,640,000 ms
40ms window: probability = 40 / 8,640,000 = 0.000005 (0.0005%)
When the astronomically unlikely collision does occur, we simply re-process the bet with the new matrix. No version history needed. No garbage collection. No version-aware caching.
What the Agent Sees
Matrix changes take effect on the next bet that starts processing after the Redis key is updated (which happens within milliseconds of the save). The dashboard shows: "Matrix updated. New bets will use the updated rules."
17. Dead Letter Queue and Poison Bet Handling
The Retry Pipeline
What Constitutes a Poison Bet
| Condition | Example |
|---|---|
| Event already settled | Bet queued during outage, event settles before replay |
| Agent suspended mid-processing | Agent suspended while bets were in retry queue |
| Market no longer exists | Data feed error caused market deletion |
| Duplicate bet_id | Network timeout caused client to retry; first attempt succeeded |
| Invalid state transition | Punter cancelled during retry window |
What the Punter Experiences
If the punter never saw "Bet Confirmed": They saw "Bet is being processed." If retries succeed, push notification: "Your bet has been confirmed." If retries fail, push notification: "Your bet could not be placed. No funds were deducted."
If the punter saw "Bet Confirmed" (partial cascade failure): The punter continues to see "Bet Active." The system retries in the background. If retries fail, an admin resolves via the DLQ -- typically void and refund.
Manual Resolution
The DLQ dashboard shows each entry with full context: the original bet, the failure reason, the punter experience (did they see a confirmation?), and the event status. Admin options: Void & Refund, Force Process, or Settle at Result.
18. Settlement Cascade & Failure Isolation
Per-Position Settlement State
Each position is settled independently. Position 2,847 failing does not block position 2,848. The worker records the failure and moves on.
Settlement Worker Design
Workers claim positions using SELECT ... FOR UPDATE SKIP LOCKED. Each position is processed: calculate P&L, update exposure ledger, set status to SETTLED. Failed positions remain for retry. Settlement is partitioned by agent for fault isolation.
Settlement Idempotency
Every settlement has an idempotency key: {position_id}_{hash(event_result)}. Before executing, the system checks if this key already exists. If it does, the settlement is skipped. This makes settlement safe to retry at any time.
Partial Failure Recovery
A background monitor runs every 30 seconds:
- Positions in PROCESSING for >60 seconds: assume the worker crashed, reset to PENDING
- Positions in FAILED: retry up to 5 times, then escalate to DLQ
- Events with mixed CONFIRMED/PENDING: generate a report and alert if >15 minutes old
19. Cash-Out / Early Settlement
How Cash-Out Works
Cash-out is a counter-bet at current odds that closes the original position. The punter locks in a guaranteed profit or limits a loss.
Formula (back bet):
cash_out_return = stake * (original_odds / current_odds) * (1 - margin)
cash_out_profit = cash_out_return - stake
Example: Amit bet 10,000 on MI at 1.85. MI now at 1.20 (dominating).
cash_out_return = 10,000 * (1.85 / 1.20) * (1 - 0.05) = 14,646
Amit's profit if he cashes out: 4,646
(vs 8,500 if MI wins, vs -10,000 if MI loses)
Routing Through the Original Proportions
The cash-out counter-bet routes through the SAME proportions as the original bet, not the current matrix. If Rajesh retained 60%, he closes 60% of the position.
Partial Cash-Out
Amit can cash out any percentage. 50% cash-out closes half the position at all levels. The remaining 50% continues normally.
Interaction with NO_NEW_RISK
Cash-out creates a counter-position that reduces exposure. It is treated as a hedge and allowed even when NO_NEW_RISK is active.
20. Lay Bet Support
What Is a Lay Bet?
A lay bet is the opposite of a back bet. Sonia lays MI to win means she bets MI will NOT win. She wins if MI draws or loses.
Liability difference:
- Back: punter risks stake, wins
stake * (odds - 1) - Lay: punter risks
stake * (odds - 1), wins the stake
Exposure Impact
A lay bet on outcome X DECREASES the agent's exposure on outcome X and INCREASES exposure on other outcomes. It is a natural hedge.
Before Sonia's lay:
MI Win liability: 5,00,000
MI Not Win liability: 0
After Sonia lays MI at 1.85 for 10,000 (Rajesh retains 60%):
MI Win liability: 4,94,900 (reduced by 5,100)
MI Not Win liability: 6,000 (increased by 6,000)
Worst case went from 5,00,000 to 4,94,900 = HEDGE
How the Matrix Handles Lay Bets
Same 5-dimensional matrix, same rules, same precedence chain. The only difference is how the exposure ledger is updated and how NO_NEW_RISK evaluates the bet. If the lay reduces worst-case liability, it is allowed even under NO_NEW_RISK.
21. Collusion Detection
The Problem
An agent conspires with a sharp punter: marks them as NORMAL to retain winning bets instead of forwarding them. Or marks them as SHARP only when they are about to lose, forwarding toxic flow upline.
Detection Signals
| Signal | Severity |
|---|---|
| Classification flip before winning streak | HIGH |
| Classification flip-flop (>3 times/week) | HIGH |
| Override % changes correlate with bet outcomes | CRITICAL |
| Forwarded flow consistently loses vs retained flow consistently wins | HIGH |
| Single user dominates forwarded volume (>30%) | MEDIUM |
Cooling-Off Period for Classification Changes
Changes are queued, not instant:
- NORMAL to SHARP: 24-hour cooling-off
- SHARP to NORMAL: 72-hour cooling-off (riskier direction, longer wait)
- During cooling-off: old classification remains active. Agent cannot change again.
Upline Audit Rights
Vikram can see Rajesh's user classifications, pending changes, and classification-vs-outcome correlation reports. Vikram can flag changes for review and request the platform to freeze Rajesh's override capability. Vikram cannot directly change Rajesh's classifications.
Correlation Engine (Phase 2 Implementation)
A nightly analysis builds timelines of classification changes, bet placements, and outcomes. Calculates collusion scores based on statistical correlation. Thresholds trigger escalating alerts from informational (score 25-75) to automatic override freeze (score 150+).
For Phase 1, the cooling-off periods and upline audit rights provide baseline protection. The automated correlation engine is Phase 2.
22. Agent Hierarchy Migration
The Problem
Rajesh leaves Vikram's network and joins Suresh's. Rajesh has open positions forwarded through Vikram. The system must handle this cleanly.
Effective-Dated Changes
Hierarchy changes are never instantaneous. They take effect at a scheduled date.
Dual-Path Settlement
After cutover:
- New bets: Rajesh -> Suresh -> Platform
- Old positions: Still settled through Vikram (he holds the positions)
| Bet Timing | Routing | Settlement |
|---|---|---|
| Before cutover, settles before cutover | Through Vikram | Through Vikram |
| Before cutover, settles after cutover | Through Vikram | Through Vikram (dual-path) |
| After cutover | Through Suresh | Through Suresh |
When all old positions settle, final reconciliation between Rajesh and Vikram, then migration status becomes COMPLETE.
23. Minimum Forwarding / Skin-in-the-Game Requirements
How It Works
The upline sets a minimum retention percentage per downstream agent. This floor cannot be overridden by the downstream agent's matrix.
Vikram requires Rajesh to retain at least 20% of every bet. If Rajesh's matrix says forward 95% for SHARP users, the system caps it at 80% forward (20% retention minimum).
Limits always win over minimum retention. If Rajesh's limits would be breached by retaining 20%, he retains only what his limits allow.
When Rajesh tries to save a matrix rule that violates the minimum, the system auto-adjusts and shows a clear message explaining the constraint.
24. Panic Button & Abuse Prevention
What the Panic Button Does
- Immediately sets forwarding to 100% for all sports and markets
- Places hedge orders on Betfair for all current retained positions
- Notifies the upline
- Logs the action
Abuse Prevention
The abuse pattern: panic when losing (lock in small loss), hold when winning (collect full profit).
Cost escalation:
- First panic per period: platform absorbs hedge spread cost
- Second panic: 50% spread cost charged to the agent
- Third+: 100% charged to the agent
Controls:
| Control | Value |
|---|---|
| Panics per night (free) | 2 |
| Cooling-off after panic | 30 minutes before matrix can be restored |
| Minimum hedge duration | 15 minutes (cannot un-hedge immediately) |
Detection: Track panic frequency, correlation with book direction (panics when losing, not when winning), and P&L improvement from panic usage. If panic consistently improves P&L by >20% over 3 weeks with asymmetric usage, flag as GAMING. Escalation: full spread cost, then suspension of panic feature replaced with automatic NO_NEW_RISK.
25. Timestamp Security & Period Boundaries
Server-Side Authority
The authoritative timestamp is assigned at the API gateway when the request is received. This is immutable. Client timestamps are stored for debugging and fraud detection but never trusted for business logic.
BET PROCESSING:
1. Client sends request with client_timestamp
2. API gateway assigns server_timestamp = NOW()
3. All period boundary evaluation uses server_timestamp
4. All exposure updates reference server_timestamp
5. Audit stores BOTH timestamps
Period Boundary Tie-Breaking
Bets at exactly the boundary belong to the ENDING period. The interval is closed-open: [19:00, 02:00). A bet at 01:59:59.999 is NIGHT. A bet at 02:00:00.000 is DAY.
Clock Skew
All server instances synchronized via NTP to within 50ms. If drift exceeds 200ms, the instance is removed from the load balancer. For period boundaries (at hour granularity), 50ms skew is irrelevant.
26. Sharp Detection via Multiple Accounts
The Detection Pillars (Design for Phase 1, Heavy Implementation Phase 2)
Four independent signals:
Pillar 1: Device Fingerprinting. Same device used by multiple accounts. Similarity scoring based on device_id, canvas fingerprint, IP + user agent, screen resolution + timezone.
Pillar 2: IP/Network Correlation. Multiple accounts from same IP or subnet. Sequential IP usage patterns. VPN/proxy detection.
Pillar 3: Betting Pattern Similarity. Outcome correlation (>80% over 100+ bets is suspicious). Timing correlation (<5 minutes apart). Correlated obscure market selection. Similar stake distributions. Similar CLV profiles.
Pillar 4: Payment Method Overlap. Same bank account, UPI ID, phone number, or card on multiple accounts.
Phase 1 Implementation
For Phase 1: implement device fingerprint collection and basic IP logging. Store the data. Manual review when agents report suspected syndicates.
For Phase 2: build the automated correlation engine that runs nightly, scores clusters, and generates syndicate alerts with recommended actions (classify all as SHARP, block, review individually).
27. Rate Limiting on Configuration Changes
Per-Agent Rate Limits
| Configuration Type | Rate Limit |
|---|---|
| Matrix rule changes | 1 per 5 minutes |
| User override changes | 1 per user per 10 minutes |
| Market override changes | 1 per market per 5 minutes |
| Agent default changes | 1 per 15 minutes |
| Limit changes | 1 per limit per 10 minutes |
| Panic button | No rate limit on activation; 30-minute cooling-off |
Queue Behavior
Rapid changes that exceed the rate limit are queued. Only the most recent value is applied when the rate limit window expires. Intermediate values are never applied.
The agent sees: "Your change is pending. It will apply in ~4 minutes. Only your most recent value will be used."
Panic Button Interaction
Panic activation bypasses all rate limits (safety trumps stability). Configuration changes while panic is active are queued and applied only after panic is deactivated.
28. Hedge Execution
The Simple Design
For Phase 1, hedge execution is deliberately simple:
What this means:
- Market orders, not limit orders. Accept whatever price Betfair gives.
- No re-pricing, no partial fill management, no slippage optimization.
- If the order fails 3 times (Betfair down, no liquidity, API error), mark the amount as unhedged and alert.
- Log the theoretical price (what the punter got) vs the actual fill price. This data feeds Phase 2 optimization.
Why this is fine for Phase 1: We are hedging small amounts. The spread between a market order and an optimized limit order on a 800 bet is a few rupees. Building a sophisticated execution engine before we have meaningful hedge volume is premature.
Betfair Health Check
Ping Betfair every 10 seconds. Three consecutive failures means Betfair is down. Queue all orders for retry. Platform absorbs risk. When Betfair returns, drain the queue.
If an event settles while a hedge order is queued, cancel the order. The platform bears the outcome as retained risk.
29. Reconciliation
The Simple Design
One nightly job. Runs at 4:00 AM IST when traffic is minimal.
What it checks:
For every active agent, for every scope with open positions:
Step 1: Read exposure_ledger.retained_open_liability for this agent+scope
Step 2: SUM(liability) FROM positions WHERE agent_id AND scope AND status='OPEN' AND type='RETAINED'
Step 3: Compare.
Step 4: If they match: log success.
Step 5: If mismatch: alert ops team via Slack/email with:
- Agent name
- Scope
- Ledger value vs computed value
- Drift amount and direction
No auto-correction. If the ledger is wrong, a human investigates the root cause. Auto-correction hides bugs. In Phase 1, every mismatch should be understood and fixed at the source.
The manual recompute tool:
reconciliation recompute --agent=rajesh --scope=cricket
1. Lock the agent+scope (advisory lock, briefly delays new bets)
2. SUM all open positions
3. Update exposure_ledger to match
4. Update Redis
5. Release lock
6. Log the correction
This is the fix tool. Use it after investigating and understanding why the drift occurred.
30. Monitoring & Alerting
The Simple Design
No Prometheus. No Grafana. No AlertManager. Instead:
1. Structured Logging (Pino)
Every bet, settlement, hedge, and error is logged as structured JSON. Key fields: timestamp, event, agent_id, bet_id, latency_ms, success, error.
This gives us everything we need for debugging and performance analysis. Use jq or a log aggregator to query.
2. Health Endpoint
GET /api/v1/monitoring/health returns:
{
"status": "healthy",
"redis": "connected",
"postgresql": "connected",
"betfair": "healthy",
"uptime_seconds": 86400,
"bets_last_minute": 42,
"avg_latency_ms": 65,
"errors_last_minute": 0
}
3. Critical Failure Alerts (Slack/Email)
Only three categories of alerts for Phase 1:
| Alert | Trigger | Channel |
|---|---|---|
| NO_NEW_RISK activated | Any agent enters NO_NEW_RISK | Slack |
| Settlement failure | Any position fails to settle after 3 retries | Slack + Email |
| Hedge failure | Hedge marked as unhedged after 3 retries | Slack |
| DLQ entry | Any bet enters the dead letter queue | Slack |
| Reconciliation mismatch | Nightly job finds any drift | |
| Betfair down | 3 consecutive health check failures | Slack |
That is it. Six alert types. When we have operational complexity that warrants dashboards, we add Prometheus/Grafana.
This concludes Parts I and II of the simplified B-Book Architecture v3.0. The document covers 30 sections (down from 46 in v2.0), with every simplification explicitly documented and justified. The v2.0 document is preserved as the scale-up reference for when Phase 1 traffic and operational needs outgrow these simpler designs.
Part III: Implementation Architecture
The following sections provide the complete implementation specification. Every database table, every API endpoint, every pipeline step, and every deployment detail. An LLM reading this can build the entire system without asking a single question.
31. Technology Stack
Core Technologies
| Technology | Version | Purpose | Why This Choice |
|---|---|---|---|
| Node.js | 20 LTS | Application runtime | Event-loop model handles high concurrency cheaply. Non-blocking I/O suits the many Redis and DB calls in the bet pipeline. Team has existing expertise. |
| TypeScript | 5.x | Language | Type safety prevents financial bugs. Domain types like Stake, Liability, and ForwardPercentage catch errors at compile time. |
| PostgreSQL | 16 | Primary database | ACID transactions for financial data. FOR UPDATE locking for limit enforcement at boundaries. JSONB for flexible audit payloads. Partitioning for table growth. |
| Prisma | 5.x | ORM | Type-safe database access from TypeScript. Schema-as-code for migrations. Use raw SQL only for FOR UPDATE locks and bulk operations. |
| Redis | 7.x | Cache and counters | Sub-millisecond reads for exposure checks. Atomic INCRBY for exposure counters. Used as the backing store for BullMQ. |
| BullMQ | Latest | Job queue | Settlement processing, hedge retry, reconciliation, audit writes. Built on Redis. Supports delayed jobs, retries, priority queues, dead letter. |
| Docker | Latest | Containerization | Consistent environments. Docker Compose for local development. Single-container production deployment. |
| Pino | Latest | Structured logging | Fast JSON logger. Structured logs are queryable. Low overhead at high throughput. Replaces the need for Prometheus in v1. |
| Zod | Latest | Runtime validation | Validates all API inputs and configuration at runtime. Complements TypeScript compile-time types. |
| Socket.IO | Latest | WebSocket | Real-time dashboard updates to agents. Push notifications for alerts. |
What We Are NOT Using (and Why)
| Technology | Why Not |
|---|---|
| Prometheus + Grafana | No dedicated ops person yet. Pino structured logs and a simple /health endpoint are sufficient for v1. Add Prometheus when you have someone to look at dashboards. |
| Separate Audit DB | Overkill for v1 volume. Audit trail goes into the same PostgreSQL instance as JSONB in a partitioned table. Separate it when table size exceeds 100 GB. |
| Kafka or Redis Streams | BullMQ covers all async job needs (settlement, hedging, reconciliation). Kafka is for 1000+ bets/sec. We are targeting 167 bets/sec. |
| Application LRU Cache | Adds a consistency headache (3-tier cache invalidation). Redis is fast enough at sub-millisecond latency. Add an LRU layer only if Redis P99 exceeds 5ms under load. |
| Microservices | One Node.js process is simpler to deploy, debug, and operate. Modules are separated in code but deployed as one application. Extract into microservices only if a specific module needs independent scaling. |
| Multi-currency / FX | INR only for v1. All amounts in paisa (integer arithmetic, no floating-point). Add multi-currency when expanding beyond India. |
| Responsible Gambling Module | Not required for the Indian agent market in v1. Add when entering regulated markets (UK, EU). |
32. System Architecture Overview
System Diagram
How Services Communicate
| From | To | Method | Notes |
|---|---|---|---|
| Client to API | REST | HTTP/JSON, auth via JWT | Standard request-response |
| Client to Dashboard | WebSocket | Socket.IO | Real-time exposure updates, alerts |
| API to Bet Processing | Function call | In-process | Everything is one Node.js process |
| Bet Processing to Cascade Engine | Function call | In-process | Synchronous -- the bet pipeline is a single call chain |
| Bet Processing to Hedge Queue | BullMQ job | Async | Fire-and-forget from the bet pipeline |
| Settlement trigger to Settlement Worker | BullMQ job | Async | Settlement jobs queued when events settle |
| Config Change to Cache | Redis DEL | Direct | Delete the Redis key; next read repopulates from PostgreSQL |
Deployment Topology
Local Development: Docker Compose with three services: app (Node.js), postgres, redis.
Production (v1): One Node.js process running the full application and all BullMQ workers. One PostgreSQL instance. One Redis instance. Put behind nginx or a cloud load balancer if you need SSL termination.
Scale horizontally by adding a second Node.js instance behind the load balancer when traffic demands it. No code changes required -- BullMQ handles job distribution across workers automatically, and Redis ensures shared state.
33. Database Schema Design
Entity-Relationship Diagram
Table: agents
Stores every entity in the hierarchy: sub-agents, master agents, and the platform itself.
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID | PRIMARY KEY, DEFAULT gen_random_uuid() | Unique agent identifier |
| external_id | VARCHAR(100) | UNIQUE, NOT NULL | Human-readable ID (e.g., rajesh_mumbai) |
| name | VARCHAR(255) | NOT NULL | Display name |
| parent_agent_id | UUID | REFERENCES agents(id), NULLABLE | Upline agent. NULL for the platform. |
| level | INTEGER | NOT NULL | Hierarchy depth. 0 = platform, 1 = master, 2 = sub, etc. |
| status | VARCHAR(20) | NOT NULL, DEFAULT 'ACTIVE' | ACTIVE, SUSPENDED, DEACTIVATED |
| timezone | VARCHAR(50) | NOT NULL, DEFAULT 'Asia/Kolkata' | IANA timezone for period boundaries |
| default_forward_percentage | DECIMAL(5,2) | NOT NULL, DEFAULT 50.00 | Fallback when no matrix rule matches |
| matrix_version | INTEGER | NOT NULL, DEFAULT 1 | Incremented on every matrix change. Simple integer -- no version history table. |
| night_period_start | TIME | NULLABLE | Night period start in local time |
| night_period_end | TIME | NULLABLE | Night period end in local time |
| weekly_period_start_day | INTEGER | NOT NULL, DEFAULT 1 | 1=Monday, 7=Sunday |
| is_platform | BOOLEAN | NOT NULL, DEFAULT false | True for the single platform agent |
| platform_retain_percentage | DECIMAL(5,2) | NULLABLE | Only for platform: percent to retain vs hedge |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() | |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() |
Indexes:
idx_agents_parenton (parent_agent_id)idx_agents_statuson (status)- UNIQUE on (external_id)
Simplification vs v2: No separate tier column (derive from config complexity). The matrix_version integer lives directly on the agent row instead of a separate matrix_versions table. When a matrix changes, increment this integer and record the change in audit_trail.
Table: agent_limits
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID | PRIMARY KEY | |
| agent_id | UUID | REFERENCES agents(id), NOT NULL | |
| limit_type | VARCHAR(30) | NOT NULL | SPORT, MARKET, NIGHT_PERIOD, WEEKLY_PERIOD |
| sport_type | VARCHAR(30) | NULLABLE | CRICKET, FOOTBALL, etc. NULL for period limits across all sports |
| event_id | VARCHAR(100) | NULLABLE | Only for MARKET limits -- the specific event |
| limit_amount | BIGINT | NOT NULL | In paisa. Example: 50 lakh = 5,000,000,000 paisa |
| is_active | BOOLEAN | NOT NULL, DEFAULT true | |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() | |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() |
Indexes:
idx_agent_limits_lookupon (agent_id, limit_type, sport_type)- UNIQUE on (agent_id, limit_type, sport_type, event_id) -- prevents duplicate limits
Table: forwarding_matrix_rules
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID | PRIMARY KEY | |
| agent_id | UUID | REFERENCES agents(id), NOT NULL | |
| market_type | VARCHAR(30) | NOT NULL, DEFAULT '*' | MATCH_ODDS, FANCY, BOOKMAKER, OVER_UNDER, LINE, or * |
| sport_type | VARCHAR(30) | NOT NULL, DEFAULT '*' | CRICKET, FOOTBALL, TENNIS, KABADDI, or * |
| event_phase | VARCHAR(30) | NOT NULL, DEFAULT '*' | PRE_MATCH, IN_PLAY, APPROACHING_START, or * |
| source_type | VARCHAR(30) | NOT NULL, DEFAULT '*' | NORMAL, SHARP, VIP, NEW_ACCOUNT, or * |
| liquidity_band | VARCHAR(30) | NOT NULL, DEFAULT '*' | HIGH, MEDIUM, LOW, NONE, or * |
| forward_percentage | DECIMAL(5,2) | NOT NULL, CHECK (0 <= val <= 100) | Percentage to forward to upline |
| specificity | INTEGER | NOT NULL | Count of non-wildcard dimensions (0-5). Computed at insert/update in application code. |
| is_active | BOOLEAN | NOT NULL, DEFAULT true | |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() | Oldest rule wins tie-breaks |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() |
Indexes:
idx_fmr_agent_activeon (agent_id) WHERE is_active = trueidx_fmr_lookupon (agent_id, market_type, sport_type, event_phase, source_type, liquidity_band) WHERE is_active = true
Simplification vs v2: No version column on each rule. Instead, the agent's matrix_version integer is incremented whenever any rule changes. Bets record the matrix_version from the agent row at processing time, which is sufficient for audit replay.
Table: user_overrides
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID | PRIMARY KEY | |
| user_id | UUID | REFERENCES users(id), NOT NULL | |
| agent_id | UUID | REFERENCES agents(id), NOT NULL | The agent applying this override |
| forward_percentage | DECIMAL(5,2) | NOT NULL | Override forward percent |
| reason | TEXT | NOT NULL | Why the override was set |
| created_by | UUID | NOT NULL | Who set the override |
| is_active | BOOLEAN | NOT NULL, DEFAULT true | |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() | |
| expires_at | TIMESTAMPTZ | NULLABLE | Optional expiry |
Indexes:
- UNIQUE on (user_id, agent_id) WHERE is_active = true
Table: market_overrides
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID | PRIMARY KEY | |
| agent_id | UUID | REFERENCES agents(id), NOT NULL | |
| event_id | VARCHAR(100) | NOT NULL | The specific event/market |
| forward_percentage | DECIMAL(5,2) | NOT NULL | Override forward percent |
| reason | TEXT | NOT NULL | |
| created_by | UUID | NOT NULL | |
| is_active | BOOLEAN | NOT NULL, DEFAULT true | |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() | |
| expires_at | TIMESTAMPTZ | NULLABLE |
Indexes:
- UNIQUE on (agent_id, event_id) WHERE is_active = true
Table: users
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID | PRIMARY KEY | |
| external_id | VARCHAR(100) | UNIQUE, NOT NULL | User ID from the main platform |
| agent_id | UUID | REFERENCES agents(id), NOT NULL | The agent this user belongs to |
| name | VARCHAR(255) | NOT NULL | |
| per_click_win_limit | BIGINT | NOT NULL, DEFAULT 5000000 | In paisa. Default 50,000 INR |
| aggregate_win_limit_daily | BIGINT | NOT NULL, DEFAULT 20000000 | In paisa. Default 2,00,000 INR |
| min_stake | BIGINT | NOT NULL, DEFAULT 10000 | In paisa. Default 100 INR |
| status | VARCHAR(20) | NOT NULL, DEFAULT 'ACTIVE' | ACTIVE, SUSPENDED |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() | |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() |
Indexes:
idx_users_agenton (agent_id)- UNIQUE on (external_id)
Simplification vs v2: No self-exclusion fields, no session time limits, no deposit limits. Those belong to the responsible gambling module (v2).
Table: user_classifications
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID | PRIMARY KEY | |
| user_id | UUID | REFERENCES users(id), NOT NULL | |
| agent_id | UUID | REFERENCES agents(id), NOT NULL | Agent making this classification |
| classification | VARCHAR(30) | NOT NULL | NORMAL, SHARP, VIP, NEW_ACCOUNT |
| reason | TEXT | NULLABLE | Why classified |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() | |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() |
Indexes:
- UNIQUE on (user_id, agent_id)
Table: agent_trust_config
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID | PRIMARY KEY | |
| agent_id | UUID | REFERENCES agents(id), NOT NULL | The upline agent |
| sub_agent_id | UUID | REFERENCES agents(id), NOT NULL | The downstream agent |
| trust_downstream_flags | BOOLEAN | NOT NULL, DEFAULT false | Whether to trust sub-agent's user classifications |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() |
Indexes:
- UNIQUE on (agent_id, sub_agent_id)
Table: events
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | VARCHAR(100) | PRIMARY KEY | External event ID from odds provider |
| sport_type | VARCHAR(30) | NOT NULL | |
| name | VARCHAR(500) | NOT NULL | "MI vs CSK, IPL 2026" |
| start_time | TIMESTAMPTZ | NOT NULL | |
| status | VARCHAR(20) | NOT NULL, DEFAULT 'UPCOMING' | UPCOMING, LIVE, SUSPENDED, SETTLED, VOID |
| result | JSONB | NULLABLE | Settlement result data |
| settled_at | TIMESTAMPTZ | NULLABLE | |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() |
Indexes:
idx_events_sport_statuson (sport_type, status)idx_events_start_timeon (start_time)
Table: markets
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | VARCHAR(100) | PRIMARY KEY | External market ID |
| event_id | VARCHAR(100) | REFERENCES events(id), NOT NULL | |
| market_type | VARCHAR(30) | NOT NULL | MATCH_ODDS, FANCY, BOOKMAKER, etc. |
| name | VARCHAR(255) | NOT NULL | |
| status | VARCHAR(20) | NOT NULL, DEFAULT 'OPEN' | OPEN, SUSPENDED, CLOSED, SETTLED, VOID |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() |
Indexes:
idx_markets_eventon (event_id)idx_markets_statuson (status)
Table: bets (partitioned by month on created_at)
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID | PRIMARY KEY | |
| user_id | UUID | NOT NULL | The punter |
| agent_id | UUID | NOT NULL | The originating agent (Level 1) |
| event_id | VARCHAR(100) | NOT NULL | |
| market_id | VARCHAR(100) | NOT NULL | |
| selection | VARCHAR(255) | NOT NULL | "MI to win" |
| side | VARCHAR(10) | NOT NULL | BACK or LAY |
| requested_stake | BIGINT | NOT NULL | Original stake in paisa |
| accepted_stake | BIGINT | NOT NULL | After stake reduction |
| odds | DECIMAL(10,4) | NOT NULL | Decimal odds |
| potential_win | BIGINT | NOT NULL | In paisa |
| liability | BIGINT | NOT NULL | In paisa |
| stake_reduction_reason | VARCHAR(50) | NULLABLE | PER_CLICK_LIMIT, AGGREGATE_LIMIT, or null |
| market_type | VARCHAR(30) | NOT NULL | |
| sport_type | VARCHAR(30) | NOT NULL | |
| event_phase | VARCHAR(30) | NOT NULL | PRE_MATCH, IN_PLAY |
| source_type | VARCHAR(30) | NOT NULL | As resolved at originating agent |
| liquidity_band | VARCHAR(30) | NOT NULL | HIGH, MEDIUM, LOW, NONE |
| matrix_version_snapshot | INTEGER | NOT NULL | The agent's matrix_version at processing time |
| status | VARCHAR(20) | NOT NULL, DEFAULT 'OPEN' | OPEN, SETTLED, VOIDED, PARTIALLY_VOIDED, CANCELLED, CASH_OUT_SETTLED, RE_SETTLED |
| routing_status | VARCHAR(20) | NOT NULL, DEFAULT 'COMPLETE' | COMPLETE, PARTIAL, FAILED |
| period_context | VARCHAR(20) | NOT NULL | NIGHT, DAY |
| total_processing_time_ms | INTEGER | NOT NULL | End-to-end latency |
| settled_at | TIMESTAMPTZ | NULLABLE | |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() | Partition key |
Indexes:
idx_bets_user_timeon (user_id, created_at DESC)idx_bets_agent_timeon (agent_id, created_at DESC)idx_bets_eventon (event_id)idx_bets_marketon (market_id)idx_bets_status_openon (status) WHERE status = 'OPEN'idx_bets_routing_incompleteon (routing_status) WHERE routing_status != 'COMPLETE'
Table: positions (partitioned by month on created_at)
One row per agent per bet. If Amit's bet flows through Rajesh, Vikram, and the Platform, that is 3 RETAINED position rows (and optionally 3 FORWARDED rows, though forwarded amounts can be derived).
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID | PRIMARY KEY | |
| bet_id | UUID | NOT NULL | |
| agent_id | UUID | NOT NULL | Agent holding this position |
| cascade_level | INTEGER | NOT NULL | 1 = first agent, 2 = upline, etc. |
| position_type | VARCHAR(20) | NOT NULL | RETAINED or FORWARDED |
| stake | BIGINT | NOT NULL | Stake portion in paisa |
| liability | BIGINT | NOT NULL | Liability portion in paisa |
| potential_win | BIGINT | NOT NULL | Potential win portion in paisa |
| forward_percentage_used | DECIMAL(5,2) | NOT NULL | The forward percent that produced this split |
| forward_source | VARCHAR(30) | NOT NULL | USER_OVERRIDE, MARKET_OVERRIDE, MATRIX_RULE, AGENT_DEFAULT |
| matrix_rule_id | UUID | NULLABLE | The specific matrix rule that matched |
| overflow_amount | BIGINT | NOT NULL, DEFAULT 0 | How much was overflow from limit breach |
| event_id | VARCHAR(100) | NOT NULL | Denormalized for fast queries |
| market_id | VARCHAR(100) | NOT NULL | Denormalized |
| sport_type | VARCHAR(30) | NOT NULL | Denormalized |
| selection | VARCHAR(255) | NOT NULL | Denormalized |
| side | VARCHAR(10) | NOT NULL | Denormalized BACK/LAY |
| odds | DECIMAL(10,4) | NOT NULL | Denormalized |
| status | VARCHAR(20) | NOT NULL, DEFAULT 'OPEN' | OPEN, SETTLED, VOIDED |
| settlement_id | UUID | NULLABLE | |
| settled_amount | BIGINT | NULLABLE | Actual P&L in paisa (positive = agent profit) |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() |
Indexes:
idx_positions_beton (bet_id)idx_positions_agent_openon (agent_id, status) WHERE status = 'OPEN'idx_positions_agent_eventon (agent_id, event_id) WHERE status = 'OPEN'idx_positions_event_statuson (event_id, status)
Table: exposure_ledger
One row per agent per scope. No sharding. Simple.
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID | PRIMARY KEY | |
| agent_id | UUID | NOT NULL | |
| scope_type | VARCHAR(30) | NOT NULL | SPORT, MARKET, NIGHT_PERIOD, WEEKLY_PERIOD |
| scope_key | VARCHAR(200) | NOT NULL | e.g., "cricket", "mi_vs_csk_2026_03_15", "night_2026_03_15" |
| retained_open_liability | BIGINT | NOT NULL, DEFAULT 0 | In paisa |
| forwarded_open_liability | BIGINT | NOT NULL, DEFAULT 0 | In paisa |
| open_potential_win | BIGINT | NOT NULL, DEFAULT 0 | In paisa |
| no_new_risk_active | BOOLEAN | NOT NULL, DEFAULT false | |
| no_new_risk_triggered_at | TIMESTAMPTZ | NULLABLE | |
| last_updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() |
Indexes:
- UNIQUE on (agent_id, scope_type, scope_key)
idx_exposure_agent_scopeon (agent_id, scope_type)idx_exposure_no_new_riskon (no_new_risk_active) WHERE no_new_risk_active = true
Simplification vs v2: No shard_index column. One row per agent per scope. If this becomes a write bottleneck (measurable as PostgreSQL write contention P99 exceeding 50ms), add sharding later.
Table: settlements (partitioned by month on created_at)
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID | PRIMARY KEY | |
| event_id | VARCHAR(100) | NOT NULL | |
| market_id | VARCHAR(100) | NOT NULL | |
| position_id | UUID | NOT NULL | |
| agent_id | UUID | NOT NULL | |
| bet_id | UUID | NOT NULL | |
| settlement_type | VARCHAR(20) | NOT NULL | WIN, LOSS, VOID, PUSH |
| stake | BIGINT | NOT NULL | The position's stake |
| payout | BIGINT | NOT NULL | Amount paid to/from agent |
| profit_loss | BIGINT | NOT NULL | Agent P&L (positive = profit) |
| idempotency_key | VARCHAR(200) | UNIQUE, NOT NULL | {position_id}_{event_result_hash} |
| status | VARCHAR(20) | NOT NULL, DEFAULT 'COMPLETED' | COMPLETED, REVERSED, RE_SETTLED |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() |
Indexes:
idx_settlements_eventon (event_id)idx_settlements_agent_timeon (agent_id, created_at DESC)idx_settlements_beton (bet_id)- UNIQUE on (idempotency_key)
Table: audit_trail (partitioned by month on created_at)
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID | PRIMARY KEY | |
| bet_id | UUID | NULLABLE | NULL for config change audits |
| record_type | VARCHAR(30) | NOT NULL | BET_PLACED, BET_SETTLED, BET_VOIDED, CONFIG_CHANGED |
| agent_id | UUID | NOT NULL | Primary agent for this record |
| user_id | UUID | NULLABLE | |
| event_id | VARCHAR(100) | NULLABLE | |
| payload | JSONB | NOT NULL | Full structured audit data |
| processing_time_ms | INTEGER | NULLABLE | |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() |
Indexes:
idx_audit_beton (bet_id) WHERE bet_id IS NOT NULLidx_audit_agent_timeon (agent_id, created_at DESC)idx_audit_type_timeon (record_type, created_at DESC)
Simplification vs v2: No checksum or previous_checksum columns. No checksum chain. The JSONB payload is the truth. If you need tamper detection later, add checksums in a migration.
Table: dead_letter_queue
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID | PRIMARY KEY | |
| source | VARCHAR(50) | NOT NULL | BET_PROCESSING, SETTLEMENT, HEDGE, RECONCILIATION |
| reference_id | UUID | NOT NULL | bet_id, settlement_id, etc. |
| error_message | TEXT | NOT NULL | |
| error_stack | TEXT | NULLABLE | |
| payload | JSONB | NOT NULL | Full context for retry |
| retry_count | INTEGER | NOT NULL, DEFAULT 0 | |
| max_retries | INTEGER | NOT NULL, DEFAULT 3 | |
| status | VARCHAR(20) | NOT NULL, DEFAULT 'PENDING' | PENDING, RETRYING, RESOLVED, ESCALATED |
| resolved_by | UUID | NULLABLE | |
| resolved_at | TIMESTAMPTZ | NULLABLE | |
| resolution_notes | TEXT | NULLABLE | |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() | |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() |
Indexes:
idx_dlq_statuson (status) WHERE status IN ('PENDING', 'RETRYING')idx_dlq_referenceon (reference_id)
Table: hedge_orders
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID | PRIMARY KEY | |
| bet_id | UUID | NOT NULL | |
| event_id | VARCHAR(100) | NOT NULL | |
| market_id | VARCHAR(100) | NOT NULL | |
| selection | VARCHAR(255) | NOT NULL | |
| side | VARCHAR(10) | NOT NULL | BACK or LAY |
| target_price | DECIMAL(10,4) | NOT NULL | Price the punter received |
| requested_amount | BIGINT | NOT NULL | In paisa |
| filled_amount | BIGINT | NOT NULL, DEFAULT 0 | In paisa |
| betfair_bet_id | VARCHAR(100) | NULLABLE | Betfair's order reference |
| status | VARCHAR(20) | NOT NULL, DEFAULT 'QUEUED' | QUEUED, SENT, FILLED, FAILED, UNHEDGED |
| retry_count | INTEGER | NOT NULL, DEFAULT 0 | |
| error_message | TEXT | NULLABLE | |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() | |
| sent_at | TIMESTAMPTZ | NULLABLE | When sent to Betfair |
| filled_at | TIMESTAMPTZ | NULLABLE |
Indexes:
idx_hedge_beton (bet_id)idx_hedge_statuson (status) WHERE status IN ('QUEUED', 'SENT')
Simplification vs v2: No reprice_count, max_reprice_attempts, time_in_force_seconds, limit_price, slippage, or average_fill_price. We send market orders, not limit orders. Retry 3 times. Mark as UNHEDGED on failure.
Table: alerts
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID | PRIMARY KEY | |
| alert_type | VARCHAR(50) | NOT NULL | NO_NEW_RISK_ACTIVATED, LIMIT_APPROACHING, RECONCILIATION_MISMATCH, HEDGE_FAILED, DLQ_ENTRY, etc. |
| severity | VARCHAR(5) | NOT NULL | P1, P2, P3 |
| agent_id | UUID | NULLABLE | |
| title | VARCHAR(255) | NOT NULL | |
| description | TEXT | NOT NULL | |
| metadata | JSONB | NULLABLE | |
| status | VARCHAR(20) | NOT NULL, DEFAULT 'ACTIVE' | ACTIVE, ACKNOWLEDGED, RESOLVED |
| acknowledged_by | UUID | NULLABLE | |
| resolved_at | TIMESTAMPTZ | NULLABLE | |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() |
Indexes:
idx_alerts_activeon (status) WHERE status = 'ACTIVE'idx_alerts_agenton (agent_id, created_at DESC)
Note on removed tables vs v2:
- No
config_changelogtable: config change audits go intoaudit_trailwithrecord_type = 'CONFIG_CHANGED'and full old/new values in the JSONB payload. - No
feature_flagstable: use environment variables (e.g.,FEATURE_CASCADING_ROUTING=true) or a simple JSON config file. Reload on deploy. - No
reconciliation_results/reconciliation_discrepanciestables: reconciliation mismatches are written as alerts withalert_type = 'RECONCILIATION_MISMATCH'and details in the metadata JSONB. - No
agent_hierarchy_historytable: hierarchy changes go intoaudit_trailwithrecord_type = 'CONFIG_CHANGED'.
34. API Design
Bet Placement APIs
| Method | Path | Description | Auth |
|---|---|---|---|
| POST | /api/v1/bets | Place a new bet | User JWT |
| GET | /api/v1/bets/:betId | Get bet details with routing info | User/Agent JWT |
| GET | /api/v1/bets | List bets with filters | Agent JWT |
| POST | /api/v1/bets/:betId/void | Void an open bet | Admin JWT |
| POST | /api/v1/bets/:betId/cancel | Cancel within window | User/Agent JWT |
| POST | /api/v1/bets/:betId/cash-out | Request cash-out offer | User JWT |
| POST | /api/v1/bets/:betId/cash-out/accept | Accept cash-out | User JWT |
| POST | /api/v1/bets/simulate | Dry-run (no money, full routing) | Agent JWT |
POST /api/v1/bets -- Request:
{
"user_id": "uuid",
"event_id": "string",
"market_id": "string",
"selection": "MI to win",
"side": "BACK",
"stake": 1000000,
"odds": 1.85,
"market_type": "MATCH_ODDS",
"sport_type": "CRICKET",
"event_phase": "PRE_MATCH",
"liquidity_band": "HIGH"
}
All monetary values in paisa. 1000000 paisa = 10,000 INR.
Response (success):
{
"bet_id": "uuid",
"status": "ACCEPTED",
"accepted_stake": 1000000,
"stake_reduced": false,
"potential_win": 850000,
"message": null
}
Response (stake reduced):
{
"bet_id": "uuid",
"status": "ACCEPTED_REDUCED",
"accepted_stake": 588200,
"original_stake": 1000000,
"stake_reduced": true,
"potential_win": 500000,
"message": "Maximum stake at these odds: 5,882 INR"
}
Response (rejected):
{
"bet_id": null,
"status": "REJECTED",
"reason": "MARKET_SUSPENDED",
"message": "This market is currently unavailable."
}
Agent Configuration APIs
| Method | Path | Description | Auth |
|---|---|---|---|
| GET | /api/v1/agents/:agentId | Get agent profile and config | Agent JWT |
| PATCH | /api/v1/agents/:agentId | Update agent settings | Agent JWT |
| GET | /api/v1/agents/:agentId/limits | Get all limits | Agent JWT |
| PUT | /api/v1/agents/:agentId/limits | Set/update limits | Agent JWT |
| GET | /api/v1/agents/:agentId/matrix | Get forwarding matrix rules | Agent JWT |
| POST | /api/v1/agents/:agentId/matrix/rules | Add a matrix rule | Agent JWT |
| PUT | /api/v1/agents/:agentId/matrix/rules/:ruleId | Update a rule | Agent JWT |
| DELETE | /api/v1/agents/:agentId/matrix/rules/:ruleId | Delete a rule | Agent JWT |
| POST | /api/v1/agents/:agentId/matrix/test | Test a bet against the matrix | Agent JWT |
| GET | /api/v1/agents/:agentId/exposure | Current exposure summary | Agent JWT |
| GET | /api/v1/agents/:agentId/exposure/:scope | Exposure for a specific scope | Agent JWT |
| POST | /api/v1/agents/:agentId/panic | Trigger panic mode | Agent JWT |
| GET | /api/v1/agents/:agentId/sub-agents | List sub-agents | Agent JWT |
| GET | /api/v1/agents/:agentId/trust-config | Get trust settings | Agent JWT |
| PUT | /api/v1/agents/:agentId/trust-config/:subAgentId | Update trust | Agent JWT |
POST /api/v1/agents/:agentId/matrix/rules -- Request:
{
"market_type": "FANCY",
"sport_type": "CRICKET",
"event_phase": "IN_PLAY",
"source_type": "*",
"liquidity_band": "*",
"forward_percentage": 70.00
}
Response:
{
"rule_id": "uuid",
"new_matrix_version": 48,
"specificity": 3,
"effective_immediately": true
}
User Management APIs
| Method | Path | Description | Auth |
|---|---|---|---|
| GET | /api/v1/users/:userId | Get user profile | Agent JWT |
| PATCH | /api/v1/users/:userId | Update user settings | Agent JWT |
| POST | /api/v1/users/:userId/override | Set user forward override | Agent JWT |
| DELETE | /api/v1/users/:userId/override | Remove override | Agent JWT |
| POST | /api/v1/users/:userId/classify | Set classification | Agent JWT |
| GET | /api/v1/users/:userId/bets | Bet history | Agent JWT |
Settlement APIs
| Method | Path | Description | Auth |
|---|---|---|---|
| POST | /api/v1/settlements/events/:eventId | Trigger settlement | System/Admin JWT |
| GET | /api/v1/settlements/events/:eventId | Settlement status | Agent JWT |
| POST | /api/v1/settlements/events/:eventId/reverse | Reverse settlement | Admin JWT |
| POST | /api/v1/settlements/events/:eventId/resettle | Re-settle with corrected results | Admin JWT |
| GET | /api/v1/settlements/agents/:agentId | Agent settlement history | Agent JWT |
POST /api/v1/settlements/events/:eventId -- Request:
{
"result": {
"winner": "MI",
"market_results": {
"match_odds": { "winning_selection": "MI to win" },
"fancy_180_runs": { "actual_value": 187, "line": 180 }
}
},
"result_source": "OFFICIAL",
"confirmed_by": "admin_uuid"
}
Admin APIs
| Method | Path | Description | Auth |
|---|---|---|---|
| POST | /api/v1/admin/agents | Create agent | Admin JWT |
| POST | /api/v1/admin/agents/:agentId/suspend | Suspend agent | Admin JWT |
| POST | /api/v1/admin/agents/:agentId/reactivate | Reactivate | Admin JWT |
| GET | /api/v1/admin/dead-letter-queue | View DLQ | Admin JWT |
| POST | /api/v1/admin/dead-letter-queue/:id/retry | Retry DLQ entry | Admin JWT |
| POST | /api/v1/admin/dead-letter-queue/:id/resolve | Resolve DLQ entry | Admin JWT |
| POST | /api/v1/admin/reconciliation/run | Manual reconciliation | Admin JWT |
Monitoring APIs
| Method | Path | Description | Auth |
|---|---|---|---|
| GET | /api/v1/monitoring/health | Health check | Public |
| GET | /api/v1/monitoring/alerts | Active alerts | Admin JWT |
| POST | /api/v1/monitoring/alerts/:id/acknowledge | Ack alert | Admin JWT |
35. The Bet Processing Pipeline (Step by Step)
This is the most important section. Every step from HTTP request to response.
Latency Budget (2-tier cache: Redis and PostgreSQL)
| Step | Budget | Description |
|---|---|---|
| 1. Request parsing & validation | 5ms | Parse JSON, Zod validate |
| 2. Timestamp & metrics | 1ms | Assign server timestamp, compute win/liability |
| 3. User win cap check | 8ms | Redis for aggregate, compute max stake |
| 4. Matrix version capture | 1ms | Read agent.matrix_version from Redis |
| 5. Forwarding resolution | 10ms | Precedence chain lookup |
| 6. Cascade routing (per level) | 12ms x N | Limit checks + retention calc at each level |
| 7. Position creation + ledger update | 15ms | DB transaction |
| 8. Hedge queue placement | 2ms | BullMQ add |
| 9. Audit trail write | 5ms | Async BullMQ job (non-blocking) |
| 10. Response | 2ms | HTTP response |
| Total (3-level cascade) | ~85ms | Within 90ms budget |
Without the LRU cache, steps that previously hit LRU now hit Redis (sub-millisecond). The total latency increases by roughly 3-5ms compared to the 3-tier approach, which is well within budget.
Step 1: Request Received (5ms)
What happens: HTTP POST arrives at /api/v1/bets. Express middleware parses JSON. Zod validates the schema: user_id is UUID, stake is positive integer, odds is positive decimal, side is BACK or LAY, all required fields present.
What can go wrong:
- Malformed JSON: Return 400.
- Missing or invalid fields: Return 400 with specific validation errors.
Error handling: Immediate rejection. No audit record. No further processing.
Step 2: Timestamp Assignment and Metrics Computation (1ms)
What happens:
2a. Assign a server-side timestamp: const processedAt = Date.now(). This is the authoritative time for all downstream decisions (period boundaries, ordering). Never trust the client timestamp.
2b. Compute financial metrics using integer arithmetic in paisa:
potential_win = Math.floor(stake * (odds - 1))
liability = potential_win (for a BACK bet from the bookie's perspective)
For a LAY bet, the liability is different:
// LAY bet: punter risks stake * (odds - 1), bookie's liability = stake
liability = stake (for a LAY bet)
potential_win = Math.floor(stake * (odds - 1))
2c. Determine period context. Convert processedAt to the originating agent's timezone. Check if the time falls within the agent's night period (closed-open interval: [night_start, night_end)). Set period_context to NIGHT or DAY.
What can go wrong: Nothing. Pure computation.
Step 3: User Win Cap Check (8ms)
What happens:
3a. Per-click win cap: Compare potential_win against user.per_click_win_limit. If exceeded, compute the max allowable stake:
max_stake = Math.floor(per_click_win_limit / (odds - 1))
3b. Aggregate win cap: Read the user's accumulated potential wins for today from Redis key agg_win:{user_id}:{date}. If adding this bet's potential_win exceeds user.aggregate_win_limit_daily, compute:
remaining_win = aggregate_limit - current_accumulated
max_stake_from_aggregate = Math.floor(remaining_win / (odds - 1))
3c. Take the minimum of max_stake, max_stake_from_aggregate, and the original stake. If the result is less than user.min_stake, reject the bet entirely with "This market is currently unavailable at these odds."
3d. If the stake was reduced, set stake_reduction_reason to PER_CLICK_LIMIT or AGGREGATE_LIMIT.
3e. After accepting the bet, atomically increment the Redis aggregate counter: INCRBY agg_win:{user_id}:{date} {potential_win}. Set TTL to end of day if not already set.
What can go wrong:
- Redis unavailable for aggregate check: Fall back to a PostgreSQL query --
SELECT SUM(potential_win) FROM bets WHERE user_id = $1 AND created_at >= $today. Adds ~15ms but is correct. - Two simultaneous bets both read the same aggregate before either increments: The 10% buffer on the aggregate limit absorbs this. The INCRBY is atomic, so the second bet will see the updated value for subsequent bets.
Error handling: If reduced stake is below minimum, return status: REJECTED, reason: BELOW_MINIMUM.
Step 4: Matrix Version Capture (1ms)
What happens: Read the originating agent's matrix_version from Redis key agent:{agent_id}:matrix_version. Store it in the processing context as matrix_version_snapshot. This value will be saved on the bet record to enable deterministic replay.
What can go wrong:
- Redis miss: Read from PostgreSQL (
SELECT matrix_version FROM agents WHERE id = $1). Adds ~3ms.
Why this matters: If Rajesh changes his matrix after this bet is captured but before it finishes processing, the bet still uses the version it captured. During disputes, the audit record shows which version was used.
Step 5: Forwarding Percentage Resolution (10ms)
What happens: For each agent in the cascade (starting with the punter's direct agent), resolve the forwarding percentage using the 4-level precedence chain:
5a. User override: Check Redis key user_override:{user_id}:{agent_id}. If active and not expired, use its forward_percentage. Done.
5b. Market override: Check Redis key market_override:{agent_id}:{event_id}. If active and not expired, use its forward_percentage. Done.
5c. Matrix rule lookup: Load the agent's active rules from Redis key matrix:{agent_id}. This is a JSON array of all active rules. In application code, filter to rules where every non-wildcard dimension matches the bet's characteristics. From matching rules, pick the one with the highest specificity. If tied on specificity, pick the highest forward_percentage (risk-safe default). If still tied, pick the oldest rule (lowest created_at).
5d. Agent default: If no rule matches (should never happen if a catch-all wildcard rule exists), use agent.default_forward_percentage.
What can go wrong:
- Agent has no matrix rules AND no default: Log a P2 alert. Use 100% forward (safest possible -- the agent retains nothing).
- Redis cache miss for matrix rules: Load from PostgreSQL. On load, populate Redis key with TTL of 1 hour.
Step 6: Cascade Routing -- Level by Level (12ms per level)
This is the heart of the system. The cascade engine walks up the agent hierarchy from the punter's agent to the platform.
For each level in the hierarchy:
6a. Determine source_type. Does this agent have their own classification for the user? Check user_classifications (cached in Redis). If yes, use it. If no, check agent_trust_config -- does this agent trust the downstream agent's flags? If yes, use the downstream classification. If no, default to NORMAL.
6b. Resolve forward percentage using the same precedence chain as Step 5, but for THIS agent's matrix with the resolved source_type.
6c. Calculate retention:
retained_stake = Math.floor(incoming_stake * (100 - forward_percentage) / 100)
forwarded_stake = incoming_stake - retained_stake
retained_liability = Math.floor(retained_stake * (odds - 1))
6d. Check all applicable limits. For this agent, read exposure from Redis for each relevant scope:
- Sport limit:
exposure:{agent_id}:SPORT:{sport_type} - Market limit:
exposure:{agent_id}:MARKET:{event_id} - Night period limit:
exposure:{agent_id}:NIGHT_PERIOD:{date} - Weekly period limit:
exposure:{agent_id}:WEEKLY_PERIOD:{week}
For each limit, compare current_exposure + retained_liability against limit_amount. The most restrictive limit determines max retention.
The exposure check decision:
Read the exposure value from Redis. If the agent is far from their limit (say, below 80% utilization), trust the Redis value and proceed. If the agent is near their limit (above 80%), use PostgreSQL with SELECT ... FOR UPDATE to get an authoritative locked value. The 80% threshold is configurable per agent.
Why 80%? At 80% utilization, even if Redis is stale by a few bets, the remaining 20% headroom absorbs the error. At 95% utilization, staleness matters because a few extra bets could breach the limit.
6e. If all limits pass: Agent retains the calculated amount.
6f. If any limit would be breached: Agent retains only up to the most restrictive remaining capacity. The difference becomes overflow added to forwarded_stake.
Real example: Rajesh's matrix says retain 60% of Amit's 10,000 INR bet = 6,000 INR stake. But Rajesh's MI vs CSK match limit has only 3,000 INR remaining. Rajesh retains 3,000 INR. The extra 3,000 INR (overflow) is added to forwarded, making it 7,000 INR total forwarded to Vikram.
6g. NO_NEW_RISK check: If the agent is in NO_NEW_RISK for any applicable scope (checked via Redis flag nnr:{agent_id}:{scope}), determine if this bet is a hedge:
worst_case_before = current worst-case liability across all outcomes
worst_case_after = worst-case liability if this bet's retention is added
If worst_case_after < worst_case_before: it is a hedge. ALLOW retention.
If worst_case_after >= worst_case_before: NOT a hedge. Forward 100%.
For a simple back bet on the same outcome as existing liability: never a hedge. For a back bet on the opposite outcome (or a lay bet): usually a hedge.
6h. Suspended agent: If agent.status = SUSPENDED, skip entirely. Forward 100% to the parent.
6i. Record the decision for this level in a local array (will become the audit payload).
6j. Forward remaining to next level. If the current agent is the platform, the remaining amount is the hedge amount.
What can go wrong:
- Parent agent not found (broken hierarchy): P1 alert. Forward directly to the platform.
- DB lock timeout on near-limit check: Retry once (most waits resolve in <20ms). If retry fails, assume limit breached, forward 100% from this level. Safe default.
Step 7: Position Creation and Ledger Update (15ms)
What happens: Write positions and update exposure ledgers. This is done sequentially, level by level, each level in its own database transaction.
For each level (e.g., Rajesh, then Vikram, then Platform):
7a. Begin a PostgreSQL transaction.
7b. Insert the RETAINED position row for this agent.
7c. Update the exposure_ledger rows for this agent:
UPDATE exposure_ledger
SET retained_open_liability = retained_open_liability + $retained_liability,
forwarded_open_liability = forwarded_open_liability + $forwarded_liability,
open_potential_win = open_potential_win + $potential_win,
last_updated_at = NOW()
WHERE agent_id = $agent_id
AND scope_type = $scope_type
AND scope_key = $scope_key;
If no row exists yet for this scope, insert one (upsert pattern).
7d. If the ledger update pushes the agent into NO_NEW_RISK territory (retained_open_liability >= limit), set no_new_risk_active = true and update Redis flag.
7e. Commit the transaction.
7f. After commit, update Redis exposure counter: SET exposure:{agent_id}:{scope_type}:{scope_key} {new_value}.
Why sequential per-level instead of one big transaction? No cross-agent locking. Rajesh's transaction does not block Vikram's. If Rajesh's write succeeds but Vikram's fails, Rajesh's position is committed. The bet is marked as routing_status = PARTIAL and the remaining levels are retried via the DLQ.
What can go wrong:
- DB write fails at level 2 of 3: Level 1 positions are already committed. Mark bet as
routing_status = PARTIAL. Queue a retry job for levels 2-3 in BullMQ. The retry reads the bet record to see which levels are done and resumes from the incomplete level. - Redis update fails after DB commit: Redis is stale until the next write or until reconciliation corrects it. Acceptable because the DB is the source of truth.
Step 8: Hedge Queue Placement (2ms)
What happens: If the platform's forwarded amount (the amount to be hedged on Betfair) is greater than zero, add a BullMQ job to the hedge-orders queue:
{
"bet_id": "uuid",
"event_id": "string",
"market_id": "string",
"selection": "MI to win",
"side": "BACK",
"target_price": 1.85,
"amount_paisa": 80000
}
This is async. The hedge does not block the bet response.
What can go wrong:
- BullMQ/Redis unavailable: Write the hedge order directly to the
hedge_orderstable in PostgreSQL withstatus = QUEUED. A polling job picks up queued orders every 5 seconds.
Step 9: Audit Trail Write (async, non-blocking)
What happens: Add a BullMQ job to the audit-trail queue with the complete audit payload:
The payload includes:
- Original bet details (all fields)
- The forwarding chain: for each level, the incoming stake, the forward percentage source (USER_OVERRIDE, MATRIX_RULE, etc.), the rule that matched, the retention amount, every limit checked and its result, any overflow
- The matrix_version_snapshot
- The period context
- Processing timestamps for each step
This write is intentionally async. The bet is confirmed to the punter before the audit record is persisted. If the audit write fails, the bet is still valid -- a reconciliation job detects orphaned bets without audit records and regenerates them from position data.
What can go wrong:
- Audit job fails 3 times: Goes to the DLQ. P3 alert. Manual investigation. The bet itself is unaffected.
Step 10: Response to Punter (2ms)
What happens: Return the HTTP response. Emit a Socket.IO event to the agent's dashboard with the new bet.
What can go wrong:
- Client already disconnected: The bet is processed. Client can query
GET /api/v1/bets/:betId. - Socket.IO delivery failure: Non-critical. Dashboard polls every 5 seconds as a fallback.
36. The Settlement Pipeline (Step by Step)
Step 1: Event Result Received
An admin or automated feed posts the result to POST /api/v1/settlements/events/:eventId. The system validates the result, confirms the event is in a settleable state (LIVE or SUSPENDED), and enqueues a BullMQ settlement job.
Step 2: Market Resolution
For each market in the event, determine winners and losers:
| Market Type | Resolution Logic |
|---|---|
| MATCH_ODDS | Selection matching winning_selection = WIN. Others = LOSS. |
| FANCY (over/under) | actual_value >= line: OVER wins. actual_value < line: UNDER wins. |
| BOOKMAKER | Same as MATCH_ODDS. |
Step 3: Position Identification
Query all open positions for this event:
SELECT * FROM positions
WHERE event_id = :eventId AND status = 'OPEN'
ORDER BY bet_id, cascade_level
Group by bet_id, then by agent_id.
Step 4: Per-Position Settlement (Idempotent)
For each position:
4a. Generate idempotency key: {position_id}_{sha256(event_result_json)}.
4b. Check if a settlement with this key exists. If yes, skip. Already settled.
4c. Determine WIN or LOSS from the market resolution.
4d. Calculate settlement:
- WIN (punter wins, agent loses):
payout = position.liability,profit_loss = -position.liability - LOSS (punter loses, agent wins):
payout = 0,profit_loss = position.stake - VOID:
payout = position.stake(return stake),profit_loss = 0
4e. In a transaction:
- Insert settlement record.
- Update position status to SETTLED.
- Decrement exposure ledger:
retained_open_liability -= position.liability(for RETAINED).forwarded_open_liability -= position.liability(for FORWARDED).open_potential_win -= position.potential_win.
4f. After commit, update Redis exposure values.
What can go wrong:
- Idempotency key already exists: Skip. Safe by design.
- DB failure mid-settlement: The failed position stays OPEN. A settlement monitor job picks it up and retries every 30 seconds. After 5 failures, it goes to the DLQ.
Step 5: NO_NEW_RISK Re-evaluation
After settling positions, check all agents who were in NO_NEW_RISK. If retained_open_liability < limit, clear the NO_NEW_RISK flag in both PostgreSQL and Redis. Fire a P3 alert: "Rajesh exited NO_NEW_RISK for cricket."
Step 6: Reconciliation Check
Run a targeted check for each affected agent: compare the ledger value to the sum of open positions. If they match, all good. If they differ, insert an alert with alert_type = RECONCILIATION_MISMATCH. Details in the metadata JSONB.
Step 7: Notifications
Mark the event as SETTLED. Emit Socket.IO events. Queue settlement summary notifications.
37. Configuration Management
How Matrix Rules Are Cached in Redis
When the system needs an agent's matrix rules (during bet processing), it follows this path:
- Check Redis key
matrix:{agent_id}. If present, use it. - If missing, load from PostgreSQL:
SELECT * FROM forwarding_matrix_rules WHERE agent_id = $1 AND is_active = true. - Store in Redis as
matrix:{agent_id}with a 1-hour TTL. - Return the rules.
Cache Invalidation: Simple and Direct
When an agent's matrix changes (rule added, updated, or deleted):
- The API handler writes the change to PostgreSQL.
- Increment
agents.matrix_versionin the same transaction. - After commit, delete Redis key
matrix:{agent_id}. - Delete Redis key
agent:{agent_id}:matrix_version.
That is it. No pub/sub broadcast needed for v1 (single process). The next bet that needs this agent's matrix will find a cache miss, load from PostgreSQL, and repopulate Redis.
If you scale to multiple Node.js instances: Add a Redis pub/sub channel config.invalidate. After deleting the Redis key, publish a message {agent_id, type: "MATRIX"}. Each instance subscribes and can clear any in-process caches. But for v1 with one process, this is unnecessary.
What Gets Cached Where
| Data | Redis Key Pattern | TTL | Invalidation |
|---|---|---|---|
| Agent matrix rules | matrix:{agent_id} | 1 hour | Deleted on any rule change |
| Agent matrix version | agent:{agent_id}:matrix_version | 1 hour | Deleted on any rule change |
| Agent limits | limits:{agent_id} | 1 hour | Deleted on limit change |
| User override | user_override:{user_id}:{agent_id} | 1 hour | Deleted on override change |
| Market override | market_override:{agent_id}:{event_id} | 1 hour | Deleted on override change |
| Exposure counter | exposure:{agent_id}:{scope_type}:{scope_key} | No TTL | Updated on every write |
| NO_NEW_RISK flag | nnr:{agent_id}:{scope} | No TTL | Set/cleared on state change |
| User aggregate wins | agg_win:{user_id}:{date} | End of day | Incremented atomically |
| User classification | user_class:{user_id}:{agent_id} | 1 hour | Deleted on change |
38. Error Handling Patterns
Error Categories
| Category | Examples | Handling |
|---|---|---|
| Validation | Bad input, missing fields | Return 400. No processing. No audit. |
| Business rule | Suspended market, below-minimum stake | Return 200 with status: REJECTED. Audit record created. |
| Transient infrastructure | Redis timeout, DB pool exhausted | Retry up to 3 times with backoff (100ms, 200ms, 400ms). Fall back or 503. |
| Permanent infrastructure | DB down | Circuit breaker opens. All bets fall back to 100% forward (safe default). |
| Data integrity | Negative ledger, broken hierarchy | P1 alert. DLQ. Manual investigation. |
| External service | Betfair API error | Hedge queue absorbs it. Mark as UNHEDGED after 3 retries. |
Retry Policies
| Operation | Max Retries | Backoff | On Exhaustion |
|---|---|---|---|
| Redis read | 2 | 50ms, 100ms | Fall back to PostgreSQL |
| Redis write | 2 | 50ms, 100ms | Log warning; DB is source of truth |
| PostgreSQL read | 2 | 100ms, 200ms | Return 503 |
| PostgreSQL write | 1 | 100ms | DLQ the operation |
| Betfair API | 3 | 1s, 2s, 4s | Mark hedge order as UNHEDGED |
| Settlement position | 5 | 1s, 5s, 30s, 60s, 120s | DLQ |
Circuit Breakers
| Service | Opens After | Behavior When Open | Closes After |
|---|---|---|---|
| Redis | 5 consecutive failures in 10 seconds | All reads go to PostgreSQL. Writes go to PostgreSQL only. | 30 seconds of successful PostgreSQL reads |
| Betfair | 5 consecutive failures in 60 seconds | All hedge orders queue up. No new Betfair calls. | Probe every 30 seconds; close after 3 successes |
When Redis circuit breaker is open, the system still works -- just slower. Every Redis read becomes a PostgreSQL read (add ~5ms per read). This is the graceful degradation path.
DLQ Integration
Any operation that exhausts retries creates a DLQ entry:
Source: BET_PROCESSING / SETTLEMENT / HEDGE
Reference ID: the bet_id or position_id
Error: message + stack trace
Payload: full context to retry manually
Status: PENDING
A P2 alert fires on every DLQ entry. Admins resolve via API: retry, void and refund, or manual override.
39. Testing Strategy
Unit Test Targets
| Module | Coverage | Key Scenarios |
|---|---|---|
| MatrixResolutionModule | 95% | Wildcard matching, specificity tie-breaking (most specific wins, then highest forward wins, then oldest wins), precedence chain order (user override > market > matrix > default), missing rules fallback to 100% forward |
| CascadeEngineModule | 95% | 2-level cascade, 4-level cascade, limit overflow cascading extra to upline, suspended agent skip, NO_NEW_RISK with hedge detection (back bet blocked, lay bet allowed) |
| LimitEnforcementModule | 95% | All 4 limit types, most restrictive wins, exact boundary (limit minus 1 paisa passes, limit exactly fails), period-aware checking |
| ExposureLedgerModule | 90% | Increment/decrement, upsert on first bet, Redis update after commit, Redis fallback to PostgreSQL |
| SettlementModule | 95% | WIN/LOSS/VOID settlement, idempotency (same key returns without re-processing), ledger decrement to zero, re-settlement |
| StakeReductionModule | 95% | Per-click reduction, aggregate reduction, below-minimum rejection, edge case odds (1.01 produces huge stake, 1000.00 produces tiny stake) |
| HedgeExecutionModule | 90% | Successful fill, Betfair error retry, 3x failure marks UNHEDGED |
Integration Test Scenarios
| Scenario | What It Tests | Expected Outcome |
|---|---|---|
| Full 3-level cascade bet | Amit bets 10,000 INR through Rajesh to Vikram to Platform | 3 retained positions created, all exposure ledgers incremented, audit record complete, hedge order queued |
| Limit overflow | Rajesh has 3,000 INR remaining on match limit, bet wants 6,000 INR retention | Rajesh retains 3,000, overflow of 3,000 forwarded to Vikram, Vikram processes normally |
| NO_NEW_RISK hedge | Rajesh in NO_NEW_RISK, lay bet arrives on the overexposed outcome | Lay bet accepted as hedge, Rajesh's worst-case liability decreases |
| NO_NEW_RISK non-hedge | Rajesh in NO_NEW_RISK, back bet arrives on the same outcome | Back bet forwarded 100% to Vikram |
| Stake reduction | Bet at odds 50.00 with 50,000 INR per-click limit | Stake reduced from 5,000 to 1,020 INR |
| Settlement cascade | Event settles, positions across 3 agents | All positions settled, ledgers decremented, NO_NEW_RISK cleared |
| Matrix version capture | Agent changes matrix between two bets | First bet uses old version, second uses new version, audit records differ |
| Concurrent bets near limit | 10 simultaneous bets at 95% utilization | All resolve, total does not exceed limit (PostgreSQL FOR UPDATE path) |
| Betfair timeout | Hedge order sent, Betfair returns 503 | Retried 3 times, marked UNHEDGED, bet still accepted |
| Redis outage | Redis killed mid-test | System falls back to PostgreSQL for all reads, latency increases, correctness maintained |
| Void cascade | Admin voids a bet, all positions reversed | Every position reversed, every ledger decremented, NO_NEW_RISK re-evaluated |
| Cash-out | Amit requests cash-out at improved odds | Counter-positions created using original proportions (not current matrix), exposure reduced |
| Suspended agent | Vikram suspended, bet arrives for Rajesh | Rajesh's overflow skips Vikram, goes directly to Platform |
| Idempotent settlement | Settlement triggered twice for same event | Second invocation is a no-op (idempotency keys match) |
Load Test Scenarios
| Scenario | Traffic | Success Criteria |
|---|---|---|
| Sustained peak | 167 bets/sec for 30 minutes | P99 < 90ms, zero errors |
| Burst spike | 500 bets/sec for 60 seconds | P99 < 200ms, error rate < 0.1% |
| Settlement storm | 3 events settle simultaneously (5,000 positions) | Completes within 5 minutes |
Chaos Test Scenarios
| Scenario | Simulation | Expected Behavior |
|---|---|---|
| Redis down | Kill Redis | Circuit breaker opens in 3 seconds. All reads fall back to PostgreSQL. No data loss. Bets accepted at higher latency. |
| PostgreSQL down | Kill PostgreSQL | All writes fail. 503 returned. Alert fires. Bets queued on client side or rejected. |
| Betfair down | Block outbound to Betfair | Hedge queue grows. Bets accepted. Platform absorbs risk. UNHEDGED alert fires. |
| Slow PostgreSQL | Add artificial 50ms latency | P99 increases. Some bets exceed budget. No data loss. |
40. Deployment Strategy
Docker Compose for Local Dev
services:
app:
build: .
ports:
- "3000:3000"
environment:
DATABASE_URL: postgresql://hannibal:secret@postgres:5432/hannibal
REDIS_URL: redis://redis:6379
BETFAIR_API_KEY: test_key
NODE_ENV: development
depends_on:
- postgres
- redis
postgres:
image: postgres:16
environment:
POSTGRES_DB: hannibal
POSTGRES_USER: hannibal
POSTGRES_PASSWORD: secret
volumes:
- postgres_data:/var/lib/postgresql/data
ports:
- "5432:5432"
redis:
image: redis:7
ports:
- "6379:6379"
volumes:
postgres_data:
Three services. One command: docker compose up. Database migrations via npx prisma migrate dev.
Production Deployment
One Docker container running the Node.js application with all modules and BullMQ workers. Deploy behind nginx or a cloud load balancer for SSL termination.
| Component | Spec | Notes |
|---|---|---|
| App container | 2 vCPU, 4 GB RAM | Single instance to start. Add a second when P99 exceeds 90ms under load. |
| PostgreSQL | 4 vCPU, 16 GB RAM, SSD | Managed service (RDS, Cloud SQL, etc.). Automated backups. |
| Redis | 2 vCPU, 4 GB RAM | Managed service (ElastiCache, Redis Cloud). Persistence enabled. |
Feature Flag Rollout Process
Feature flags are environment variables:
FEATURE_CASCADING_ROUTING=false
FEATURE_USER_WIN_LIMITS=false
FEATURE_NO_NEW_RISK=false
FEATURE_HEDGE_EXECUTION=false
FEATURE_PERIOD_MANAGEMENT=false
Rollout steps:
- Deploy code with feature off (env var = false).
- Enable for a single test agent by adding agent-specific override in config.
- Monitor for 24-48 hours: check audit trails, ledger accuracy.
- Enable globally (set env var = true). Redeploy.
- After 2 weeks stable, remove the feature flag check from code.
41. Implementation Phases
Phase 1: Foundation (Weeks 1-4)
Goal: All building blocks exist but are not connected into a pipeline.
| Week | Deliverables |
|---|---|
| 1 | Prisma schema migration: all tables. Seed data with test agents (Vikram, Rajesh, Priya) and test users (Amit, Sonia). Docker Compose setup. Basic Express app with health endpoint. |
| 2 | ExposureLedgerModule: PostgreSQL reads/writes, single counter per agent per scope (no sharding). Redis read/write for counters. LimitEnforcementModule: check all 4 limit types, return max retainable amount. |
| 3 | MatrixResolutionModule: full 5D wildcard matching with specificity tie-breaking. Precedence chain. ConfigModule: load matrix rules from DB, cache in Redis with delete-on-change invalidation. |
| 4 | UserManagementModule: per-click win cap check, aggregate win cap (Redis INCRBY). StakeReductionModule. AuditModule: BullMQ job that writes JSONB audit record. |
Phase 2: Core Pipeline (Weeks 5-8)
Goal: End-to-end bet placement and settlement through the full cascade.
| Week | Deliverables |
|---|---|
| 5 | CascadeEngineModule: N-level cascade with matrix resolution and limit checking per level. Overflow handling. Suspended agent skip. BetProcessingModule: orchestrates the full pipeline from HTTP request to response. |
| 6 | Position creation with per-level transactions. Exposure ledger updates atomic with positions. End-to-end bet placement through 3 levels. Full integration tests. |
| 7 | SettlementModule: event result processing, position settlement (idempotent), ledger decrement, bet status updates. |
| 8 | Void/cancellation state machine. Cash-out (counter-position using original proportions). Lay bet support. Feature flag: enable cascade per agent. |
End of Phase 2: The system can accept, route, settle, void, and cash out bets through the full cascade. This is the MVP.
Phase 3: Production Hardening (Weeks 9-12)
Goal: Ready for real traffic with safety mechanisms.
| Week | Deliverables |
|---|---|
| 9 | NO_NEW_RISK: automatic trigger, hedge detection, scoped activation. Period management: night/weekly periods, timezone handling, carry-forward logic. |
| 10 | HedgeExecutionModule: Betfair API integration, market orders, 3x retry, mark UNHEDGED on failure. BullMQ hedge queue. |
| 11 | Dead letter queue with admin API (list, retry, resolve). Nightly reconciliation: compare ledgers to position sums, alert on mismatch. |
| 12 | Circuit breakers for Redis and Betfair. Redis fallback to PostgreSQL. End-to-end chaos testing. Load testing at 167 bets/sec. |
End of Phase 3: Production-ready for controlled launch.
Phase 4: Intelligence (Weeks 13-16)
| Week | Deliverables |
|---|---|
| 13 | Sharp detection: CLV calculation over rolling 500-bet window, behavioral scoring, automatic classification. |
| 14 | Collusion detection: classification flip timing correlation, forwarded-flow P&L asymmetry detection. |
| 15 | Agent dashboard: real-time exposure, traffic light view, bet feed, settlement summary. Socket.IO push updates. |
| 16 | WhatsApp integration: scheduled status messages, interactive commands (status, stop, resume, panic). |
Phase 5: Scale-Up (When Data Demands It)
This phase has no fixed timeline. Each item is triggered by a measurable threshold, not a calendar date. See the Scale-Up Reference section below.
42. Scale-Up Reference
This section is the bridge between v1 (simple) and v2 (full architecture). Add complexity only when data proves you need it.
When to Add Each Optimization
| Optimization | Trigger Threshold | What to Do | Expected Impact |
|---|---|---|---|
| Application LRU cache | Redis P99 latency exceeds 5ms under sustained load | Add an in-memory LRU cache (e.g., lru-cache npm) with 5-second TTL in front of Redis for exposure reads and matrix lookups | Reduces Redis calls by ~60%. Drops P99 back below 2ms. Adds cache invalidation complexity. |
| Sharded exposure counters | PostgreSQL write contention on exposure_ledger exceeds 50ms P99 for hot agents | Add shard_index column to exposure_ledger. Use 8 shards per hot agent. Each write picks a random shard. Reads sum all shards. | Reduces lock contention by 8x. |
| Prometheus + Grafana | You have a dedicated ops person (or the team exceeds 5 engineers) | Add prom-client for metrics. Deploy Prometheus and Grafana. Build dashboards for bet latency, settlement time, hedge fill rate, DLQ depth. | Real-time operational visibility. Replaces log-grep debugging with dashboards. |
| Read replicas | Dashboard queries slow down bet processing (same PostgreSQL instance) | Add a PostgreSQL read replica. Route all dashboard/reporting queries to the replica. | Isolates read load from write path. |
| Multiple Node.js instances | Single process CPU exceeds 70% sustained during peak | Add a second instance behind a load balancer. BullMQ handles job distribution automatically. | Doubles throughput capacity. |
| Separate audit database | Audit trail table exceeds 100 GB or audit writes cause write contention | Move audit_trail to a separate PostgreSQL instance. Change audit write module to target the new connection. | Isolates audit I/O from transactional I/O. |
| Multi-currency | Expanding beyond India (Southeast Asia, Africa) | Add base_currency to agents, FX conversion at platform boundary, FX rate capture in audit trail. See v2 Section 34 for full design. | Enables Ghanaian Cedis, Thai Baht, etc. |
| Responsible gambling module | Entering regulated markets (UK, EU, Australia) | Add self-exclusion, session limits, reality checks, deposit limits. See v2 Section 46 for full design. | Regulatory compliance. |
| Message broker (Kafka) | Bet volume exceeds 1000 bets/sec sustained | Replace BullMQ for bet event streaming with Kafka. Keep BullMQ for background jobs. | Higher throughput, better backpressure, multi-consumer. |
| Microservice extraction | A specific module (e.g., hedge execution) needs independent scaling or a separate deployment cycle | Extract the module into its own service with its own repository, deployment, and scaling. Communicate via message queue. | Independent scaling and deployment. |
| Limit order hedging | Betfair slippage on market orders exceeds 2% average | Replace market orders with limit orders. Add reprice logic, time-in-force, partial fill tracking. See v2 Section 52 hedge_orders schema. | Better execution quality, less slippage. |
| Matrix version history table | Agents create more than 100 matrix versions and audit replay needs fast version lookup | Add a matrix_versions table storing immutable snapshots. Reference by version_id instead of version integer. | Faster historical replay, cleaner audit. |
| Reconciliation frequency | Nightly reconciliation is too slow to catch drift during IPL matches | Add 15-minute scheduled reconciliation. Add post-settlement targeted reconciliation. See v2 Section 36 for full design. | Faster drift detection. |
How to Measure Each Threshold
| Metric | How to Measure in v1 (without Prometheus) |
|---|---|
| Redis P99 latency | Pino log with redis_latency_ms on each call. Query logs: sort, take 99th percentile. |
| PostgreSQL write contention | Pino log with db_write_ms on exposure_ledger updates. Watch for spikes above 50ms. |
| CPU utilization | process.cpuUsage() logged every 60 seconds. Or top / container metrics from your hosting platform. |
| Bet volume | Count from Pino logs or a simple in-memory counter logged every minute. |
| Hedge slippage | target_price - fill_price logged on each hedge fill. Average over a rolling window. |
| Audit table size | SELECT pg_total_relation_size('audit_trail') run nightly. |
The Principle
Start simple. Measure everything. Add complexity only when a specific, measurable threshold is crossed. Every optimization in the v2 document is available as a well-documented upgrade path. You never need to redesign -- you add a layer, a shard, or a service.
The simplest architecture that works is the best architecture until it stops working. Then you have data to tell you exactly where to invest.
This document is maintained by the Hannibal engineering and product teams. For the full-scale architecture with all optimizations, refer to B-Book-Architecture-v2.md.