Skip to main content

Phase 1: Comprehensive Logging to AWS S3

Context

Hannibal has Winston logging in 106+ files but outputs to console only (plain text, printf format). Promtail + Loki + Grafana already exist in docker-compose but Loki uses local filesystem storage (lost on container restart) and is on version 2.9.2 (outdated). Massive logging gaps exist across critical systems — ~75 backend files have no or minimal logging. This plan fixes both: fill all logging gaps AND ship everything to S3 for durable, queryable storage.


Current State vs Target

AreaCurrent CoverageTargetFiles Affected
Order placement~70%100%1 file
Settlement~65%100%2 files
Provider API calls~20%100%4 adapter clients
Exchange coordination & failover~10%100%4 files (coordinator, failover, routing, registry)
Bifrost queue/recovery pipeline~5%100%3 files (queue, recovery, bet consumer)
B-Book system~15%100%5 files (config, fill, settlement, state, filter)
Background jobs~20%100%4 job files
Financial transactions & accounting~15%100%10 files (ledger, commission, take, carryover, margin, etc.)
Monitoring agents & evaluation~10%100%9 files (base, player, cricket, soccer, compiler, evaluator, resolver, crud, presets)
AI interactions (backend)~5%100%2 files (agentService, fixtureAnalysisService)
AI chat service (Python)~10%100%5+ files
Authentication & permissions~30%100%3 files (auth route, auth middleware, agentPermissions)
WebSocket events~60%100%3 files (clientWs, oddsPapiWs, RoanuzWebSocketManager)
Redis operations~25%90%1 file
Cricket services~20%100%6 files (ballByBall, analysis, idResolver, milestones, sessions, toss)
Notification & push~10%100%2 files (notificationService, pushService)
Telegram bot~0%100%5 files (bot, linkService, rateLimit, uiRenderer, types)
Routes (API endpoints)~30%90%6 uncovered routes
Odds transformation & margins~5%100%3 files (oddsTransformer, marginService, limitsService)
Agent hierarchy & user agents~10%100%3 files (agentHierarchy, agentMonitoring, userAgent)
Database operations~50%90%1 file
Error handling middleware~40%90%1 file

Total files requiring logging changes: ~80+


Architecture: Logging Pipeline to S3

Application (Winston JSON → stdout)


Promtail (already deployed, reads Docker container logs)


Loki 3.x (upgrade from 2.9.2, reconfigure: filesystem → S3 backend)

├──► S3 bucket (chunks + indexes)
│ └── Lifecycle: Standard 30d → IA 90d → Glacier 365d → Deep Archive


Grafana (real-time queries via Loki, historical via Athena)

Why this approach (not direct Winston→S3):

  • Promtail+Loki already deployed — config change only, zero new infra
  • Decoupled from app — if S3 is slow, app is unaffected
  • Structured JSON stdout → Promtail parses automatically
  • Grafana already connected to Loki for real-time querying
  • Direct Winston→S3 transport is fragile (blocks event loop on failure, loses logs on crash)

CRITICAL: Loki version upgrade required. Current Loki 2.9.2 does NOT support schema v13 or TSDB store. Must upgrade to Loki 3.0.0+ before enabling S3 backend with TSDB. Loki 2.9.x can use S3 only with boltdb-shipper (legacy), which is deprecated in 3.x. Plan targets Loki 3.x directly.


Detailed Logging Gaps Audit

AREA 1: ORDER PLACEMENT FLOW (Currently ~70%)

File: backend/src/services/orderService.ts

Currently logged:

  • Point value loading
  • B-Book routing decision & rerouting logic
  • Agent booking capture
  • B-Book fill execution and failures
  • PAL readiness check
  • Betslip fetch & slippage calculation
  • PAL order submission
  • Partial refund scenarios
  • Batch order completion

MISSING — must add:

  • Validation step entry/exit with specific failure reason
  • Pre/post balance deduction values (critical for race condition debugging)
  • Exposure cache invalidation timing and actual exposure values
  • Slippage magnitude (requested odds vs live odds, delta value)
  • Tick snapping in batch orders (OLD → NEW values)
  • Exchange portion under minimum — threshold check failure reason
  • Request UUID lifecycle tracking

AREA 2: SETTLEMENT FLOW (Currently ~65%)

Files: backend/src/services/settlement.ts, backend/src/jobs/orderSyncJob.ts

Currently logged:

  • Settlement polling start, fixture status verification
  • Settlement data retrieval, market-level processing
  • Commission processing, B-Book settlement
  • Order status outcomes, PnL divergence detection

MISSING — must add:

  • Settlement data structure validation (unexpected format detection)
  • Fixture status change reasons (why skipping: "LIVE but we only process FINISHED")
  • Winning outcome extraction logic (filtering/mapping decisions)
  • Market-level outcome determination (what data influenced win/void decision)
  • Void order handling reason (refund amount = deducted amount)
  • Commission rate lookup result (which config row matched)
  • Margin profit edge case detection steps
  • Ledger entry constraint error context

AREA 3: PROVIDER API CALLS (Currently ~20%) ⚠️ CRITICAL

Files: backend/src/exchanges/adapters/betfair/BetfairClient.ts, pinnacle/PinnacleClient.ts, bifrost/BifrostClient.ts, roanuz/RoanuzDataAdapter.ts

MISSING — must add for ALL providers:

  • Every API request/response with latency timing
  • Session refresh timing and success/failure
  • Authentication errors with reason
  • Rate limit header values and throttle state
  • Retry logic (which attempt succeeded, why retries exhausted)
  • Request/response payload sizes
  • Error code extraction and categorization
  • Failover decisions (which provider failed, which chosen as fallback)

AREA 4: EXCHANGE COORDINATION & FAILOVER (Currently ~10%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN

Files:

  • backend/src/exchanges/core/coordinator/ExchangeCoordinator.ts — main routing orchestrator
  • backend/src/exchanges/core/coordinator/FailoverManager.ts — circuit breaker (closed/open/half-open)
  • backend/src/exchanges/core/coordinator/RoutingStrategy.ts — provider selection logic
  • backend/src/exchanges/core/coordinator/IDRegistry.ts — cross-provider ID mapping with Redis cache

MISSING — must add:

  • Every routing decision: which provider selected, why, fallback chain considered
  • Circuit breaker state transitions: closed→open (with failure count), open→half-open (timeout), half-open→closed (success)
  • Provider health check results: latency, error rate, availability score
  • ID resolution: canonical→provider mapping lookups, cache hits/misses, fuzzy match scores
  • Failover triggers: original provider failure reason, fallback provider selected, degraded mode entered
  • Request routing latency: time added by coordination layer

AREA 5: BIFROST QUEUE PIPELINE (Currently ~5%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN

Files:

  • backend/src/exchanges/adapters/bifrost/BifrostQueueManager.ts — RabbitMQ consumer (protobuf)
  • backend/src/exchanges/adapters/bifrost/BifrostRecoveryManager.ts — startup + periodic recovery
  • backend/src/exchanges/adapters/bifrost/BifrostBetConsumer.ts — live bet status/settlement

MISSING — must add:

  • Queue message consumption: queue name, message type, protobuf decode success/failure, processing latency
  • Recovery cycles: startup recovery (events + markets + phantoms), periodic 10-min recovery, items recovered count
  • Bet status updates: orderId, oldStatus→newStatus, settlement outcome from queue
  • Queue connection health: connected/disconnected, prefetch count, consumer tag
  • Message acknowledgment: ack/nack/reject decisions with reason
  • Protobuf decode errors: which field failed, raw payload size

AREA 6: B-BOOK SYSTEM (Currently ~15%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN

Files:

  • backend/src/services/bbook/bbookConfigService.ts — config, enable/disable, limits
  • backend/src/services/bbook/bbookFillService.ts — fill execution, position creation
  • backend/src/services/bbook/bbookSettlementService.ts — position settlement, P&L
  • backend/src/services/bbook/bbookStateService.ts — Redis + PG position tracking
  • backend/src/services/bbook/filterEngine.ts — routing decisions

MISSING — must add:

  • Config changes: who changed what setting, old→new values
  • Fill decisions: orderId, fillAmount, remainingCapacity, limits checked (global/market/selection/user)
  • Position creation: positionId, fixtureId, marketId, side (back/lay), stake, odds
  • Settlement: positionId, outcome (win/loss/void), grossPnL, netPnL
  • State tracking: exposure at global/market/selection/user levels, Redis vs PG sync status
  • Routing decisions: why B-Book (within limits) vs exchange (exceeded limits), specific limit that triggered routing
  • Limit breaches: which limit hit (globalPoolLimit, defaultMarketLimit, selectionLimit, userLimit, maxOdds, minStake)
  • Balancing alerts: when pool becomes unbalanced (e.g., >80% on one side)

AREA 7: WEBSOCKET EVENTS (Currently ~60%)

Files: backend/src/services/clientWs.ts, backend/src/services/oddsPapiWs.ts

Currently logged:

  • Client connections, fixture subscriptions, disconnections
  • Redis subscriber ready, message handling
  • Broadcast operations, reconnection attempts

MISSING — must add:

  • Connection duration on disconnect
  • Message count per client session
  • Subscription failures
  • Broadcast recipient count (how many clients actually received)
  • Redis subscription drop detection
  • Ping/pong latency tracking
  • Message ordering/gap detection
  • Update frequency per fixture (sampled)
  • Odds extraction failures (when extraction returns zeros)

Additional file NOT in original plan:

  • backend/src/exchanges/adapters/roanuz/RoanuzWebSocketManager.ts — cricket streaming
    • Socket.IO connection state to Roanuz API
    • Subscription protocol events (subscribe/unsubscribe/ack)
    • gzip decompression failures
    • Event normalization errors
    • Auto-reconnection attempts and backoff timing

AREA 8: AI INTERACTIONS (Currently ~5%) ⚠️ CRITICAL

Files:

  • backend/src/ai/core/agentService.ts — Anthropic API calls, tool execution loop
  • backend/src/services/fixtureAnalysisService.ts — AI fixture analysis with caching (NOT IN ORIGINAL PLAN)

MISSING — must add:

  • Every Anthropic API call: model, inputTokens, outputTokens, latencyMs, success
  • Every tool execution in agentic loop: toolName, latencyMs, success/error
  • Conversation context size (message count, estimated tokens)
  • Prompt construction summary (not full prompt, but template + parameters)
  • Streaming latency for AI insights
  • Cache hits (insights from cache vs generated fresh)
  • Failed insight fallback strategy
  • Fixture analysis: fixtureId, analysisType, cacheHit, generationLatencyMs, provider used

AREA 9: MONITORING AGENTS & EVALUATION (Currently ~10%) ⚠️ CRITICAL

Files in original plan:

  • backend/src/services/monitoring/baseMonitoringService.ts
  • backend/src/services/monitoring/playerMonitoringService.ts
  • backend/src/services/monitoring/cricketMonitoringService.ts

Files NOT in original plan:

  • backend/src/services/monitoring/soccerMonitoringService.ts — soccer-specific monitoring
  • backend/src/services/monitoring/soccerCapabilities.ts — soccer capability definitions
  • backend/src/services/monitoring/soccerResolvers.ts — soccer metric resolvers
  • backend/src/services/monitoring/conditionCompiler.ts — condition→evaluable form
  • backend/src/services/monitoring/conditionEvaluator.ts — runtime condition evaluation
  • backend/src/services/monitoring/metricResolver.ts — metric value resolution from fixture data
  • backend/src/services/monitoring/monitoringCrudService.ts — CRUD for monitor configs
  • backend/src/services/monitoring/presets.ts — preset conditions by domain
  • backend/src/services/monitoring/utils.ts — monitoring utilities

MISSING — must add across ALL monitoring files:

  • Each evaluation cycle: monitorId, conditionsCount, conditionsMet, alertFired, evalDurationMs
  • Condition compilation: input condition → compiled form, compilation errors
  • Condition evaluation: metricName, resolvedValue, operator, threshold, result (pass/fail)
  • Metric resolution: which resolver used, data source, resolution latency
  • Alert firing decisions (WHY triggered or suppressed)
  • Deduplication logic (duplicate detection, alerts skipped)
  • Event subscription errors
  • Monitor cache coherency (hits/misses/invalidations)
  • Periodic checker execution timing
  • CRUD operations: monitor created/updated/deleted by userId
  • Preset loading: which presets applied, domain

AREA 10: FINANCIAL TRANSACTIONS & ACCOUNTING (Currently ~15%) ⚠️ CRITICAL

Files in original plan:

  • backend/src/accounting/ledgerService.ts
  • backend/src/accounting/commissionService.ts
  • backend/src/accounting/agentSettlementService.ts

Files NOT in original plan:

  • backend/src/accounting/availableService.ts — available credit (AC) computation
  • backend/src/accounting/carryoverService.ts — deferred settlement (carryover) handling
  • backend/src/accounting/marginRevenueService.ts — margin profit tracking per order
  • backend/src/accounting/reportService.ts — settlement report generation
  • backend/src/accounting/takeService.ts — event-driven take balance updates (T = AC(self) + Σ AC(downline) - CL)
  • backend/src/accounting/transferSettleService.ts — settlement reconciliation between upline/downline
  • backend/src/accounting/settlementPeriodService.ts — period lifecycle (open→grace→finalized)
  • backend/src/accounting/calculations.ts — pure calc functions (round, commission, profitShare, booking)
  • backend/src/accounting/exportService.ts — Excel/CSV export for settlement reports

MISSING — must add:

  • Every ledger entry: accountType, entryType, amount, balanceAfter, transactionRef
  • Commission rate lookup (which config matched, rate applied)
  • Commission calculation steps: grossWinnings × rate = commission
  • Per-agent settlement: totalStakes, totalPayouts, bookingPercent, netSettlement
  • Available credit computation: userId, creditLimit, downlineAllocation, exposure, resultAC
  • Carryover requests: downlineId, amount, accepted/rejected, grace period
  • Margin revenue per order: orderId, placedOdds, displayOdds, marginProfit, slippageType
  • Take balance updates: userId, oldTake, newTake, trigger (which event caused update)
  • Transfer settlement: uplineId→downlineId, obligation, adjustedBalance
  • Settlement period transitions: periodId, oldState→newState, triggeredBy
  • Report generation: reportType (platform/agent/player), periodId, generationLatencyMs
  • Export operations: format (Excel/CSV), rowCount, fileSize, downloadedBy
  • Decimal rounding decisions: input, output, roundingMode, precision
  • Constraint violation context

AREA 11: AUTHENTICATION & PERMISSIONS (Currently ~30%)

Files in original plan:

  • backend/src/routes/auth.ts
  • backend/src/middleware/auth.ts

File NOT in original plan:

  • backend/src/middleware/agentPermissions.ts — permission flag checking (T/B/D/W)

MISSING — must add:

  • Every login attempt (wallet address, success/failure, failure reason)
  • Token verification failure reason (expired, invalid, revoked)
  • Session cache hit vs DB lookup path
  • User status check result value
  • Admin role grant/denial
  • Token expiry timing (how long token was valid)
  • New user registration events
  • Referral code validation results
  • Permission checks: userId, requiredFlag (Transfer/Bet/Deposit/Withdraw), granted/denied

AREA 12: BACKGROUND JOBS (Currently ~20%) ⚠️ CRITICAL

Files: backend/src/jobs/orderSyncJob.ts, agentSettlementJob.ts, pinnacleSettlementJob.ts, reportEmailJob.ts

Currently logged: Job startup, job stop, "no orders to sync" (debug).

MISSING — must add:

  • Job execution timing (cycle start/end/duration)
  • Batch size per cycle
  • Each order status change detected (orderId, oldStatus → newStatus)
  • Retry logic (which orders retried, why)
  • Settlement check results (how many settled, which orders)
  • Error accumulation (fail count/types per cycle)
  • Agent settlement calculations per agent
  • Pinnacle settlement data received/processed
  • Report generation contents, recipients, send status, SMTP response

AREA 13: CRICKET SERVICES (Currently ~20%) — EXPANDED

File in original plan:

  • backend/src/services/cricket/ballByBallService.ts

Files NOT in original plan:

  • backend/src/services/cricket/cricketAnalysisService.ts — AI cricket analysis (head-to-head, venue, player form, toss, pitch)
  • backend/src/services/cricket/cricketIdResolver.ts — Betfair→Roanuz fuzzy ID matching
  • backend/src/services/cricket/milestoneAlertService.ts — batting/bowling milestone detection (50/100/150/200, 3/5-for, partnerships, powerplay, RRR)
  • backend/src/services/cricket/sessionBettingService.ts — phase analysis (Test/ODI/T20), projected scores vs historical
  • backend/src/services/cricket/tossAnalysisService.ts — toss results, pitch conditions, venue history, AI recommendations

MISSING — must add:

  • Ball-by-ball: ball#, runs, outcome, wicket details, team score, run rate
  • Analysis generation: matchId, analysisType, dataSource (Roanuz), latencyMs, cacheHit
  • ID resolution: betfairEventId, resolvedRoanuzMatchId, matchScore, method (exact/fuzzy), cacheHit
  • Milestones detected: matchId, milestoneType, player, value, timestamp
  • Session analysis: matchId, format, currentPhase, projectedScore, historicalAvg
  • Toss analysis: matchId, tossWinner, decision (bat/field), pitchCondition, venueHistory

AREA 14: NOTIFICATION & PUSH (Currently ~10%) — NOT IN ORIGINAL PLAN

Files:

  • backend/src/services/notificationService.ts — creates notifications, Redis pub/sub, push to mobile
  • backend/src/services/pushService.ts — web push (VAPID) to mobile/browser

MISSING — must add:

  • Notification creation: userId, type, title, channel (in-app/push/both)
  • Redis pub/sub publish: channel, payloadSize, subscriberCount
  • Push notification send: userId, deviceCount, success/failure per device
  • VAPID authentication: key rotation, errors
  • Push delivery failures: endpoint, statusCode, reason (expired subscription, etc.)

AREA 15: TELEGRAM BOT (Currently ~0%) — NOT IN ORIGINAL PLAN

Files:

  • backend/src/services/telegram/bot.ts — main bot (GramMyBot), commands, AI text processing
  • backend/src/services/telegram/linkService.ts — Telegram↔Hannibal account linking
  • backend/src/services/telegram/rateLimit.ts — rate limiting
  • backend/src/services/telegram/uiRenderer.ts — keyboard/message rendering

MISSING — must add:

  • Command handling: chatId, command (/start, /help, /balance, /bets), userId, latencyMs
  • AI text processing: chatId, messageText (truncated), responseLatencyMs, toolCalls
  • Account linking: telegramUserId, hannibalUserId, linked/unlinked, success/failure
  • Rate limit hits: chatId, command, blocked (yes/no)
  • Callback queries: chatId, callbackData, action taken
  • Session management: chatId, sessionCreated/expired, duration

AREA 16: ODDS TRANSFORMATION & MARGINS (Currently ~5%) — NOT IN ORIGINAL PLAN

Files:

  • backend/src/services/oddsTransformer.ts — raw OddsPAPI→frontend format, margin adjustment
  • backend/src/services/marginService.ts — platform margin applied to Betfair odds (revenue source)
  • backend/src/services/limitsService.ts — stake limits per fixture/outcome

MISSING — must add:

  • Odds transformation: fixtureId, marketCount, outcomeCount, transformLatencyMs
  • Margin application: fixtureId, originalOdds, marginRate, adjustedOdds, revenueImpact
  • Limits enrichment: fixtureId, platformMin, depthData, computedLimits

AREA 17: AGENT HIERARCHY & USER AGENTS (Currently ~10%) — NOT IN ORIGINAL PLAN

Files:

  • backend/src/services/agentHierarchyService.ts — recursive CTE traversal, parent-child
  • backend/src/services/agentMonitoringService.ts — agent alerting via Redis/OddsPAPI
  • backend/src/services/userAgentService.ts — autonomous agents (value_hunter, steam_tracker, arbitrage)
  • backend/src/services/watchlistService.ts — fixture/team watchlists with AI alerts

MISSING — must add:

  • Hierarchy traversal: rootAgentId, depth, nodeCount, latencyMs
  • Agent alerts: agentId, alertType (odds change, match finished), fixtureId, triggered
  • Autonomous agent execution: agentId, agentType, evaluationCycle, actionsGenerated
  • Watchlist alerts: watchlistId, userId, alertCondition, matched (yes/no)

AREA 18: ROUTES — API ENDPOINTS (Currently ~30%) — NOT IN ORIGINAL PLAN

Files not covered:

  • backend/src/routes/ai.ts — AI features (analysis, watchlists, agents, chat)
  • backend/src/routes/orders.ts — order placement/retrieval, schema validation
  • backend/src/routes/referrals.ts — referral code generation
  • backend/src/routes/settlements.ts — settlement periods, processing, ledger verification
  • backend/src/routes/users.ts — user profile, balance
  • backend/src/routes/health.ts — health checks (/health, /health/ready)

MISSING — must add:

  • Request/response logging for each route group (method, path, statusCode, latencyMs, userId)
  • Note: most of this will be covered by the HTTP request middleware (Step 1d), but route-specific business logic logging is still needed:
    • Order validation failures with specific field/reason
    • Settlement period transitions triggered via API
    • AI analysis requests with fixtureId and cache status
    • Health check detailed status per service

AREA 19: ERROR HANDLING MIDDLEWARE — NOT IN ORIGINAL PLAN

File: backend/src/middleware/errorHandler.ts

Custom error classes: ValidationError, AuthenticationError, AuthorizationError, NotFoundError, ConflictError, ExternalServiceError

MISSING — must add:

  • Error categorization: operational vs programming bug
  • Error class distribution logging (which error types most frequent)
  • Stack trace persistence with request context
  • User impact flag (blocks request vs non-fatal)
  • External service errors: which service, statusCode, retryable

AREA 20: REDIS OPERATIONS (Currently ~25%)

File: backend/src/services/redis.ts

Currently logged: Connection state changes only.

MISSING — must add:

  • Cache hits/misses (debug level, controlled by LOG_LEVEL)
  • Cache errors with key name context
  • Lock acquisition failures (key, contention duration)
  • Lock release verification (success or stolen lock)
  • Pub/sub channel health
  • Slow Redis operations

AREA 21: DATABASE OPERATIONS (Currently ~50%)

File: backend/src/services/database.ts

Currently logged: Connection success, slow queries in dev (>100ms).

MISSING — must add:

  • Slow query logging in production (>200ms threshold)
  • Structured error logging for Prisma errors with query context
  • Transaction rollback logging
  • Connection pool metrics (if available)
  • Constraint violation logging with context

AREA 22: EXCHANGE MAPPERS & ID RESOLUTION — NOT IN ORIGINAL PLAN

Files:

  • backend/src/exchanges/core/ProviderRegistry.ts — provider instance management
  • backend/src/exchanges/core/FuzzyMatcher.ts — fuzzy string matching for ID resolution
  • backend/src/exchanges/adapters/betfair/BetfairCache.ts — in-memory market data cache
  • Various mapper files (BetfairMapper, BifrostMapper, OddsPapiMapper, PinnacleMapper)

MISSING — must add:

  • Provider registration/deregistration events
  • Fuzzy match: inputString, candidates, bestMatch, score, threshold
  • Betfair cache: size, evictions, hit rate (periodic summary, not per-access)
  • Mapper transform errors: provider, entityType, fieldName, rawValue, error

AREA 23: FRONTEND — CLIENT-SIDE LOGGING (Currently ~5%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN

Current state: 33 files with scattered console.log/error/warn. All logs are browser-only (lost when tab closes). No global error handler. No structured logging. No backend error sink.

23a. Error Handling (Currently: ErrorBoundary only)

Files:

  • frontend/src/components/ui/ErrorBoundary.tsx — catches React render errors, dev-only console.error
  • frontend/src/app/layout.tsx — wraps app in ErrorBoundary

What slips through ErrorBoundary (NOT caught):

  • Async errors (promises, setTimeout)
  • Event handler errors (onClick, onChange)
  • WebSocket event errors
  • React Query failures
  • Axios interceptor errors

MISSING — must add:

  • Global window.addEventListener('unhandledrejection') handler
  • Global window.addEventListener('error') handler
  • All unhandled errors should be captured and sent to backend

23b. API Error Logging (Currently: toast only)

Files:

  • frontend/src/lib/api.ts — Axios interceptor handles 401 refresh, warns on missing token
  • 47+ files with useMutation/useQuery — most use onError: (err) => toast.error(msg), nothing else

MISSING — must add:

  • Axios response interceptor: log every error response (url, method, status, latencyMs, errorCode)
  • Axios request interceptor: capture request start time for latency calculation
  • React Query global onError handler at QueryClient level
  • Failed mutation details: mutationKey, variables (sanitized), error

23c. WebSocket Logging (Currently: console.log)

File: frontend/src/hooks/useWebSocket.ts — 30 lines of console.log

Currently logged (to browser console only):

  • Connection creation, connect ID, disconnect reason, connect_error
  • Balance/settlement/notification/order status updates received
  • Cricket ball events

MISSING — must add:

  • Reconnection attempts (count, delay, transport fallback)
  • Connection duration on disconnect
  • Message delivery gaps (missed balls, out-of-order events)
  • Subscription failures

23d. Performance Monitoring (Currently: none)

Only tracked: Auth latency and AI response latency (sent to Mixpanel, not logged).

MISSING — must add:

  • Core Web Vitals (LCP, FID, CLS, INP, TTFB)
  • Page navigation timing
  • API call latency distribution
  • Long task detection (>50ms)

23e. Authentication Flow (Currently: console.error)

Files:

  • frontend/src/hooks/useAuth.ts
  • frontend/src/components/providers/Web3AuthProvider.tsx (18 lines)
  • frontend/src/components/auth/LoginModal.tsx

MISSING — must add:

  • Login attempt: provider (metamask/phantom/email), success/failure, latencyMs
  • Web3Auth init: success/failure, latencyMs, retryCount
  • Token refresh: success/failure, reason
  • Session expiry: duration, trigger

23f. AI & Cricket UI (Currently: console.error on failure)

Files: 4 AI component files, 6 cricket component files

MISSING — must add:

  • AI analysis request: fixtureId, type, latencyMs, cacheHit, error
  • Cricket subscription: matchId, subscribed/unsubscribed, ballsReceived, gaps
  • Deep analysis: fixtureId, latencyMs, success/error

Frontend Logging Architecture

Frontend cannot write to Loki directly (would expose Loki to the internet). Instead:

Browser

│ 1. frontendLogger captures structured events
│ 2. Batches in memory (max 50 events or 10s flush)
│ 3. POST /api/logs/client (fire-and-forget, no retry on 4xx)


Backend: POST /api/logs/client

│ 1. Validates payload (max 100 events per request, size limit)
│ 2. Rate-limited (100 req/min per IP)
│ 3. Logs each event via Winston with source: "frontend"


Winston (stdout JSON) → Promtail → Loki → S3

Every frontend log line becomes:

{
"ts": "2026-02-26T10:30:00.123Z",
"level": "error",
"msg": "API call failed",
"service": "hannibal-frontend",
"source": "client",
"sessionId": "sess_abc123",
"userId": "u_456",
"url": "/api/orders",
"method": "POST",
"statusCode": 500,
"latencyMs": 1234,
"errorCode": "INTERNAL_SERVER_ERROR",
"userAgent": "Mozilla/5.0...",
"page": "/sports/cricket"
}

New frontend files:

  • frontend/src/lib/frontendLogger.ts — structured logger with batching + backend sink
  • frontend/src/lib/globalErrorHandler.ts — window error + unhandledrejection handlers
  • frontend/src/lib/performanceMonitor.ts — Web Vitals + API latency tracking
  • frontend/src/hooks/useErrorReporting.ts — React hook for component error context

New backend files:

  • backend/src/routes/logs.ts — POST /api/logs/client endpoint (validated, rate-limited)

Modified frontend files:

  • frontend/src/lib/api.ts — Axios interceptors for request/response logging
  • frontend/src/hooks/useWebSocket.ts — structured event logging
  • frontend/src/lib/analytics/index.ts — wire frontendLogger alongside Mixpanel
  • frontend/src/components/ui/ErrorBoundary.tsx — send errors to backend
  • frontend/src/app/layout.tsx — init globalErrorHandler + performanceMonitor
  • frontend/src/components/providers/Web3AuthProvider.tsx — auth flow logging
  • frontend/src/hooks/useAuth.ts — login/logout event logging

What stays in Mixpanel (user behavior analytics):

  • 56 existing user events (bet_placed, fixture_viewed, etc.)
  • These are product analytics, not operational logs — keep them separate

What goes to Loki/S3 via backend sink (operational logs):

  • Errors (unhandled, API failures, WebSocket errors)
  • Performance (Web Vitals, API latency, long tasks)
  • Auth events (login/logout attempts, token refresh)
  • WebSocket health (connect/disconnect, reconnection, message gaps)
  • Critical UI events (ErrorBoundary catches, blank screens)

Implementation Steps

Step 1: Foundation — Winston + Correlation IDs

1a. Switch Winston to Structured JSON Output

Modify: backend/src/utils/logger.ts

Current: plain text with colorized console, printf format, circular reference handling, child logger support. Target: structured JSON with service metadata, dual-format (JSON prod / colorized dev).

Every log line becomes:
{"ts":"2026-02-26T10:30:00.123Z","level":"info","msg":"Order placed","service":"hannibal-backend","env":"production","requestId":"abc-123","userId":"u_456","orderId":"ord_789","latencyMs":45}

Changes:

  • Replace printf format with winston.format.json() for non-dev environments
  • Add defaultMeta: { service: 'hannibal-backend', env: process.env.NODE_ENV }
  • Keep colorized console format for local dev via NODE_ENV === 'development' check
  • Add winston.format.errors({ stack: true }) for full stack traces
  • Preserve existing safe metadata serialization (circular reference handling already built in)

1b. Add AsyncLocalStorage for Correlation IDs

New file: backend/src/utils/requestContext.ts

Uses Node.js built-in AsyncLocalStorage to automatically inject requestId, userId, traceId into every log line without passing logger around.

Note: Partial infrastructure already exists:

  • errorHandler.ts already reads x-request-id header
  • CORS already allows X-Request-ID header
  • logger.child({ requestId }) capability exists but isn't used consistently

Modify: backend/src/utils/logger.ts

  • Add format that reads from AsyncLocalStorage and injects context fields into every log line

New middleware: backend/src/middleware/requestContext.ts

  • Wraps every request in AsyncLocalStorage.run() with { requestId, userId, traceId }
  • Generates UUID if no x-request-id header present
  • Sets x-request-id response header for tracing
  • Extracts userId from authenticated request (after auth middleware runs)

Modify: backend/src/index.ts

  • Add requestContextMiddleware before all route handlers (after CORS, before auth)

Impact: All 106 files that import logger automatically get correlation IDs in every log line. Zero changes needed in those files.

1c. HTTP Request/Response Middleware

New middleware addition in backend/src/index.ts:

  • Log every HTTP request: method, path, statusCode, latencyMs, userId, requestSize, responseSize
  • Replaces/enhances existing Morgan "combined" format with structured JSON
  • Excludes health check endpoints from verbose logging (or logs at debug level)

1d. Provider API Call Logging Wrapper

New file: backend/src/utils/httpLogger.ts

  • Generic wrapper for external HTTP calls
  • Captures: provider, method, url, latencyMs, statusCode, success, error, requestSize, responseSize
  • Usage: await withHttpLogging('betfair', 'placeOrder', () => axios.post(...))
  • Can be applied as axios interceptor or manual wrapper

Step 2: P0 — Critical Business Logic Gaps (~25 files)

2a. Provider Clients

Modify (wrap all API methods with httpLogger):

  • backend/src/exchanges/adapters/betfair/BetfairClient.ts — listMarketCatalogue, placeOrders, cancelOrders, listCurrentOrders, listClearedOrders, getAccountFunds, login, session refresh
  • backend/src/exchanges/adapters/pinnacle/PinnacleClient.ts — all methods + token bucket state
  • backend/src/exchanges/adapters/bifrost/BifrostClient.ts — all methods
  • backend/src/exchanges/adapters/roanuz/RoanuzDataAdapter.ts — all methods
  • backend/src/services/betsApi.ts — BetsAPI calls for odds fetching

2b. Exchange Coordination & Failover

Modify:

  • backend/src/exchanges/core/coordinator/ExchangeCoordinator.ts — routing decisions, provider selection
  • backend/src/exchanges/core/coordinator/FailoverManager.ts — circuit breaker state transitions, health checks
  • backend/src/exchanges/core/coordinator/RoutingStrategy.ts — strategy evaluation, provider ranking
  • backend/src/exchanges/core/coordinator/IDRegistry.ts — ID lookups, cache hits/misses

2c. Bifrost Pipeline

Modify:

  • backend/src/exchanges/adapters/bifrost/BifrostQueueManager.ts — message consumption, protobuf decode, processing latency
  • backend/src/exchanges/adapters/bifrost/BifrostRecoveryManager.ts — recovery cycles, items recovered
  • backend/src/exchanges/adapters/bifrost/BifrostBetConsumer.ts — bet status updates, settlement from queue

2d. B-Book System

Modify:

  • backend/src/services/bbook/bbookConfigService.ts — config changes
  • backend/src/services/bbook/bbookFillService.ts — fill decisions, position creation, capacity checks
  • backend/src/services/bbook/bbookSettlementService.ts — position settlement, P&L
  • backend/src/services/bbook/bbookStateService.ts — exposure levels, Redis/PG sync
  • backend/src/services/bbook/filterEngine.ts — routing decisions, limit checks

2e. Order + Settlement

Modify:

  • backend/src/services/orderService.ts — validation, balance, exposure, slippage details
  • backend/src/services/settlement.ts — outcome determination, commission, void handling
  • backend/src/jobs/orderSyncJob.ts — cycle timing, batch size, status changes
  • backend/src/jobs/agentSettlementJob.ts — per-agent calculations
  • backend/src/jobs/pinnacleSettlementJob.ts — settlement data received/processed
  • backend/src/jobs/reportEmailJob.ts — report generation, SMTP

2f. Financial Transactions & Accounting

Modify:

  • backend/src/accounting/ledgerService.ts — every ledger entry with full context
  • backend/src/accounting/commissionService.ts — rate lookup, calculation steps
  • backend/src/accounting/agentSettlementService.ts — per-agent calculations
  • backend/src/accounting/availableService.ts — AC computation steps
  • backend/src/accounting/takeService.ts — take balance event-driven updates
  • backend/src/accounting/marginRevenueService.ts — margin profit per order
  • backend/src/accounting/carryoverService.ts — deferred settlement requests
  • backend/src/accounting/transferSettleService.ts — upline/downline reconciliation
  • backend/src/accounting/settlementPeriodService.ts — period lifecycle transitions
  • backend/src/accounting/reportService.ts — report generation
  • backend/src/accounting/exportService.ts — Excel/CSV export

Step 3: P1 — High-Priority Operational Gaps (~20 files)

3a. Monitoring Agents (all files)

Modify:

  • backend/src/services/monitoring/baseMonitoringService.ts — evaluation cycles, alert decisions
  • backend/src/services/monitoring/playerMonitoringService.ts — bet event processing, condition results
  • backend/src/services/monitoring/cricketMonitoringService.ts — cricket-specific conditions
  • backend/src/services/monitoring/soccerMonitoringService.ts — soccer-specific conditions
  • backend/src/services/monitoring/conditionCompiler.ts — compilation results/errors
  • backend/src/services/monitoring/conditionEvaluator.ts — evaluation trace per condition
  • backend/src/services/monitoring/metricResolver.ts — metric resolution source/latency
  • backend/src/services/monitoring/monitoringCrudService.ts — CRUD operations
  • backend/src/services/monitoring/presets.ts — preset loading

3b. Authentication & Permissions

Modify:

  • backend/src/routes/auth.ts — login attempts, registration, referral validation
  • backend/src/middleware/auth.ts — token verification, cache vs DB path, user status
  • backend/src/middleware/agentPermissions.ts — permission flag checks (T/B/D/W)

3c. Redis Operations

Modify:

  • backend/src/services/redis.ts — cache hits/misses (debug), lock contention, slow ops

3d. Database Operations

Modify:

  • backend/src/services/database.ts — slow queries in prod (>200ms), transaction errors, constraint violations

3e. Error Handler Enhancement

Modify:

  • backend/src/middleware/errorHandler.ts — error categorization, class distribution, external service context

Step 4: P2 — Comprehensive Coverage (~30 files)

4a. AI Interactions

Modify:

  • backend/src/ai/core/agentService.ts — API calls, tool executions, tokens, latency
  • backend/src/services/fixtureAnalysisService.ts — analysis generation, caching

4b. WebSocket Events

Modify:

  • backend/src/services/clientWs.ts — connection duration, broadcast counts, subscription failures
  • backend/src/services/oddsPapiWs.ts — message rates, extraction failures, ping/pong
  • backend/src/exchanges/adapters/roanuz/RoanuzWebSocketManager.ts — cricket stream health

4c. Cricket Services

Modify:

  • backend/src/services/cricket/ballByBallService.ts — ball events, wickets, milestones
  • backend/src/services/cricket/cricketAnalysisService.ts — AI analysis, data sources
  • backend/src/services/cricket/cricketIdResolver.ts — ID resolution, fuzzy matching
  • backend/src/services/cricket/milestoneAlertService.ts — milestone detection
  • backend/src/services/cricket/sessionBettingService.ts — phase analysis
  • backend/src/services/cricket/tossAnalysisService.ts — toss/pitch analysis

4d. Notification & Push

Modify:

  • backend/src/services/notificationService.ts — creation, pub/sub, delivery
  • backend/src/services/pushService.ts — VAPID auth, delivery success/failure

4e. Telegram Bot

Modify:

  • backend/src/services/telegram/bot.ts — commands, AI processing, sessions
  • backend/src/services/telegram/linkService.ts — account linking
  • backend/src/services/telegram/rateLimit.ts — rate limit events

4f. Odds & Margins

Modify:

  • backend/src/services/oddsTransformer.ts — transformation latency
  • backend/src/services/marginService.ts — margin application per fixture
  • backend/src/services/limitsService.ts — stake limits computation

4g. Agent Hierarchy & User Agents

Modify:

  • backend/src/services/agentHierarchyService.ts — tree traversal
  • backend/src/services/agentMonitoringService.ts — agent alerts
  • backend/src/services/userAgentService.ts — autonomous agent execution
  • backend/src/services/watchlistService.ts — watchlist alerts

4h. Rate Limiting

Modify:

  • backend/src/middleware/rateLimiter.ts — limit hits, repeat offenders

4i. Exchange Internals

Modify:

  • backend/src/exchanges/core/ProviderRegistry.ts — registration events
  • backend/src/exchanges/core/FuzzyMatcher.ts — match scores
  • backend/src/exchanges/adapters/betfair/BetfairCache.ts — cache size, hit rate (periodic)

4j. Routes (business logic in API handlers)

Modify:

  • backend/src/routes/orders.ts — validation failures
  • backend/src/routes/settlements.ts — period transitions
  • backend/src/routes/ai.ts — analysis requests
  • backend/src/routes/referrals.ts — code generation
  • backend/src/routes/users.ts — profile operations

Step 5: Frontend Logging Pipeline (~15 files)

5a. Frontend Logger + Backend Sink

New file: frontend/src/lib/frontendLogger.ts

  • Structured JSON logger mirroring backend format
  • In-memory buffer: max 50 events or 10-second flush interval
  • Sends batched events via POST /api/logs/client
  • Fire-and-forget — never blocks UI, drops on failure (no retry on 4xx)
  • Includes session context: sessionId, userId, page, userAgent

New file: backend/src/routes/logs.ts

  • POST /api/logs/client — accepts batched frontend log events
  • Validates: max 100 events per request, max 10KB per event, required fields
  • Rate-limited: 100 requests/min per IP (prevents abuse)
  • Logs each event via Winston with service: "hannibal-frontend", source: "client"
  • Flows through same pipeline: Winston → stdout → Promtail → Loki → S3

Modify: backend/src/index.ts

  • Mount /api/logs route (before auth middleware — client errors may come from unauthenticated users)

5b. Global Error Handlers

New file: frontend/src/lib/globalErrorHandler.ts

  • window.addEventListener('error', ...) — catches uncaught JS errors
  • window.addEventListener('unhandledrejection', ...) — catches unhandled promise rejections
  • Captures: errorMessage, stack, componentStack, page, userAgent
  • Sends to frontendLogger (which batches and sends to backend)
  • Deduplicates: same error message within 5 seconds = 1 log entry with count

Modify: frontend/src/app/layout.tsx

  • Import and initialize globalErrorHandler on mount
  • Initialize performanceMonitor on mount

Modify: frontend/src/components/ui/ErrorBoundary.tsx

  • Send caught errors to frontendLogger (not just console.error in dev)
  • Include componentStack for debugging

5c. API Call Logging

Modify: frontend/src/lib/api.ts

  • Request interceptor: record start timestamp
  • Response interceptor: log every failed request (status >= 400): url, method, statusCode, latencyMs, errorCode
  • Log token refresh attempts: success/failure, latencyMs
  • Log auth:logout events triggered by failed refresh

5d. React Query Global Error Handler

Modify: frontend/src/lib/providers.tsx (or wherever QueryClient is created)

  • Add QueryClient default onError callback
  • Logs all failed queries: queryKey, error message, status code
  • This catches the 47+ files that have no onError handler

5e. WebSocket Logging

Modify: frontend/src/hooks/useWebSocket.ts

  • Replace console.log with frontendLogger.info/error
  • Add: reconnection attempt count, transport type, connection duration
  • Add: message gap detection (cricket ball sequence)
  • Keep existing connection/disconnect logging but structured

5f. Performance Monitoring

New file: frontend/src/lib/performanceMonitor.ts

  • Core Web Vitals via web-vitals library (LCP, FID, CLS, INP, TTFB)
  • Reports once per page load via frontendLogger
  • Optional: long task detection via PerformanceObserver('longtask')

5g. Auth Flow Logging

Modify: frontend/src/hooks/useAuth.ts

  • Log login attempt: provider, success/failure, latencyMs
  • Log logout: trigger (manual/session expired/refresh failed)

Modify: frontend/src/components/providers/Web3AuthProvider.tsx

  • Log Web3Auth init: success/failure, latencyMs, retryCount
  • Log wallet connection: chain, address (first 6 chars), success/failure

Step 6: AI Chat Service (Python)

Modify: services/ai-chat-assistant/app/

  • Configure Python logging to output structured JSON to stdout (matching backend format)
  • Add correlation ID propagation from x-request-id request header
  • Log every LLM provider call: provider (Grok/Groq/Ollama/OpenAI/Anthropic), model, tokens, latency, success/error
  • Log provider fallback chain: which provider tried, failed, next selected
  • Log conversation history operations (Redis reads/writes)
  • Log streaming SSE events: chunkCount, totalLatencyMs

Promtail already collects Docker container stdout — logs flow to Loki/S3 automatically.


Step 7: Loki Upgrade + S3 Backend

7a. Upgrade Loki 2.9.2 → 3.x

CRITICAL PREREQUISITE: Loki 2.9.2 does NOT support schema v13 or TSDB store.

Modify: infra/docker/docker-compose.monitoring.yml

  • Change grafana/loki:2.9.2grafana/loki:3.3.2 (or latest 3.x stable)
  • Add AWS credentials env vars:
    environment:
    - AWS_ACCESS_KEY_ID=${LOKI_AWS_ACCESS_KEY_ID}
    - AWS_SECRET_ACCESS_KEY=${LOKI_AWS_SECRET_ACCESS_KEY}
    - AWS_REGION=${AWS_REGION:-ap-south-1}

Migration notes:

  • Schema v11 (boltdb-shipper) data remains readable by Loki 3.x
  • New data writes use schema v13 (TSDB) with S3
  • Old data on filesystem will not be migrated (acceptable — current data is ephemeral on container restart anyway)
  • Test on staging before production rollout

7b. Reconfigure Loki for S3

Modify: infra/monitoring/loki/loki-config.yml

Add new schema config entry (keep old for backward compat):

schema_config:
configs:
# Legacy — reads old data if any
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
# New — all new writes go to S3
- from: 2026-03-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h

storage_config:
aws:
s3: s3://${AWS_REGION}/${S3_BUCKET_NAME}
bucketnames: hannibal-logs
region: ${AWS_REGION}
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
shared_store: s3
# Keep filesystem for legacy reads
boltdb_shipper:
active_index_directory: /loki/boltdb-shipper-active
shared_store: filesystem
filesystem:
directory: /loki/chunks

compactor:
working_directory: /loki/compactor
shared_store: s3
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h

Step 8: S3 Bucket + Lifecycle Policies

New file: infra/scripts/setup-s3-logging.sh

Creates S3 bucket with:

  • Versioning disabled (logs are immutable)
  • Server-side encryption (SSE-S3)
  • Lifecycle policy:
    • Standard: 0-30 days ($0.023/GB)
    • Standard-IA: 30-90 days ($0.0125/GB)
    • Glacier Instant: 90-365 days ($0.004/GB)
    • Deep Archive: 365+ days ($0.00099/GB)
    • Expiration: 7 years (compliance)
  • Bucket policy restricting access to Loki service role
  • Block public access enabled

Estimated cost (10GB/day logs after compression):

  • ~$50-70/year total storage

Step 9: Enhanced Promtail Label Extraction

Modify: infra/monitoring/promtail/promtail-config.yml

Current pipeline already has JSON extraction and log level detection. Enhance:

  • Extract service field as Loki label (enables per-service queries)
  • Extract requestId, userId as structured metadata (Loki 3.x supports this without label cardinality explosion)
  • Keep existing container/compose label extraction
  • Add orderId, provider, fixtureId as structured metadata
  • Enables queries like: {service="hannibal-backend"} | json | provider="betfair" | latencyMs > 1000

Warning: Do NOT add high-cardinality fields (requestId, orderId, userId) as Loki labels — this will destroy performance. Use Loki's structured metadata or | json filter syntax instead.


Step 10: Grafana Dashboards

New dashboards in infra/monitoring/grafana/dashboards/:

  • logging-overview.json — log volume by service/level, error rate trending, top error messages
  • provider-health.json — API latency per provider (from structured logs), error rates, availability
  • order-audit.json — order lifecycle flow (validated→routed→submitted→settled), timing
  • financial-audit.json — balance changes, commissions, settlements, take balance trends
  • bbook-dashboard.json — B-Book fills, exposure levels, routing decisions, settlement P&L
  • exchange-coordination.json — circuit breaker states, failover events, routing distribution
  • frontend-health.json — client errors, Web Vitals, API latency from client perspective, WebSocket health

(Existing system-overview.json dashboard remains unchanged)


Future: Athena for Historical Queries

When log volume justifies:

  • AWS Athena table pointing to S3 bucket for SQL on logs >31 days old
  • Grafana Athena datasource plugin for historical dashboards
  • Nightly JSONL→Parquet ETL for 10x cheaper Athena queries

Complete File Change Summary

New Files — Backend (4)

  • backend/src/utils/requestContext.ts — AsyncLocalStorage singleton
  • backend/src/middleware/requestContext.ts — request context middleware
  • backend/src/utils/httpLogger.ts — provider API call wrapper
  • backend/src/routes/logs.ts — POST /api/logs/client (frontend error sink)

New Files — Frontend (4)

  • frontend/src/lib/frontendLogger.ts — structured logger with batching + backend sink
  • frontend/src/lib/globalErrorHandler.ts — window error + unhandledrejection handlers
  • frontend/src/lib/performanceMonitor.ts — Web Vitals + API latency
  • frontend/src/hooks/useErrorReporting.ts — React hook for component error context

New Files — Infra (1)

  • infra/scripts/setup-s3-logging.sh — S3 bucket setup

Modified Files — Infrastructure (5)

  • backend/src/utils/logger.ts — JSON format, AsyncLocalStorage integration
  • backend/src/index.ts — requestContext middleware, HTTP timing, /api/logs route
  • infra/monitoring/loki/loki-config.yml — S3 backend, schema v13
  • infra/docker/docker-compose.monitoring.yml — Loki 3.x, AWS creds
  • infra/monitoring/promtail/promtail-config.yml — enhanced label extraction

Modified Files — Frontend (~10 files)

  • frontend/src/lib/api.ts — Axios interceptors for request/response logging
  • frontend/src/hooks/useWebSocket.ts — structured event logging via frontendLogger
  • frontend/src/lib/analytics/index.ts — wire frontendLogger alongside Mixpanel
  • frontend/src/components/ui/ErrorBoundary.tsx — send errors to backend
  • frontend/src/app/layout.tsx — init globalErrorHandler + performanceMonitor
  • frontend/src/components/providers/Web3AuthProvider.tsx — auth flow logging
  • frontend/src/hooks/useAuth.ts — login/logout event logging
  • frontend/src/lib/providers.tsx — QueryClient global onError handler

Modified Files — P0 Critical Backend (~25 files)

  • Provider clients: BetfairClient, PinnacleClient, BifrostClient, RoanuzDataAdapter, betsApi
  • Exchange coordination: ExchangeCoordinator, FailoverManager, RoutingStrategy, IDRegistry
  • Bifrost pipeline: BifrostQueueManager, BifrostRecoveryManager, BifrostBetConsumer
  • B-Book: bbookConfigService, bbookFillService, bbookSettlementService, bbookStateService, filterEngine
  • Orders: orderService, settlement
  • Jobs: orderSyncJob, agentSettlementJob, pinnacleSettlementJob, reportEmailJob
  • Accounting: ledgerService, commissionService, agentSettlementService, availableService, takeService, marginRevenueService, carryoverService, transferSettleService, settlementPeriodService, reportService, exportService

Modified Files — P1 High Priority Backend (~15 files)

  • Monitoring: baseMonitoringService, playerMonitoringService, cricketMonitoringService, soccerMonitoringService, conditionCompiler, conditionEvaluator, metricResolver, monitoringCrudService
  • Auth: auth route, auth middleware, agentPermissions
  • Infrastructure: redis, database, errorHandler

Modified Files — P2 Comprehensive Backend (~35 files)

  • AI: agentService, fixtureAnalysisService
  • WebSocket: clientWs, oddsPapiWs, RoanuzWebSocketManager
  • Cricket: ballByBallService, cricketAnalysisService, cricketIdResolver, milestoneAlertService, sessionBettingService, tossAnalysisService
  • Notifications: notificationService, pushService
  • Telegram: bot, linkService, rateLimit
  • Odds: oddsTransformer, marginService, limitsService
  • Agents: agentHierarchyService, agentMonitoringService, userAgentService, watchlistService
  • Exchange: ProviderRegistry, FuzzyMatcher, BetfairCache
  • Routes: orders, settlements, ai, referrals, users
  • Middleware: rateLimiter
  • Python: ai-chat-assistant (5+ files)

New Grafana Dashboards (7)

  • logging-overview.json, provider-health.json, order-audit.json
  • financial-audit.json, bbook-dashboard.json, exchange-coordination.json
  • frontend-health.json

Total: ~95+ files modified, 9 new files, 7 new dashboards


Performance Impact

LATENCY IMPACT: ~0ms (Winston writes to stdout, Promtail reads async)
CPU IMPACT: Minimal (JSON.stringify per log line, ~0.01ms)
MEMORY IMPACT: None (no in-app buffering, Promtail handles collection)
FAILURE MODE: If Loki/S3 is down, Promtail queues locally and retries — app unaffected

Cross-System Impact

- backend: Logger format change, new requestContext middleware, logging additions in ~80 files, /api/logs/client endpoint
- frontend: frontendLogger + global error handler + performance monitor + API/WS logging (~10 modified, 4 new files)
- services/ai-chat: Structured JSON logging, correlation ID propagation
- infra: Loki 3.x upgrade, S3 backend config, S3 bucket + lifecycle, Promtail enhancement, 7 new dashboards

Key Best Practices Applied

  1. Structured JSON logging — machine-parseable, Promtail/Loki/Athena compatible
  2. AsyncLocalStorage correlation IDs — every log line traceable to a request, zero code changes in existing 106 files
  3. External pipeline (Promtail→Loki→S3) — decoupled from app, no backpressure risk
  4. Structured metadata (not labels) for high-cardinality fields — prevents Loki performance degradation
  5. gzip compression — ~80% storage reduction
  6. Lifecycle policies — automatic cost optimization (Standard→IA→Glacier→Deep Archive)
  7. Dev/prod format split — colorized console in dev, JSON in prod
  8. Provider call wrapping — single wrapper captures all external API metrics uniformly
  9. Log level discipline — error (unrecoverable), warn (degraded), info (business events), debug (operational detail)
  10. Dual schema config — old boltdb-shipper for legacy reads, new TSDB+S3 for writes (zero-downtime migration)