Phase 1: Comprehensive Logging to AWS S3
Context
Hannibal has Winston logging in 106+ files but outputs to console only (plain text, printf format). Promtail + Loki + Grafana already exist in docker-compose but Loki uses local filesystem storage (lost on container restart) and is on version 2.9.2 (outdated). Massive logging gaps exist across critical systems — ~75 backend files have no or minimal logging. This plan fixes both: fill all logging gaps AND ship everything to S3 for durable, queryable storage.
Current State vs Target
| Area | Current Coverage | Target | Files Affected |
|---|---|---|---|
| Order placement | ~70% | 100% | 1 file |
| Settlement | ~65% | 100% | 2 files |
| Provider API calls | ~20% | 100% | 4 adapter clients |
| Exchange coordination & failover | ~10% | 100% | 4 files (coordinator, failover, routing, registry) |
| Bifrost queue/recovery pipeline | ~5% | 100% | 3 files (queue, recovery, bet consumer) |
| B-Book system | ~15% | 100% | 5 files (config, fill, settlement, state, filter) |
| Background jobs | ~20% | 100% | 4 job files |
| Financial transactions & accounting | ~15% | 100% | 10 files (ledger, commission, take, carryover, margin, etc.) |
| Monitoring agents & evaluation | ~10% | 100% | 9 files (base, player, cricket, soccer, compiler, evaluator, resolver, crud, presets) |
| AI interactions (backend) | ~5% | 100% | 2 files (agentService, fixtureAnalysisService) |
| AI chat service (Python) | ~10% | 100% | 5+ files |
| Authentication & permissions | ~30% | 100% | 3 files (auth route, auth middleware, agentPermissions) |
| WebSocket events | ~60% | 100% | 3 files (clientWs, oddsPapiWs, RoanuzWebSocketManager) |
| Redis operations | ~25% | 90% | 1 file |
| Cricket services | ~20% | 100% | 6 files (ballByBall, analysis, idResolver, milestones, sessions, toss) |
| Notification & push | ~10% | 100% | 2 files (notificationService, pushService) |
| Telegram bot | ~0% | 100% | 5 files (bot, linkService, rateLimit, uiRenderer, types) |
| Routes (API endpoints) | ~30% | 90% | 6 uncovered routes |
| Odds transformation & margins | ~5% | 100% | 3 files (oddsTransformer, marginService, limitsService) |
| Agent hierarchy & user agents | ~10% | 100% | 3 files (agentHierarchy, agentMonitoring, userAgent) |
| Database operations | ~50% | 90% | 1 file |
| Error handling middleware | ~40% | 90% | 1 file |
Total files requiring logging changes: ~80+
Architecture: Logging Pipeline to S3
Application (Winston JSON → stdout)
│
▼
Promtail (already deployed, reads Docker container logs)
│
▼
Loki 3.x (upgrade from 2.9.2, reconfigure: filesystem → S3 backend)
│
├──► S3 bucket (chunks + indexes)
│ └── Lifecycle: Standard 30d → IA 90d → Glacier 365d → Deep Archive
│
▼
Grafana (real-time queries via Loki, historical via Athena)
Why this approach (not direct Winston→S3):
- Promtail+Loki already deployed — config change only, zero new infra
- Decoupled from app — if S3 is slow, app is unaffected
- Structured JSON stdout → Promtail parses automatically
- Grafana already connected to Loki for real-time querying
- Direct Winston→S3 transport is fragile (blocks event loop on failure, loses logs on crash)
CRITICAL: Loki version upgrade required. Current Loki 2.9.2 does NOT support schema v13 or TSDB store. Must upgrade to Loki 3.0.0+ before enabling S3 backend with TSDB. Loki 2.9.x can use S3 only with boltdb-shipper (legacy), which is deprecated in 3.x. Plan targets Loki 3.x directly.
Detailed Logging Gaps Audit
AREA 1: ORDER PLACEMENT FLOW (Currently ~70%)
File: backend/src/services/orderService.ts
Currently logged:
- Point value loading
- B-Book routing decision & rerouting logic
- Agent booking capture
- B-Book fill execution and failures
- PAL readiness check
- Betslip fetch & slippage calculation
- PAL order submission
- Partial refund scenarios
- Batch order completion
MISSING — must add:
- Validation step entry/exit with specific failure reason
- Pre/post balance deduction values (critical for race condition debugging)
- Exposure cache invalidation timing and actual exposure values
- Slippage magnitude (requested odds vs live odds, delta value)
- Tick snapping in batch orders (OLD → NEW values)
- Exchange portion under minimum — threshold check failure reason
- Request UUID lifecycle tracking
AREA 2: SETTLEMENT FLOW (Currently ~65%)
Files: backend/src/services/settlement.ts, backend/src/jobs/orderSyncJob.ts
Currently logged:
- Settlement polling start, fixture status verification
- Settlement data retrieval, market-level processing
- Commission processing, B-Book settlement
- Order status outcomes, PnL divergence detection
MISSING — must add:
- Settlement data structure validation (unexpected format detection)
- Fixture status change reasons (why skipping: "LIVE but we only process FINISHED")
- Winning outcome extraction logic (filtering/mapping decisions)
- Market-level outcome determination (what data influenced win/void decision)
- Void order handling reason (refund amount = deducted amount)
- Commission rate lookup result (which config row matched)
- Margin profit edge case detection steps
- Ledger entry constraint error context
AREA 3: PROVIDER API CALLS (Currently ~20%) ⚠️ CRITICAL
Files: backend/src/exchanges/adapters/betfair/BetfairClient.ts, pinnacle/PinnacleClient.ts, bifrost/BifrostClient.ts, roanuz/RoanuzDataAdapter.ts
MISSING — must add for ALL providers:
- Every API request/response with latency timing
- Session refresh timing and success/failure
- Authentication errors with reason
- Rate limit header values and throttle state
- Retry logic (which attempt succeeded, why retries exhausted)
- Request/response payload sizes
- Error code extraction and categorization
- Failover decisions (which provider failed, which chosen as fallback)
AREA 4: EXCHANGE COORDINATION & FAILOVER (Currently ~10%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN
Files:
backend/src/exchanges/core/coordinator/ExchangeCoordinator.ts— main routing orchestratorbackend/src/exchanges/core/coordinator/FailoverManager.ts— circuit breaker (closed/open/half-open)backend/src/exchanges/core/coordinator/RoutingStrategy.ts— provider selection logicbackend/src/exchanges/core/coordinator/IDRegistry.ts— cross-provider ID mapping with Redis cache
MISSING — must add:
- Every routing decision: which provider selected, why, fallback chain considered
- Circuit breaker state transitions: closed→open (with failure count), open→half-open (timeout), half-open→closed (success)
- Provider health check results: latency, error rate, availability score
- ID resolution: canonical→provider mapping lookups, cache hits/misses, fuzzy match scores
- Failover triggers: original provider failure reason, fallback provider selected, degraded mode entered
- Request routing latency: time added by coordination layer
AREA 5: BIFROST QUEUE PIPELINE (Currently ~5%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN
Files:
backend/src/exchanges/adapters/bifrost/BifrostQueueManager.ts— RabbitMQ consumer (protobuf)backend/src/exchanges/adapters/bifrost/BifrostRecoveryManager.ts— startup + periodic recoverybackend/src/exchanges/adapters/bifrost/BifrostBetConsumer.ts— live bet status/settlement
MISSING — must add:
- Queue message consumption: queue name, message type, protobuf decode success/failure, processing latency
- Recovery cycles: startup recovery (events + markets + phantoms), periodic 10-min recovery, items recovered count
- Bet status updates: orderId, oldStatus→newStatus, settlement outcome from queue
- Queue connection health: connected/disconnected, prefetch count, consumer tag
- Message acknowledgment: ack/nack/reject decisions with reason
- Protobuf decode errors: which field failed, raw payload size
AREA 6: B-BOOK SYSTEM (Currently ~15%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN
Files:
backend/src/services/bbook/bbookConfigService.ts— config, enable/disable, limitsbackend/src/services/bbook/bbookFillService.ts— fill execution, position creationbackend/src/services/bbook/bbookSettlementService.ts— position settlement, P&Lbackend/src/services/bbook/bbookStateService.ts— Redis + PG position trackingbackend/src/services/bbook/filterEngine.ts— routing decisions
MISSING — must add:
- Config changes: who changed what setting, old→new values
- Fill decisions: orderId, fillAmount, remainingCapacity, limits checked (global/market/selection/user)
- Position creation: positionId, fixtureId, marketId, side (back/lay), stake, odds
- Settlement: positionId, outcome (win/loss/void), grossPnL, netPnL
- State tracking: exposure at global/market/selection/user levels, Redis vs PG sync status
- Routing decisions: why B-Book (within limits) vs exchange (exceeded limits), specific limit that triggered routing
- Limit breaches: which limit hit (globalPoolLimit, defaultMarketLimit, selectionLimit, userLimit, maxOdds, minStake)
- Balancing alerts: when pool becomes unbalanced (e.g., >80% on one side)
AREA 7: WEBSOCKET EVENTS (Currently ~60%)
Files: backend/src/services/clientWs.ts, backend/src/services/oddsPapiWs.ts
Currently logged:
- Client connections, fixture subscriptions, disconnections
- Redis subscriber ready, message handling
- Broadcast operations, reconnection attempts
MISSING — must add:
- Connection duration on disconnect
- Message count per client session
- Subscription failures
- Broadcast recipient count (how many clients actually received)
- Redis subscription drop detection
- Ping/pong latency tracking
- Message ordering/gap detection
- Update frequency per fixture (sampled)
- Odds extraction failures (when extraction returns zeros)
Additional file NOT in original plan:
backend/src/exchanges/adapters/roanuz/RoanuzWebSocketManager.ts— cricket streaming- Socket.IO connection state to Roanuz API
- Subscription protocol events (subscribe/unsubscribe/ack)
- gzip decompression failures
- Event normalization errors
- Auto-reconnection attempts and backoff timing
AREA 8: AI INTERACTIONS (Currently ~5%) ⚠️ CRITICAL
Files:
backend/src/ai/core/agentService.ts— Anthropic API calls, tool execution loopbackend/src/services/fixtureAnalysisService.ts— AI fixture analysis with caching (NOT IN ORIGINAL PLAN)
MISSING — must add:
- Every Anthropic API call: model, inputTokens, outputTokens, latencyMs, success
- Every tool execution in agentic loop: toolName, latencyMs, success/error
- Conversation context size (message count, estimated tokens)
- Prompt construction summary (not full prompt, but template + parameters)
- Streaming latency for AI insights
- Cache hits (insights from cache vs generated fresh)
- Failed insight fallback strategy
- Fixture analysis: fixtureId, analysisType, cacheHit, generationLatencyMs, provider used
AREA 9: MONITORING AGENTS & EVALUATION (Currently ~10%) ⚠️ CRITICAL
Files in original plan:
backend/src/services/monitoring/baseMonitoringService.tsbackend/src/services/monitoring/playerMonitoringService.tsbackend/src/services/monitoring/cricketMonitoringService.ts
Files NOT in original plan:
backend/src/services/monitoring/soccerMonitoringService.ts— soccer-specific monitoringbackend/src/services/monitoring/soccerCapabilities.ts— soccer capability definitionsbackend/src/services/monitoring/soccerResolvers.ts— soccer metric resolversbackend/src/services/monitoring/conditionCompiler.ts— condition→evaluable formbackend/src/services/monitoring/conditionEvaluator.ts— runtime condition evaluationbackend/src/services/monitoring/metricResolver.ts— metric value resolution from fixture databackend/src/services/monitoring/monitoringCrudService.ts— CRUD for monitor configsbackend/src/services/monitoring/presets.ts— preset conditions by domainbackend/src/services/monitoring/utils.ts— monitoring utilities
MISSING — must add across ALL monitoring files:
- Each evaluation cycle: monitorId, conditionsCount, conditionsMet, alertFired, evalDurationMs
- Condition compilation: input condition → compiled form, compilation errors
- Condition evaluation: metricName, resolvedValue, operator, threshold, result (pass/fail)
- Metric resolution: which resolver used, data source, resolution latency
- Alert firing decisions (WHY triggered or suppressed)
- Deduplication logic (duplicate detection, alerts skipped)
- Event subscription errors
- Monitor cache coherency (hits/misses/invalidations)
- Periodic checker execution timing
- CRUD operations: monitor created/updated/deleted by userId
- Preset loading: which presets applied, domain
AREA 10: FINANCIAL TRANSACTIONS & ACCOUNTING (Currently ~15%) ⚠️ CRITICAL
Files in original plan:
backend/src/accounting/ledgerService.tsbackend/src/accounting/commissionService.tsbackend/src/accounting/agentSettlementService.ts
Files NOT in original plan:
backend/src/accounting/availableService.ts— available credit (AC) computationbackend/src/accounting/carryoverService.ts— deferred settlement (carryover) handlingbackend/src/accounting/marginRevenueService.ts— margin profit tracking per orderbackend/src/accounting/reportService.ts— settlement report generationbackend/src/accounting/takeService.ts— event-driven take balance updates (T = AC(self) + Σ AC(downline) - CL)backend/src/accounting/transferSettleService.ts— settlement reconciliation between upline/downlinebackend/src/accounting/settlementPeriodService.ts— period lifecycle (open→grace→finalized)backend/src/accounting/calculations.ts— pure calc functions (round, commission, profitShare, booking)backend/src/accounting/exportService.ts— Excel/CSV export for settlement reports
MISSING — must add:
- Every ledger entry: accountType, entryType, amount, balanceAfter, transactionRef
- Commission rate lookup (which config matched, rate applied)
- Commission calculation steps: grossWinnings × rate = commission
- Per-agent settlement: totalStakes, totalPayouts, bookingPercent, netSettlement
- Available credit computation: userId, creditLimit, downlineAllocation, exposure, resultAC
- Carryover requests: downlineId, amount, accepted/rejected, grace period
- Margin revenue per order: orderId, placedOdds, displayOdds, marginProfit, slippageType
- Take balance updates: userId, oldTake, newTake, trigger (which event caused update)
- Transfer settlement: uplineId→downlineId, obligation, adjustedBalance
- Settlement period transitions: periodId, oldState→newState, triggeredBy
- Report generation: reportType (platform/agent/player), periodId, generationLatencyMs
- Export operations: format (Excel/CSV), rowCount, fileSize, downloadedBy
- Decimal rounding decisions: input, output, roundingMode, precision
- Constraint violation context
AREA 11: AUTHENTICATION & PERMISSIONS (Currently ~30%)
Files in original plan:
backend/src/routes/auth.tsbackend/src/middleware/auth.ts
File NOT in original plan:
backend/src/middleware/agentPermissions.ts— permission flag checking (T/B/D/W)
MISSING — must add:
- Every login attempt (wallet address, success/failure, failure reason)
- Token verification failure reason (expired, invalid, revoked)
- Session cache hit vs DB lookup path
- User status check result value
- Admin role grant/denial
- Token expiry timing (how long token was valid)
- New user registration events
- Referral code validation results
- Permission checks: userId, requiredFlag (Transfer/Bet/Deposit/Withdraw), granted/denied
AREA 12: BACKGROUND JOBS (Currently ~20%) ⚠️ CRITICAL
Files: backend/src/jobs/orderSyncJob.ts, agentSettlementJob.ts, pinnacleSettlementJob.ts, reportEmailJob.ts
Currently logged: Job startup, job stop, "no orders to sync" (debug).
MISSING — must add:
- Job execution timing (cycle start/end/duration)
- Batch size per cycle
- Each order status change detected (orderId, oldStatus → newStatus)
- Retry logic (which orders retried, why)
- Settlement check results (how many settled, which orders)
- Error accumulation (fail count/types per cycle)
- Agent settlement calculations per agent
- Pinnacle settlement data received/processed
- Report generation contents, recipients, send status, SMTP response
AREA 13: CRICKET SERVICES (Currently ~20%) — EXPANDED
File in original plan:
backend/src/services/cricket/ballByBallService.ts
Files NOT in original plan:
backend/src/services/cricket/cricketAnalysisService.ts— AI cricket analysis (head-to-head, venue, player form, toss, pitch)backend/src/services/cricket/cricketIdResolver.ts— Betfair→Roanuz fuzzy ID matchingbackend/src/services/cricket/milestoneAlertService.ts— batting/bowling milestone detection (50/100/150/200, 3/5-for, partnerships, powerplay, RRR)backend/src/services/cricket/sessionBettingService.ts— phase analysis (Test/ODI/T20), projected scores vs historicalbackend/src/services/cricket/tossAnalysisService.ts— toss results, pitch conditions, venue history, AI recommendations
MISSING — must add:
- Ball-by-ball: ball#, runs, outcome, wicket details, team score, run rate
- Analysis generation: matchId, analysisType, dataSource (Roanuz), latencyMs, cacheHit
- ID resolution: betfairEventId, resolvedRoanuzMatchId, matchScore, method (exact/fuzzy), cacheHit
- Milestones detected: matchId, milestoneType, player, value, timestamp
- Session analysis: matchId, format, currentPhase, projectedScore, historicalAvg
- Toss analysis: matchId, tossWinner, decision (bat/field), pitchCondition, venueHistory
AREA 14: NOTIFICATION & PUSH (Currently ~10%) — NOT IN ORIGINAL PLAN
Files:
backend/src/services/notificationService.ts— creates notifications, Redis pub/sub, push to mobilebackend/src/services/pushService.ts— web push (VAPID) to mobile/browser
MISSING — must add:
- Notification creation: userId, type, title, channel (in-app/push/both)
- Redis pub/sub publish: channel, payloadSize, subscriberCount
- Push notification send: userId, deviceCount, success/failure per device
- VAPID authentication: key rotation, errors
- Push delivery failures: endpoint, statusCode, reason (expired subscription, etc.)
AREA 15: TELEGRAM BOT (Currently ~0%) — NOT IN ORIGINAL PLAN
Files:
backend/src/services/telegram/bot.ts— main bot (GramMyBot), commands, AI text processingbackend/src/services/telegram/linkService.ts— Telegram↔Hannibal account linkingbackend/src/services/telegram/rateLimit.ts— rate limitingbackend/src/services/telegram/uiRenderer.ts— keyboard/message rendering
MISSING — must add:
- Command handling: chatId, command (/start, /help, /balance, /bets), userId, latencyMs
- AI text processing: chatId, messageText (truncated), responseLatencyMs, toolCalls
- Account linking: telegramUserId, hannibalUserId, linked/unlinked, success/failure
- Rate limit hits: chatId, command, blocked (yes/no)
- Callback queries: chatId, callbackData, action taken
- Session management: chatId, sessionCreated/expired, duration
AREA 16: ODDS TRANSFORMATION & MARGINS (Currently ~5%) — NOT IN ORIGINAL PLAN
Files:
backend/src/services/oddsTransformer.ts— raw OddsPAPI→frontend format, margin adjustmentbackend/src/services/marginService.ts— platform margin applied to Betfair odds (revenue source)backend/src/services/limitsService.ts— stake limits per fixture/outcome
MISSING — must add:
- Odds transformation: fixtureId, marketCount, outcomeCount, transformLatencyMs
- Margin application: fixtureId, originalOdds, marginRate, adjustedOdds, revenueImpact
- Limits enrichment: fixtureId, platformMin, depthData, computedLimits
AREA 17: AGENT HIERARCHY & USER AGENTS (Currently ~10%) — NOT IN ORIGINAL PLAN
Files:
backend/src/services/agentHierarchyService.ts— recursive CTE traversal, parent-childbackend/src/services/agentMonitoringService.ts— agent alerting via Redis/OddsPAPIbackend/src/services/userAgentService.ts— autonomous agents (value_hunter, steam_tracker, arbitrage)backend/src/services/watchlistService.ts— fixture/team watchlists with AI alerts
MISSING — must add:
- Hierarchy traversal: rootAgentId, depth, nodeCount, latencyMs
- Agent alerts: agentId, alertType (odds change, match finished), fixtureId, triggered
- Autonomous agent execution: agentId, agentType, evaluationCycle, actionsGenerated
- Watchlist alerts: watchlistId, userId, alertCondition, matched (yes/no)
AREA 18: ROUTES — API ENDPOINTS (Currently ~30%) — NOT IN ORIGINAL PLAN
Files not covered:
backend/src/routes/ai.ts— AI features (analysis, watchlists, agents, chat)backend/src/routes/orders.ts— order placement/retrieval, schema validationbackend/src/routes/referrals.ts— referral code generationbackend/src/routes/settlements.ts— settlement periods, processing, ledger verificationbackend/src/routes/users.ts— user profile, balancebackend/src/routes/health.ts— health checks (/health, /health/ready)
MISSING — must add:
- Request/response logging for each route group (method, path, statusCode, latencyMs, userId)
- Note: most of this will be covered by the HTTP request middleware (Step 1d), but route-specific business logic logging is still needed:
- Order validation failures with specific field/reason
- Settlement period transitions triggered via API
- AI analysis requests with fixtureId and cache status
- Health check detailed status per service
AREA 19: ERROR HANDLING MIDDLEWARE — NOT IN ORIGINAL PLAN
File: backend/src/middleware/errorHandler.ts
Custom error classes: ValidationError, AuthenticationError, AuthorizationError, NotFoundError, ConflictError, ExternalServiceError
MISSING — must add:
- Error categorization: operational vs programming bug
- Error class distribution logging (which error types most frequent)
- Stack trace persistence with request context
- User impact flag (blocks request vs non-fatal)
- External service errors: which service, statusCode, retryable
AREA 20: REDIS OPERATIONS (Currently ~25%)
File: backend/src/services/redis.ts
Currently logged: Connection state changes only.
MISSING — must add:
- Cache hits/misses (debug level, controlled by LOG_LEVEL)
- Cache errors with key name context
- Lock acquisition failures (key, contention duration)
- Lock release verification (success or stolen lock)
- Pub/sub channel health
- Slow Redis operations
AREA 21: DATABASE OPERATIONS (Currently ~50%)
File: backend/src/services/database.ts
Currently logged: Connection success, slow queries in dev (>100ms).
MISSING — must add:
- Slow query logging in production (>200ms threshold)
- Structured error logging for Prisma errors with query context
- Transaction rollback logging
- Connection pool metrics (if available)
- Constraint violation logging with context
AREA 22: EXCHANGE MAPPERS & ID RESOLUTION — NOT IN ORIGINAL PLAN
Files:
backend/src/exchanges/core/ProviderRegistry.ts— provider instance managementbackend/src/exchanges/core/FuzzyMatcher.ts— fuzzy string matching for ID resolutionbackend/src/exchanges/adapters/betfair/BetfairCache.ts— in-memory market data cache- Various mapper files (BetfairMapper, BifrostMapper, OddsPapiMapper, PinnacleMapper)
MISSING — must add:
- Provider registration/deregistration events
- Fuzzy match: inputString, candidates, bestMatch, score, threshold
- Betfair cache: size, evictions, hit rate (periodic summary, not per-access)
- Mapper transform errors: provider, entityType, fieldName, rawValue, error
AREA 23: FRONTEND — CLIENT-SIDE LOGGING (Currently ~5%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN
Current state: 33 files with scattered console.log/error/warn. All logs are browser-only (lost when tab closes). No global error handler. No structured logging. No backend error sink.
23a. Error Handling (Currently: ErrorBoundary only)
Files:
frontend/src/components/ui/ErrorBoundary.tsx— catches React render errors, dev-only console.errorfrontend/src/app/layout.tsx— wraps app in ErrorBoundary
What slips through ErrorBoundary (NOT caught):
- Async errors (promises, setTimeout)
- Event handler errors (onClick, onChange)
- WebSocket event errors
- React Query failures
- Axios interceptor errors
MISSING — must add:
- Global
window.addEventListener('unhandledrejection')handler - Global
window.addEventListener('error')handler - All unhandled errors should be captured and sent to backend
23b. API Error Logging (Currently: toast only)
Files:
frontend/src/lib/api.ts— Axios interceptor handles 401 refresh, warns on missing token- 47+ files with
useMutation/useQuery— most useonError: (err) => toast.error(msg), nothing else
MISSING — must add:
- Axios response interceptor: log every error response (url, method, status, latencyMs, errorCode)
- Axios request interceptor: capture request start time for latency calculation
- React Query global
onErrorhandler atQueryClientlevel - Failed mutation details: mutationKey, variables (sanitized), error
23c. WebSocket Logging (Currently: console.log)
File: frontend/src/hooks/useWebSocket.ts — 30 lines of console.log
Currently logged (to browser console only):
- Connection creation, connect ID, disconnect reason, connect_error
- Balance/settlement/notification/order status updates received
- Cricket ball events
MISSING — must add:
- Reconnection attempts (count, delay, transport fallback)
- Connection duration on disconnect
- Message delivery gaps (missed balls, out-of-order events)
- Subscription failures
23d. Performance Monitoring (Currently: none)
Only tracked: Auth latency and AI response latency (sent to Mixpanel, not logged).
MISSING — must add:
- Core Web Vitals (LCP, FID, CLS, INP, TTFB)
- Page navigation timing
- API call latency distribution
- Long task detection (>50ms)
23e. Authentication Flow (Currently: console.error)
Files:
frontend/src/hooks/useAuth.tsfrontend/src/components/providers/Web3AuthProvider.tsx(18 lines)frontend/src/components/auth/LoginModal.tsx
MISSING — must add:
- Login attempt: provider (metamask/phantom/email), success/failure, latencyMs
- Web3Auth init: success/failure, latencyMs, retryCount
- Token refresh: success/failure, reason
- Session expiry: duration, trigger
23f. AI & Cricket UI (Currently: console.error on failure)
Files: 4 AI component files, 6 cricket component files
MISSING — must add:
- AI analysis request: fixtureId, type, latencyMs, cacheHit, error
- Cricket subscription: matchId, subscribed/unsubscribed, ballsReceived, gaps
- Deep analysis: fixtureId, latencyMs, success/error
Frontend Logging Architecture
Frontend cannot write to Loki directly (would expose Loki to the internet). Instead:
Browser
│
│ 1. frontendLogger captures structured events
│ 2. Batches in memory (max 50 events or 10s flush)
│ 3. POST /api/logs/client (fire-and-forget, no retry on 4xx)
│
▼
Backend: POST /api/logs/client
│
│ 1. Validates payload (max 100 events per request, size limit)
│ 2. Rate-limited (100 req/min per IP)
│ 3. Logs each event via Winston with source: "frontend"
│
▼
Winston (stdout JSON) → Promtail → Loki → S3
Every frontend log line becomes:
{
"ts": "2026-02-26T10:30:00.123Z",
"level": "error",
"msg": "API call failed",
"service": "hannibal-frontend",
"source": "client",
"sessionId": "sess_abc123",
"userId": "u_456",
"url": "/api/orders",
"method": "POST",
"statusCode": 500,
"latencyMs": 1234,
"errorCode": "INTERNAL_SERVER_ERROR",
"userAgent": "Mozilla/5.0...",
"page": "/sports/cricket"
}
New frontend files:
frontend/src/lib/frontendLogger.ts— structured logger with batching + backend sinkfrontend/src/lib/globalErrorHandler.ts— window error + unhandledrejection handlersfrontend/src/lib/performanceMonitor.ts— Web Vitals + API latency trackingfrontend/src/hooks/useErrorReporting.ts— React hook for component error context
New backend files:
backend/src/routes/logs.ts— POST /api/logs/client endpoint (validated, rate-limited)
Modified frontend files:
frontend/src/lib/api.ts— Axios interceptors for request/response loggingfrontend/src/hooks/useWebSocket.ts— structured event loggingfrontend/src/lib/analytics/index.ts— wire frontendLogger alongside Mixpanelfrontend/src/components/ui/ErrorBoundary.tsx— send errors to backendfrontend/src/app/layout.tsx— init globalErrorHandler + performanceMonitorfrontend/src/components/providers/Web3AuthProvider.tsx— auth flow loggingfrontend/src/hooks/useAuth.ts— login/logout event logging
What stays in Mixpanel (user behavior analytics):
- 56 existing user events (bet_placed, fixture_viewed, etc.)
- These are product analytics, not operational logs — keep them separate
What goes to Loki/S3 via backend sink (operational logs):
- Errors (unhandled, API failures, WebSocket errors)
- Performance (Web Vitals, API latency, long tasks)
- Auth events (login/logout attempts, token refresh)
- WebSocket health (connect/disconnect, reconnection, message gaps)
- Critical UI events (ErrorBoundary catches, blank screens)
Implementation Steps
Step 1: Foundation — Winston + Correlation IDs
1a. Switch Winston to Structured JSON Output
Modify: backend/src/utils/logger.ts
Current: plain text with colorized console, printf format, circular reference handling, child logger support.
Target: structured JSON with service metadata, dual-format (JSON prod / colorized dev).
Every log line becomes:
{"ts":"2026-02-26T10:30:00.123Z","level":"info","msg":"Order placed","service":"hannibal-backend","env":"production","requestId":"abc-123","userId":"u_456","orderId":"ord_789","latencyMs":45}
Changes:
- Replace
printfformat withwinston.format.json()for non-dev environments - Add
defaultMeta: { service: 'hannibal-backend', env: process.env.NODE_ENV } - Keep colorized console format for local dev via
NODE_ENV === 'development'check - Add
winston.format.errors({ stack: true })for full stack traces - Preserve existing safe metadata serialization (circular reference handling already built in)
1b. Add AsyncLocalStorage for Correlation IDs
New file: backend/src/utils/requestContext.ts
Uses Node.js built-in AsyncLocalStorage to automatically inject requestId, userId, traceId into every log line without passing logger around.
Note: Partial infrastructure already exists:
errorHandler.tsalready readsx-request-idheader- CORS already allows
X-Request-IDheader logger.child({ requestId })capability exists but isn't used consistently
Modify: backend/src/utils/logger.ts
- Add format that reads from
AsyncLocalStorageand injects context fields into every log line
New middleware: backend/src/middleware/requestContext.ts
- Wraps every request in
AsyncLocalStorage.run()with{ requestId, userId, traceId } - Generates UUID if no
x-request-idheader present - Sets
x-request-idresponse header for tracing - Extracts
userIdfrom authenticated request (after auth middleware runs)
Modify: backend/src/index.ts
- Add
requestContextMiddlewarebefore all route handlers (after CORS, before auth)
Impact: All 106 files that import logger automatically get correlation IDs in every log line. Zero changes needed in those files.
1c. HTTP Request/Response Middleware
New middleware addition in backend/src/index.ts:
- Log every HTTP request: method, path, statusCode, latencyMs, userId, requestSize, responseSize
- Replaces/enhances existing Morgan "combined" format with structured JSON
- Excludes health check endpoints from verbose logging (or logs at debug level)
1d. Provider API Call Logging Wrapper
New file: backend/src/utils/httpLogger.ts
- Generic wrapper for external HTTP calls
- Captures: provider, method, url, latencyMs, statusCode, success, error, requestSize, responseSize
- Usage:
await withHttpLogging('betfair', 'placeOrder', () => axios.post(...)) - Can be applied as axios interceptor or manual wrapper
Step 2: P0 — Critical Business Logic Gaps (~25 files)
2a. Provider Clients
Modify (wrap all API methods with httpLogger):
backend/src/exchanges/adapters/betfair/BetfairClient.ts— listMarketCatalogue, placeOrders, cancelOrders, listCurrentOrders, listClearedOrders, getAccountFunds, login, session refreshbackend/src/exchanges/adapters/pinnacle/PinnacleClient.ts— all methods + token bucket statebackend/src/exchanges/adapters/bifrost/BifrostClient.ts— all methodsbackend/src/exchanges/adapters/roanuz/RoanuzDataAdapter.ts— all methodsbackend/src/services/betsApi.ts— BetsAPI calls for odds fetching
2b. Exchange Coordination & Failover
Modify:
backend/src/exchanges/core/coordinator/ExchangeCoordinator.ts— routing decisions, provider selectionbackend/src/exchanges/core/coordinator/FailoverManager.ts— circuit breaker state transitions, health checksbackend/src/exchanges/core/coordinator/RoutingStrategy.ts— strategy evaluation, provider rankingbackend/src/exchanges/core/coordinator/IDRegistry.ts— ID lookups, cache hits/misses
2c. Bifrost Pipeline
Modify:
backend/src/exchanges/adapters/bifrost/BifrostQueueManager.ts— message consumption, protobuf decode, processing latencybackend/src/exchanges/adapters/bifrost/BifrostRecoveryManager.ts— recovery cycles, items recoveredbackend/src/exchanges/adapters/bifrost/BifrostBetConsumer.ts— bet status updates, settlement from queue
2d. B-Book System
Modify:
backend/src/services/bbook/bbookConfigService.ts— config changesbackend/src/services/bbook/bbookFillService.ts— fill decisions, position creation, capacity checksbackend/src/services/bbook/bbookSettlementService.ts— position settlement, P&Lbackend/src/services/bbook/bbookStateService.ts— exposure levels, Redis/PG syncbackend/src/services/bbook/filterEngine.ts— routing decisions, limit checks
2e. Order + Settlement
Modify:
backend/src/services/orderService.ts— validation, balance, exposure, slippage detailsbackend/src/services/settlement.ts— outcome determination, commission, void handlingbackend/src/jobs/orderSyncJob.ts— cycle timing, batch size, status changesbackend/src/jobs/agentSettlementJob.ts— per-agent calculationsbackend/src/jobs/pinnacleSettlementJob.ts— settlement data received/processedbackend/src/jobs/reportEmailJob.ts— report generation, SMTP
2f. Financial Transactions & Accounting
Modify:
backend/src/accounting/ledgerService.ts— every ledger entry with full contextbackend/src/accounting/commissionService.ts— rate lookup, calculation stepsbackend/src/accounting/agentSettlementService.ts— per-agent calculationsbackend/src/accounting/availableService.ts— AC computation stepsbackend/src/accounting/takeService.ts— take balance event-driven updatesbackend/src/accounting/marginRevenueService.ts— margin profit per orderbackend/src/accounting/carryoverService.ts— deferred settlement requestsbackend/src/accounting/transferSettleService.ts— upline/downline reconciliationbackend/src/accounting/settlementPeriodService.ts— period lifecycle transitionsbackend/src/accounting/reportService.ts— report generationbackend/src/accounting/exportService.ts— Excel/CSV export
Step 3: P1 — High-Priority Operational Gaps (~20 files)
3a. Monitoring Agents (all files)
Modify:
backend/src/services/monitoring/baseMonitoringService.ts— evaluation cycles, alert decisionsbackend/src/services/monitoring/playerMonitoringService.ts— bet event processing, condition resultsbackend/src/services/monitoring/cricketMonitoringService.ts— cricket-specific conditionsbackend/src/services/monitoring/soccerMonitoringService.ts— soccer-specific conditionsbackend/src/services/monitoring/conditionCompiler.ts— compilation results/errorsbackend/src/services/monitoring/conditionEvaluator.ts— evaluation trace per conditionbackend/src/services/monitoring/metricResolver.ts— metric resolution source/latencybackend/src/services/monitoring/monitoringCrudService.ts— CRUD operationsbackend/src/services/monitoring/presets.ts— preset loading
3b. Authentication & Permissions
Modify:
backend/src/routes/auth.ts— login attempts, registration, referral validationbackend/src/middleware/auth.ts— token verification, cache vs DB path, user statusbackend/src/middleware/agentPermissions.ts— permission flag checks (T/B/D/W)
3c. Redis Operations
Modify:
backend/src/services/redis.ts— cache hits/misses (debug), lock contention, slow ops
3d. Database Operations
Modify:
backend/src/services/database.ts— slow queries in prod (>200ms), transaction errors, constraint violations
3e. Error Handler Enhancement
Modify:
backend/src/middleware/errorHandler.ts— error categorization, class distribution, external service context
Step 4: P2 — Comprehensive Coverage (~30 files)
4a. AI Interactions
Modify:
backend/src/ai/core/agentService.ts— API calls, tool executions, tokens, latencybackend/src/services/fixtureAnalysisService.ts— analysis generation, caching
4b. WebSocket Events
Modify:
backend/src/services/clientWs.ts— connection duration, broadcast counts, subscription failuresbackend/src/services/oddsPapiWs.ts— message rates, extraction failures, ping/pongbackend/src/exchanges/adapters/roanuz/RoanuzWebSocketManager.ts— cricket stream health
4c. Cricket Services
Modify:
backend/src/services/cricket/ballByBallService.ts— ball events, wickets, milestonesbackend/src/services/cricket/cricketAnalysisService.ts— AI analysis, data sourcesbackend/src/services/cricket/cricketIdResolver.ts— ID resolution, fuzzy matchingbackend/src/services/cricket/milestoneAlertService.ts— milestone detectionbackend/src/services/cricket/sessionBettingService.ts— phase analysisbackend/src/services/cricket/tossAnalysisService.ts— toss/pitch analysis
4d. Notification & Push
Modify:
backend/src/services/notificationService.ts— creation, pub/sub, deliverybackend/src/services/pushService.ts— VAPID auth, delivery success/failure
4e. Telegram Bot
Modify:
backend/src/services/telegram/bot.ts— commands, AI processing, sessionsbackend/src/services/telegram/linkService.ts— account linkingbackend/src/services/telegram/rateLimit.ts— rate limit events
4f. Odds & Margins
Modify:
backend/src/services/oddsTransformer.ts— transformation latencybackend/src/services/marginService.ts— margin application per fixturebackend/src/services/limitsService.ts— stake limits computation
4g. Agent Hierarchy & User Agents
Modify:
backend/src/services/agentHierarchyService.ts— tree traversalbackend/src/services/agentMonitoringService.ts— agent alertsbackend/src/services/userAgentService.ts— autonomous agent executionbackend/src/services/watchlistService.ts— watchlist alerts
4h. Rate Limiting
Modify:
backend/src/middleware/rateLimiter.ts— limit hits, repeat offenders
4i. Exchange Internals
Modify:
backend/src/exchanges/core/ProviderRegistry.ts— registration eventsbackend/src/exchanges/core/FuzzyMatcher.ts— match scoresbackend/src/exchanges/adapters/betfair/BetfairCache.ts— cache size, hit rate (periodic)
4j. Routes (business logic in API handlers)
Modify:
backend/src/routes/orders.ts— validation failuresbackend/src/routes/settlements.ts— period transitionsbackend/src/routes/ai.ts— analysis requestsbackend/src/routes/referrals.ts— code generationbackend/src/routes/users.ts— profile operations
Step 5: Frontend Logging Pipeline (~15 files)
5a. Frontend Logger + Backend Sink
New file: frontend/src/lib/frontendLogger.ts
- Structured JSON logger mirroring backend format
- In-memory buffer: max 50 events or 10-second flush interval
- Sends batched events via
POST /api/logs/client - Fire-and-forget — never blocks UI, drops on failure (no retry on 4xx)
- Includes session context: sessionId, userId, page, userAgent
New file: backend/src/routes/logs.ts
POST /api/logs/client— accepts batched frontend log events- Validates: max 100 events per request, max 10KB per event, required fields
- Rate-limited: 100 requests/min per IP (prevents abuse)
- Logs each event via Winston with
service: "hannibal-frontend",source: "client" - Flows through same pipeline: Winston → stdout → Promtail → Loki → S3
Modify: backend/src/index.ts
- Mount
/api/logsroute (before auth middleware — client errors may come from unauthenticated users)
5b. Global Error Handlers
New file: frontend/src/lib/globalErrorHandler.ts
window.addEventListener('error', ...)— catches uncaught JS errorswindow.addEventListener('unhandledrejection', ...)— catches unhandled promise rejections- Captures: errorMessage, stack, componentStack, page, userAgent
- Sends to frontendLogger (which batches and sends to backend)
- Deduplicates: same error message within 5 seconds = 1 log entry with count
Modify: frontend/src/app/layout.tsx
- Import and initialize
globalErrorHandleron mount - Initialize
performanceMonitoron mount
Modify: frontend/src/components/ui/ErrorBoundary.tsx
- Send caught errors to frontendLogger (not just console.error in dev)
- Include componentStack for debugging
5c. API Call Logging
Modify: frontend/src/lib/api.ts
- Request interceptor: record start timestamp
- Response interceptor: log every failed request (status >= 400): url, method, statusCode, latencyMs, errorCode
- Log token refresh attempts: success/failure, latencyMs
- Log auth:logout events triggered by failed refresh
5d. React Query Global Error Handler
Modify: frontend/src/lib/providers.tsx (or wherever QueryClient is created)
- Add
QueryClientdefaultonErrorcallback - Logs all failed queries: queryKey, error message, status code
- This catches the 47+ files that have no
onErrorhandler
5e. WebSocket Logging
Modify: frontend/src/hooks/useWebSocket.ts
- Replace
console.logwithfrontendLogger.info/error - Add: reconnection attempt count, transport type, connection duration
- Add: message gap detection (cricket ball sequence)
- Keep existing connection/disconnect logging but structured
5f. Performance Monitoring
New file: frontend/src/lib/performanceMonitor.ts
- Core Web Vitals via
web-vitalslibrary (LCP, FID, CLS, INP, TTFB) - Reports once per page load via frontendLogger
- Optional: long task detection via
PerformanceObserver('longtask')
5g. Auth Flow Logging
Modify: frontend/src/hooks/useAuth.ts
- Log login attempt: provider, success/failure, latencyMs
- Log logout: trigger (manual/session expired/refresh failed)
Modify: frontend/src/components/providers/Web3AuthProvider.tsx
- Log Web3Auth init: success/failure, latencyMs, retryCount
- Log wallet connection: chain, address (first 6 chars), success/failure
Step 6: AI Chat Service (Python)
Modify: services/ai-chat-assistant/app/
- Configure Python
loggingto output structured JSON to stdout (matching backend format) - Add correlation ID propagation from
x-request-idrequest header - Log every LLM provider call: provider (Grok/Groq/Ollama/OpenAI/Anthropic), model, tokens, latency, success/error
- Log provider fallback chain: which provider tried, failed, next selected
- Log conversation history operations (Redis reads/writes)
- Log streaming SSE events: chunkCount, totalLatencyMs
Promtail already collects Docker container stdout — logs flow to Loki/S3 automatically.
Step 7: Loki Upgrade + S3 Backend
7a. Upgrade Loki 2.9.2 → 3.x
CRITICAL PREREQUISITE: Loki 2.9.2 does NOT support schema v13 or TSDB store.
Modify: infra/docker/docker-compose.monitoring.yml
- Change
grafana/loki:2.9.2→grafana/loki:3.3.2(or latest 3.x stable) - Add AWS credentials env vars:
environment:
- AWS_ACCESS_KEY_ID=${LOKI_AWS_ACCESS_KEY_ID}
- AWS_SECRET_ACCESS_KEY=${LOKI_AWS_SECRET_ACCESS_KEY}
- AWS_REGION=${AWS_REGION:-ap-south-1}
Migration notes:
- Schema v11 (boltdb-shipper) data remains readable by Loki 3.x
- New data writes use schema v13 (TSDB) with S3
- Old data on filesystem will not be migrated (acceptable — current data is ephemeral on container restart anyway)
- Test on staging before production rollout
7b. Reconfigure Loki for S3
Modify: infra/monitoring/loki/loki-config.yml
Add new schema config entry (keep old for backward compat):
schema_config:
configs:
# Legacy — reads old data if any
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
# New — all new writes go to S3
- from: 2026-03-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
storage_config:
aws:
s3: s3://${AWS_REGION}/${S3_BUCKET_NAME}
bucketnames: hannibal-logs
region: ${AWS_REGION}
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
shared_store: s3
# Keep filesystem for legacy reads
boltdb_shipper:
active_index_directory: /loki/boltdb-shipper-active
shared_store: filesystem
filesystem:
directory: /loki/chunks
compactor:
working_directory: /loki/compactor
shared_store: s3
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
Step 8: S3 Bucket + Lifecycle Policies
New file: infra/scripts/setup-s3-logging.sh
Creates S3 bucket with:
- Versioning disabled (logs are immutable)
- Server-side encryption (SSE-S3)
- Lifecycle policy:
- Standard: 0-30 days ($0.023/GB)
- Standard-IA: 30-90 days ($0.0125/GB)
- Glacier Instant: 90-365 days ($0.004/GB)
- Deep Archive: 365+ days ($0.00099/GB)
- Expiration: 7 years (compliance)
- Bucket policy restricting access to Loki service role
- Block public access enabled
Estimated cost (10GB/day logs after compression):
- ~$50-70/year total storage
Step 9: Enhanced Promtail Label Extraction
Modify: infra/monitoring/promtail/promtail-config.yml
Current pipeline already has JSON extraction and log level detection. Enhance:
- Extract
servicefield as Loki label (enables per-service queries) - Extract
requestId,userIdas structured metadata (Loki 3.x supports this without label cardinality explosion) - Keep existing container/compose label extraction
- Add
orderId,provider,fixtureIdas structured metadata - Enables queries like:
{service="hannibal-backend"} | json | provider="betfair" | latencyMs > 1000
Warning: Do NOT add high-cardinality fields (requestId, orderId, userId) as Loki labels — this will destroy performance. Use Loki's structured metadata or | json filter syntax instead.
Step 10: Grafana Dashboards
New dashboards in infra/monitoring/grafana/dashboards/:
logging-overview.json— log volume by service/level, error rate trending, top error messagesprovider-health.json— API latency per provider (from structured logs), error rates, availabilityorder-audit.json— order lifecycle flow (validated→routed→submitted→settled), timingfinancial-audit.json— balance changes, commissions, settlements, take balance trendsbbook-dashboard.json— B-Book fills, exposure levels, routing decisions, settlement P&Lexchange-coordination.json— circuit breaker states, failover events, routing distributionfrontend-health.json— client errors, Web Vitals, API latency from client perspective, WebSocket health
(Existing system-overview.json dashboard remains unchanged)
Future: Athena for Historical Queries
When log volume justifies:
- AWS Athena table pointing to S3 bucket for SQL on logs >31 days old
- Grafana Athena datasource plugin for historical dashboards
- Nightly JSONL→Parquet ETL for 10x cheaper Athena queries
Complete File Change Summary
New Files — Backend (4)
backend/src/utils/requestContext.ts— AsyncLocalStorage singletonbackend/src/middleware/requestContext.ts— request context middlewarebackend/src/utils/httpLogger.ts— provider API call wrapperbackend/src/routes/logs.ts— POST /api/logs/client (frontend error sink)
New Files — Frontend (4)
frontend/src/lib/frontendLogger.ts— structured logger with batching + backend sinkfrontend/src/lib/globalErrorHandler.ts— window error + unhandledrejection handlersfrontend/src/lib/performanceMonitor.ts— Web Vitals + API latencyfrontend/src/hooks/useErrorReporting.ts— React hook for component error context
New Files — Infra (1)
infra/scripts/setup-s3-logging.sh— S3 bucket setup
Modified Files — Infrastructure (5)
backend/src/utils/logger.ts— JSON format, AsyncLocalStorage integrationbackend/src/index.ts— requestContext middleware, HTTP timing, /api/logs routeinfra/monitoring/loki/loki-config.yml— S3 backend, schema v13infra/docker/docker-compose.monitoring.yml— Loki 3.x, AWS credsinfra/monitoring/promtail/promtail-config.yml— enhanced label extraction
Modified Files — Frontend (~10 files)
frontend/src/lib/api.ts— Axios interceptors for request/response loggingfrontend/src/hooks/useWebSocket.ts— structured event logging via frontendLoggerfrontend/src/lib/analytics/index.ts— wire frontendLogger alongside Mixpanelfrontend/src/components/ui/ErrorBoundary.tsx— send errors to backendfrontend/src/app/layout.tsx— init globalErrorHandler + performanceMonitorfrontend/src/components/providers/Web3AuthProvider.tsx— auth flow loggingfrontend/src/hooks/useAuth.ts— login/logout event loggingfrontend/src/lib/providers.tsx— QueryClient global onError handler
Modified Files — P0 Critical Backend (~25 files)
- Provider clients: BetfairClient, PinnacleClient, BifrostClient, RoanuzDataAdapter, betsApi
- Exchange coordination: ExchangeCoordinator, FailoverManager, RoutingStrategy, IDRegistry
- Bifrost pipeline: BifrostQueueManager, BifrostRecoveryManager, BifrostBetConsumer
- B-Book: bbookConfigService, bbookFillService, bbookSettlementService, bbookStateService, filterEngine
- Orders: orderService, settlement
- Jobs: orderSyncJob, agentSettlementJob, pinnacleSettlementJob, reportEmailJob
- Accounting: ledgerService, commissionService, agentSettlementService, availableService, takeService, marginRevenueService, carryoverService, transferSettleService, settlementPeriodService, reportService, exportService
Modified Files — P1 High Priority Backend (~15 files)
- Monitoring: baseMonitoringService, playerMonitoringService, cricketMonitoringService, soccerMonitoringService, conditionCompiler, conditionEvaluator, metricResolver, monitoringCrudService
- Auth: auth route, auth middleware, agentPermissions
- Infrastructure: redis, database, errorHandler
Modified Files — P2 Comprehensive Backend (~35 files)
- AI: agentService, fixtureAnalysisService
- WebSocket: clientWs, oddsPapiWs, RoanuzWebSocketManager
- Cricket: ballByBallService, cricketAnalysisService, cricketIdResolver, milestoneAlertService, sessionBettingService, tossAnalysisService
- Notifications: notificationService, pushService
- Telegram: bot, linkService, rateLimit
- Odds: oddsTransformer, marginService, limitsService
- Agents: agentHierarchyService, agentMonitoringService, userAgentService, watchlistService
- Exchange: ProviderRegistry, FuzzyMatcher, BetfairCache
- Routes: orders, settlements, ai, referrals, users
- Middleware: rateLimiter
- Python: ai-chat-assistant (5+ files)
New Grafana Dashboards (7)
- logging-overview.json, provider-health.json, order-audit.json
- financial-audit.json, bbook-dashboard.json, exchange-coordination.json
- frontend-health.json
Total: ~95+ files modified, 9 new files, 7 new dashboards
Performance Impact
LATENCY IMPACT: ~0ms (Winston writes to stdout, Promtail reads async)
CPU IMPACT: Minimal (JSON.stringify per log line, ~0.01ms)
MEMORY IMPACT: None (no in-app buffering, Promtail handles collection)
FAILURE MODE: If Loki/S3 is down, Promtail queues locally and retries — app unaffected
Cross-System Impact
- backend: Logger format change, new requestContext middleware, logging additions in ~80 files, /api/logs/client endpoint
- frontend: frontendLogger + global error handler + performance monitor + API/WS logging (~10 modified, 4 new files)
- services/ai-chat: Structured JSON logging, correlation ID propagation
- infra: Loki 3.x upgrade, S3 backend config, S3 bucket + lifecycle, Promtail enhancement, 7 new dashboards
Key Best Practices Applied
- Structured JSON logging — machine-parseable, Promtail/Loki/Athena compatible
- AsyncLocalStorage correlation IDs — every log line traceable to a request, zero code changes in existing 106 files
- External pipeline (Promtail→Loki→S3) — decoupled from app, no backpressure risk
- Structured metadata (not labels) for high-cardinality fields — prevents Loki performance degradation
- gzip compression — ~80% storage reduction
- Lifecycle policies — automatic cost optimization (Standard→IA→Glacier→Deep Archive)
- Dev/prod format split — colorized console in dev, JSON in prod
- Provider call wrapping — single wrapper captures all external API metrics uniformly
- Log level discipline — error (unrecoverable), warn (degraded), info (business events), debug (operational detail)
- Dual schema config — old boltdb-shipper for legacy reads, new TSDB+S3 for writes (zero-downtime migration)