Phase 1: Comprehensive Logging to AWS S3

Context

Hannibal has Winston logging in 106+ files but outputs to console only (plain text, printf format). Promtail + Loki + Grafana already exist in docker-compose but Loki uses local filesystem storage (lost on container restart) and is on version 2.9.2 (outdated). Massive logging gaps exist across critical systems — ~75 backend files have no or minimal logging. This plan fixes both: fill all logging gaps AND ship everything to S3 for durable, queryable storage.

Current State vs Target

Area	Current Coverage	Target	Files Affected
Order placement	~70%	100%	1 file
Settlement	~65%	100%	2 files
Provider API calls	~20%	100%	4 adapter clients
Exchange coordination & failover	~10%	100%	4 files (coordinator, failover, routing, registry)
Bifrost queue/recovery pipeline	~5%	100%	3 files (queue, recovery, bet consumer)
B-Book system	~15%	100%	5 files (config, fill, settlement, state, filter)
Background jobs	~20%	100%	4 job files
Financial transactions & accounting	~15%	100%	10 files (ledger, commission, take, carryover, margin, etc.)
Monitoring agents & evaluation	~10%	100%	9 files (base, player, cricket, soccer, compiler, evaluator, resolver, crud, presets)
AI interactions (backend)	~5%	100%	2 files (agentService, fixtureAnalysisService)
AI chat service (Python)	~10%	100%	5+ files
Authentication & permissions	~30%	100%	3 files (auth route, auth middleware, agentPermissions)
WebSocket events	~60%	100%	3 files (clientWs, oddsPapiWs, RoanuzWebSocketManager)
Redis operations	~25%	90%	1 file
Cricket services	~20%	100%	6 files (ballByBall, analysis, idResolver, milestones, sessions, toss)
Notification & push	~10%	100%	2 files (notificationService, pushService)
Telegram bot	~0%	100%	5 files (bot, linkService, rateLimit, uiRenderer, types)
Routes (API endpoints)	~30%	90%	6 uncovered routes
Odds transformation & margins	~5%	100%	3 files (oddsTransformer, marginService, limitsService)
Agent hierarchy & user agents	~10%	100%	3 files (agentHierarchy, agentMonitoring, userAgent)
Database operations	~50%	90%	1 file
Error handling middleware	~40%	90%	1 file

Total files requiring logging changes: ~80+

Architecture: Logging Pipeline to S3

Application (Winston JSON → stdout)
       │
       ▼
Promtail (already deployed, reads Docker container logs)
       │
       ▼
Loki 3.x (upgrade from 2.9.2, reconfigure: filesystem → S3 backend)
       │
       ├──► S3 bucket (chunks + indexes)
       │      └── Lifecycle: Standard 30d → IA 90d → Glacier 365d → Deep Archive
       │
       ▼
Grafana (real-time queries via Loki, historical via Athena)

Why this approach (not direct Winston→S3):

Promtail+Loki already deployed — config change only, zero new infra
Decoupled from app — if S3 is slow, app is unaffected
Structured JSON stdout → Promtail parses automatically
Grafana already connected to Loki for real-time querying
Direct Winston→S3 transport is fragile (blocks event loop on failure, loses logs on crash)

CRITICAL: Loki version upgrade required. Current Loki 2.9.2 does NOT support schema v13 or TSDB store. Must upgrade to Loki 3.0.0+ before enabling S3 backend with TSDB. Loki 2.9.x can use S3 only with boltdb-shipper (legacy), which is deprecated in 3.x. Plan targets Loki 3.x directly.

Detailed Logging Gaps Audit

AREA 1: ORDER PLACEMENT FLOW (Currently ~70%)

File: backend/src/services/orderService.ts

Currently logged:

Point value loading
B-Book routing decision & rerouting logic
Agent booking capture
B-Book fill execution and failures
PAL readiness check
Betslip fetch & slippage calculation
PAL order submission
Partial refund scenarios
Batch order completion

MISSING — must add:

Validation step entry/exit with specific failure reason
Pre/post balance deduction values (critical for race condition debugging)
Exposure cache invalidation timing and actual exposure values
Slippage magnitude (requested odds vs live odds, delta value)
Tick snapping in batch orders (OLD → NEW values)
Exchange portion under minimum — threshold check failure reason
Request UUID lifecycle tracking

AREA 2: SETTLEMENT FLOW (Currently ~65%)

Files: backend/src/services/settlement.ts, backend/src/jobs/orderSyncJob.ts

Currently logged:

Settlement polling start, fixture status verification
Settlement data retrieval, market-level processing
Commission processing, B-Book settlement
Order status outcomes, PnL divergence detection

MISSING — must add:

Settlement data structure validation (unexpected format detection)
Fixture status change reasons (why skipping: "LIVE but we only process FINISHED")
Winning outcome extraction logic (filtering/mapping decisions)
Market-level outcome determination (what data influenced win/void decision)
Void order handling reason (refund amount = deducted amount)
Commission rate lookup result (which config row matched)
Margin profit edge case detection steps
Ledger entry constraint error context

AREA 3: PROVIDER API CALLS (Currently ~20%) ⚠️ CRITICAL

Files: backend/src/exchanges/adapters/betfair/BetfairClient.ts, pinnacle/PinnacleClient.ts, bifrost/BifrostClient.ts, roanuz/RoanuzDataAdapter.ts

MISSING — must add for ALL providers:

Every API request/response with latency timing
Session refresh timing and success/failure
Authentication errors with reason
Rate limit header values and throttle state
Retry logic (which attempt succeeded, why retries exhausted)
Request/response payload sizes
Error code extraction and categorization
Failover decisions (which provider failed, which chosen as fallback)

AREA 4: EXCHANGE COORDINATION & FAILOVER (Currently ~10%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN

Files:

backend/src/exchanges/core/coordinator/ExchangeCoordinator.ts — main routing orchestrator
backend/src/exchanges/core/coordinator/FailoverManager.ts — circuit breaker (closed/open/half-open)
backend/src/exchanges/core/coordinator/RoutingStrategy.ts — provider selection logic
backend/src/exchanges/core/coordinator/IDRegistry.ts — cross-provider ID mapping with Redis cache

MISSING — must add:

Every routing decision: which provider selected, why, fallback chain considered
Circuit breaker state transitions: closed→open (with failure count), open→half-open (timeout), half-open→closed (success)
Provider health check results: latency, error rate, availability score
ID resolution: canonical→provider mapping lookups, cache hits/misses, fuzzy match scores
Failover triggers: original provider failure reason, fallback provider selected, degraded mode entered
Request routing latency: time added by coordination layer

AREA 5: BIFROST QUEUE PIPELINE (Currently ~5%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN

Files:

backend/src/exchanges/adapters/bifrost/BifrostQueueManager.ts — RabbitMQ consumer (protobuf)
backend/src/exchanges/adapters/bifrost/BifrostRecoveryManager.ts — startup + periodic recovery
backend/src/exchanges/adapters/bifrost/BifrostBetConsumer.ts — live bet status/settlement

MISSING — must add:

Queue message consumption: queue name, message type, protobuf decode success/failure, processing latency
Recovery cycles: startup recovery (events + markets + phantoms), periodic 10-min recovery, items recovered count
Bet status updates: orderId, oldStatus→newStatus, settlement outcome from queue
Queue connection health: connected/disconnected, prefetch count, consumer tag
Message acknowledgment: ack/nack/reject decisions with reason
Protobuf decode errors: which field failed, raw payload size

AREA 6: B-BOOK SYSTEM (Currently ~15%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN

Files:

backend/src/services/bbook/bbookConfigService.ts — config, enable/disable, limits
backend/src/services/bbook/bbookFillService.ts — fill execution, position creation
backend/src/services/bbook/bbookSettlementService.ts — position settlement, P&L
backend/src/services/bbook/bbookStateService.ts — Redis + PG position tracking
backend/src/services/bbook/filterEngine.ts — routing decisions

MISSING — must add:

Config changes: who changed what setting, old→new values
Fill decisions: orderId, fillAmount, remainingCapacity, limits checked (global/market/selection/user)
Position creation: positionId, fixtureId, marketId, side (back/lay), stake, odds
Settlement: positionId, outcome (win/loss/void), grossPnL, netPnL
State tracking: exposure at global/market/selection/user levels, Redis vs PG sync status
Routing decisions: why B-Book (within limits) vs exchange (exceeded limits), specific limit that triggered routing
Limit breaches: which limit hit (globalPoolLimit, defaultMarketLimit, selectionLimit, userLimit, maxOdds, minStake)
Balancing alerts: when pool becomes unbalanced (e.g., >80% on one side)

AREA 7: WEBSOCKET EVENTS (Currently ~60%)

Files: backend/src/services/clientWs.ts, backend/src/services/oddsPapiWs.ts

Currently logged:

Client connections, fixture subscriptions, disconnections
Redis subscriber ready, message handling
Broadcast operations, reconnection attempts

MISSING — must add:

Connection duration on disconnect
Message count per client session
Subscription failures
Broadcast recipient count (how many clients actually received)
Redis subscription drop detection
Ping/pong latency tracking
Message ordering/gap detection
Update frequency per fixture (sampled)
Odds extraction failures (when extraction returns zeros)

Additional file NOT in original plan:

backend/src/exchanges/adapters/roanuz/RoanuzWebSocketManager.ts — cricket streaming
- Socket.IO connection state to Roanuz API
- Subscription protocol events (subscribe/unsubscribe/ack)
- gzip decompression failures
- Event normalization errors
- Auto-reconnection attempts and backoff timing

AREA 8: AI INTERACTIONS (Currently ~5%) ⚠️ CRITICAL

Files:

backend/src/ai/core/agentService.ts — Anthropic API calls, tool execution loop
backend/src/services/fixtureAnalysisService.ts — AI fixture analysis with caching (NOT IN ORIGINAL PLAN)

MISSING — must add:

Every Anthropic API call: model, inputTokens, outputTokens, latencyMs, success
Every tool execution in agentic loop: toolName, latencyMs, success/error
Conversation context size (message count, estimated tokens)
Prompt construction summary (not full prompt, but template + parameters)
Streaming latency for AI insights
Cache hits (insights from cache vs generated fresh)
Failed insight fallback strategy
Fixture analysis: fixtureId, analysisType, cacheHit, generationLatencyMs, provider used

AREA 9: MONITORING AGENTS & EVALUATION (Currently ~10%) ⚠️ CRITICAL

Files in original plan:

backend/src/services/monitoring/baseMonitoringService.ts
backend/src/services/monitoring/playerMonitoringService.ts
backend/src/services/monitoring/cricketMonitoringService.ts

Files NOT in original plan:

backend/src/services/monitoring/soccerMonitoringService.ts — soccer-specific monitoring
backend/src/services/monitoring/soccerCapabilities.ts — soccer capability definitions
backend/src/services/monitoring/soccerResolvers.ts — soccer metric resolvers
backend/src/services/monitoring/conditionCompiler.ts — condition→evaluable form
backend/src/services/monitoring/conditionEvaluator.ts — runtime condition evaluation
backend/src/services/monitoring/metricResolver.ts — metric value resolution from fixture data
backend/src/services/monitoring/monitoringCrudService.ts — CRUD for monitor configs
backend/src/services/monitoring/presets.ts — preset conditions by domain
backend/src/services/monitoring/utils.ts — monitoring utilities

MISSING — must add across ALL monitoring files:

Each evaluation cycle: monitorId, conditionsCount, conditionsMet, alertFired, evalDurationMs
Condition compilation: input condition → compiled form, compilation errors
Condition evaluation: metricName, resolvedValue, operator, threshold, result (pass/fail)
Metric resolution: which resolver used, data source, resolution latency
Alert firing decisions (WHY triggered or suppressed)
Deduplication logic (duplicate detection, alerts skipped)
Event subscription errors
Monitor cache coherency (hits/misses/invalidations)
Periodic checker execution timing
CRUD operations: monitor created/updated/deleted by userId
Preset loading: which presets applied, domain

AREA 10: FINANCIAL TRANSACTIONS & ACCOUNTING (Currently ~15%) ⚠️ CRITICAL

Files in original plan:

backend/src/accounting/ledgerService.ts
backend/src/accounting/commissionService.ts
backend/src/accounting/agentSettlementService.ts

Files NOT in original plan:

backend/src/accounting/availableService.ts — available credit (AC) computation
backend/src/accounting/carryoverService.ts — deferred settlement (carryover) handling
backend/src/accounting/marginRevenueService.ts — margin profit tracking per order
backend/src/accounting/reportService.ts — settlement report generation
backend/src/accounting/takeService.ts — event-driven take balance updates (T = AC(self) + Σ AC(downline) - CL)
backend/src/accounting/transferSettleService.ts — settlement reconciliation between upline/downline
backend/src/accounting/settlementPeriodService.ts — period lifecycle (open→grace→finalized)
backend/src/accounting/calculations.ts — pure calc functions (round, commission, profitShare, booking)
backend/src/accounting/exportService.ts — Excel/CSV export for settlement reports

MISSING — must add:

Every ledger entry: accountType, entryType, amount, balanceAfter, transactionRef
Commission rate lookup (which config matched, rate applied)
Commission calculation steps: grossWinnings × rate = commission
Per-agent settlement: totalStakes, totalPayouts, bookingPercent, netSettlement
Available credit computation: userId, creditLimit, downlineAllocation, exposure, resultAC
Carryover requests: downlineId, amount, accepted/rejected, grace period
Margin revenue per order: orderId, placedOdds, displayOdds, marginProfit, slippageType
Take balance updates: userId, oldTake, newTake, trigger (which event caused update)
Transfer settlement: uplineId→downlineId, obligation, adjustedBalance
Settlement period transitions: periodId, oldState→newState, triggeredBy
Report generation: reportType (platform/agent/player), periodId, generationLatencyMs
Export operations: format (Excel/CSV), rowCount, fileSize, downloadedBy
Decimal rounding decisions: input, output, roundingMode, precision
Constraint violation context

AREA 11: AUTHENTICATION & PERMISSIONS (Currently ~30%)

Files in original plan:

backend/src/routes/auth.ts
backend/src/middleware/auth.ts

File NOT in original plan:

backend/src/middleware/agentPermissions.ts — permission flag checking (T/B/D/W)

MISSING — must add:

Every login attempt (wallet address, success/failure, failure reason)
Token verification failure reason (expired, invalid, revoked)
Session cache hit vs DB lookup path
User status check result value
Admin role grant/denial
Token expiry timing (how long token was valid)
New user registration events
Referral code validation results
Permission checks: userId, requiredFlag (Transfer/Bet/Deposit/Withdraw), granted/denied

AREA 12: BACKGROUND JOBS (Currently ~20%) ⚠️ CRITICAL

Files: backend/src/jobs/orderSyncJob.ts, agentSettlementJob.ts, pinnacleSettlementJob.ts, reportEmailJob.ts

Currently logged: Job startup, job stop, "no orders to sync" (debug).

MISSING — must add:

Job execution timing (cycle start/end/duration)
Batch size per cycle
Each order status change detected (orderId, oldStatus → newStatus)
Retry logic (which orders retried, why)
Settlement check results (how many settled, which orders)
Error accumulation (fail count/types per cycle)
Agent settlement calculations per agent
Pinnacle settlement data received/processed
Report generation contents, recipients, send status, SMTP response

AREA 13: CRICKET SERVICES (Currently ~20%) — EXPANDED

File in original plan:

backend/src/services/cricket/ballByBallService.ts

Files NOT in original plan:

backend/src/services/cricket/cricketAnalysisService.ts — AI cricket analysis (head-to-head, venue, player form, toss, pitch)
backend/src/services/cricket/cricketIdResolver.ts — Betfair→Roanuz fuzzy ID matching
backend/src/services/cricket/milestoneAlertService.ts — batting/bowling milestone detection (50/100/150/200, 3/5-for, partnerships, powerplay, RRR)
backend/src/services/cricket/sessionBettingService.ts — phase analysis (Test/ODI/T20), projected scores vs historical
backend/src/services/cricket/tossAnalysisService.ts — toss results, pitch conditions, venue history, AI recommendations

MISSING — must add:

Ball-by-ball: ball#, runs, outcome, wicket details, team score, run rate
Analysis generation: matchId, analysisType, dataSource (Roanuz), latencyMs, cacheHit
ID resolution: betfairEventId, resolvedRoanuzMatchId, matchScore, method (exact/fuzzy), cacheHit
Milestones detected: matchId, milestoneType, player, value, timestamp
Session analysis: matchId, format, currentPhase, projectedScore, historicalAvg
Toss analysis: matchId, tossWinner, decision (bat/field), pitchCondition, venueHistory

AREA 14: NOTIFICATION & PUSH (Currently ~10%) — NOT IN ORIGINAL PLAN

Files:

backend/src/services/notificationService.ts — creates notifications, Redis pub/sub, push to mobile
backend/src/services/pushService.ts — web push (VAPID) to mobile/browser

MISSING — must add:

Notification creation: userId, type, title, channel (in-app/push/both)
Redis pub/sub publish: channel, payloadSize, subscriberCount
Push notification send: userId, deviceCount, success/failure per device
VAPID authentication: key rotation, errors
Push delivery failures: endpoint, statusCode, reason (expired subscription, etc.)

AREA 15: TELEGRAM BOT (Currently ~0%) — NOT IN ORIGINAL PLAN

Files:

backend/src/services/telegram/bot.ts — main bot (GramMyBot), commands, AI text processing
backend/src/services/telegram/linkService.ts — Telegram↔Hannibal account linking
backend/src/services/telegram/rateLimit.ts — rate limiting
backend/src/services/telegram/uiRenderer.ts — keyboard/message rendering

MISSING — must add:

Command handling: chatId, command (/start, /help, /balance, /bets), userId, latencyMs
AI text processing: chatId, messageText (truncated), responseLatencyMs, toolCalls
Account linking: telegramUserId, hannibalUserId, linked/unlinked, success/failure
Rate limit hits: chatId, command, blocked (yes/no)
Callback queries: chatId, callbackData, action taken
Session management: chatId, sessionCreated/expired, duration

AREA 16: ODDS TRANSFORMATION & MARGINS (Currently ~5%) — NOT IN ORIGINAL PLAN

Files:

backend/src/services/oddsTransformer.ts — raw OddsPAPI→frontend format, margin adjustment
backend/src/services/marginService.ts — platform margin applied to Betfair odds (revenue source)
backend/src/services/limitsService.ts — stake limits per fixture/outcome

MISSING — must add:

Odds transformation: fixtureId, marketCount, outcomeCount, transformLatencyMs
Margin application: fixtureId, originalOdds, marginRate, adjustedOdds, revenueImpact
Limits enrichment: fixtureId, platformMin, depthData, computedLimits

AREA 17: AGENT HIERARCHY & USER AGENTS (Currently ~10%) — NOT IN ORIGINAL PLAN

Files:

backend/src/services/agentHierarchyService.ts — recursive CTE traversal, parent-child
backend/src/services/agentMonitoringService.ts — agent alerting via Redis/OddsPAPI
backend/src/services/userAgentService.ts — autonomous agents (value_hunter, steam_tracker, arbitrage)
backend/src/services/watchlistService.ts — fixture/team watchlists with AI alerts

MISSING — must add:

Hierarchy traversal: rootAgentId, depth, nodeCount, latencyMs
Agent alerts: agentId, alertType (odds change, match finished), fixtureId, triggered
Autonomous agent execution: agentId, agentType, evaluationCycle, actionsGenerated
Watchlist alerts: watchlistId, userId, alertCondition, matched (yes/no)

AREA 18: ROUTES — API ENDPOINTS (Currently ~30%) — NOT IN ORIGINAL PLAN

Files not covered:

backend/src/routes/ai.ts — AI features (analysis, watchlists, agents, chat)
backend/src/routes/orders.ts — order placement/retrieval, schema validation
backend/src/routes/referrals.ts — referral code generation
backend/src/routes/settlements.ts — settlement periods, processing, ledger verification
backend/src/routes/users.ts — user profile, balance
backend/src/routes/health.ts — health checks (/health, /health/ready)

MISSING — must add:

Request/response logging for each route group (method, path, statusCode, latencyMs, userId)
Note: most of this will be covered by the HTTP request middleware (Step 1d), but route-specific business logic logging is still needed:
- Order validation failures with specific field/reason
- Settlement period transitions triggered via API
- AI analysis requests with fixtureId and cache status
- Health check detailed status per service

AREA 19: ERROR HANDLING MIDDLEWARE — NOT IN ORIGINAL PLAN

File: backend/src/middleware/errorHandler.ts

Custom error classes: ValidationError, AuthenticationError, AuthorizationError, NotFoundError, ConflictError, ExternalServiceError

MISSING — must add:

Error categorization: operational vs programming bug
Error class distribution logging (which error types most frequent)
Stack trace persistence with request context
User impact flag (blocks request vs non-fatal)
External service errors: which service, statusCode, retryable

AREA 20: REDIS OPERATIONS (Currently ~25%)

File: backend/src/services/redis.ts

Currently logged: Connection state changes only.

MISSING — must add:

Cache hits/misses (debug level, controlled by LOG_LEVEL)
Cache errors with key name context
Lock acquisition failures (key, contention duration)
Lock release verification (success or stolen lock)
Pub/sub channel health
Slow Redis operations

AREA 21: DATABASE OPERATIONS (Currently ~50%)

File: backend/src/services/database.ts

Currently logged: Connection success, slow queries in dev (>100ms).

MISSING — must add:

Slow query logging in production (>200ms threshold)
Structured error logging for Prisma errors with query context
Transaction rollback logging
Connection pool metrics (if available)
Constraint violation logging with context

AREA 22: EXCHANGE MAPPERS & ID RESOLUTION — NOT IN ORIGINAL PLAN

Files:

backend/src/exchanges/core/ProviderRegistry.ts — provider instance management
backend/src/exchanges/core/FuzzyMatcher.ts — fuzzy string matching for ID resolution
backend/src/exchanges/adapters/betfair/BetfairCache.ts — in-memory market data cache
Various mapper files (BetfairMapper, BifrostMapper, OddsPapiMapper, PinnacleMapper)

MISSING — must add:

Provider registration/deregistration events
Fuzzy match: inputString, candidates, bestMatch, score, threshold
Betfair cache: size, evictions, hit rate (periodic summary, not per-access)
Mapper transform errors: provider, entityType, fieldName, rawValue, error

AREA 23: FRONTEND — CLIENT-SIDE LOGGING (Currently ~5%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN

Current state: 33 files with scattered console.log/error/warn. All logs are browser-only (lost when tab closes). No global error handler. No structured logging. No backend error sink.

23a. Error Handling (Currently: ErrorBoundary only)

Files:

frontend/src/components/ui/ErrorBoundary.tsx — catches React render errors, dev-only console.error
frontend/src/app/layout.tsx — wraps app in ErrorBoundary

What slips through ErrorBoundary (NOT caught):

Async errors (promises, setTimeout)
Event handler errors (onClick, onChange)
WebSocket event errors
React Query failures
Axios interceptor errors

MISSING — must add:

Global window.addEventListener('unhandledrejection') handler
Global window.addEventListener('error') handler
All unhandled errors should be captured and sent to backend

23b. API Error Logging (Currently: toast only)

Files:

frontend/src/lib/api.ts — Axios interceptor handles 401 refresh, warns on missing token
47+ files with useMutation/useQuery — most use onError: (err) => toast.error(msg), nothing else

MISSING — must add:

Axios response interceptor: log every error response (url, method, status, latencyMs, errorCode)
Axios request interceptor: capture request start time for latency calculation
React Query global onError handler at QueryClient level
Failed mutation details: mutationKey, variables (sanitized), error

23c. WebSocket Logging (Currently: console.log)

File: frontend/src/hooks/useWebSocket.ts — 30 lines of console.log

Currently logged (to browser console only):

Connection creation, connect ID, disconnect reason, connect_error
Balance/settlement/notification/order status updates received
Cricket ball events

MISSING — must add:

Reconnection attempts (count, delay, transport fallback)
Connection duration on disconnect
Message delivery gaps (missed balls, out-of-order events)
Subscription failures

23d. Performance Monitoring (Currently: none)

Only tracked: Auth latency and AI response latency (sent to Mixpanel, not logged).

MISSING — must add:

Core Web Vitals (LCP, FID, CLS, INP, TTFB)
Page navigation timing
API call latency distribution
Long task detection (>50ms)

23e. Authentication Flow (Currently: console.error)

Files:

frontend/src/hooks/useAuth.ts
frontend/src/components/providers/Web3AuthProvider.tsx (18 lines)
frontend/src/components/auth/LoginModal.tsx

MISSING — must add:

Login attempt: provider (metamask/phantom/email), success/failure, latencyMs
Web3Auth init: success/failure, latencyMs, retryCount
Token refresh: success/failure, reason
Session expiry: duration, trigger

23f. AI & Cricket UI (Currently: console.error on failure)

Files: 4 AI component files, 6 cricket component files

MISSING — must add:

AI analysis request: fixtureId, type, latencyMs, cacheHit, error
Cricket subscription: matchId, subscribed/unsubscribed, ballsReceived, gaps
Deep analysis: fixtureId, latencyMs, success/error

Frontend Logging Architecture

Frontend cannot write to Loki directly (would expose Loki to the internet). Instead:

Browser
  │
  │  1. frontendLogger captures structured events
  │  2. Batches in memory (max 50 events or 10s flush)
  │  3. POST /api/logs/client (fire-and-forget, no retry on 4xx)
  │
  ▼
Backend: POST /api/logs/client
  │
  │  1. Validates payload (max 100 events per request, size limit)
  │  2. Rate-limited (100 req/min per IP)
  │  3. Logs each event via Winston with source: "frontend"
  │
  ▼
Winston (stdout JSON) → Promtail → Loki → S3

Every frontend log line becomes:

{
  "ts": "2026-02-26T10:30:00.123Z",
  "level": "error",
  "msg": "API call failed",
  "service": "hannibal-frontend",
  "source": "client",
  "sessionId": "sess_abc123",
  "userId": "u_456",
  "url": "/api/orders",
  "method": "POST",
  "statusCode": 500,
  "latencyMs": 1234,
  "errorCode": "INTERNAL_SERVER_ERROR",
  "userAgent": "Mozilla/5.0...",
  "page": "/sports/cricket"
}

New frontend files:

frontend/src/lib/frontendLogger.ts — structured logger with batching + backend sink
frontend/src/lib/globalErrorHandler.ts — window error + unhandledrejection handlers
frontend/src/lib/performanceMonitor.ts — Web Vitals + API latency tracking
frontend/src/hooks/useErrorReporting.ts — React hook for component error context

New backend files:

backend/src/routes/logs.ts — POST /api/logs/client endpoint (validated, rate-limited)

Modified frontend files:

frontend/src/lib/api.ts — Axios interceptors for request/response logging
frontend/src/hooks/useWebSocket.ts — structured event logging
frontend/src/lib/analytics/index.ts — wire frontendLogger alongside Mixpanel
frontend/src/components/ui/ErrorBoundary.tsx — send errors to backend
frontend/src/app/layout.tsx — init globalErrorHandler + performanceMonitor
frontend/src/components/providers/Web3AuthProvider.tsx — auth flow logging
frontend/src/hooks/useAuth.ts — login/logout event logging

What stays in Mixpanel (user behavior analytics):

56 existing user events (bet_placed, fixture_viewed, etc.)
These are product analytics, not operational logs — keep them separate

What goes to Loki/S3 via backend sink (operational logs):

Errors (unhandled, API failures, WebSocket errors)
Performance (Web Vitals, API latency, long tasks)
Auth events (login/logout attempts, token refresh)
WebSocket health (connect/disconnect, reconnection, message gaps)
Critical UI events (ErrorBoundary catches, blank screens)

Implementation Steps

Step 1: Foundation — Winston + Correlation IDs

1a. Switch Winston to Structured JSON Output

Modify: backend/src/utils/logger.ts

Current: plain text with colorized console, printf format, circular reference handling, child logger support. Target: structured JSON with service metadata, dual-format (JSON prod / colorized dev).

Every log line becomes:
{"ts":"2026-02-26T10:30:00.123Z","level":"info","msg":"Order placed","service":"hannibal-backend","env":"production","requestId":"abc-123","userId":"u_456","orderId":"ord_789","latencyMs":45}

Changes:

Replace printf format with winston.format.json() for non-dev environments
Add defaultMeta: { service: 'hannibal-backend', env: process.env.NODE_ENV }
Keep colorized console format for local dev via NODE_ENV === 'development' check
Add winston.format.errors({ stack: true }) for full stack traces
Preserve existing safe metadata serialization (circular reference handling already built in)

1b. Add AsyncLocalStorage for Correlation IDs

New file: backend/src/utils/requestContext.ts

Uses Node.js built-in AsyncLocalStorage to automatically inject requestId, userId, traceId into every log line without passing logger around.

Note: Partial infrastructure already exists:

errorHandler.ts already reads x-request-id header
CORS already allows X-Request-ID header
logger.child({ requestId }) capability exists but isn't used consistently

Modify: backend/src/utils/logger.ts

Add format that reads from AsyncLocalStorage and injects context fields into every log line

New middleware: backend/src/middleware/requestContext.ts

Wraps every request in AsyncLocalStorage.run() with { requestId, userId, traceId }
Generates UUID if no x-request-id header present
Sets x-request-id response header for tracing
Extracts userId from authenticated request (after auth middleware runs)

Modify: backend/src/index.ts

Add requestContextMiddleware before all route handlers (after CORS, before auth)

Impact: All 106 files that import logger automatically get correlation IDs in every log line. Zero changes needed in those files.

1c. HTTP Request/Response Middleware

New middleware addition in backend/src/index.ts:

Log every HTTP request: method, path, statusCode, latencyMs, userId, requestSize, responseSize
Replaces/enhances existing Morgan "combined" format with structured JSON
Excludes health check endpoints from verbose logging (or logs at debug level)

1d. Provider API Call Logging Wrapper

New file: backend/src/utils/httpLogger.ts

Generic wrapper for external HTTP calls
Captures: provider, method, url, latencyMs, statusCode, success, error, requestSize, responseSize
Usage: await withHttpLogging('betfair', 'placeOrder', () => axios.post(...))
Can be applied as axios interceptor or manual wrapper

Step 2: P0 — Critical Business Logic Gaps (~25 files)

2a. Provider Clients

Modify (wrap all API methods with httpLogger):

backend/src/exchanges/adapters/betfair/BetfairClient.ts — listMarketCatalogue, placeOrders, cancelOrders, listCurrentOrders, listClearedOrders, getAccountFunds, login, session refresh
backend/src/exchanges/adapters/pinnacle/PinnacleClient.ts — all methods + token bucket state
backend/src/exchanges/adapters/bifrost/BifrostClient.ts — all methods
backend/src/exchanges/adapters/roanuz/RoanuzDataAdapter.ts — all methods
backend/src/services/betsApi.ts — BetsAPI calls for odds fetching

2b. Exchange Coordination & Failover

Modify:

backend/src/exchanges/core/coordinator/ExchangeCoordinator.ts — routing decisions, provider selection
backend/src/exchanges/core/coordinator/FailoverManager.ts — circuit breaker state transitions, health checks
backend/src/exchanges/core/coordinator/RoutingStrategy.ts — strategy evaluation, provider ranking
backend/src/exchanges/core/coordinator/IDRegistry.ts — ID lookups, cache hits/misses

2c. Bifrost Pipeline

Modify:

backend/src/exchanges/adapters/bifrost/BifrostQueueManager.ts — message consumption, protobuf decode, processing latency
backend/src/exchanges/adapters/bifrost/BifrostRecoveryManager.ts — recovery cycles, items recovered
backend/src/exchanges/adapters/bifrost/BifrostBetConsumer.ts — bet status updates, settlement from queue

2d. B-Book System

Modify:

backend/src/services/bbook/bbookConfigService.ts — config changes
backend/src/services/bbook/bbookFillService.ts — fill decisions, position creation, capacity checks
backend/src/services/bbook/bbookSettlementService.ts — position settlement, P&L
backend/src/services/bbook/bbookStateService.ts — exposure levels, Redis/PG sync
backend/src/services/bbook/filterEngine.ts — routing decisions, limit checks

2e. Order + Settlement

Modify:

backend/src/services/orderService.ts — validation, balance, exposure, slippage details
backend/src/services/settlement.ts — outcome determination, commission, void handling
backend/src/jobs/orderSyncJob.ts — cycle timing, batch size, status changes
backend/src/jobs/agentSettlementJob.ts — per-agent calculations
backend/src/jobs/pinnacleSettlementJob.ts — settlement data received/processed
backend/src/jobs/reportEmailJob.ts — report generation, SMTP

2f. Financial Transactions & Accounting

Modify:

backend/src/accounting/ledgerService.ts — every ledger entry with full context
backend/src/accounting/commissionService.ts — rate lookup, calculation steps
backend/src/accounting/agentSettlementService.ts — per-agent calculations
backend/src/accounting/availableService.ts — AC computation steps
backend/src/accounting/takeService.ts — take balance event-driven updates
backend/src/accounting/marginRevenueService.ts — margin profit per order
backend/src/accounting/carryoverService.ts — deferred settlement requests
backend/src/accounting/transferSettleService.ts — upline/downline reconciliation
backend/src/accounting/settlementPeriodService.ts — period lifecycle transitions
backend/src/accounting/reportService.ts — report generation
backend/src/accounting/exportService.ts — Excel/CSV export

Step 3: P1 — High-Priority Operational Gaps (~20 files)

3a. Monitoring Agents (all files)

Modify:

backend/src/services/monitoring/baseMonitoringService.ts — evaluation cycles, alert decisions
backend/src/services/monitoring/playerMonitoringService.ts — bet event processing, condition results
backend/src/services/monitoring/cricketMonitoringService.ts — cricket-specific conditions
backend/src/services/monitoring/soccerMonitoringService.ts — soccer-specific conditions
backend/src/services/monitoring/conditionCompiler.ts — compilation results/errors
backend/src/services/monitoring/conditionEvaluator.ts — evaluation trace per condition
backend/src/services/monitoring/metricResolver.ts — metric resolution source/latency
backend/src/services/monitoring/monitoringCrudService.ts — CRUD operations
backend/src/services/monitoring/presets.ts — preset loading

3b. Authentication & Permissions

Modify:

backend/src/routes/auth.ts — login attempts, registration, referral validation
backend/src/middleware/auth.ts — token verification, cache vs DB path, user status
backend/src/middleware/agentPermissions.ts — permission flag checks (T/B/D/W)

3c. Redis Operations

Modify:

backend/src/services/redis.ts — cache hits/misses (debug), lock contention, slow ops

3d. Database Operations

Modify:

backend/src/services/database.ts — slow queries in prod (>200ms), transaction errors, constraint violations

3e. Error Handler Enhancement

Modify:

backend/src/middleware/errorHandler.ts — error categorization, class distribution, external service context

Step 4: P2 — Comprehensive Coverage (~30 files)

4a. AI Interactions

Modify:

backend/src/ai/core/agentService.ts — API calls, tool executions, tokens, latency
backend/src/services/fixtureAnalysisService.ts — analysis generation, caching

4b. WebSocket Events

Modify:

backend/src/services/clientWs.ts — connection duration, broadcast counts, subscription failures
backend/src/services/oddsPapiWs.ts — message rates, extraction failures, ping/pong
backend/src/exchanges/adapters/roanuz/RoanuzWebSocketManager.ts — cricket stream health

4c. Cricket Services

Modify:

backend/src/services/cricket/ballByBallService.ts — ball events, wickets, milestones
backend/src/services/cricket/cricketAnalysisService.ts — AI analysis, data sources
backend/src/services/cricket/cricketIdResolver.ts — ID resolution, fuzzy matching
backend/src/services/cricket/milestoneAlertService.ts — milestone detection
backend/src/services/cricket/sessionBettingService.ts — phase analysis
backend/src/services/cricket/tossAnalysisService.ts — toss/pitch analysis

4d. Notification & Push

Modify:

backend/src/services/notificationService.ts — creation, pub/sub, delivery
backend/src/services/pushService.ts — VAPID auth, delivery success/failure

4e. Telegram Bot

Modify:

backend/src/services/telegram/bot.ts — commands, AI processing, sessions
backend/src/services/telegram/linkService.ts — account linking
backend/src/services/telegram/rateLimit.ts — rate limit events

4f. Odds & Margins

Modify:

backend/src/services/oddsTransformer.ts — transformation latency
backend/src/services/marginService.ts — margin application per fixture
backend/src/services/limitsService.ts — stake limits computation

4g. Agent Hierarchy & User Agents

Modify:

backend/src/services/agentHierarchyService.ts — tree traversal
backend/src/services/agentMonitoringService.ts — agent alerts
backend/src/services/userAgentService.ts — autonomous agent execution
backend/src/services/watchlistService.ts — watchlist alerts

4h. Rate Limiting

Modify:

backend/src/middleware/rateLimiter.ts — limit hits, repeat offenders

4i. Exchange Internals

Modify:

backend/src/exchanges/core/ProviderRegistry.ts — registration events
backend/src/exchanges/core/FuzzyMatcher.ts — match scores
backend/src/exchanges/adapters/betfair/BetfairCache.ts — cache size, hit rate (periodic)

4j. Routes (business logic in API handlers)

Modify:

backend/src/routes/orders.ts — validation failures
backend/src/routes/settlements.ts — period transitions
backend/src/routes/ai.ts — analysis requests
backend/src/routes/referrals.ts — code generation
backend/src/routes/users.ts — profile operations

Step 5: Frontend Logging Pipeline (~15 files)

5a. Frontend Logger + Backend Sink

New file: frontend/src/lib/frontendLogger.ts

Structured JSON logger mirroring backend format
In-memory buffer: max 50 events or 10-second flush interval
Sends batched events via POST /api/logs/client
Fire-and-forget — never blocks UI, drops on failure (no retry on 4xx)
Includes session context: sessionId, userId, page, userAgent

New file: backend/src/routes/logs.ts

POST /api/logs/client — accepts batched frontend log events
Validates: max 100 events per request, max 10KB per event, required fields
Rate-limited: 100 requests/min per IP (prevents abuse)
Logs each event via Winston with service: "hannibal-frontend", source: "client"
Flows through same pipeline: Winston → stdout → Promtail → Loki → S3

Modify: backend/src/index.ts

Mount /api/logs route (before auth middleware — client errors may come from unauthenticated users)

5b. Global Error Handlers

New file: frontend/src/lib/globalErrorHandler.ts

window.addEventListener('error', ...) — catches uncaught JS errors
window.addEventListener('unhandledrejection', ...) — catches unhandled promise rejections
Captures: errorMessage, stack, componentStack, page, userAgent
Sends to frontendLogger (which batches and sends to backend)
Deduplicates: same error message within 5 seconds = 1 log entry with count

Modify: frontend/src/app/layout.tsx

Import and initialize globalErrorHandler on mount
Initialize performanceMonitor on mount

Modify: frontend/src/components/ui/ErrorBoundary.tsx

Send caught errors to frontendLogger (not just console.error in dev)
Include componentStack for debugging

5c. API Call Logging

Modify: frontend/src/lib/api.ts

Request interceptor: record start timestamp
Response interceptor: log every failed request (status >= 400): url, method, statusCode, latencyMs, errorCode
Log token refresh attempts: success/failure, latencyMs
Log auth:logout events triggered by failed refresh

5d. React Query Global Error Handler

Modify: frontend/src/lib/providers.tsx (or wherever QueryClient is created)

Add QueryClient default onError callback
Logs all failed queries: queryKey, error message, status code
This catches the 47+ files that have no onError handler

5e. WebSocket Logging

Modify: frontend/src/hooks/useWebSocket.ts

Replace console.log with frontendLogger.info/error
Add: reconnection attempt count, transport type, connection duration
Add: message gap detection (cricket ball sequence)
Keep existing connection/disconnect logging but structured

5f. Performance Monitoring

New file: frontend/src/lib/performanceMonitor.ts

Core Web Vitals via web-vitals library (LCP, FID, CLS, INP, TTFB)
Reports once per page load via frontendLogger
Optional: long task detection via PerformanceObserver('longtask')

5g. Auth Flow Logging

Modify: frontend/src/hooks/useAuth.ts

Log login attempt: provider, success/failure, latencyMs
Log logout: trigger (manual/session expired/refresh failed)

Modify: frontend/src/components/providers/Web3AuthProvider.tsx

Log Web3Auth init: success/failure, latencyMs, retryCount
Log wallet connection: chain, address (first 6 chars), success/failure

Step 6: AI Chat Service (Python)

Modify: services/ai-chat-assistant/app/

Configure Python logging to output structured JSON to stdout (matching backend format)
Add correlation ID propagation from x-request-id request header
Log every LLM provider call: provider (Grok/Groq/Ollama/OpenAI/Anthropic), model, tokens, latency, success/error
Log provider fallback chain: which provider tried, failed, next selected
Log conversation history operations (Redis reads/writes)
Log streaming SSE events: chunkCount, totalLatencyMs

Promtail already collects Docker container stdout — logs flow to Loki/S3 automatically.

Step 7: Loki Upgrade + S3 Backend

7a. Upgrade Loki 2.9.2 → 3.x

CRITICAL PREREQUISITE: Loki 2.9.2 does NOT support schema v13 or TSDB store.

Modify: infra/docker/docker-compose.monitoring.yml

Change grafana/loki:2.9.2 → grafana/loki:3.3.2 (or latest 3.x stable)

Add AWS credentials env vars:

environment:
  - AWS_ACCESS_KEY_ID=${LOKI_AWS_ACCESS_KEY_ID}
  - AWS_SECRET_ACCESS_KEY=${LOKI_AWS_SECRET_ACCESS_KEY}
  - AWS_REGION=${AWS_REGION:-ap-south-1}

Migration notes:

Schema v11 (boltdb-shipper) data remains readable by Loki 3.x
New data writes use schema v13 (TSDB) with S3
Old data on filesystem will not be migrated (acceptable — current data is ephemeral on container restart anyway)
Test on staging before production rollout

7b. Reconfigure Loki for S3

Modify: infra/monitoring/loki/loki-config.yml

Add new schema config entry (keep old for backward compat):

schema_config:
  configs:
    # Legacy — reads old data if any
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h
    # New — all new writes go to S3
    - from: 2026-03-01
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: loki_index_
        period: 24h

storage_config:
  aws:
    s3: s3://${AWS_REGION}/${S3_BUCKET_NAME}
    bucketnames: hannibal-logs
    region: ${AWS_REGION}
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache
    shared_store: s3
  # Keep filesystem for legacy reads
  boltdb_shipper:
    active_index_directory: /loki/boltdb-shipper-active
    shared_store: filesystem
  filesystem:
    directory: /loki/chunks

compactor:
  working_directory: /loki/compactor
  shared_store: s3
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h

Step 8: S3 Bucket + Lifecycle Policies

New file: infra/scripts/setup-s3-logging.sh

Creates S3 bucket with:

Versioning disabled (logs are immutable)
Server-side encryption (SSE-S3)
Lifecycle policy:
- Standard: 0-30 days ($0.023/GB)
- Standard-IA: 30-90 days ($0.0125/GB)
- Glacier Instant: 90-365 days ($0.004/GB)
- Deep Archive: 365+ days ($0.00099/GB)
- Expiration: 7 years (compliance)
Bucket policy restricting access to Loki service role
Block public access enabled

Estimated cost (10GB/day logs after compression):

~$50-70/year total storage

Step 9: Enhanced Promtail Label Extraction

Modify: infra/monitoring/promtail/promtail-config.yml

Current pipeline already has JSON extraction and log level detection. Enhance:

Extract service field as Loki label (enables per-service queries)
Extract requestId, userId as structured metadata (Loki 3.x supports this without label cardinality explosion)
Keep existing container/compose label extraction
Add orderId, provider, fixtureId as structured metadata
Enables queries like: {service="hannibal-backend"} | json | provider="betfair" | latencyMs > 1000

Warning: Do NOT add high-cardinality fields (requestId, orderId, userId) as Loki labels — this will destroy performance. Use Loki's structured metadata or | json filter syntax instead.

Step 10: Grafana Dashboards

New dashboards in infra/monitoring/grafana/dashboards/:

logging-overview.json — log volume by service/level, error rate trending, top error messages
provider-health.json — API latency per provider (from structured logs), error rates, availability
order-audit.json — order lifecycle flow (validated→routed→submitted→settled), timing
financial-audit.json — balance changes, commissions, settlements, take balance trends
bbook-dashboard.json — B-Book fills, exposure levels, routing decisions, settlement P&L
exchange-coordination.json — circuit breaker states, failover events, routing distribution
frontend-health.json — client errors, Web Vitals, API latency from client perspective, WebSocket health

(Existing system-overview.json dashboard remains unchanged)

Future: Athena for Historical Queries

When log volume justifies:

AWS Athena table pointing to S3 bucket for SQL on logs >31 days old
Grafana Athena datasource plugin for historical dashboards
Nightly JSONL→Parquet ETL for 10x cheaper Athena queries

Complete File Change Summary

New Files — Backend (4)

backend/src/utils/requestContext.ts — AsyncLocalStorage singleton
backend/src/middleware/requestContext.ts — request context middleware
backend/src/utils/httpLogger.ts — provider API call wrapper
backend/src/routes/logs.ts — POST /api/logs/client (frontend error sink)

New Files — Frontend (4)

frontend/src/lib/frontendLogger.ts — structured logger with batching + backend sink
frontend/src/lib/globalErrorHandler.ts — window error + unhandledrejection handlers
frontend/src/lib/performanceMonitor.ts — Web Vitals + API latency
frontend/src/hooks/useErrorReporting.ts — React hook for component error context

New Files — Infra (1)

infra/scripts/setup-s3-logging.sh — S3 bucket setup

Modified Files — Infrastructure (5)

backend/src/utils/logger.ts — JSON format, AsyncLocalStorage integration
backend/src/index.ts — requestContext middleware, HTTP timing, /api/logs route
infra/monitoring/loki/loki-config.yml — S3 backend, schema v13
infra/docker/docker-compose.monitoring.yml — Loki 3.x, AWS creds
infra/monitoring/promtail/promtail-config.yml — enhanced label extraction

Modified Files — Frontend (~10 files)

frontend/src/lib/api.ts — Axios interceptors for request/response logging
frontend/src/hooks/useWebSocket.ts — structured event logging via frontendLogger
frontend/src/lib/analytics/index.ts — wire frontendLogger alongside Mixpanel
frontend/src/components/ui/ErrorBoundary.tsx — send errors to backend
frontend/src/app/layout.tsx — init globalErrorHandler + performanceMonitor
frontend/src/components/providers/Web3AuthProvider.tsx — auth flow logging
frontend/src/hooks/useAuth.ts — login/logout event logging
frontend/src/lib/providers.tsx — QueryClient global onError handler

Modified Files — P0 Critical Backend (~25 files)

Provider clients: BetfairClient, PinnacleClient, BifrostClient, RoanuzDataAdapter, betsApi
Exchange coordination: ExchangeCoordinator, FailoverManager, RoutingStrategy, IDRegistry
Bifrost pipeline: BifrostQueueManager, BifrostRecoveryManager, BifrostBetConsumer
B-Book: bbookConfigService, bbookFillService, bbookSettlementService, bbookStateService, filterEngine
Orders: orderService, settlement
Jobs: orderSyncJob, agentSettlementJob, pinnacleSettlementJob, reportEmailJob
Accounting: ledgerService, commissionService, agentSettlementService, availableService, takeService, marginRevenueService, carryoverService, transferSettleService, settlementPeriodService, reportService, exportService

Modified Files — P1 High Priority Backend (~15 files)

Monitoring: baseMonitoringService, playerMonitoringService, cricketMonitoringService, soccerMonitoringService, conditionCompiler, conditionEvaluator, metricResolver, monitoringCrudService
Auth: auth route, auth middleware, agentPermissions
Infrastructure: redis, database, errorHandler

Modified Files — P2 Comprehensive Backend (~35 files)

AI: agentService, fixtureAnalysisService
WebSocket: clientWs, oddsPapiWs, RoanuzWebSocketManager
Cricket: ballByBallService, cricketAnalysisService, cricketIdResolver, milestoneAlertService, sessionBettingService, tossAnalysisService
Notifications: notificationService, pushService
Telegram: bot, linkService, rateLimit
Odds: oddsTransformer, marginService, limitsService
Agents: agentHierarchyService, agentMonitoringService, userAgentService, watchlistService
Exchange: ProviderRegistry, FuzzyMatcher, BetfairCache
Routes: orders, settlements, ai, referrals, users
Middleware: rateLimiter
Python: ai-chat-assistant (5+ files)

New Grafana Dashboards (7)

logging-overview.json, provider-health.json, order-audit.json
financial-audit.json, bbook-dashboard.json, exchange-coordination.json
frontend-health.json

Total: ~95+ files modified, 9 new files, 7 new dashboards

Performance Impact

LATENCY IMPACT: ~0ms (Winston writes to stdout, Promtail reads async)
CPU IMPACT: Minimal (JSON.stringify per log line, ~0.01ms)
MEMORY IMPACT: None (no in-app buffering, Promtail handles collection)
FAILURE MODE: If Loki/S3 is down, Promtail queues locally and retries — app unaffected

Cross-System Impact

- backend: Logger format change, new requestContext middleware, logging additions in ~80 files, /api/logs/client endpoint
- frontend: frontendLogger + global error handler + performance monitor + API/WS logging (~10 modified, 4 new files)
- services/ai-chat: Structured JSON logging, correlation ID propagation
- infra: Loki 3.x upgrade, S3 backend config, S3 bucket + lifecycle, Promtail enhancement, 7 new dashboards

Key Best Practices Applied

Structured JSON logging — machine-parseable, Promtail/Loki/Athena compatible
AsyncLocalStorage correlation IDs — every log line traceable to a request, zero code changes in existing 106 files
External pipeline (Promtail→Loki→S3) — decoupled from app, no backpressure risk
Structured metadata (not labels) for high-cardinality fields — prevents Loki performance degradation
gzip compression — ~80% storage reduction
Lifecycle policies — automatic cost optimization (Standard→IA→Glacier→Deep Archive)
Dev/prod format split — colorized console in dev, JSON in prod
Provider call wrapping — single wrapper captures all external API metrics uniformly
Log level discipline — error (unrecoverable), warn (degraded), info (business events), debug (operational detail)
Dual schema config — old boltdb-shipper for legacy reads, new TSDB+S3 for writes (zero-downtime migration)

Context​

Current State vs Target​

Architecture: Logging Pipeline to S3​

Detailed Logging Gaps Audit​

AREA 1: ORDER PLACEMENT FLOW (Currently ~70%)​

AREA 2: SETTLEMENT FLOW (Currently ~65%)​

AREA 3: PROVIDER API CALLS (Currently ~20%) ⚠️ CRITICAL​

AREA 4: EXCHANGE COORDINATION & FAILOVER (Currently ~10%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN​

AREA 5: BIFROST QUEUE PIPELINE (Currently ~5%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN​

AREA 6: B-BOOK SYSTEM (Currently ~15%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN​

AREA 7: WEBSOCKET EVENTS (Currently ~60%)​

AREA 8: AI INTERACTIONS (Currently ~5%) ⚠️ CRITICAL​

AREA 9: MONITORING AGENTS & EVALUATION (Currently ~10%) ⚠️ CRITICAL​

AREA 10: FINANCIAL TRANSACTIONS & ACCOUNTING (Currently ~15%) ⚠️ CRITICAL​

AREA 11: AUTHENTICATION & PERMISSIONS (Currently ~30%)​

AREA 12: BACKGROUND JOBS (Currently ~20%) ⚠️ CRITICAL​

AREA 13: CRICKET SERVICES (Currently ~20%) — EXPANDED​

AREA 14: NOTIFICATION & PUSH (Currently ~10%) — NOT IN ORIGINAL PLAN​

AREA 15: TELEGRAM BOT (Currently ~0%) — NOT IN ORIGINAL PLAN​

AREA 16: ODDS TRANSFORMATION & MARGINS (Currently ~5%) — NOT IN ORIGINAL PLAN​

AREA 17: AGENT HIERARCHY & USER AGENTS (Currently ~10%) — NOT IN ORIGINAL PLAN​

AREA 18: ROUTES — API ENDPOINTS (Currently ~30%) — NOT IN ORIGINAL PLAN​

AREA 19: ERROR HANDLING MIDDLEWARE — NOT IN ORIGINAL PLAN​

AREA 20: REDIS OPERATIONS (Currently ~25%)​

AREA 21: DATABASE OPERATIONS (Currently ~50%)​

AREA 22: EXCHANGE MAPPERS & ID RESOLUTION — NOT IN ORIGINAL PLAN​

AREA 23: FRONTEND — CLIENT-SIDE LOGGING (Currently ~5%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN​

23a. Error Handling (Currently: ErrorBoundary only)​

23b. API Error Logging (Currently: toast only)​

23c. WebSocket Logging (Currently: console.log)​

23d. Performance Monitoring (Currently: none)​

23e. Authentication Flow (Currently: console.error)​

23f. AI & Cricket UI (Currently: console.error on failure)​

Frontend Logging Architecture​

Implementation Steps​

Step 1: Foundation — Winston + Correlation IDs​

1a. Switch Winston to Structured JSON Output​

1b. Add AsyncLocalStorage for Correlation IDs​

1c. HTTP Request/Response Middleware​

1d. Provider API Call Logging Wrapper​

Step 2: P0 — Critical Business Logic Gaps (~25 files)​

2a. Provider Clients​

2b. Exchange Coordination & Failover​

2c. Bifrost Pipeline​

2d. B-Book System​

2e. Order + Settlement​

2f. Financial Transactions & Accounting​

Step 3: P1 — High-Priority Operational Gaps (~20 files)​

3a. Monitoring Agents (all files)​

3b. Authentication & Permissions​

3c. Redis Operations​

3d. Database Operations​

3e. Error Handler Enhancement​

Step 4: P2 — Comprehensive Coverage (~30 files)​

4a. AI Interactions​

4b. WebSocket Events​

4c. Cricket Services​

4d. Notification & Push​

4e. Telegram Bot​

4f. Odds & Margins​

4g. Agent Hierarchy & User Agents​

4h. Rate Limiting​

4i. Exchange Internals​

4j. Routes (business logic in API handlers)​

Step 5: Frontend Logging Pipeline (~15 files)​

5a. Frontend Logger + Backend Sink​

5b. Global Error Handlers​

5c. API Call Logging​

5d. React Query Global Error Handler​

5e. WebSocket Logging​

5f. Performance Monitoring​

5g. Auth Flow Logging​

Step 6: AI Chat Service (Python)​

Step 7: Loki Upgrade + S3 Backend​

7a. Upgrade Loki 2.9.2 → 3.x​

7b. Reconfigure Loki for S3​

Step 8: S3 Bucket + Lifecycle Policies​

Step 9: Enhanced Promtail Label Extraction​

Step 10: Grafana Dashboards​

Future: Athena for Historical Queries​

Context

Current State vs Target

Architecture: Logging Pipeline to S3

Detailed Logging Gaps Audit

AREA 1: ORDER PLACEMENT FLOW (Currently ~70%)

AREA 2: SETTLEMENT FLOW (Currently ~65%)

AREA 3: PROVIDER API CALLS (Currently ~20%) ⚠️ CRITICAL

AREA 4: EXCHANGE COORDINATION & FAILOVER (Currently ~10%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN

AREA 5: BIFROST QUEUE PIPELINE (Currently ~5%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN

AREA 6: B-BOOK SYSTEM (Currently ~15%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN

AREA 7: WEBSOCKET EVENTS (Currently ~60%)

AREA 8: AI INTERACTIONS (Currently ~5%) ⚠️ CRITICAL

AREA 9: MONITORING AGENTS & EVALUATION (Currently ~10%) ⚠️ CRITICAL

AREA 10: FINANCIAL TRANSACTIONS & ACCOUNTING (Currently ~15%) ⚠️ CRITICAL

AREA 11: AUTHENTICATION & PERMISSIONS (Currently ~30%)

AREA 12: BACKGROUND JOBS (Currently ~20%) ⚠️ CRITICAL

AREA 13: CRICKET SERVICES (Currently ~20%) — EXPANDED

AREA 14: NOTIFICATION & PUSH (Currently ~10%) — NOT IN ORIGINAL PLAN

AREA 15: TELEGRAM BOT (Currently ~0%) — NOT IN ORIGINAL PLAN

AREA 16: ODDS TRANSFORMATION & MARGINS (Currently ~5%) — NOT IN ORIGINAL PLAN

AREA 17: AGENT HIERARCHY & USER AGENTS (Currently ~10%) — NOT IN ORIGINAL PLAN

AREA 18: ROUTES — API ENDPOINTS (Currently ~30%) — NOT IN ORIGINAL PLAN

AREA 19: ERROR HANDLING MIDDLEWARE — NOT IN ORIGINAL PLAN

AREA 20: REDIS OPERATIONS (Currently ~25%)

AREA 21: DATABASE OPERATIONS (Currently ~50%)

AREA 22: EXCHANGE MAPPERS & ID RESOLUTION — NOT IN ORIGINAL PLAN

AREA 23: FRONTEND — CLIENT-SIDE LOGGING (Currently ~5%) ⚠️ CRITICAL — NOT IN ORIGINAL PLAN

23a. Error Handling (Currently: ErrorBoundary only)

23b. API Error Logging (Currently: toast only)

23c. WebSocket Logging (Currently: console.log)

23d. Performance Monitoring (Currently: none)

23e. Authentication Flow (Currently: console.error)

23f. AI & Cricket UI (Currently: console.error on failure)

Frontend Logging Architecture

Implementation Steps

Step 1: Foundation — Winston + Correlation IDs

1a. Switch Winston to Structured JSON Output

1b. Add AsyncLocalStorage for Correlation IDs

1c. HTTP Request/Response Middleware

1d. Provider API Call Logging Wrapper

Step 2: P0 — Critical Business Logic Gaps (~25 files)

2a. Provider Clients

2b. Exchange Coordination & Failover

2c. Bifrost Pipeline

2d. B-Book System

2e. Order + Settlement

2f. Financial Transactions & Accounting

Step 3: P1 — High-Priority Operational Gaps (~20 files)

3a. Monitoring Agents (all files)

3b. Authentication & Permissions

3c. Redis Operations

3d. Database Operations

3e. Error Handler Enhancement

Step 4: P2 — Comprehensive Coverage (~30 files)

4a. AI Interactions

4b. WebSocket Events

4c. Cricket Services

4d. Notification & Push

4e. Telegram Bot

4f. Odds & Margins

4g. Agent Hierarchy & User Agents

4h. Rate Limiting

4i. Exchange Internals

4j. Routes (business logic in API handlers)

Step 5: Frontend Logging Pipeline (~15 files)

5a. Frontend Logger + Backend Sink

5b. Global Error Handlers

5c. API Call Logging

5d. React Query Global Error Handler

5e. WebSocket Logging

5f. Performance Monitoring

5g. Auth Flow Logging

Step 6: AI Chat Service (Python)

Step 7: Loki Upgrade + S3 Backend

7a. Upgrade Loki 2.9.2 → 3.x

7b. Reconfigure Loki for S3

Step 8: S3 Bucket + Lifecycle Policies

Step 9: Enhanced Promtail Label Extraction

Step 10: Grafana Dashboards

Future: Athena for Historical Queries