Hannibal Production Readiness Guide

Version: 1.0
Last Updated: January 28, 2026
Status: Planning

This document outlines all the requirements, configurations, and best practices needed to make Hannibal a production-ready, rock-solid betting platform that can run autonomously for years with minimal human intervention.

Executive Summary
Current State Assessment
Monitoring & Observability
Alerting Strategy
Backup & Disaster Recovery
Security
Infrastructure & Resource Management
Performance Optimization
Deployment & CI/CD
Analytics & Business Intelligence
High Availability & Reliability
Operational Runbooks
Implementation Roadmap
Appendix: Configuration Examples

1. Executive Summary

1.1 Goals

Hannibal must be:

Reliable: 99.9% uptime (< 8.76 hours downtime/year)
Resilient: Auto-recover from failures without human intervention
Observable: Full visibility into system health and business metrics
Secure: Protected against attacks and data breaches
Recoverable: Ability to restore from any disaster within defined RTO/RPO
Scalable: Handle 10x current load without architecture changes

1.2 Key Metrics Targets

Metric	Target
Uptime	99.9%
API Response Time (p95)	< 500ms
Order Placement Latency	< 1 second
Settlement Time	< 5 minutes after match end
Recovery Time Objective (RTO)	< 1 hour
Recovery Point Objective (RPO)	< 1 hour
Mean Time to Detection (MTTD)	< 2 minutes
Mean Time to Recovery (MTTR)	< 30 minutes

1.3 Technology Stack Overview

┌─────────────────────────────────────────────────────────────────┐
│                        MONITORING LAYER                          │
│  Prometheus │ Grafana │ Loki │ Uptime Kuma │ Jaeger             │
└─────────────────────────────────────────────────────────────────┘
                                │
┌─────────────────────────────────────────────────────────────────┐
│                        APPLICATION LAYER                         │
│  Frontend (Next.js) │ Backend (Node.js/Express) │ Jobs          │
└─────────────────────────────────────────────────────────────────┘
                                │
┌─────────────────────────────────────────────────────────────────┐
│                          DATA LAYER                              │
│  PostgreSQL │ Redis │ Betfair API │ Roanuz API                  │
└─────────────────────────────────────────────────────────────────┘
                                │
┌─────────────────────────────────────────────────────────────────┐
│                      INFRASTRUCTURE LAYER                        │
│  Docker │ Nginx │ Certbot │ UFW │ Systemd                       │
└─────────────────────────────────────────────────────────────────┘

2. Current State Assessment

2.1 What We Have

Component	Status	Notes
Docker Compose deployment	✅ Working	Single host
PostgreSQL database	✅ Working	No backups configured
Redis cache	✅ Working	No memory limits
Nginx reverse proxy	✅ Working	Basic config
SSL certificates	✅ Working	Certbot auto-renewal
Backend API	✅ Working	No metrics exposed
Frontend	✅ Working	No analytics
Betfair integration	✅ Working	Session management needs monitoring
Settlement jobs	✅ Working	No failure alerts

2.2 Critical Gaps

Gap	Risk	Priority
No database backups	Data loss	🔴 Critical
No monitoring	Blind to issues	🔴 Critical
No alerting	Delayed response	🔴 Critical
No log rotation	Disk exhaustion	🔴 Critical
No resource limits	OOM crashes	🟡 High
No uptime monitoring	Unknown downtime	🟡 High
No security hardening	Vulnerabilities	🟡 High
No CI/CD pipeline	Manual deployments	🟢 Medium
No analytics	No business insights	🟢 Medium

3. Monitoring & Observability

3.1 Metrics Collection (Prometheus)

Prometheus will be the central metrics collection system, scraping metrics from all components.

3.1.1 Infrastructure Metrics (Node Exporter)

Collect from every host:

CPU: Usage, load average, steal time
Memory: Used, available, cached, swap
Disk: Usage, I/O, latency
Network: Bandwidth, packets, errors
File descriptors: Open files, limits

3.1.2 Database Metrics (postgres_exporter)

Key Metrics:
  - pg_stat_activity_count          # Active connections
  - pg_stat_database_tup_fetched    # Rows read
  - pg_stat_database_tup_inserted   # Rows inserted
  - pg_stat_database_xact_commit    # Transactions committed
  - pg_stat_database_deadlocks      # Deadlock count
  - pg_replication_lag              # Replication lag (if replica)
  - pg_database_size_bytes          # Database size
  - pg_stat_user_tables_n_dead_tup  # Dead tuples (vacuum needed)

3.1.3 Redis Metrics (redis_exporter)

Key Metrics:
  - redis_memory_used_bytes         # Memory consumption
  - redis_memory_max_bytes          # Memory limit
  - redis_connected_clients         # Client connections
  - redis_commands_processed_total  # Command throughput
  - redis_keyspace_hits_total       # Cache hits
  - redis_keyspace_misses_total     # Cache misses
  - redis_evicted_keys_total        # Evicted keys (memory pressure)
  - redis_rejected_connections      # Rejected connections

3.1.4 Application Metrics (Custom)

Backend should expose metrics at /metrics endpoint:

// Key application metrics to implement
const metrics = {
  // HTTP metrics
  http_requests_total: Counter,           // Total requests by method, path, status
  http_request_duration_seconds: Histogram, // Request latency

  // Business metrics
  orders_placed_total: Counter,           // Orders by type, venue
  orders_settled_total: Counter,          // Settlements by outcome
  settlement_duration_seconds: Histogram, // Time to settle
  bbook_exposure_dollars: Gauge,          // Current B-Book liability

  // External API metrics
  betfair_api_calls_total: Counter,       // Betfair API calls
  betfair_api_errors_total: Counter,      // Betfair API errors
  betfair_api_latency_seconds: Histogram, // Betfair API latency
  betfair_session_status: Gauge,          // 1 = active, 0 = expired

  // Job metrics
  job_runs_total: Counter,                // Job executions
  job_duration_seconds: Histogram,        // Job duration
  job_errors_total: Counter,              // Job failures
};

3.1.5 Container Metrics (cAdvisor)

Key Metrics:
  - container_cpu_usage_seconds_total
  - container_memory_usage_bytes
  - container_memory_working_set_bytes
  - container_network_receive_bytes_total
  - container_network_transmit_bytes_total
  - container_fs_usage_bytes

3.2 Dashboards (Grafana)

3.2.1 System Overview Dashboard

┌─────────────────────────────────────────────────────────────────┐
│                     SYSTEM OVERVIEW                              │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────┤
│   CPU %     │  Memory %   │   Disk %    │  Network    │ Uptime  │
│    23%      │    67%      │    45%      │  12 Mb/s    │ 45 days │
├─────────────┴─────────────┴─────────────┴─────────────┴─────────┤
│                    CPU Usage Over Time                           │
│  ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁                   │
├─────────────────────────────────────────────────────────────────┤
│                   Memory Usage Over Time                         │
│  ████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░  │
├─────────────────────────────────────────────────────────────────┤
│  Container Status: backend ✅ | frontend ✅ | postgres ✅ | redis ✅ │
└─────────────────────────────────────────────────────────────────┘

3.2.2 Application Health Dashboard

┌─────────────────────────────────────────────────────────────────┐
│                   APPLICATION HEALTH                             │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────┤
│ Requests/s  │ Error Rate  │  p50 Lat    │  p95 Lat    │ p99 Lat │
│    125      │   0.02%     │   45ms      │   180ms     │  450ms  │
├─────────────────────────────────────────────────────────────────┤
│                  Request Rate by Endpoint                        │
│  /api/fixtures     ████████████████████  45/s                   │
│  /api/orders       ████████              18/s                   │
│  /api/odds         ██████████████        32/s                   │
├─────────────────────────────────────────────────────────────────┤
│                  Error Rate Over Time                            │
│  ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁  │
└─────────────────────────────────────────────────────────────────┘

3.2.3 Database Dashboard

┌─────────────────────────────────────────────────────────────────┐
│                      DATABASE HEALTH                             │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────┤
│ Connections │ Query Time  │   Size      │ Dead Tuples │ TPS     │
│   12/100    │   15ms avg  │   2.3 GB    │    1,234    │   85    │
├─────────────────────────────────────────────────────────────────┤
│                  Table Sizes                                     │
│  orders          ████████████████████  1.2 GB                   │
│  users           ████                  120 MB                   │
│  fixtures        ██████████            450 MB                   │
├─────────────────────────────────────────────────────────────────┤
│                  Slow Queries (> 100ms)                          │
│  SELECT * FROM orders WHERE... │ 234ms │ 15 calls/min           │
└─────────────────────────────────────────────────────────────────┘

3.2.4 Business Metrics Dashboard

┌─────────────────────────────────────────────────────────────────┐
│                    BUSINESS METRICS                              │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────┤
│ Orders Today│ GGR Today   │ Active Users│ B-Book Exp  │ Settled │
│    1,234    │   $5,678    │     89      │   $12,345   │  98.5%  │
├─────────────────────────────────────────────────────────────────┤
│                  Orders Over Time (24h)                          │
│  ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁                   │
├─────────────────────────────────────────────────────────────────┤
│  Settlement Status                                               │
│  Pending: 23 │ Settled (Win): 456 │ Settled (Lose): 678         │
└─────────────────────────────────────────────────────────────────┘

3.3 Log Aggregation (Loki + Promtail)

3.3.1 Log Format Standard

All services should output JSON logs:

{
  "timestamp": "2026-01-28T19:45:23.456Z",
  "level": "info",
  "service": "backend",
  "traceId": "abc123",
  "userId": "user_456",
  "message": "Order placed successfully",
  "orderId": "order_789",
  "duration": 145
}

3.3.2 Log Retention Policy

Log Type	Hot Storage	Cold Storage	Total Retention
Application logs	7 days	90 days	90 days
Access logs	7 days	30 days	30 days
Error logs	30 days	1 year	1 year
Audit logs	30 days	7 years	7 years
Security logs	30 days	1 year	1 year

3.3.3 Docker Log Rotation

// /etc/docker/daemon.json
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "5",
    "compress": "true"
  }
}

3.4 Distributed Tracing (Jaeger)

Implement OpenTelemetry tracing to track requests across services:

User Request
    │
    ▼
┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│ Nginx   │───▶│ Frontend│───▶│ Backend │───▶│ Postgres│
│ 2ms     │    │ 15ms    │    │ 45ms    │    │ 12ms    │
└─────────┘    └─────────┘    └─────────┘    └─────────┘
                                   │
                                   ▼
                              ┌─────────┐
                              │ Betfair │
                              │ 120ms   │
                              └─────────┘

3.5 Uptime Monitoring (Uptime Kuma)

External monitoring from outside the infrastructure:

Check	URL	Interval	Timeout
Frontend	https://hannibaldev.forsyt.io	60s	10s
API Health	https://hannibaldev.forsyt.io/api/health	60s	10s
API Fixtures	https://hannibaldev.forsyt.io/api/fixtures/live	60s	15s
WebSocket	wss://hannibaldev.forsyt.io/ws	60s	10s

4. Alerting Strategy

4.1 Alert Channels

Channel	Use Case	Response Time
PagerDuty/OpsGenie	Critical alerts requiring immediate action	< 5 min
Slack #alerts-critical	Critical alerts for team visibility	< 5 min
Slack #alerts-warning	Warning alerts for awareness	< 1 hour
Email	Daily reports, non-urgent notifications	< 24 hours
SMS (backup)	Critical alerts if PagerDuty fails	< 5 min

4.2 Alert Severity Levels

Severity	Description	Response	Examples
P1 - Critical	Service down, data loss risk	Page immediately, 24/7	Database down, disk full
P2 - High	Degraded service, potential impact	Page during business hours	High error rate, slow response
P3 - Medium	Warning, needs attention	Slack notification	High memory, certificate expiring
P4 - Low	Informational	Email/daily report	Deployment complete

4.3 Critical Alerts (P1) - Page Immediately

Alert Name	Condition	Duration	Action
ServiceDown	Container not running	2 min	Auto-restart, then page
DatabaseDown	PostgreSQL not responding	1 min	Page immediately
DiskCritical	Disk usage > 95%	0 min	Page immediately
MemoryCritical	Memory usage > 95%	5 min	Page immediately
HighErrorRate	5xx errors > 10%	5 min	Page immediately
BetfairSessionLost	Session expired, reconnect failed	5 min	Page immediately
SettlementJobFailed	Job not run in 10 min	10 min	Page immediately
BackupFailed	No successful backup in 25 hours	0 min	Page immediately
SSLCertExpired	Certificate expired	0 min	Page immediately

4.4 High Alerts (P2) - Page During Business Hours

Alert Name	Condition	Duration	Action
HighCPU	CPU > 90%	15 min	Investigate, scale if needed
HighMemory	Memory > 90%	10 min	Investigate, restart if needed
DiskHigh	Disk usage > 85%	0 min	Clean up, expand disk
SlowQueries	Query time > 5s	5 min	Optimize query
HighLatency	p95 latency > 2s	10 min	Investigate bottleneck
ConnectionPoolExhausted	DB connections > 90%	5 min	Increase pool, investigate
RedisMemoryHigh	Redis memory > 90%	5 min	Check eviction, increase limit

4.5 Warning Alerts (P3) - Slack Notification

Alert Name	Condition	Duration
DiskWarning	Disk usage > 75%	0 min
MemoryWarning	Memory > 80%	10 min
CPUWarning	CPU > 80%	15 min
CertificateExpiring	SSL cert expires in < 14 days	0 min
HighDeadTuples	Dead tuples > 100,000	0 min
ReplicationLag	Lag > 30 seconds	5 min
BetfairAPIErrors	Error rate > 5%	5 min

4.6 Alert Rules (Prometheus)

# prometheus/alerts/critical.yml
groups:
  - name: critical
    rules:
      - alert: ServiceDown
        expr: up{job=~"backend|frontend|postgres|redis"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"
          description: "{{ $labels.job }} has been down for more than 2 minutes"

      - alert: DiskCritical
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 5
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Disk almost full on {{ $labels.instance }}"
          description: "Disk usage is above 95% on {{ $labels.mountpoint }}"

      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) /
          sum(rate(http_requests_total[5m])) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is above 10% for the last 5 minutes"

      - alert: DatabaseDown
        expr: pg_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL is down"
          description: "PostgreSQL database is not responding"

5. Backup & Disaster Recovery

5.1 Backup Strategy Overview

┌─────────────────────────────────────────────────────────────────┐
│                      BACKUP ARCHITECTURE                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  PostgreSQL ──┬── pg_dump (daily) ──────────▶ S3 (30 days)     │
│               │                                                  │
│               └── WAL Archive (continuous) ──▶ S3 (7 days)      │
│                                                                  │
│  Redis ───────── RDB Snapshot (hourly) ─────▶ Local + S3        │
│                                                                  │
│  Config ──────── Git + Encrypted Secrets ───▶ GitHub + Vault    │
│                                                                  │
│  Logs ────────── Loki ──────────────────────▶ S3 (90 days)      │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

5.2 Database Backup Configuration

5.2.1 Daily Full Backup (pg_dump)

#!/bin/bash
# /opt/hannibal/scripts/backup-database.sh

set -e

BACKUP_DIR="/var/backups/postgres"
S3_BUCKET="s3://hannibal-backups/postgres"
RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="hannibal_${TIMESTAMP}.sql.gz"

# Create backup
docker exec hannibal-postgres pg_dump -U hannibal -d hannibal | gzip > "${BACKUP_DIR}/${BACKUP_FILE}"

# Upload to S3
aws s3 cp "${BACKUP_DIR}/${BACKUP_FILE}" "${S3_BUCKET}/${BACKUP_FILE}"

# Verify upload
aws s3 ls "${S3_BUCKET}/${BACKUP_FILE}" || exit 1

# Clean old local backups
find ${BACKUP_DIR} -name "hannibal_*.sql.gz" -mtime +7 -delete

# Clean old S3 backups (lifecycle policy handles this, but belt and suspenders)
aws s3 ls ${S3_BUCKET}/ | while read -r line; do
  createDate=$(echo $line | awk '{print $1" "$2}')
  createDate=$(date -d "$createDate" +%s)
  olderThan=$(date -d "-${RETENTION_DAYS} days" +%s)
  if [[ $createDate -lt $olderThan ]]; then
    fileName=$(echo $line | awk '{print $4}')
    aws s3 rm "${S3_BUCKET}/${fileName}"
  fi
done

echo "Backup completed: ${BACKUP_FILE}"

5.2.2 Continuous WAL Archiving

# postgresql.conf additions
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://hannibal-backups/wal/%f'
archive_timeout = 300  # Archive every 5 minutes minimum

5.2.3 Point-in-Time Recovery (PITR)

#!/bin/bash
# Restore to specific point in time
# /opt/hannibal/scripts/restore-pitr.sh

TARGET_TIME="2026-01-28 15:30:00"
BACKUP_FILE="hannibal_20260128_030000.sql.gz"

# Stop services
docker compose down

# Download base backup
aws s3 cp "s3://hannibal-backups/postgres/${BACKUP_FILE}" /tmp/

# Restore base backup
gunzip -c /tmp/${BACKUP_FILE} | docker exec -i hannibal-postgres psql -U hannibal -d hannibal

# Apply WAL files up to target time
# (This requires proper PostgreSQL PITR setup with recovery.conf)

5.3 Backup Schedule

Backup Type	Frequency	Time	Retention	Storage
PostgreSQL full	Daily	03:00 UTC	30 days	S3
PostgreSQL WAL	Continuous	-	7 days	S3
Redis RDB	Hourly	:00	24 hours	Local + S3
Configuration	On change	-	Forever	Git
Docker volumes	Daily	04:00 UTC	7 days	S3

5.4 Backup Verification

#!/bin/bash
# /opt/hannibal/scripts/verify-backup.sh
# Run weekly to verify backups are restorable

# Get latest backup
LATEST_BACKUP=$(aws s3 ls s3://hannibal-backups/postgres/ | sort | tail -n 1 | awk '{print $4}')

# Download and restore to test database
aws s3 cp "s3://hannibal-backups/postgres/${LATEST_BACKUP}" /tmp/
docker exec -i hannibal-postgres createdb -U hannibal hannibal_test || true
gunzip -c /tmp/${LATEST_BACKUP} | docker exec -i hannibal-postgres psql -U hannibal -d hannibal_test

# Verify data integrity
ORDERS_COUNT=$(docker exec hannibal-postgres psql -U hannibal -d hannibal_test -t -c "SELECT COUNT(*) FROM orders")
USERS_COUNT=$(docker exec hannibal-postgres psql -U hannibal -d hannibal_test -t -c "SELECT COUNT(*) FROM users")

# Clean up
docker exec hannibal-postgres dropdb -U hannibal hannibal_test

# Report
echo "Backup verification complete"
echo "Orders: ${ORDERS_COUNT}"
echo "Users: ${USERS_COUNT}"

5.5 Disaster Recovery Procedures

5.5.1 Recovery Time Objectives

Scenario	RTO	RPO	Procedure
Container crash	1 min	0	Auto-restart
Host reboot	5 min	0	Systemd auto-start
Database corruption	1 hour	1 hour	PITR restore
Host failure	2 hours	1 hour	New host + restore
Data center failure	4 hours	1 hour	DR site activation
Ransomware/breach	4 hours	1 hour	Clean restore from backup

5.5.2 Recovery Runbook

## Database Recovery Procedure

### Prerequisites
- Access to AWS S3
- Access to new/clean PostgreSQL instance
- Latest backup file name

### Steps

1. **Stop application services**
   ```bash
   docker compose stop backend frontend

Download latest backup

aws s3 cp s3://hannibal-backups/postgres/BACKUP_FILE.sql.gz /tmp/

Restore database

# Drop and recreate database
docker exec hannibal-postgres dropdb -U hannibal hannibal
docker exec hannibal-postgres createdb -U hannibal hannibal

# Restore from backup
gunzip -c /tmp/BACKUP_FILE.sql.gz | docker exec -i hannibal-postgres psql -U hannibal -d hannibal

Verify restoration

docker exec hannibal-postgres psql -U hannibal -d hannibal -c "SELECT COUNT(*) FROM orders"

Start application services
```
docker compose up -d backend frontend
```

Verify application health

curl https://hannibaldev.forsyt.io/api/health

---

## 6. Security

### 6.1 Network Security

#### 6.1.1 Firewall Configuration (UFW)

```bash
# Default policies
ufw default deny incoming
ufw default allow outgoing

# Allow SSH (restricted to known IPs)
ufw allow from 203.0.113.0/24 to any port 22

# Allow HTTP/HTTPS
ufw allow 80/tcp
ufw allow 443/tcp

# Allow internal Docker network
ufw allow from 172.16.0.0/12

# Enable firewall
ufw enable

6.1.2 Cloudflare Configuration

Setting	Value	Purpose
SSL Mode	Full (Strict)	End-to-end encryption
Always Use HTTPS	On	Force HTTPS
Minimum TLS Version	1.2	Security
WAF	On	Block attacks
Rate Limiting	100 req/min per IP	DDoS protection
Bot Fight Mode	On	Block bad bots
Browser Integrity Check	On	Block suspicious requests

6.1.3 Rate Limiting (Nginx)

# /etc/nginx/conf.d/rate-limiting.conf

# Define rate limit zones
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=login:10m rate=5r/m;
limit_req_zone $binary_remote_addr zone=orders:10m rate=30r/m;

# Apply to locations
location /api/ {
    limit_req zone=api burst=20 nodelay;
    proxy_pass http://backend:3001;
}

location /api/auth/login {
    limit_req zone=login burst=5 nodelay;
    proxy_pass http://backend:3001;
}

location /api/orders {
    limit_req zone=orders burst=10 nodelay;
    proxy_pass http://backend:3001;
}

6.2 Application Security

6.2.1 Security Headers (Nginx)

# Security headers
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
add_header Content-Security-Policy "default-src 'self'; script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; img-src 'self' data: https:; font-src 'self' data:; connect-src 'self' wss: https:;" always;
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header Permissions-Policy "geolocation=(), microphone=(), camera=()" always;

6.2.2 Input Validation

// All API inputs must be validated
import { z } from 'zod';

const placeOrderSchema = z.object({
  fixtureId: z.string().min(1).max(50),
  marketId: z.string().regex(/^1\.\d+$/),  // Betfair market ID format
  outcomeId: z.string().min(1).max(100),
  stake: z.number().positive().max(10000),
  odds: z.number().min(1.01).max(1000),
  betType: z.enum(['back', 'lay']),
});

// Sanitize all string inputs
function sanitize(input: string): string {
  return input.replace(/[<>\"'&]/g, '');
}

6.2.3 Authentication Security

Measure	Implementation
Password Hashing	bcrypt with cost factor 12
JWT Expiry	Access token: 15 min, Refresh token: 7 days
Session Management	Redis-backed sessions
Failed Login Lockout	5 attempts, 15 min lockout
Password Requirements	Min 8 chars, 1 upper, 1 lower, 1 number
2FA	TOTP (optional, recommended for admins)

6.3 Secrets Management

6.3.1 Current Secrets Inventory

Secret	Location	Rotation
Database password	.env	90 days
Redis password	.env	90 days
JWT secret	.env	90 days
Betfair API key	.env	Never (API key)
Betfair credentials	.env	90 days
Roanuz API key	.env	Never
AWS credentials	.env	90 days

6.3.2 Secrets Management Best Practices

# Never commit secrets to git
echo ".env" >> .gitignore
echo "*.pem" >> .gitignore
echo "*.key" >> .gitignore

# Use Docker secrets in production
docker secret create db_password db_password.txt
docker secret create jwt_secret jwt_secret.txt

# Reference in docker-compose.yml
services:
  backend:
    secrets:
      - db_password
      - jwt_secret
    environment:
      - DB_PASSWORD_FILE=/run/secrets/db_password

6.4 Audit Logging

// Log all security-relevant events
const auditEvents = [
  'user.login',
  'user.logout',
  'user.login_failed',
  'user.password_changed',
  'user.created',
  'user.deleted',
  'order.placed',
  'order.cancelled',
  'admin.user_modified',
  'admin.settings_changed',
  'admin.manual_settlement',
  'system.backup_completed',
  'system.backup_failed',
];

// Audit log format
interface AuditLog {
  timestamp: string;
  event: string;
  userId: string;
  ipAddress: string;
  userAgent: string;
  details: Record<string, unknown>;
  success: boolean;
}

7. Infrastructure & Resource Management

7.1 Docker Configuration

7.1.1 Resource Limits

# docker-compose.prod.yml
services:
  backend:
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 2G
        reservations:
          cpus: '0.5'
          memory: 512M
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3001/api/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  frontend:
    deploy:
      resources:
        limits:
          cpus: '1'
          memory: 1G
        reservations:
          cpus: '0.25'
          memory: 256M

  postgres:
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 4G
        reservations:
          cpus: '0.5'
          memory: 1G
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U hannibal"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    deploy:
      resources:
        limits:
          cpus: '1'
          memory: 1G
        reservations:
          cpus: '0.25'
          memory: 256M
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

7.1.2 Docker Daemon Configuration

// /etc/docker/daemon.json
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "5",
    "compress": "true"
  },
  "storage-driver": "overlay2",
  "live-restore": true,
  "default-ulimits": {
    "nofile": {
      "Name": "nofile",
      "Hard": 65536,
      "Soft": 65536
    }
  }
}

7.2 Disk Management

7.2.1 Disk Space Monitoring

#!/bin/bash
# /opt/hannibal/scripts/disk-cleanup.sh
# Run daily via cron

# Clean Docker resources
docker system prune -f --volumes --filter "until=168h"

# Clean old logs
find /var/log -name "*.gz" -mtime +30 -delete
find /var/log -name "*.log.*" -mtime +7 -delete

# Clean temp files
find /tmp -type f -mtime +7 -delete

# Clean old backups
find /var/backups -mtime +7 -delete

# Report disk usage
df -h | grep -E "^/dev" | while read line; do
  usage=$(echo $line | awk '{print $5}' | sed 's/%//')
  mount=$(echo $line | awk '{print $6}')
  if [ $usage -gt 80 ]; then
    echo "WARNING: ${mount} is ${usage}% full"
  fi
done

7.2.2 PostgreSQL Maintenance

#!/bin/bash
# /opt/hannibal/scripts/postgres-maintenance.sh
# Run weekly

# Vacuum and analyze all tables
docker exec hannibal-postgres psql -U hannibal -d hannibal -c "VACUUM ANALYZE;"

# Reindex if needed (run monthly)
# docker exec hannibal-postgres psql -U hannibal -d hannibal -c "REINDEX DATABASE hannibal;"

# Check for bloat
docker exec hannibal-postgres psql -U hannibal -d hannibal -c "
SELECT schemaname, tablename,
       pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) as size,
       n_dead_tup as dead_tuples
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;
"

7.3 Redis Configuration

# redis.conf
maxmemory 512mb
maxmemory-policy allkeys-lru

# Persistence
save 900 1
save 300 10
save 60 10000

# Append-only file
appendonly yes
appendfsync everysec

# Logging
loglevel notice
logfile /var/log/redis/redis.log

7.4 PostgreSQL Tuning

# postgresql.conf optimizations for 4GB RAM server

# Memory
shared_buffers = 1GB
effective_cache_size = 3GB
work_mem = 16MB
maintenance_work_mem = 256MB

# Connections
max_connections = 100

# WAL
wal_buffers = 64MB
checkpoint_completion_target = 0.9
max_wal_size = 2GB
min_wal_size = 1GB

# Query Planning
random_page_cost = 1.1
effective_io_concurrency = 200

# Logging
log_min_duration_statement = 1000  # Log queries > 1 second
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on

# Autovacuum
autovacuum = on
autovacuum_max_workers = 3
autovacuum_naptime = 1min

8. Performance Optimization

8.1 Database Performance

8.1.1 Essential Indexes

-- Orders table indexes
CREATE INDEX idx_orders_status ON orders(status) WHERE status = 'accepted';
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_fixture_id ON orders(fixture_id);
CREATE INDEX idx_orders_created_at ON orders(created_at DESC);
CREATE INDEX idx_orders_market_id ON orders(market_id);
CREATE INDEX idx_orders_routing_venue_status ON orders(routing_venue, status);

-- Composite index for settlement job
CREATE INDEX idx_orders_settlement ON orders(routing_venue, status, market_id)
  WHERE status = 'accepted';

-- Users table
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_agent_id ON users(agent_id);

8.1.2 Query Optimization Checklist

All queries use indexes (check with EXPLAIN ANALYZE)
No N+1 queries (use eager loading)
Pagination on all list endpoints
Connection pooling configured
Prepared statements used
No SELECT * in production code

8.2 Caching Strategy

8.2.1 Cache Layers

Layer	Data	TTL	Invalidation
Redis	Fixtures list	30s	On update
Redis	Market odds	5s	Streaming update
Redis	User sessions	24h	On logout
Redis	Betfair session	12h	On expiry
CDN	Static assets	1 year	Version hash
Browser	API responses	0	No-cache

8.2.2 Cache Implementation

// Redis caching pattern
class CacheService {
  async get<T>(key: string): Promise<T | null> {
    const cached = await redis.get(key);
    return cached ? JSON.parse(cached) : null;
  }

  async set(key: string, value: unknown, ttlSeconds: number): Promise<void> {
    await redis.setex(key, ttlSeconds, JSON.stringify(value));
  }

  async invalidate(pattern: string): Promise<void> {
    const keys = await redis.keys(pattern);
    if (keys.length > 0) {
      await redis.del(...keys);
    }
  }
}

// Usage
const fixtures = await cache.get('fixtures:live:1');
if (!fixtures) {
  const data = await fetchFixtures();
  await cache.set('fixtures:live:1', data, 30);
}

8.3 API Performance

8.3.1 Response Time Targets

Endpoint	Target p50	Target p95	Target p99
GET /api/health	5ms	10ms	50ms
GET /api/fixtures	50ms	150ms	300ms
GET /api/fixtures/:id	30ms	100ms	200ms
POST /api/orders	100ms	300ms	500ms
GET /api/orders	50ms	150ms	300ms

8.3.2 Performance Checklist

9. Deployment & CI/CD

9.1 GitHub Actions Pipeline

# .github/workflows/deploy.yml
name: Deploy to Production

on:
  push:
    branches: [main]
  workflow_dispatch:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'

      - name: Install dependencies
        run: |
          cd backend && npm ci
          cd ../frontend && npm ci

      - name: Run tests
        run: |
          cd backend && npm test
          cd ../frontend && npm test

      - name: Type check
        run: |
          cd backend && npm run typecheck
          cd ../frontend && npm run typecheck

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run Snyk security scan
        uses: snyk/actions/node@master
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}

  build:
    needs: [test, security-scan]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build Docker images
        run: |
          docker build -t hannibal-backend:${{ github.sha }} ./backend
          docker build -t hannibal-frontend:${{ github.sha }} ./frontend

      - name: Push to registry
        run: |
          docker push hannibal-backend:${{ github.sha }}
          docker push hannibal-frontend:${{ github.sha }}

  deploy:
    needs: build
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Deploy to production
        uses: appleboy/ssh-action@master
        with:
          host: ${{ secrets.PROD_HOST }}
          username: ${{ secrets.PROD_USER }}
          key: ${{ secrets.PROD_SSH_KEY }}
          script: |
            cd /opt/hannibal
            git pull
            docker compose -f docker-compose.prod.yml pull
            docker compose -f docker-compose.prod.yml up -d

      - name: Health check
        run: |
          sleep 30
          curl -f https://hannibaldev.forsyt.io/api/health || exit 1

      - name: Notify Slack
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "✅ Deployed to production: ${{ github.sha }}"
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

9.2 Rollback Procedure

#!/bin/bash
# /opt/hannibal/scripts/rollback.sh

# Get previous version
PREVIOUS_VERSION=$(docker images hannibal-backend --format "{{.Tag}}" | head -2 | tail -1)

echo "Rolling back to version: ${PREVIOUS_VERSION}"

# Update docker-compose to use previous version
sed -i "s/hannibal-backend:.*/hannibal-backend:${PREVIOUS_VERSION}/" docker-compose.prod.yml
sed -i "s/hannibal-frontend:.*/hannibal-frontend:${PREVIOUS_VERSION}/" docker-compose.prod.yml

# Restart services
docker compose -f docker-compose.prod.yml up -d

# Verify health
sleep 30
curl -f http://localhost:3001/api/health || echo "WARNING: Health check failed"

echo "Rollback complete"

9.3 Database Migrations

// Always use backward-compatible migrations
// BAD: Rename column directly
// ALTER TABLE orders RENAME COLUMN old_name TO new_name;

// GOOD: Add new column, migrate data, then remove old
// Step 1: Add new column
// ALTER TABLE orders ADD COLUMN new_name VARCHAR(255);
// Step 2: Migrate data (in application)
// UPDATE orders SET new_name = old_name WHERE new_name IS NULL;
// Step 3: Remove old column (after deployment is stable)
// ALTER TABLE orders DROP COLUMN old_name;

10. Analytics & Business Intelligence

10.1 Google Analytics 4 Setup

// Frontend analytics integration
import { gtag } from './gtag';

// Track page views
gtag('config', 'G-XXXXXXXXXX', {
  page_path: window.location.pathname,
});

// Track events
gtag('event', 'place_order', {
  event_category: 'betting',
  event_label: fixture.homeTeam + ' vs ' + fixture.awayTeam,
  value: order.stake,
});

gtag('event', 'login', {
  method: 'email',
});

10.2 Business Metrics to Track

Metric	Description	Query/Calculation
DAU	Daily Active Users	Unique users with activity per day
Orders/Day	Total orders placed	COUNT(orders) WHERE date = today
GGR	Gross Gaming Revenue	SUM(stake) - SUM(payouts)
Handle	Total amount wagered	SUM(stake)
Hold %	House edge realized	GGR / Handle * 100
B-Book Exposure	Current liability	SUM(potential_payout) for unsettled B-Book
Settlement Rate	Orders settled on time	Settled within 1hr / Total settled
User Retention	7-day retention	Users active day 7 / Users signed up day 0

10.3 Business Dashboard Queries

-- Daily summary
SELECT
  DATE(created_at) as date,
  COUNT(*) as total_orders,
  COUNT(DISTINCT user_id) as unique_users,
  SUM(stake) as total_stake,
  SUM(CASE WHEN settlement_outcome = 'win' THEN -profit_loss ELSE profit_loss END) as ggr
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;

-- B-Book exposure
SELECT
  SUM(stake * (odds - 1)) as max_exposure
FROM orders
WHERE routing_venue = 'bbook'
  AND status = 'accepted';

-- Settlement performance
SELECT
  AVG(EXTRACT(EPOCH FROM (settled_at - created_at))) as avg_settlement_seconds
FROM orders
WHERE status = 'settled'
  AND settled_at >= CURRENT_DATE - INTERVAL '7 days';

11. High Availability & Reliability

11.1 Current Architecture (Single Node)

                    ┌─────────────┐
                    │  Cloudflare │
                    │    (CDN)    │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │    Nginx    │
                    │   (Proxy)   │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
       ┌──────▼──────┐ ┌───▼───┐ ┌──────▼──────┐
       │  Frontend   │ │Backend│ │   Backend   │
       │  (Next.js)  │ │ (API) │ │   (Jobs)    │
       └─────────────┘ └───┬───┘ └──────┬──────┘
                           │            │
              ┌────────────┼────────────┘
              │            │
       ┌──────▼──────┐ ┌───▼───┐
       │  PostgreSQL │ │ Redis │
       └─────────────┘ └───────┘

11.2 Future HA Architecture

                    ┌─────────────┐
                    │  Cloudflare │
                    │ (CDN + WAF) │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
       ┌──────▼──────┐     │     ┌──────▼──────┐
       │   Nginx 1   │     │     │   Nginx 2   │
       │  (Primary)  │     │     │  (Standby)  │
       └──────┬──────┘     │     └──────┬──────┘
              │            │            │
              └────────────┼────────────┘
                           │
                    ┌──────▼──────┐
                    │    HAProxy  │
                    │    (LB)     │
                    └──────┬──────┘
                           │
         ┌─────────────────┼─────────────────┐
         │                 │                 │
  ┌──────▼──────┐   ┌──────▼──────┐   ┌──────▼──────┐
  │  Backend 1  │   │  Backend 2  │   │  Backend 3  │
  └──────┬──────┘   └──────┬──────┘   └──────┬──────┘
         │                 │                 │
         └─────────────────┼─────────────────┘
                           │
         ┌─────────────────┼─────────────────┐
         │                 │                 │
  ┌──────▼──────┐   ┌──────▼──────┐   ┌──────▼──────┐
  │  Postgres   │   │  Postgres   │   │   Redis     │
  │  (Primary)  │◄──│  (Replica)  │   │  (Sentinel) │
  └─────────────┘   └─────────────┘   └─────────────┘

11.3 Reliability Patterns

11.3.1 Circuit Breaker (Betfair API)

import CircuitBreaker from 'opossum';

const betfairBreaker = new CircuitBreaker(callBetfairAPI, {
  timeout: 10000,           // 10 second timeout
  errorThresholdPercentage: 50,  // Open circuit if 50% fail
  resetTimeout: 30000,      // Try again after 30 seconds
  volumeThreshold: 10,      // Minimum 10 requests before tripping
});

betfairBreaker.on('open', () => {
  logger.warn('Betfair circuit breaker opened');
  alerting.send('warning', 'Betfair API circuit breaker opened');
});

betfairBreaker.on('halfOpen', () => {
  logger.info('Betfair circuit breaker half-open, testing...');
});

betfairBreaker.on('close', () => {
  logger.info('Betfair circuit breaker closed');
});

11.3.2 Retry with Exponential Backoff

async function withRetry<T>(
  fn: () => Promise<T>,
  maxRetries: number = 3,
  baseDelay: number = 1000
): Promise<T> {
  let lastError: Error;

  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;
      const delay = baseDelay * Math.pow(2, attempt);
      logger.warn(`Attempt ${attempt + 1} failed, retrying in ${delay}ms`);
      await sleep(delay);
    }
  }

  throw lastError!;
}

11.3.3 Graceful Degradation

// If Betfair is down, show cached odds with warning
async function getOdds(fixtureId: string) {
  try {
    return await betfairAdapter.getOdds(fixtureId);
  } catch (error) {
    logger.error('Failed to get live odds, using cache', error);

    const cached = await cache.get(`odds:${fixtureId}`);
    if (cached) {
      return {
        ...cached,
        stale: true,
        staleSince: cached.timestamp,
      };
    }

    throw new ServiceUnavailableError('Odds temporarily unavailable');
  }
}

11.4 Health Checks

// /api/health endpoint
app.get('/api/health', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    betfair: await checkBetfairSession(),
  };

  const healthy = Object.values(checks).every(c => c.status === 'healthy');

  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'healthy' : 'unhealthy',
    timestamp: new Date().toISOString(),
    checks,
  });
});

async function checkDatabase(): Promise<HealthCheck> {
  try {
    await prisma.$queryRaw`SELECT 1`;
    return { status: 'healthy', latency: Date.now() - start };
  } catch (error) {
    return { status: 'unhealthy', error: error.message };
  }
}

12. Operational Runbooks

12.1 Runbook: Service Restart

## Service Restart Runbook

### When to Use
- Service is unresponsive
- Memory leak detected
- After configuration change

### Procedure

1. **Check current status**
   ```bash
   docker compose ps
   docker logs hannibal-backend --tail 100

Restart specific service
```
docker compose restart backend
```

Verify service is healthy

curl http://localhost:3001/api/health
docker logs hannibal-backend --tail 20

If restart fails, recreate container

docker compose up -d --force-recreate backend

Escalation

If service won't start after 3 attempts, escalate to on-call engineer.

### 12.2 Runbook: Database Connection Issues

```markdown
## Database Connection Issues Runbook

### Symptoms
- "Connection refused" errors
- "Too many connections" errors
- Slow queries

### Diagnosis

1. **Check PostgreSQL status**
   ```bash
   docker exec hannibal-postgres pg_isready

Check connection count

docker exec hannibal-postgres psql -U hannibal -c "SELECT count(*) FROM pg_stat_activity;"

Check for blocking queries

docker exec hannibal-postgres psql -U hannibal -c "
  SELECT pid, now() - pg_stat_activity.query_start AS duration, query
  FROM pg_stat_activity
  WHERE state != 'idle'
  ORDER BY duration DESC
  LIMIT 10;
"

Resolution

Too many connections:

# Kill idle connections
docker exec hannibal-postgres psql -U hannibal -c "
  SELECT pg_terminate_backend(pid)
  FROM pg_stat_activity
  WHERE state = 'idle'
  AND query_start < now() - interval '10 minutes';
"

Blocking query:

# Kill specific query
docker exec hannibal-postgres psql -U hannibal -c "SELECT pg_terminate_backend(PID);"

### 12.3 Runbook: Betfair Session Recovery

```markdown
## Betfair Session Recovery Runbook

### Symptoms
- "Session expired" errors in logs
- Orders failing to place
- No odds updates

### Diagnosis

1. **Check Betfair session status**
   ```bash
   docker logs hannibal-backend --tail 100 | grep -i betfair

Check session in Redis

docker exec hannibal-redis redis-cli GET betfair:session

Resolution

Force session refresh

# Clear cached session
docker exec hannibal-redis redis-cli DEL betfair:session

# Restart backend to trigger re-login
docker compose restart backend

Verify new session

docker logs hannibal-backend --tail 50 | grep -i "logged in"

Check Betfair credentials in .env
Check if Betfair API is down: https://status.betfair.com
Check if IP is blocked (contact Betfair support)

### 12.4 Runbook: Disk Space Emergency

```markdown
## Disk Space Emergency Runbook

### Symptoms
- Disk usage > 90% alert
- Services failing to write

### Immediate Actions

1. **Check disk usage**
   ```bash
   df -h
   du -sh /* | sort -hr | head -20

Quick cleanup

# Docker cleanup
docker system prune -af --volumes

# Clear old logs
find /var/log -name "*.gz" -delete
truncate -s 0 /var/log/*.log

# Clear temp files
rm -rf /tmp/*

Check large files

find / -type f -size +100M -exec ls -lh {} \;

If Still Critical

Delete old database backups: rm /var/backups/postgres/hannibal_*.sql.gz
Expand disk volume (cloud provider console)
Move data to larger volume

---

## 13. Implementation Roadmap

### Phase 1: Critical (Week 1-2) 🔴

| Task | Effort | Owner | Status |
|------|--------|-------|--------|
| Configure Docker log rotation | 1h | DevOps | ⬜ |
| Set up daily database backups to S3 | 4h | DevOps | ⬜ |
| Deploy Prometheus + Node Exporter | 4h | DevOps | ⬜ |
| Deploy Grafana with basic dashboards | 4h | DevOps | ⬜ |
| Configure critical alerts (disk, memory, service down) | 4h | DevOps | ⬜ |
| Set up Uptime Kuma for external monitoring | 2h | DevOps | ⬜ |
| Configure UFW firewall | 2h | DevOps | ⬜ |
| Add resource limits to Docker containers | 2h | DevOps | ⬜ |

### Phase 2: Important (Week 3-4) 🟡

| Task | Effort | Owner | Status |
|------|--------|-------|--------|
| Deploy Loki + Promtail for log aggregation | 4h | DevOps | ⬜ |
| Add application metrics to backend | 8h | Backend | ⬜ |
| Create business metrics dashboard | 4h | DevOps | ⬜ |
| Set up backup verification automation | 4h | DevOps | ⬜ |
| Implement security headers in Nginx | 2h | DevOps | ⬜ |
| Set up GitHub Actions CI/CD pipeline | 8h | DevOps | ⬜ |
| Configure Cloudflare WAF rules | 4h | DevOps | ⬜ |
| Add database indexes for performance | 4h | Backend | ⬜ |

### Phase 3: Enhancement (Month 2) 🟢

| Task | Effort | Owner | Status |
|------|--------|-------|--------|
| Implement distributed tracing (Jaeger) | 8h | Backend | ⬜ |
| Add circuit breaker for Betfair API | 4h | Backend | ⬜ |
| Create operational runbooks | 8h | DevOps | ⬜ |
| Set up Google Analytics 4 | 4h | Frontend | ⬜ |
| Implement audit logging | 8h | Backend | ⬜ |
| Performance optimization (caching, queries) | 16h | Backend | ⬜ |
| Security audit and penetration testing | 16h | Security | ⬜ |

### Phase 4: Scale (Month 3+) 🔵

| Task | Effort | Owner | Status |
|------|--------|-------|--------|
| Set up PostgreSQL replication | 8h | DevOps | ⬜ |
| Configure Redis Sentinel | 4h | DevOps | ⬜ |
| Deploy multiple backend instances | 8h | DevOps | ⬜ |
| Set up load balancer | 4h | DevOps | ⬜ |
| Implement chaos testing | 16h | DevOps | ⬜ |
| Multi-region failover setup | 24h | DevOps | ⬜ |

---

## 14. Appendix: Configuration Examples

### 14.1 Complete docker-compose.prod.yml

```yaml
version: '3.8'

services:
  backend:
    image: hannibal-backend:latest
    restart: unless-stopped
    environment:
      - NODE_ENV=production
      - DATABASE_URL=postgresql://hannibal:${DB_PASSWORD}@postgres:5432/hannibal
      - REDIS_URL=redis://redis:6379
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 2G
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3001/api/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
        max-file: "5"

  frontend:
    image: hannibal-frontend:latest
    restart: unless-stopped
    environment:
      - NODE_ENV=production
    deploy:
      resources:
        limits:
          cpus: '1'
          memory: 1G
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000"]
      interval: 30s
      timeout: 10s
      retries: 3
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
        max-file: "5"

  postgres:
    image: postgres:15-alpine
    restart: unless-stopped
    environment:
      - POSTGRES_USER=hannibal
      - POSTGRES_PASSWORD=${DB_PASSWORD}
      - POSTGRES_DB=hannibal
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./postgres/postgresql.conf:/etc/postgresql/postgresql.conf
    command: postgres -c config_file=/etc/postgresql/postgresql.conf
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 4G
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U hannibal"]
      interval: 10s
      timeout: 5s
      retries: 5
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
        max-file: "5"

  redis:
    image: redis:7-alpine
    restart: unless-stopped
    command: redis-server /usr/local/etc/redis/redis.conf
    volumes:
      - redis_data:/data
      - ./redis/redis.conf:/usr/local/etc/redis/redis.conf
    deploy:
      resources:
        limits:
          cpus: '1'
          memory: 1G
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
    logging:
      driver: "json-file"
      options:
        max-size: "50m"
        max-file: "3"

  prometheus:
    image: prom/prometheus:latest
    restart: unless-stopped
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    restart: unless-stopped
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
      - GF_USERS_ALLOW_SIGN_UP=false
    ports:
      - "3000:3000"

  loki:
    image: grafana/loki:latest
    restart: unless-stopped
    volumes:
      - ./loki:/etc/loki
      - loki_data:/loki
    command: -config.file=/etc/loki/loki-config.yml

  promtail:
    image: grafana/promtail:latest
    restart: unless-stopped
    volumes:
      - ./promtail:/etc/promtail
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock
    command: -config.file=/etc/promtail/promtail-config.yml

  node-exporter:
    image: prom/node-exporter:latest
    restart: unless-stopped
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'

  postgres-exporter:
    image: prometheuscommunity/postgres-exporter:latest
    restart: unless-stopped
    environment:
      - DATA_SOURCE_NAME=postgresql://hannibal:${DB_PASSWORD}@postgres:5432/hannibal?sslmode=disable

  redis-exporter:
    image: oliver006/redis_exporter:latest
    restart: unless-stopped
    environment:
      - REDIS_ADDR=redis://redis:6379

volumes:
  postgres_data:
  redis_data:
  prometheus_data:
  grafana_data:
  loki_data:

14.2 Prometheus Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - /etc/prometheus/alerts/*.yml

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']

  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']

  - job_name: 'backend'
    static_configs:
      - targets: ['backend:3001']
    metrics_path: /metrics

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

14.3 Crontab for Maintenance Tasks

# /etc/cron.d/hannibal

# Database backup - daily at 3 AM
0 3 * * * root /opt/hannibal/scripts/backup-database.sh >> /var/log/hannibal/backup.log 2>&1

# Backup verification - weekly on Sunday at 4 AM
0 4 * * 0 root /opt/hannibal/scripts/verify-backup.sh >> /var/log/hannibal/backup-verify.log 2>&1

# Disk cleanup - daily at 2 AM
0 2 * * * root /opt/hannibal/scripts/disk-cleanup.sh >> /var/log/hannibal/cleanup.log 2>&1

# PostgreSQL maintenance - weekly on Sunday at 5 AM
0 5 * * 0 root /opt/hannibal/scripts/postgres-maintenance.sh >> /var/log/hannibal/postgres.log 2>&1

# SSL certificate renewal check - daily at 6 AM
0 6 * * * root certbot renew --quiet --post-hook "nginx -s reload"

# Docker cleanup - weekly on Saturday at 3 AM
0 3 * * 6 root docker system prune -af >> /var/log/hannibal/docker-cleanup.log 2>&1

Document History

Version	Date	Author	Changes
1.0	2026-01-28	AI Assistant	Initial document

Table of Contents​

1. Executive Summary​

1.1 Goals​

1.2 Key Metrics Targets​

1.3 Technology Stack Overview​

2. Current State Assessment​

2.1 What We Have​

2.2 Critical Gaps​

3. Monitoring & Observability​

3.1 Metrics Collection (Prometheus)​

3.1.1 Infrastructure Metrics (Node Exporter)​

3.1.2 Database Metrics (postgres_exporter)​

3.1.3 Redis Metrics (redis_exporter)​

3.1.4 Application Metrics (Custom)​

3.1.5 Container Metrics (cAdvisor)​

3.2 Dashboards (Grafana)​

3.2.1 System Overview Dashboard​

3.2.2 Application Health Dashboard​

3.2.3 Database Dashboard​

3.2.4 Business Metrics Dashboard​

3.3 Log Aggregation (Loki + Promtail)​

3.3.1 Log Format Standard​

3.3.2 Log Retention Policy​

3.3.3 Docker Log Rotation​

3.4 Distributed Tracing (Jaeger)​

3.5 Uptime Monitoring (Uptime Kuma)​

4. Alerting Strategy​

4.1 Alert Channels​

4.2 Alert Severity Levels​

4.3 Critical Alerts (P1) - Page Immediately​

4.4 High Alerts (P2) - Page During Business Hours​

4.5 Warning Alerts (P3) - Slack Notification​

4.6 Alert Rules (Prometheus)​

5. Backup & Disaster Recovery​

5.1 Backup Strategy Overview​

5.2 Database Backup Configuration​

5.2.1 Daily Full Backup (pg_dump)​

5.2.2 Continuous WAL Archiving​

5.2.3 Point-in-Time Recovery (PITR)​

5.3 Backup Schedule​

5.4 Backup Verification​

5.5 Disaster Recovery Procedures​

5.5.1 Recovery Time Objectives​

5.5.2 Recovery Runbook​

6.1.2 Cloudflare Configuration​

6.1.3 Rate Limiting (Nginx)​

6.2 Application Security​

6.2.1 Security Headers (Nginx)​

6.2.2 Input Validation​

6.2.3 Authentication Security​

6.3 Secrets Management​

6.3.1 Current Secrets Inventory​

6.3.2 Secrets Management Best Practices​

6.4 Audit Logging​

7. Infrastructure & Resource Management​

7.1 Docker Configuration​

7.1.1 Resource Limits​

7.1.2 Docker Daemon Configuration​

7.2 Disk Management​

7.2.1 Disk Space Monitoring​

7.2.2 PostgreSQL Maintenance​

7.3 Redis Configuration​

7.4 PostgreSQL Tuning​

8. Performance Optimization​

8.1 Database Performance​

8.1.1 Essential Indexes​

8.1.2 Query Optimization Checklist​

8.2 Caching Strategy​

8.2.1 Cache Layers​

8.2.2 Cache Implementation​

8.3 API Performance​

8.3.1 Response Time Targets​

8.3.2 Performance Checklist​

9. Deployment & CI/CD​

9.1 GitHub Actions Pipeline​

9.2 Rollback Procedure​

9.3 Database Migrations​

10. Analytics & Business Intelligence​

10.1 Google Analytics 4 Setup​

10.2 Business Metrics to Track​

Table of Contents