Hannibal Production Readiness Guide
Version: 1.0
Last Updated: January 28, 2026
Status: Planning
This document outlines all the requirements, configurations, and best practices needed to make Hannibal a production-ready, rock-solid betting platform that can run autonomously for years with minimal human intervention.
Table of Contents
- Executive Summary
- Current State Assessment
- Monitoring & Observability
- Alerting Strategy
- Backup & Disaster Recovery
- Security
- Infrastructure & Resource Management
- Performance Optimization
- Deployment & CI/CD
- Analytics & Business Intelligence
- High Availability & Reliability
- Operational Runbooks
- Implementation Roadmap
- Appendix: Configuration Examples
1. Executive Summary
1.1 Goals
Hannibal must be:
- Reliable: 99.9% uptime (< 8.76 hours downtime/year)
- Resilient: Auto-recover from failures without human intervention
- Observable: Full visibility into system health and business metrics
- Secure: Protected against attacks and data breaches
- Recoverable: Ability to restore from any disaster within defined RTO/RPO
- Scalable: Handle 10x current load without architecture changes
1.2 Key Metrics Targets
| Metric | Target |
|---|---|
| Uptime | 99.9% |
| API Response Time (p95) | < 500ms |
| Order Placement Latency | < 1 second |
| Settlement Time | < 5 minutes after match end |
| Recovery Time Objective (RTO) | < 1 hour |
| Recovery Point Objective (RPO) | < 1 hour |
| Mean Time to Detection (MTTD) | < 2 minutes |
| Mean Time to Recovery (MTTR) | < 30 minutes |
1.3 Technology Stack Overview
┌─────────────────────────────────────────────────────────────────┐
│ MONITORING LAYER │
│ Prometheus │ Grafana │ Loki │ Uptime Kuma │ Jaeger │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ Frontend (Next.js) │ Backend (Node.js/Express) │ Jobs │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────────┐
│ DATA LAYER │
│ PostgreSQL │ Redis │ Betfair API │ Roanuz API │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────────┐
│ INFRASTRUCTURE LAYER │
│ Docker │ Nginx │ Certbot │ UFW │ Systemd │
└─────────────────────────────────────────────────────────────────┘
2. Current State Assessment
2.1 What We Have
| Component | Status | Notes |
|---|---|---|
| Docker Compose deployment | ✅ Working | Single host |
| PostgreSQL database | ✅ Working | No backups configured |
| Redis cache | ✅ Working | No memory limits |
| Nginx reverse proxy | ✅ Working | Basic config |
| SSL certificates | ✅ Working | Certbot auto-renewal |
| Backend API | ✅ Working | No metrics exposed |
| Frontend | ✅ Working | No analytics |
| Betfair integration | ✅ Working | Session management needs monitoring |
| Settlement jobs | ✅ Working | No failure alerts |
2.2 Critical Gaps
| Gap | Risk | Priority |
|---|---|---|
| No database backups | Data loss | 🔴 Critical |
| No monitoring | Blind to issues | 🔴 Critical |
| No alerting | Delayed response | 🔴 Critical |
| No log rotation | Disk exhaustion | 🔴 Critical |
| No resource limits | OOM crashes | 🟡 High |
| No uptime monitoring | Unknown downtime | 🟡 High |
| No security hardening | Vulnerabilities | 🟡 High |
| No CI/CD pipeline | Manual deployments | 🟢 Medium |
| No analytics | No business insights | 🟢 Medium |
3. Monitoring & Observability
3.1 Metrics Collection (Prometheus)
Prometheus will be the central metrics collection system, scraping metrics from all components.
3.1.1 Infrastructure Metrics (Node Exporter)
Collect from every host:
- CPU: Usage, load average, steal time
- Memory: Used, available, cached, swap
- Disk: Usage, I/O, latency
- Network: Bandwidth, packets, errors
- File descriptors: Open files, limits
3.1.2 Database Metrics (postgres_exporter)
Key Metrics:
- pg_stat_activity_count # Active connections
- pg_stat_database_tup_fetched # Rows read
- pg_stat_database_tup_inserted # Rows inserted
- pg_stat_database_xact_commit # Transactions committed
- pg_stat_database_deadlocks # Deadlock count
- pg_replication_lag # Replication lag (if replica)
- pg_database_size_bytes # Database size
- pg_stat_user_tables_n_dead_tup # Dead tuples (vacuum needed)
3.1.3 Redis Metrics (redis_exporter)
Key Metrics:
- redis_memory_used_bytes # Memory consumption
- redis_memory_max_bytes # Memory limit
- redis_connected_clients # Client connections
- redis_commands_processed_total # Command throughput
- redis_keyspace_hits_total # Cache hits
- redis_keyspace_misses_total # Cache misses
- redis_evicted_keys_total # Evicted keys (memory pressure)
- redis_rejected_connections # Rejected connections
3.1.4 Application Metrics (Custom)
Backend should expose metrics at /metrics endpoint:
// Key application metrics to implement
const metrics = {
// HTTP metrics
http_requests_total: Counter, // Total requests by method, path, status
http_request_duration_seconds: Histogram, // Request latency
// Business metrics
orders_placed_total: Counter, // Orders by type, venue
orders_settled_total: Counter, // Settlements by outcome
settlement_duration_seconds: Histogram, // Time to settle
bbook_exposure_dollars: Gauge, // Current B-Book liability
// External API metrics
betfair_api_calls_total: Counter, // Betfair API calls
betfair_api_errors_total: Counter, // Betfair API errors
betfair_api_latency_seconds: Histogram, // Betfair API latency
betfair_session_status: Gauge, // 1 = active, 0 = expired
// Job metrics
job_runs_total: Counter, // Job executions
job_duration_seconds: Histogram, // Job duration
job_errors_total: Counter, // Job failures
};
3.1.5 Container Metrics (cAdvisor)
Key Metrics:
- container_cpu_usage_seconds_total
- container_memory_usage_bytes
- container_memory_working_set_bytes
- container_network_receive_bytes_total
- container_network_transmit_bytes_total
- container_fs_usage_bytes
3.2 Dashboards (Grafana)
3.2.1 System Overview Dashboard
┌─────────────────────────────────────────────────────────────────┐
│ SYSTEM OVERVIEW │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────┤
│ CPU % │ Memory % │ Disk % │ Network │ Uptime │
│ 23% │ 67% │ 45% │ 12 Mb/s │ 45 days │
├─────────────┴─────────────┴─────────────┴─────────────┴─────────┤
│ CPU Usage Over Time │
│ ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁ │
├─────────────────────────────────────────────────────────────────┤
│ Memory Usage Over Time │
│ ████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░ │
├─────────────────────────────────────────────────────────────────┤
│ Container Status: backend ✅ | frontend ✅ | postgres ✅ | redis ✅ │
└─────────────────────────────────────────────────────────────────┘
3.2.2 Application Health Dashboard
┌─────────────────────────────────────────────────────────────────┐
│ APPLICATION HEALTH │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────┤
│ Requests/s │ Error Rate │ p50 Lat │ p95 Lat │ p99 Lat │
│ 125 │ 0.02% │ 45ms │ 180ms │ 450ms │
├─────────────────────────────────────────────────────────────────┤
│ Request Rate by Endpoint │
│ /api/fixtures ████████████████████ 45/s │
│ /api/orders ████████ 18/s │
│ /api/odds ██████████████ 32/s │
├─────────────────────────────────────────────────────────────────┤
│ Error Rate Over Time │
│ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ │
└─────────────────────────────────────────────────────────────────┘
3.2.3 Database Dashboard
┌─────────────────────────────────────────────────────────────────┐
│ DATABASE HEALTH │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────┤
│ Connections │ Query Time │ Size │ Dead Tuples │ TPS │
│ 12/100 │ 15ms avg │ 2.3 GB │ 1,234 │ 85 │
├─────────────────────────────────────────────────────────────────┤
│ Table Sizes │
│ orders ████████████████████ 1.2 GB │
│ users ████ 120 MB │
│ fixtures ██████████ 450 MB │
├─────────────────────────────────────────────────────────────────┤
│ Slow Queries (> 100ms) │
│ SELECT * FROM orders WHERE... │ 234ms │ 15 calls/min │
└─────────────────────────────────────────────────────────────────┘
3.2.4 Business Metrics Dashboard
┌─────────────────────────────────────────────────────────────────┐
│ BUSINESS METRICS │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────┤
│ Orders Today│ GGR Today │ Active Users│ B-Book Exp │ Settled │
│ 1,234 │ $5,678 │ 89 │ $12,345 │ 98.5% │
├─────────────────────────────────────────────────────────────────┤
│ Orders Over Time (24h) │
│ ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁ │
├─────────────────────────────────────────────────────────────────┤
│ Settlement Status │
│ Pending: 23 │ Settled (Win): 456 │ Settled (Lose): 678 │
└─────────────────────────────────────────────────────────────────┘
3.3 Log Aggregation (Loki + Promtail)
3.3.1 Log Format Standard
All services should output JSON logs:
{
"timestamp": "2026-01-28T19:45:23.456Z",
"level": "info",
"service": "backend",
"traceId": "abc123",
"userId": "user_456",
"message": "Order placed successfully",
"orderId": "order_789",
"duration": 145
}
3.3.2 Log Retention Policy
| Log Type | Hot Storage | Cold Storage | Total Retention |
|---|---|---|---|
| Application logs | 7 days | 90 days | 90 days |
| Access logs | 7 days | 30 days | 30 days |
| Error logs | 30 days | 1 year | 1 year |
| Audit logs | 30 days | 7 years | 7 years |
| Security logs | 30 days | 1 year | 1 year |
3.3.3 Docker Log Rotation
// /etc/docker/daemon.json
{
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "5",
"compress": "true"
}
}
3.4 Distributed Tracing (Jaeger)
Implement OpenTelemetry tracing to track requests across services:
User Request
│
▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Nginx │───▶│ Frontend│───▶│ Backend │───▶│ Postgres│
│ 2ms │ │ 15ms │ │ 45ms │ │ 12ms │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
│
▼
┌─────────┐
│ Betfair │
│ 120ms │
└─────────┘
3.5 Uptime Monitoring (Uptime Kuma)
External monitoring from outside the infrastructure:
| Check | URL | Interval | Timeout |
|---|---|---|---|
| Frontend | https://hannibaldev.forsyt.io | 60s | 10s |
| API Health | https://hannibaldev.forsyt.io/api/health | 60s | 10s |
| API Fixtures | https://hannibaldev.forsyt.io/api/fixtures/live | 60s | 15s |
| WebSocket | wss://hannibaldev.forsyt.io/ws | 60s | 10s |
4. Alerting Strategy
4.1 Alert Channels
| Channel | Use Case | Response Time |
|---|---|---|
| PagerDuty/OpsGenie | Critical alerts requiring immediate action | < 5 min |
| Slack #alerts-critical | Critical alerts for team visibility | < 5 min |
| Slack #alerts-warning | Warning alerts for awareness | < 1 hour |
| Daily reports, non-urgent notifications | < 24 hours | |
| SMS (backup) | Critical alerts if PagerDuty fails | < 5 min |
4.2 Alert Severity Levels
| Severity | Description | Response | Examples |
|---|---|---|---|
| P1 - Critical | Service down, data loss risk | Page immediately, 24/7 | Database down, disk full |
| P2 - High | Degraded service, potential impact | Page during business hours | High error rate, slow response |
| P3 - Medium | Warning, needs attention | Slack notification | High memory, certificate expiring |
| P4 - Low | Informational | Email/daily report | Deployment complete |
4.3 Critical Alerts (P1) - Page Immediately
| Alert Name | Condition | Duration | Action |
|---|---|---|---|
| ServiceDown | Container not running | 2 min | Auto-restart, then page |
| DatabaseDown | PostgreSQL not responding | 1 min | Page immediately |
| DiskCritical | Disk usage > 95% | 0 min | Page immediately |
| MemoryCritical | Memory usage > 95% | 5 min | Page immediately |
| HighErrorRate | 5xx errors > 10% | 5 min | Page immediately |
| BetfairSessionLost | Session expired, reconnect failed | 5 min | Page immediately |
| SettlementJobFailed | Job not run in 10 min | 10 min | Page immediately |
| BackupFailed | No successful backup in 25 hours | 0 min | Page immediately |
| SSLCertExpired | Certificate expired | 0 min | Page immediately |
4.4 High Alerts (P2) - Page During Business Hours
| Alert Name | Condition | Duration | Action |
|---|---|---|---|
| HighCPU | CPU > 90% | 15 min | Investigate, scale if needed |
| HighMemory | Memory > 90% | 10 min | Investigate, restart if needed |
| DiskHigh | Disk usage > 85% | 0 min | Clean up, expand disk |
| SlowQueries | Query time > 5s | 5 min | Optimize query |
| HighLatency | p95 latency > 2s | 10 min | Investigate bottleneck |
| ConnectionPoolExhausted | DB connections > 90% | 5 min | Increase pool, investigate |
| RedisMemoryHigh | Redis memory > 90% | 5 min | Check eviction, increase limit |
4.5 Warning Alerts (P3) - Slack Notification
| Alert Name | Condition | Duration |
|---|---|---|
| DiskWarning | Disk usage > 75% | 0 min |
| MemoryWarning | Memory > 80% | 10 min |
| CPUWarning | CPU > 80% | 15 min |
| CertificateExpiring | SSL cert expires in < 14 days | 0 min |
| HighDeadTuples | Dead tuples > 100,000 | 0 min |
| ReplicationLag | Lag > 30 seconds | 5 min |
| BetfairAPIErrors | Error rate > 5% | 5 min |
4.6 Alert Rules (Prometheus)
# prometheus/alerts/critical.yml
groups:
- name: critical
rules:
- alert: ServiceDown
expr: up{job=~"backend|frontend|postgres|redis"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
description: "{{ $labels.job }} has been down for more than 2 minutes"
- alert: DiskCritical
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 5
for: 0m
labels:
severity: critical
annotations:
summary: "Disk almost full on {{ $labels.instance }}"
description: "Disk usage is above 95% on {{ $labels.mountpoint }}"
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is above 10% for the last 5 minutes"
- alert: DatabaseDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL is down"
description: "PostgreSQL database is not responding"
5. Backup & Disaster Recovery
5.1 Backup Strategy Overview
┌─────────────────────────────────────────────────────────────────┐
│ BACKUP ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PostgreSQL ──┬── pg_dump (daily) ──────────▶ S3 (30 days) │
│ │ │
│ └── WAL Archive (continuous) ──▶ S3 (7 days) │
│ │
│ Redis ───────── RDB Snapshot (hourly) ─────▶ Local + S3 │
│ │
│ Config ──────── Git + Encrypted Secrets ───▶ GitHub + Vault │
│ │
│ Logs ────────── Loki ──────────────────────▶ S3 (90 days) │
│ │
└─────────────────────────────────────────────────────────────────┘
5.2 Database Backup Configuration
5.2.1 Daily Full Backup (pg_dump)
#!/bin/bash
# /opt/hannibal/scripts/backup-database.sh
set -e
BACKUP_DIR="/var/backups/postgres"
S3_BUCKET="s3://hannibal-backups/postgres"
RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="hannibal_${TIMESTAMP}.sql.gz"
# Create backup
docker exec hannibal-postgres pg_dump -U hannibal -d hannibal | gzip > "${BACKUP_DIR}/${BACKUP_FILE}"
# Upload to S3
aws s3 cp "${BACKUP_DIR}/${BACKUP_FILE}" "${S3_BUCKET}/${BACKUP_FILE}"
# Verify upload
aws s3 ls "${S3_BUCKET}/${BACKUP_FILE}" || exit 1
# Clean old local backups
find ${BACKUP_DIR} -name "hannibal_*.sql.gz" -mtime +7 -delete
# Clean old S3 backups (lifecycle policy handles this, but belt and suspenders)
aws s3 ls ${S3_BUCKET}/ | while read -r line; do
createDate=$(echo $line | awk '{print $1" "$2}')
createDate=$(date -d "$createDate" +%s)
olderThan=$(date -d "-${RETENTION_DAYS} days" +%s)
if [[ $createDate -lt $olderThan ]]; then
fileName=$(echo $line | awk '{print $4}')
aws s3 rm "${S3_BUCKET}/${fileName}"
fi
done
echo "Backup completed: ${BACKUP_FILE}"
5.2.2 Continuous WAL Archiving
# postgresql.conf additions
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://hannibal-backups/wal/%f'
archive_timeout = 300 # Archive every 5 minutes minimum
5.2.3 Point-in-Time Recovery (PITR)
#!/bin/bash
# Restore to specific point in time
# /opt/hannibal/scripts/restore-pitr.sh
TARGET_TIME="2026-01-28 15:30:00"
BACKUP_FILE="hannibal_20260128_030000.sql.gz"
# Stop services
docker compose down
# Download base backup
aws s3 cp "s3://hannibal-backups/postgres/${BACKUP_FILE}" /tmp/
# Restore base backup
gunzip -c /tmp/${BACKUP_FILE} | docker exec -i hannibal-postgres psql -U hannibal -d hannibal
# Apply WAL files up to target time
# (This requires proper PostgreSQL PITR setup with recovery.conf)
5.3 Backup Schedule
| Backup Type | Frequency | Time | Retention | Storage |
|---|---|---|---|---|
| PostgreSQL full | Daily | 03:00 UTC | 30 days | S3 |
| PostgreSQL WAL | Continuous | - | 7 days | S3 |
| Redis RDB | Hourly | :00 | 24 hours | Local + S3 |
| Configuration | On change | - | Forever | Git |
| Docker volumes | Daily | 04:00 UTC | 7 days | S3 |
5.4 Backup Verification
#!/bin/bash
# /opt/hannibal/scripts/verify-backup.sh
# Run weekly to verify backups are restorable
# Get latest backup
LATEST_BACKUP=$(aws s3 ls s3://hannibal-backups/postgres/ | sort | tail -n 1 | awk '{print $4}')
# Download and restore to test database
aws s3 cp "s3://hannibal-backups/postgres/${LATEST_BACKUP}" /tmp/
docker exec -i hannibal-postgres createdb -U hannibal hannibal_test || true
gunzip -c /tmp/${LATEST_BACKUP} | docker exec -i hannibal-postgres psql -U hannibal -d hannibal_test
# Verify data integrity
ORDERS_COUNT=$(docker exec hannibal-postgres psql -U hannibal -d hannibal_test -t -c "SELECT COUNT(*) FROM orders")
USERS_COUNT=$(docker exec hannibal-postgres psql -U hannibal -d hannibal_test -t -c "SELECT COUNT(*) FROM users")
# Clean up
docker exec hannibal-postgres dropdb -U hannibal hannibal_test
# Report
echo "Backup verification complete"
echo "Orders: ${ORDERS_COUNT}"
echo "Users: ${USERS_COUNT}"
5.5 Disaster Recovery Procedures
5.5.1 Recovery Time Objectives
| Scenario | RTO | RPO | Procedure |
|---|---|---|---|
| Container crash | 1 min | 0 | Auto-restart |
| Host reboot | 5 min | 0 | Systemd auto-start |
| Database corruption | 1 hour | 1 hour | PITR restore |
| Host failure | 2 hours | 1 hour | New host + restore |
| Data center failure | 4 hours | 1 hour | DR site activation |
| Ransomware/breach | 4 hours | 1 hour | Clean restore from backup |
5.5.2 Recovery Runbook
## Database Recovery Procedure
### Prerequisites
- Access to AWS S3
- Access to new/clean PostgreSQL instance
- Latest backup file name
### Steps
1. **Stop application services**
```bash
docker compose stop backend frontend
-
Download latest backup
aws s3 cp s3://hannibal-backups/postgres/BACKUP_FILE.sql.gz /tmp/ -
Restore database
# Drop and recreate database
docker exec hannibal-postgres dropdb -U hannibal hannibal
docker exec hannibal-postgres createdb -U hannibal hannibal
# Restore from backup
gunzip -c /tmp/BACKUP_FILE.sql.gz | docker exec -i hannibal-postgres psql -U hannibal -d hannibal -
Verify restoration
docker exec hannibal-postgres psql -U hannibal -d hannibal -c "SELECT COUNT(*) FROM orders" -
Start application services
docker compose up -d backend frontend -
Verify application health
curl https://hannibaldev.forsyt.io/api/health
---
## 6. Security
### 6.1 Network Security
#### 6.1.1 Firewall Configuration (UFW)
```bash
# Default policies
ufw default deny incoming
ufw default allow outgoing
# Allow SSH (restricted to known IPs)
ufw allow from 203.0.113.0/24 to any port 22
# Allow HTTP/HTTPS
ufw allow 80/tcp
ufw allow 443/tcp
# Allow internal Docker network
ufw allow from 172.16.0.0/12
# Enable firewall
ufw enable
6.1.2 Cloudflare Configuration
| Setting | Value | Purpose |
|---|---|---|
| SSL Mode | Full (Strict) | End-to-end encryption |
| Always Use HTTPS | On | Force HTTPS |
| Minimum TLS Version | 1.2 | Security |
| WAF | On | Block attacks |
| Rate Limiting | 100 req/min per IP | DDoS protection |
| Bot Fight Mode | On | Block bad bots |
| Browser Integrity Check | On | Block suspicious requests |
6.1.3 Rate Limiting (Nginx)
# /etc/nginx/conf.d/rate-limiting.conf
# Define rate limit zones
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=login:10m rate=5r/m;
limit_req_zone $binary_remote_addr zone=orders:10m rate=30r/m;
# Apply to locations
location /api/ {
limit_req zone=api burst=20 nodelay;
proxy_pass http://backend:3001;
}
location /api/auth/login {
limit_req zone=login burst=5 nodelay;
proxy_pass http://backend:3001;
}
location /api/orders {
limit_req zone=orders burst=10 nodelay;
proxy_pass http://backend:3001;
}
6.2 Application Security
6.2.1 Security Headers (Nginx)
# Security headers
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
add_header Content-Security-Policy "default-src 'self'; script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; img-src 'self' data: https:; font-src 'self' data:; connect-src 'self' wss: https:;" always;
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header Permissions-Policy "geolocation=(), microphone=(), camera=()" always;
6.2.2 Input Validation
// All API inputs must be validated
import { z } from 'zod';
const placeOrderSchema = z.object({
fixtureId: z.string().min(1).max(50),
marketId: z.string().regex(/^1\.\d+$/), // Betfair market ID format
outcomeId: z.string().min(1).max(100),
stake: z.number().positive().max(10000),
odds: z.number().min(1.01).max(1000),
betType: z.enum(['back', 'lay']),
});
// Sanitize all string inputs
function sanitize(input: string): string {
return input.replace(/[<>\"'&]/g, '');
}
6.2.3 Authentication Security
| Measure | Implementation |
|---|---|
| Password Hashing | bcrypt with cost factor 12 |
| JWT Expiry | Access token: 15 min, Refresh token: 7 days |
| Session Management | Redis-backed sessions |
| Failed Login Lockout | 5 attempts, 15 min lockout |
| Password Requirements | Min 8 chars, 1 upper, 1 lower, 1 number |
| 2FA | TOTP (optional, recommended for admins) |
6.3 Secrets Management
6.3.1 Current Secrets Inventory
| Secret | Location | Rotation |
|---|---|---|
| Database password | .env | 90 days |
| Redis password | .env | 90 days |
| JWT secret | .env | 90 days |
| Betfair API key | .env | Never (API key) |
| Betfair credentials | .env | 90 days |
| Roanuz API key | .env | Never |
| AWS credentials | .env | 90 days |
6.3.2 Secrets Management Best Practices
# Never commit secrets to git
echo ".env" >> .gitignore
echo "*.pem" >> .gitignore
echo "*.key" >> .gitignore
# Use Docker secrets in production
docker secret create db_password db_password.txt
docker secret create jwt_secret jwt_secret.txt
# Reference in docker-compose.yml
services:
backend:
secrets:
- db_password
- jwt_secret
environment:
- DB_PASSWORD_FILE=/run/secrets/db_password
6.4 Audit Logging
// Log all security-relevant events
const auditEvents = [
'user.login',
'user.logout',
'user.login_failed',
'user.password_changed',
'user.created',
'user.deleted',
'order.placed',
'order.cancelled',
'admin.user_modified',
'admin.settings_changed',
'admin.manual_settlement',
'system.backup_completed',
'system.backup_failed',
];
// Audit log format
interface AuditLog {
timestamp: string;
event: string;
userId: string;
ipAddress: string;
userAgent: string;
details: Record<string, unknown>;
success: boolean;
}
7. Infrastructure & Resource Management
7.1 Docker Configuration
7.1.1 Resource Limits
# docker-compose.prod.yml
services:
backend:
deploy:
resources:
limits:
cpus: '2'
memory: 2G
reservations:
cpus: '0.5'
memory: 512M
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3001/api/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
frontend:
deploy:
resources:
limits:
cpus: '1'
memory: 1G
reservations:
cpus: '0.25'
memory: 256M
postgres:
deploy:
resources:
limits:
cpus: '2'
memory: 4G
reservations:
cpus: '0.5'
memory: 1G
healthcheck:
test: ["CMD-SHELL", "pg_isready -U hannibal"]
interval: 10s
timeout: 5s
retries: 5
redis:
deploy:
resources:
limits:
cpus: '1'
memory: 1G
reservations:
cpus: '0.25'
memory: 256M
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
7.1.2 Docker Daemon Configuration
// /etc/docker/daemon.json
{
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "5",
"compress": "true"
},
"storage-driver": "overlay2",
"live-restore": true,
"default-ulimits": {
"nofile": {
"Name": "nofile",
"Hard": 65536,
"Soft": 65536
}
}
}
7.2 Disk Management
7.2.1 Disk Space Monitoring
#!/bin/bash
# /opt/hannibal/scripts/disk-cleanup.sh
# Run daily via cron
# Clean Docker resources
docker system prune -f --volumes --filter "until=168h"
# Clean old logs
find /var/log -name "*.gz" -mtime +30 -delete
find /var/log -name "*.log.*" -mtime +7 -delete
# Clean temp files
find /tmp -type f -mtime +7 -delete
# Clean old backups
find /var/backups -mtime +7 -delete
# Report disk usage
df -h | grep -E "^/dev" | while read line; do
usage=$(echo $line | awk '{print $5}' | sed 's/%//')
mount=$(echo $line | awk '{print $6}')
if [ $usage -gt 80 ]; then
echo "WARNING: ${mount} is ${usage}% full"
fi
done
7.2.2 PostgreSQL Maintenance
#!/bin/bash
# /opt/hannibal/scripts/postgres-maintenance.sh
# Run weekly
# Vacuum and analyze all tables
docker exec hannibal-postgres psql -U hannibal -d hannibal -c "VACUUM ANALYZE;"
# Reindex if needed (run monthly)
# docker exec hannibal-postgres psql -U hannibal -d hannibal -c "REINDEX DATABASE hannibal;"
# Check for bloat
docker exec hannibal-postgres psql -U hannibal -d hannibal -c "
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) as size,
n_dead_tup as dead_tuples
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;
"
7.3 Redis Configuration
# redis.conf
maxmemory 512mb
maxmemory-policy allkeys-lru
# Persistence
save 900 1
save 300 10
save 60 10000
# Append-only file
appendonly yes
appendfsync everysec
# Logging
loglevel notice
logfile /var/log/redis/redis.log
7.4 PostgreSQL Tuning
# postgresql.conf optimizations for 4GB RAM server
# Memory
shared_buffers = 1GB
effective_cache_size = 3GB
work_mem = 16MB
maintenance_work_mem = 256MB
# Connections
max_connections = 100
# WAL
wal_buffers = 64MB
checkpoint_completion_target = 0.9
max_wal_size = 2GB
min_wal_size = 1GB
# Query Planning
random_page_cost = 1.1
effective_io_concurrency = 200
# Logging
log_min_duration_statement = 1000 # Log queries > 1 second
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
# Autovacuum
autovacuum = on
autovacuum_max_workers = 3
autovacuum_naptime = 1min
8. Performance Optimization
8.1 Database Performance
8.1.1 Essential Indexes
-- Orders table indexes
CREATE INDEX idx_orders_status ON orders(status) WHERE status = 'accepted';
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_fixture_id ON orders(fixture_id);
CREATE INDEX idx_orders_created_at ON orders(created_at DESC);
CREATE INDEX idx_orders_market_id ON orders(market_id);
CREATE INDEX idx_orders_routing_venue_status ON orders(routing_venue, status);
-- Composite index for settlement job
CREATE INDEX idx_orders_settlement ON orders(routing_venue, status, market_id)
WHERE status = 'accepted';
-- Users table
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_agent_id ON users(agent_id);
8.1.2 Query Optimization Checklist
- All queries use indexes (check with EXPLAIN ANALYZE)
- No N+1 queries (use eager loading)
- Pagination on all list endpoints
- Connection pooling configured
- Prepared statements used
- No SELECT * in production code
8.2 Caching Strategy
8.2.1 Cache Layers
| Layer | Data | TTL | Invalidation |
|---|---|---|---|
| Redis | Fixtures list | 30s | On update |
| Redis | Market odds | 5s | Streaming update |
| Redis | User sessions | 24h | On logout |
| Redis | Betfair session | 12h | On expiry |
| CDN | Static assets | 1 year | Version hash |
| Browser | API responses | 0 | No-cache |
8.2.2 Cache Implementation
// Redis caching pattern
class CacheService {
async get<T>(key: string): Promise<T | null> {
const cached = await redis.get(key);
return cached ? JSON.parse(cached) : null;
}
async set(key: string, value: unknown, ttlSeconds: number): Promise<void> {
await redis.setex(key, ttlSeconds, JSON.stringify(value));
}
async invalidate(pattern: string): Promise<void> {
const keys = await redis.keys(pattern);
if (keys.length > 0) {
await redis.del(...keys);
}
}
}
// Usage
const fixtures = await cache.get('fixtures:live:1');
if (!fixtures) {
const data = await fetchFixtures();
await cache.set('fixtures:live:1', data, 30);
}
8.3 API Performance
8.3.1 Response Time Targets
| Endpoint | Target p50 | Target p95 | Target p99 |
|---|---|---|---|
| GET /api/health | 5ms | 10ms | 50ms |
| GET /api/fixtures | 50ms | 150ms | 300ms |
| GET /api/fixtures/:id | 30ms | 100ms | 200ms |
| POST /api/orders | 100ms | 300ms | 500ms |
| GET /api/orders | 50ms | 150ms | 300ms |
8.3.2 Performance Checklist
- Gzip compression enabled
- HTTP/2 enabled
- Keep-alive connections
- Response pagination
- Field filtering (sparse fieldsets)
- Async processing for heavy operations
9. Deployment & CI/CD
9.1 GitHub Actions Pipeline
# .github/workflows/deploy.yml
name: Deploy to Production
on:
push:
branches: [main]
workflow_dispatch:
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- name: Install dependencies
run: |
cd backend && npm ci
cd ../frontend && npm ci
- name: Run tests
run: |
cd backend && npm test
cd ../frontend && npm test
- name: Type check
run: |
cd backend && npm run typecheck
cd ../frontend && npm run typecheck
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Snyk security scan
uses: snyk/actions/node@master
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
build:
needs: [test, security-scan]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build Docker images
run: |
docker build -t hannibal-backend:${{ github.sha }} ./backend
docker build -t hannibal-frontend:${{ github.sha }} ./frontend
- name: Push to registry
run: |
docker push hannibal-backend:${{ github.sha }}
docker push hannibal-frontend:${{ github.sha }}
deploy:
needs: build
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy to production
uses: appleboy/ssh-action@master
with:
host: ${{ secrets.PROD_HOST }}
username: ${{ secrets.PROD_USER }}
key: ${{ secrets.PROD_SSH_KEY }}
script: |
cd /opt/hannibal
git pull
docker compose -f docker-compose.prod.yml pull
docker compose -f docker-compose.prod.yml up -d
- name: Health check
run: |
sleep 30
curl -f https://hannibaldev.forsyt.io/api/health || exit 1
- name: Notify Slack
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "✅ Deployed to production: ${{ github.sha }}"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
9.2 Rollback Procedure
#!/bin/bash
# /opt/hannibal/scripts/rollback.sh
# Get previous version
PREVIOUS_VERSION=$(docker images hannibal-backend --format "{{.Tag}}" | head -2 | tail -1)
echo "Rolling back to version: ${PREVIOUS_VERSION}"
# Update docker-compose to use previous version
sed -i "s/hannibal-backend:.*/hannibal-backend:${PREVIOUS_VERSION}/" docker-compose.prod.yml
sed -i "s/hannibal-frontend:.*/hannibal-frontend:${PREVIOUS_VERSION}/" docker-compose.prod.yml
# Restart services
docker compose -f docker-compose.prod.yml up -d
# Verify health
sleep 30
curl -f http://localhost:3001/api/health || echo "WARNING: Health check failed"
echo "Rollback complete"
9.3 Database Migrations
// Always use backward-compatible migrations
// BAD: Rename column directly
// ALTER TABLE orders RENAME COLUMN old_name TO new_name;
// GOOD: Add new column, migrate data, then remove old
// Step 1: Add new column
// ALTER TABLE orders ADD COLUMN new_name VARCHAR(255);
// Step 2: Migrate data (in application)
// UPDATE orders SET new_name = old_name WHERE new_name IS NULL;
// Step 3: Remove old column (after deployment is stable)
// ALTER TABLE orders DROP COLUMN old_name;
10. Analytics & Business Intelligence
10.1 Google Analytics 4 Setup
// Frontend analytics integration
import { gtag } from './gtag';
// Track page views
gtag('config', 'G-XXXXXXXXXX', {
page_path: window.location.pathname,
});
// Track events
gtag('event', 'place_order', {
event_category: 'betting',
event_label: fixture.homeTeam + ' vs ' + fixture.awayTeam,
value: order.stake,
});
gtag('event', 'login', {
method: 'email',
});
10.2 Business Metrics to Track
| Metric | Description | Query/Calculation |
|---|---|---|
| DAU | Daily Active Users | Unique users with activity per day |
| Orders/Day | Total orders placed | COUNT(orders) WHERE date = today |
| GGR | Gross Gaming Revenue | SUM(stake) - SUM(payouts) |
| Handle | Total amount wagered | SUM(stake) |
| Hold % | House edge realized | GGR / Handle * 100 |
| B-Book Exposure | Current liability | SUM(potential_payout) for unsettled B-Book |
| Settlement Rate | Orders settled on time | Settled within 1hr / Total settled |
| User Retention | 7-day retention | Users active day 7 / Users signed up day 0 |
10.3 Business Dashboard Queries
-- Daily summary
SELECT
DATE(created_at) as date,
COUNT(*) as total_orders,
COUNT(DISTINCT user_id) as unique_users,
SUM(stake) as total_stake,
SUM(CASE WHEN settlement_outcome = 'win' THEN -profit_loss ELSE profit_loss END) as ggr
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;
-- B-Book exposure
SELECT
SUM(stake * (odds - 1)) as max_exposure
FROM orders
WHERE routing_venue = 'bbook'
AND status = 'accepted';
-- Settlement performance
SELECT
AVG(EXTRACT(EPOCH FROM (settled_at - created_at))) as avg_settlement_seconds
FROM orders
WHERE status = 'settled'
AND settled_at >= CURRENT_DATE - INTERVAL '7 days';
11. High Availability & Reliability
11.1 Current Architecture (Single Node)
┌─────────────┐
│ Cloudflare │
│ (CDN) │
└──────┬──────┘
│
┌──────▼──────┐
│ Nginx │
│ (Proxy) │
└──────┬──────┘
│
┌────────────┼────────────┐
│ │ │
┌──────▼──────┐ ┌───▼───┐ ┌──────▼──────┐
│ Frontend │ │Backend│ │ Backend │
│ (Next.js) │ │ (API) │ │ (Jobs) │
└─────────────┘ └───┬───┘ └──────┬──────┘
│ │
┌────────────┼────────────┘
│ │
┌──────▼──────┐ ┌───▼───┐
│ PostgreSQL │ │ Redis │
└─────────────┘ └───────┘
11.2 Future HA Architecture
┌─────────────┐
│ Cloudflare │
│ (CDN + WAF) │
└──────┬──────┘
│
┌────────────┼────────────┐
│ │ │
┌──────▼──────┐ │ ┌──────▼──────┐
│ Nginx 1 │ │ │ Nginx 2 │
│ (Primary) │ │ │ (Standby) │
└──────┬──────┘ │ └──────┬──────┘
│ │ │
└────────────┼────────────┘
│
┌──────▼──────┐
│ HAProxy │
│ (LB) │
└──────┬──────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Backend 1 │ │ Backend 2 │ │ Backend 3 │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└─────────────────┼─────────────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Postgres │ │ Postgres │ │ Redis │
│ (Primary) │◄──│ (Replica) │ │ (Sentinel) │
└─────────────┘ └─────────────┘ └─────────────┘
11.3 Reliability Patterns
11.3.1 Circuit Breaker (Betfair API)
import CircuitBreaker from 'opossum';
const betfairBreaker = new CircuitBreaker(callBetfairAPI, {
timeout: 10000, // 10 second timeout
errorThresholdPercentage: 50, // Open circuit if 50% fail
resetTimeout: 30000, // Try again after 30 seconds
volumeThreshold: 10, // Minimum 10 requests before tripping
});
betfairBreaker.on('open', () => {
logger.warn('Betfair circuit breaker opened');
alerting.send('warning', 'Betfair API circuit breaker opened');
});
betfairBreaker.on('halfOpen', () => {
logger.info('Betfair circuit breaker half-open, testing...');
});
betfairBreaker.on('close', () => {
logger.info('Betfair circuit breaker closed');
});
11.3.2 Retry with Exponential Backoff
async function withRetry<T>(
fn: () => Promise<T>,
maxRetries: number = 3,
baseDelay: number = 1000
): Promise<T> {
let lastError: Error;
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error as Error;
const delay = baseDelay * Math.pow(2, attempt);
logger.warn(`Attempt ${attempt + 1} failed, retrying in ${delay}ms`);
await sleep(delay);
}
}
throw lastError!;
}
11.3.3 Graceful Degradation
// If Betfair is down, show cached odds with warning
async function getOdds(fixtureId: string) {
try {
return await betfairAdapter.getOdds(fixtureId);
} catch (error) {
logger.error('Failed to get live odds, using cache', error);
const cached = await cache.get(`odds:${fixtureId}`);
if (cached) {
return {
...cached,
stale: true,
staleSince: cached.timestamp,
};
}
throw new ServiceUnavailableError('Odds temporarily unavailable');
}
}
11.4 Health Checks
// /api/health endpoint
app.get('/api/health', async (req, res) => {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
betfair: await checkBetfairSession(),
};
const healthy = Object.values(checks).every(c => c.status === 'healthy');
res.status(healthy ? 200 : 503).json({
status: healthy ? 'healthy' : 'unhealthy',
timestamp: new Date().toISOString(),
checks,
});
});
async function checkDatabase(): Promise<HealthCheck> {
try {
await prisma.$queryRaw`SELECT 1`;
return { status: 'healthy', latency: Date.now() - start };
} catch (error) {
return { status: 'unhealthy', error: error.message };
}
}
12. Operational Runbooks
12.1 Runbook: Service Restart
## Service Restart Runbook
### When to Use
- Service is unresponsive
- Memory leak detected
- After configuration change
### Procedure
1. **Check current status**
```bash
docker compose ps
docker logs hannibal-backend --tail 100
-
Restart specific service
docker compose restart backend -
Verify service is healthy
curl http://localhost:3001/api/health
docker logs hannibal-backend --tail 20 -
If restart fails, recreate container
docker compose up -d --force-recreate backend
Escalation
If service won't start after 3 attempts, escalate to on-call engineer.
### 12.2 Runbook: Database Connection Issues
```markdown
## Database Connection Issues Runbook
### Symptoms
- "Connection refused" errors
- "Too many connections" errors
- Slow queries
### Diagnosis
1. **Check PostgreSQL status**
```bash
docker exec hannibal-postgres pg_isready
-
Check connection count
docker exec hannibal-postgres psql -U hannibal -c "SELECT count(*) FROM pg_stat_activity;" -
Check for blocking queries
docker exec hannibal-postgres psql -U hannibal -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC
LIMIT 10;
"
Resolution
Too many connections:
# Kill idle connections
docker exec hannibal-postgres psql -U hannibal -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';
"
Blocking query:
# Kill specific query
docker exec hannibal-postgres psql -U hannibal -c "SELECT pg_terminate_backend(PID);"
### 12.3 Runbook: Betfair Session Recovery
```markdown
## Betfair Session Recovery Runbook
### Symptoms
- "Session expired" errors in logs
- Orders failing to place
- No odds updates
### Diagnosis
1. **Check Betfair session status**
```bash
docker logs hannibal-backend --tail 100 | grep -i betfair
- Check session in Redis
docker exec hannibal-redis redis-cli GET betfair:session
Resolution
-
Force session refresh
# Clear cached session
docker exec hannibal-redis redis-cli DEL betfair:session
# Restart backend to trigger re-login
docker compose restart backend -
Verify new session
docker logs hannibal-backend --tail 50 | grep -i "logged in"
If Login Fails
- Check Betfair credentials in .env
- Check if Betfair API is down: https://status.betfair.com
- Check if IP is blocked (contact Betfair support)
### 12.4 Runbook: Disk Space Emergency
```markdown
## Disk Space Emergency Runbook
### Symptoms
- Disk usage > 90% alert
- Services failing to write
### Immediate Actions
1. **Check disk usage**
```bash
df -h
du -sh /* | sort -hr | head -20
-
Quick cleanup
# Docker cleanup
docker system prune -af --volumes
# Clear old logs
find /var/log -name "*.gz" -delete
truncate -s 0 /var/log/*.log
# Clear temp files
rm -rf /tmp/* -
Check large files
find / -type f -size +100M -exec ls -lh {} \;
If Still Critical
- Delete old database backups:
rm /var/backups/postgres/hannibal_*.sql.gz - Expand disk volume (cloud provider console)
- Move data to larger volume
---
## 13. Implementation Roadmap
### Phase 1: Critical (Week 1-2) 🔴
| Task | Effort | Owner | Status |
|------|--------|-------|--------|
| Configure Docker log rotation | 1h | DevOps | ⬜ |
| Set up daily database backups to S3 | 4h | DevOps | ⬜ |
| Deploy Prometheus + Node Exporter | 4h | DevOps | ⬜ |
| Deploy Grafana with basic dashboards | 4h | DevOps | ⬜ |
| Configure critical alerts (disk, memory, service down) | 4h | DevOps | ⬜ |
| Set up Uptime Kuma for external monitoring | 2h | DevOps | ⬜ |
| Configure UFW firewall | 2h | DevOps | ⬜ |
| Add resource limits to Docker containers | 2h | DevOps | ⬜ |
### Phase 2: Important (Week 3-4) 🟡
| Task | Effort | Owner | Status |
|------|--------|-------|--------|
| Deploy Loki + Promtail for log aggregation | 4h | DevOps | ⬜ |
| Add application metrics to backend | 8h | Backend | ⬜ |
| Create business metrics dashboard | 4h | DevOps | ⬜ |
| Set up backup verification automation | 4h | DevOps | ⬜ |
| Implement security headers in Nginx | 2h | DevOps | ⬜ |
| Set up GitHub Actions CI/CD pipeline | 8h | DevOps | ⬜ |
| Configure Cloudflare WAF rules | 4h | DevOps | ⬜ |
| Add database indexes for performance | 4h | Backend | ⬜ |
### Phase 3: Enhancement (Month 2) 🟢
| Task | Effort | Owner | Status |
|------|--------|-------|--------|
| Implement distributed tracing (Jaeger) | 8h | Backend | ⬜ |
| Add circuit breaker for Betfair API | 4h | Backend | ⬜ |
| Create operational runbooks | 8h | DevOps | ⬜ |
| Set up Google Analytics 4 | 4h | Frontend | ⬜ |
| Implement audit logging | 8h | Backend | ⬜ |
| Performance optimization (caching, queries) | 16h | Backend | ⬜ |
| Security audit and penetration testing | 16h | Security | ⬜ |
### Phase 4: Scale (Month 3+) 🔵
| Task | Effort | Owner | Status |
|------|--------|-------|--------|
| Set up PostgreSQL replication | 8h | DevOps | ⬜ |
| Configure Redis Sentinel | 4h | DevOps | ⬜ |
| Deploy multiple backend instances | 8h | DevOps | ⬜ |
| Set up load balancer | 4h | DevOps | ⬜ |
| Implement chaos testing | 16h | DevOps | ⬜ |
| Multi-region failover setup | 24h | DevOps | ⬜ |
---
## 14. Appendix: Configuration Examples
### 14.1 Complete docker-compose.prod.yml
```yaml
version: '3.8'
services:
backend:
image: hannibal-backend:latest
restart: unless-stopped
environment:
- NODE_ENV=production
- DATABASE_URL=postgresql://hannibal:${DB_PASSWORD}@postgres:5432/hannibal
- REDIS_URL=redis://redis:6379
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
deploy:
resources:
limits:
cpus: '2'
memory: 2G
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3001/api/health"]
interval: 30s
timeout: 10s
retries: 3
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "5"
frontend:
image: hannibal-frontend:latest
restart: unless-stopped
environment:
- NODE_ENV=production
deploy:
resources:
limits:
cpus: '1'
memory: 1G
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000"]
interval: 30s
timeout: 10s
retries: 3
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "5"
postgres:
image: postgres:15-alpine
restart: unless-stopped
environment:
- POSTGRES_USER=hannibal
- POSTGRES_PASSWORD=${DB_PASSWORD}
- POSTGRES_DB=hannibal
volumes:
- postgres_data:/var/lib/postgresql/data
- ./postgres/postgresql.conf:/etc/postgresql/postgresql.conf
command: postgres -c config_file=/etc/postgresql/postgresql.conf
deploy:
resources:
limits:
cpus: '2'
memory: 4G
healthcheck:
test: ["CMD-SHELL", "pg_isready -U hannibal"]
interval: 10s
timeout: 5s
retries: 5
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "5"
redis:
image: redis:7-alpine
restart: unless-stopped
command: redis-server /usr/local/etc/redis/redis.conf
volumes:
- redis_data:/data
- ./redis/redis.conf:/usr/local/etc/redis/redis.conf
deploy:
resources:
limits:
cpus: '1'
memory: 1G
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
logging:
driver: "json-file"
options:
max-size: "50m"
max-file: "3"
prometheus:
image: prom/prometheus:latest
restart: unless-stopped
volumes:
- ./prometheus:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
restart: unless-stopped
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- "3000:3000"
loki:
image: grafana/loki:latest
restart: unless-stopped
volumes:
- ./loki:/etc/loki
- loki_data:/loki
command: -config.file=/etc/loki/loki-config.yml
promtail:
image: grafana/promtail:latest
restart: unless-stopped
volumes:
- ./promtail:/etc/promtail
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock
command: -config.file=/etc/promtail/promtail-config.yml
node-exporter:
image: prom/node-exporter:latest
restart: unless-stopped
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
postgres-exporter:
image: prometheuscommunity/postgres-exporter:latest
restart: unless-stopped
environment:
- DATA_SOURCE_NAME=postgresql://hannibal:${DB_PASSWORD}@postgres:5432/hannibal?sslmode=disable
redis-exporter:
image: oliver006/redis_exporter:latest
restart: unless-stopped
environment:
- REDIS_ADDR=redis://redis:6379
volumes:
postgres_data:
redis_data:
prometheus_data:
grafana_data:
loki_data:
14.2 Prometheus Configuration
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/alerts/*.yml
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
- job_name: 'backend'
static_configs:
- targets: ['backend:3001']
metrics_path: /metrics
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
14.3 Crontab for Maintenance Tasks
# /etc/cron.d/hannibal
# Database backup - daily at 3 AM
0 3 * * * root /opt/hannibal/scripts/backup-database.sh >> /var/log/hannibal/backup.log 2>&1
# Backup verification - weekly on Sunday at 4 AM
0 4 * * 0 root /opt/hannibal/scripts/verify-backup.sh >> /var/log/hannibal/backup-verify.log 2>&1
# Disk cleanup - daily at 2 AM
0 2 * * * root /opt/hannibal/scripts/disk-cleanup.sh >> /var/log/hannibal/cleanup.log 2>&1
# PostgreSQL maintenance - weekly on Sunday at 5 AM
0 5 * * 0 root /opt/hannibal/scripts/postgres-maintenance.sh >> /var/log/hannibal/postgres.log 2>&1
# SSL certificate renewal check - daily at 6 AM
0 6 * * * root certbot renew --quiet --post-hook "nginx -s reload"
# Docker cleanup - weekly on Saturday at 3 AM
0 3 * * 6 root docker system prune -af >> /var/log/hannibal/docker-cleanup.log 2>&1
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-01-28 | AI Assistant | Initial document |