Skip to main content

Hannibal Production Readiness Guide

Version: 1.0
Last Updated: January 28, 2026
Status: Planning

This document outlines all the requirements, configurations, and best practices needed to make Hannibal a production-ready, rock-solid betting platform that can run autonomously for years with minimal human intervention.


Table of Contents

  1. Executive Summary
  2. Current State Assessment
  3. Monitoring & Observability
  4. Alerting Strategy
  5. Backup & Disaster Recovery
  6. Security
  7. Infrastructure & Resource Management
  8. Performance Optimization
  9. Deployment & CI/CD
  10. Analytics & Business Intelligence
  11. High Availability & Reliability
  12. Operational Runbooks
  13. Implementation Roadmap
  14. Appendix: Configuration Examples

1. Executive Summary

1.1 Goals

Hannibal must be:

  • Reliable: 99.9% uptime (< 8.76 hours downtime/year)
  • Resilient: Auto-recover from failures without human intervention
  • Observable: Full visibility into system health and business metrics
  • Secure: Protected against attacks and data breaches
  • Recoverable: Ability to restore from any disaster within defined RTO/RPO
  • Scalable: Handle 10x current load without architecture changes

1.2 Key Metrics Targets

MetricTarget
Uptime99.9%
API Response Time (p95)< 500ms
Order Placement Latency< 1 second
Settlement Time< 5 minutes after match end
Recovery Time Objective (RTO)< 1 hour
Recovery Point Objective (RPO)< 1 hour
Mean Time to Detection (MTTD)< 2 minutes
Mean Time to Recovery (MTTR)< 30 minutes

1.3 Technology Stack Overview

┌─────────────────────────────────────────────────────────────────┐
│ MONITORING LAYER │
│ Prometheus │ Grafana │ Loki │ Uptime Kuma │ Jaeger │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ Frontend (Next.js) │ Backend (Node.js/Express) │ Jobs │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ DATA LAYER │
│ PostgreSQL │ Redis │ Betfair API │ Roanuz API │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ INFRASTRUCTURE LAYER │
│ Docker │ Nginx │ Certbot │ UFW │ Systemd │
└─────────────────────────────────────────────────────────────────┘

2. Current State Assessment

2.1 What We Have

ComponentStatusNotes
Docker Compose deployment✅ WorkingSingle host
PostgreSQL database✅ WorkingNo backups configured
Redis cache✅ WorkingNo memory limits
Nginx reverse proxy✅ WorkingBasic config
SSL certificates✅ WorkingCertbot auto-renewal
Backend API✅ WorkingNo metrics exposed
Frontend✅ WorkingNo analytics
Betfair integration✅ WorkingSession management needs monitoring
Settlement jobs✅ WorkingNo failure alerts

2.2 Critical Gaps

GapRiskPriority
No database backupsData loss🔴 Critical
No monitoringBlind to issues🔴 Critical
No alertingDelayed response🔴 Critical
No log rotationDisk exhaustion🔴 Critical
No resource limitsOOM crashes🟡 High
No uptime monitoringUnknown downtime🟡 High
No security hardeningVulnerabilities🟡 High
No CI/CD pipelineManual deployments🟢 Medium
No analyticsNo business insights🟢 Medium

3. Monitoring & Observability

3.1 Metrics Collection (Prometheus)

Prometheus will be the central metrics collection system, scraping metrics from all components.

3.1.1 Infrastructure Metrics (Node Exporter)

Collect from every host:

  • CPU: Usage, load average, steal time
  • Memory: Used, available, cached, swap
  • Disk: Usage, I/O, latency
  • Network: Bandwidth, packets, errors
  • File descriptors: Open files, limits

3.1.2 Database Metrics (postgres_exporter)

Key Metrics:
- pg_stat_activity_count # Active connections
- pg_stat_database_tup_fetched # Rows read
- pg_stat_database_tup_inserted # Rows inserted
- pg_stat_database_xact_commit # Transactions committed
- pg_stat_database_deadlocks # Deadlock count
- pg_replication_lag # Replication lag (if replica)
- pg_database_size_bytes # Database size
- pg_stat_user_tables_n_dead_tup # Dead tuples (vacuum needed)

3.1.3 Redis Metrics (redis_exporter)

Key Metrics:
- redis_memory_used_bytes # Memory consumption
- redis_memory_max_bytes # Memory limit
- redis_connected_clients # Client connections
- redis_commands_processed_total # Command throughput
- redis_keyspace_hits_total # Cache hits
- redis_keyspace_misses_total # Cache misses
- redis_evicted_keys_total # Evicted keys (memory pressure)
- redis_rejected_connections # Rejected connections

3.1.4 Application Metrics (Custom)

Backend should expose metrics at /metrics endpoint:

// Key application metrics to implement
const metrics = {
// HTTP metrics
http_requests_total: Counter, // Total requests by method, path, status
http_request_duration_seconds: Histogram, // Request latency

// Business metrics
orders_placed_total: Counter, // Orders by type, venue
orders_settled_total: Counter, // Settlements by outcome
settlement_duration_seconds: Histogram, // Time to settle
bbook_exposure_dollars: Gauge, // Current B-Book liability

// External API metrics
betfair_api_calls_total: Counter, // Betfair API calls
betfair_api_errors_total: Counter, // Betfair API errors
betfair_api_latency_seconds: Histogram, // Betfair API latency
betfair_session_status: Gauge, // 1 = active, 0 = expired

// Job metrics
job_runs_total: Counter, // Job executions
job_duration_seconds: Histogram, // Job duration
job_errors_total: Counter, // Job failures
};

3.1.5 Container Metrics (cAdvisor)

Key Metrics:
- container_cpu_usage_seconds_total
- container_memory_usage_bytes
- container_memory_working_set_bytes
- container_network_receive_bytes_total
- container_network_transmit_bytes_total
- container_fs_usage_bytes

3.2 Dashboards (Grafana)

3.2.1 System Overview Dashboard

┌─────────────────────────────────────────────────────────────────┐
│ SYSTEM OVERVIEW │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────┤
│ CPU % │ Memory % │ Disk % │ Network │ Uptime │
│ 23% │ 67% │ 45% │ 12 Mb/s │ 45 days │
├─────────────┴─────────────┴─────────────┴─────────────┴─────────┤
│ CPU Usage Over Time │
│ ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁ │
├─────────────────────────────────────────────────────────────────┤
│ Memory Usage Over Time │
│ ████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░ │
├─────────────────────────────────────────────────────────────────┤
│ Container Status: backend ✅ | frontend ✅ | postgres ✅ | redis ✅ │
└─────────────────────────────────────────────────────────────────┘

3.2.2 Application Health Dashboard

┌─────────────────────────────────────────────────────────────────┐
│ APPLICATION HEALTH │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────┤
│ Requests/s │ Error Rate │ p50 Lat │ p95 Lat │ p99 Lat │
│ 125 │ 0.02% │ 45ms │ 180ms │ 450ms │
├─────────────────────────────────────────────────────────────────┤
│ Request Rate by Endpoint │
│ /api/fixtures ████████████████████ 45/s │
│ /api/orders ████████ 18/s │
│ /api/odds ██████████████ 32/s │
├─────────────────────────────────────────────────────────────────┤
│ Error Rate Over Time │
│ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ │
└─────────────────────────────────────────────────────────────────┘

3.2.3 Database Dashboard

┌─────────────────────────────────────────────────────────────────┐
│ DATABASE HEALTH │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────┤
│ Connections │ Query Time │ Size │ Dead Tuples │ TPS │
│ 12/100 │ 15ms avg │ 2.3 GB │ 1,234 │ 85 │
├─────────────────────────────────────────────────────────────────┤
│ Table Sizes │
│ orders ████████████████████ 1.2 GB │
│ users ████ 120 MB │
│ fixtures ██████████ 450 MB │
├─────────────────────────────────────────────────────────────────┤
│ Slow Queries (> 100ms) │
│ SELECT * FROM orders WHERE... │ 234ms │ 15 calls/min │
└─────────────────────────────────────────────────────────────────┘

3.2.4 Business Metrics Dashboard

┌─────────────────────────────────────────────────────────────────┐
│ BUSINESS METRICS │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────┤
│ Orders Today│ GGR Today │ Active Users│ B-Book Exp │ Settled │
│ 1,234 │ $5,678 │ 89 │ $12,345 │ 98.5% │
├─────────────────────────────────────────────────────────────────┤
│ Orders Over Time (24h) │
│ ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁ │
├─────────────────────────────────────────────────────────────────┤
│ Settlement Status │
│ Pending: 23 │ Settled (Win): 456 │ Settled (Lose): 678 │
└─────────────────────────────────────────────────────────────────┘

3.3 Log Aggregation (Loki + Promtail)

3.3.1 Log Format Standard

All services should output JSON logs:

{
"timestamp": "2026-01-28T19:45:23.456Z",
"level": "info",
"service": "backend",
"traceId": "abc123",
"userId": "user_456",
"message": "Order placed successfully",
"orderId": "order_789",
"duration": 145
}

3.3.2 Log Retention Policy

Log TypeHot StorageCold StorageTotal Retention
Application logs7 days90 days90 days
Access logs7 days30 days30 days
Error logs30 days1 year1 year
Audit logs30 days7 years7 years
Security logs30 days1 year1 year

3.3.3 Docker Log Rotation

// /etc/docker/daemon.json
{
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "5",
"compress": "true"
}
}

3.4 Distributed Tracing (Jaeger)

Implement OpenTelemetry tracing to track requests across services:

User Request


┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Nginx │───▶│ Frontend│───▶│ Backend │───▶│ Postgres│
│ 2ms │ │ 15ms │ │ 45ms │ │ 12ms │
└─────────┘ └─────────┘ └─────────┘ └─────────┘


┌─────────┐
│ Betfair │
│ 120ms │
└─────────┘

3.5 Uptime Monitoring (Uptime Kuma)

External monitoring from outside the infrastructure:

CheckURLIntervalTimeout
Frontendhttps://hannibaldev.forsyt.io60s10s
API Healthhttps://hannibaldev.forsyt.io/api/health60s10s
API Fixtureshttps://hannibaldev.forsyt.io/api/fixtures/live60s15s
WebSocketwss://hannibaldev.forsyt.io/ws60s10s

4. Alerting Strategy

4.1 Alert Channels

ChannelUse CaseResponse Time
PagerDuty/OpsGenieCritical alerts requiring immediate action< 5 min
Slack #alerts-criticalCritical alerts for team visibility< 5 min
Slack #alerts-warningWarning alerts for awareness< 1 hour
EmailDaily reports, non-urgent notifications< 24 hours
SMS (backup)Critical alerts if PagerDuty fails< 5 min

4.2 Alert Severity Levels

SeverityDescriptionResponseExamples
P1 - CriticalService down, data loss riskPage immediately, 24/7Database down, disk full
P2 - HighDegraded service, potential impactPage during business hoursHigh error rate, slow response
P3 - MediumWarning, needs attentionSlack notificationHigh memory, certificate expiring
P4 - LowInformationalEmail/daily reportDeployment complete

4.3 Critical Alerts (P1) - Page Immediately

Alert NameConditionDurationAction
ServiceDownContainer not running2 minAuto-restart, then page
DatabaseDownPostgreSQL not responding1 minPage immediately
DiskCriticalDisk usage > 95%0 minPage immediately
MemoryCriticalMemory usage > 95%5 minPage immediately
HighErrorRate5xx errors > 10%5 minPage immediately
BetfairSessionLostSession expired, reconnect failed5 minPage immediately
SettlementJobFailedJob not run in 10 min10 minPage immediately
BackupFailedNo successful backup in 25 hours0 minPage immediately
SSLCertExpiredCertificate expired0 minPage immediately

4.4 High Alerts (P2) - Page During Business Hours

Alert NameConditionDurationAction
HighCPUCPU > 90%15 minInvestigate, scale if needed
HighMemoryMemory > 90%10 minInvestigate, restart if needed
DiskHighDisk usage > 85%0 minClean up, expand disk
SlowQueriesQuery time > 5s5 minOptimize query
HighLatencyp95 latency > 2s10 minInvestigate bottleneck
ConnectionPoolExhaustedDB connections > 90%5 minIncrease pool, investigate
RedisMemoryHighRedis memory > 90%5 minCheck eviction, increase limit

4.5 Warning Alerts (P3) - Slack Notification

Alert NameConditionDuration
DiskWarningDisk usage > 75%0 min
MemoryWarningMemory > 80%10 min
CPUWarningCPU > 80%15 min
CertificateExpiringSSL cert expires in < 14 days0 min
HighDeadTuplesDead tuples > 100,0000 min
ReplicationLagLag > 30 seconds5 min
BetfairAPIErrorsError rate > 5%5 min

4.6 Alert Rules (Prometheus)

# prometheus/alerts/critical.yml
groups:
- name: critical
rules:
- alert: ServiceDown
expr: up{job=~"backend|frontend|postgres|redis"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
description: "{{ $labels.job }} has been down for more than 2 minutes"

- alert: DiskCritical
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 5
for: 0m
labels:
severity: critical
annotations:
summary: "Disk almost full on {{ $labels.instance }}"
description: "Disk usage is above 95% on {{ $labels.mountpoint }}"

- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is above 10% for the last 5 minutes"

- alert: DatabaseDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL is down"
description: "PostgreSQL database is not responding"

5. Backup & Disaster Recovery

5.1 Backup Strategy Overview

┌─────────────────────────────────────────────────────────────────┐
│ BACKUP ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PostgreSQL ──┬── pg_dump (daily) ──────────▶ S3 (30 days) │
│ │ │
│ └── WAL Archive (continuous) ──▶ S3 (7 days) │
│ │
│ Redis ───────── RDB Snapshot (hourly) ─────▶ Local + S3 │
│ │
│ Config ──────── Git + Encrypted Secrets ───▶ GitHub + Vault │
│ │
│ Logs ────────── Loki ──────────────────────▶ S3 (90 days) │
│ │
└─────────────────────────────────────────────────────────────────┘

5.2 Database Backup Configuration

5.2.1 Daily Full Backup (pg_dump)

#!/bin/bash
# /opt/hannibal/scripts/backup-database.sh

set -e

BACKUP_DIR="/var/backups/postgres"
S3_BUCKET="s3://hannibal-backups/postgres"
RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="hannibal_${TIMESTAMP}.sql.gz"

# Create backup
docker exec hannibal-postgres pg_dump -U hannibal -d hannibal | gzip > "${BACKUP_DIR}/${BACKUP_FILE}"

# Upload to S3
aws s3 cp "${BACKUP_DIR}/${BACKUP_FILE}" "${S3_BUCKET}/${BACKUP_FILE}"

# Verify upload
aws s3 ls "${S3_BUCKET}/${BACKUP_FILE}" || exit 1

# Clean old local backups
find ${BACKUP_DIR} -name "hannibal_*.sql.gz" -mtime +7 -delete

# Clean old S3 backups (lifecycle policy handles this, but belt and suspenders)
aws s3 ls ${S3_BUCKET}/ | while read -r line; do
createDate=$(echo $line | awk '{print $1" "$2}')
createDate=$(date -d "$createDate" +%s)
olderThan=$(date -d "-${RETENTION_DAYS} days" +%s)
if [[ $createDate -lt $olderThan ]]; then
fileName=$(echo $line | awk '{print $4}')
aws s3 rm "${S3_BUCKET}/${fileName}"
fi
done

echo "Backup completed: ${BACKUP_FILE}"

5.2.2 Continuous WAL Archiving

# postgresql.conf additions
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://hannibal-backups/wal/%f'
archive_timeout = 300 # Archive every 5 minutes minimum

5.2.3 Point-in-Time Recovery (PITR)

#!/bin/bash
# Restore to specific point in time
# /opt/hannibal/scripts/restore-pitr.sh

TARGET_TIME="2026-01-28 15:30:00"
BACKUP_FILE="hannibal_20260128_030000.sql.gz"

# Stop services
docker compose down

# Download base backup
aws s3 cp "s3://hannibal-backups/postgres/${BACKUP_FILE}" /tmp/

# Restore base backup
gunzip -c /tmp/${BACKUP_FILE} | docker exec -i hannibal-postgres psql -U hannibal -d hannibal

# Apply WAL files up to target time
# (This requires proper PostgreSQL PITR setup with recovery.conf)

5.3 Backup Schedule

Backup TypeFrequencyTimeRetentionStorage
PostgreSQL fullDaily03:00 UTC30 daysS3
PostgreSQL WALContinuous-7 daysS3
Redis RDBHourly:0024 hoursLocal + S3
ConfigurationOn change-ForeverGit
Docker volumesDaily04:00 UTC7 daysS3

5.4 Backup Verification

#!/bin/bash
# /opt/hannibal/scripts/verify-backup.sh
# Run weekly to verify backups are restorable

# Get latest backup
LATEST_BACKUP=$(aws s3 ls s3://hannibal-backups/postgres/ | sort | tail -n 1 | awk '{print $4}')

# Download and restore to test database
aws s3 cp "s3://hannibal-backups/postgres/${LATEST_BACKUP}" /tmp/
docker exec -i hannibal-postgres createdb -U hannibal hannibal_test || true
gunzip -c /tmp/${LATEST_BACKUP} | docker exec -i hannibal-postgres psql -U hannibal -d hannibal_test

# Verify data integrity
ORDERS_COUNT=$(docker exec hannibal-postgres psql -U hannibal -d hannibal_test -t -c "SELECT COUNT(*) FROM orders")
USERS_COUNT=$(docker exec hannibal-postgres psql -U hannibal -d hannibal_test -t -c "SELECT COUNT(*) FROM users")

# Clean up
docker exec hannibal-postgres dropdb -U hannibal hannibal_test

# Report
echo "Backup verification complete"
echo "Orders: ${ORDERS_COUNT}"
echo "Users: ${USERS_COUNT}"

5.5 Disaster Recovery Procedures

5.5.1 Recovery Time Objectives

ScenarioRTORPOProcedure
Container crash1 min0Auto-restart
Host reboot5 min0Systemd auto-start
Database corruption1 hour1 hourPITR restore
Host failure2 hours1 hourNew host + restore
Data center failure4 hours1 hourDR site activation
Ransomware/breach4 hours1 hourClean restore from backup

5.5.2 Recovery Runbook

## Database Recovery Procedure

### Prerequisites
- Access to AWS S3
- Access to new/clean PostgreSQL instance
- Latest backup file name

### Steps

1. **Stop application services**
```bash
docker compose stop backend frontend
  1. Download latest backup

    aws s3 cp s3://hannibal-backups/postgres/BACKUP_FILE.sql.gz /tmp/
  2. Restore database

    # Drop and recreate database
    docker exec hannibal-postgres dropdb -U hannibal hannibal
    docker exec hannibal-postgres createdb -U hannibal hannibal

    # Restore from backup
    gunzip -c /tmp/BACKUP_FILE.sql.gz | docker exec -i hannibal-postgres psql -U hannibal -d hannibal
  3. Verify restoration

    docker exec hannibal-postgres psql -U hannibal -d hannibal -c "SELECT COUNT(*) FROM orders"
  4. Start application services

    docker compose up -d backend frontend
  5. Verify application health

    curl https://hannibaldev.forsyt.io/api/health

---

## 6. Security

### 6.1 Network Security

#### 6.1.1 Firewall Configuration (UFW)

```bash
# Default policies
ufw default deny incoming
ufw default allow outgoing

# Allow SSH (restricted to known IPs)
ufw allow from 203.0.113.0/24 to any port 22

# Allow HTTP/HTTPS
ufw allow 80/tcp
ufw allow 443/tcp

# Allow internal Docker network
ufw allow from 172.16.0.0/12

# Enable firewall
ufw enable

6.1.2 Cloudflare Configuration

SettingValuePurpose
SSL ModeFull (Strict)End-to-end encryption
Always Use HTTPSOnForce HTTPS
Minimum TLS Version1.2Security
WAFOnBlock attacks
Rate Limiting100 req/min per IPDDoS protection
Bot Fight ModeOnBlock bad bots
Browser Integrity CheckOnBlock suspicious requests

6.1.3 Rate Limiting (Nginx)

# /etc/nginx/conf.d/rate-limiting.conf

# Define rate limit zones
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=login:10m rate=5r/m;
limit_req_zone $binary_remote_addr zone=orders:10m rate=30r/m;

# Apply to locations
location /api/ {
limit_req zone=api burst=20 nodelay;
proxy_pass http://backend:3001;
}

location /api/auth/login {
limit_req zone=login burst=5 nodelay;
proxy_pass http://backend:3001;
}

location /api/orders {
limit_req zone=orders burst=10 nodelay;
proxy_pass http://backend:3001;
}

6.2 Application Security

6.2.1 Security Headers (Nginx)

# Security headers
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
add_header Content-Security-Policy "default-src 'self'; script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; img-src 'self' data: https:; font-src 'self' data:; connect-src 'self' wss: https:;" always;
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header Permissions-Policy "geolocation=(), microphone=(), camera=()" always;

6.2.2 Input Validation

// All API inputs must be validated
import { z } from 'zod';

const placeOrderSchema = z.object({
fixtureId: z.string().min(1).max(50),
marketId: z.string().regex(/^1\.\d+$/), // Betfair market ID format
outcomeId: z.string().min(1).max(100),
stake: z.number().positive().max(10000),
odds: z.number().min(1.01).max(1000),
betType: z.enum(['back', 'lay']),
});

// Sanitize all string inputs
function sanitize(input: string): string {
return input.replace(/[<>\"'&]/g, '');
}

6.2.3 Authentication Security

MeasureImplementation
Password Hashingbcrypt with cost factor 12
JWT ExpiryAccess token: 15 min, Refresh token: 7 days
Session ManagementRedis-backed sessions
Failed Login Lockout5 attempts, 15 min lockout
Password RequirementsMin 8 chars, 1 upper, 1 lower, 1 number
2FATOTP (optional, recommended for admins)

6.3 Secrets Management

6.3.1 Current Secrets Inventory

SecretLocationRotation
Database password.env90 days
Redis password.env90 days
JWT secret.env90 days
Betfair API key.envNever (API key)
Betfair credentials.env90 days
Roanuz API key.envNever
AWS credentials.env90 days

6.3.2 Secrets Management Best Practices

# Never commit secrets to git
echo ".env" >> .gitignore
echo "*.pem" >> .gitignore
echo "*.key" >> .gitignore

# Use Docker secrets in production
docker secret create db_password db_password.txt
docker secret create jwt_secret jwt_secret.txt

# Reference in docker-compose.yml
services:
backend:
secrets:
- db_password
- jwt_secret
environment:
- DB_PASSWORD_FILE=/run/secrets/db_password

6.4 Audit Logging

// Log all security-relevant events
const auditEvents = [
'user.login',
'user.logout',
'user.login_failed',
'user.password_changed',
'user.created',
'user.deleted',
'order.placed',
'order.cancelled',
'admin.user_modified',
'admin.settings_changed',
'admin.manual_settlement',
'system.backup_completed',
'system.backup_failed',
];

// Audit log format
interface AuditLog {
timestamp: string;
event: string;
userId: string;
ipAddress: string;
userAgent: string;
details: Record<string, unknown>;
success: boolean;
}

7. Infrastructure & Resource Management

7.1 Docker Configuration

7.1.1 Resource Limits

# docker-compose.prod.yml
services:
backend:
deploy:
resources:
limits:
cpus: '2'
memory: 2G
reservations:
cpus: '0.5'
memory: 512M
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3001/api/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s

frontend:
deploy:
resources:
limits:
cpus: '1'
memory: 1G
reservations:
cpus: '0.25'
memory: 256M

postgres:
deploy:
resources:
limits:
cpus: '2'
memory: 4G
reservations:
cpus: '0.5'
memory: 1G
healthcheck:
test: ["CMD-SHELL", "pg_isready -U hannibal"]
interval: 10s
timeout: 5s
retries: 5

redis:
deploy:
resources:
limits:
cpus: '1'
memory: 1G
reservations:
cpus: '0.25'
memory: 256M
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5

7.1.2 Docker Daemon Configuration

// /etc/docker/daemon.json
{
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "5",
"compress": "true"
},
"storage-driver": "overlay2",
"live-restore": true,
"default-ulimits": {
"nofile": {
"Name": "nofile",
"Hard": 65536,
"Soft": 65536
}
}
}

7.2 Disk Management

7.2.1 Disk Space Monitoring

#!/bin/bash
# /opt/hannibal/scripts/disk-cleanup.sh
# Run daily via cron

# Clean Docker resources
docker system prune -f --volumes --filter "until=168h"

# Clean old logs
find /var/log -name "*.gz" -mtime +30 -delete
find /var/log -name "*.log.*" -mtime +7 -delete

# Clean temp files
find /tmp -type f -mtime +7 -delete

# Clean old backups
find /var/backups -mtime +7 -delete

# Report disk usage
df -h | grep -E "^/dev" | while read line; do
usage=$(echo $line | awk '{print $5}' | sed 's/%//')
mount=$(echo $line | awk '{print $6}')
if [ $usage -gt 80 ]; then
echo "WARNING: ${mount} is ${usage}% full"
fi
done

7.2.2 PostgreSQL Maintenance

#!/bin/bash
# /opt/hannibal/scripts/postgres-maintenance.sh
# Run weekly

# Vacuum and analyze all tables
docker exec hannibal-postgres psql -U hannibal -d hannibal -c "VACUUM ANALYZE;"

# Reindex if needed (run monthly)
# docker exec hannibal-postgres psql -U hannibal -d hannibal -c "REINDEX DATABASE hannibal;"

# Check for bloat
docker exec hannibal-postgres psql -U hannibal -d hannibal -c "
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) as size,
n_dead_tup as dead_tuples
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;
"

7.3 Redis Configuration

# redis.conf
maxmemory 512mb
maxmemory-policy allkeys-lru

# Persistence
save 900 1
save 300 10
save 60 10000

# Append-only file
appendonly yes
appendfsync everysec

# Logging
loglevel notice
logfile /var/log/redis/redis.log

7.4 PostgreSQL Tuning

# postgresql.conf optimizations for 4GB RAM server

# Memory
shared_buffers = 1GB
effective_cache_size = 3GB
work_mem = 16MB
maintenance_work_mem = 256MB

# Connections
max_connections = 100

# WAL
wal_buffers = 64MB
checkpoint_completion_target = 0.9
max_wal_size = 2GB
min_wal_size = 1GB

# Query Planning
random_page_cost = 1.1
effective_io_concurrency = 200

# Logging
log_min_duration_statement = 1000 # Log queries > 1 second
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on

# Autovacuum
autovacuum = on
autovacuum_max_workers = 3
autovacuum_naptime = 1min

8. Performance Optimization

8.1 Database Performance

8.1.1 Essential Indexes

-- Orders table indexes
CREATE INDEX idx_orders_status ON orders(status) WHERE status = 'accepted';
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_fixture_id ON orders(fixture_id);
CREATE INDEX idx_orders_created_at ON orders(created_at DESC);
CREATE INDEX idx_orders_market_id ON orders(market_id);
CREATE INDEX idx_orders_routing_venue_status ON orders(routing_venue, status);

-- Composite index for settlement job
CREATE INDEX idx_orders_settlement ON orders(routing_venue, status, market_id)
WHERE status = 'accepted';

-- Users table
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_agent_id ON users(agent_id);

8.1.2 Query Optimization Checklist

  • All queries use indexes (check with EXPLAIN ANALYZE)
  • No N+1 queries (use eager loading)
  • Pagination on all list endpoints
  • Connection pooling configured
  • Prepared statements used
  • No SELECT * in production code

8.2 Caching Strategy

8.2.1 Cache Layers

LayerDataTTLInvalidation
RedisFixtures list30sOn update
RedisMarket odds5sStreaming update
RedisUser sessions24hOn logout
RedisBetfair session12hOn expiry
CDNStatic assets1 yearVersion hash
BrowserAPI responses0No-cache

8.2.2 Cache Implementation

// Redis caching pattern
class CacheService {
async get<T>(key: string): Promise<T | null> {
const cached = await redis.get(key);
return cached ? JSON.parse(cached) : null;
}

async set(key: string, value: unknown, ttlSeconds: number): Promise<void> {
await redis.setex(key, ttlSeconds, JSON.stringify(value));
}

async invalidate(pattern: string): Promise<void> {
const keys = await redis.keys(pattern);
if (keys.length > 0) {
await redis.del(...keys);
}
}
}

// Usage
const fixtures = await cache.get('fixtures:live:1');
if (!fixtures) {
const data = await fetchFixtures();
await cache.set('fixtures:live:1', data, 30);
}

8.3 API Performance

8.3.1 Response Time Targets

EndpointTarget p50Target p95Target p99
GET /api/health5ms10ms50ms
GET /api/fixtures50ms150ms300ms
GET /api/fixtures/:id30ms100ms200ms
POST /api/orders100ms300ms500ms
GET /api/orders50ms150ms300ms

8.3.2 Performance Checklist

  • Gzip compression enabled
  • HTTP/2 enabled
  • Keep-alive connections
  • Response pagination
  • Field filtering (sparse fieldsets)
  • Async processing for heavy operations

9. Deployment & CI/CD

9.1 GitHub Actions Pipeline

# .github/workflows/deploy.yml
name: Deploy to Production

on:
push:
branches: [main]
workflow_dispatch:

jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'

- name: Install dependencies
run: |
cd backend && npm ci
cd ../frontend && npm ci

- name: Run tests
run: |
cd backend && npm test
cd ../frontend && npm test

- name: Type check
run: |
cd backend && npm run typecheck
cd ../frontend && npm run typecheck

security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Run Snyk security scan
uses: snyk/actions/node@master
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}

build:
needs: [test, security-scan]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Build Docker images
run: |
docker build -t hannibal-backend:${{ github.sha }} ./backend
docker build -t hannibal-frontend:${{ github.sha }} ./frontend

- name: Push to registry
run: |
docker push hannibal-backend:${{ github.sha }}
docker push hannibal-frontend:${{ github.sha }}

deploy:
needs: build
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy to production
uses: appleboy/ssh-action@master
with:
host: ${{ secrets.PROD_HOST }}
username: ${{ secrets.PROD_USER }}
key: ${{ secrets.PROD_SSH_KEY }}
script: |
cd /opt/hannibal
git pull
docker compose -f docker-compose.prod.yml pull
docker compose -f docker-compose.prod.yml up -d

- name: Health check
run: |
sleep 30
curl -f https://hannibaldev.forsyt.io/api/health || exit 1

- name: Notify Slack
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "✅ Deployed to production: ${{ github.sha }}"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

9.2 Rollback Procedure

#!/bin/bash
# /opt/hannibal/scripts/rollback.sh

# Get previous version
PREVIOUS_VERSION=$(docker images hannibal-backend --format "{{.Tag}}" | head -2 | tail -1)

echo "Rolling back to version: ${PREVIOUS_VERSION}"

# Update docker-compose to use previous version
sed -i "s/hannibal-backend:.*/hannibal-backend:${PREVIOUS_VERSION}/" docker-compose.prod.yml
sed -i "s/hannibal-frontend:.*/hannibal-frontend:${PREVIOUS_VERSION}/" docker-compose.prod.yml

# Restart services
docker compose -f docker-compose.prod.yml up -d

# Verify health
sleep 30
curl -f http://localhost:3001/api/health || echo "WARNING: Health check failed"

echo "Rollback complete"

9.3 Database Migrations

// Always use backward-compatible migrations
// BAD: Rename column directly
// ALTER TABLE orders RENAME COLUMN old_name TO new_name;

// GOOD: Add new column, migrate data, then remove old
// Step 1: Add new column
// ALTER TABLE orders ADD COLUMN new_name VARCHAR(255);
// Step 2: Migrate data (in application)
// UPDATE orders SET new_name = old_name WHERE new_name IS NULL;
// Step 3: Remove old column (after deployment is stable)
// ALTER TABLE orders DROP COLUMN old_name;

10. Analytics & Business Intelligence

10.1 Google Analytics 4 Setup

// Frontend analytics integration
import { gtag } from './gtag';

// Track page views
gtag('config', 'G-XXXXXXXXXX', {
page_path: window.location.pathname,
});

// Track events
gtag('event', 'place_order', {
event_category: 'betting',
event_label: fixture.homeTeam + ' vs ' + fixture.awayTeam,
value: order.stake,
});

gtag('event', 'login', {
method: 'email',
});

10.2 Business Metrics to Track

MetricDescriptionQuery/Calculation
DAUDaily Active UsersUnique users with activity per day
Orders/DayTotal orders placedCOUNT(orders) WHERE date = today
GGRGross Gaming RevenueSUM(stake) - SUM(payouts)
HandleTotal amount wageredSUM(stake)
Hold %House edge realizedGGR / Handle * 100
B-Book ExposureCurrent liabilitySUM(potential_payout) for unsettled B-Book
Settlement RateOrders settled on timeSettled within 1hr / Total settled
User Retention7-day retentionUsers active day 7 / Users signed up day 0

10.3 Business Dashboard Queries

-- Daily summary
SELECT
DATE(created_at) as date,
COUNT(*) as total_orders,
COUNT(DISTINCT user_id) as unique_users,
SUM(stake) as total_stake,
SUM(CASE WHEN settlement_outcome = 'win' THEN -profit_loss ELSE profit_loss END) as ggr
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;

-- B-Book exposure
SELECT
SUM(stake * (odds - 1)) as max_exposure
FROM orders
WHERE routing_venue = 'bbook'
AND status = 'accepted';

-- Settlement performance
SELECT
AVG(EXTRACT(EPOCH FROM (settled_at - created_at))) as avg_settlement_seconds
FROM orders
WHERE status = 'settled'
AND settled_at >= CURRENT_DATE - INTERVAL '7 days';

11. High Availability & Reliability

11.1 Current Architecture (Single Node)

                    ┌─────────────┐
│ Cloudflare │
│ (CDN) │
└──────┬──────┘

┌──────▼──────┐
│ Nginx │
│ (Proxy) │
└──────┬──────┘

┌────────────┼────────────┐
│ │ │
┌──────▼──────┐ ┌───▼───┐ ┌──────▼──────┐
│ Frontend │ │Backend│ │ Backend │
│ (Next.js) │ │ (API) │ │ (Jobs) │
└─────────────┘ └───┬───┘ └──────┬──────┘
│ │
┌────────────┼────────────┘
│ │
┌──────▼──────┐ ┌───▼───┐
│ PostgreSQL │ │ Redis │
└─────────────┘ └───────┘

11.2 Future HA Architecture

                    ┌─────────────┐
│ Cloudflare │
│ (CDN + WAF) │
└──────┬──────┘

┌────────────┼────────────┐
│ │ │
┌──────▼──────┐ │ ┌──────▼──────┐
│ Nginx 1 │ │ │ Nginx 2 │
│ (Primary) │ │ │ (Standby) │
└──────┬──────┘ │ └──────┬──────┘
│ │ │
└────────────┼────────────┘

┌──────▼──────┐
│ HAProxy │
│ (LB) │
└──────┬──────┘

┌─────────────────┼─────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Backend 1 │ │ Backend 2 │ │ Backend 3 │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└─────────────────┼─────────────────┘

┌─────────────────┼─────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Postgres │ │ Postgres │ │ Redis │
│ (Primary) │◄──│ (Replica) │ │ (Sentinel) │
└─────────────┘ └─────────────┘ └─────────────┘

11.3 Reliability Patterns

11.3.1 Circuit Breaker (Betfair API)

import CircuitBreaker from 'opossum';

const betfairBreaker = new CircuitBreaker(callBetfairAPI, {
timeout: 10000, // 10 second timeout
errorThresholdPercentage: 50, // Open circuit if 50% fail
resetTimeout: 30000, // Try again after 30 seconds
volumeThreshold: 10, // Minimum 10 requests before tripping
});

betfairBreaker.on('open', () => {
logger.warn('Betfair circuit breaker opened');
alerting.send('warning', 'Betfair API circuit breaker opened');
});

betfairBreaker.on('halfOpen', () => {
logger.info('Betfair circuit breaker half-open, testing...');
});

betfairBreaker.on('close', () => {
logger.info('Betfair circuit breaker closed');
});

11.3.2 Retry with Exponential Backoff

async function withRetry<T>(
fn: () => Promise<T>,
maxRetries: number = 3,
baseDelay: number = 1000
): Promise<T> {
let lastError: Error;

for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error as Error;
const delay = baseDelay * Math.pow(2, attempt);
logger.warn(`Attempt ${attempt + 1} failed, retrying in ${delay}ms`);
await sleep(delay);
}
}

throw lastError!;
}

11.3.3 Graceful Degradation

// If Betfair is down, show cached odds with warning
async function getOdds(fixtureId: string) {
try {
return await betfairAdapter.getOdds(fixtureId);
} catch (error) {
logger.error('Failed to get live odds, using cache', error);

const cached = await cache.get(`odds:${fixtureId}`);
if (cached) {
return {
...cached,
stale: true,
staleSince: cached.timestamp,
};
}

throw new ServiceUnavailableError('Odds temporarily unavailable');
}
}

11.4 Health Checks

// /api/health endpoint
app.get('/api/health', async (req, res) => {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
betfair: await checkBetfairSession(),
};

const healthy = Object.values(checks).every(c => c.status === 'healthy');

res.status(healthy ? 200 : 503).json({
status: healthy ? 'healthy' : 'unhealthy',
timestamp: new Date().toISOString(),
checks,
});
});

async function checkDatabase(): Promise<HealthCheck> {
try {
await prisma.$queryRaw`SELECT 1`;
return { status: 'healthy', latency: Date.now() - start };
} catch (error) {
return { status: 'unhealthy', error: error.message };
}
}

12. Operational Runbooks

12.1 Runbook: Service Restart

## Service Restart Runbook

### When to Use
- Service is unresponsive
- Memory leak detected
- After configuration change

### Procedure

1. **Check current status**
```bash
docker compose ps
docker logs hannibal-backend --tail 100
  1. Restart specific service

    docker compose restart backend
  2. Verify service is healthy

    curl http://localhost:3001/api/health
    docker logs hannibal-backend --tail 20
  3. If restart fails, recreate container

    docker compose up -d --force-recreate backend

Escalation

If service won't start after 3 attempts, escalate to on-call engineer.


### 12.2 Runbook: Database Connection Issues

```markdown
## Database Connection Issues Runbook

### Symptoms
- "Connection refused" errors
- "Too many connections" errors
- Slow queries

### Diagnosis

1. **Check PostgreSQL status**
```bash
docker exec hannibal-postgres pg_isready
  1. Check connection count

    docker exec hannibal-postgres psql -U hannibal -c "SELECT count(*) FROM pg_stat_activity;"
  2. Check for blocking queries

    docker exec hannibal-postgres psql -U hannibal -c "
    SELECT pid, now() - pg_stat_activity.query_start AS duration, query
    FROM pg_stat_activity
    WHERE state != 'idle'
    ORDER BY duration DESC
    LIMIT 10;
    "

Resolution

Too many connections:

# Kill idle connections
docker exec hannibal-postgres psql -U hannibal -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';
"

Blocking query:

# Kill specific query
docker exec hannibal-postgres psql -U hannibal -c "SELECT pg_terminate_backend(PID);"

### 12.3 Runbook: Betfair Session Recovery

```markdown
## Betfair Session Recovery Runbook

### Symptoms
- "Session expired" errors in logs
- Orders failing to place
- No odds updates

### Diagnosis

1. **Check Betfair session status**
```bash
docker logs hannibal-backend --tail 100 | grep -i betfair
  1. Check session in Redis
    docker exec hannibal-redis redis-cli GET betfair:session

Resolution

  1. Force session refresh

    # Clear cached session
    docker exec hannibal-redis redis-cli DEL betfair:session

    # Restart backend to trigger re-login
    docker compose restart backend
  2. Verify new session

    docker logs hannibal-backend --tail 50 | grep -i "logged in"

If Login Fails

  • Check Betfair credentials in .env
  • Check if Betfair API is down: https://status.betfair.com
  • Check if IP is blocked (contact Betfair support)

### 12.4 Runbook: Disk Space Emergency

```markdown
## Disk Space Emergency Runbook

### Symptoms
- Disk usage > 90% alert
- Services failing to write

### Immediate Actions

1. **Check disk usage**
```bash
df -h
du -sh /* | sort -hr | head -20
  1. Quick cleanup

    # Docker cleanup
    docker system prune -af --volumes

    # Clear old logs
    find /var/log -name "*.gz" -delete
    truncate -s 0 /var/log/*.log

    # Clear temp files
    rm -rf /tmp/*
  2. Check large files

    find / -type f -size +100M -exec ls -lh {} \;

If Still Critical

  • Delete old database backups: rm /var/backups/postgres/hannibal_*.sql.gz
  • Expand disk volume (cloud provider console)
  • Move data to larger volume

---

## 13. Implementation Roadmap

### Phase 1: Critical (Week 1-2) 🔴

| Task | Effort | Owner | Status |
|------|--------|-------|--------|
| Configure Docker log rotation | 1h | DevOps | ⬜ |
| Set up daily database backups to S3 | 4h | DevOps | ⬜ |
| Deploy Prometheus + Node Exporter | 4h | DevOps | ⬜ |
| Deploy Grafana with basic dashboards | 4h | DevOps | ⬜ |
| Configure critical alerts (disk, memory, service down) | 4h | DevOps | ⬜ |
| Set up Uptime Kuma for external monitoring | 2h | DevOps | ⬜ |
| Configure UFW firewall | 2h | DevOps | ⬜ |
| Add resource limits to Docker containers | 2h | DevOps | ⬜ |

### Phase 2: Important (Week 3-4) 🟡

| Task | Effort | Owner | Status |
|------|--------|-------|--------|
| Deploy Loki + Promtail for log aggregation | 4h | DevOps | ⬜ |
| Add application metrics to backend | 8h | Backend | ⬜ |
| Create business metrics dashboard | 4h | DevOps | ⬜ |
| Set up backup verification automation | 4h | DevOps | ⬜ |
| Implement security headers in Nginx | 2h | DevOps | ⬜ |
| Set up GitHub Actions CI/CD pipeline | 8h | DevOps | ⬜ |
| Configure Cloudflare WAF rules | 4h | DevOps | ⬜ |
| Add database indexes for performance | 4h | Backend | ⬜ |

### Phase 3: Enhancement (Month 2) 🟢

| Task | Effort | Owner | Status |
|------|--------|-------|--------|
| Implement distributed tracing (Jaeger) | 8h | Backend | ⬜ |
| Add circuit breaker for Betfair API | 4h | Backend | ⬜ |
| Create operational runbooks | 8h | DevOps | ⬜ |
| Set up Google Analytics 4 | 4h | Frontend | ⬜ |
| Implement audit logging | 8h | Backend | ⬜ |
| Performance optimization (caching, queries) | 16h | Backend | ⬜ |
| Security audit and penetration testing | 16h | Security | ⬜ |

### Phase 4: Scale (Month 3+) 🔵

| Task | Effort | Owner | Status |
|------|--------|-------|--------|
| Set up PostgreSQL replication | 8h | DevOps | ⬜ |
| Configure Redis Sentinel | 4h | DevOps | ⬜ |
| Deploy multiple backend instances | 8h | DevOps | ⬜ |
| Set up load balancer | 4h | DevOps | ⬜ |
| Implement chaos testing | 16h | DevOps | ⬜ |
| Multi-region failover setup | 24h | DevOps | ⬜ |

---

## 14. Appendix: Configuration Examples

### 14.1 Complete docker-compose.prod.yml

```yaml
version: '3.8'

services:
backend:
image: hannibal-backend:latest
restart: unless-stopped
environment:
- NODE_ENV=production
- DATABASE_URL=postgresql://hannibal:${DB_PASSWORD}@postgres:5432/hannibal
- REDIS_URL=redis://redis:6379
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
deploy:
resources:
limits:
cpus: '2'
memory: 2G
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3001/api/health"]
interval: 30s
timeout: 10s
retries: 3
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "5"

frontend:
image: hannibal-frontend:latest
restart: unless-stopped
environment:
- NODE_ENV=production
deploy:
resources:
limits:
cpus: '1'
memory: 1G
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000"]
interval: 30s
timeout: 10s
retries: 3
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "5"

postgres:
image: postgres:15-alpine
restart: unless-stopped
environment:
- POSTGRES_USER=hannibal
- POSTGRES_PASSWORD=${DB_PASSWORD}
- POSTGRES_DB=hannibal
volumes:
- postgres_data:/var/lib/postgresql/data
- ./postgres/postgresql.conf:/etc/postgresql/postgresql.conf
command: postgres -c config_file=/etc/postgresql/postgresql.conf
deploy:
resources:
limits:
cpus: '2'
memory: 4G
healthcheck:
test: ["CMD-SHELL", "pg_isready -U hannibal"]
interval: 10s
timeout: 5s
retries: 5
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "5"

redis:
image: redis:7-alpine
restart: unless-stopped
command: redis-server /usr/local/etc/redis/redis.conf
volumes:
- redis_data:/data
- ./redis/redis.conf:/usr/local/etc/redis/redis.conf
deploy:
resources:
limits:
cpus: '1'
memory: 1G
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
logging:
driver: "json-file"
options:
max-size: "50m"
max-file: "3"

prometheus:
image: prom/prometheus:latest
restart: unless-stopped
volumes:
- ./prometheus:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
ports:
- "9090:9090"

grafana:
image: grafana/grafana:latest
restart: unless-stopped
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- "3000:3000"

loki:
image: grafana/loki:latest
restart: unless-stopped
volumes:
- ./loki:/etc/loki
- loki_data:/loki
command: -config.file=/etc/loki/loki-config.yml

promtail:
image: grafana/promtail:latest
restart: unless-stopped
volumes:
- ./promtail:/etc/promtail
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock
command: -config.file=/etc/promtail/promtail-config.yml

node-exporter:
image: prom/node-exporter:latest
restart: unless-stopped
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'

postgres-exporter:
image: prometheuscommunity/postgres-exporter:latest
restart: unless-stopped
environment:
- DATA_SOURCE_NAME=postgresql://hannibal:${DB_PASSWORD}@postgres:5432/hannibal?sslmode=disable

redis-exporter:
image: oliver006/redis_exporter:latest
restart: unless-stopped
environment:
- REDIS_ADDR=redis://redis:6379

volumes:
postgres_data:
redis_data:
prometheus_data:
grafana_data:
loki_data:

14.2 Prometheus Configuration

# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s

alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093

rule_files:
- /etc/prometheus/alerts/*.yml

scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']

- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']

- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']

- job_name: 'backend'
static_configs:
- targets: ['backend:3001']
metrics_path: /metrics

- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']

14.3 Crontab for Maintenance Tasks

# /etc/cron.d/hannibal

# Database backup - daily at 3 AM
0 3 * * * root /opt/hannibal/scripts/backup-database.sh >> /var/log/hannibal/backup.log 2>&1

# Backup verification - weekly on Sunday at 4 AM
0 4 * * 0 root /opt/hannibal/scripts/verify-backup.sh >> /var/log/hannibal/backup-verify.log 2>&1

# Disk cleanup - daily at 2 AM
0 2 * * * root /opt/hannibal/scripts/disk-cleanup.sh >> /var/log/hannibal/cleanup.log 2>&1

# PostgreSQL maintenance - weekly on Sunday at 5 AM
0 5 * * 0 root /opt/hannibal/scripts/postgres-maintenance.sh >> /var/log/hannibal/postgres.log 2>&1

# SSL certificate renewal check - daily at 6 AM
0 6 * * * root certbot renew --quiet --post-hook "nginx -s reload"

# Docker cleanup - weekly on Saturday at 3 AM
0 3 * * 6 root docker system prune -af >> /var/log/hannibal/docker-cleanup.log 2>&1

Document History

VersionDateAuthorChanges
1.02026-01-28AI AssistantInitial document

References