Monitoring
Current stack
| Tool | Status | Purpose |
|---|---|---|
| Prometheus / node_exporter | Deployed | Host metrics (CPU, memory, disk, network) |
| Fail2ban | Deployed | Intrusion detection, SSH brute-force protection |
| Uptime Kuma | Not deployed | External uptime monitoring, status page |
| Grafana | Not deployed | Metrics visualisation and dashboards |
| Promtail / Loki | Partial | Log shipping (Promtail container deployed; Loki not yet) |
Dashboard widgets
Metrics are surfaced in the ops dashboard via dedicated widgets:
| Widget | Data source | API route |
|---|---|---|
| Prometheus | node_exporter scrape | /api/widgets/prometheus |
| Fail2ban | log / socket | /api/widgets/fail2ban |
| Uptime Kuma | REST API | /api/widgets/uptime-kuma |
See Widgets for the full widget reference.
node_exporter
Scrapes host metrics from the web VPS. The Prometheus widget polls node_exporter directly from the Next.js container.
# Environment variablePROMETHEUS_URL=http://<host>:9100/metricsMetrics exposed: CPU usage, memory, disk I/O, filesystem usage, network throughput.
Fail2ban
Monitors /var/log/auth.log and other log sources. Bans IPs that exceed failed authentication thresholds.
The Fail2ban widget reads ban counts and recently banned IPs via the fail2ban socket or log scrape.
Uptime Kuma
[placeholder — not yet deployed. See Uptime Kuma for planned deployment.]
Target monitors once deployed:
web.level147.net— ops dashboard (HTTP 200)docs.level147.net— docs site (HTTP 200)- Gitea internal health endpoint
- Woodpecker CI
Alerting
[placeholder — define alerting channels and thresholds]
Proposed alerting rules:
- Dashboard unreachable > 5 minutes → immediate
- Disk usage > 85% → warn; > 95% → critical
- Memory usage > 90% for 10 minutes → warn
- Failed SSH logins spike (> 50/min) → immediate
- CI pipeline failure → notify via [channel TBD]
Log retention
| Log type | Retention |
|---|---|
| Docker container logs | 14 days (Docker log driver) |
| Woodpecker pipeline logs | 90 days |
| Fail2ban logs | 30 days |
| System auth logs | 30 days |
Health check endpoints
| Service | Endpoint | Expected response |
|---|---|---|
| Ops dashboard | http://localhost:3000/api/health | 200 OK |
| Cloudflare tunnel | $CLOUDFLARED_METRICS_URL | metrics text |