Monitoring Architecture¶
Layers¶
Monitoring is split across two planes: real-time cluster observability (Prometheus + Grafana) and scheduled operational checks (Windmill flows).
graph LR
subgraph realtime["Real-time (Prometheus stack)"]
prom["Prometheus\n150Gi PVC"]
grafana["Grafana"]
alerts["Alertmanager"]
exporters["Node exporters\nLonghorn exporter\nGitLab exporter"]
end
subgraph scheduled["Scheduled (Windmill)"]
windmill["Windmill flows\n(Monday 07:00–07:55)"]
pushover["Pushover\nnotifications"]
end
subgraph image_updates["Image Updates"]
diun["Diun\nwatchByDefault: true\nall pod images"]
keel["Keel\ndigest poll @every 4h"]
end
exporters --> prom --> grafana
prom --> alerts
windmill --> pushover
diun --> pushover
keel -->|"rolling deploy"| cluster["K8s workloads"]
Diun — Image Update Detection¶
Diun watches all running pod images cluster-wide (watchByDefault: true) and sends a Pushover notification whenever an upstream image digest changes. This is complementary to Keel: Diun notifies, Keel acts.
Infrastructure Health Checks (Windmill)¶
Each Monday morning, nine Windmill flows check different aspects of the infrastructure:
| Flow | What it checks | Alert condition |
|---|---|---|
weekly_version_check |
RKE2, Rancher, Harvester, 9 Helm charts vs GitHub releases | Any upgrade available |
weekly_infra_health |
Fleet bundle status, cert expiry, Longhorn backup ages | ErrApplied bundles, cert expiry < 30d, backup > 35d old |
storage_health |
TrueNAS pool status, Synology volume health | Degraded pool, missing volume |
rancher_backup_check |
Age of last Rancher backup | > 7 days old |
bpir4_health |
SSH connectivity, uptime, OpenWrt version | SSH unreachable |
keel_update_log |
Recent Keel deploy events | Summary notification |
resource_waste |
PVCs without pods, zero-replica ReplicaSets | Any found |
registry_cleanup_audit |
GitLab registry tags older than N days | Stale tags by project |
backup_label_enforcer |
Longhorn backup labels on all PVCs | Missing labels (auto-fixes) |
Non-K8s Version Checks¶
A separate weekly flow checks firmware and OS versions on infrastructure outside Kubernetes:
| Host | Method | Checks |
|---|---|---|
| salt / pepper | TrueNAS REST API | Pool health, available updates |
| santillo / nastillo | Synology DSM REST API | DSM version, volume status |
| BPI-R4 | SSH | OpenWrt version, uptime |
Internal DNS from Windmill pods
Windmill pods use K8s cluster DNS, which does not resolve infrastructure hostnames like salt or santillo. Monitoring scripts use IP addresses directly.
Longhorn Backup Monitoring Thresholds¶
Thresholds are set to match the actual backup cadence — not arbitrary defaults:
| Label group | Expected cadence | Alert after |
|---|---|---|
default (daily snapshots) |
Daily | 35 days |
weekly (off-site to MinIO) |
Weekly | 70 days |
These values were calibrated after the initial monitoring run revealed false positives from the default 10d/60d thresholds mismatching the operational schedule.