Skip to content

Monitoring Architecture

Layers

Monitoring is split across two planes: real-time cluster observability (Prometheus + Grafana) and scheduled operational checks (Windmill flows).

graph LR
    subgraph realtime["Real-time (Prometheus stack)"]
        prom["Prometheus\n150Gi PVC"]
        grafana["Grafana"]
        alerts["Alertmanager"]
        exporters["Node exporters\nLonghorn exporter\nGitLab exporter"]
    end

    subgraph scheduled["Scheduled (Windmill)"]
        windmill["Windmill flows\n(Monday 07:00–07:55)"]
        pushover["Pushover\nnotifications"]
    end

    subgraph image_updates["Image Updates"]
        diun["Diun\nwatchByDefault: true\nall pod images"]
        keel["Keel\ndigest poll @every 4h"]
    end

    exporters --> prom --> grafana
    prom --> alerts
    windmill --> pushover
    diun --> pushover
    keel -->|"rolling deploy"| cluster["K8s workloads"]

Diun — Image Update Detection

Diun watches all running pod images cluster-wide (watchByDefault: true) and sends a Pushover notification whenever an upstream image digest changes. This is complementary to Keel: Diun notifies, Keel acts.

Infrastructure Health Checks (Windmill)

Each Monday morning, nine Windmill flows check different aspects of the infrastructure:

Flow What it checks Alert condition
weekly_version_check RKE2, Rancher, Harvester, 9 Helm charts vs GitHub releases Any upgrade available
weekly_infra_health Fleet bundle status, cert expiry, Longhorn backup ages ErrApplied bundles, cert expiry < 30d, backup > 35d old
storage_health TrueNAS pool status, Synology volume health Degraded pool, missing volume
rancher_backup_check Age of last Rancher backup > 7 days old
bpir4_health SSH connectivity, uptime, OpenWrt version SSH unreachable
keel_update_log Recent Keel deploy events Summary notification
resource_waste PVCs without pods, zero-replica ReplicaSets Any found
registry_cleanup_audit GitLab registry tags older than N days Stale tags by project
backup_label_enforcer Longhorn backup labels on all PVCs Missing labels (auto-fixes)

Non-K8s Version Checks

A separate weekly flow checks firmware and OS versions on infrastructure outside Kubernetes:

Host Method Checks
salt / pepper TrueNAS REST API Pool health, available updates
santillo / nastillo Synology DSM REST API DSM version, volume status
BPI-R4 SSH OpenWrt version, uptime

Internal DNS from Windmill pods

Windmill pods use K8s cluster DNS, which does not resolve infrastructure hostnames like salt or santillo. Monitoring scripts use IP addresses directly.

Longhorn Backup Monitoring Thresholds

Thresholds are set to match the actual backup cadence — not arbitrary defaults:

Label group Expected cadence Alert after
default (daily snapshots) Daily 35 days
weekly (off-site to MinIO) Weekly 70 days

These values were calibrated after the initial monitoring run revealed false positives from the default 10d/60d thresholds mismatching the operational schedule.