Skip to content

Windmill Automation Platform

Windmill is an open-source workflow automation platform running in the windmill namespace. It serves as the operational brain of the homelab — running scheduled health checks, version monitors, and maintenance flows.

Monday morning health train

All flows send Pushover alerts on findings. The Monday morning cadence runs sequentially to avoid parallel load on the cluster API.

flowchart LR
    sched["Cron scheduler\nMon 07:00–09:00\nEurope/Zurich"]

    sched --> v["07:00\nweekly_version_check\nRKE2 / Rancher / Harvester\n+ 9 Helm charts"]
    sched --> n["07:00\nnon_k8s_update_check\nDSM (Synology) /\nTrueNAS firmware"]
    sched --> s["07:30\nstorage_health\nTrueNAS + Synology\npool health via REST/SSH"]
    sched --> b["07:45\nbpir4_health\nUptime / WAN\nkernel errors"]
    sched --> k["07:55\nkeel_update_log\nRecent image updates"]
    sched --> rw["08:00\nresource_waste\nUnused PVCs / zero-replica RSes\nSpot Unknown-node recovery"]
    sched --> rs["09:00\nrackspace_spend\nSpot cost vs $50/mo cap"]

    v & n & s & b & k & rw & rs --> pushover["Pushover\nnotification"]

Other cadences

Cadence Path Type Purpose
Every 5m f/infra_health/longhorn_rebuild_stuck script Tracks active Longhorn replica rebuilds via Pushgateway gauge; auto-deletes a stalled WO replica after 30 min of no progress (cap 1 per run) so Longhorn re-schedules on a healthier node
Every 5m f/infra_health/ntp_health script Functional NTP probe — queries the ntppool chrony server over NTP and publishes sync state to Pushgateway; backs the NtpChronyNotServing / NtpChronyUnsynced alerts. Pod-level availability is not proof the daemon is serving time
Every 4h f/gitlab/upgrade_auto_flow flow Unattended GitLab patch / one-minor upgrade — see GitLab → Automated upgrades
Hourly :15 f/infra_health/daily_infra_health script Fleet bundle readiness, cert expiry, Longhorn backup ages, node disk pressure. Publishes _last_run_timestamp_seconds + per-job gauges to Pushgateway. Runs hourly so a transient bad value self-corrects within the hour instead of pinning an alert until the next run
Daily 01:30 f/infra_health/rancher_backup_check script Rancher backup snapshot age + error check
Daily 01:30 f/infra_health/gitlab_backup_health script GitLab backup age, runner pod + group runner status
Daily 02:00 f/security/pod_image_cve_scan script Grype+Syft CVE scan of all running pod images, all registries (incl. docker.io), 12/day cap
Daily 03:00 f/config_backup/config_backup script Daily snapshot of mdapi configs
Daily f/infra_health/schedule_cadence_map script Publishes windmill_schedule_interval_seconds{job=...} for every scheduled probe so the BatchJobStale Prometheus rule can fire at 2 × interval per-job (replaces the previous regex-based daily/weekly threshold split)
Wed 04:00 f/longhorn/backup_label_enforcer script Ensures all PVCs carry the correct Longhorn recurring-job labels
1st of month 08:00 f/gitlab/registry_cleanup_audit script GitLab container registry cleanup-policy audit
1st of month 09:00 f/infra_health/fleet_maxhistory_audit script Fleet maxHistory audit across all GitRepos
On-demand f/gitlab/gitlab_upgrade_flow flow Semi-automated GitLab upgrade with manual approvals — used for multi-minor hops

Version Checking Pattern

The version check flow compares running versions against upstream release APIs:

flowchart LR
    flow["weekly_version_check"]

    flow -->|"Rancher API"| k8s["K8s version\n(mdapi-prod + mdapi-rancher)"]
    flow -->|"GitHub Releases API"| gh["Rancher / Longhorn / cert-manager\n+ 6 other Helm charts"]
    flow -->|"SSH + REST"| nas["TrueNAS / Synology\nfirmware versions"]

    k8s & gh & nas --> compare["Compare running vs latest"]
    compare -->|"upgrade available"| alert["Pushover alert\nwith versions"]
    compare -->|"up to date"| silence["No notification"]

Only actionable upgrades generate alerts — no noise for services that are current.

Postgres Persistence

Windmill's Postgres runs as a StatefulSet with a 5 Gi harvester-longhorn-2replicas PVC. The PV reclaim policy is Retain (manually set — the default for dynamically provisioned PVs is Delete).

Fresh Windmill install = data loss

The default Helm chart does not configure persistence out of the box. A fresh install without a pre-existing PVC starts with an empty database — all workflows, schedules, and variables are lost. Always verify the PVC is bound and the PV reclaim policy is Retain before any Helm operation.