Windmill Automation Platform¶
Windmill is an open-source workflow automation platform running in the windmill namespace. It serves as the operational brain of the homelab — running scheduled health checks, version monitors, and maintenance flows.
Monday morning health train¶
All flows send Pushover alerts on findings. The Monday morning cadence runs sequentially to avoid parallel load on the cluster API.
flowchart LR
sched["Cron scheduler\nMon 07:00–09:00\nEurope/Zurich"]
sched --> v["07:00\nweekly_version_check\nRKE2 / Rancher / Harvester\n+ 9 Helm charts"]
sched --> n["07:00\nnon_k8s_update_check\nDSM (Synology) /\nTrueNAS firmware"]
sched --> s["07:30\nstorage_health\nTrueNAS + Synology\npool health via REST/SSH"]
sched --> b["07:45\nbpir4_health\nUptime / WAN\nkernel errors"]
sched --> k["07:55\nkeel_update_log\nRecent image updates"]
sched --> rw["08:00\nresource_waste\nUnused PVCs / zero-replica RSes\nSpot Unknown-node recovery"]
sched --> rs["09:00\nrackspace_spend\nSpot cost vs $50/mo cap"]
v & n & s & b & k & rw & rs --> pushover["Pushover\nnotification"]
Other cadences¶
| Cadence | Path | Type | Purpose |
|---|---|---|---|
| Every 5m | f/infra_health/longhorn_rebuild_stuck |
script | Tracks active Longhorn replica rebuilds via Pushgateway gauge; auto-deletes a stalled WO replica after 30 min of no progress (cap 1 per run) so Longhorn re-schedules on a healthier node |
| Every 5m | f/infra_health/ntp_health |
script | Functional NTP probe — queries the ntppool chrony server over NTP and publishes sync state to Pushgateway; backs the NtpChronyNotServing / NtpChronyUnsynced alerts. Pod-level availability is not proof the daemon is serving time |
| Every 4h | f/gitlab/upgrade_auto_flow |
flow | Unattended GitLab patch / one-minor upgrade — see GitLab → Automated upgrades |
| Hourly :15 | f/infra_health/daily_infra_health |
script | Fleet bundle readiness, cert expiry, Longhorn backup ages, node disk pressure. Publishes _last_run_timestamp_seconds + per-job gauges to Pushgateway. Runs hourly so a transient bad value self-corrects within the hour instead of pinning an alert until the next run |
| Daily 01:30 | f/infra_health/rancher_backup_check |
script | Rancher backup snapshot age + error check |
| Daily 01:30 | f/infra_health/gitlab_backup_health |
script | GitLab backup age, runner pod + group runner status |
| Daily 02:00 | f/security/pod_image_cve_scan |
script | Grype+Syft CVE scan of all running pod images, all registries (incl. docker.io), 12/day cap |
| Daily 03:00 | f/config_backup/config_backup |
script | Daily snapshot of mdapi configs |
| Daily | f/infra_health/schedule_cadence_map |
script | Publishes windmill_schedule_interval_seconds{job=...} for every scheduled probe so the BatchJobStale Prometheus rule can fire at 2 × interval per-job (replaces the previous regex-based daily/weekly threshold split) |
| Wed 04:00 | f/longhorn/backup_label_enforcer |
script | Ensures all PVCs carry the correct Longhorn recurring-job labels |
| 1st of month 08:00 | f/gitlab/registry_cleanup_audit |
script | GitLab container registry cleanup-policy audit |
| 1st of month 09:00 | f/infra_health/fleet_maxhistory_audit |
script | Fleet maxHistory audit across all GitRepos |
| On-demand | f/gitlab/gitlab_upgrade_flow |
flow | Semi-automated GitLab upgrade with manual approvals — used for multi-minor hops |
Version Checking Pattern¶
The version check flow compares running versions against upstream release APIs:
flowchart LR
flow["weekly_version_check"]
flow -->|"Rancher API"| k8s["K8s version\n(mdapi-prod + mdapi-rancher)"]
flow -->|"GitHub Releases API"| gh["Rancher / Longhorn / cert-manager\n+ 6 other Helm charts"]
flow -->|"SSH + REST"| nas["TrueNAS / Synology\nfirmware versions"]
k8s & gh & nas --> compare["Compare running vs latest"]
compare -->|"upgrade available"| alert["Pushover alert\nwith versions"]
compare -->|"up to date"| silence["No notification"]
Only actionable upgrades generate alerts — no noise for services that are current.
Postgres Persistence¶
Windmill's Postgres runs as a StatefulSet with a 5 Gi harvester-longhorn-2replicas PVC. The PV reclaim policy is Retain (manually set — the default for dynamically provisioned PVs is Delete).
Fresh Windmill install = data loss
The default Helm chart does not configure persistence out of the box. A fresh install without a pre-existing PVC starts with an empty database — all workflows, schedules, and variables are lost. Always verify the PVC is bound and the PV reclaim policy is Retain before any Helm operation.