Windmill Automation Platform¶

Windmill is an open-source workflow automation platform running in the windmill namespace. It serves as the operational brain of the homelab — running scheduled health checks, version monitors, and maintenance flows.

Monday morning health train¶

All flows send Pushover alerts on findings. The Monday morning cadence runs sequentially to avoid parallel load on the cluster API.

flowchart LR
    sched["Cron scheduler\nMon 07:00–09:00\nEurope/Zurich"]

    sched --> v["07:00\nweekly_version_check\nRKE2 / Rancher / Harvester\n+ 9 Helm charts"]
    sched --> n["07:00\nnon_k8s_update_check\nDSM (Synology) /\nTrueNAS firmware"]
    sched --> s["07:30\nstorage_health\nTrueNAS + Synology\npool health via REST/SSH"]
    sched --> b["07:45\nbpir4_health\nUptime / WAN\nkernel errors"]
    sched --> k["07:55\nkeel_update_log\nRecent image updates"]
    sched --> rw["08:00\nresource_waste\nUnused PVCs / zero-replica RSes\nSpot Unknown-node recovery"]
    sched --> rs["09:00\nrackspace_spend\nSpot cost vs $50/mo cap"]

    v & n & s & b & k & rw & rs --> pushover["Pushover\nnotification"]

Other cadences¶

Cadence	Path	Type	Purpose
Every 5m	`f/infra_health/longhorn_rebuild_stuck`	script	Tracks active Longhorn replica rebuilds via Pushgateway gauge; auto-deletes a stalled `WO` replica after 30 min of no progress (cap 1 per run) so Longhorn re-schedules on a healthier node
Every 5m	`f/infra_health/ntp_health`	script	Functional NTP probe — queries the `ntppool` chrony server over NTP and publishes sync state to Pushgateway; backs the `NtpChronyNotServing` / `NtpChronyUnsynced` alerts. Pod-level availability is not proof the daemon is serving time
Every 4h	`f/gitlab/upgrade_auto_flow`	flow	Unattended GitLab patch / one-minor upgrade — see GitLab → Automated upgrades
Hourly :15	`f/infra_health/daily_infra_health`	script	Fleet bundle readiness, cert expiry, Longhorn backup ages, node disk pressure. Publishes `_last_run_timestamp_seconds` + per-job gauges to Pushgateway. Runs hourly so a transient bad value self-corrects within the hour instead of pinning an alert until the next run
Daily 01:30	`f/infra_health/rancher_backup_check`	script	Rancher backup snapshot age + error check
Daily 01:30	`f/infra_health/gitlab_backup_health`	script	GitLab backup age, runner pod + group runner status
Daily 02:00	`f/security/pod_image_cve_scan`	script	Grype+Syft CVE scan of all running pod images, all registries (incl. docker.io), 12/day cap
Daily 03:00	`f/config_backup/config_backup`	script	Daily snapshot of mdapi configs
Daily	`f/infra_health/schedule_cadence_map`	script	Publishes `windmill_schedule_interval_seconds{job=...}` for every scheduled probe so the `BatchJobStale` Prometheus rule can fire at `2 × interval` per-job (replaces the previous regex-based daily/weekly threshold split)
Wed 04:00	`f/longhorn/backup_label_enforcer`	script	Ensures all PVCs carry the correct Longhorn recurring-job labels
1st of month 08:00	`f/gitlab/registry_cleanup_audit`	script	GitLab container registry cleanup-policy audit
1st of month 09:00	`f/infra_health/fleet_maxhistory_audit`	script	Fleet `maxHistory` audit across all GitRepos
On-demand	`f/gitlab/gitlab_upgrade_flow`	flow	Semi-automated GitLab upgrade with manual approvals — used for multi-minor hops

Version Checking Pattern¶

The version check flow compares running versions against upstream release APIs:

flowchart LR
    flow["weekly_version_check"]

    flow -->|"Rancher API"| k8s["K8s version\n(mdapi-prod + mdapi-rancher)"]
    flow -->|"GitHub Releases API"| gh["Rancher / Longhorn / cert-manager\n+ 6 other Helm charts"]
    flow -->|"SSH + REST"| nas["TrueNAS / Synology\nfirmware versions"]

    k8s & gh & nas --> compare["Compare running vs latest"]
    compare -->|"upgrade available"| alert["Pushover alert\nwith versions"]
    compare -->|"up to date"| silence["No notification"]

Only actionable upgrades generate alerts — no noise for services that are current.

Postgres Persistence¶

Windmill's Postgres runs as a StatefulSet with a 5 Gi harvester-longhorn-2replicas PVC. The PV reclaim policy is Retain (manually set — the default for dynamically provisioned PVs is Delete).

Fresh Windmill install = data loss

The default Helm chart does not configure persistence out of the box. A fresh install without a pre-existing PVC starts with an empty database — all workflows, schedules, and variables are lost. Always verify the PVC is bound and the PV reclaim policy is Retain before any Helm operation.