Skip to content

Longhorn Backup Policy

Longhorn provides replicated block storage across the three bare-metal nodes. Snapshot scheduling and off-site backup to Garage (S3) are enforced cluster-wide by a labelling script.

Label Groups

longhorn-backup-labels.sh runs as part of the weekly maintenance flow and ensures every PVC has the correct labels. Longhorn maps labels to recurring job groups.

Label Applied to Effect
recurring-job.longhorn.io/source=enabled All PVCs Opts the PVC into the recurring job system
recurring-job-group.longhorn.io/default=enabled All PVCs Joins the default jobs: monthly backup to Garage (0 5 1 * *, retain 2) + daily filesystem-trim (27 15 * * *)
recurring-job-group.longhorn.io/weekly=enabled Selected namespaces Weekly off-site backup to Garage (Fri 01:00, retain 2)
recurring-job-group.longhorn.io/weeklysnapshot=enabled Selected namespaces Weekly local snapshot (Thu 01:00, retain 2)
recurring-job-group.longhorn.io/nosnapshots=enabled Cache / metrics PVCs No snapshots; only the every-4h snapshot-cleanup

Namespaces in the weekly group: appdaemon, bootstrap, envuassu, esphome, frigate, home-assistant, openldap, and others where data loss would be significant.

Source of truth: kubectl --context mdapi-prod -n longhorn-system get recurringjob.

Backup Target

Backups ship to the external Garage cluster. The backup target is configured as a Longhorn BackupTarget resource with the URL s3://harvester-backup@garage/.

VM Backups

Longhorn also backs up KubeVirt VMs (e.g. the CipherTrust Manager appliance) as standard Longhorn volumes. This is what makes the CipherTrust Manager recoverable — a Longhorn snapshot restore brings the entire VM disk back without needing to reconfigure Akeyless.

Monitoring Thresholds

The daily_infra_health Windmill flow monitors backup ages:

Job group Expected cadence Alert threshold
default (monthly) Monthly 35 days without snapshot
weekly (Garage) Weekly 70 days without backup

Thresholds are intentionally wider than the cadence to absorb missed runs without false-positive alerts.

PV Reclaim Policy

Production PVs default to reclaimPolicy: Delete — disk space is finite and recovery from a deleted PVC almost always means restoring from backup anyway. Retain is reserved for the small set of PVCs where no backup path exists and accidental deletion would be unrecoverable. See Storage → PV Reclaim Policy for the cross-check alert that guards this.

Mount-stall auto-recovery

A persistent Longhorn behavior on the current hardware: under disk / network pressure, individual pods occasionally end up with a stale mount — Longhorn reports the volume attached + healthy, dmesg on the node is clean, but the pod's writes return EIO or its ext4 silently flips read-only. The fix is always the same — delete the pod, the new one gets a fresh mount and writes resume.

Two automated paths cover this without a human in the loop:

Pod state What catches it What it does
CrashLoopBackOff with EIO / os error 5 / input/output error in recent logs longhorn_eio_autoheal Windmill flow (per-pod 30 min cooldown, global cap 6/hour) Deletes the pod; new pod attaches and recovers in ~20 s
Running but ext4 went read-only mid-flight (writes failing silently, no log signal) LonghornVolumeReadOnlyRemount alert (longhorn_volume_file_system_read_only > 0 for 2 min) Pages with a runbook link; same pod-delete recovery

The auto-healer's regex also has a sinceSeconds=300 Running-pod scan for the case where logs do carry the EIO signature even though the pod hasn't crashed. The Pushgateway exposes _last_run_* plus a rate_limit_hits counter so a runaway healer would itself trip BatchJobStale.

The complementary PodIOError log-pattern detector (see Monitoring → log-based alerts) fires on the same EIO signature from any namespace and tags the alert with namespace / pod / container, so an operator running the manual kubectl delete pod knows exactly which one before opening Lens.