Longhorn Backup Policy¶

Longhorn provides replicated block storage across the three bare-metal nodes. Snapshot scheduling and off-site backup to Garage (S3) are enforced cluster-wide by a labelling script.

Label Groups¶

longhorn-backup-labels.sh runs as part of the weekly maintenance flow and ensures every PVC has the correct labels. Longhorn maps labels to recurring job groups.

Label	Applied to	Effect
`recurring-job.longhorn.io/source=enabled`	All PVCs	Opts the PVC into the recurring job system
`recurring-job-group.longhorn.io/default=enabled`	All PVCs	Joins the `default` jobs: monthly backup to Garage (`0 5 1 * `, retain 2) + daily filesystem-trim* (`27 15 * * *`)
`recurring-job-group.longhorn.io/weekly=enabled`	Selected namespaces	Weekly off-site `backup` to Garage (Fri 01:00, retain 2)
`recurring-job-group.longhorn.io/weeklysnapshot=enabled`	Selected namespaces	Weekly local `snapshot` (Thu 01:00, retain 2)
`recurring-job-group.longhorn.io/nosnapshots=enabled`	Cache / metrics PVCs	No snapshots; only the every-4h `snapshot-cleanup`

Namespaces in the weekly group: appdaemon, bootstrap, envuassu, esphome, frigate, home-assistant, openldap, and others where data loss would be significant.

Source of truth: kubectl --context mdapi-prod -n longhorn-system get recurringjob.

Backup Target¶

Backups ship to the external Garage cluster. The backup target is configured as a Longhorn BackupTarget resource with the URL s3://harvester-backup@garage/.

VM Backups¶

Longhorn also backs up KubeVirt VMs (e.g. the CipherTrust Manager appliance) as standard Longhorn volumes. This is what makes the CipherTrust Manager recoverable — a Longhorn snapshot restore brings the entire VM disk back without needing to reconfigure Akeyless.

Monitoring Thresholds¶

The daily_infra_health Windmill flow monitors backup ages:

Job group	Expected cadence	Alert threshold
`default` (monthly)	Monthly	35 days without snapshot
`weekly` (Garage)	Weekly	70 days without backup

Thresholds are intentionally wider than the cadence to absorb missed runs without false-positive alerts.

PV Reclaim Policy¶

Production PVs default to reclaimPolicy: Delete — disk space is finite and recovery from a deleted PVC almost always means restoring from backup anyway. Retain is reserved for the small set of PVCs where no backup path exists and accidental deletion would be unrecoverable. See Storage → PV Reclaim Policy for the cross-check alert that guards this.

Mount-stall auto-recovery¶

A persistent Longhorn behavior on the current hardware: under disk / network pressure, individual pods occasionally end up with a stale mount — Longhorn reports the volume attached + healthy, dmesg on the node is clean, but the pod's writes return EIO or its ext4 silently flips read-only. The fix is always the same — delete the pod, the new one gets a fresh mount and writes resume.

Two automated paths cover this without a human in the loop:

Pod state	What catches it	What it does
CrashLoopBackOff with `EIO` / `os error 5` / `input/output error` in recent logs	`longhorn_eio_autoheal` Windmill flow (per-pod 30 min cooldown, global cap 6/hour)	Deletes the pod; new pod attaches and recovers in ~20 s
Running but ext4 went read-only mid-flight (writes failing silently, no log signal)	`LonghornVolumeReadOnlyRemount` alert (`longhorn_volume_file_system_read_only > 0` for 2 min)	Pages with a runbook link; same pod-delete recovery

The auto-healer's regex also has a sinceSeconds=300 Running-pod scan for the case where logs do carry the EIO signature even though the pod hasn't crashed. The Pushgateway exposes _last_run_* plus a rate_limit_hits counter so a runaway healer would itself trip BatchJobStale.

The complementary PodIOError log-pattern detector (see Monitoring → log-based alerts) fires on the same EIO signature from any namespace and tags the alert with namespace / pod / container, so an operator running the manual kubectl delete pod knows exactly which one before opening Lens.