Skip to content

Backups & Disaster Recovery

Backup is a cross-cutting concern — every data class in the stack (block volumes, relational databases, GitLab object stores, cluster configuration, network device snapshots) has its own native backup mechanism, but they all converge on the same destinations and the same off-site path. Reading any one page in this site only shows you that page's slice; this one is the cross-section.

The design goal is two independent off-site copies for every byte that matters, with a recovery path that doesn't depend on any single piece of on-prem hardware staying up. The fast-restore tier (local snapshots) is a convenience on top of that, not a substitute for it.

Architecture at a Glance

flowchart LR
    subgraph apps["Workloads"]
        pvc["Block PVCs<br/>(Longhorn)"]
        cnpg["PostgreSQL<br/>(CloudNativePG x3)"]
        gl["GitLab object stores<br/>(uploads · artifacts · LFS · registry · …)"]
        rk["Rancher cluster config<br/>(CRDs · RBAC · secrets)"]
        net["Network devices + cluster config<br/>(BPI-R4 · Technitium · OpenObserve · Scrypted)"]
    end

    subgraph snap["L0 — Local snapshots"]
        lh_snap["Longhorn snapshots<br/>(per-volume, on-cluster)"]
    end

    subgraph onsite["L1 — On-site backup (external Garage S3)"]
        g_hb["harvester-backup"]
        g_pg["*-pg-backup<br/>(barman stores)"]
        g_gl["gitlab-backups"]
        g_rk["rancher"]
        g_cb["config-backup"]
    end

    subgraph off["L2 — Off-site (daily rclone, 04:00 UTC)"]
        b2["Backblaze B2"]
        sftp["o2switch SFTP"]
    end

    pvc -->|"recurring snapshot"| lh_snap
    pvc -->|"recurring backup"| g_hb
    cnpg -->|"continuous WAL +<br/>nightly base"| g_pg
    gl -->|"daily backup-utility"| g_gl
    rk -->|"daily Rancher Backup CR"| g_rk
    net -->|"daily Windmill snapshot"| g_cb
    onsite --> b2
    onsite --> sftp

Solid arrows are scheduled data flow. The dashed-line "L0 → L1" relationship (snapshot then backup) is collapsed in the diagram because Longhorn handles both as recurring jobs on the same volume — they're configured side-by-side, not in series.

The Four Layers

L0 — Local Longhorn snapshots are copy-on-write snapshots of the underlying volume image, kept on the same nodes as the live data. They cost almost nothing in time and storage, and a restore is a volume.spec.fromBackup flip plus a pod restart. They're the right tool for "I just broke a PVC, get the previous state back" but they're worthless against losing the cluster — they don't leave the cluster.

L1 — On-site backup to the external Garage cluster moves the data off the production nodes onto separate hardware (the salt + pepper TrueNAS hosts plus a baremetal witness for quorum). This is the layer that survives a Harvester-node-class failure. The Garage cluster itself is 3-node, so it tolerates losing one node without losing the data. Each consumer has its own bucket scoped by its own access key — see Storage → Garage S3 for the full bucket list.

L2 — Off-site to Backblaze B2 is a daily rclone sync of every L1 bucket to a Backblaze B2 bucket. B2 is in a different jurisdiction and a different failure domain from the home lab. This is the layer that survives losing the entire on-prem site.

L3 — Off-site to o2switch SFTP is the same daily sync, fanned out to an SFTP account at a second provider. Two independent off-site destinations means losing either one (provider outage, credential revocation, account-level mishap) still leaves a recoverable copy.

The L2/L3 fan-out is a single Windmill flow (f/config_backup/garage_offsite_backup) that runs at 04:00 UTC daily. It is idempotent — re-runs only transfer changed objects — and exception-handlers wrap each bucket-pair so a failure on one bucket doesn't abort the others.

By Data Class

Block PVCs — Longhorn

PVCs are labelled at provision time by longhorn-backup-labels.sh, which maps labels to Longhorn recurring-job groups. Every PVC gets at least the default group (monthly off-site backup to Garage + daily filesystem trim); namespaces holding active state (Home Assistant, GitLab, OwnCloud, Frigate, ...) additionally get weekly (weekly backup) and/or weeklysnapshot (weekly on-cluster snapshot). Cache and metrics PVCs (Prometheus, Valkey, ...) get nosnapshots since rebuilding them from scratch is faster than restoring.

Backups ship to the harvester-backup bucket on the external Garage cluster via Longhorn's BackupTarget resource (s3://harvester-backup@garage/). See Longhorn Backup Policy for the per-label schedule.

PostgreSQL — CloudNativePG with Barman

The stack runs three CloudNativePG clusters: bootstrap/gitlab-pg (the GitLab Rails database), home-assistant/ha-recorder-pg (the Home Assistant recorder), and paperless/paperless-pg (the Paperless DMS metadata store). All three use the same backup pattern: continuous WAL streaming via Barman Cloud to a dedicated Garage S3 bucket, plus a nightly base backup. This gives every Postgres a point-in-time restore path independent of any Longhorn snapshot of the underlying PVC — the PVC backup is a consistency hedge, the Barman store is the actual recovery substrate.

GitLab object stores — backup-utility

GitLab Rails has its own backup-utility that snapshots every internal object store (uploads, artifacts, LFS, packages, dependency proxy, container registry, terraform state, pages) into a single tarball and ships it to the gitlab-backups bucket. The job runs on schedule from the gitlab-toolbox pod. Combined with the CNPG backup of gitlab-pg and the Longhorn backup of the GitLab PVCs, a GitLab restore can pick whichever consistency boundary is cheapest for the recovery scenario.

Rancher cluster config — Rancher Backup Operator

The rancher-backup operator dumps the Rancher management cluster's CRDs, RBAC, secrets and project mappings to the rancher bucket on a daily schedule. This is what lets the management plane (Fleet bindings, downstream cluster registrations, Keycloak integrations) be reconstructed if the local cluster is rebuilt from scratch.

Network devices and cluster configuration — Windmill config_backup

Things that don't live in a PVC also need backing up: the BPI-R4 OpenWrt config tarball, the Technitium DNS zone exports, OpenObserve internal metadata, Scrypted's encrypted device store. A Windmill flow (f/config_backup/config_backup) snapshots these daily into the config-backup Garage bucket with 14-day retention — long enough to roll back a bad change, short enough that storage cost stays trivial.

Off-site Sync — garage_offsite_backup

The single daily Windmill flow that owns the L2/L3 fan-out reads every L1 bucket and rclone-syncs each one to both Backblaze B2 and an o2switch SFTP account. It runs at 04:00 UTC, after the per-class backup jobs above have finished. Each bucket-pair sync is wrapped in its own exception handler, so a transient B2 outage on one bucket doesn't strand the other thirteen.

Object-shape matters for tuning here: buckets holding a few large objects (SQL dumps, archives) sync fastest with --fast-list; buckets holding millions of small objects (Longhorn block storage in harvester-backup) need streaming listing instead, or they OOM the worker by materializing the full listing in memory. This is why the flow carries a per-bucket policy rather than one set of flags for all of them.

Monitoring

Backups that fail silently are worse than no backups at all, so every layer has an age-based liveness check.

Probe Cadence What fires
f/infra_health/daily_infra_health Hourly Per-PVC Longhorn backup age (against the label group's expected schedule), per-CNPG cluster last-base-backup age
f/infra_health/rancher_backup_check Daily 01:30 Rancher Backup CR age + error state
f/infra_health/gitlab_backup_health Daily 01:30 GitLab backup-utility tarball age, gitlab-toolbox pod state, group-runner status
garage_offsite_backup self-metrics Per-run garage_offsite_backup_success, pairs_ok, pairs_failed pushed to Pushgateway
PV_DELETE_POLICY_WITHOUT_BACKUP_STRATEGY Continuous Any Delete-policy PV without a backup label group or an explicit Windmill snapshot allowlist

The last guard is the most important: it makes "this PVC has no backup path" a Prometheus alert rather than a discovery made at restore time. See Storage → PV Reclaim Policy for the cross-check.

Recovery Targets

The four layers translate into different recovery characteristics per data class. These are design targets, not hard guarantees, and they assume the destination hardware is itself healthy (a corrupt Garage cluster doesn't get you to L1).

Data class Fast restore (RPO / RTO) Off-site restore (RPO / RTO)
Block PVCs (Longhorn) Snapshot interval (≤ 1 week) / minutes Daily L2/L3 / hours (rclone pull + Longhorn restore)
PostgreSQL (CNPG) WAL gap (seconds) / minutes (PITR from local Barman cache) WAL gap / hours (Barman bootstrap from B2)
GitLab object stores Daily backup-utility / 1–2 h Daily L2/L3 / a working day end-to-end
Rancher cluster config Daily / minutes Daily L2/L3 / minutes (the dumps are small)
Network / config snapshots Daily / minutes (paste a config back) Daily L2/L3 / minutes

The deliberately worst-case end-to-end story — "lose the entire on-prem site and recover from B2 + SFTP alone" — is the one this architecture is engineered to make merely tedious, rather than impossible.