Skip to content

Storage Architecture

Overview

Storage is split across complementary tiers: Longhorn for replicated block storage on the cluster nodes (now landing on the FusionIO flash tier), an in-cluster Rook Ceph cluster for bulk RWX / object capacity on the repurposed SSDs, Garage for S3-compatible object storage, and TrueNAS via democratic-csi for NFS/iSCSI (currently idle — no spare space).

graph LR
    subgraph k8s["Kubernetes workloads"]
        pvc_lh["Longhorn PVCs\nharvester-longhorn-2replicas"]
        pvc_nfs["NFS / iSCSI PVCs\nsalt-* (idle)"]
        s3["S3 clients\n(GitLab, Rancher,\nHarvester, config-backup, …)"]
    end

    subgraph longhorn["Longhorn"]
        lh_ctrl["Longhorn controller\n2-replica default"]
        qui_d["qui — local disks"]
        quo_d["quo — local disks"]
        qua_d["qua — local disks"]
    end

    subgraph dcsi["democratic-csi"]
        nfs_driver["NFS / iSCSI CSI"]
    end

    subgraph nas["TrueNAS hosts"]
        salt["salt — TrueNAS CORE\nNFS/iSCSI (idle)\n+ Garage"]
        pepper["pepper\nGarage"]
        witness["baremetal witness\nGarage quorum"]
    end

    subgraph obj["Object Storage"]
        garage["Garage\ngarage.home.tillo.ch\n3-node quorum"]
    end

    pvc_lh --> lh_ctrl
    lh_ctrl --> qui_d & quo_d & qua_d
    lh_ctrl -->|"backup snapshots"| garage
    pvc_nfs -.-> nfs_driver -.-> salt
    s3 --> garage
    garage --- salt
    garage --- pepper
    garage --- witness

Storage hosts at a glance

salt + pepper — two TrueNAS hosts running Garage (3-node external cluster, joined by a baremetal witness for multipart quorum). salt's ZFS pool also backs salt-nfs and salt-iscsi via democratic-csi — the pool is full so no PVC provisions there, but the manifests remain for fast re-enable.

Baremetal witness — third Garage node (quorum-only, no data). Required because a 2-node Garage cluster races on multipart commit (see lesson below).

Storage Classes

Class Type Replication Use case
harvester-longhorn-2replicas Longhorn block (RWO) 2× across nodes All production PVCs
longhorn Longhorn block (RWO) Non-critical / ephemeral
harvester-longhorn-2replicas-notmigratable Longhorn block (RWX) 2× fixed nodes Fixed-node RWX (legacy; the media library has since moved to CephFS)
longhorn-iomemory Longhorn block (RWO) on FusionIO 2× across nodes Latency-sensitive volumes (databases, hot data)
ceph-block Ceph RBD (RWO) via Rook 2× / min 1 In-cluster block for general / app volumes
ceph-filesystem CephFS (RWX) via Rook data 2× / min 1, metadata 3× Shared media library (tv / film / serie)
ceph-bucket Ceph RGW S3 (ObjectBucketClaim) via Rook data 2× / min 1, metadata 3× In-cluster S3 buckets
salt-nfs NFS v4 via democratic-csi ZFS on salt (currently full → no PVCs) Reserved for large shared volumes
salt-iscsi iSCSI via democratic-csi ZFS on salt (currently full → no PVCs) Reserved for block volumes from TrueNAS

Node-local Performance Tiers

Beyond replica count, Longhorn's per-node disks are split into deliberate performance tiers, exposed as disk tags and selectable per StorageClass. Latency-sensitive workloads land on the fast tier while bulk data uses cheaper, higher-capacity disks.

Tier Longhorn tag Backing hardware Role
FusionIO fio, iomemory ioMemory PCIe flash — direct PCIe, no RAID controller in the path All Longhorn volumes now land here: etcd, databases, replicated block
SSD (P420i) (not in Longhorn) SSDs behind an HP Smart Array P420i Reassigned to Ceph OSDs — the in-cluster bulk tier — and no longer a Longhorn tier
Rotational (not in Longhorn) HP SAS 10K in RAID-5 Node OS and pod ephemeral storage — where a pod writes when it has no PVC

The longhorn-iomemory StorageClass pins a volume's replicas to the FusionIO tier; since the SSDs were handed to Ceph, every Longhorn volume effectively lives on FusionIO. etcd itself runs on a dedicated FusionIO partition on every control-plane node, so cluster-state writes never share a controller with bulk volume I/O.

Representative figures from production telemetry (node-exporter → VictoriaMetrics): the FusionIO tier commits durable writes in ~0.4 ms (etcd WAL fsync) and sustains low-latency I/O bursting past 1 GB/s — an order of magnitude faster, and far more predictable, than the controller-backed tiers. That predictability is the reason the most latency- and integrity-sensitive stores live there.

Longhorn Backup Policy

Longhorn snapshots ship to Garage (S3) on a schedule enforced by longhorn-backup-labels.sh, which labels every PVC at provisioning time.

Label Applied to Schedule
recurring-job-group.longhorn.io/default=enabled All PVCs Monthly off-site backup to Garage (1st @ 05:00) + daily filesystem-trim
recurring-job-group.longhorn.io/weekly=enabled Selected namespaces Weekly off-site backup to Garage (Fri 01:00)
recurring-job-group.longhorn.io/weeklysnapshot=enabled Selected namespaces Weekly local snapshot (Thu 01:00)
recurring-job-group.longhorn.io/nosnapshots=enabled Cache / metrics PVCs No snapshots (prometheus, redis, ...)

Weekly backup namespaces: appdaemon, bootstrap, envuassu, esphome, frigate, home-assistant, openldap, and others.

democratic-csi — External NFS/iSCSI

democratic-csi bridges Kubernetes persistent volumes to TrueNAS CORE on salt via the TrueNAS REST API. The driver config (including API credentials) is stored in Akeyless and fetched via an ExternalSecret at deploy time.

This allows large volumes (like the mirror's 200 Gi) to be provisioned from ZFS pools with full snapshot and clone support — without running storage inside the cluster.

Garage — S3 Object Storage

Garage is the off-site / backup S3 tier: it backs Longhorn volume backups, the database Barman stores, GitLab's backup tarballs, the Rancher backup, and the daily config snapshots. The cluster's live in-cluster object storage — GitLab's object stores and the OpenObserve log store — runs on the in-cluster Ceph RGW gateway instead (see the note below). The live Garage tier is a single external Garage cluster:

Deployment Where Backs
External, 3-node salt + pepper + baremetal witness harvester-backup, config-backup, gitlab-backups, rancher, homeassistant-pg-backup

The external cluster fronts a round-robin DNS (garage.home.tillo.ch) and exposes five buckets with per-consumer scoped keys. A dedicated config-backup bucket holds daily snapshots of network gear, storage configs, OpenObserve metadata and Scrypted state, with 14-day retention.

In-cluster object storage runs on Ceph RGW

The live S3 object stores that used to run on single-node in-cluster Garage now sit on the in-cluster Ceph RGW S3 gateway (ceph-objectstore, in rook-ceph): GitLab's object stores — container registry, CI artifacts, LFS, uploads, packages, Pages, dependency proxy, Terraform state — and the CI runner cache, plus the OpenObserve Parquet log store. The multi-OSD Ceph cluster gives the object data real redundancy and benchmarked several times faster than single-node Garage on small-object workloads. GitLab's own backup tarballs still ship to the gitlab-backups bucket on the external Garage cluster, so the off-site copy is unchanged.

Lesson: 2-node Garage races on multipart commit

With replication_factor=2 and consistency_mode=consistent, CompleteMultipartUpload reads the parts list from the peer. If the peer hasn't yet acknowledged the last UploadPart RPC, the read returns InvalidPart / NoSuchUpload — sequential single-client uploads fail ~80% of the time. The fix is a third node: with 3 nodes a single lagging peer no longer blocks reads. The external cluster runs salt + pepper + a baremetal witness for exactly this reason.

Off-site Backup

The external Garage cluster holds the only on-prem copy of several datasets that have no other home — Longhorn volume backups, the GitLab backups, and the daily config snapshots. A daily Windmill job (04:00 UTC) rclone-syncs every Garage bucket to two independent off-site destinations: a Backblaze B2 bucket and an off-site SFTP account. Losing the entire salt + pepper site therefore still leaves a recoverable copy in two geographically separate locations.

PV Reclaim Policy

Production PVCs default to reclaimPolicy: Delete — disk space here is finite and recovery from a deleted PVC almost always means restoring from backup anyway. Retain is reserved for the small set of PVCs where no backup path exists and accidental deletion would be unrecoverable.

A Prometheus alert (PV_DELETE_POLICY_WITHOUT_BACKUP_STRATEGY) cross-checks every Delete-policy PV against either a Longhorn recurring-job-group label or an explicit Windmill snapshot allowlist (OpenObserve, Scrypted, …) — a Delete PV only stops alerting if it provably has a backup path. The Windmill pv_reclaim_policy_analysis flow publishes the same data as a Pushgateway gauge with a 24h staleness rule, so a silent failure of the audit script also fires.