Home Assistant¶

Home Assistant runs in the home-assistant namespace and is the single integration point for everything physical in the house — heating (Elco / Ariston Cloud), lights, smart plugs, presence, doors, NVR cross-references, GPS trackers, and ambient sensors. It is the only workload in the cluster that talks to a meaningful number of external proprietary APIs (Apple, Google, Hue, Tractive, Shelly, ESPHome, Ariston), so its operational shape is deliberately conservative: minimal cluster surface, all state on a single PVC, explicit reverse-proxy posture, and a fast restart path for the cases where some upstream integration insists on it.

Deployment shape¶

Concern	Where it lives
All HA state (registry, automations, scripts, scenes, themes, `.storage/`, logs, addon config, `configuration.yaml`)	PVC `home-assistant-data` (Longhorn, recurring backup + weekly group)
Recorder database (history, states, long-term statistics)	External PostgreSQL — CloudNativePG `Cluster` `ha-recorder-pg`, same namespace
Secrets pushed as env vars	Kubernetes `Secret` `home-assistant-env-secret` (populated by External Secrets from Akeyless)
`/config/secrets.yaml`	Kubernetes `Secret` `home-assistant-secrets`, mounted via `subPath`
Disaster-recovery seed for `configuration.yaml`	`ConfigMap` `home-assistant-bootstrap` (kept in Fleet, not mounted by default)
Ingress	`home.mdapi.ch` → `rke2-ingress-nginx` (cert-manager / Let's Encrypt, ModSec WAF upstream of HA)
Pod-attached USB radios (Zigbee dongle etc.)	Privileged container pinned to the node that owns the USB ports via `nodeSelector: kubernetes.io/hostname=qui`, exactly one replica

The PVC carries everything HA writes except the recorder database. The Deployment uses strategy: Recreate with replicas: 1 — there is no horizontal scaling story here because the USB radio slot pins to a single pod. Loss of the pod is a 10–20 s outage; loss of the PVC is restored from a Longhorn snapshot in S3-compatible storage (Garage).

Recorder database¶

Home Assistant's recorder — the history of every state change plus the long-term statistics that feed the Energy dashboard — runs on a dedicated in-cluster PostgreSQL rather than the default SQLite file on the PVC. The database is a CloudNativePG Cluster (ha-recorder-pg) in the home-assistant namespace; configuration.yaml points the recorder at it through recorder.db_url.

The move off SQLite was driven by scale. With several million states rows and a metadata table grown into the hundreds of thousands, every recorder.purge hit SQLite's bound-variable ceiling (~32k) and failed. PostgreSQL has no IN-list cap, serves concurrent reads while the recorder writes, and autovacuums incrementally — the same reasoning that already puts Keycloak and the envuassu services on Postgres.

CloudNativePG streams continuous WAL backups (Barman) to the Garage S3 cluster and takes a nightly base backup, giving the recorder a point-in-time restore path independent of the Longhorn snapshot of the PVC.

Why `configuration.yaml` lives on the PVC¶

configuration.yaml is on the PVC rather than rendered from a ConfigMap because Kubernetes ConfigMap mounts via subPath do not auto-refresh when the underlying ConfigMap is updated. Kubelet bind-mounts the file once at pod start; subsequent ConfigMap changes are invisible inside the container until the pod is restarted.

For most workloads that's a non-issue. For Home Assistant it isn't: several integrations (notably the bundled iCloud one) lose their trust tokens across restarts and force an interactive 2FA reauthentication every boot. Treating every routine sensor tweak as "edit YAML → restart pod → physically retrieve a 2FA code" makes small iterations on the YAML disproportionately expensive.

Living on the PVC means edits happen in place and apply with homeassistant.reload_all — no pod restart, no reauth ritual.

The trade-off:

Given up: the audit trail and PR-review surface that GitOps provides on configuration.yaml changes. There is no git log over routine config edits.
Compensating coverage: Longhorn snapshots of the PVC (shipped to Garage) and Home Assistant's own Auto-Backup add-on (writes to the PVC, also Longhorn-backed). Two layers that already cover everything else under /config, now covering this file too.
Considered and rejected: a periodic kubectl cp + git commit snapshot job. It would add a moving piece without changing the recovery story.

The bootstrap ConfigMap¶

The home-assistant-bootstrap ConfigMap is reconciled into the cluster by Fleet but is never mounted in steady state. Its role is disaster recovery: a minimal, device-free starting configuration.yaml that lets Home Assistant boot cleanly behind the same reverse-proxy posture, even if the PVC has just been restored from scratch and is missing its config.

Bootstrap content¶

default_config:                              # bundled HA basics (recovery essential)
frontend:
  themes: !include_dir_merge_named themes    # empty themes dir is fine

tts:
  - platform: google_translate               # free, no credentials

automation: !include automations.yaml         # PVC-resident; comment out for a fresh PVC
script:     !include scripts.yaml
scene:      !include scenes.yaml

http:
  cors_allowed_origins:
    - https://home.mdapi.ch
  use_x_forwarded_for: true
  trusted_proxies:                            # cluster pod/service CIDRs + LAN
    - 10.0.0.0/8
    - 172.16.0.0/12
    - 192.168.0.0/16
  ip_ban_enabled: true
  login_attempts_threshold: 50

shelly:                                       # zero-config auto-discovery
bluetooth:

Anything beyond this — Google Assistant project + service account, REST/template/integration sensors, REST endpoints for iLO probes, command_line monitors, input_* helpers, utility_meter, shell_command blocks, derived solar/freezer sensors — is device-specific and stays out of the bootstrap. The bootstrap deliberately reads as a starting-point template, not a snapshot.

Restoring from bootstrap¶

If /config/configuration.yaml on the PVC is missing, empty or corrupt, the bootstrap can be promoted back into the pod via the normal subPath pattern: re-add the home-assistant-config volume + mount to the Deployment, redeploy. HA starts with the bootstrap content; integrations are then added back via the UI (or configuration.yaml is restored from a Longhorn snapshot).

The bootstrap initContainer¶

The Deployment carries a small initContainer that handles the case where the PVC is in a pre-bootstrapped state — a file /config/configuration.yaml.bootstrap exists but no real configuration.yaml does (either missing entirely, or present but zero-length).

initContainers:
- name: bootstrap-configuration
  image: busybox:1.36
  command:
  - sh
  - -c
  - |
    if [ -f /config/configuration.yaml.bootstrap ] && [ ! -s /config/configuration.yaml ]; then
      mv -f /config/configuration.yaml.bootstrap /config/configuration.yaml
    fi
  volumeMounts:
  - name: home-assistant-data
    mountPath: /config

The check is ! -s (file is missing or zero-length), not ! -f (file is missing). This matters: a stale zero-byte file at the target path is indistinguishable from a missing one from HA's perspective — it boots into an effectively empty config — but -f would treat it as "present" and skip the promotion. -s is the safer invariant for a "promote when the target is unusable" pattern.

On a healthy steady-state pod the init is a no-op: configuration.yaml is present and non-empty, so the mv is skipped.

Editing workflow¶

flowchart LR
    edit["Edit /config/configuration.yaml\n(File Editor / kubectl exec / kubectl cp)"]
    reload["homeassistant.reload_all\n(or domain-specific .reload)"]
    edit --> reload
    reload --> live["Live HA picks up changes\n— no pod restart"]

Most domain configs (automation, script, scene, template, rest, etc.) have individual <domain>.reload services. homeassistant.reload_all runs all of them. Changes HA's reload machinery doesn't cover — adding a new integration that wasn't in default_config, changing http: trusted_proxies, replacing a deprecated platform — still require a pod restart; they are infrequent enough that the lost-token cost is acceptable.

What Home Assistant integrates today (selected)¶

Heating: Elco / Ariston Cloud bridge via AppDaemon.
Lights & plugs: Philips Hue bridge, ESPHome devices, Shelly auto-discovery. One Hue smart plug carries both the FTTH ONU and the OpenWrt edge router on one circuit — see the router watchdog automations in the BPI-R4 writeup.
Presence: Mobile-app companion (Android, iOS), iCloud, Tractive (pet GPS).
Cameras: Frigate cross-references for notifications and event motion.
NTP / atomic time: see Chrony GPS NTP.
Voice: Google Assistant smart-home integration (devices exposed via project-scoped service account on the PVC).
Observability: HA database size, response time, and uptime are exported as sensors and scraped by Prometheus alongside the rest of the cluster.

Most automations are managed via the HA UI (and persist into .storage/); the YAML-defined ones live in automations.yaml on the PVC.

Backups & recovery¶

Layer	Covers	Schedule
HA Auto-Backup add-on	Snapshots of `/config` into a folder on the PVC	Nightly
Longhorn recurring jobs (`default` + `weekly` groups)	PVC `home-assistant-data`	Local snapshots; weekly group ships to Garage (S3)
CloudNativePG (Barman)	Recorder PostgreSQL — continuous WAL + nightly base backup	Continuous, off-site to Garage (S3)
Bootstrap ConfigMap (`home-assistant-bootstrap`)	A clean minimal `configuration.yaml` to boot HA against	Always-current in Fleet

Worst case: restore the PVC from a Longhorn S3 snapshot, redeploy, HA comes back where it was. If the snapshot itself is unrecoverable, the bootstrap CM gets the service back online empty, and integrations are rebuilt from the UI.