Home Assistant¶
Home Assistant runs in the home-assistant namespace and is the single integration point for everything physical in the house — heating (Elco / Ariston Cloud), lights, smart plugs, presence, doors, NVR cross-references, GPS trackers, and ambient sensors. It is the only workload in the cluster that talks to a meaningful number of external proprietary APIs (Apple, Google, Hue, Tractive, Shelly, ESPHome, Ariston), so its operational shape is deliberately conservative: minimal cluster surface, all state on a single PVC, explicit reverse-proxy posture, and a fast restart path for the cases where some upstream integration insists on it.
Deployment shape¶
| Concern | Where it lives |
|---|---|
All HA state (registry, automations, scripts, scenes, themes, .storage/, logs, addon config, configuration.yaml) |
PVC home-assistant-data (Longhorn, recurring backup + weekly group) |
| Recorder database (history, states, long-term statistics) | External PostgreSQL — CloudNativePG Cluster ha-recorder-pg, same namespace |
| Secrets pushed as env vars | Kubernetes Secret home-assistant-env-secret (populated by External Secrets from Akeyless) |
/config/secrets.yaml |
Kubernetes Secret home-assistant-secrets, mounted via subPath |
Disaster-recovery seed for configuration.yaml |
ConfigMap home-assistant-bootstrap (kept in Fleet, not mounted by default) |
| Ingress | home.mdapi.ch → rke2-ingress-nginx (cert-manager / Let's Encrypt, ModSec WAF upstream of HA) |
| Pod-attached USB radios (Zigbee dongle etc.) | Privileged container pinned to the node that owns the USB ports via nodeSelector: kubernetes.io/hostname=qui, exactly one replica |
The PVC carries everything HA writes except the recorder database. The Deployment uses strategy: Recreate with replicas: 1 — there is no horizontal scaling story here because the USB radio slot pins to a single pod. Loss of the pod is a 10–20 s outage; loss of the PVC is restored from a Longhorn snapshot in S3-compatible storage (Garage).
Recorder database¶
Home Assistant's recorder — the history of every state change plus the long-term statistics that feed the Energy dashboard — runs on a dedicated in-cluster PostgreSQL rather than the default SQLite file on the PVC. The database is a CloudNativePG Cluster (ha-recorder-pg) in the home-assistant namespace; configuration.yaml points the recorder at it through recorder.db_url.
The move off SQLite was driven by scale. With several million states rows and a metadata table grown into the hundreds of thousands, every recorder.purge hit SQLite's bound-variable ceiling (~32k) and failed. PostgreSQL has no IN-list cap, serves concurrent reads while the recorder writes, and autovacuums incrementally — the same reasoning that already puts Keycloak and the envuassu services on Postgres.
CloudNativePG streams continuous WAL backups (Barman) to the Garage S3 cluster and takes a nightly base backup, giving the recorder a point-in-time restore path independent of the Longhorn snapshot of the PVC.
Why configuration.yaml lives on the PVC¶
configuration.yaml is on the PVC rather than rendered from a ConfigMap because Kubernetes ConfigMap mounts via subPath do not auto-refresh when the underlying ConfigMap is updated. Kubelet bind-mounts the file once at pod start; subsequent ConfigMap changes are invisible inside the container until the pod is restarted.
For most workloads that's a non-issue. For Home Assistant it isn't: several integrations (notably the bundled iCloud one) lose their trust tokens across restarts and force an interactive 2FA reauthentication every boot. Treating every routine sensor tweak as "edit YAML → restart pod → physically retrieve a 2FA code" makes small iterations on the YAML disproportionately expensive.
Living on the PVC means edits happen in place and apply with homeassistant.reload_all — no pod restart, no reauth ritual.
The trade-off:
- Given up: the audit trail and PR-review surface that GitOps provides on
configuration.yamlchanges. There is nogit logover routine config edits. - Compensating coverage: Longhorn snapshots of the PVC (shipped to Garage) and Home Assistant's own Auto-Backup add-on (writes to the PVC, also Longhorn-backed). Two layers that already cover everything else under
/config, now covering this file too. - Considered and rejected: a periodic
kubectl cp+ git commit snapshot job. It would add a moving piece without changing the recovery story.
The bootstrap ConfigMap¶
The home-assistant-bootstrap ConfigMap is reconciled into the cluster by Fleet but is never mounted in steady state. Its role is disaster recovery: a minimal, device-free starting configuration.yaml that lets Home Assistant boot cleanly behind the same reverse-proxy posture, even if the PVC has just been restored from scratch and is missing its config.
Bootstrap content¶
default_config: # bundled HA basics (recovery essential)
frontend:
themes: !include_dir_merge_named themes # empty themes dir is fine
tts:
- platform: google_translate # free, no credentials
automation: !include automations.yaml # PVC-resident; comment out for a fresh PVC
script: !include scripts.yaml
scene: !include scenes.yaml
http:
cors_allowed_origins:
- https://home.mdapi.ch
use_x_forwarded_for: true
trusted_proxies: # cluster pod/service CIDRs + LAN
- 10.0.0.0/8
- 172.16.0.0/12
- 192.168.0.0/16
ip_ban_enabled: true
login_attempts_threshold: 50
shelly: # zero-config auto-discovery
bluetooth:
Anything beyond this — Google Assistant project + service account, REST/template/integration sensors, REST endpoints for iLO probes, command_line monitors, input_* helpers, utility_meter, shell_command blocks, derived solar/freezer sensors — is device-specific and stays out of the bootstrap. The bootstrap deliberately reads as a starting-point template, not a snapshot.
Restoring from bootstrap¶
If /config/configuration.yaml on the PVC is missing, empty or corrupt, the bootstrap can be promoted back into the pod via the normal subPath pattern: re-add the home-assistant-config volume + mount to the Deployment, redeploy. HA starts with the bootstrap content; integrations are then added back via the UI (or configuration.yaml is restored from a Longhorn snapshot).
The bootstrap initContainer¶
The Deployment carries a small initContainer that handles the case where the PVC is in a pre-bootstrapped state — a file /config/configuration.yaml.bootstrap exists but no real configuration.yaml does (either missing entirely, or present but zero-length).
initContainers:
- name: bootstrap-configuration
image: busybox:1.36
command:
- sh
- -c
- |
if [ -f /config/configuration.yaml.bootstrap ] && [ ! -s /config/configuration.yaml ]; then
mv -f /config/configuration.yaml.bootstrap /config/configuration.yaml
fi
volumeMounts:
- name: home-assistant-data
mountPath: /config
The check is ! -s (file is missing or zero-length), not ! -f (file is missing). This matters: a stale zero-byte file at the target path is indistinguishable from a missing one from HA's perspective — it boots into an effectively empty config — but -f would treat it as "present" and skip the promotion. -s is the safer invariant for a "promote when the target is unusable" pattern.
On a healthy steady-state pod the init is a no-op: configuration.yaml is present and non-empty, so the mv is skipped.
Editing workflow¶
flowchart LR
edit["Edit /config/configuration.yaml\n(File Editor / kubectl exec / kubectl cp)"]
reload["homeassistant.reload_all\n(or domain-specific .reload)"]
edit --> reload
reload --> live["Live HA picks up changes\n— no pod restart"]
Most domain configs (automation, script, scene, template, rest, etc.) have individual <domain>.reload services. homeassistant.reload_all runs all of them. Changes HA's reload machinery doesn't cover — adding a new integration that wasn't in default_config, changing http: trusted_proxies, replacing a deprecated platform — still require a pod restart; they are infrequent enough that the lost-token cost is acceptable.
What Home Assistant integrates today (selected)¶
- Heating: Elco / Ariston Cloud bridge via AppDaemon.
- Lights & plugs: Philips Hue bridge, ESPHome devices, Shelly auto-discovery. One Hue smart plug carries both the FTTH ONU and the OpenWrt edge router on one circuit — see the router watchdog automations in the BPI-R4 writeup.
- Presence: Mobile-app companion (Android, iOS), iCloud, Tractive (pet GPS).
- Cameras: Frigate cross-references for notifications and event motion.
- NTP / atomic time: see Chrony GPS NTP.
- Voice: Google Assistant smart-home integration (devices exposed via project-scoped service account on the PVC).
- Observability: HA database size, response time, and uptime are exported as sensors and scraped by Prometheus alongside the rest of the cluster.
Most automations are managed via the HA UI (and persist into .storage/); the YAML-defined ones live in automations.yaml on the PVC.
Backups & recovery¶
| Layer | Covers | Schedule |
|---|---|---|
| HA Auto-Backup add-on | Snapshots of /config into a folder on the PVC |
Nightly |
Longhorn recurring jobs (default + weekly groups) |
PVC home-assistant-data |
Local snapshots; weekly group ships to Garage (S3) |
| CloudNativePG (Barman) | Recorder PostgreSQL — continuous WAL + nightly base backup | Continuous, off-site to Garage (S3) |
Bootstrap ConfigMap (home-assistant-bootstrap) |
A clean minimal configuration.yaml to boot HA against |
Always-current in Fleet |
Worst case: restore the PVC from a Longhorn S3 snapshot, redeploy, HA comes back where it was. If the snapshot itself is unrecoverable, the bootstrap CM gets the service back online empty, and integrations are rebuilt from the UI.