Monitoring & Alerting Architecture¶

Layers¶

Three planes, each owning a different signal class, all converging on Alertmanager:

graph TB
    subgraph metrics["Metrics plane (VictoriaMetrics primary + Prometheus)"]
        vm["VictoriaMetrics<br/>vmsingle 3y + vmagent<br/>(primary store · Grafana DS)"]
        prom["Prometheus<br/>(rancher-monitoring, ~15d)<br/>operator CRDs + Alertmanager"]
        kss["kube-state-metrics"]
        ne["node-exporter"]
        bb["blackbox-exporter<br/>(synthetic probes)"]
        pg["Pushgateway<br/>(batch job heartbeat + gauges)"]
    end

    subgraph logs["Log plane"]
        fb["fluent-bit DS<br/>(rancher-logging)"]
        flu["fluentd aggregator"]
        crib["Cribl LogStream<br/>(routing + filtering)"]
        relay["cribl-am-relay<br/>(Python; per-pattern detectors)"]
        o2["OpenObserve<br/>(logs.mdapi.ch, 7d on Ceph RGW)"]
    end

    subgraph alerting["Alert plane"]
        am["Alertmanager<br/>(send_resolved: true)"]
        adapt["statuspage adapter<br/>(webhook receiver)"]
        push["Pushover"]
        sp["statuspage.io<br/>(public)"]
        hc["healthchecks.io<br/>(dead-man switch)"]
    end

    kss --> prom
    ne --> prom
    bb --> prom
    pg --> prom
    kss & ne & bb & pg -->|"vmagent scrape"| vm
    fb --> flu
    flu --> crib
    flu --> relay
    crib -->|"ModSec route → ES bulk"| o2

    prom -->|"PrometheusRule firing"| am
    relay -->|"PostableAlert /api/v2/alerts"| am
    am -->|"team=mdapi route"| adapt --> sp
    am -->|"team=mdapi route"| push
    am -->|"alertname=Watchdog"| hc

Design principles¶

VictoriaMetrics is the long-term metrics store; rancher-monitoring Prometheus is the operator + alert plane. A vm-operator stack (vmsingle, 3-year retention on Longhorn, fed by a vmagent that scrapes every target via the operator's Prometheus-CR converter) is the primary Grafana datasource. The bundled rancher-monitoring Prometheus is retained at ~15d as the fallback datasource and still owns the operator CRDs and Alertmanager. Supplemental observability lives as PrometheusRule, ServiceMonitor, Probe, and AlertmanagerConfig CRDs that the chart's operator picks up automatically ({} selectors for SM/PM/Rule/AC; Probe selector requires the release: rancher-monitoring label) — and vmagent mirrors the same scrape targets into VictoriaMetrics.
Cribl is a routing + filtering plane. fluent-bit + fluentd ship to Cribl's HTTP source; pipelines either (a) extract alert signals and POST to Alertmanager /api/v2/alerts via the cribl-am-relay, or (b) parse and forward into OpenObserve for indexed search. Real-time tail is via Cribl's Live Capture against the same stream.
OpenObserve is the indexed-search layer. Single-node deployment in the openobserve namespace, writing Parquet to the in-cluster Ceph RGW gateway (ceph-objectstore in rook-ceph). Streams are created on first ingest; retention is governed by ZO_DATA_RETENTION_DAYS (currently 7 days globally, per-stream override via PUT /api/<org>/streams/<name>/settings). Three streams are populated today: modsec (WAF audit records, parsed into top-level columns), pod_logs (the catch-all all-logs ClusterFlow piped through Cribl, with the full fluentd record kept in _raw), and technitium_queries (the internal-DNS query log, ingested every 60s by a Windmill probe using an ISO-timestamp cursor and surfaced through a per-client / per-qname / per-rcode dashboard cross-linked from Grafana). Additional streams just need a Cribl route — or a small Windmill probe — to the same ES-compatible bulk endpoint with the desired index: value.
GitOps via Fleet. Every supplemental CRD lives in the fleet repo. The handful of out-of-band touches that the upstream chart owns are documented inline so the recovery command is one copy-paste away if a chart upgrade resets them.

Alert label convention¶

Every alert MUST carry these labels — the Alertmanager route, statuspage adapter, and Pushover template all key on them:

Label	Values	Purpose
`severity`	`critical`, `warning`, `info`	Maps to statuspage status: critical → `major_outage`, warning → `degraded_performance`
`component`	one of the 10 statuspage component names	Adapter looks this up in its component map and patches the matching statuspage component
`team`	`mdapi`	Distinguishes our alerts from rancher-monitoring chart defaults so Alertmanager routes can scope cleanly

Public statuspage components¶

Ten flat components (no groups, fits the free-tier ceiling). Each maps to one or more namespaces / probes:

Component	Covers
Mail	SMTP, IMAP, webmail, autoconfig
Sign-in (SSO)	Keycloak + OpenLDAP single sign-on (`idp.mdapi.ch`)
Platform	GitLab (source, CI, container registry), logging, automation, and shared platform tooling
Files & Documents	file sync (`cloud.mdapi.ch`), file gateways, notes (`notes.mdapi.ch`), document archive (`documents.mdapi.ch`), Debian/Ubuntu mirror
Backups	off-site backup pipelines + storage health (volumes, databases, config, Garage clusters, NAS)
Websites	personal + customer static sites
DNS	authoritative + recursive resolver + OpenNIC tier-2
Smart Home	Home Assistant, Frigate, MQTT, smart-plug integrations, remote-access VPN
Internet	WAN uplink, IPv6 + NAT64, BPI-R4 router
NTP	public NTP time server (`mirror.mdapi.ch`), `pool.ntp.org` contributor

Alertmanager configuration prerequisite¶

The bundled rancher-monitoring chart ships an Alertmanager CR. By default alertmanagerConfigMatcherStrategy is OnNamespace — every AlertmanagerConfig route gets namespace=<ac-namespace> prepended, so routes never match alerts from other namespaces. Required one-time patch:

kubectl -n cattle-monitoring-system patch alertmanager <name> --type=merge \
  -p '{"spec":{"alertmanagerConfigMatcherStrategy":{"type":"None"}}}'

Risk: chart upgrades may revert this. Documented in the bundle's README so the recovery command is one copy-paste away.

Synthetic probes¶

Each statuspage component is backed by one or more Probe CRDs targeting blackbox-exporter:

HTTP /health/ready style endpoints where the app exposes one
HTTP 200 | 301 | 302 | 401 | 403 (http_any_redirect module) for OIDC-gated services that 302 to login
TCP banner check for SMTP, TCP+TLS handshake for IMAPS
DNS A query for the authoritative resolver

probe_success == 0 for 5m → critical alert with the matching component label. Cert expiry < 14d on the probe_ssl_earliest_cert_expiry series → warning alert (backstop on cert-manager).

Workload health¶

PrometheusRule files keyed off kube-state-metrics:

Deployment unavailable replicas > 0 for 5m → critical
StatefulSet ready replicas < desired for 5m → critical
Pod container restart rate > 0.5/min over 15m → warning
Pod OOMKilled within last 1h → warning
PVC fill > 85% → warning

Each rule sets team=mdapi and the appropriate component label so the routing pipeline does the right thing without rule-by-rule receiver wiring.

Log-based alerts (cribl-am-relay)¶

A small Python relay (cribl-am-relay, in the cribl namespace) runs a chain of pattern detectors on log records, builds Alertmanager v2 PostableAlerts with proper labels + annotations + endsAt, and POSTs to /api/v2/alerts. Each detector is fed by its own narrow ClusterFlow that grep-prefilters at fluentd so only candidate lines hit the relay:

Detector	ClusterFlow (`monitoring-rules/`)	Pattern	Severity	Auto-resolve
`ModSecBlockLikelyFalsePositive`	`15a-logging-flow-ingress-nginx.yml`	`ModSecurity: Access denied` + client IP in trusted CIDR (192.168/16 except 192.168.164.2, 10/8, 100.64/10)	warning	5 min
`PodIOError`	`14a-logging-flow-pod-io.yml`	`[Errno 5]`, `blk_update_request.*I/O error`, `EXT4-fs error`, `XFS metadata/writeback error`, `buffer_io_error`, `EIO read/write`, `stale file handle`, `pcfg_openfile: Permission denied`	warning	5 min
`PostfixDeliveryFailure`	`14-logging-flow-postfix.yml`	`status=bounced\|deferred`, `loops back`, `Connection refused` in the mail namespace docker-mailserver pod	warning	5 min
`RedisBGSAVEFailure`	`14b-logging-flow-redis-bgsave.yml`	`Background saving error`, `Failed opening the temp RDB file`, `Error saving DB on disk`	warning	10 min

The relay tags each alert with team=mdapi plus a component derived from the namespace (Mail / Sign-in (SSO) / Platform / Files & Documents / Smart Home / Websites) so the existing Alertmanager route delivers it to statuspage + Pushover with no per-detector receiver wiring.

The all-logs catch-all ClusterFlow ships everything (except kube-system and cattle-logging-system) to Cribl for live search only (Live Capture against the same fluentd HTTP stream). It does not feed the relay — the catch-all volume saturates the relay's fluentd output buffer and triggers drop_oldest_chunk, so signal detection has to run from narrow grep-prefiltered flows.

The relay's modsec detector excludes 192.168.164.2 from the trusted-source check explicitly. That address is the Jool NAT64 v4 egress for external IPv6 clients (see external traffic on bpi-r4) — blocks from there are real attacks, not false positives. The companion address 192.168.164.3 (NAT64 v4 egress for internal LAN IPv6 clients) stays inside 192.168/16 and is treated as trusted, so blocks from there do alert as likely false positives.

Adding a new log-pattern detector¶

# In monitoring-cribl-am-relay/relay.yml, inside the server.py ConfigMap

MY_PATTERN_RE = re.compile(r"keyword1|keyword2", re.I)

def detect_my_signal(ev):
    line = (ev.get("log") or ev.get("message") or "").strip()
    if not line or not MY_PATTERN_RE.search(line):
        return None
    k = ev.get("kubernetes") or {}
    # (optional) scope by namespace / label
    if k.get("namespace_name") != "mynamespace":
        return None
    return {
        "labels": {
            "alertname": "MySignalDetected",
            "team":      "mdapi",
            "component": ns_to_component(k.get("namespace_name") or ""),
            "severity":  "warning",
            "signal":    "my_signal",
            "namespace": k.get("namespace_name", "?"),
            "pod":       k.get("pod_name", "?"),
        },
        "annotations": {
            "summary":     f"... {line[:120]}",
            "description": "...",
        },
        "endsAt": (datetime.now(timezone.utc) + timedelta(minutes=5)).isoformat().replace("+00:00", "Z"),
    }

DETECTORS = [detect_modsec, detect_pod_io, detect_postfix_failure, detect_redis_bgsave, detect_my_signal]

Steps:

Add detect_<name>(ev) and append to DETECTORS in relay.yml.
Add a narrow ClusterFlow under monitoring-rules/14<letter>-logging-flow-<name>.yml that grep-prefilters on the same pattern your detector matches, then ships to am-relay. Keep the grep regex aligned with the relay's regex — too narrow at the flow level = false negatives, too wide = relay buffer overflow. Do not widen the all-logs catch-all to feed the relay.
Commit + push.
Verify by injecting a synthetic event:

PAYLOAD='[{"log":"<line matching your pattern>","kubernetes":{"namespace_name":"<ns>","pod_name":"<pod>","container_name":"<c>"}}]'
kubectl -n windmill exec deploy/windmill-workers-default -- curl -s -X POST \
  http://cribl-am-relay.cribl.svc.cluster.local:9999/ingest \
  -H 'Content-Type: application/json' -d "$PAYLOAD"

# Then check AM:
kubectl -n cattle-monitoring-system exec alertmanager-rancher-monitoring-alertmanager-0 -c alertmanager -- \
  wget -qO- 'http://localhost:9093/api/v2/alerts' | jq '.[] | select(.labels.alertname=="MySignalDetected")'

The relay's regex check (single substring before regex) runs at fluentd batch rate (~5s) for the ~6k events/min volume without measurable load.

Log search¶

Two surfaces, picked by time horizon.

Indexed historical search (up to retention) — OpenObserve at logs.mdapi.ch. SQL-over-Parquet on Ceph RGW. Two streams are populated:

modsec — every WAF audit JSON record and every nginx ModSecurity: Access denied line, with matched_rules, matched_messages, hostname, uri, client_ip, unique_id lifted to top-level columns by the Cribl pipeline.
pod_logs — the catch-all all-logs ClusterFlow (everything except kube-system and cattle-logging-system). Kubernetes metadata isn't lifted yet — the full fluentd record lives in _raw, so queries are _raw LIKE '%pattern%' style. Baseline ingest is ~5k events/min ≈ 7M docs/day; watch the Ceph RGW pool capacity if retention is bumped.

Default retention is 7 days; override per-stream when a longer horizon is justified. New streams just need a Cribl route + pipeline pointing at the same ES-compatible bulk endpoint.

-- Where did this 403 come from? Root rule(s):
SELECT _timestamp, hostname, uri, method, http_code,
       matched_rules, matched_messages, client_ip
FROM modsec
WHERE hostname = 'rspamd.mdapi.ch' AND modsec_kind = 'audit'
ORDER BY _timestamp DESC LIMIT 10;

-- Root-cause a CronJob that was GC'd before you could read its logs:
SELECT _timestamp, _raw
FROM pod_logs
WHERE _raw LIKE '%gitlab-toolbox-backup%' AND _raw LIKE '%error%'
ORDER BY _timestamp DESC LIMIT 50;

Real-time tail — Cribl Live Capture. Source / Route / Pipeline → Live Data tab (10-second sliding window of incoming events). Useful for ad-hoc investigations that are happening now.

All logs from one namespace: kubernetes.namespace_name === '<ns>'
Errors anywhere: /error|fail|exception|fatal|panic/i.test(log || '')
Specific container: kubernetes.container_name === '<container>'

For aggregate metrics (pod restart counts, OOM rates, PVC fill), the kube_pod_* series are queried from VictoriaMetrics (3-year retention) by default, with the rancher-monitoring Prometheus (~15d) as the fallback datasource.

Batch-job metrics (Pushgateway)¶

Short-lived jobs — cluster sweeps, external API checks, anything that runs on a cron and exits — publish their results as Prometheus metrics through Pushgateway:

batch job ──push──▶ Pushgateway ──scrape──▶ Prometheus ──rule──▶ Alertmanager
                    (HTTP)        (15s)              (PromQL)    (statuspage + Pushover)

Each job pushes:

<name>_last_run_timestamp_seconds — backstop "did it actually run?"
<name>_last_run_success (1/0) — backstop "did it succeed?"
Domain gauges produced by the check (backup age, days since last snapshot, S3 object count, …)

A meta BatchJobStale rule fires per-job at now - _last_run_timestamp_seconds > 2 × cadence_seconds, where cadence_seconds comes from windmill_schedule_interval_seconds published daily by f/infra_health/schedule_cadence_map. New scheduled probes are picked up automatically — there's no central regex list to maintain — but the probe's Pushgateway JOB label must equal basename(script_path) or the join silently misses. Each domain gauge has its own rule with the appropriate severity/component/team labels.

Pushgateway PUT replaces the whole job group

Probes pushing into a shared job group (i.e. anything that publishes to a /metrics/job/<name> where another script also publishes) must use POST. PUT wipes every metric in the group, including peers' _last_run_timestamp_seconds — which silently disables the BatchJobStale join. Use PUT only when the script owns its job group exclusively.

Dead-man switch¶

The bundled rancher-monitoring chart emits an always-firing Watchdog alert. An AlertmanagerConfig forwards it to a healthchecks.io ping URL every 4 minutes:

Prometheus emits Watchdog → AM matches alertname=Watchdog → webhook to hc-ping.com/<uuid>

If Alertmanager or Prometheus is fundamentally broken (can't even deliver the keepalive), healthchecks.io sees the missed heartbeat and fires its own independent alert via email/Pushover. The ping URL is held in akeyless under /mdapi/pushover/healthchecks-watchdog-url and pulled via ExternalSecret.

Lessons¶

Always inventory before deploying a parallel stack

Verify with kubectl get ns (no filter), helm ls -A, and kubectl get crd | grep <expected-group> whether the thing you're about to deploy already exists. A grep over namespaces with the wrong filter will silently miss a stack that's been running for years. Spending an hour scaffolding a parallel kube-prometheus-stack only to discover the cluster already had one is avoidable with a five-second inventory.

Helm-chart fleet bundles ignore sibling raw manifests

A fleet bundle directory with helm.chart set in fleet.yaml uses the upstream chart and ignores other YAMLs in the same dir. Wrapper-chart bundles (no helm.chart) DO templating include them. For mixed deployments (chart + supplemental CRDs), put the chart in its own bundle and the CRDs in a wrapper bundle.

Helm-shaped fleet bundles must NOT include a Namespace manifest

Fleet creates defaultNamespace itself. A Namespace YAML in the bundle makes Helm try to claim the same ns Fleet just created → ownership conflict → release failed → bundle stuck NotReady forever. Recovery: helm uninstall <release>, drop the Namespace manifest, push.

Fleet's ${VAR} substitution eats ${VAR} even inside YAML scalar comments

When embedding shell scripts inside valuesFiles (e.g. ConfigMap data sections), prefer bare $VAR over ${VAR}. Both work in bash when the variable is followed by a non-identifier character. Otherwise Fleet substitutes against its own variables and renders fail with function "VAR" not defined.

Prometheus alert templates aren't sprig

Alert annotations support Go's text/template plus a small set of Prometheus helpers (humanize, humanizeDuration, query, reReplaceAll, safeHtml, title). Sprig functions like default, lower, quote are NOT available. The mutating webhook will reject the rule with function "default" not defined.

fluentd flush_interval is silently ignored unless flush_mode: interval is set

Default flush_mode: lazy uses timekey-based flushing — events sit in buffer for ~10min+ before shipping. Set both options explicitly when you want responsive output.

logging-operator match: [{exclude: …}] doesn't route on its own

A ClusterFlow with only an exclude block produces a fluentd label_router route that never matches. Need an explicit select: {} first (match everything), then exclude. Symptom: fluentd_output_status_emit_count = 0 on the flow despite ClusterFlow status showing Active.

OWASP CRS rule 911100 method enforcement: id:900200 BEFORE Include

Override tx.allowed_methods with a SecAction id:900200 placed before the CRS Include. ModSec keeps the first registration of a duplicate ID, so the CRS crs-setup.conf's own id:900200 is silently dropped and the expanded methods list survives. Rule 911100 then reads the expanded list at phase 1 and doesn't bump anomaly_score for PATCH/PUT/DELETE. Placing the SecAction after the Include doesn't work — 911100 reads tx.allowed_methods in load order before any later SecAction runs. See the ModSecurity page for the full ingress-annotation pattern.

Resource Sizing & VPA¶

The Vertical Pod Autoscaler runs cluster-wide. Roughly half of workloads are in Auto mode (VPA mutates live pod requests on eviction), the rest in Off mode (recommendations are observed but not applied — used for spiky workloads where surprise eviction would be disruptive: GitLab webservice/sidekiq/gitaly, Plex, transcoders, Frigate, Home Assistant, etc.).

For Off-mode workloads, requests and limits are set explicitly in the manifest from the VPA's target and upperBound recommendations. The scheduler then has real numbers to work with and noisy neighbours stay bounded.

VPA Auto preserves limit:request ratio

When VPA Auto bumps a request to its target, it scales the limit by the same factor. A Deployment written with request 200m / limit 2 (a 10× ratio) and a VPA target of 2 cores ends up with a 20-core live limit — enough to push a single node past 100% CPU-limit overcommit on its own and trip KubeCPUOvercommit, without any obvious culprit (the Deployment manifest still says limit: 2).

For workloads under VPA Auto, write request:limit close to 1:1. If a workload genuinely needs burst headroom (transcoder, batch job), keep its VPA in Off mode and size the limit explicitly — VPA's proportional scaling is the wrong tool for spiky workloads anyway. When debugging unexpected per-node CPU overcommit, compare the live pod spec against the Deployment spec — a divergence is the fingerprint of VPA mutation.

Longhorn backup monitoring thresholds¶

Thresholds are set to match the actual backup cadence — not arbitrary defaults:

Label group	Expected cadence	Alert after
`default` (monthly off-site + daily trim)	Monthly	35 days
`weekly` (off-site to Garage)	Weekly	70 days

Thresholds wider than the cadence absorb occasional missed runs without triggering false-positive alerts.

Stale Longhorn mount: detection + remediation¶

A class of failures manifests as filesystem EIO from a pod even though Longhorn reports the volume attached + healthy and dmesg on the node is clean. Symptoms:

Pod up for many days, files in the mount have stale modification times
Writes return EIO (I/O error, [Errno 5] Input/output error)
/proc/mounts shows rw, perms are correct
The Longhorn engine has restarted out from under the pod at some point and the pod's I/O session is stale

Fix is always a pod restart — the new pod gets a fresh mount and writes resume immediately. A Windmill flow does this automatically for CrashLoopBackOff + EIO-in-logs, and a companion LonghornVolumeReadOnlyRemount alert covers the silent Running-but-readonly case the log signature misses. See Longhorn → Mount-stall auto-recovery for the full picture.

The PodIOError detector in cribl-am-relay matches the EIO log signature across all namespaces and fires an alert tagged with namespace / pod / container / component, so an operator who wants to act before the auto-healer wakes up has the offending pod in the notification. A separate LonghornVolumeDegraded / LonghornVolumeFaulted rule keys on longhorn_volume_robustness so a degraded volume is alerted even before a workload tries to read or write to it.