Skip to content

Secret Management

This page covers how secrets are handled at runtime — where the plaintext lives, who reconstructs it, how workloads consume it. The complementary Supply Chain Security page covers how secrets are kept out of source control in the first place (gitleaks pre-commit + CI + .gitignore baseline). The two together make leaked credentials rare and well-handled when they do occur.

Architecture

Secrets are stored in Akeyless — a zero-knowledge secrets management platform. The critical distinction is the customer fragment: a secret shard stored on-premise in a CipherTrust Manager VM. Akeyless SaaS holds only an encrypted fragment; without the customer shard it cannot reconstruct any secret.

flowchart LR
    dev["Operator"]
    ak["Akeyless SaaS\n(encrypted fragment only)"]
    cm["CipherTrust Manager\ncm.home.tillo.ch\ncustomer fragment"]
    eso["External Secrets Operator\nClusterSecretStore: cm-akeyless"]
    sec["K8s Secret"]
    pod["Pod"]

    dev -->|"store secret"| ak
    ak <-->|"customer fragment\nnever transmitted\nover internet"| cm
    eso -->|"fetch at runtime\n/mdapi/* paths"| cm
    eso --> sec --> pod

The gateway is embedded in the CipherTrust Manager VM and exposed at cm.home.tillo.ch. The External Secrets Operator (ESO) talks to the v2 API endpoint at /akeyless-api/v2.

CipherTrust Manager — the on-premise root

The CipherTrust Manager itself runs as a KubeVirt VM on the Harvester cluster, with a Longhorn-backed persistent disk that holds the customer fragment. Two consequences:

  • The fragment never leaves the LAN. A subpoena against Akeyless SaaS, a credential compromise on the SaaS side, or a complete Akeyless outage cannot reconstruct any secret without this VM cooperating. The on-prem half of the split-knowledge model is what makes "zero-knowledge" mean something at the customer level rather than just the vendor level.
  • The VM is itself a recovery target. Its disk is included in the weekly Longhorn snapshot group and in the daily off-site rclone sync of harvester-backup (see Backups & DR). When a CM upgrade has broken the fragment service in the past, the recovery path has been to restore the disk from the most recent known-good Longhorn backup — typically under five minutes.

Path Convention

Every secret lives under a path of the form /mdapi/<namespace>/<secret-name>/<key>. The namespace segment matches the Kubernetes namespace that consumes the secret, which means an ExternalSecret in joplin can never accidentally fetch a key intended for bootstrap: the access-control on the customer fragment is scoped to the path, not just to the gateway.

Examples currently in use:

Path Consumer
/mdapi/joplin/tsig-secret/tsig-secret-key external-dns DDNS updates from the joplin namespace
/mdapi/<ns>/mdapi-registry per-namespace dockerconfigjson for the internal registry
/mdapi/gitlab/api-token non-expiring GitLab API token used by every CI consumer
/mdapi/openobserve/o2/root-password OpenObserve admin password fetched by the WAF and DNS audit skills
/mdapi/pushover/mdapi-alertmanager-token Alertmanager → Pushover notification token

The convention is enforced by code review and by the /akeyless-secret-manage skill rather than by Akeyless itself, but the resulting structure is what makes the audit log readable and the rotation runbooks trivial.

Secret Lifecycle

  1. Write — operator creates/updates a secret via Akeyless CLI or UI; it is stored encrypted, split between Akeyless SaaS and the on-premise fragment.
  2. Fetch — ESO reads the ExternalSecret CR, authenticates to the gateway, and reconstructs the plaintext. The plaintext exists only in memory and in the resulting K8s Secret.
  3. Consume — pods reference the K8s Secret as environment variables or volume mounts; they never talk to Akeyless directly.
  4. Rotate — the secret is updated in Akeyless; ESO refreshes the K8s Secret on its next poll interval (default: 1 minute).
  5. Revoke — deleting the ExternalSecret CR causes the K8s Secret to be deleted on the next reconcile.

ExternalSecret Pattern

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: my-secret
  namespace: my-namespace
spec:
  refreshInterval: 1m
  secretStoreRef:
    name: cm-akeyless
    kind: ClusterSecretStore
  target:
    name: my-secret
  data:
    - secretKey: MY_VAR
      remoteRef:
        key: /mdapi/my-service/my-secret/MY_VAR

Template field order

When templating a secret from multiple Akeyless keys, NAME (direct value) must come before SELECTOR (path reference) in template.data. Wrong order silently produces empty values — no error is shown.

Long-lived Credentials Belong in Akeyless

A consequence of the architecture rather than a separate feature: every long-lived service credential (API tokens, dockerconfigjson, TSIG keys, database passwords) lives in Akeyless and is fetched via ESO at runtime or via Windmill g/all/* variables at job time. Tokens in .git/config, secrets in environment files, kubeconfigs in working directories — all of those are anti-patterns the Supply Chain Security page is built to catch.

The pattern is most visible on the GitLab side. Personal Access Tokens generated through the GitLab UI inherit a maximum lifetime, which means a long-running consumer (a Windmill schedule, an external-dns reconcile loop) silently breaks the day the token expires. The replacement pattern is to mint a non-expiring token via the Rails console (Gitlab::AccessToken.create! with no expires_at), store it once at /mdapi/gitlab/api-token, and have every consumer fetch it from there. One rotation event, every consumer updates on the next ESO refresh, no UI-token sprawl.

The same logic applies to dockerconfigjson per namespace: a single bootstrap creates the secret in Akeyless and an ExternalSecret in each namespace pulls it. Image-pull failures across the cluster after a registry password rotation become a single Akeyless edit instead of a 40-namespace kubectl edit.

Failure Modes

Symptom Cause Resolution
ExternalSecrets stuck in SecretSyncedError CipherTrust Manager VM unreachable Check VM health; restore from Longhorn snapshot if needed
/api/derived-key 404 CM upgrade broke the fragment endpoint Restore CM VM from last known-good Longhorn backup
500 on derived-key Fragment service crashed Same as above

The CipherTrust Manager VM has a Longhorn-backed persistent disk and is snapshotted on the weekly schedule. Recovery typically takes under 5 minutes.