Document Management¶

The household's paper archive — every bill, bank statement, insurance policy, contract and certificate worth keeping — lives at documents.mdapi.ch, running Paperless-NGX in the paperless namespace. Documents arrive from three independent paths, are OCRed and classified automatically, and end up filed in a {year}/{correspondent}/{title}.pdf archive tree that is browsable read-only through OwnCloud without anyone having to log in to Paperless itself.

The point of writing this page up is less Paperless the app and more the surrounding integration shape: how a single source of truth for documents gets fed by three ingestion paths, exposed through two UIs (Paperless for tagging and search, OwnCloud for "just give me the PDF"), and backed up alongside everything else.

Architecture¶

flowchart LR
    subgraph in["Ingestion"]
        oc_in["OwnCloud<br/>Paperless-Consume share"]
        mail["IMAP mailbox<br/>documents@mdapi.ch"]
        api["Direct API upload"]
    end

    subgraph pod["paperless pod"]
        puller["rclone puller<br/>(sidecar, 5-min)"]
        unzip["archive-extractor<br/>(sidecar, .zip/.7z/.tar*)"]
        app["paperless-ngx<br/>web · consumer · celery"]
        pusher["library-pusher<br/>(sidecar, 5-min)"]
    end

    subgraph svc["In-namespace services"]
        valkey["Valkey<br/>(Redis-compatible cache)"]
        pg["paperless-pg<br/>(CloudNativePG)"]
    end

    subgraph store["Storage"]
        pvc["paperless-data PVC<br/>100Gi Longhorn<br/>{year}/{correspondent}/{title}.pdf"]
    end

    subgraph out["Read-only browse"]
        oc_out["OwnCloud<br/>Paperless-Library share"]
    end

    oc_in --> puller --> unzip --> app
    mail --> app
    api --> app
    app <--> valkey
    app <--> pg
    app --> pvc
    pvc --> pusher --> oc_out

The paperless pod runs four containers in the same network namespace: the app itself plus three sidecars that make the integrations possible without modifying paperless-ngx upstream.

Ingestion Paths¶

OwnCloud Paperless-Consume is the everyday path. Tillo and Elisa each have a write-enabled share into a directory the rclone puller sidecar polls every 5 minutes. Anything that lands there (including subdirectories — tag names get inherited from the path) is pulled into paperless-ngx's /consume/ directory and removed from OwnCloud once processed. The puller carries --delete-empty-src-dirs so the user-visible folder structure stays clean: an empty subdir means no documents waiting, not something is stuck.

Archives (.zip, .tar(.gz/.bz2/.xz), .7z) get unpacked by the archive-extractor sidecar before paperless-ngx sees them. paperless-ngx upstream doesn't natively handle archives, so this sidecar exists to bridge "user drops a ZIP of receipts on the share" to "paperless sees individual PDFs in the consume queue". Charset-aware ZIP handling recovers cp437 filenames from old Windows-era exports.

IMAP mailbox documents@mdapi.ch is polled every 5 minutes. Anything arriving with an attachment matching the rule (PDF, scanned image, common office formats) is extracted, fed to the consume pipeline, and the source email is moved to a Processed folder. This is the path for "forwarded receipt from a vendor" or "scan from a multifunction printer that emails to documents@".

Direct API upload is available for scripts and for paperless-ngx's own mobile app; in practice it's used rarely because the OwnCloud path is easier for non-technical users.

Classification, Workflows and Custom Fields¶

paperless-ngx's auto-classifier is retrained nightly against the manually-curated tag / correspondent / document-type set. Tags created in "Any word" mode propagate from filenames and folder paths; tags created in "Automatic" mode are predicted by the classifier from the OCR text. Both modes coexist on the same document.

Workflows route documents based on the correspondent field (which represents the issuing entity — bank, government, insurer — not the recipient). Six are deployed today, each with a conservative trigger (correspondent contains "<name>") and a small downstream effect (tag adjustment, document-type assignment, archive-filename template). The list — PostFinance, BCBE, Etat de Vaud, Assura, Kyos, CPEV — covers the recurring sources of paperwork in the household; everything else falls through to the default catch-all.

Custom fields add structured metadata that paperless-ngx's tag/type/correspondent model can't express. Six fields are defined: amount, due_date, paid, valid_until, iban, policy_number. Two have automatic extraction: a currency-aware regex pulls amount (CHF / EUR / USD) and a Swiss-IBAN-with-checksum-validation routine pulls iban. The other four require manual entry — the extraction signal is too weak (random "policy numbers" look like everything else) to backfill without supervision.

The combination of correspondent for routing, tags for retrieval, custom fields for structured query is what makes "show me every Postfinance invoice over CHF 1000 that's unpaid" expressible as a saved view.

Storage and the Archive Tree¶

The paperless-data PVC is 100 GiB on Longhorn (2-replica). The default Paperless layout is a flat documents/ directory keyed by document ID, which works for the app but makes any direct-filesystem use (a recovery, an audit, a manual bulk operation) painful. The deployed layout overrides Paperless's filename format to {year}/{correspondent}/{title}.pdf, which:

Mirrors the way a human would organize a physical filing cabinet — open the binder for 2025, open the PostFinance tab, find the statement.
Survives a Paperless DB restore better — if the metadata database is lost, the filesystem alone still makes sense.
Makes the library-pusher sidecar (below) trivial: it can rsync paperless-data/archive/{year}/... straight into OwnCloud without translating IDs back to names.

OwnCloud Bidirectional Bridge¶

The integration is symmetric across two shares:

Paperless-Consume (write, two users) — for ingestion. WebDAV in.
Paperless-Library (read-only, two users) — for browsing. WebDAV out, kept in sync every 5 minutes by the library-pusher sidecar which mirrors the archive tree from the Paperless PVC into OwnCloud via the same WebDAV endpoint.

The architectural note worth recording is that OwnCloud's OCS API (the documented mechanism for managing shares) consistently returns 401 even with valid credentials — this is a permanent OwnCloud-side behaviour, not a misconfiguration. The shares are created instead by direct PostgreSQL inserts into oc_share, which OwnCloud reads on every share validation. WebDAV traffic — both directions — goes through unchanged. The pattern (use the supported channel where it works, fall back to the data plane where it doesn't) generalises to other OwnCloud integrations.

Net effect: a user who never logs into Paperless can drop a PDF into their Paperless-Consume folder and find the OCRed, classified, filed copy in Paperless-Library within ten minutes — five for the consume puller to pull, five for the library pusher to mirror.

Identity, Backup, Monitoring¶

Identity — Paperless is fronted by OIDC against the tilloch realm on the MDAPI Keycloak. Same realm as Joplin, OwnCloud and the rest of the household-user services; one account, one MFA enrollment.
Backup — Standard story (see Backups & DR): the 100 GiB paperless-data PVC is in the Longhorn weekly backup group; paperless-pg is a CloudNativePG cluster with Barman continuous WAL + nightly base backup to Garage; the off-site rclone fan-out at 04:00 UTC ships both to Backblaze B2 and o2switch SFTP. No service-specific backup plumbing.
Monitoring — paperless-ngx exposes Celery queue depth, consumer worker state and OCR job latency through the standard Prometheus annotations; the WAF audit on documents.mdapi.ch is part of the daily /waf-audit pass.

Saved Views¶

Five dashboard views ship on the home page and the left sidebar, covering the queues an operator-of-the-archive looks at routinely:

View	Query
Inbox	`tag:inbox` — triage queue, every newly-consumed document starts here
Needs review	`tag:needs-review` — classifier marked low-confidence, human verdict needed
Open invoices	`document_type:Fattura AND custom_field:paid=false` (36 today)
Insurance policies	`document_type:Polizza` (11 today)
Bank statements	`document_type:Estratto` (45 today)

A current Paperless limitation is that saved views can't use relative dates ("this month", "in the last 90 days") — the filter expressions are absolute. The work-around when that becomes painful enough is a weekly Windmill regenerator script that rewrites the date bounds in place; for now the absolute-date views are good enough.

Backups & DR — where the 100 GiB PVC and the CNPG cluster land in the wider backup architecture.
Authentication & Identity — the OIDC realm Paperless authenticates against.
Storage Architecture — the underlying Longhorn class the PVC uses, and the Garage buckets the Postgres Barman stream lands in.
ModSecurity WAF — the WAF carveouts (or absence thereof) that govern public access to documents.mdapi.ch.