BPI-R4 OpenWrt Router¶
The edge router is a Banana Pi BPI-R4 running a custom fork of OpenWrt 25.12 targeting the MT7988a SoC. It is the single piece of equipment between the FTTH ONT and the LAN — it terminates PPPoE on the WAN, relays DHCP to the in-cluster Technitium server, and terminates two WireGuard VPNs. The build is fully automated via GitLab CI.
Build Pipeline¶
flowchart LR
push["git push\nopenwrt-25.12-tillo branch"]
ci["GitLab CI\nproject ID 33"]
build["OpenWrt build system\ncross-compile MT7988a"]
artifact["sysupgrade.itb\n(firmware artifact)"]
flash["SSH to bpi-r4\ncat > /tmp/sysupgrade.itb\nsysupgrade -v"]
push --> ci --> build --> artifact --> flash
CI is triggered on every push. The firmware artifact is stored in the GitLab generic packages registry, keyed by pipeline IID.
Kernel Customizations¶
Custom patches live in target/linux/mediatek/patches-6.12/, applied in lexicographic order via the quilt workflow. Key patches:
197-dts-mt7988a-add-ramoops.patch— reserves 1 MiB at0x42f00000for ramoops/pstore (record-size 128 KiB ×5, 256 KiB console, 64 KiB pmsg, 64 KiB ftrace), enabling kernel crash dumps to survive reboots970-net-ethernet-mtk_eth_soc-increase-warm-reset-timeout.patch—mtk_hw_warm_resetRSTCTRL_FE timeout 1ms → 100ms (avoids spurious "warm reset failed" on the SoC)target/linux/generic/pending-6.12/989-net-sfp-prefix-match-quirks.patch+990-add_sfp_quirks.patch— addsSFP_QUIRK_F_PREFIXand registers our XGS-PON ONT sticks (the productionFS XGS-SFP-ONT-MACIand theOEM XGSPONST2001clone kept as fallback) withsfp_fixup_potron. MasksSFP_F_TX_FAULT | SFP_F_LOSinstate_hw_maskso the SFP state machine doesn't disable the module on spurious assertions, and bumpsT_START_UPto 60 s for the slow PON bring-up. Prefix matching is required because the OEM clone fillsvendor_pnpast the legitimate string with non-printable garbage instead of SFF-8472 space padding. See SFP TX-fault storm for what this does and doesn't prevent.- Lockup detectors —
SOFTLOCKUP_DETECTOR,HARDLOCKUP_DETECTOR,DETECT_HUNG_TASK, and watchdog pretimeout panic enabled for crash diagnostics
The matching u-boot side (which injects its own ramoops node into the kernel FDT before booting) also has to use 1 MiB at 0x42f00000:
package/boot/uboot-mediatek/patches/103-04-mt7988-enable-pstore.patch— u-boot'smt7988.dtsipackage/boot/uboot-mediatek/patches/450-add-bpi-r4.patch— six BPI-R4 per-variant defconfigs that setCONFIG_CMD_PSTORE_MEM_ADDR=0x42f00000
Lesson learned: kernel and u-boot ramoops must match
If the kernel DT and u-boot defconfig disagree on the ramoops region, the kernel logs OF: reserved mem: OVERLAP DETECTED! at boot and falls back to whichever node was registered first — usually the smaller one. pstore then captures only ~16 KiB per record instead of 128 KiB. Verify by counting 0x42ff0000 bytes in the FIP after a build:
0. Anything else means a defconfig still has the old address.
Kernel config layering
Changes must go into target/linux/mediatek/filogic/Config-kernel.in — not directly into config-6.12. Config-kernel.in is processed last and overrides config-6.12. Edits to config-6.12 alone are silently reverted on the next build.
Flash Procedure¶
Dropbear (the SSH server in OpenWrt) has no sftp-server. File transfers must use stdin/stdout — scp fails silently or with a protocol error, never use it.
Kernel + rootfs (sysupgrade)¶
ssh root@bpi-r4 'cat > /tmp/sysupgrade.itb' < /path/to/sysupgrade.itb
ssh root@bpi-r4 'sysupgrade -T /tmp/sysupgrade.itb' # validate first
ssh root@bpi-r4 'sysupgrade /tmp/sysupgrade.itb' # flash + reboot
sysupgrade writes only the production partition (/dev/mmcblk0p5) and reboots (~2-3 min).
Bootloader (FIP)¶
When the change is in u-boot itself (DT, defconfig, BL31), sysupgrade is not enough — u-boot lives in the fip partition (/dev/mmcblk0p3). Flash directly:
# Build artifact: openwrt-mediatek-filogic-bananapi_bpi-r4-emmc-bl31-uboot.fip
# Sanity check before flashing
python3 -c "import sys; d=open('uboot.fip','rb').read(); print('size', len(d), 'must <= 4194304'); print('0x42ff0000 hits', d.count(bytes.fromhex('0000ff42')), 'must be 0')"
ssh root@bpi-r4 'cat > /tmp/uboot.fip' < /path/to/uboot.fip
ssh root@bpi-r4 'dd if=/tmp/uboot.fip of=/dev/mmcblk0p3 bs=1M conv=fsync'
# verify the readback matches
ssh root@bpi-r4 'sha256sum /tmp/uboot.fip; SIZE=$(wc -c < /tmp/uboot.fip); dd if=/dev/mmcblk0p3 bs=1 count=$SIZE 2>/dev/null | sha256sum'
ssh root@bpi-r4 'reboot' # u-boot only takes effect after a reboot
The BL2 preloader (emmc-preloader.bin → mmcblk0boot0/1) is rarely changed and isn't needed for u-boot DT or defconfig changes.
Recovery (TFTP)¶
If a flash bricks u-boot, the BPI-R4's u-boot environment has TFTP recovery macros baked in:
bootmenu_4→ "Load production system via TFTP then write to eMMC"boot_tftp_write_bl2→ TFTP-loadbootfile_bl2(=openwrt-mediatek-filogic-bananapi_bpi-r4-emmc-preloader.bin) and writeboot_tftp_write_fip→ TFTP-loadbootfile_fip(=openwrt-mediatek-filogic-bananapi_bpi-r4-emmc-bl31-uboot.fip) and write
Requires serial console + a TFTP server with the correct file names.
eMMC Partition Layout¶
| Partition | Label | Size | Contents |
|---|---|---|---|
mmcblk0boot0/1 |
(eMMC HW boot) | 4 MiB ea | BL2 preloader |
mmcblk0p1 |
ubootenv |
512 KiB | u-boot environment |
mmcblk0p2 |
factory |
2 MiB | factory data |
mmcblk0p3 |
fip |
4 MiB | BL31 + u-boot proper |
mmcblk0p4 |
recovery |
32 MiB | recovery FIT |
mmcblk0p5 |
production |
2 GiB | kernel + rootfs (target of sysupgrade) |
MT7988a Reserved Memory¶
| Address | Size | Purpose |
|---|---|---|
0x42f00000 |
1 MiB | ramoops / pstore (kernel crash dumps) |
0x43000000 |
320 KiB | ATF / secmon |
Post-Crash Diagnosis¶
After a kernel crash and reboot, pstore holds the previous kernel's dmesg:
ls /sys/fs/pstore/ # look for dmesg-ramoops-N or console-ramoops-0
cat /sys/fs/pstore/dmesg-ramoops-0
/etc/rc.local mounts pstore on boot. The boot-reason logger (see Diagnostic Infrastructure) auto-summarises the previous boot's pstore state into /etc/last_boot.json and pushes a sensor to Home Assistant.
Diagnostic Infrastructure¶
A set of helper scripts lives in /usr/local/bin/ (all preserved across firmware upgrades via /etc/sysupgrade.conf):
| Script | Purpose |
|---|---|
boot-reason.sh |
Runs from rc.local at boot. Snapshots pstore record count, dmesg head, page_pool state, kernel version, build into /etc/last_boot.json and pushes sensors to HA (sensor.bpi_r4_boot_cause, sensor.bpi_r4_pstore_records, sensor.bpi_r4_uptime_at_capture, sensor.bpi_r4_page_pool_inflight_boot). |
health-snapshot.sh |
Cron * * * * *. Captures per-minute time-series of WAN/LAN/PPPoE counters, softnet, IRQ totals, page_pool, conntrack into /etc/last_health.jsonl (10 080 lines = 7 days at 1/min, persisted across sysupgrade). The forensic ring used to reconstruct outage timelines. Also pushes derived sensors (sensor.bpi_r4_wan_carrier, sensor.bpi_r4_wan_tx_bps, sensor.bpi_r4_wan_tx_pps, sensor.bpi_r4_softnet_drop_rate, sensor.bpi_r4_heartbeat, sensor.bpi_r4_pppoe_state) to HA. |
page-pool-watch.sh |
Cron * * * * *. Reads dmesg \| grep page_pool_release_retry \| tail -1, parses inflight Nsec, pushes sensor.bpi_r4_page_pool_age and sensor.bpi_r4_page_pool_inflight to HA. At 600 s of stall: triggers echo t > /proc/sysrq-trigger (task stack dump → netconsole). At 1800 s with /etc/page-pool-watch.reboot=1: triggers echo c > /proc/sysrq-trigger (controlled panic → captures pstore + clean reboot). |
sfp-txfault-watch.sh |
Cron * * * * *. Computes sfp1-tx-fault IRQ rate per minute from /proc/interrupts, pushes sensor.bpi_r4_sfp1_txfault_rate and sensor.bpi_r4_sfp1_txfault_event to HA. Baseline ~1–2/min; storm ~110 000/min (1 800/sec). At 3 consecutive minutes >1 000/min: dumps task stacks via sysrq-t into /etc/last_sfp_storm.txt. At 5 minutes with /etc/sfp-storm-watch.reboot=1: controlled panic via sysrq-c. Mirrors the page-pool-watch escalation pattern. |
sfp-recover.sh |
Soft-reset the WAN SFP without physically pulling the module. Three escalation levels: A) ip link sfp-wan down/up, B) ifdown/ifup wan, C) unbind/bind the SFP platform driver via /sys/bus/platform/drivers/sfp/. Run sfp-recover.sh for auto-escalation, or pass 1/2/3 for a single step. |
haproxy-watch.sh |
Cron * * * * *. Restarts haproxy if it's down while vip is up for ≥120 s; pushes sensor.bpi_r4_haproxy_state, ..._down_age, ..._event to HA. Backstop for procd respawn. |
Both crash detection and the MTK ethernet page_pool_release_retry leak are then surfaced in Home Assistant — see Home Assistant Integration.
Hardware watchdog + sysctls¶
The MT7988a hardware watchdog is configured to panic on starvation:
$ cat /sys/class/watchdog/watchdog0/{identity,pretimeout_governor,pretimeout,timeout}
mtk-wdt
panic
15
30
Combined with these sysctls (in /etc/sysctl.d/99-tillo-panic.conf and the OpenWrt defaults):
kernel.panic = 10
kernel.panic_on_oops = 1
kernel.sysrq = 1
kernel.hung_task_panic = 1
kernel.softlockup_panic = 1
If the kernel hangs for >15 s without kicking the watchdog, pretimeout fires panic() (which writes to ramoops console), and 10 s later the SoC resets. Pstore captures the panic on next boot.
Home Assistant Integration¶
The router-side scripts push to HA via /api/states/<sensor>. Two automations on top:
automation.bpi_r4_crash_captured_via_pstore¶
Triggers when sensor.bpi_r4_boot_cause becomes panic or reset_with_pstore (i.e., the previous boot crashed and pstore captured something). Sends Pushover priority+1 + persistent_notification with the pstore record count, uptime at capture, and the SSH command to dump the actual log.
automation.bpi_r4_page_pool_stall_mtk_leak¶
Triggers when sensor.bpi_r4_page_pool_age > 600 (10 min stall). Escalation:
- t=0 — Initial dashboard + Pushover; the router-side script has already triggered sysrq-t for task stacks.
- t+20min if still
> 1800 s— firesshell_command.bpi_r4_reboot(defined inhome-assistant-cm.yml, SSH-keyed via/config/.ssh/bpi_r4_ed25519→/etc/dropbear/authorized_keyson the router), then waits 4 min and confirms recovery from the newboot_cause. - Otherwise — dismisses the dashboard notification (recovery happened on its own).
The SSH key is HA-side only; the matching pubkey on the router is registered in /etc/sysupgrade.conf so it survives flashes.
automation.bpi_r4_sfp_tx_fault_storm_carrier_down_onu_power_cycle¶
First-line response for an XGS-PON wedge. Triggers on sensor.bpi_r4_sfp1_txfault_rate > 10000 sustained for 2 min with sensor.bpi_r4_wan_carrier = 0. Cycles the Hue plug 15 s, waits 4 min for boot + PLOAM re-registration. Recovery in ~5 min. The 2-min hysteresis filters single-sample blips (60 s carrier flaps where the rate spikes for one sample then returns to baseline). See SFP TX-fault storm.
automation.bpi_r4_sfp_tx_fault_storm_clean_reboot (carrier-up only)¶
Triggers on sensor.bpi_r4_sfp1_txfault_rate > 1000 for 5 min with sensor.bpi_r4_wan_carrier = 1. Runs shell_command.bpi_r4_reboot. Reserved for the kernel-IRQ-wedge case where the link is technically still up but the kernel is buried under IRQ load. Carrier-down storms route to the sibling automation above — rebooting the SoC during a PON wedge is a no-op (the laser is at the optical front-end, below the kernel's reach).
automation.onu_watchdog_power_cycle_if_ploam_not_in_o5¶
If sensor.onu_ploam_state != 51 (O5) for 7 min, tries a soft pon reboot via ubus first; falls back to a Hue-plug power cycle after 3 min. Last-resort safety net for PON state issues that don't manifest as a TX-fault storm.
automation.router_watchdog_power_cycle_on_unreachable (last resort)¶
If binary_sensor.192_168_1_254 stays off for 15 min, power-cycles the router via the Hue smart plug. Untouched by the new layer — it's the safety net for the case where everything else has failed.
Known Issue: sfp-lan Zero Traffic¶
If the SFP LAN port (GMAC1) shows link-up but zero hardware TX/RX bytes, the RSS/LRO patches (999-eth-08*, 999-eth-09*) have corrupted the GMAC1 GDMA data path. Removing those patch files resolves it.
Known Issue: MTK page_pool teardown leak¶
Symptom in dmesg (and /var/log/bpi-r4.log on mbptillo via netconsole):
The line repeats every 60 s with Nsec ticking up forever. After a few hours of accumulation, the next pppoe-wan flap (e.g., ISP-side LCP timeout) wedges the SoC ethernet completely and the watchdog hard-resets — usually without leaving a useful pstore record because the kernel's already too sick to schedule the panic write.
Mitigations applied:
net: ethernet: mtk_eth_soc: initialize PPE per-tag-layer MTU registers— upstream commit2dddb34dd0(already in 6.12.85 stable). Fixes the actual root cause: PPE was punting PPPoE-encapsulated frames to the CPU because VLAN_MTU registers were uninitialised, which kept page_pool refs alive across teardown.- Removed
999-9907-2-mtk-use-net_prefetch-for-non-pagepool-path.patch— out-of-tree MTK SDK patch that operates on the same RX hot path Felix Fietkau reverted upstream (79d3db7447, "Revert: improve mtk_eth_soc performance — stability issues"). page-pool-watch.shas a third-line defence: dumps task stacks at 10 min stall, optionally triggers controlled panic+reboot at 30 min if/etc/page-pool-watch.reboot=1. HA'sbpi_r4_page_pool_stall_mtk_leakautomation does the same via clean SSH reboot if the watcher's gate is off.
Known Issue: SFP TX-fault storm (XGS-PON wedge)¶
The WAN SFP is an FS XGS-SFP-ONT-MAC-I MAC-mode XGS-PON ONT (MaxLinear PRX126 silicon, accessible as ssh onu), which replaced an OEM XGSPONST2001 stick of the same silicon family — so the failure mode and the kernel quirk below are unchanged. Under certain optical-layer disturbances (OLT-side events, fiber bend, laser bias instability) the laser cycles rapidly. Observed pattern: TX-fault assertions hit roughly five orders of magnitude above baseline (baseline ~1–2/min, storms reach ~110 000/min ≈ 1 800/sec) and persist for minutes. The OLT loses our upstream signal, stops sending downstream, and our 10GBASE-R SerDes loses bit-sync → wan_carrier=0.
Layer model¶
PRX126 laser → cage TX_FAULT pin → GPIO IRQ (line 69) → SFP state machine → phylink/SerDes → HA sensors
The kernel quirk (SFP_QUIRK_F_PREFIX("FS", "XGS-SFP-ONT-MACI", sfp_fixup_potron)) masks SFP_F_TX_FAULT | SFP_F_LOS in state_hw_mask, so the SFP state machine doesn't disable the module on these signals. The IRQ still fires (the GPIO handler is wired regardless), which is why /proc/interrupts count is the canary that surfaces the storm. The carrier still drops when it does, because that's reported by phylink/SerDes from real loss of received frames — below the SFP code's reach.
Lesson learned: a kernel-side TX-fault quirk can't paper over a real optical event
sfp_fixup_potron does its job (no spurious module-offline), but the carrier still drops when the laser actually cycles. Don't add more aggressive host-side suppression — it would only hide the failure, not prevent it.
Recovery¶
The only effective recovery is power-cycling the ONU (the "Internet" Hue plug carries both router + ONU on one circuit). A SoC reboot doesn't clear the storm: the laser sits in the optical front-end, below the kernel's reach. The carrier-down automation (above) routes directly to ONU power-cycle for this reason; the carrier-up clean reboot is reserved for the kernel-IRQ-wedge case.
Lesson learned: don't daisy-chain watchdogs on a single failure mode
Three watchdogs serialised on PON wedge — unreachable-watchdog at 15 min, then SoC reboot at 5 min after that, then ONU power-cycle at 7 min after that — produces ~10 min recovery and the middle step is wasted (it does nothing for this fault mode). Disambiguate by carrier state (wan_carrier=0 vs =1) and route directly to the right action.
ONU side¶
ssh onu lands on the module's stock FS firmware on the same MaxLinear PRX126 platform (the retired OEM stick ran LEDE plus the 8311 was-software community mod). The pon CLI and the laser/optic knobs in /etc/config/optic (laser timings, rogue_auto_en, tx_pup_mode) and /etc/config/gpon are the same platform tooling and should not be tweaked without a specific reason and a way to validate.
Forensic reconstruction¶
After a storm, replay the timeline from /etc/last_health.jsonl (health-snapshot.sh ring, 7 days, persisted across sysupgrade). Note: timestamps in the file are local time, while the HA logbook is UTC — easy to mistake for two different events.
ssh root@bpi-r4 'grep -E "<YYYY-MM-DD>T<HH>:" /etc/last_health.jsonl' | python3 -c "
import sys, json
prev=None
for l in sys.stdin:
d=json.loads(l); irq=d['irq']['sfp1tx']
delta=irq-prev if prev is not None else 0
prev=irq
print(f\"{d['ts']} car={d['wan']['car']} op={d['wan']['op']:>7} sfp1tx={irq} delta={delta:>6}\")"
A delta of 100 000+ per minute with car=0 is the storm signature. A delta reset to a small value mid-window means the IRQ counter was reset — i.e. the kernel rebooted that minute.
Known Issue: dnsmasq early-boot crash loop¶
dnsmasq's init script is at S19dnsmasq, network is S20network. At early boot dnsmasq tries to bind on the WAN VIPs (e.g. 31.3.128.59:53) before pppoe-wan has finished negotiating, fails 6 times in <1 s with Address in use, and procd's default respawn 3600 5 5 circuit breaker gives up. The router then runs without dnsmasq until manual intervention.
Fix in /etc/rc.local (idempotent, applied on every boot, persisted via /etc/sysupgrade.conf):
sed -i 's/procd_set_param respawn\b.*/procd_set_param respawn 3600 10 30/' /etc/init.d/dnsmasq 2>/dev/null
This loosens the respawn budget to 30 retries × 10 s wait = ~5 min of patience, long enough for pppoe-wan to come up. Same pattern is already in place for haproxy.
Lesson learned: never signal $(cat /var/run/<svc>/<svc>.pid) for jailed services
OpenWrt's ujail puts each service in its own PID namespace, so the pidfile contains the jail-internal PID 1 — which on the host is procd. Signaling that PID (e.g. kill -USR1 $(cat /var/run/dnsmasq/dnsmasq.pid)) sends the signal to procd; on many builds that's enough to reboot the router. Always use pgrep -f <pattern> or pidof <binary> (from outside the jail) to get the real host PID before signalling.
Network Topology¶
| Interface | Subnet | Role |
|---|---|---|
pppoe-wan |
31.3.128.50–.59, .62 (and transit) | PPPoE WAN with the ISP-allocated /28 |
br-lan (untagged) |
192.168.1.0/24 — gateway 192.168.1.254 |
Trusted LAN |
sfp-lan.20 |
192.168.7.0/24 — gateway 192.168.7.254 |
DMZ |
sfp-lan.70 |
192.168.77.0/24 — gateway 192.168.77.254 |
ADLAN (admin / OOB) |
wg_iot |
10.8.0.0/24 — 10.8.0.1 |
IoT-restricted WireGuard tunnel |
wg_s2s |
10.8.1.0/24 — 10.8.1.1 |
Site-to-site WireGuard |
jool |
192.168.164.0/24 | NAT64 stateful translation in a dedicated netns; pool4 mark split (.2 = external v6, .3 = internal LAN v6) — see external traffic → NAT64 |
Split-Horizon DNS¶
Split-horizon resolution — answering *.mdapi.ch on internal VIPs so LAN traffic skips the WAN-side hairpin — no longer runs on the router. It moved into an in-cluster unbound resolver that owns the well-known LAN DNS VIP 192.168.1.1 (a MetalLB L2 address advertised on mgmt-br). See External Traffic → Split-Horizon DNS.
The router's dnsmasq keeps no address= overrides of its own. It still answers DNS on the DMZ and ad-LAN segments, but purely as a forwarder — no-resolv with a single upstream of 192.168.1.1 — so those clients resolve against the same unbound override set as the LAN.
Lesson learned: never override a root domain
An override for a zone apex (mdapi.ch itself, rather than gitlab.mdapi.ch) shadows MX, NS, TXT, and SPF for the whole zone — mail and registrar delegation break for internal clients. The override list maps third-level hostnames only; the apex is always left to resolve normally.
DHCP — Centralised on Technitium¶
Local DHCP servers are disabled on every BPI-R4 scope (option ignore '1' in /etc/config/dhcp). Three relay sections forward DISCOVER/REQUEST traffic to the cluster instead:
config relay 'relay_lan' { interface 'lan'; local_addr '192.168.1.254'; server '192.168.1.55' }
config relay 'relay_dmz' { interface 'sfp-lan-20'; local_addr '192.168.7.254'; server '192.168.1.55' }
config relay 'relay_adlan' { interface 'sfp-lan-70'; local_addr '192.168.77.254'; server '192.168.1.55' }
Technitium owns all reservations, registers leases into its home.tillo.ch zone, and hands out each segment's local DNS address as the DNS server in the OFFER — 192.168.1.1 on the LAN, 192.168.7.1 / 192.168.77.1 on the DMZ / ad-LAN. On the LAN that address is the in-cluster split-horizon resolver (see Split-Horizon DNS).
Lesson learned: K8s SNAT vs DHCP relay
Replies from Technitium come back with the K8s-node source IP after CNI SNAT, not the LB IP. A custom nftables rule in /etc/nftables.d/dhcp-relay-fix.nft rewrites the reply source from {node-IPs}:67 back to 192.168.1.55:67 so dnsmasq's relay socket accepts them. The file is registered in sysupgrade.conf so it survives firmware upgrades.
WireGuard¶
Two separate WireGuard interfaces share the BPI-R4 listening on the public WAN:
| Interface | Port | Subnet | Purpose |
|---|---|---|---|
wg_iot |
51820 | 10.8.0.0/24 | Phones, laptops, and other roaming endpoints — full-tunnel into the LAN |
wg_s2s |
51821 | 10.8.1.0/24 | Site-to-site to the gum peer |
The public endpoint is pctillo.tillo.ch (DNS pointing at 31.3.128.62, an alias on pppoe-wan). Peer config is managed in /etc/config/network (config wireguard_wg_iot and config wireguard_wg_s2s sections).
IPv6 TX Checksum Workaround¶
The MT7988a NAT engine produces incorrect IPv6 TX checksums on the SFP-LAN VLANs in some flows. Two ethtool -K calls in /etc/rc.local disable hardware TX checksumming on those interfaces and fall back to software checksums:
Without this, IPv6 traffic crossing the VLAN boundary is dropped by destinations that strictly verify checksums.