Replacing Pangolin with Caddy

2026-04-21 wireguard pangolin caddy authelia crowdsec kubernetes selfhosting networking

This was meant to be a quick swap of Newt, the Pangolin tunnel agent running in my cluster, for a tinier WireGuard pod I wrote in 30 lines of shell. It turned into an end-to-end teardown of Pangolin itself, replaced with Caddy, Authelia and CrowdSec on the VPS. This is the long version.

Why I wanted Newt gone

Newt worked, but it was the chubbiest pod in my cluster relative to what it did, and it leaked memory on top of that, getting OOMKilled often. On a Raspi5 node every MiB matters. I started looking at whether I could replace it with something simpler.

What Newt actually does

Pangolin ran on a small VPS. Public traffic hit Traefik there, and resources were forwarded over a WireGuard tunnel to a Newt agent inside the target network. Newt registered with Pangolin and acted as the L3/L7 bridge between the VPS-side WG network and cluster services.

Functionally, the VPS just needed to reach things inside the cluster. WireGuard already did that; Newt was a convenience layer with a control plane on top. I already ran a similar stripped-down pod, pangolin-ssh: wg-quick + socat, about 32 MiB, so I could SSH back to the VPS from inside the cluster. I figured I could build the inverse.

The gateway pod

The new pod was modelled exactly on pangolin-ssh:

An init container set net.ipv4.ip_forward=1.
A main container based on alpine:3.21 brought up the WG interface with wg-quick and configured iptables.
A SealedSecret held wg0.conf.
No Service: it was purely a receiver of inbound traffic from the VPS over WG.

The entrypoint was dead simple:

#!/bin/sh
set -e

apk add --no-cache wireguard-tools iptables >/dev/null 2>&1

cp /etc/wireguard/wg0.conf /tmp/wg0.conf
wg-quick down /tmp/wg0.conf 2>/dev/null || true
wg-quick up /tmp/wg0.conf

iptables -A FORWARD -i wg0 -o eth0 -j ACCEPT
iptables -A FORWARD -i eth0 -o wg0 -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

echo "WireGuard tunnel up, forwarding wg0 -> cluster service network"
exec sleep infinity

The MASQUERADE on eth0 is the trick: packets arriving on wg0 get SNAT’d to the pod’s IP when forwarded out, so Cilium handles them like any other intra-cluster traffic. Return traffic finds its way back via conntrack.

On the VPS side, in wg-easy, I added a new peer with Server Allowed IPs = 10.43.0.0/16 (the cluster service CIDR). wg-easy adds a kernel route for that prefix automatically when the interface comes up, no PostUp hook required.

After bringing it all up, from the VPS:

$ dig +short @10.43.0.10 coredns.kube-system.svc.cluster.local
10.43.0.10

End-to-end working. The VPS can now reach any cluster service directly.

Adding TCP to CoreDNS

For DNS responses larger than 512 bytes, the VPS needs TCP/53 to CoreDNS. The CoreDNS Helm chart supports it via a use_tcp: true flag on the zone:

servers:
  - zones:
      - zone: .
        use_tcp: true
    port: 53

helm template showed the chart correctly produced two Service ports, udp-53 and tcp-53, so I committed the change and synced.

Except the live Service still showed only UDP/53.

The puzzle

This is where things got stuck.

ArgoCD reported the app as Synced, the diff comparedTo had use_tcp: true, and Helm rendered both ports locally. But the Service object in the cluster wasn’t getting the TCP entry. The sync result message was telling: service/coredns unchanged.

The cause turned out to be a Server-Side Apply ownership conflict. kubectl get svc coredns -n kube-system --show-managed-fields showed:

manager: helm owned f:ports from a 2025 install, with the entry only for the UDP port.
manager: argocd-controller owned the metadata fields but not f:ports.

The Service had originally been applied directly by Helm before ArgoCD took over, and ArgoCD’s client-side kubectl apply couldn’t modify the field that another manager owned. The TCP addition silently dropped, and ArgoCD’s diff, which compares against last-applied-configuration rather than checking field ownership, thought everything was fine.

I tried to patch the Service to add the TCP port:

kubectl patch svc coredns -n kube-system --type=strategic -p \
  '{"spec":{"ports":[{"name":"udp-53",...,"protocol":"UDP",...},{"name":"tcp-53",...,"protocol":"TCP",...}]}}'

This is where I learned that Service.spec.ports uses port as its strategic-merge key, not port + protocol. The two entries with port: 53 got deduplicated, the TCP one won, UDP/53 was dropped, and cluster DNS broke for any pod doing fresh UDP lookups.

The fix was to use a JSON patch instead, which doesn’t dedupe:

kubectl patch svc coredns -n kube-system --type=json -p \
  '[{"op":"replace","path":"/spec/ports","value":[
    {"name":"udp-53","port":53,"protocol":"UDP","targetPort":53},
    {"name":"tcp-53","port":53,"protocol":"TCP","targetPort":53}
  ]}]'

Then I added ServerSideApply=true to the CoreDNS Application’s syncOptions so ArgoCD takes proper field ownership going forward. After the next sync, argocd-controller owns f:ports and Helm’s stale entry no longer wins.

Two takeaways I’m writing down so I don’t forget:

Don’t strategic-merge Service.spec.ports when two entries share the same port number. JSON patch or SSA, always.
When ArgoCD takes over a Helm-installed resource, set ServerSideApply=true from day one. Otherwise the original manager keeps owning fields and ArgoCD’s writes silently no-op.

The half-fix

The plumbing worked, the gateway pod was up, the VPS could reach the cluster service CIDR, CoreDNS spoke UDP and TCP, Pangolin could in principle resolve *.svc.cluster.local and target services directly without Newt.

Except: I still had to wire a DNS forwarder on the VPS, repoint every Pangolin resource from Newt → cluster hostnames, make sure a cluster outage didn’t also take out DNS for non-cluster targets, and then tear Newt down. Two self-inflicted SSH lockouts in, the remaining value felt small against the risk. One rule those taught me: always have an emergency SSH path before touching network plumbing, firewalled at the hosting provider by default and openable on demand.

Sitting with that, the real question changed. Not is the gateway pod worth it?, but is Pangolin worth it?

What I actually used Pangolin for:

A reverse proxy. Traefik does that.
A forward-auth gate (Pangolin calls it badger) for some private endpoints. Any SSO would do that.
CrowdSec wiring. That’s a community Traefik plugin, not a Pangolin feature.

Pangolin’s resource UI and identity layer are real, but I’m one person with a small pile of resources. I’m not going to outgrow a text file. And Newt was on the teardown list either way.

I’d also noticed a pattern putting me off: more of what I’d reach for sat behind a “supporter” tier or commercial edition. Nothing wrong with a project needing revenue, but the homelab had started to feel like a dashboard that was also a sales funnel. That was the final nudge.

I started researching what plain Caddy would look like.

The target stack

Caddy on the VPS, custom-built with xcaddy so caddy-l4 and caddy-crowdsec-bouncer are compiled in, a single binary owns HTTP, TLS, L4 forwarding, and bouncer decisions.
Authelia in one container for forward-auth.
CrowdSec kept as-is. Only the bouncer changed, from Traefik plugin to Caddy module.
The gateway pod from earlier stays and becomes load-bearing: every VPS → cluster packet goes through it.
Newt, gerbil, Pangolin, Traefik: gone.

I dumped Pangolin’s resource table out of its SQLite to get the real inventory: a roughly-even mix of public and gated HTTP routes, plus a handful of L4 listeners (Forgejo SSH and the wg-easy UDP pools). That became the Caddyfile source of truth.

Pre-flight went fine. xcaddy build, Authelia secrets, a staged Caddyfile.cutover under /opt/replacement/. Validated. Authelia came up on 127.0.0.1:9091. CrowdSec got a bouncer key for the Caddy module.

Cutover itself was one command:

cd /opt/pangolin && docker compose stop pangolin gerbil traefik
cp /opt/replacement/Caddyfile.cutover /etc/caddy/Caddyfile
systemctl restart caddy

This is where things turned into a series of smaller puzzles.

Puzzle 1: caddy-l4 doesn’t use cluster DNS

My first Caddyfile pointed the L4 block at forgejo-ssh.forgejo.svc.cluster.local:22 and the two wg-easy UDP services similarly. That’s how the HTTP routes work: reverse_proxy with transport http { resolvers 10.43.0.10 } happily resolves cluster service names via CoreDNS.

caddy-l4’s proxy directive uses the system resolver, which on the VPS is Hetzner’s public DNS. Nothing .cluster.local resolves. I replaced the hostnames with ClusterIPs resolved via dig @10.43.0.10 and baked them into the Caddyfile. ClusterIPs for stable Services don’t drift in practice.

Puzzle 2: my own VPN quietly ate the LAN

A while after the cutover, I noticed my laptop couldn’t reach 192.168.1.9, the cluster LB VIP, even though I could ping my router. Another PC on the LAN worked fine.

$ ip route | grep 192.168.1
192.168.1.0/24 dev wg_home      ... metric 53    ← wins
192.168.1.0/24 dev wlp0s20f3    ... metric 600
192.168.1.1    dev wlp0s20f3    ... metric 50    ← why the gateway still pings

I keep a wg-easy peer on my laptop pointing at my LAN-only home tunnel so I can reach the LAN from outside. That UDP port used to be forwarded by Pangolin’s L4; with Pangolin stopped and Caddy’s L4 block still commented out (I’d staged it but not uncommented it during cutover), the peer had no live endpoint. WireGuard kept the interface up anyway, and the route kept winning over the LAN route, so anything LAN-ward disappeared into a dead tunnel. The /32 route to the router was metric 50 and specifically hardcoded, which is why it still worked and misled me.

Taking the tunnel down restored LAN access. Uncommenting the L4 block in the Caddyfile, filling in the ClusterIPs, and reloading Caddy restored the tunnel: the two wg-easy UDP pools plus TCP/22 for Forgejo SSH.

Puzzle 3: reaching TrueNAS via the cluster

I moved LanguageTool off the cluster last week, it lives on my TrueNAS on the LAN now, fronted by an ExternalName Service. Pangolin/Newt made this work by accident: requests landed on Newt inside the cluster, resolved the ExternalName, and egressed to the LAN out of the cluster node’s NIC.

From Caddy-on-VPS this doesn’t work. The VPS → cluster WG tunnel only covers the Service CIDR; the ExternalName resolves to a LAN IP the VPS has no route to.

First attempt: a selector-less Service with manually specified Endpoints pointing at the fixed LAN IP. The theory: kube-proxy should DNAT ClusterIP traffic to an off-cluster IP, and the node’s masquerade rule for cluster-CIDR sources hides the return path.

In practice, Cilium’s kube-proxy-replacement doesn’t program that case when the source is external (traffic arriving from the WG tunnel). TCP connects succeed against the ClusterIP but no response ever comes back. I eliminated other possibilities before accepting that I was hitting an unsupported path.

What worked: a tiny nginx pod inside the languagetool namespace, reverse-proxying to the TrueNAS host. The pod is a normal in-cluster egress source, the node masquerade fires the normal way for pod-originated external traffic, and the VPS just talks to this new ClusterIP like any other service. Two Deployment/ConfigMap/Service files and one kustomization entry. I might look into fixing this later.

Next trap: the nginx pod returned 502 on every request and logged unexpected A record in DNS response every 5 seconds. I was using a runtime resolver directive (so nginx resolves the upstream on each request rather than caching forever), and CoreDNS’s AAAA response was parsing wrong somewhere inside nginx. ipv6=off on the resolver directive: problem gone.

Puzzle 4: Authelia’s missing subcommand

My compose had test: ["CMD", "authelia", "healthcheck"] as the Docker healthcheck. I lifted it from an example I don’t remember. Current Authelia (4.39 at time of writing) doesn’t have a healthcheck subcommand anymore, it returns unknown command and the container stays stuck at unhealthy, even though Authelia itself is happily gating requests.

The image ships /app/healthcheck.sh. It reads /app/.healthcheck.env (which Authelia writes at startup) and runs wget against localhost:9091/api/health. test: ["CMD", "/app/healthcheck.sh"] and the container turns healthy on the next interval.

Puzzle 5: the CrowdSec log split

One more. crowdsecurity/caddy-logs was erroring on every log line with UnmarshalJSON: unexpected end of JSON input. Cause: my Caddyfile had a single log default block that captured everything, HTTP access logs, reverse_proxy warnings, tls cache maintenance, admin API chatter, and dumped it all into access.log. CrowdSec expects that file to be pure HTTP access log JSON. One buffered, partially-flushed runtime line was enough to poison the parser.

Split the logger:

log default {
    output file /var/log/caddy/runtime.log
    exclude http.log.access
}

log access {
    include http.log.access
    output file /var/log/caddy/access.log
    format json
}

Then access.log went empty. Because Caddy doesn’t emit HTTP access logs unless a site explicitly asks for them. I have shared (public) and (gated) snippets that every site imports, adding log to each snippet enables access logging for every route for free, and crowdsecurity/caddy-logs immediately started parsing clean access entries again.

Can I actually see what’s happening?

With the proxy swapped and everything serving, a question I’d been half-answering with cscli decisions list and tailing access.log kept nagging: what’s actually hitting my edge, what’s getting blocked, is the blocklist doing anything useful? Fine tools for firefighting. Terrible for pattern recognition.

The cluster already runs Victoria Metrics and Grafana. Caddy has a native Prometheus endpoint on its admin API once metrics goes into the servers block; CrowdSec exposes its own on :6060. A single vmagent on the VPS scrapes both and remote-writes to Victoria Metrics in the cluster, the gateway WG tunnel already routes 10.43.0.0/16, so writing to a cluster ClusterIP is free.

Logs are the other half. Loki runs in single-binary mode in the cluster, Promtail tails access.log on the VPS and pushes to Loki through the same tunnel. Caddy’s access log is JSON with remote_ip, method, host, uri, status, tls, etc., Loki’s | json pipe turns that into queryable fields at read time without a separate ingest pipeline.

Ten metric panels and three log panels later, there’s a purpose-built dashboard: request rate, latency percentiles, CrowdSec decisions, AppSec rule hits, top source IPs and paths, a live tail. I can now actually answer what should I block. Or, more usefully, stop pre-blocking and start recognizing the shape of who’s showing up.

The tidy

All that refactoring had left the VPS filesystem a mess I still needed to clean up.

So: surgery, one directory per compose project.

/opt/
├── authelia/
├── caddy/              # just the CrowdSec API key env for the systemd drop-in
├── crowdsec/           # engine + web-ui + config + GeoLite MMDBs + the refresher script
├── promtail/
├── vmagent/
└── wg-ssh/             # renamed from wg-pangolin-ssh

The WG peer container’s old name (wg-pangolin-ssh) was the most persistent Pangolin-era reference on the box, so that went too. Moved the bind-mounted state, updated the systemd EnvironmentFile path and the cron entry, restarted. /opt/pangolin/ and /opt/replacement/ both rm -rf’d at the end.

Configs in git

Pangolin owned my proxy configs via its admin interface, so they weren’t in the repo. With Caddy, the configs are text and go in git:

edge-gateway/
├── caddy/{Caddyfile,BUILD.txt,caddy.service.d/}
├── authelia/{docker-compose.yml,configuration.yml.tpl,build-config.sh}
├── crowdsec/{docker-compose.yml,acquis-*.yaml,update-geolite-db.sh}
├── wg-ssh/docker-compose.yml
├── vmagent/{docker-compose.yml,config/}
└── promtail/{docker-compose.yml,config/}

The Caddy binary isn’t tracked (reproducible from the module list in BUILD.txt). Runtime state (SQLite databases, ACME cert storage) isn’t. Secret provisioning is documented outside the repo.

Merging the tunnel pods

Two pangolin-ish strings still in the repo: ns/pangolin-ssh/ (the SSH tunnel peer) and ns/pangolin-gateway/ (the Service-CIDR tunnel peer). Both were “WG client to the edge gateway with forwarding rules”, close enough that one pod could do both. Merged into ns/edge-gateway/, one wg-quick up, iptables forwards from the gateway side, socat forwarding the SSH port through the tunnel from the SSH side. SealedSecrets are bound to {namespace, name}, so I pulled the plain wg0.conf out of the old namespace with kubectl get secret and re-sealed it for the new one.