Today, lemmy.amxl.com suffered an outage because the rootful Lemmy podman container crashed out, and wouldn’t restart.
Fixing it turned out to be more complicated than I expected, so I’m documenting the steps here in case anyone else has a similar issue with a podman container.
I tried restarting it, but got an unexpected error the internal IP address (which I hand assign to containers) was already in use, despite the fact it wasn’t running.
I create my Lemmy services with podman-compose
, so I deleted the Lemmy services with podman-compose down
, and then re-created them with podman-compose up
- that usually fixes things when they are really broken. But this time, I got a message like:
level=error msg=“"IPAM error: requested ip address 172.19.10.11 is already allocated to container ID 36e1a622f261862d592b7ceb05db776051003a4422d6502ea483f275b5c390f2"”
The only problem is that the referenced container actually didn’t exist at all in the output of podman ps -a
- in other words, podman thought the IP address was in use by a container that it didn’t know anything about! The IP address has effectively been ‘leaked’.
After digging into the internals, and a few false starts trying to track down where the leaked info was kept, I found it was kept in a BoltDB file at /run/containers/networks/ipam.db
- that’s apparently the ‘IP allocation’ database. Now, the good thing about /run
is it is wiped on system restart - although I didn’t really want to restart all my containers just to fix Lemmy.
BoltDB doesn’t come with a lot of tools, but you can install a TUI editor like this: go install github.com/br0xen/boltbrowser@latest
.
I made a backup of /run/containers/networks/ipam.db
just in case I screwed it up.
Then I ran sudo ~/go/bin/boltbrowser /run/containers/networks/ipam.db
to open the DB (this will lock the DB and stop any containers starting or otherwise changing IP statuses until you exit).
I found the networks that were impacted, and expanded the bucket (BoltDB has a hierarchy of buckets
, and eventually you get key/value pairs) for those networks, and then for the CIDR ranges the leaked IP was in. In that list, I found a record with a value equal to the container that didn’t actually exist. I used D to tell boltbrowser to delete that key/value pair. I also cleaned up under ids - where this time the key was the container ID that no longer existed - and repeated for both networks my container was in.
I then exited out of boltbrowser with q
.
After that, I brought my Lemmy containers back up with podman-compose up -d
- and everything then worked cleanly.
I’ve inherited it on production systems before, automated service discovery and certificate renewal is definitely what admins should have in 2025. I thought the label/annotation system it used on Docker had some ergonomics/documentation issues, but nothing serious.
It feels like it’s more meant for Docker/Podman though. On Kubernetes I use cert-manager and Gateway API+Project Contour. It does seem like Traefik has support for Gateway API too, so it’s probably a good choice for Kubernetes too?
We’re thinking of moving to it from a custom coredns and flannel inplementation in a k3s 33 node cluster.
Ah, interesting. What kind of customization are you using CoreDNS for? If you don’t have Ingress/Gateway API for your HTTP traffic, Traefik is likely a good option for adopting it.
Coredns and an nginx reverse proxy are handling DNS, failover, and some other redirect. However it’s not ideal as it’s a custom implementation a previous engineer setup.