Quick Hit - NIC driver hang/reset under heavy load
One of the reasons it took me almost six months to relaunch the blog is that I kept running into instability problems at almost every layer. I didn’t know if it was hardware, Proxmox, GlusterFS, Longhorn, or K3S. The one thing in common is that it would happen when the system was under heavy load - backups, large file transfers, etc.
Every time I thought I had the issues largely resolved, the new Proxmox node would stop responding.
April 2022 Update
It’s been six months since my last update. Wow, I knew it had been some time, but that’s obviously way longer than I expected. I’ve had plenty to say and plenty of updates, but I was waiting for a specific event. Let’s take a step back so I can explain:
Six months ago I ran into an issue where LDAP broke after a TLS certificate expired. It expired because it was not set up to renew automatically.
I broke authentication, but it’s not my fault.
How this came about # This weekend I was trying to login to Matrix (which uses OpenLDAP as its password store) on a new device and it was failing. Looking into the logs, it was complaining about an expired TLS certificate. Weird. First, the certificate was set up with cert-manager to renew the certificate automatically with Let’s Encrypt. Second, the certificate had been expired for a year and Synapse never complained about it before.
Kubernetes Native Storage and a Load Balancer
As I continue to evolve my self-hosted environment to be more robust and fault-tolerant, I have completed setting up the Longhorn storage system and a bare-metal load balancer, MetalLB.
Longhorn # Longhorn provides block strorage for a Kubernetes cluster which is provisioned and managed with containers and microservices. It manages the disk devices on the nodes and creates a pool for Kubernetes persistent volumes (PVs) which are replicated and distributed across the nodes in the cluster.
Towards High Availability
In order to make my new 2 node Proxmox cluster highly available, I need shared storage for the VMs and a quorum in the cluster.
Shared storage is available now as an NFS mount from the QNAP, but my goal is to retire the QNAP and move two TB disks into the first Proxmox node.
There are a number of ways to do this, but I to chose to use GlusterFS volumes backed by ZFS.