The Great Migration
Back in April, I wrote a summary post about the project I had been working on to migrate to a new k3s cluster from the original Rancher Kubernetes cluster that I created in 2020. My intention was to continue that series and detail each part of that project. Unfortunately, I ran into some technical problems which meant that not only did the cluster get shutdown, but I didn’t have time to even look at it. Since this site was moved to the new k3s cluster, it also went down.
I also started a new business in 2020 which I’d been working on part-time while working full-time as an IT leader in a financial services company. In April, shortly after the post I referred to above, I also went to a marketing conference where I learned how to market my business. Since then, business has picked up enough that I was able to resign from my salaried job and work in the business full-time. We’re not profitable yet, but we’re on track to be profitable within a year.
Unfortunately, that meant I didn’t have time to finish the migration project and had just shut down the VMs, but a few weeks ago the Rancher cluster suffered a catastrophic crash and would not come back up. I had no choice but to make time to finish it.
Automation to the rescue
Fortunately, the guiding principles I set out for the project when I started have paid off. All of the VMs were provisioned using Terraform on the Proxmox cluster, k3s was installed using Ansible, and all of the workloads were deployed using helm and Ansible-driven templated Kubernetes manifests. That meant that when the new cluster ran into problems (more on that in a minute), it was just a matter of deleting the VMs and recreating everything with a few commands and restore the Longhorn volumes from the backup.
The Problems
It turns out that most of the stability issues with the new cluster came down to two things. First, there is a bug in the NIC card driver which was causing the network connection on one of the Proxmox hosts to disconnect under heavy load. The cluster storage is provided by Longhorn which relies on each volume being replicated on 3 Kubernetes nodes (VMs). In my two node Proxmox cluster, three k3s nodes are on each physical server. As load increased on the cluster, the more unstable it became.
The second problem appears to be using ReadWriteMany volumes on Longhorn. As opposed to ReadWriteOnce, ReadWriteMany allows multiple pods on different nodes to read and write to a volume. The underlying technology, I believe, is that ReadWriteOnce uses NFS to mount the volume on other nodes in the cluster. Perhaps it’s related the the first problem as well, but every time I created a ReadWriteMany volume, the entire cluster became very unstable.
The Migration
In the original project, since I was new to Kubernetes, I used the Rancher web GUI to create many of the objects. Later, I started using Ansible playbooks to dynamically apply manifests to the cluster to create and maintain new applications. For example, when a new version was released, I only had to update the variable in the playbook with the version number and run the playbook.
The applications deployed using the GUI were the toughest as it required that I was recover the manifests from the cluster and convert them into Ansible playbooks:
$ kubectl -n namespace get deployment name -o yaml > name.yaml
Fortunately, I was able to get restore an old backup of the VMs and bring them up long enough to dump everything to manifest files.
#!/usr/bin/env bash
set -e
CONTEXT="$1"
if [[ -z ${CONTEXT} ]]; then
echo "Usage: $0 KUBE-CONTEXT"
exit 1
fi
NAMESPACES=$(kubectl --context ${CONTEXT} get -o json namespaces|jq '.items[].metadata.name'|sed "s/\"//g")
RESOURCES="pvc pv configmap serviceaccount secret ingress service deployment statefulset hpa job cronjob"
for ns in ${NAMESPACES};do
for resource in ${RESOURCES};do
rsrcs=$(kubectl --context ${CONTEXT} -n ${ns} get -o json ${resource}|jq '.items[].metadata.name'|sed "s/\"//g")
for r in ${rsrcs};do
dir="${CONTEXT}/${ns}/${resource}"
mkdir -p "${dir}"
kubectl --context ${CONTEXT} -n ${ns} get -o yaml ${resource} ${r} > "${dir}/${r}.yaml"
done
done
done
The basic process was to create to:
- Create a new Longhorn volume
- Create a temporary volume pointing to the old NFS location
- Run a kubernetes job to rsync from the source volume to the destination volume
- Delete the temporary source volume
- Convert old manifest or playbook into a new playbook to deploy all of the kubernetes objects
What’s next?
I know I’ve made this promise before, but I’m still planning to write additional posts going into more detail on each of the components of the project:
- Terraform with Proxmox plugin for provisioning of VMs
- haproxy for routing traffic between clusters based on hostname
- Keepalived for high-availability of Kubernetes API and haproxy VMs
- Metallb for Kubernetes load balancer services
- Traefik for Kubernetes ingress controller
- Longhorn for Kubernetes-native block storage
- K3S for Kubernetes engine
- OpenLDAP for authentication
- Authelia for single sign-on (replaces Keycloak)
- Kubernetes Dashboard for management of k3s cluster
- Cloudflare DDNS for updating DNS records when the external IP changes
- Dashy for a user dashboard of available services
- Kube Prometheus / Prometheus Operator for observability
- Drone for CI/CD pipelines
- Gitea for code repositories
- Docker Registry for private image repository
- Ntfy.sh for notifications (replaces Gotify)
- Postgresql Operator for Postgres database clusters
- Hugo for static site generation
- Hello Friend theme for Hugo site
In order to keep this promise, I’m adding regular posts to my weekly TO DO list. This site has been focused on self-hosting and data sovereignty, but I also want to expand the topics I cover on this site into other areas that interest me. These include productivity, organization, time management, gaming, politics, my experience starting a new business, and my thoughts on the IT Services industry.