The Rabbit Hole

The Rabbit Hole

I've just recently passed the 1 year anniversary of setting up my home kubernetes cluster in which I used VMs running RancherOS on a Proxmox hypervisor to quickly spin up nodes. Then used Rancher server to initate the cluster which also provided a convenient GUI to get some workloads up and going without needing to learn all of the concepts all at once. It was a good strategy.

Even as I've been setting up new workloads and automating changes through the use of Ansible along with more traditional workload definition using YAML, I continue to manage some of the early workloads such as Nextcloud directly using the Rancher GUI. If it ain't broke, right?

Yesterday I attempted to login to Rancher in order to do something (can't remember what it was now) and was not able to connect. All of other workloads seemed to be running fine so I thought maybe that VM was down.. nope. As I investigated,  found messages in the Rancher logs like this:

http: TLS handshake error from remote error: tls: bad certificate

This led me to a Github issue referencing the same messages. It turns out that when Rancher is set up the way I did at the time, using the default self-signed certificates, they are set to expire after 1 year by default. The other options were to generate my own self-signed certificate or provide one signed by a known Certificate Authority such as Let's Encrypt. If I had known this or noticed it, I could have gone into Rancher and initiated a certificate rotation which would generated new certificates with a 10 year expiration date. Unfortunately, the cluster was already down even though everything was still running (fortunately).

Unfortnately, my F**** Up was not reading all the way through the issue and realizing that there was a new version of Rancher, v2.5.8, which addressed this issue. The good news is that I've previously upgraded Rancher and knew that it included backing up all of the Rancher data to a tar file and I actually did that first before attempting some of the other methods to correct this problem.

Backing up Rancher data

  • Log into kubemgr node running Rancher
  • Run Docker ps to get the name of the Rancher container (e.g. festive_hypatia)
  • Stop Rancher
  • If not already done, create a named data volume for the Rancher data
docker create --volumes-from <RANCHER_CONTAINER_NAME> --name rancher-data-<DATE> rancher/rancher:<RANCHER_CONTAINER_TAG>
  • Execute command to create tar file
docker run  --volumes-from rancher-data-<DATE> -v $PWD:/backup:z busybox tar pzcvf /backup/rancher-data-backup-<RANCHER_VERSION>-<DATE>.tar.gz /var/lib/rancher
  • Verify tar file was created in the current directory and move someplace for safe keeping.
  • Start Rancher again

What Not To Do

What I attempted and failed to do, in several different ways, was to create or replace the current certificates with new ones I generated with a 10 year expiration. It was an interesting exercise with a high stakes "production down" situation. Following along in the issue, I tried:

  • Changing the date/time on the VM
  • Run a shell on the Rancher container and remove or rename various directories associate with the certificates (bad idea!).
  • Use easyrsa scripts to generate a new certificate following this blog post, but the easyrsa I had looked completly different than the process described in the post.

For some reason, I thought wrongly that the most recent version of Rancher the one that I was already running, v2.5.7, so I didn't consider upgrading to a newer version until I saw this comment in the Github issue. Version v2.5.8 seemingly resolved this issue by replacing the expired self-signed certificates.

I knew there was a FU after upgrading and started getting Goland errors when starting Rancher. So, it was time to restore the Rancher data back to it's the state before I started down this path.

Restoring Rancher data

  • Log into kubemgr node running Rancher
  • Run Docker ps to get the name of the Rancher container (e.g. festive_hypatia)
  • Stop Rancher
  • Make sure the original tar file created from the backup is back in the current directory and execute this command to remove the existing data and restore from the tar file
docker run  --volumes-from <RANCHER_CONTAINER_NAME> -v $PWD:/backup \
busybox sh -c "rm /var/lib/rancher/* -rf  && \
tar pzxvf /backup/rancher-data-backup-<RANCHER_VERSION>-<DATE>.tar.gz"
  • Start Rancher

Upgrading Rancher

The process of upgrading Rancher involves:

  • Following the backup procedure to create the a name data volume
  • Pulling the new new image
docker pull rancher/rancher:<RANCHER_VERSION_TAG>
  • Creating a new container using the same method as the original installation (in my case, generated self-signed certificate) from the rancher data volume
docker run -d --volumes-from rancher-data \
  --restart=unless-stopped \
  -p 80:80 -p 443:443 \

Once the data was restored and Rancher was upgraded to the newer version,everything was working again.

How Deep Does It Go?

In order to put this issue in context, I need to breakdown just how many layers of abstraction exist. I had no idea it went this deep. From lowest to highest:

  • Physical server
  • Linux host (Debian/Proxmox)
  • Hypervisor (QEMU)
  • VM (Rancher OS)
  • System Docker - Rancher OS is the Linux kernel plus Docker
  • OS container (Debian console)
  • Docker running on OS container
  • Rancher and Rancher-agent containers
  • k3s (a lightweight Kubernetes implementation)
  • Rancher server using CRDs (Custom Resource Definitions) to manage the k3s (aka local) and one or more k8s clusters.

As best as I can tell, the certificates which were expired were the k3s certificates running in side that entire stack. Whew!