One of the reasons it took me almost six months to relaunch the blog is that I kept running into instability problems at almost every layer. I didn’t know if it was hardware, Proxmox, GlusterFS, Longhorn, or K3S. The one thing in common is that it would happen when the system was under heavy load - backups, large file transfers, etc.

Every time I thought I had the issues largely resolved, the new Proxmox node would stop responding. I thought it was a hardware hang, so I would hit the power button to reboot it. I really should have dug sooner because this weekend it finally happened at a time when I could take a few moments to troubleshoot further.

I hooked up a monitor to the system and found these messages in the system logs:

 Apr 17 07:44:08 athena kernel: [357199.815359] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
 Apr 17 07:44:08 athena kernel: [357199.815359]   TDH                  <f3>
 Apr 17 07:44:08 athena kernel: [357199.815359]   TDT                  <9e>
 Apr 17 07:44:08 athena kernel: [357199.815359]   next_to_use          <9e>
 Apr 17 07:44:08 athena kernel: [357199.815359]   next_to_clean        <f3>
 Apr 17 07:44:08 athena kernel: [357199.815359] buffer_info[next_to_clean]:
 Apr 17 07:44:08 athena kernel: [357199.815359]   time_stamp           <1055169b2>
 Apr 17 07:44:08 athena kernel: [357199.815359]   next_to_watch        <f4>
 Apr 17 07:44:08 athena kernel: [357199.815359]   jiffies              <105517109>
 Apr 17 07:44:08 athena kernel: [357199.815359]   next_to_watch.status <0>
 Apr 17 07:44:08 athena kernel: [357199.815359] MAC Status             <40080083>
 Apr 17 07:44:08 athena kernel: [357199.815359] PHY Status             <796d>
 Apr 17 07:44:08 athena kernel: [357199.815359] PHY 1000BASE-T Status  <3800>
 Apr 17 07:44:08 athena kernel: [357199.815359] PHY Extended Status    <3000>
 Apr 17 07:44:08 athena kernel: [357199.815359] PCI Status             <10>
 Apr 17 07:44:09 athena pmxcfs[9081]: [status] notice: cpg_send_message retry 10
 Apr 17 07:44:09 athena kernel: [357201.059155] e1000e 0000:00:1f.6 eno1: Reset adapter unexpectedly

Doing a bit of DuckDuckGo’ing and I found this Proxmox forum thread. It seems that this is a known issue and it just so happens that the NIC built on the motherboard for the new node is an e1000e.

The workaround is to disable offloading:

apt install -y ethtool
ethtool -K eth0 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off

Diff of output from “ethtool -k eno1” before and after the command above:

2,3c2,3
< rx-checksumming: on
< tx-checksumming: on
---
> rx-checksumming: off
> tx-checksumming: off
5c5
< 	tx-checksum-ip-generic: on
---
> 	tx-checksum-ip-generic: off
9,10c9,10
< scatter-gather: on
< 	tx-scatter-gather: on
---
> scatter-gather: off
> 	tx-scatter-gather: off
12,13c12,13
< tcp-segmentation-offload: on
< 	tx-tcp-segmentation: on
---
> tcp-segmentation-offload: off
> 	tx-tcp-segmentation: off
16,18c16,18
< 	tx-tcp6-segmentation: on
< generic-segmentation-offload: on
< generic-receive-offload: on
---
> 	tx-tcp6-segmentation: off
> generic-segmentation-offload: off
> generic-receive-offload: off
20,21c20,21
< rx-vlan-offload: on
< tx-vlan-offload: on
---
> rx-vlan-offload: off
> tx-vlan-offload: off

This is great and it seems to help, but it needs to be executed after every boot unless you add this to /etc/network/interfaces:

iface eno1 inet manual
    post-up /usr/bin/logger -p debug -t ifup "Disabling offload for eno1" && /sbin/ethtool -K $IFACE tgso off gro off tso off tx off rx off rxvlan off txvlan off sg off && /usr/bin/logger -p debug -t ifup "Disabled offload for eno1"

You may try just disabling tso and gso, but that’s didn’t seem to help in my case. It’s been stable so far.I’m sure there is also a performance hit, but that’s better than it going down several times a day.