I just switched some of my Kubernetes nodes to run on a root ZFS system. It was mostly painless, but there were a few places that required special configuration. Here are my notes.
Disk setup
The cheap server I got from Hetzner’s server auction has two 225G SSDs and two 1.8T HDDs. The former are the root filesystem, and the latter are used for slow data.
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1.8T 0 disk
sdb 8:16 0 1.8T 0 disk
sdc 8:32 0 223.6G 0 disk
sdd 8:48 0 223.6G 0 disk
...
The first few gigs on each disk is used for the mirrored bootloader, and we add the rest to two ZFS pools. The zroot
pool is on the fast disks, and the zroot/nixos
dataset is the root filesystem. That’s not going to use up all the 225G available, and we’re going to put the zroot/containerd
dataset on it later. The zdata
pool is on the slow spinning disks, and it will contain the zdata/longhorn-ext4
volume.
# zpool create ... zroot mirror SDC_STABLE_ID SDD_STABLE_ID
# zpool create ... zdata mirror SDA_STABLE_ID SDB_STABLE_ID
# zfs create -o mountpoint=legacy zroot/nixos
In this setup, ZFS replaces the entire mdadm
/LVM/cryptsetup
/ext4
stack. In other words, there’s no additional RAID, volume management, or encryption setup to do.
Ideally, we’d want Kubernetes to just use some part of zroot/nixos
and Longhorn (the storage provider) to use the entirety of the zdata
pool. In practice, we have to jump through a couple of hoops first.
Kubernetes (or rather, overlayfs)
We’re using Kubernetes 1.21.6 with containerd
1.5.7 on NixOS 21.05.
The problem is that Kubernetes uses containerd
, which in turn uses overlayfs
, which doesn’t work on ZFS. The errors in the containerd
logs look like this:
... failed to create containerd task: failed to create shim: failed to mount rootfs component ... invalid argument: unknown
Nov 18 17:12:59 fsn-qws-app2 containerd[31371]: time="2021-11-18T17:12:59.141191730Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:ingress-nginx-controller-vxsr4,Uid:85176d99-f1a4-42ef-9125-41ea50d7757c,Namespace:ingress-nginx,Attempt:0,} failed, error" error="failed to create containerd task: failed to create shim: failed to mount rootfs component &{overlay overlay[index=off workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/2/work upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/2/fs lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/1/fs]}: invalid argument: unknown"
And the corresponding errors in the kernel logs look like this:
Nov 18 17:12:59 fsn-qws-app2 kernel: overlayfs: upper fs does not support RENAME_WHITEOUT.
Nov 18 17:12:59 fsn-qws-app2 kernel: overlayfs: upper fs missing required features.
The issue is that overlayfs
uses some options for renameat2
that aren’t implemented on ZFS (openzfs/zfs#9414). The workaround is to configure containerd
to use the zfs snapshotter plugin.
We create a dataset for containerd
and then add the following lines to its config:
# zfs create zroot/containerd -o mountpoint=/var/lib/containerd/io.containerd.snapshotter.v1.zfs
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "zfs"
With this, Kubernetes should be able to run pods on the host. Containerd will also be creating a lot of zfs datasets; see the “Resulting ZFS setup” section below for an example.
Longhorn
I use Longhorn as the container storage provider in my cluster. It’s easy to setup and it gets the job done. That said, Longhorn requires file extents, which ZFS doesn’t support, so we’re in a pickle.
The workaround is to use ZFS as a volume manager instead of a filesystem. We create a fixed size volume in the big ZFS pool, format it as ext4
, and mount that to the longhorn directory.
# zfs create zdata/longhorn-ext4 -V 528G
# mkfs.ext4 /dev/zvol/zdata/longhorn-ext4
# mount -o noatime,discard /dev/zvol/zdata/longhorn-ext4 /var/lib/longhorn
I’m using NixOS, so I add the following to my config to make the mount permanent.
systemd.mounts = [{
what = "/dev/zvol/zdata/longhorn-ext4";
type = "ext4";
where = "/var/lib/longhorn";
wantedBy = [ "kubernetes.target" ];
requiredBy = [ "kubernetes.target" ];
options = "noatime,discard";
}];
Note that we use the discard
option for mount
. This enables the TRIM command on the filesystem, and makes it so that ext4
tells the zpool
when blocks become unused. Without this, the volume will eventually expand to its maximum size in the pool, regardless on how much data is actually used by the ext4
filesystem.
Resulting ZFS setup
The resulting ZFS setup looks like this:
$ zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zdata 1.81T 9.44G 1.80T - - 0% 0% 1.00x ONLINE -
zroot 220G 4.45G 216G - - 0% 2% 1.00x ONLINE -
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
zdata 545G 1.22T 192K none
zdata/longhorn-ext4 545G 1.75T 9.43G -
zroot 38.5G 175G 192K none
zroot/containerd 744M 175G 216K /var/lib/containerd/io.containerd.snapshotter.v1.zfs
...
zroot/nixos 3.71G 175G 3.71G legacy
zroot/swap 34.0G 209G 92K -
It turns out that containerd takes full advantage of ZFS and creates a lot of volumes. The full listing of the lightly used node actually looks like this:
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
zdata 545G 1.22T 192K none
zdata/longhorn-ext4 545G 1.75T 9.43G -
zroot 38.5G 175G 192K none
zroot/containerd 744M 175G 216K /var/lib/containerd/io.containerd.snapshotter.v1.zfs
zroot/containerd/1 21.7M 175G 21.7M legacy
zroot/containerd/10 104M 175G 156M legacy
zroot/containerd/11 19.1M 175G 175M legacy
zroot/containerd/116 144K 175G 21.7M legacy
zroot/containerd/120 544K 175G 136M legacy
zroot/containerd/139 144K 175G 21.7M legacy
zroot/containerd/14 5.41M 175G 5.41M legacy
zroot/containerd/140 264K 175G 204M legacy
zroot/containerd/15 65.9M 175G 71.0M legacy
zroot/containerd/155 144K 175G 21.7M legacy
zroot/containerd/156 336K 175G 16.6M legacy
zroot/containerd/16 192K 175G 71.1M legacy
zroot/containerd/160 144K 175G 21.7M legacy
zroot/containerd/161 336K 175G 175M legacy
zroot/containerd/166 144K 175G 21.7M legacy
zroot/containerd/167 144K 175G 21.7M legacy
zroot/containerd/168 304K 175G 16.6M legacy
zroot/containerd/169 548K 175G 136M legacy
zroot/containerd/17 30.8M 175G 102M legacy
zroot/containerd/170 144K 175G 21.7M legacy
zroot/containerd/171 144K 175G 21.7M legacy
zroot/containerd/173 400K 175G 175M legacy
zroot/containerd/174 296K 175G 204M legacy
zroot/containerd/175 144K 175G 21.7M legacy
zroot/containerd/176 476K 175G 190M legacy
zroot/containerd/177 144K 175G 21.7M legacy
zroot/containerd/178 484K 175G 190M legacy
zroot/containerd/18 14.8M 175G 116M legacy
zroot/containerd/182 424K 175G 175M legacy
zroot/containerd/183 424K 175G 175M legacy
zroot/containerd/19 5.65M 175G 114M legacy
zroot/containerd/20 752K 175G 115M legacy
zroot/containerd/21 3.61M 175G 118M legacy
zroot/containerd/22 16.5M 175G 134M legacy
zroot/containerd/23 1.78M 175G 136M legacy
zroot/containerd/24 256K 175G 136M legacy
zroot/containerd/25 24.4M 175G 136M legacy
zroot/containerd/26 200K 175G 136M legacy
zroot/containerd/28 122M 175G 173M legacy
zroot/containerd/29 5.59M 175G 179M legacy
zroot/containerd/30 24.6M 175G 203M legacy
zroot/containerd/31 244K 175G 203M legacy
zroot/containerd/32 424K 175G 203M legacy
zroot/containerd/33 1.38M 175G 204M legacy
zroot/containerd/34 216K 175G 204M legacy
zroot/containerd/35 224K 175G 204M legacy
zroot/containerd/38 6.77M 175G 6.76M legacy
zroot/containerd/39 9.97M 175G 16.6M legacy
zroot/containerd/70 54.7M 175G 54.7M legacy
zroot/containerd/71 124M 175G 173M legacy
zroot/containerd/72 5.59M 175G 179M legacy
zroot/containerd/73 10.3M 175G 189M legacy
zroot/containerd/74 236K 175G 189M legacy
zroot/containerd/75 424K 175G 189M legacy
zroot/containerd/76 1.41M 175G 189M legacy
zroot/containerd/77 216K 175G 190M legacy
zroot/containerd/78 224K 175G 190M legacy
zroot/containerd/9 54.7M 175G 54.7M legacy
zroot/nixos 3.71G 175G 3.71G legacy
zroot/swap 34.0G 209G 92K -
Appendix: Full ZFS commands
In the interest of clarity, I elided most of the options of the ZFS commands above, but in case anyone wants to try this out, here they are:
$ zfs version
zfs-2.0.6-1
zfs-kmod-2.0.6-1
### Create zpools with mirroring, compression, encryption, and the
### ashift optimisation for modern drives. The long ids are the
### /dev/disk/by-id names of the big partitions on each drive.
### Do not use sd* because those can change from boot to boot.
# zpool create \
-O mountpoint=none -o ashift=12 -O atime=off -O acltype=posixacl \
-O xattr=sa -O compression=lz4 \
-O encryption=aes-256-gcm -O keyformat=passphrase \
zroot mirror \
ata-INTEL_SSDSC2CW240A3_CVCV326400E1240FGN-part3 ata-INTEL_SSDSC2CW240A3_CVCV316202CW240CGN-part3
# zpool create \
-O mountpoint=none -o ashift=12 -O atime=off -O acltype=posixacl \
-O xattr=sa -O compression=lz4 \
-O encryption=aes-256-gcm -O keyformat=passphrase \
zdata mirror \
ata-HGST_HUS724020ALA640_PN1134P6HVHR7W-part3 ata-ST2000NM0033-9ZM175_Z1X08M9Q-part3
### We use a legacy mountpoint because nixos is going to mount the
### root for us.
# zfs create -o mountpoint=legacy zroot/nixos
### Swap on ZFS might not work. Try it out anyway.
### https://github.com/openzfs/zfs/issues/7734
### The options are the ones recommended by the FAQ:
### https://openzfs.github.io/openzfs-docs/Project%20and%20Community/FAQ.html?highlight=faq#using-a-zvol-for-a-swap-device-on-linux
# zfs create -V 32G -b $(getconf PAGESIZE) \
-o logbias=throughput \
-o sync=always \
-o primarycache=metadata \
-o com.sum:auto-snapshot=false \
zroot/swap
# mkswap /dev/zroot/swap
### Nothing changed from the listing above. This is just
### a regular zfs filesystem auto-mounted to the given
### mountpoint.
# zfs create zroot/containerd \
-o mountpoint=/var/lib/containerd/io.containerd.snapshotter.v1.zfs
### Nothing changed from the listing above. I use less
### than the 1.8T available in the pool here to match
### the sizes of the other nodes in my cluster. I will
### resize this volume when I get bigger machines.
# zfs create zdata/longhorn-ext4 -V 528G
# mkfs.ext4 /dev/zvol/zdata/longhorn-ext4