I just switched some of my Kubernetes nodes to run on a root ZFS system. It was mostly painless, but there were a few places that required special configuration. Here are my notes.

Disk setup

The cheap server I got from Hetzner’s server auction has two 225G SSDs and two 1.8T HDDs. The former are the root filesystem, and the latter are used for slow data.

$ lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0   1.8T  0 disk
sdb      8:16   0   1.8T  0 disk
sdc      8:32   0 223.6G  0 disk
sdd      8:48   0 223.6G  0 disk
...

The first few gigs on each disk is used for the mirrored bootloader, and we add the rest to two ZFS pools. The zroot pool is on the fast disks, and the zroot/nixos dataset is the root filesystem. That’s not going to use up all the 225G available, and we’re going to put the zroot/containerd dataset on it later. The zdata pool is on the slow spinning disks, and it will contain the zdata/longhorn-ext4 volume.

# zpool create ... zroot mirror SDC_STABLE_ID SDD_STABLE_ID
# zpool create ... zdata mirror SDA_STABLE_ID SDB_STABLE_ID
# zfs create -o mountpoint=legacy zroot/nixos

In this setup, ZFS replaces the entire mdadm/LVM/cryptsetup/ext4 stack. In other words, there’s no additional RAID, volume management, or encryption setup to do.

Ideally, we’d want Kubernetes to just use some part of zroot/nixos and Longhorn (the storage provider) to use the entirety of the zdata pool. In practice, we have to jump through a couple of hoops first.

Kubernetes (or rather, overlayfs)

We’re using Kubernetes 1.21.6 with containerd 1.5.7 on NixOS 21.05.

The problem is that Kubernetes uses containerd, which in turn uses overlayfs, which doesn’t work on ZFS. The errors in the containerd logs look like this:

... failed to create containerd task: failed to create shim: failed to mount rootfs component ... invalid argument: unknown
Nov 18 17:12:59 fsn-qws-app2 containerd[31371]: time="2021-11-18T17:12:59.141191730Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:ingress-nginx-controller-vxsr4,Uid:85176d99-f1a4-42ef-9125-41ea50d7757c,Namespace:ingress-nginx,Attempt:0,} failed, error" error="failed to create containerd task: failed to create shim: failed to mount rootfs component &{overlay overlay[index=off workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/2/work upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/2/fs lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/1/fs]}: invalid argument: unknown"

And the corresponding errors in the kernel logs look like this:

Nov 18 17:12:59 fsn-qws-app2 kernel: overlayfs: upper fs does not support RENAME_WHITEOUT.
Nov 18 17:12:59 fsn-qws-app2 kernel: overlayfs: upper fs missing required features.

The issue is that overlayfs uses some options for renameat2 that aren’t implemented on ZFS (openzfs/zfs#9414). The workaround is to configure containerd to use the zfs snapshotter plugin.

We create a dataset for containerd and then add the following lines to its config:

# zfs create zroot/containerd -o mountpoint=/var/lib/containerd/io.containerd.snapshotter.v1.zfs
[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "zfs"

With this, Kubernetes should be able to run pods on the host. Containerd will also be creating a lot of zfs datasets; see the “Resulting ZFS setup” section below for an example.

Longhorn

I use Longhorn as the container storage provider in my cluster. It’s easy to setup and it gets the job done. That said, Longhorn requires file extents, which ZFS doesn’t support, so we’re in a pickle.

The workaround is to use ZFS as a volume manager instead of a filesystem. We create a fixed size volume in the big ZFS pool, format it as ext4, and mount that to the longhorn directory.

# zfs create zdata/longhorn-ext4 -V 528G
# mkfs.ext4 /dev/zvol/zdata/longhorn-ext4
# mount -o noatime,discard /dev/zvol/zdata/longhorn-ext4 /var/lib/longhorn

I’m using NixOS, so I add the following to my config to make the mount permanent.

systemd.mounts = [{
  what = "/dev/zvol/zdata/longhorn-ext4";
  type = "ext4";
  where = "/var/lib/longhorn";
  wantedBy = [ "kubernetes.target" ];
  requiredBy = [ "kubernetes.target" ];
  options = "noatime,discard";
}];

Note that we use the discard option for mount. This enables the TRIM command on the filesystem, and makes it so that ext4 tells the zpool when blocks become unused. Without this, the volume will eventually expand to its maximum size in the pool, regardless on how much data is actually used by the ext4 filesystem.

Resulting ZFS setup

The resulting ZFS setup looks like this:

$ zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zdata  1.81T  9.44G  1.80T        -         -     0%     0%  1.00x    ONLINE  -
zroot   220G  4.45G   216G        -         -     0%     2%  1.00x    ONLINE  -

$ zfs list
NAME                   USED  AVAIL     REFER  MOUNTPOINT
zdata                  545G  1.22T      192K  none
zdata/longhorn-ext4    545G  1.75T     9.43G  -
zroot                 38.5G   175G      192K  none
zroot/containerd       744M   175G      216K  /var/lib/containerd/io.containerd.snapshotter.v1.zfs
...
zroot/nixos           3.71G   175G     3.71G  legacy
zroot/swap            34.0G   209G       92K  -

It turns out that containerd takes full advantage of ZFS and creates a lot of volumes. The full listing of the lightly used node actually looks like this:

$ zfs list
NAME                   USED  AVAIL     REFER  MOUNTPOINT
zdata                  545G  1.22T      192K  none
zdata/longhorn-ext4    545G  1.75T     9.43G  -
zroot                 38.5G   175G      192K  none
zroot/containerd       744M   175G      216K  /var/lib/containerd/io.containerd.snapshotter.v1.zfs
zroot/containerd/1    21.7M   175G     21.7M  legacy
zroot/containerd/10    104M   175G      156M  legacy
zroot/containerd/11   19.1M   175G      175M  legacy
zroot/containerd/116   144K   175G     21.7M  legacy
zroot/containerd/120   544K   175G      136M  legacy
zroot/containerd/139   144K   175G     21.7M  legacy
zroot/containerd/14   5.41M   175G     5.41M  legacy
zroot/containerd/140   264K   175G      204M  legacy
zroot/containerd/15   65.9M   175G     71.0M  legacy
zroot/containerd/155   144K   175G     21.7M  legacy
zroot/containerd/156   336K   175G     16.6M  legacy
zroot/containerd/16    192K   175G     71.1M  legacy
zroot/containerd/160   144K   175G     21.7M  legacy
zroot/containerd/161   336K   175G      175M  legacy
zroot/containerd/166   144K   175G     21.7M  legacy
zroot/containerd/167   144K   175G     21.7M  legacy
zroot/containerd/168   304K   175G     16.6M  legacy
zroot/containerd/169   548K   175G      136M  legacy
zroot/containerd/17   30.8M   175G      102M  legacy
zroot/containerd/170   144K   175G     21.7M  legacy
zroot/containerd/171   144K   175G     21.7M  legacy
zroot/containerd/173   400K   175G      175M  legacy
zroot/containerd/174   296K   175G      204M  legacy
zroot/containerd/175   144K   175G     21.7M  legacy
zroot/containerd/176   476K   175G      190M  legacy
zroot/containerd/177   144K   175G     21.7M  legacy
zroot/containerd/178   484K   175G      190M  legacy
zroot/containerd/18   14.8M   175G      116M  legacy
zroot/containerd/182   424K   175G      175M  legacy
zroot/containerd/183   424K   175G      175M  legacy
zroot/containerd/19   5.65M   175G      114M  legacy
zroot/containerd/20    752K   175G      115M  legacy
zroot/containerd/21   3.61M   175G      118M  legacy
zroot/containerd/22   16.5M   175G      134M  legacy
zroot/containerd/23   1.78M   175G      136M  legacy
zroot/containerd/24    256K   175G      136M  legacy
zroot/containerd/25   24.4M   175G      136M  legacy
zroot/containerd/26    200K   175G      136M  legacy
zroot/containerd/28    122M   175G      173M  legacy
zroot/containerd/29   5.59M   175G      179M  legacy
zroot/containerd/30   24.6M   175G      203M  legacy
zroot/containerd/31    244K   175G      203M  legacy
zroot/containerd/32    424K   175G      203M  legacy
zroot/containerd/33   1.38M   175G      204M  legacy
zroot/containerd/34    216K   175G      204M  legacy
zroot/containerd/35    224K   175G      204M  legacy
zroot/containerd/38   6.77M   175G     6.76M  legacy
zroot/containerd/39   9.97M   175G     16.6M  legacy
zroot/containerd/70   54.7M   175G     54.7M  legacy
zroot/containerd/71    124M   175G      173M  legacy
zroot/containerd/72   5.59M   175G      179M  legacy
zroot/containerd/73   10.3M   175G      189M  legacy
zroot/containerd/74    236K   175G      189M  legacy
zroot/containerd/75    424K   175G      189M  legacy
zroot/containerd/76   1.41M   175G      189M  legacy
zroot/containerd/77    216K   175G      190M  legacy
zroot/containerd/78    224K   175G      190M  legacy
zroot/containerd/9    54.7M   175G     54.7M  legacy
zroot/nixos           3.71G   175G     3.71G  legacy
zroot/swap            34.0G   209G       92K  -

Appendix: Full ZFS commands

In the interest of clarity, I elided most of the options of the ZFS commands above, but in case anyone wants to try this out, here they are:

$ zfs version
zfs-2.0.6-1
zfs-kmod-2.0.6-1

### Create zpools with mirroring, compression, encryption, and the
### ashift optimisation for modern drives.  The long ids are the
### /dev/disk/by-id names of the big partitions on each drive.
### Do not use sd* because those can change from boot to boot.
# zpool create \
    -O mountpoint=none -o ashift=12 -O atime=off -O acltype=posixacl \
    -O xattr=sa -O compression=lz4 \
    -O encryption=aes-256-gcm -O keyformat=passphrase \
    zroot mirror \
    ata-INTEL_SSDSC2CW240A3_CVCV326400E1240FGN-part3 ata-INTEL_SSDSC2CW240A3_CVCV316202CW240CGN-part3
# zpool create \
    -O mountpoint=none -o ashift=12 -O atime=off -O acltype=posixacl \
    -O xattr=sa -O compression=lz4 \
    -O encryption=aes-256-gcm -O keyformat=passphrase \
    zdata mirror \
    ata-HGST_HUS724020ALA640_PN1134P6HVHR7W-part3 ata-ST2000NM0033-9ZM175_Z1X08M9Q-part3

### We use a legacy mountpoint because nixos is going to mount the
### root for us.
# zfs create -o mountpoint=legacy zroot/nixos

### Swap on ZFS might not work.  Try it out anyway.
### https://github.com/openzfs/zfs/issues/7734
### The options are the ones recommended by the FAQ:
### https://openzfs.github.io/openzfs-docs/Project%20and%20Community/FAQ.html?highlight=faq#using-a-zvol-for-a-swap-device-on-linux
# zfs create -V 32G -b $(getconf PAGESIZE) \
    -o logbias=throughput \
    -o sync=always \
    -o primarycache=metadata \
    -o com.sum:auto-snapshot=false \
    zroot/swap
# mkswap /dev/zroot/swap

### Nothing changed from the listing above.  This is just
### a regular zfs filesystem auto-mounted to the given
### mountpoint.
# zfs create zroot/containerd \
    -o mountpoint=/var/lib/containerd/io.containerd.snapshotter.v1.zfs

### Nothing changed from the listing above.  I use less
### than the 1.8T available in the pool here to match
### the sizes of the other nodes in my cluster.  I will
### resize this volume when I get bigger machines.
# zfs create zdata/longhorn-ext4 -V 528G
# mkfs.ext4 /dev/zvol/zdata/longhorn-ext4