I figured out how to access Kubernetes services on my cluster without the need for kubectl port-forward or an ingress. It can all be done with Linux routing tables, and Wireguard makes this trivial to set up.

If you’d rather skip the preamble, and jump straight to “how to bridge subnets with Wireguard”, go here.

The goal

My setup is this:

  • There are three machines in Hetzner running a Kubernetes (K8s) cluster,

  • There is a VPN machine which proxies Internet connections from my laptop, and

  • There is a Wireguard mesh connecting all the machines.

Overview of the setup. The laptop, vpn1, kube* machines are connected in a full mesh. The kube* hosts are also part of the Service subnet, and the Pod subnet.
Overview of the setup. The laptop, vpn1, kube* machines are connected in a full mesh. The kube* hosts are also part of the Service subnet, and the Pod subnet.

The K8s cluster hosts several public Internet-facing services like this blog, but also some private services such as a Syncthing file dump. Concretely, each node hosts an ingress-nginx pod and an assortment of pods backing the aforementioned services.

The pods are the ones running the actual application code, and a client has several ways of reaching them.

  • Client → Pod1 IP:Port. A client could theoretically connect directly to the IP:Port of a specific pod. In practice, this would only work on the K8s hosts themselves since that’s the only place where the pod routing rules are setup. It would generally be a bad idea since pods are constantly being re-created, and their IPs change every time.

  • Client → Service ClusterIP:Port → Pod*. A client could also theoretically connect to the virtual IP:Port of a ClusterIP service. The connection would then be redirected to a live pod for the service. This is the cleanest solution, but it doesn’t work in practice because the service routing rules are only setup on the K8s hosts. Later in this post, we will fix this, but first, let’s look at the standard K8s solutions.

  • Client → Service NodePort → Pod*. If we create a NodePort service, then the service gets bound to a static port on all the K8s hosts. Clients can connect to any of the hosts on the static port, and their connections will be redirected to a pod for the service. This is the first solution we’ve seen that works in general, but it has be big downside of requiring us to manually assign a port to each service.

  • Client → Ingress → Service ClusterIP:Port → Pod*. The recommended solution is to setup an ingress, which is essentially a reverse proxy that redirects inbound HTTP(S) connections to the right service based on some property of the request like its Host header. In my setup, I have ingress-nginx running on all the machines, and this page was served by the blog service through this ingress. The downsides are that it only works for HTTP(S), it requires an extra process to middle-man connections, and it requires a bit of extra configuration.

💭 There’s no fundamental reason why ingresses should only work with HTTP(S), but in practice, this is the case. HAProxy is the only one I could find that works for general TCP connections, but it’s clear for the clunkyness of the configuration that this was an afterthought.

With the preamble out of the way, here’s our goal: We want laptop to access some private K8s services using the encrypted Wireguard network connections. Since these are “secure” connections, they shouldn’t require authorization, or even TLS.

We could add the private services to the public ingress and restrict access based on some properties of the requests like the source IP address, but that seems brittle. One misconfiguration and the private services are open to the Internet.

We could create a second ingress just for the private services, and this feels like the right solution given K8s’ APIs, but it’s very awkward to do in practice. All the ingress implementations have defaults that work against us: they try to bind to all interfaces, they listen on many ports forcing us to pick and configure many ports for the private ingress, and they sometimes don’t even support running more than one instance in the same cluster. K8s itself sometimes tries to be helpful, and might expose a service through the wrong ingress. Specifically, it might assign an Ingress for a Service to the public IngressClass if we forgot to explicitly assign it to the private one.

💭 Note on terminology: What I’m calling an ingress in this post is actually an IngressClass in K8s. An Ingress is technically the configuration that exposes a Service through an IngressClass. However, the previous paragraph is the only place where the distinction between Ingress and IngressClass matters, so I’m going to continue calling either just “ingress”.

So, if we want to minimize the chance that a private service gets exposed to the Internet by mistake, we can’t use ingresses. This leaves the first three options listed above. As mentioned, connecting directly to pods doesn’t work because pods change their addresses frequently, and connecting to a service on a static port is annoying because we have to manage the assignment of static ports (and also it’s very hard to restrict services to only listen on the private Wireguard interface).

This leaves connecting to a service via its virtual ClusterIP address. This doesn’t work out of the box because we’re missing some routing rules, but this is something we can fix.

💭 If we have a certificate to access the cluster, we could also use kubectl port-forward to tunnel into a service from any machine that has access to the K8s apiserver. But the command is cumbersome to write, we have to manually assign ports, and the tunnel sometimes breaks after running for a while. It’s good enough for development, but we’re looking for a longer term solution.

K8s service routing

How do K8s services actually work? When we create a Service, the apiserver assigns an IP address to it (from the subnet passed in --service-cluster-ip-range). This address will not change unless the Service is deleted and recreated. The interesting thing about this IP address is that it isn’t bound to any network interface anywhere. It is a “fake” address, and routing to it can never work.

Let’s look at a concrete example. For a node in my cluster, I have the following addresses and subnets:

name address/subnet subnet
public IP 5.9.100.179 32
Wireguard IP 10.10.0.11 24
container networking IP 10.32.2.1 24
blog Service IP 10.33.0.111 -

For a full explanation of all these, see Kubernetes networking, but we can glean enough of an understanding by inspecting the routing table:

# ip route show
default via 5.9.100.161 dev enp2s0 proto static
5.9.100.161 dev enp2s0 proto static scope link
10.10.0.0/24 dev wg0 proto kernel scope link src 10.10.0.11
...
10.32.1.0/24 via 10.32.1.0 dev flannel.1 onlink
10.32.2.0/24 dev mynet proto kernel scope link src 10.32.2.1
10.32.4.0/24 via 10.32.4.0 dev flannel.1 onlink
...

We see:

  • The default gateway is 5.9.100.161 which is directly reachable through the physical enp2s0 interface.

  • All traffic to the Wireguard subnet of 10.10.0.0/24 goes through the wg0 network interface which knows how to reach the other Wireguard peers.

  • Traffic to containers on different nodes (10.32.1.0/24 and 10.32.4.0/24) goes through the flannel.1 interface.

  • Traffic to containers on the same node (10.32.2.0/24) goes through the mynet bridge to which the container veth devices are connected (see Kubernetes networking).

What is noticeably absent is any rule for connecting to the Service subnet of 10.33.0.0/24. So, what happens to a packet sent to the blog service 10.33.0.111?

The answer is that kube-proxy redirects it to one of the pods backing the service. How it does this depends on how kube-proxy is configured:

  • In iptables mode, it adds rules like the following to firewall configuration:

    Chain KUBE-SERVICES (2 references)
    target     prot opt source               destination
    ...
    KUBE-SVC-TCOU7JCQXEZGVUNU  tcp  --  0.0.0.0/0            10.33.0.111          tcp dpt:80
    ...
    
    Chain KUBE-SVC-TCOU7JCQXEZGVUNU (1 references)
    KUBE-SEP-JMPR5F7DNF2VZGXN  all  --  0.0.0.0/0            0.0.0.0/0            statistic mode random probability 0.50000000000
    KUBE-SEP-GETXBXEJNGKLZGS3  all  --  0.0.0.0/0            0.0.0.0/0
    
    Chain KUBE-SEP-JMPR5F7DNF2VZGXN (1 references)
    target     prot opt source               destination
    ...
    DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp to:10.32.1.84:80
    
    Chain KUBE-SEP-GETXBXEJNGKLZGS3 (1 references)
    target     prot opt source               destination
    ...
    DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp to:10.32.4.149:80
    

    So, packets with destination 10.33.0.111:80 are passed through one of two chains with equal probability, and each chain rewrites the destination address to that of one of the pods backing the service (10.32.1.84:80 or 10.32.4.149:80). If the service and pod ports were different, they would be rewritten here.

  • In ipvs mode, the same thing happens, but the setup is a easier to inspect:

    # ipvsadm -Ln --tcp-service 10.33.0.111:80
    Prot LocalAddress:Port Scheduler Flags
      -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
    TCP  10.33.0.111:80 rr
      -> 10.32.1.84:80                Masq    1      0          0
      -> 10.32.4.149:80               Masq    1      0          0
    

    So, a virtual server is created listening on 10.33.0.111:80, and the kernel round-robins (rr) connections to it to the pods backing the service (10.32.1.84:80 or 10.32.4.149:80) with equal weights. If the service and pod ports were different, they would be rewritten here.

To summarise, services have “fake” IP addresses, and although routing to them doesn’t work, there are rules on the K8s hosts to rewrite packets to them to instead go to the pods backing the service. So, if we want to send a packet to a service from outside the K8s cluster, we just need to get it to one of the nodes, and then everything should just work.

Wireguard routing

In order to get packets to the K8s hosts, we go over Wireguard, but how does that work exactly? Let’s first look at the mesh configuration for the K8s hosts, and later we’ll look at the more complicated laptop configuration.

The Wireguard config file for the K8s hosts looks like this:

[interface]
Address = 10.10.0.11/24
Address = fd86:ea04:1111::11/64
ListenPort = 8123

[Peer]
PublicKey = BVDPF763oS0iL3+DihB5p0oF6FWbj+9XZx9dPK2LTkI=
Endpoint = 168.119.33.208:8123
AllowedIPs = 10.10.0.10/32,fd86:ea04:1111::10/128

... more peers ...
Wireguard config for K8s machine

Bringing up the interface involves executing commands like the following:


modprobe wireguard || true
ip link add dev "wg0" type wireguard
ip address add "10.10.0.11/24" dev "wg0"
ip address add "fd86:ea04:1111::11/64" dev "wg0"
wg set "wg0" \
  private-key "/etc/wireguard/key.priv" \
  listen-port "8123"
ip link set up dev "wg0"

wg set wg0 peer "BVDPF763oS0iL3+DihB5p0oF6FWbj+9XZx9dPK2LTkI=" \
  endpoint "168.119.33.208:8123" \
  allowed-ips "10.10.0.10/32,fd86:ea04:1111::10/128"
ip route replace "10.10.0.10/32" dev "wg0" table "main"
ip route replace "fd86:ea04:1111::10/128" dev "wg0" table "main"

... more peers ...

The wg0 device is created, its address is set, the private key is configured, and the listen port is assigned. Each peer is then setup individually: the peer is created with a specific public Internet-facing address (endpoint), and is configured to accept packets for a given subnet (allowed-ips). Routing to the subnet is then configured to go through the wg0 interface.

The resulting routing table looks like this:

# ip rule show
0:      from all lookup local
32766:  from all lookup main
32767:  from all lookup default

# ip route show table main
default via 5.9.100.161 dev enp2s0 proto static
5.9.100.161 dev enp2s0 proto static scope link
10.10.0.0/24 dev wg0 proto kernel scope link src 10.10.0.11
10.10.0.2 dev wg0 scope link
10.10.0.7 dev wg0 scope link
10.10.0.9 dev wg0 scope link
10.10.0.10 dev wg0 scope link
10.10.0.12 dev wg0 scope link
10.10.0.13 dev wg0 scope link
... kubernetes routes ...

So, when we send a packet to 10.10.0.10, Linux first checks the local table, but doesn’t find anything there because it’s just localhost and broadcast rules. Then, it tries the main table and finds the longest matching route 10.10.0.10 dev wg0 scope link”. It gives the packet to the wg0 interface which takes care of encapsulating it and sending it to 168.119.33.208 (the endpoint of the peer). This new packet wouldn’t match any specific rule in main, so it would match the default rule, and get sent to the default gateway of 5.9.100.161. This is pretty simple stuff, but it gets more complicated…

If you’ve never encountered ip rule before, read the manpage, and search the Internet for “linux policy-based routing”. There are a lot of resources, but none of them are particularly good.

My laptop uses wg-quick to ensure that all outbound Internet traffic goes through my VPN. The trick it uses is described in the official docs, and we can see it in action by reading the wg-quick shell script.

The Wireguard config file looks very similar to the one for the K8s hosts (this is a mesh after all, so all the configs should have the same structure). The important difference is that the AllowedIPs for peer vpn1 contains 0.0.0.0/0:

...
[Peer]
PublicKey = JFvpmNopPF/r+P3A6co9pU+lZleXoT1ppvEv1jraW34=
Endpoint = 157.90.123.245:8123
AllowedIPs = 0.0.0.0/0,::/0,10.10.0.7/32,fd86:ea04:1111::7/128
...
Wireguard config for laptop

A subnet mask of /0 means “all addresses”, and when wg-quick encounters this in add_route, it calls add_default to add the rules that all outgoing traffic should go through wg0:

wg set wg0 fwmark 51820
ip route add "0.0.0.0/0" dev "wg0" table 51820
ip rule add not fwmark 51820 table 51820
ip rule add table main suppress_prefixlength 0

The resulting routing table looks like this:

# ip rule show
0:      from all lookup local
32764:  from all lookup main suppress_prefixlength 0
32765:  not from all fwmark 0xca6c lookup 51820
32766:  from all lookup main
32767:  from all lookup default

# ip route show table 51820
default dev wg0 scope link

# ip route show table main
default via 192.168.1.1 dev enp0s31f6 proto dhcp src 192.168.1.117 metric 203
10.10.0.0/24 dev wg0 proto kernel scope link src 10.10.0.2
192.168.1.0/24 dev enp0s31f6 proto dhcp scope link src 192.168.1.117 metric 203
...

The goal of these rules is to reroute all packets that would have gone through the default gateway to go through wg0 instead. However, Wireguard’s own traffic still has to go through the main routing table and the default route there (otherwise we would have a routing loop). Additionally, any routes which were specifically configured in the main routing table should still work.

Let’s follow a packet for 8.8.8.8 (i.e. an Internet packet) through this:

  • 0: from all lookup local

    It first goes through the local table, but doesn't match anything there.
  • 32764: from all lookup main suppress_prefixlength 0

    It hits the suppress_prefixlength 0 rule, and does a lookup in the main table. There are no specific rules for 8.8.8.8, so it only matches the default rule. However, that has a prefix length of 0 (because default just means 0.0.0.0/0), so this gets suppressed, and the matching continues.
  • 32765: not from all fwmark 0xca6c lookup 51820

    The next rule parses like not ((from all) && (fwmark 0xca6c)). The 0xca6c number is 51820 in hexadecimal. The fwmark referenced is the "firewall mark" that can be used to tag network packets in Linux. There's no fwmark on this packet, so this rule evaluates to true, so the packet goes through table 51820. Here, it matches the only route, so it gets sent to interface wg0.
  • Once in wg0, the packet is encapsulated into a Wireguard envelope, and the fwmark is set because of the wg set wg0 fwmark 51820 from earlier. This new packet has destination 157.90.123.245 (vpn1) because we configured the VPN peer to accept all IPs.

  • 0: from all lookup local

    The new packet goes through the routing rules again. It doesn't match anything in the local table.
  • 32764: from all lookup main suppress_prefixlength 0

    It hits the suppress_prefixlength 0 rule, goes through the main table, matches only the default route, so gets suppressed.
  • 32765: not from all fwmark 0xca6c lookup 51820

    The next rule doesn't match because the fwmark is set.
  • 32766: from all lookup main

    The next rule is the main routing table. It doesn't match anything specific, so it goes to the default route. From here, the packet leaves localhost, and hopefully reaches the VPN machine.

If we instead followed a packet for 192.168.1.127 (i.e. a local area network packet), we’d hit the suppress_prefixlength 0 rule, find the 192.168.1.0/24 dev enp0s31f6 rule in the main table, and since its prefix is greater than 0, it would not get suppressed, so matching would end, and the packet wouldn’t go through Wireguard.

To summarise, the Wireguard routing on the K8s hosts is fairly simple: there are a few routes that configure packets destined for the other peers to go through the Wireguard interface. The routing configuration on laptop is more complicated because it tries to reroute all traffic through the default gateway to go through Wireguard instead. We need to understand this because we’re going to mess with the routing rules in the next section.

Wireguard firewall rules

In addition to configuring routes, add_default also sets up some firewall rules. The commands are gnarly, but the resulting configuration is not:


# nft -s list ruleset
...
table ip raw {
        chain PREROUTING {
                type filter hook prerouting priority raw; policy accept;
                iifname != "wg0" ip daddr 10.10.0.2 fib saddr type != local  counter drop
        }
}
table ip mangle {
        chain POSTROUTING {
                type filter hook postrouting priority mangle; policy accept;
                meta l4proto udp mark 0xca6c  counter ct mark set mark
        }

        chain PREROUTING {
                type filter hook prerouting priority mangle; policy accept;
                meta l4proto udp  counter meta mark set ct mark
        }
}

These rules copy the fwmark from the packet to conntrack’s state, and back to the packet. Even though we’re talking about stateless UDP connections, the kernel still tries to guess which packets form a related “connection”, and these rules will ensure the reply packets get the fwmark.

However, why we need related UDP packets to all have the mark is unclear to me. It’s certainly not necessary for the “route all traffic through the VPN” functionality. I suspect it’s a defence mechanism against somebody flooding the Wireguard port with bogus packets. As these wouldn’t have the fwmark, they could easily be filtered by a firewall rule and never reach Wireguard itself. But I haven’t been able to get confirmation of this from the Wireguard devs.

Bridging subnets

We’ve now seen all the subnets and routing tricks involved:

  • The physical machines are all connected in a Wireguard mesh.

  • The K8s hosts and the pods are connected in the container network (see Kubernetes networking).

  • The “fake” subnet of services is accessible from the K8s hosts.

Venn diagram of the subnets involved
Venn diagram of the subnets involved

We want to get from laptop, through Blog Service, to blog pod 1 or blog pod 2. The last hop will work thanks to K8s, but we have to handle laptopBlog Service ourselves.

Let’s see what happens if we try to reach the service from laptop now:

$ curl -v 10.33.0.111:80
*   Trying 10.33.0.111:80...
... hangs ...

$ traceroute -n 10.33.0.111
traceroute to 10.33.0.111 (10.33.0.111), 30 hops max, 60 byte packets
 1  10.10.0.7  33.055 ms  33.179 ms  33.170 ms
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  * * *
 ...

The packets go through 10.10.0.7 (vpn1), and then proceed to go on an adventure. We know the pod is just 1-3 hops away (depending on the path), so this is clearly wrong. More specifically, packets shouldn’t be going through the VPN box at all. They should go straight to a K8s host like 10.10.0.11.

We’ve seen enough ip route calls by now to know what to do:

# ip route add 10.33.0.0/24 dev wg0 scope link

# traceroute -n 10.33.0.111
traceroute to 10.33.0.111 (10.33.0.111), 30 hops max, 60 byte packets
 1  10.10.0.7  33.372 ms  33.275 ms  33.341 ms
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  * * *

… That didn’t work. Let’s look at the routing table to figure out why:

# ip rule show
0:      from all lookup local
32764:  from all lookup main suppress_prefixlength 0
32765:  not from all fwmark 0xca6c lookup 51820
32766:  from all lookup main
32767:  from all lookup default

# ip route show table main
default via 192.168.1.1 dev enp0s31f6 proto dhcp src 192.168.1.117 metric 203
10.10.0.0/24 dev wg0 proto kernel scope link src 10.10.0.2
10.33.0.0/24 dev wg0 scope link
192.168.1.0/24 dev enp0s31f6 proto dhcp scope link src 192.168.1.117 metric 203

The route we added was in fact redundant. Because there was no fwmark, a packet to 10.33.0.111 would have gone through wg0 anyway thanks to rule 32765. So, the problem wasn’t the lack of a route. However, Wireguard doesn’t have any configuration for the 10.33.0.0/24 subnet, so it defaults to sending the packet to the 10.10.0.7 (vpn1) peer which has AllowedIPs = 0.0.0.0/0. But that peer also doesn’t know what to do with a 10.33.0.0/24 packet, so it sends it through its default gateway to the Internet, and that doesn’t work.

Let’s remove the manual route and try something else:

# ip route del 10.33.0.0/24 dev wg0 scope link

We add 10.33.0.0/24 to AllowedIPs for the kube2 peer:

[Peer] # vpn1 peer
PublicKey = JFvpmNopPF/r+P3A6co9pU+lZleXoT1ppvEv1jraW34=
Endpoint = 157.90.123.245:8123
AllowedIPs = 0.0.0.0/0,::/0,10.10.0.7/32,fd86:ea04:1111::7/128

[Peer] # kube2 peer
PublicKey = xCH3Y+h8rdrSz8DHjlUt+Gi44s9WOAs/95srtCeuDxE=
Endpoint = 5.9.100.179:8123
AllowedIPs = 10.33.0.0/24,10.10.0.11/32,fd86:ea04:1111::11/128

This causes Wireguard to add the same route:

# ip route show table main
default via 192.168.1.1 dev enp0s31f6 proto dhcp src 192.168.1.117 metric 203
10.10.0.0/24 dev wg0 proto kernel scope link src 10.10.0.2
10.33.0.0/24 dev wg0 scope link
192.168.1.0/24 dev enp0s31f6 proto dhcp scope link src 192.168.1.117 metric 203

Additionally, the wg0 interface now knows to route 10.33.0.111 to the kube2 peer, so traceroute and curl work:

# traceroute -n 10.33.0.111
traceroute to 10.33.0.111 (10.33.0.111), 30 hops max, 60 byte packets
 1  10.33.0.111  37.428 ms  37.323 ms  37.264 ms

# curl -s 10.33.0.111 | head -n1
<!DOCTYPE html>

So, all we needed to do was add the services subnet of 10.33.0.0/24 to AllowedIPs for the K8s hosts.

💭 Above, we added the route 10.33.0.0/24 dev wg0 scope link”. We could have instead tried 10.33.0.0/24 via 10.10.0.11 dev wg0 scope link” in order to specify an explicit gateway for the subnet. However, this fails with Error: Nexthop has invalid gateway.

I think what is going on here is that a gateway must be a machine that is reachable directly through a network device. Wireguard peers don’t work because the kernel doesn’t actually know how to reach them (they aren’t in ip neighbour). The Wireguard interface knows how to reach the peers, and all the kernel knows is to dump packets for them into the interface.

Mind you, I might be completely wrong about this. It’s hard to find documentation on how routing is supposed to work and what the specific rules are. There are lots of forum/mailing list/StackOverflow routing-related posts, but it all tends to be the same basic stuff.

Looking back

Wireguard is great in general, and it happens to make stitching together different subnets easy too. The trick described in this post let me get rid of the private ingress in my K8s cluster which makes for a simpler setup and removes the unnecessary middle-man process.

💭 This post is around 3,000 words long. It’s a lot of text to just say “add the services subnet to AllowedIPs”, but I think this is one of those situations where it’s important to understand how something works.

When I had to fix this problem for myself, I immediately guessed that I was missing a route, but after adding it, connecting to services still didn’t work. Because I had little faith in my understanding of the routing rules, I couldn’t conclude that the problem was somewhere else, so I spent hours fiddling with routing before looking at the Wireguard configuration, and realizing that AllowedIPs probably needs to be changed.

Thanks to Francesco Mazzoli for reading drafts of this post.