I figured out how to access Kubernetes services on my cluster without the need for kubectl port-forward
or an ingress. It can all be done with Linux routing tables, and Wireguard makes this trivial to set up.
If you’d rather skip the preamble, and jump straight to “how to bridge subnets with Wireguard”, go here.
The goal
My setup is this:
-
There are three machines in Hetzner running a Kubernetes (K8s) cluster,
-
There is a VPN machine which proxies Internet connections from my laptop, and
-
There is a Wireguard mesh connecting all the machines.
laptop
, vpn1
, kube*
machines are connected in a full mesh. The kube*
hosts are also part of the Service subnet, and the Pod subnet.
The K8s cluster hosts several public Internet-facing services like this blog, but also some private services such as a Syncthing file dump. Concretely, each node hosts an ingress-nginx
pod and an assortment of pods backing the aforementioned services.
The pods are the ones running the actual application code, and a client has several ways of reaching them.
-
Client → Pod1 IP:Port
. A client could theoretically connect directly to the IP:Port of a specific pod. In practice, this would only work on the K8s hosts themselves since that’s the only place where the pod routing rules are setup. It would generally be a bad idea since pods are constantly being re-created, and their IPs change every time. -
Client → Service ClusterIP:Port → Pod*
. A client could also theoretically connect to the virtual IP:Port of aClusterIP
service. The connection would then be redirected to a live pod for the service. This is the cleanest solution, but it doesn’t work in practice because the service routing rules are only setup on the K8s hosts. Later in this post, we will fix this, but first, let’s look at the standard K8s solutions. -
Client → Service NodePort → Pod*
. If we create a NodePort service, then the service gets bound to a static port on all the K8s hosts. Clients can connect to any of the hosts on the static port, and their connections will be redirected to a pod for the service. This is the first solution we’ve seen that works in general, but it has be big downside of requiring us to manually assign a port to each service. -
Client → Ingress → Service ClusterIP:Port → Pod*
. The recommended solution is to setup an ingress, which is essentially a reverse proxy that redirects inbound HTTP(S) connections to the right service based on some property of the request like itsHost
header. In my setup, I haveingress-nginx
running on all the machines, and this page was served by the blog service through this ingress. The downsides are that it only works for HTTP(S), it requires an extra process to middle-man connections, and it requires a bit of extra configuration.
💭 There’s no fundamental reason why ingresses should only work with HTTP(S), but in practice, this is the case. HAProxy is the only one I could find that works for general TCP connections, but it’s clear for the clunkyness of the configuration that this was an afterthought.
With the preamble out of the way, here’s our goal: We want laptop
to access some private K8s services using the encrypted Wireguard network connections. Since these are “secure” connections, they shouldn’t require authorization, or even TLS.
We could add the private services to the public ingress and restrict access based on some properties of the requests like the source IP address, but that seems brittle. One misconfiguration and the private services are open to the Internet.
We could create a second ingress just for the private services, and this feels like the right solution given K8s’ APIs, but it’s very awkward to do in practice. All the ingress implementations have defaults that work against us: they try to bind to all interfaces, they listen on many ports forcing us to pick and configure many ports for the private ingress, and they sometimes don’t even support running more than one instance in the same cluster. K8s itself sometimes tries to be helpful, and might expose a service through the wrong ingress. Specifically, it might assign an Ingress
for a Service
to the public IngressClass
if we forgot to explicitly assign it to the private one.
💭 Note on terminology: What I’m calling an ingress in this post is actually an IngressClass
in K8s. An Ingress
is technically the configuration that exposes a Service
through an IngressClass
. However, the previous paragraph is the only place where the distinction between Ingress
and IngressClass
matters, so I’m going to continue calling either just “ingress”.
So, if we want to minimize the chance that a private service gets exposed to the Internet by mistake, we can’t use ingresses. This leaves the first three options listed above. As mentioned, connecting directly to pods doesn’t work because pods change their addresses frequently, and connecting to a service on a static port is annoying because we have to manage the assignment of static ports (and also it’s very hard to restrict services to only listen on the private Wireguard interface).
This leaves connecting to a service via its virtual ClusterIP
address. This doesn’t work out of the box because we’re missing some routing rules, but this is something we can fix.
💭 If we have a certificate to access the cluster, we could also use kubectl port-forward
to tunnel into a service from any machine that has access to the K8s apiserver
. But the command is cumbersome to write, we have to manually assign ports, and the tunnel sometimes breaks after running for a while. It’s good enough for development, but we’re looking for a longer term solution.
K8s service routing
How do K8s services actually work? When we create a Service
, the apiserver
assigns an IP address to it (from the subnet passed in --service-cluster-ip-range
). This address will not change unless the Service
is deleted and recreated. The interesting thing about this IP address is that it isn’t bound to any network interface anywhere. It is a “fake” address, and routing to it can never work.
Let’s look at a concrete example. For a node in my cluster, I have the following addresses and subnets:
name | address/subnet | subnet |
---|---|---|
public IP | 5.9.100.179 |
32 |
Wireguard IP | 10.10.0.11 |
24 |
container networking IP | 10.32.2.1 |
24 |
blog Service IP |
10.33.0.111 |
- |
For a full explanation of all these, see Kubernetes networking, but we can glean enough of an understanding by inspecting the routing table:
# ip route show
default via 5.9.100.161 dev enp2s0 proto static
5.9.100.161 dev enp2s0 proto static scope link
10.10.0.0/24 dev wg0 proto kernel scope link src 10.10.0.11
...
10.32.1.0/24 via 10.32.1.0 dev flannel.1 onlink
10.32.2.0/24 dev mynet proto kernel scope link src 10.32.2.1
10.32.4.0/24 via 10.32.4.0 dev flannel.1 onlink
...
We see:
-
The default gateway is
5.9.100.161
which is directly reachable through the physicalenp2s0
interface. -
All traffic to the Wireguard subnet of
10.10.0.0/24
goes through thewg0
network interface which knows how to reach the other Wireguard peers. -
Traffic to containers on different nodes (
10.32.1.0/24
and10.32.4.0/24
) goes through theflannel.1
interface. -
Traffic to containers on the same node (
10.32.2.0/24
) goes through themynet
bridge to which the containerveth
devices are connected (see Kubernetes networking).
What is noticeably absent is any rule for connecting to the Service
subnet of 10.33.0.0/24
. So, what happens to a packet sent to the blog service 10.33.0.111
?
The answer is that kube-proxy
redirects it to one of the pods backing the service. How it does this depends on how kube-proxy
is configured:
-
In
iptables
mode, it adds rules like the following to firewall configuration:Chain KUBE-SERVICES (2 references) target prot opt source destination ... KUBE-SVC-TCOU7JCQXEZGVUNU tcp -- 0.0.0.0/0 10.33.0.111 tcp dpt:80 ... Chain KUBE-SVC-TCOU7JCQXEZGVUNU (1 references) KUBE-SEP-JMPR5F7DNF2VZGXN all -- 0.0.0.0/0 0.0.0.0/0 statistic mode random probability 0.50000000000 KUBE-SEP-GETXBXEJNGKLZGS3 all -- 0.0.0.0/0 0.0.0.0/0 Chain KUBE-SEP-JMPR5F7DNF2VZGXN (1 references) target prot opt source destination ... DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp to:10.32.1.84:80 Chain KUBE-SEP-GETXBXEJNGKLZGS3 (1 references) target prot opt source destination ... DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp to:10.32.4.149:80
So, packets with destination
10.33.0.111:80
are passed through one of two chains with equal probability, and each chain rewrites the destination address to that of one of the pods backing the service (10.32.1.84:80
or10.32.4.149:80
). If the service and pod ports were different, they would be rewritten here. -
In
ipvs
mode, the same thing happens, but the setup is a easier to inspect:# ipvsadm -Ln --tcp-service 10.33.0.111:80 Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 10.33.0.111:80 rr -> 10.32.1.84:80 Masq 1 0 0 -> 10.32.4.149:80 Masq 1 0 0
So, a virtual server is created listening on
10.33.0.111:80
, and the kernel round-robins (rr
) connections to it to the pods backing the service (10.32.1.84:80
or10.32.4.149:80
) with equal weights. If the service and pod ports were different, they would be rewritten here.
To summarise, services have “fake” IP addresses, and although routing to them doesn’t work, there are rules on the K8s hosts to rewrite packets to them to instead go to the pods backing the service. So, if we want to send a packet to a service from outside the K8s cluster, we just need to get it to one of the nodes, and then everything should just work.
Wireguard routing
In order to get packets to the K8s hosts, we go over Wireguard, but how does that work exactly? Let’s first look at the mesh configuration for the K8s hosts, and later we’ll look at the more complicated laptop
configuration.
The Wireguard config file for the K8s hosts looks like this:
[interface]
Address = 10.10.0.11/24
Address = fd86:ea04:1111::11/64
ListenPort = 8123
[Peer]
PublicKey = BVDPF763oS0iL3+DihB5p0oF6FWbj+9XZx9dPK2LTkI=
Endpoint = 168.119.33.208:8123
AllowedIPs = 10.10.0.10/32,fd86:ea04:1111::10/128
... more peers ...
Bringing up the interface involves executing commands like the following:
modprobe wireguard || true
ip link add dev "wg0" type wireguard
ip address add "10.10.0.11/24" dev "wg0"
ip address add "fd86:ea04:1111::11/64" dev "wg0"
wg set "wg0" \
private-key "/etc/wireguard/key.priv" \
listen-port "8123"
ip link set up dev "wg0"
wg set wg0 peer "BVDPF763oS0iL3+DihB5p0oF6FWbj+9XZx9dPK2LTkI=" \
endpoint "168.119.33.208:8123" \
allowed-ips "10.10.0.10/32,fd86:ea04:1111::10/128"
ip route replace "10.10.0.10/32" dev "wg0" table "main"
ip route replace "fd86:ea04:1111::10/128" dev "wg0" table "main"
... more peers ...
The wg0
device is created, its address is set, the private key is configured, and the listen port is assigned. Each peer is then setup individually: the peer is created with a specific public Internet-facing address (endpoint
), and is configured to accept packets for a given subnet (allowed-ips
). Routing to the subnet is then configured to go through the wg0
interface.
The resulting routing table looks like this:
# ip rule show
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
# ip route show table main
default via 5.9.100.161 dev enp2s0 proto static
5.9.100.161 dev enp2s0 proto static scope link
10.10.0.0/24 dev wg0 proto kernel scope link src 10.10.0.11
10.10.0.2 dev wg0 scope link
10.10.0.7 dev wg0 scope link
10.10.0.9 dev wg0 scope link
10.10.0.10 dev wg0 scope link
10.10.0.12 dev wg0 scope link
10.10.0.13 dev wg0 scope link
... kubernetes routes ...
So, when we send a packet to 10.10.0.10
, Linux first checks the local
table, but doesn’t find anything there because it’s just localhost and broadcast rules. Then, it tries the main
table and finds the longest matching route “10.10.0.10 dev wg0 scope link”
. It gives the packet to the wg0
interface which takes care of encapsulating it and sending it to 168.119.33.208
(the endpoint
of the peer). This new packet wouldn’t match any specific rule in main
, so it would match the default rule, and get sent to the default gateway of 5.9.100.161
. This is pretty simple stuff, but it gets more complicated…
If you’ve never encountered ip rule
before, read the manpage, and search the Internet for “linux policy-based routing”. There are a lot of resources, but none of them are particularly good.
My laptop
uses wg-quick
to ensure that all outbound Internet traffic goes through my VPN. The trick it uses is described in the official docs, and we can see it in action by reading the wg-quick
shell script.
The Wireguard config file looks very similar to the one for the K8s hosts (this is a mesh after all, so all the configs should have the same structure). The important difference is that the AllowedIPs
for peer vpn1
contains 0.0.0.0/0
:
...
[Peer]
PublicKey = JFvpmNopPF/r+P3A6co9pU+lZleXoT1ppvEv1jraW34=
Endpoint = 157.90.123.245:8123
AllowedIPs = 0.0.0.0/0,::/0,10.10.0.7/32,fd86:ea04:1111::7/128
...
laptop
A subnet mask of /0
means “all addresses”, and when wg-quick
encounters this in add_route
, it calls add_default
to add the rules that all outgoing traffic should go through wg0
:
wg set wg0 fwmark 51820
ip route add "0.0.0.0/0" dev "wg0" table 51820
ip rule add not fwmark 51820 table 51820
ip rule add table main suppress_prefixlength 0
The resulting routing table looks like this:
# ip rule show
0: from all lookup local
32764: from all lookup main suppress_prefixlength 0
32765: not from all fwmark 0xca6c lookup 51820
32766: from all lookup main
32767: from all lookup default
# ip route show table 51820
default dev wg0 scope link
# ip route show table main
default via 192.168.1.1 dev enp0s31f6 proto dhcp src 192.168.1.117 metric 203
10.10.0.0/24 dev wg0 proto kernel scope link src 10.10.0.2
192.168.1.0/24 dev enp0s31f6 proto dhcp scope link src 192.168.1.117 metric 203
...
The goal of these rules is to reroute all packets that would have gone through the default gateway to go through wg0
instead. However, Wireguard’s own traffic still has to go through the main
routing table and the default route there (otherwise we would have a routing loop). Additionally, any routes which were specifically configured in the main
routing table should still work.
Let’s follow a packet for 8.8.8.8
(i.e. an Internet packet) through this:
-
0: from all lookup local
It first goes through thelocal
table, but doesn't match anything there. -
32764: from all lookup main suppress_prefixlength 0
It hits thesuppress_prefixlength 0
rule, and does a lookup in themain
table. There are no specific rules for8.8.8.8
, so it only matches the default rule. However, that has a prefix length of 0 (because default just means0.0.0.0/0
), so this gets suppressed, and the matching continues. -
32765: not from all fwmark 0xca6c lookup 51820
The next rule parses likenot ((from all) && (fwmark 0xca6c))
. The0xca6c
number is 51820 in hexadecimal. Thefwmark
referenced is the "firewall mark" that can be used to tag network packets in Linux. There's nofwmark
on this packet, so this rule evaluates to true, so the packet goes through table51820
. Here, it matches the only route, so it gets sent to interfacewg0
. -
Once in
wg0
, the packet is encapsulated into a Wireguard envelope, and thefwmark
is set because of thewg set wg0 fwmark 51820
from earlier. This new packet has destination157.90.123.245
(vpn1
) because we configured the VPN peer to accept all IPs. -
0: from all lookup local
The new packet goes through the routing rules again. It doesn't match anything in thelocal
table. -
32764: from all lookup main suppress_prefixlength 0
It hits thesuppress_prefixlength 0
rule, goes through themain
table, matches only the default route, so gets suppressed. -
32765: not from all fwmark 0xca6c lookup 51820
The next rule doesn't match because thefwmark
is set. -
32766: from all lookup main
The next rule is themain
routing table. It doesn't match anything specific, so it goes to the default route. From here, the packet leaves localhost, and hopefully reaches the VPN machine.
If we instead followed a packet for 192.168.1.127
(i.e. a local area network packet), we’d hit the suppress_prefixlength 0
rule, find the 192.168.1.0/24 dev enp0s31f6
rule in the main
table, and since its prefix is greater than 0, it would not get suppressed, so matching would end, and the packet wouldn’t go through Wireguard.
To summarise, the Wireguard routing on the K8s hosts is fairly simple: there are a few routes that configure packets destined for the other peers to go through the Wireguard interface. The routing configuration on laptop
is more complicated because it tries to reroute all traffic through the default gateway to go through Wireguard instead. We need to understand this because we’re going to mess with the routing rules in the next section.
Wireguard firewall rules
In addition to configuring routes, add_default
also sets up some firewall rules. The commands are gnarly, but the resulting configuration is not:
# nft -s list ruleset
...
table ip raw {
chain PREROUTING {
type filter hook prerouting priority raw; policy accept;
iifname != "wg0" ip daddr 10.10.0.2 fib saddr type != local counter drop
}
}
table ip mangle {
chain POSTROUTING {
type filter hook postrouting priority mangle; policy accept;
meta l4proto udp mark 0xca6c counter ct mark set mark
}
chain PREROUTING {
type filter hook prerouting priority mangle; policy accept;
meta l4proto udp counter meta mark set ct mark
}
}
These rules copy the fwmark
from the packet to conntrack
’s state, and back to the packet. Even though we’re talking about stateless UDP connections, the kernel still tries to guess which packets form a related “connection”, and these rules will ensure the reply packets get the fwmark
.
However, why we need related UDP packets to all have the mark is unclear to me. It’s certainly not necessary for the “route all traffic through the VPN” functionality. I suspect it’s a defence mechanism against somebody flooding the Wireguard port with bogus packets. As these wouldn’t have the fwmark
, they could easily be filtered by a firewall rule and never reach Wireguard itself. But I haven’t been able to get confirmation of this from the Wireguard devs.
Bridging subnets
We’ve now seen all the subnets and routing tricks involved:
-
The physical machines are all connected in a Wireguard mesh.
-
The K8s hosts and the pods are connected in the container network (see Kubernetes networking).
-
The “fake” subnet of services is accessible from the K8s hosts.
We want to get from laptop
, through Blog Service
, to blog pod 1
or blog pod 2
. The last hop will work thanks to K8s, but we have to handle laptop
→ Blog Service
ourselves.
Let’s see what happens if we try to reach the service from laptop
now:
$ curl -v 10.33.0.111:80
* Trying 10.33.0.111:80...
... hangs ...
$ traceroute -n 10.33.0.111
traceroute to 10.33.0.111 (10.33.0.111), 30 hops max, 60 byte packets
1 10.10.0.7 33.055 ms 33.179 ms 33.170 ms
2 * * *
3 * * *
4 * * *
5 * * *
6 * * *
...
The packets go through 10.10.0.7
(vpn1
), and then proceed to go on an adventure. We know the pod is just 1-3 hops away (depending on the path), so this is clearly wrong. More specifically, packets shouldn’t be going through the VPN box at all. They should go straight to a K8s host like 10.10.0.11
.
We’ve seen enough ip route
calls by now to know what to do:
# ip route add 10.33.0.0/24 dev wg0 scope link
# traceroute -n 10.33.0.111
traceroute to 10.33.0.111 (10.33.0.111), 30 hops max, 60 byte packets
1 10.10.0.7 33.372 ms 33.275 ms 33.341 ms
2 * * *
3 * * *
4 * * *
5 * * *
6 * * *
… That didn’t work. Let’s look at the routing table to figure out why:
# ip rule show
0: from all lookup local
32764: from all lookup main suppress_prefixlength 0
32765: not from all fwmark 0xca6c lookup 51820
32766: from all lookup main
32767: from all lookup default
# ip route show table main
default via 192.168.1.1 dev enp0s31f6 proto dhcp src 192.168.1.117 metric 203
10.10.0.0/24 dev wg0 proto kernel scope link src 10.10.0.2
10.33.0.0/24 dev wg0 scope link
192.168.1.0/24 dev enp0s31f6 proto dhcp scope link src 192.168.1.117 metric 203
The route we added was in fact redundant. Because there was no fwmark
, a packet to 10.33.0.111
would have gone through wg0
anyway thanks to rule 32765
. So, the problem wasn’t the lack of a route. However, Wireguard doesn’t have any configuration for the 10.33.0.0/24
subnet, so it defaults to sending the packet to the 10.10.0.7
(vpn1
) peer which has AllowedIPs = 0.0.0.0/0
. But that peer also doesn’t know what to do with a 10.33.0.0/24
packet, so it sends it through its default gateway to the Internet, and that doesn’t work.
Let’s remove the manual route and try something else:
# ip route del 10.33.0.0/24 dev wg0 scope link
We add 10.33.0.0/24
to AllowedIPs
for the kube2
peer:
[Peer] # vpn1 peer
PublicKey = JFvpmNopPF/r+P3A6co9pU+lZleXoT1ppvEv1jraW34=
Endpoint = 157.90.123.245:8123
AllowedIPs = 0.0.0.0/0,::/0,10.10.0.7/32,fd86:ea04:1111::7/128
[Peer] # kube2 peer
PublicKey = xCH3Y+h8rdrSz8DHjlUt+Gi44s9WOAs/95srtCeuDxE=
Endpoint = 5.9.100.179:8123
AllowedIPs = 10.33.0.0/24,10.10.0.11/32,fd86:ea04:1111::11/128
This causes Wireguard to add the same route:
# ip route show table main
default via 192.168.1.1 dev enp0s31f6 proto dhcp src 192.168.1.117 metric 203
10.10.0.0/24 dev wg0 proto kernel scope link src 10.10.0.2
10.33.0.0/24 dev wg0 scope link
192.168.1.0/24 dev enp0s31f6 proto dhcp scope link src 192.168.1.117 metric 203
Additionally, the wg0
interface now knows to route 10.33.0.111
to the kube2
peer, so traceroute
and curl
work:
# traceroute -n 10.33.0.111
traceroute to 10.33.0.111 (10.33.0.111), 30 hops max, 60 byte packets
1 10.33.0.111 37.428 ms 37.323 ms 37.264 ms
# curl -s 10.33.0.111 | head -n1
<!DOCTYPE html>
So, all we needed to do was add the services subnet of 10.33.0.0/24
to AllowedIPs
for the K8s hosts.
💭 Above, we added the route “10.33.0.0/24 dev wg0 scope link”
. We could have instead tried “10.33.0.0/24 via 10.10.0.11 dev wg0 scope link”
in order to specify an explicit gateway for the subnet. However, this fails with Error: Nexthop has invalid gateway
.
I think what is going on here is that a gateway must be a machine that is reachable directly through a network device. Wireguard peers don’t work because the kernel doesn’t actually know how to reach them (they aren’t in ip neighbour
). The Wireguard interface knows how to reach the peers, and all the kernel knows is to dump packets for them into the interface.
Mind you, I might be completely wrong about this. It’s hard to find documentation on how routing is supposed to work and what the specific rules are. There are lots of forum/mailing list/StackOverflow routing-related posts, but it all tends to be the same basic stuff.
Looking back
Wireguard is great in general, and it happens to make stitching together different subnets easy too. The trick described in this post let me get rid of the private ingress in my K8s cluster which makes for a simpler setup and removes the unnecessary middle-man process.
💭 This post is around 3,000 words long. It’s a lot of text to just say “add the services subnet to AllowedIPs
”, but I think this is one of those situations where it’s important to understand how something works.
When I had to fix this problem for myself, I immediately guessed that I was missing a route, but after adding it, connecting to services still didn’t work. Because I had little faith in my understanding of the routing rules, I couldn’t conclude that the problem was somewhere else, so I spent hours fiddling with routing before looking at the Wireguard configuration, and realizing that AllowedIPs
probably needs to be changed.
Thanks to Francesco Mazzoli for reading drafts of this post.