Kubernetes networking is complicated. It’s not complex, mind you, as there’s no emergent behaviour. It’s just complicated because there are lots of moving parts that are used in different circumstances. Let’s explore how the parts fit together by walking through several scenarios:

There are many ways to configure this in Kubernetes. Here, we specifically talk about a 1.21 cluster with iptables proxying and flannel networking.

Pod-to-pod connections on the same node

In Kubernetes, containers are grouped into pods. Each pod has its own network namespace, which roughly means it has its own interfaces with their own addresses. These are private and not shared with either the host or any other pods.

Let’s start a fedora:34 pod on host fsn-qws-app1 and check its networking:

scvalex@toolbox ~ $ kubectl run debug-pod1 \
  --image=fedora:34 \
  --restart=Never \
  --overrides='{ "apiVersion": "v1", "spec": { "nodeSelector": { "kubernetes.io/hostname": "fsn-qws-app1" } } }' \
  -- sleep 1d
scvalex@toolbox ~ $ kubectl exec --stdin --tty debug-pod1 -- /bin/bash
[root@debug-pod1 /]# dnf install dnsutils ethtool iproute iputils traceroute
[root@debug-pod1 /]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
3: eth0@if183: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1370 qdisc noqueue state UP group default
    link/ether 42:ef:cb:6d:77:8e brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.32.1.105/24 brd 10.32.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::40ef:cbff:fe6d:778e/64 scope link
       valid_lft forever preferred_lft forever

We see that the pod has loopback interface lo. This is shared among all the containers in the pod, and they can use it to talk to each other.

The pod also has an eth0 network interface with IP address 10.32.1.105. This is a veth device, which is basically a virtual ethernet cable. The veth device comes as a pair of network interfaces, one on the pod, and one on the host.

[root@debug-pod1 /]# ip -details link show eth0
3: eth0@if183: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1370 qdisc noqueue state UP mode DEFAULT group default
    link/ether 42:ef:cb:6d:77:8e brd ff:ff:ff:ff:ff:ff link-netnsid 0 promiscuity 0 minmtu 68 maxmtu 65535
    veth addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
[root@debug-pod1 /]# ethtool -S eth0 | grep ifindex
     peer_ifindex: 183

root@fsn-qws-app1 ~ # ip link | grep '^183:'
183: vethc7f94ee2@if3:...
root@fsn-qws-app1 ~ # ip -details link show vethc7f94ee2
183: vethc7f94ee2@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1370 qdisc noqueue master mynet state UP mode DEFAULT group default

    link/ether e6:be:1f:86:d5:78 brd ff:ff:ff:ff:ff:ff link-netns cni-24a578d7-6d3a-b593-575c-b629c8591f70 promiscuity 1 minmtu68 maxmtu 65535
    veth
    bridge_slave state forwarding priority 32 cost 2 hairpin off guard off root_block off fastleave off learning on flood on port_id 0x8017 port_no 0x17 designated_port 32791 designated_cost 0 designated_bridge 8000.e:56:4e:fe:d:46 designated_root 8000.e:56:4e:fe:d:46 hold_timer    0.00 message_age_timer    0.00 forward_delay_timer    0.00 topology_change_ack 0 config_pending 0 proxy_arp off proxy_arp_wifi off mcast_router 1 mcast_fast_leave off mcast_flood on mcast_to_unicast off neigh_suppress off group_fwd_mask 0 group_fwd_mask_str 0x0 vlan_tunnel off isolated off addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

There are lots of veth interfaces on the host (one for each pod), and we identify the one for this particular pod by its peer_ifindex.

What is the pod’s eth0 connected to? Looking at the routes, we see that:

  • packets to the 10.32.1.0/24 subnet go directly through eth0; this is the subnet of pods on the same host and is the subject of this section,
  • packets to the 10.32.0.0/16 subnet are routed via 10.32.1.1; this is the subnet of all pods (regardless of host) and is the subject of the “Pod-to-other-node connections” section, and
  • other packets are routed via the default gateway of 10.32.1.1.
[root@debug-pod1 /]# ip route
default via 10.32.1.1 dev eth0
10.32.0.0/16 via 10.32.1.1 dev eth0
10.32.1.0/24 dev eth0 proto kernel scope link src 10.32.1.105

We can see this setup in action by tracerouting another pod on the same host, a pod on a different host, and an external host like example.com:

[root@debug-pod1 /]# traceroute -n 10.32.1.8
traceroute to 10.32.1.8 (10.32.1.8), 30 hops max, 60 byte packets
 1  10.32.1.8  0.053 ms  0.016 ms  0.012 ms

[root@debug-pod1 /]# traceroute -n 10.32.0.197
traceroute to 10.32.0.197 (10.32.0.197), 30 hops max, 60 byte packets
 1  10.32.1.1  0.034 ms  0.009 ms  0.008 ms
 2  10.32.0.0  25.107 ms  25.085 ms  25.071 ms
 3  10.32.0.197  25.084 ms  25.068 ms  25.077 ms

[root@debug-pod1 /]# traceroute example.com
traceroute to example.com (93.184.216.34), 30 hops max, 60 byte packets
 1  _gateway (10.32.1.1)  0.025 ms  0.011 ms  0.013 ms
 2  static.193.33.119.168.clients.your-server.de (168.119.33.193)  0.501 ms  0.525 ms  0.508 ms
 3  core24.fsn1.hetzner.com (213.239.245.209)  3.912 ms *  10.690 ms
...

How do packets from one pod get to another pod on the same host? Looking at the host’s interfaces, we find that 10.32.1.1 address is assigned to a bridge called mynet.

root@fsn-qws-app1 ~ # ip addr show mynet
6: mynet: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1370 qdisc noqueue state UP group default qlen 1000
    link/ether 0e:56:4e:fe:0d:46 brd ff:ff:ff:ff:ff:ff
    inet 10.32.1.1/24 brd 10.32.1.255 scope global mynet
       valid_lft forever preferred_lft forever
    inet6 fe80::c56:4eff:fefe:d46/64 scope link
       valid_lft forever preferred_lft forever

root@fsn-qws-app1 ~ # ip -details link show mynet
6: mynet: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1370 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 0e:56:4e:fe:0d:46 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
    bridge forward_delay 1500 hello_time 200 max_age 2000 ageing_time 30000 stp_state 0 priority 32768 vlan_filtering 0 vlan_protocol 802.1Q bridge_id 8000.e:56:4e:fe:d:46 designated_root 8000.e:56:4e:fe:d:46 root_port 0 root_path_cost 0 topology_change 0topology_change_detected 0 hello_timer    0.00 tcn_timer    0.00 topology_change_timer    0.00 gc_timer  201.70 vlan_default_pvid 1 vlan_stats_enabled 0 vlan_stats_per_port 0 group_fwd_mask 0 group_address 01:80:c2:00:00:00 mcast_snooping 1 mcast_router 1mcast_query_use_ifaddr 0 mcast_querier 0 mcast_hash_elasticity 16 mcast_hash_max 4096 mcast_last_member_count 2 mcast_startup_query_count 2 mcast_last_member_interval 100 mcast_membership_interval 26000 mcast_querier_interval 25500 mcast_query_interval 12500 mcast_query_response_interval 1000 mcast_startup_query_interval 3125 mcast_stats_enabled 0 mcast_igmp_version 2 mcast_mld_version 1 nf_call_iptables 0 nf_call_ip6tables 0 nf_call_arptables 0 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

The mynet bridge connects all the pods on the same host into a single network. Looking at the veth device above, we see that it is a bridge_slave whose master is mynet.

So, when a pod sends a packet to another pod on the same host:

  • it sends the packet through its veth interface,
  • the packet comes out of the veth pair on the host,
  • gets sent to the bridge,
  • gets broadcast to all the veth devices slaved to the bridge, and
  • emerges out of the other pod’s veth interface.
Pod-to-pod connections on the same node

That’s all there is to it. To summarize, each pod has one half of a veth device pair, and the host has the other half. The pod’s routing table is setup to send packets to other pods in the same node through this interface. The host has a bridge interface, and all the pod veth interfaces are slaved to it.

Aside: Pod IP address allocation

We’ve seen the pod’s IP address 10.32.1.105, the node’s subnet 10.32.1.0/24, and the subnet of all pods 10.32.0.0/16. Where did these specific values come from?

Going largest-to-smallest, the subnet of all pods 10.32.0.0/16 is the --cluster-cidr argument passed to kube-controller-manager, kube-proxy, and the Network option in flannel’s configuration file.

The node got its subnet of 10.32.1.0/24 when it first joined the cluster. The kube-controller-manager’s ipam plugin allocated a /24 subnet to it at that point. This means that our cluster can have at most 256 nodes (although the size of each node’s subnet can be tweaked in the kube-controller-manager arguments). We can find a node’s subnet in the podCIDR field in its spec:

scvalex@toolbox ~ $ kubectl get node fsn-qws-app1 -o json | jq '.spec.podCIDR'
"10.32.1.0/24"

The pod was allocated its IP address of 10.32.1.105 when it was created by kubelet from the node’s subnet. This means each of our nodes can have at most 256 pods (although the actual number is lower due to special addresses).

Pod-to-other-node connections

We’ve seen how intra-node networking works: it’s a combination of veth devices, routes, and bridges. What happens if a pod sends a packet to a pod on a different node?

Starting with the pod, there’s no specific route for “pod on a different node”, but there is a route for the whole pod subnet (10.32.0.0/16). Using it, the packet gets routed via the host (the host has the address ending in .1 from the node’s subnet of 10.32.1.0/24).

[root@debug-pod1 /]# ip route
...
10.32.0.0/16 via 10.32.1.1 dev eth0
...

On the host side, the packet will first appear on the host’s half of the veth pair, it will then be sent to the bridge (because the veth device is slaved to it), and it will finally reach the mynet interface itself as it has the 10.32.1.1 address.

Assuming the host has IPv4 forwarding set up, it will route the packet to somewhere else. The forwarding settings relevant here are the net.ipv4.conf.(all|default).forwarding options in sysctl. Additionally, the FORWARD chain in iptables filters the packets, and the POSTROUTING chain in the nat table rewrites the packets’ source addresses. The PREROUTING chain also plays a part, but only when Kubernetes services are involved, so we’ll look at it in a later section.

root@fsn-qws-app1 ~ # sysctl net.ipv4.conf.all.forwarding
net.ipv4.conf.all.forwarding = 1
root@fsn-qws-app1 ~ # sysctl net.ipv4.conf.default.forwarding
net.ipv4.conf.default.forwarding = 1

root@fsn-qws-app1 ~ # iptables -nL FORWARD
Chain FORWARD (policy DROP)
target     prot opt source               destination
...
ACCEPT     all  --  10.32.0.0/16         0.0.0.0/0
ACCEPT     all  --  0.0.0.0/0            10.32.0.0/16
...

root@fsn-qws-app1 ~ # iptables -nL POSTROUTING -t nat
Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
...
CNI-c92e884a69aa6d8a543a5e80  all  --  10.32.1.105          0.0.0.0/0            /* name: "mynet" id: "aa83dddeefb3336b4c5dd4342b57269f89fc78ea60f97ff22b8ffaa7881188bc" */
...

root@fsn-qws-app1 ~ # iptables -nL CNI-c92e884a69aa6d8a543a5e80 -t nat
Chain CNI-c92e884a69aa6d8a543a5e80 (1 references)
target     prot opt source               destination
ACCEPT     all  --  0.0.0.0/0            10.32.1.0/24         /* name: "mynet" id: "aa83dddeefb3336b4c5dd4342b57269f89fc78ea60f97ff22b8ffaa7881188bc" */
MASQUERADE  all  --  0.0.0.0/0           !224.0.0.0/4          /* name: "mynet" id: "aa83dddeefb3336b4c5dd4342b57269f89fc78ea60f97ff22b8ffaa7881188bc" */

The FORWARD chain rules say “accept packets for forwarding to and from the subnet of all pods”. The POSTROUTING chain rules say “when forwarding a packet from a valid pod IP address, just accept it if it’s destined for a pod on the same host, or enable masquerading for it otherwise”. So, when a pod’s packet emerges from the mynet interface, the host will accept it locally, or it will send the packet onward and pretend like it came from the host itself.

I think the POSTROUTING rules for each pod are smashed in place by kube-proxy periodically (on the grounds that it adds the per-service PREROUTING rules), but I don’t know for certain.

One more thing to add here is that hosts need to be configured to accept packets coming from the mynet bridge interface in the INPUT chain in iptables. Otherwise, packets will be dropped before they have a chance of being forwarded.

To see where the masqueraded packet gets sent, we look at the host’s routing table:

root@fsn-qws-app1 ~ # ip route
...
10.32.0.0/24 via 10.32.0.0 dev flannel.1 onlink
10.32.1.0/24 dev mynet proto kernel scope link src 10.32.1.1
...

Packets destined for pods on the same host (10.32.1.0/24) get sent to the bridge, and otherwise, packets destined for the CIDR of a different node (10.32.0.0/24) go through the flannel.1 interface. Note that the subnet mask is 24 because it’s the CIDR of a single other node, and not 16 which is the CIDR for the whole pod subnet.

Looking at the devices, we see flannel.1 is a vxlan device:

root@fsn-qws-app1 ~ # ip -details link show flannel.1
86: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1370 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/ether 16:3f:5b:3b:d2:dc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
    vxlan id 1 local 10.10.0.10 dev wg0 srcport 0 0 dstport 8472 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

This flannel.1 interface is the Flannel app itself. It keeps track of the other hosts running Flannel and forwards any packets to them by encapsulating them in UDP/IP envelopes. These aren’t encrypted, but it’s fine in my case because the nodes are actually connected over a Wireguard mesh. On the other host, packets come out of the flannel.1 interface, but we’ll talk about that in the “Other-node-to-pod connections” section.

Pod-to-other-node connections

To summarize, when a pod sends a packet to a pod on a different node, it goes on quite the journey:

  • the packet is sent via the pod’s veth device pair,
  • emerges on the host’s half of the pair,
  • gets sent to the bridge,
  • emerges out of the bridge’s interface,
  • maybe gets masqueraded by iptables, and
  • gets routed to the other node via the flannel.1 interface.

An interesting effect of the above is that, outside the host, nobody can tell which pod the packet actually came from because it’s rewritten to originate from the host itself.

Other-node-to-pod connections

We’ve seen what happens when a pod sends packets, but how does it receive them? A pod can receive packets from another pod on the same node, a pod on a different node, or from outside the cluster. The first case was the subject of the “Pod-to-pod connections on the same node” section, this section deals with the second case, and we’ll leave discussing the third case for a different post.

When a node forwards a pod’s packet to a different host, it enables masquerading and sends the packet via the flannel.1 interface. On the receiving end, the packet emerges from flannel.1 interface with the source address rewritten to be that of the node.

In order to keep talking about the same addresses, let’s flip the discussion here and say that our host is receiving a packet destined for our pod. So, a packet with destination of 10.32.1.105 comes out of the flannel.1 interface. Looking at the host routes, this will be sent to the mynet bridge:

root@fsn-qws-app1 ~ # ip route
...
10.32.1.0/24 dev mynet proto kernel scope link src 10.32.1.1
...

The iptables rules in the FORWARD chain allow this packet through because it’s destined for the pod CIDR, and the bridge broadcasts this packet to all the slaved veth devices. The packet then emerges from the pod’s half of the veth pair.

On the return trip, the packets have as their destination the other node’s flannel.1 address, and not the other pod’s address. This is because the source address was rewritten. They get sent as described in the “Pod-to-other-node connections” section. The one extra bit of trickiness is how the other node knows to route the packets addressed to it to the right pod. The answer is that, as part of masquerading, it keeps track of live connections, and thus knows how to “unmasquerade” the returning packets.

Other-node-to-pod connections

To summarize, when a node receives a packet destined for one of its pods:

  • the packet comes out of the flannel.1 interface,
  • gets routed to the bridge,
  • gets broadcast to all the slaved veth devices, and
  • emerges out of the pod’s veth interface.

Aside: Cluster DNS

We’ve seen how pods talk to each other, be they on the same node or on different nodes. However, we’d rather not refer to pods by their ephemeral IP addresses, but by their names, or better yet, by the names of their services. To do this name resolution, the Kubernetes cluster is expected to run a kube-system/kube-dns service. In practice, this is likely to be a CoreDNS deployment.

The address of this DNS service is configured statically (10.33.0.254 in our case) and is passed as the --cluster-dns argument to kubelet. The same address must be included in CoreDNS’s service spec.

When a pod is started, kubelet sets this address as the nameserver in the pod’s /etc/resolv.conf:

search default.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.33.0.254
options ndots:5
A pod's resolv.conf

So, a pod knows the address of the DNS service and can query it. However, this address is not a pod’s address because it doesn’t belong to the pod CIDR of 10.32.0.0/16. It is a service address, and reaching services is the subject of the next section.

Pod-to-service connections

Pod-to-service connections are ultimately just pod-to-pod connections with an extra translation done at the very beginning.

The basic trick is that services have IPs from a range that never overlaps with either that of pod IPs or that of node IPs. This range is the --service-cluster-ip-range argument passed to kube-apiserver and kube-controller-manager. In our example, this range is 10.33.0.0/24. Services get allocated their IP when they’re first created and it never changes unless the service is deleted and recreated.

As an example, we try reaching the kube-system/kube-dns service from our example pod. Note that we have to get the protocol and port of the service right (DNS is just UDP on port 53); otherwise, the routing doesn’t work.

## Can we query the kube-dns service for its own address?
[root@debug-pod1 /]# dig +short kube-dns.kube-system.svc.cluster.local a
10.33.0.254

## How do we reach said address?
[root@debug-pod1 /]# traceroute --udp -p 53 -n 10.33.0.254
traceroute to 10.33.0.254 (10.33.0.254), 30 hops max, 60 byte packets
 1  10.33.0.254  0.035 ms  0.286 ms  0.174 ms

## We get the port wrong here, so the packets are routed to outside the cluster
[root@debug-pod1 /]# traceroute --udp -p 1000 10.33.0.254
traceroute to 10.33.0.254 (10.33.0.254), 30 hops max, 60 byte packets
 1  _gateway (10.32.1.1)  0.296 ms  0.246 ms  0.226 ms
 2  static.193.33.119.168.clients.your-server.de (168.119.33.193)  1.399 ms  1.379 ms  1.365 ms
 3  core24.fsn1.hetzner.com (213.239.245.209)  3.335 ms  3.323 ms *
...

Let’s follow a packet leaving our pod with destination 10.33.0.254. Looking at the pod’s routing table, there’s no special rule for the service subnet of 10.33.0.0/24, so the default gateway of 10.32.1.1 is used:

[root@debug-pod1 /]# ip route
default via 10.32.1.1 dev eth0
10.32.0.0/16 via 10.32.1.1 dev eth0
10.32.1.0/24 dev eth0 proto kernel scope link src 10.32.1.105

On the host, we’ve already seen that the packets come out of the veth device for the pod, get sent to the bridge, and come out of the mynet interface which has the address 10.32.1.1.

The host now has to figure out how to route packets to the service IP of 10.33.0.254. As soon as the packet emerges from the bridge, the iptables rules kick in. The packet came from the pod IP range, so it’s accepted by the FORWARD chain. The PREROUTING chain in the nat table then rewrites the destination IP address to that of one of the pods in the service.

root@fsn-qws-app1 ~ # iptables -nL PREROUTING -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

## This chain matches on the destination address, protocol, and port
root@fsn-qws-app1 ~ # iptables -nL KUBE-SERVICES -t nat
Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
...
KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  0.0.0.0/0            10.33.0.254          /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
...

## This chain picks one of the two kube-dns pods randomly
root@fsn-qws-app1 ~ # iptables -nL KUBE-SVC-TCOU7JCQXEZGVUNU -t nat
Chain KUBE-SVC-TCOU7JCQXEZGVUNU (1 references)
target     prot opt source               destination
KUBE-SEP-JMPR5F7DNF2VZGXN  all  --  0.0.0.0/0            0.0.0.0/0            /* kube-system/kube-dns:dns */ statistic mode random probability 0.50000000000
KUBE-SEP-GETXBXEJNGKLZGS3  all  --  0.0.0.0/0            0.0.0.0/0            /* kube-system/kube-dns:dns *

## This chain rewrites the address to the first kube-dns pod
root@fsn-qws-app1 ~ # iptables -nL KUBE-SEP-JMPR5F7DNF2VZGXN -t nat
Chain KUBE-SEP-JMPR5F7DNF2VZGXN (1 references)
target     prot opt source               destination
KUBE-MARK-MASQ  all  --  10.32.0.172          0.0.0.0/0            /* kube-system/kube-dns:dns */
DNAT       udp  --  0.0.0.0/0            0.0.0.0/0            /* kube-system/kube-dns:dns */ udp to:10.32.0.172:10053

## This chain rewrites the address to the second kube-dns pod
root@fsn-qws-app1 ~ # iptables -nL KUBE-SEP-GETXBXEJNGKLZGS3 -t nat
Chain KUBE-SEP-GETXBXEJNGKLZGS3 (1 references)
target     prot opt source               destination
KUBE-MARK-MASQ  all  --  10.32.1.116          0.0.0.0/0            /* kube-system/kube-dns:dns */
DNAT       udp  --  0.0.0.0/0            0.0.0.0/0            /* kube-system/kube-dns:dns */ udp to:10.32.1.116:10053

Note that the rule in the KUBE-SERVICES chain matches on both the service IP address and on the service’s port. If a packet matches both, then it gets sent to the KUBE-SVC-... chain which then sends it, with equal probability to one of the KUBE-SEP-... chains. These just DNAT the packet and rewrite the destination address and port to that of the actual kube-dns pods. From here, the packet follows the same path as that described in one of the pod-to-pod sections.

The iptables rules for each service are smashed into place by kube-proxy periodically (assuming it’s running in iptables mode).

Pod-to-service connections

To summarize, when sending packets to a service, the source host uses iptables to rewrite the destination of the packets to the actual addresses and ports of the pods implementing the service.

Corollary: host-to-pod and host-to-service connections

An interesting side-effect of the above is that the host can also access Kubernetes pods and services by their IP (because the iptables rules are at the host level). This is particularly helpful when debugging cluster networking because it removes the need for messing around with containers.

That said, the iptables rules in this case are a bit different. Let’s follow a packet being sent from the host to the kube-dns service. The packet first goes to the PREROUTING chain in the nat table, and then the KUBE-SERVICES chain:

root@fsn-qws-app1 ~ # iptables -nL PREROUTING -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

root@fsn-qws-app1 ~ # iptables -nL KUBE-SERVICES -t nat
Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
...
KUBE-MARK-MASQ  udp  -- !10.32.0.0/16         10.33.0.254          /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  0.0.0.0/0            10.33.0.254          /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
...

Because the source of the packet isn’t in the pod CIDR of 10.32.0.0/16, the first rule gets triggered and we proceed to the KUBE-MARK-MASQ chain.

root@fsn-qws-app1 ~ # iptables -nL KUBE-MARK-MASQ -t nat
Chain KUBE-MARK-MASQ (74 references)
target     prot opt source               destination
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

This marks the packet with 0x4000, and we continue to the second rule in the KUBE-SERVICES chain. Like in the “Pod-to-service connections” section, the destination is rewritten to that of one of the pods implementing the service.

Afterwards, as the packet is leaving the host, it goes through the POSTROURING chain in the nat table and matches only the first rule:

root@fsn-qws-app1 ~ # iptables -nvL POSTROUTING -t nat
Chain POSTROUTING (policy ACCEPT 10401 packets, 1037K bytes)
 pkts bytes target     prot opt in     out     source               destination
  17M 1319M KUBE-POSTROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules*/
  ... ## many more rules for pod-to-other-node masquerading

root@fsn-qws-app1 ~ # iptables -nvL KUBE-POSTROUTING -t nat
Chain KUBE-POSTROUTING (1 references)
 pkts bytes target     prot opt in     out     source               destination
28445 2444K RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0x4000/0x4000
    8   449 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            MARK xor 0x4000
    8   449 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully

Only packets with the 0x4000 mark make it past the RETURN rule. For them, the mark is removed, and masquerading is enabled. This corresponds to how masquerading was enabled for pod-to-other-node communications in the POSTROUTING table by checking that the source of the packets matched the address of one of the pods. Mind you, in our example, this masquerading doesn’t do anything because the packet was already coming from this host. If the packet had come from a different host, then this would have made the forwarding work.

From here, the packet follows the same path as “Pod-to-other-node communcations”.

Pod-to-outside-cluster connections

We’ve seen what happens when a pod sends a packet to another pod on a different host, but what if it sends the packet to a host outside the cluster?

At the beginning, the process is the same as with “Pod-to-other-node connections” but diverges when the packet has to be routed out of the host:

root@fsn-qws-app1 ~ # ip route
default via 168.119.33.193 dev enp35s0 proto static
...
10.32.0.0/24 via 10.32.0.0 dev flannel.1 onlink
10.32.1.0/24 dev mynet proto kernel scope link src 10.32.1.1
...
Pod-to-outside-cluster connections

Since the packet’s destination is not a node on the same host (10.32.1.0/24) or a different node in the cluster (10.32.0.0/24), the default route is chosen and the packet is sent via the host’s regular networking. Again, since the packet was masqueraded, no other host will be able to tell that the packet came from a pod and not the host itself.

Conclusion

At first glance, Kubernetes networking seems very complicated, but on closer inspection, it’s just veth devices, bridges, routes, and iptables rules chained in very specific ways. Remembering the various scenarios is hard, but luckily, all the parts are easy to introspect, so we don’t have to.