Kubernetes network stack fundamentals: How pods on different nodes communicate

Learn how pods communicate with each other when they are on different Kubernetes nodes.

Posted: August 9, 2022 by Anthony Critelli (Sudoer)

Neon lights in geometrical shapes — ^{Photo by JJ Ying on Unsplash}

My previous article showed how containers communicate within a pod through the same network namespace. This article looks at how pods communicate with each other when those pods exist on different Kubernetes nodes.

Kubernetes defines a network model called the container network interface (CNI), but the actual implementation relies on network plugins. The network plugin is responsible for allocating internet protocol (IP) addresses to pods and enabling pods to communicate with each other within the Kubernetes cluster. There are a variety of network plugins for Kubernetes, but this article will use Flannel. Flannel is very simple and uses a Virtual Extensible LAN (VXLAN) overlay by default.

You often hear about overlay networks in the context of Kubernetes networking. While this may sound complicated, an overlay network simply involves another layer of encapsulation for network traffic. For example, the Flannel network plugin takes traffic from a pod and encapsulates it inside the VXLAN protocol. This article takes a deep dive into how this encapsulation works and how the traffic appears on the wire.

[ Learn how to manage your Linux environment for success. ]

Environment setup

The environment for this article uses a two-node minikube cluster with the Flannel network plugin. You can start the necessary minikube environment using:

$ minikube start --nodes 2 --network-plugin=cni --cni=flannel
😄  minikube v1.25.2 on Ubuntu 20.04
✨  Automatically selected the kvm2 driver. Other choices: virtualbox, ssh
❗  With --network-plugin=cni, you will need to provide your own CNI. See --cni flag as a user-friendly alternative
👍  Starting control plane node minikube in cluster minikube
🔥  Creating kvm2 VM (CPUs=2, Memory=2200MB, Disk=20000MB) ...
🐳  Preparing Kubernetes v1.23.3 on Docker 20.10.12 ...
    ▪ kubelet.housekeeping-interval=5m
    ▪ Generating certificates and keys ...
    ▪ Booting up control plane ...
    ▪ Configuring RBAC rules ...
🔗  Configuring Flannel (Container Networking Interface) ...
🔎  Verifying Kubernetes components...
    ▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
🌟  Enabled addons: storage-provisioner, default-storageclass

👍  Starting worker node minikube-m02 in cluster minikube
🔥  Creating kvm2 VM (CPUs=2, Memory=2200MB, Disk=20000MB) ...
🌐  Found network options:
    ▪ NO_PROXY=192.168.50.43
🐳  Preparing Kubernetes v1.23.3 on Docker 20.10.12 ...
    ▪ env NO_PROXY=192.168.50.43
🔎  Verifying Kubernetes components...
🏄  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

# Verify that nodes are up and ready
$ kubectl get nodes
NAME           STATUS   ROLES                  AGE   VERSION
minikube       Ready    control-plane,master   65s   v1.23.3
minikube-m02   Ready    <none>                 34s   v1.23.3

The diagram below shows the basic topology for this environment. I will describe each component of this topology in more detail throughout the article.

Network diagram of two pods and a physical network — (Anthony Critelli, CC BY-SA 4.0)

Below are the Kubernetes manifests that define the fedora-1 and fedora-2 pods. A nodeSelector is used to ensure that each pod runs on a separate host for this experiment.

Here is k8s_flannel_fedora_1_definition.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: fedora-1
spec:
  nodeSelector:
    kubernetes.io/hostname: minikube
  containers:
  - command:
    - sleep
    - infinity
    image: fedora
    name: fedora

And k8s_flannel_fedora_2_definition.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: fedora-2
spec:
  nodeSelector:
    kubernetes.io/hostname: minikube-m02
  containers:
  - command:
    - sleep
    - infinity
    image: fedora
    name: fedora

Note that throughout this article, I enter the pods and execute commands within their namespaces. Many of these commands are not installed by default, but you can install them using DNF. I'm entering pods only for explanatory reasons, and you should avoid "logging in" to pods and directly installing utilities in a production environment.

Download now

The OSI model is the major standard for describing communications between computer or telecom systems. The model is divided into seven layers, each covering a different responsibility within the network communication process. This article focuses on Layer 2 (the data link layer) and Layer 3 (the network layer).

The Layer 2 network

This exercise sends a ping from the fedora-1 pod to the fedora-2 pod and traces it through the network stack. Connecting to the fedora-1 pod and inspecting its network stack reveals that it has an eth0@if10 interface. This interface is one side of a virtual Ethernet pair.

A virtual Ethernet pair allows connections between network namespaces, such as between a pod's namespace and the host's default namespace. The @if10 and output of ethtool indicate that the peer interface index on the host is 10.

# Enter the fedora-1 pod
$ kubectl exec -it fedora-1 -- /bin/bash

# Inspect the network configuration within the fedora-1 pod
[root@fedora-1 /]# ip link sh
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/sit 0.0.0.0 brd 0.0.0.0
4: eth0@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 92:73:8f:43:67:1b brd ff:ff:ff:ff:ff:ff link-netnsid 0

[root@fedora-1 /]# ethtool -S eth0
NIC statistics:
     peer_ifindex: 10

You can now inspect the network configuration on the minikube host. There are several additional interfaces, but the network configuration within the fedora-1 pod indicates that the interface with index 10 is the remote end of the virtual Ethernet pair. The interface with index 10 on the minikube host is vetheaa97948@if4, and the @if4 and output of ethtool both indicate that the remote peer index is 4. This index corresponds with the eth0 interface from the fedora-1 pod, which has an interface index of 4.

# Inspect the network configuration on the minikube host
$ ip link sh
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:37:09:5c brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:ea:1b:4a brd ff:ff:ff:ff:ff:ff
4: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/sit 0.0.0.0 brd 0.0.0.0
5: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default 
    link/ether 02:42:7e:1b:3c:46 brd ff:ff:ff:ff:ff:ff
6: cni-podman0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether aa:8b:b8:3b:ae:0d brd ff:ff:ff:ff:ff:ff
8: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default 
    link/ether 3e:7d:76:2a:39:82 brd ff:ff:ff:ff:ff:ff
9: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether c6:ae:d0:e6:66:cc brd ff:ff:ff:ff:ff:ff
    
$ ethtool -S vetheaa97948
NIC statistics:
     peer_ifindex: 4
10: vetheaa97948@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default 
    link/ether 0e:52:e0:c4:14:74 brd ff:ff:ff:ff:ff:ff link-netnsid 0

The virtual Ethernet pair also connects to a bridge on the minikube host. This allows pods on the same host to communicate directly with each other over the bridge. The cni0 bridge interface has an IP address assigned to it, which will be important in the Layer 3 routing process:

# Bridge configuration on minikube host
$ brctl show
bridge name	bridge id		STP enabled	interfaces
cni-podman0		8000.aa8bb83bae0d	no		
cni0		8000.c6aed0e666cc	no		vetheaa97948
docker0		8000.02427e1b3c46	no	

# Bridge interface IP address on minikube host
$ ip -br addr sh cni0
cni0             UP             10.244.0.1/24

After investigating the network stack within the flannel-1 pod and the minikube host, here is how the network diagram appears:

virtual ethernet bridge — (Anthony Critelli, CC BY-SA 4.0)

The Layer 3 network

At this point, a good portion of the Layer 2 part of the diagram is filled out. However, this only provides local connectivity for the pod. Next, consider how fedora-1 at 10.244.0.2/24 can communicate with fedora-2 at 10.244.1.3/24, which is in a different Layer 3 network.

[ Get the guide to installing applications on Linux. ]

First, the fedora-1 pod must decide where to send traffic for the remote network. The routing table does not include a specific route containing 10.244.1.3, so fedora-1 will send the traffic to its default gateway. The default gateway is at 10.244.0.1, which is the IP address of the cni0 bridge on the minikube host:

# The default route is 10.244.0.1, which is the cni0 bridge on the Minikube host
[root@fedora-1 /]# ip route sh
default via 10.244.0.1 dev eth0 
10.244.0.0/24 dev eth0 proto kernel scope link src 10.244.0.2 
10.244.0.0/16 via 10.244.0.1 dev eth0

Once the traffic has reached the default gateway, the next routing decision will determine how to forward that traffic to the desired network (10.244.10/24). The host's routing table shows that traffic destined to the 10.244.1.0/24 network will be sent via 10.244.1.0 on the flannel.1 interface:

# Routing table on minikube node
$ ip route sh
default via 192.168.122.1 dev eth1 proto dhcp src 192.168.122.215 metric 1024 
10.88.0.0/16 dev cni-podman0 proto kernel scope link src 10.88.0.1 linkdown 
10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1 
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.61.0/24 dev eth0 proto kernel scope link src 192.168.61.95 
192.168.122.0/24 dev eth1 proto kernel scope link src 192.168.122.215 
192.168.122.1 dev eth1 proto dhcp scope link src 192.168.122.215 metric 1024

The flannel.1 interface is a VXLAN interface with a VXLAN ID of 1 and an IP address of 10.244.0.0/32:

# flannel.1 interface on minikube host
$ ip -d link sh flannel.1
8: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default 
    link/ether 3e:7d:76:2a:39:82 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 
    vxlan id 1 local 192.168.122.215 dev eth1 srcport 0 0 dstport 8472 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 

$ ip -br addr sh flannel.1
flannel.1        UNKNOWN        10.244.0.0/32

The next hop for a packet from fedora-1 to fedora-2 is 10.244.1.0. However, no interface on the network contains the 10.244.1.0 address. Instead, a static Address Resolution Protocol (ARP) entry exists on the ARP table on the minikube host. This static entry indicates that the MAC address for 10.244.1.0 is d2:d8:a1:85:9e:38. The bridge forwarding database directs traffic for this MAC address to a remote destination of 192.168.122.7. This remote address, which is the physical interface on the minikube-m02 node, is the other side of the VXLAN tunnel.

# ARP table on minikube host. Note that some entries have been removed for brevity.
$ arp -a
? (10.244.1.0) at d2:d8:a1:85:9e:38 [ether] PERM on flannel.1


# Determine the container ID of the Flannel container and obtain a shell within the container.
$ docker ps | grep flannel
ee112410bd61   4e9f801d2217           "/opt/bin/flanneld -…"   8 hours ago   Up 8 hours             k8s_kube-flannel_kube-flannel-ds-amd64-2qqld_kube-system_bc079a14-d045-44ce-9dc2-fa1369a20c30_0
11eee23ddcb0   k8s.gcr.io/pause:3.6   "/pause"                 8 hours ago   Up 8 hours             k8s_POD_kube-flannel-ds-amd64-2qqld_kube-system_bc079a14-d045-44ce-9dc2-fa1369a20c30_0
$ docker exec -it ee112410bd61 /bin/bash

# Bridge forwarding database information within the Flannel container on the minikube host. Irrelevant entries have been removed for brevity.
bash-5.0# bridge fdb show
d2:d8:a1:85:9e:38 dev flannel.1 dst 192.168.122.7 self permanent

This additional information provides the overall traffic diagram for one side of the topology:

Pod network diagram with Flannel plugin — (Anthony Critelli, CC BY-SA 4.0)

You can see the entire picture by viewing the configuration on the other side of the connection for the fedora-2 pod and the minikube-m02 node:

# MAC and IP addresses within fedora-2 pod
[root@fedora-2 /]# ip -br link sh
lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP> 
sit0@NONE        DOWN           0.0.0.0 <NOARP> 
eth0@if9         UP             1e:59:e5:f0:89:56 <BROADCAST,MULTICAST,UP,LOWER_UP> 

[root@fedora-2 /]# ip -br addr sh
lo               UNKNOWN        127.0.0.1/8 
sit0@NONE        DOWN           
eth0@if9         UP             10.244.1.3/24 

# Peer interface within fedora-2 pod for veth
[root@fedora-2 /]# ethtool -S eth0
NIC statistics:
     peer_ifindex: 9

# MAC and IP addresses on the minikube-m02 node
$ ip -br link sh
lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP> 
eth0             UP             52:54:00:4b:22:10 <BROADCAST,MULTICAST,UP,LOWER_UP> 
eth1             UP             52:54:00:b4:90:59 <BROADCAST,MULTICAST,UP,LOWER_UP> 
sit0@NONE        DOWN           0.0.0.0 <NOARP> 
docker0          DOWN           02:42:df:f0:b5:92 <NO-CARRIER,BROADCAST,MULTICAST,UP> 
flannel.1        UNKNOWN        d2:d8:a1:85:9e:38 <BROADCAST,MULTICAST,UP,LOWER_UP> 
cni0             UP             ba:b6:aa:32:0a:ed <BROADCAST,MULTICAST,UP,LOWER_UP> 
vethd63c7e02@if4 UP             96:ba:a1:c6:7c:b8 <BROADCAST,MULTICAST,UP,LOWER_UP> 
veth4427a55c@if4 UP             6a:df:dc:89:7a:7f <BROADCAST,MULTICAST,UP,LOWER_UP> 

$ ip -br addr sh
lo               UNKNOWN        127.0.0.1/8 
eth0             UP             192.168.61.149/24 
eth1             UP             192.168.122.7/24 
sit0@NONE        DOWN           
docker0          DOWN           172.17.0.1/16 
flannel.1        UNKNOWN        10.244.1.0/32 
cni0             UP             10.244.1.1/24 
vethd63c7e02@if4 UP             
veth4427a55c@if4 UP 

# Peer interface on host for veth
$ ethtool -S veth4427a55c
NIC statistics:
     peer_ifindex: 4

The complete traffic flow can now be understood at a high level: The pod sends traffic to its next hop, which is the CNI bridge on the host. The host forwards this traffic through the local flannel.1 interface to the flannel.1 interface on the remote node. This traffic is encapsulated in a VXLAN tunnel. It is decapsulated and forwarded to the target pod once the remote node receives it:

Pod network diagram with full topology — (Anthony Critelli, CC BY-SA 4.0)

Communication on the wire

Now that you understand how the network stack works, it's time to see it in action. The best way to take this knowledge from conceptual to concrete is to send traffic between the pods and observe the traffic on the wire.

To do this, I set up a ping between fedora-1 and fedora-2 while running a Wireshark packet capture on the physical network. In my setup, minikube runs on a Linux host, which requires capturing packets on the virtual network interface controllers (NICs) for the minikube VMs. With the packet capture running, I set up a ping from fedora-1 to fedora-2:

[root@fedora-1 /]# ping 10.244.1.3
PING 10.244.1.3 (10.244.1.3) 56(84) bytes of data.
64 bytes from 10.244.1.3: icmp_seq=1 ttl=62 time=0.301 ms
64 bytes from 10.244.1.3: icmp_seq=2 ttl=62 time=0.771 ms
64 bytes from 10.244.1.3: icmp_seq=3 ttl=62 time=0.762 ms

Once I had captured some packets, I used Wireshark to analyze them. Notice that the communication appears as plain User Datagram Protocol (UDP) traffic, not Internet Control Message Protocol (ICMP) traffic. This is because it is encapsulated in the VXLAN tunnel.

Packet capture displaying UDP traffic — (Anthony Critelli, CC BY-SA 4.0)

You can tell Wireshark to decode the traffic as VXLAN traffic by right-clicking on any traffic, selecting Decode As…, and specifying VXLAN as the current protocol for UDP port 8472 (the port the VXLAN tunnel uses):

Packet capture decoded as VXLAN — (Anthony Critelli, CC BY-SA 4.0)

Once you set the decoding correctly, the traffic will appear as ICMP traffic. Expanding a particular packet shows the complete picture:

Packet capture displaying echo results — (Anthony Critelli, CC BY-SA 4.0)

I will start at the innermost part of the packet. This is an ICMP packet being sent from 10.244.0.2 (fedora-1) to 10.244.1.3 (fedora-2). The innermost Ethernet header specifies a source MAC address of 3e:7d:76:2a:39:82 (flannel.1 on minikube) and a destination MAC address of d2:d8:a1:85:9e:38 (flannel.1 on minikube-m02).

Next, you can see a VXLAN Network Identifier of 1, which corresponds to the configuration you previously saw for the flannel.1 interface. This is the VXLAN header, which enables the overlay part of this network. The inner traffic is encapsulated within this VXLAN header, and the pods that communicate with each other don't know anything about the physical network topology.

Moving up the stack, you can see that this is a UDP packet sent from 192.168.122.215 (eth1 on minikube) to 192.168.122.7 (eth1 on minikube-m02). The MAC addresses are also sent from 52:54:00:ea:1b:4a (eth1 on minikube) to 52:54:00:b4:90:59 (eth1 on minikube-m02) because these minikube nodes are on the same physical network.

These packets demonstrate the entire flow of traffic from the fedora-1 to fedora-2 pods. Traffic is encapsulated with a VXLAN header and then sent directly between the two minikube nodes. When a node receives the traffic, it removes the VXLAN header and then forwards the inner packet to the appropriate pod. This allows full end-to-end communication between the pods without needing them to understand the underlying physical network.

Wrap up

This article showed how a packet travels from one Kubernetes pod to another through a VXLAN overlay network. This process is very complicated, and this article provided the in-depth commands necessary to fully trace this traffic flow. While Flannel is only one example of a Kubernetes network plugin, the general commands and concepts from this article can help you to understand traffic flows in Kubernetes clusters that use different network approaches.

How to fix Kubernetes namespaces stuck in the terminating state

Sometimes the process to delete Kubernetes namespaces gets hung up, and the command never completes. Here's how to troubleshoot terminating namespaces