Tour Around Kernel Stack

In this blog post, I will share my shallow understanding of Linux Kernel Networking Stack by touring a simple network topology which consists of two containers connected to a linux bridge using veth as shown below.



Our discussion here is mainly based on the linux kernel 4.14.4 source code and we will talk about both the sending and receiving path beginning with the write() and read() systemcall. Also, we will discuss some details of the veth and bridge drivers.

Note that kernel is a fairly complex, and the networking subsystem also interacts with other subsystems (e.g., softirq subsystem, and etc.). So here, we do not delve into those details and neither the details of specific data structs. We focus more on the path.

Walking Path

Usually, userspace applications use systemcall (e.g., write(), read(), and etc.) to interact with the kernel stack. We talk about the sending path first.

Sending Path

1
2
3
4
5
6
7
8
9
10
write() // [fs/read_write.c]
|-> vfs_write() new_sync_write() call_write_iter() // [fs/read_write.c]
|-> sock_write_iter() // [net/socket.c]
|-> sock_sendmsg()
|-> inet_sendmsg() // [net/ipv4/af_inet.c]
|-> tcp_sendmsg() // [net/ipv4/tcp.c] **!!!data copy here!!!** tcp_sendmsg_locked() skb_add_data_nocache()
|-> tcp_push() __tcp_push_pending_frames() tcp_write_xmit() // [net/ipv4/tcp_output.c]
|-> tcp_transmit_skb() // build tcp header here
|-> ip_queue_xmit() // build ip header here [net/ipv4/ip\_output.c] ip_local_out() ip_output() ip_finish_output() ip_finish_output2() // IP fragmentation
|-> neigh_output() neigh_hh_output() [net/neighbour.h] dev_queue_xmit() [net/core/dev.c] __dev_queue_xmit()
  1. select transmitting hardware queue, if supported
  2. if no queue (loopback, tunnel, etc.), dev_hard_start_xmit()
  3. otherwise, entering some queuing discipline
1
2
3
|-> xmit_one() dev_queue_xmit_nit() // NOTE: skb_clone() for trace
|-> netdev_start_xmit()
|-> ndo_start_xmit()

Receiving Path

Though receiving path seems to be an exact reverse path of sending path, there exists some differences. In my view, the receiving path is much more ‘asynchronous‘, since the procedures of calling read() systemcall and kernel receiving data from underlying NIC are not much related.

We start by discussing read() and then talk about how kernel receives data frames.

1
2
3
4
5
read() [fs/read_write.c] 
|-> vfs\_read() new_sync_read() call_read_iter() // [fs/read_write.c]
|-> sock_read_iter() // [net/socket.c]
|-> sock_recvmsg()
|-> tcp_recvmsg() // [net/ipv4/tcp.c] [NOTE: 1. sk_wait_data() 2. skb_copy_datagram_iovec() from kernel to userspace]
  1. data DMAed to pre-defined memory region
  2. raise IRQ
  3. typically, hardware IRQ handler will trigger a NAPI softirq
1
2
3
4
5
6
7
|-> net_rx_action() 
|-> napi_poll()
|-> [NOTE: device-specific implementation]
|-> napi_gro_receive() [net/core/dev.c]
|-> dev_gro_receivei()
|-> napi_gro_complete()
|-> netif_receive_skb_internal() [NOTE: XDP and RPS things are done here]
  1. due to some traffic control on the receiving side, process_backlog() here
1
2
3
4
5
|-> __netif_receive_skb() 
|-> deliver_skb() [NOTE: calls pre-defined func]
|-> ip_rcv() [NOTE: pass through netfilter] [net/ipv4/ip\_input.c]
|-> ip_rcv_finish() dst_entry() [NOTE: calls pre-defined ip_local_deliver()]
|-> ip_local_deliver_finish()

OPTION-1 [stack, calls pre-defined tcp_v4_rcv()]

1
2
3
|-> tcp_v4_do_rcv() [net/ipv4/tcp\_ipv4.c] [NOTE: there are some parts related to the ACK (transmit path)] 
|-> tcp_rcv_established() [NOTE: main TCP stack-related operations]
|-> tcp\_data\_queue() [NOTE: handle OOO queue and data queue, calls pre-defined sock_def_readable()]

OPTION-2 [raw]

1
2
3
4
5
|-> raw_local_deliver() 
|-> raw_v4_input()
|-> raw_rcv()
|-> raw_rcv_skb()
|-> sock_queue_rcv_skb() [NOTE: skb_clone() happens here]

From Driver View

Easy to use ethtool -i to see that the type of exact driver of both vNIC inside container and in the host

1
2
3
4
5
6
7
8
9
**inside container**
root@c0026747a805:/# ethtool -i eth0
driver: veth
...

**in host**
~$ ethtool -i veth0c5cf6a
driver: veth
...
veth driver [drivers/net/veth.c]

veth devices are a pair of interconnected virtual Ethernet devices. veth_newlink() function creates a pair of devices.

Transmit Path

Remember that along the trasmit path, the final stage is to call ndo_start_xmit(), which is driver-specific.

1
2
3
4
5
static const struct net_device_ops veth_netdev_ops = {
...
.ndo_start_xmit = veth_xmit,
...
}

We here take a close look on veth_xmit().

1
2
3
4
5
6
7
8
9
static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) {
...
rcv = rcu_dereference(priv->peer); // get the peer net device
...
if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
...
}
...
}
1
2
3
4
5
dev_forward_skb()
|-> __dev_forward_skb() // checks whether skb is forwardable
|-> netif_rx_internal()
|-> enqueue_to_backlog() // enqueue skb to per-cpu list, which is a softnet_data structure
|-> ____napi_schedule() // raise soft IRQ

[NOTE: this soft IRQ is processed by the pre-registered process_backlog() in net_dev_init().]

1
2
3
4
5
process_backlog()
|-> __netif_receive_skb()
|-> __netif_receive_skb_core()
|-> deliver_skb() // if stack-related, deliver to the upper stack, e.g. ip_rcv()
|-> rx_handler() // calls the device's pre-registered rx_handler which is br_handle_frame

We explain the related bridge kernel module in the next section.

bridge kernel module

Remember that one side of a veth pair is attached to the bridge using brctl addif commmand which calls br_add_if().

1
2
3
4
5
int br_add_if(struct net_bridge *br, struct net_device *dev) {
...
err = netdev_rx_handler_register(dev, br_handle_frame, p);
...
}
1
2
3
4
5
6
7
br_handle_frame() 
|-> br_handle_frame_finish()
|-> br_forward()
|-> __br_forward()
|-> br_forward_finish()
|-> br_dev_queue_push_xmit()
|-> dev_queue_xmit()

As we can see, when handling the incoming frames, the bridge will finally call ndo_start_xmit(), a device-specific driver procedure, here, it is veth_xmit() as we have discussed.

[NOTE: here, when we talk about br_handle_frame(), we focus on the forward case. Actually, it does some frame-specific things (e.g., IEEE pause frame, STP, and etc.)]

Misc

  • use grep -nr "SYSCALL_DEFINE to locate system calls (e.g., write())

References

  1. Monitoring and Tuning the Linux Networking Stack: Sending Data and Receiving Data
  2. Understanding Linux Network Internals