The packet path through the Linux kernel
Before tuning anything, you need to understand what you're tuning. When a packet arrives at your NIC, here's the abbreviated path it takes:
- NIC DMA — the hardware writes the packet into a ring buffer in memory and raises a hardware interrupt (IRQ)
- NAPI poll — the kernel's Network API disables the IRQ and polls the ring buffer in a loop, processing packets in batches
- netif_receive_skb — the packet is handed to the network stack as a socket buffer (
sk_buff) - IP/TCP layer — routing decisions, TCP state machine, socket demultiplexing
- Application recv() — data lands in userspace
Each step has tunable parameters. Most guides stop at sysctl. We're going deeper.
sysctl baseline — the non-negotiables
# /etc/sysctl.d/99-network-performance.conf # Increase socket buffer sizes — default 212992 is prehistoric net.core.rmem_max = 134217728 net.core.wmem_max = 134217728 net.core.rmem_default = 31457280 net.core.wmem_default = 31457280 # TCP socket buffer auto-tuning range [min, default, max] in bytes net.ipv4.tcp_rmem = 4096 87380 134217728 net.ipv4.tcp_wmem = 4096 65536 134217728 # Increase the backlog queue depth net.core.netdev_max_backlog = 250000 net.core.somaxconn = 65535 # Enable TCP BBR congestion control (requires kernel >= 4.9) net.core.default_qdisc = fq net.ipv4.tcp_congestion_control = bbr # TIME_WAIT socket reuse and fast recycling net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_fin_timeout = 15 # Increase local port range for outbound connections net.ipv4.ip_local_port_range = 1024 65535 # Disable slow start after idle (bad for bursty workloads) net.ipv4.tcp_slow_start_after_idle = 0
sysctl -p /etc/sysctl.d/99-network-performance.conf
IRQ affinity and RSS
Receive Side Scaling (RSS) distributes incoming packets across multiple hardware queues, each mapped to a different CPU core. Without RSS, all packets land on CPU 0, creating a bottleneck that no amount of sysctl tuning can fix.
Check how many queues your NIC has
ethtool -l eth0 # Channel parameters for eth0: # Pre-set maximums: # RX: 0 # TX: 0 # Other: 1 # Combined: 8 ← you have 8 hardware queues # Set to match your CPU core count (or half for NUMA systems) ethtool -L eth0 combined 8
Pin each queue's IRQ to a specific CPU core
#!/bin/bash
# irq-affinity.sh — pin NIC queues to CPU cores
IFACE=eth0
CPU=0
for IRQ in $(grep "${IFACE}" /proc/interrupts | awk -F: '{print $1}' | tr -d ' '); do
echo $((1 << CPU)) > /proc/irq/${IRQ}/smp_affinity
echo "IRQ ${IRQ} → CPU ${CPU}"
CPU=$(( (CPU + 1) % $(nproc) ))
done
Writing your first XDP program
eXpress Data Path (XDP) is a programmable fast path in the Linux kernel. XDP programs run in the NIC driver context — before the packet even enters the kernel network stack. This means you can drop, redirect, or modify packets at line rate with minimal CPU overhead.
A minimal XDP program that counts packets per source IP:
// xdp_count.c
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
// BPF map: source IP → packet count
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__type(key, __u32); // source IPv4 address
__type(value, __u64); // packet count
__uint(max_entries, 65536);
} packet_count SEC(".maps");
SEC("xdp")
int count_packets(struct xdp_md *ctx) {
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
// Parse Ethernet header
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
return XDP_PASS;
// Only handle IPv4
if (bpf_ntohs(eth->h_proto) != ETH_P_IP)
return XDP_PASS;
// Parse IP header
struct iphdr *ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end)
return XDP_PASS;
__u32 src_ip = ip->saddr;
__u64 *count = bpf_map_lookup_elem(&packet_count, &src_ip);
if (count) {
__sync_fetch_and_add(count, 1);
} else {
__u64 init = 1;
bpf_map_update_elem(&packet_count, &src_ip, &init, BPF_ANY);
}
return XDP_PASS; // let the packet continue
}
char _license[] SEC("license") = "GPL";
Compile and attach
# Compile with clang
clang -O2 -target bpf -c xdp_count.c -o xdp_count.o \
-I/usr/include/$(uname -m)-linux-gnu
# Attach to interface (native mode — fastest)
ip link set dev eth0 xdp obj xdp_count.o sec xdp
# Read the map — requires bpftool
bpftool map dump name packet_count
A trivial XDP DROP program on a modern NIC can process 24 million packets per second on a single core. Compared to iptables DROP at ~1.5 Mpps, XDP is roughly 16x more efficient for packet filtering workloads.
Measuring the result
# Baseline throughput with iperf3 iperf3 -s & iperf3 -c server-ip -t 30 -P 8 # 8 parallel streams # Monitor NIC stats in real time watch -n1 'ethtool -S eth0 | grep -E "(rx_packets|tx_packets|rx_bytes|tx_bytes|rx_dropped)"' # Check IRQ distribution across CPUs watch -n1 'cat /proc/interrupts | grep eth0'
After applying all of the above on a dual-socket server with a Mellanox ConnectX-5, I measured sustained 23 Gbps bidirectional throughput — up from 7 Gbps with default settings. The bottleneck shifted from the kernel to the PCIe bus, which is where it should be.