Linux kernel tuning for high-throughput networking

The packet path through the Linux kernel

Before tuning anything, you need to understand what you're tuning. When a packet arrives at your NIC, here's the abbreviated path it takes:

NIC DMA — the hardware writes the packet into a ring buffer in memory and raises a hardware interrupt (IRQ)
NAPI poll — the kernel's Network API disables the IRQ and polls the ring buffer in a loop, processing packets in batches
netif_receive_skb — the packet is handed to the network stack as a socket buffer (sk_buff)
IP/TCP layer — routing decisions, TCP state machine, socket demultiplexing
Application recv() — data lands in userspace

Each step has tunable parameters. Most guides stop at sysctl. We're going deeper.

sysctl baseline — the non-negotiables

# /etc/sysctl.d/99-network-performance.conf

# Increase socket buffer sizes — default 212992 is prehistoric
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.core.rmem_default = 31457280
net.core.wmem_default = 31457280

# TCP socket buffer auto-tuning range [min, default, max] in bytes
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728

# Increase the backlog queue depth
net.core.netdev_max_backlog = 250000
net.core.somaxconn = 65535

# Enable TCP BBR congestion control (requires kernel >= 4.9)
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr

# TIME_WAIT socket reuse and fast recycling
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15

# Increase local port range for outbound connections
net.ipv4.ip_local_port_range = 1024 65535

# Disable slow start after idle (bad for bursty workloads)
net.ipv4.tcp_slow_start_after_idle = 0

sysctl -p /etc/sysctl.d/99-network-performance.conf

IRQ affinity and RSS

Receive Side Scaling (RSS) distributes incoming packets across multiple hardware queues, each mapped to a different CPU core. Without RSS, all packets land on CPU 0, creating a bottleneck that no amount of sysctl tuning can fix.

Check how many queues your NIC has

ethtool -l eth0
# Channel parameters for eth0:
# Pre-set maximums:
# RX:     0
# TX:     0
# Other:  1
# Combined:       8     ← you have 8 hardware queues

# Set to match your CPU core count (or half for NUMA systems)
ethtool -L eth0 combined 8

Pin each queue's IRQ to a specific CPU core

#!/bin/bash
# irq-affinity.sh — pin NIC queues to CPU cores
IFACE=eth0
CPU=0
for IRQ in $(grep "${IFACE}" /proc/interrupts | awk -F: '{print $1}' | tr -d ' '); do
    echo $((1 << CPU)) > /proc/irq/${IRQ}/smp_affinity
    echo "IRQ ${IRQ} → CPU ${CPU}"
    CPU=$(( (CPU + 1) % $(nproc) ))
done

Writing your first XDP program

eXpress Data Path (XDP) is a programmable fast path in the Linux kernel. XDP programs run in the NIC driver context — before the packet even enters the kernel network stack. This means you can drop, redirect, or modify packets at line rate with minimal CPU overhead.

A minimal XDP program that counts packets per source IP:

// xdp_count.c
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>

// BPF map: source IP → packet count
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __type(key, __u32);    // source IPv4 address
    __type(value, __u64);  // packet count
    __uint(max_entries, 65536);
} packet_count SEC(".maps");

SEC("xdp")
int count_packets(struct xdp_md *ctx) {
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    // Parse Ethernet header
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;

    // Only handle IPv4
    if (bpf_ntohs(eth->h_proto) != ETH_P_IP)
        return XDP_PASS;

    // Parse IP header
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;

    __u32 src_ip = ip->saddr;
    __u64 *count = bpf_map_lookup_elem(&packet_count, &src_ip);
    if (count) {
        __sync_fetch_and_add(count, 1);
    } else {
        __u64 init = 1;
        bpf_map_update_elem(&packet_count, &src_ip, &init, BPF_ANY);
    }

    return XDP_PASS;  // let the packet continue
}

char _license[] SEC("license") = "GPL";

Compile and attach

# Compile with clang
clang -O2 -target bpf -c xdp_count.c -o xdp_count.o \
    -I/usr/include/$(uname -m)-linux-gnu

# Attach to interface (native mode — fastest)
ip link set dev eth0 xdp obj xdp_count.o sec xdp

# Read the map — requires bpftool
bpftool map dump name packet_count

Performance numbers

A trivial XDP DROP program on a modern NIC can process 24 million packets per second on a single core. Compared to iptables DROP at ~1.5 Mpps, XDP is roughly 16x more efficient for packet filtering workloads.

Measuring the result

# Baseline throughput with iperf3
iperf3 -s &
iperf3 -c server-ip -t 30 -P 8   # 8 parallel streams

# Monitor NIC stats in real time
watch -n1 'ethtool -S eth0 | grep -E "(rx_packets|tx_packets|rx_bytes|tx_bytes|rx_dropped)"'

# Check IRQ distribution across CPUs
watch -n1 'cat /proc/interrupts | grep eth0'

After applying all of the above on a dual-socket server with a Mellanox ConnectX-5, I measured sustained 23 Gbps bidirectional throughput — up from 7 Gbps with default settings. The bottleneck shifted from the kernel to the PCIe bus, which is where it should be.

Linux kernel tuning for high-throughput networking: sysctl, IRQ affinity and XDP

The packet path through the Linux kernel

sysctl baseline — the non-negotiables

IRQ affinity and RSS

Check how many queues your NIC has

Pin each queue's IRQ to a specific CPU core

Writing your first XDP program

Compile and attach

Measuring the result