Linux kernel tuning for high-throughput networking: sysctl, IRQ affinity and XDP

How to push a commodity server past 10 Gbps without buying new hardware. A deep dive into interrupt coalescing, RSS, RPS, and writing your first XDP program.

The packet path through the Linux kernel

Before tuning anything, you need to understand what you're tuning. When a packet arrives at your NIC, here's the abbreviated path it takes:

  1. NIC DMA — the hardware writes the packet into a ring buffer in memory and raises a hardware interrupt (IRQ)
  2. NAPI poll — the kernel's Network API disables the IRQ and polls the ring buffer in a loop, processing packets in batches
  3. netif_receive_skb — the packet is handed to the network stack as a socket buffer (sk_buff)
  4. IP/TCP layer — routing decisions, TCP state machine, socket demultiplexing
  5. Application recv() — data lands in userspace

Each step has tunable parameters. Most guides stop at sysctl. We're going deeper.

sysctl baseline — the non-negotiables

# /etc/sysctl.d/99-network-performance.conf

# Increase socket buffer sizes — default 212992 is prehistoric
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.core.rmem_default = 31457280
net.core.wmem_default = 31457280

# TCP socket buffer auto-tuning range [min, default, max] in bytes
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728

# Increase the backlog queue depth
net.core.netdev_max_backlog = 250000
net.core.somaxconn = 65535

# Enable TCP BBR congestion control (requires kernel >= 4.9)
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr

# TIME_WAIT socket reuse and fast recycling
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15

# Increase local port range for outbound connections
net.ipv4.ip_local_port_range = 1024 65535

# Disable slow start after idle (bad for bursty workloads)
net.ipv4.tcp_slow_start_after_idle = 0
sysctl -p /etc/sysctl.d/99-network-performance.conf

IRQ affinity and RSS

Receive Side Scaling (RSS) distributes incoming packets across multiple hardware queues, each mapped to a different CPU core. Without RSS, all packets land on CPU 0, creating a bottleneck that no amount of sysctl tuning can fix.

Check how many queues your NIC has

ethtool -l eth0
# Channel parameters for eth0:
# Pre-set maximums:
# RX:     0
# TX:     0
# Other:  1
# Combined:       8     ← you have 8 hardware queues

# Set to match your CPU core count (or half for NUMA systems)
ethtool -L eth0 combined 8

Pin each queue's IRQ to a specific CPU core

#!/bin/bash
# irq-affinity.sh — pin NIC queues to CPU cores
IFACE=eth0
CPU=0
for IRQ in $(grep "${IFACE}" /proc/interrupts | awk -F: '{print $1}' | tr -d ' '); do
    echo $((1 << CPU)) > /proc/irq/${IRQ}/smp_affinity
    echo "IRQ ${IRQ} → CPU ${CPU}"
    CPU=$(( (CPU + 1) % $(nproc) ))
done

Writing your first XDP program

eXpress Data Path (XDP) is a programmable fast path in the Linux kernel. XDP programs run in the NIC driver context — before the packet even enters the kernel network stack. This means you can drop, redirect, or modify packets at line rate with minimal CPU overhead.

A minimal XDP program that counts packets per source IP:

// xdp_count.c
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>

// BPF map: source IP → packet count
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __type(key, __u32);    // source IPv4 address
    __type(value, __u64);  // packet count
    __uint(max_entries, 65536);
} packet_count SEC(".maps");

SEC("xdp")
int count_packets(struct xdp_md *ctx) {
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    // Parse Ethernet header
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;

    // Only handle IPv4
    if (bpf_ntohs(eth->h_proto) != ETH_P_IP)
        return XDP_PASS;

    // Parse IP header
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;

    __u32 src_ip = ip->saddr;
    __u64 *count = bpf_map_lookup_elem(&packet_count, &src_ip);
    if (count) {
        __sync_fetch_and_add(count, 1);
    } else {
        __u64 init = 1;
        bpf_map_update_elem(&packet_count, &src_ip, &init, BPF_ANY);
    }

    return XDP_PASS;  // let the packet continue
}

char _license[] SEC("license") = "GPL";

Compile and attach

# Compile with clang
clang -O2 -target bpf -c xdp_count.c -o xdp_count.o \
    -I/usr/include/$(uname -m)-linux-gnu

# Attach to interface (native mode — fastest)
ip link set dev eth0 xdp obj xdp_count.o sec xdp

# Read the map — requires bpftool
bpftool map dump name packet_count
Performance numbers

A trivial XDP DROP program on a modern NIC can process 24 million packets per second on a single core. Compared to iptables DROP at ~1.5 Mpps, XDP is roughly 16x more efficient for packet filtering workloads.

Measuring the result

# Baseline throughput with iperf3
iperf3 -s &
iperf3 -c server-ip -t 30 -P 8   # 8 parallel streams

# Monitor NIC stats in real time
watch -n1 'ethtool -S eth0 | grep -E "(rx_packets|tx_packets|rx_bytes|tx_bytes|rx_dropped)"'

# Check IRQ distribution across CPUs
watch -n1 'cat /proc/interrupts | grep eth0'

After applying all of the above on a dual-socket server with a Mellanox ConnectX-5, I measured sustained 23 Gbps bidirectional throughput — up from 7 Gbps with default settings. The bottleneck shifted from the kernel to the PCIe bus, which is where it should be.

← back to all posts