NUMA

NUMA awareness in Varnish

Prerequisites

NUMA awareness requires a relatively recent version of the Linux kernel. Supported platforms:

Debian 12 (Bookworm) or newer
Ubuntu 20.04 LTS (Focal Fossa) or newer
Red Hat Enterprise Linux 8 (RHEL, AlmaLinux) or newer

For the NUMA awareness feature to work as intended, the hardware must also be NUMA balanced. In short, there should be at least one network card attached to each NUMA node.

How to enable

To enable the NUMA awareness feature in Varnish Enterprise, set the following parameters:

reuseport=on
numa_aware=on.

These parameters must be set before Varnish has been started, and cannot be changed later through varnishadm.

Example varnishd run command:

varnishd -a :80 -f /etc/varnish/default.vcl -p reuseport=on -p numa_aware=on

We recommend that you perform testing before applying this change to all of your servers. See below for metrics to look for.

How it works

Varnish has a concept of thread pools where each thread pool contains a set of worker threads. This makes for a great abstraction layer for NUMA aware Varnish. This is crucial as shared memory access between the NUMA nodes should be avoided whenever possible. One way to ensure that tasks running in worker threads never cross over to another NUMA node for task-scoped data structures is to correlate memory pools with thread pools.

When numa_aware is enabled, it is recommended to leave thread_pools to its default value of 2. There should be one thread pool per NUMA node, and if your system has more than two NUMA nodes, Varnish will automatically increase the number of thread pools to match.

In order to make the transition between the kernel and Varnish efficient, we pin new sessions to the thread pool where the client arrived in the kernel.

Note that it is not possible to eliminate all possible NUMA crossing scenarios. Data structures shared between multiple tasks such as the cache itself are exempt from NUMA node pinning. Likewise, NUMA awareness is not a criterion for book and store selection in persisted caches with MSE. There are also background threads outside the worker thread pools.

Availability

The NUMA awareness feature was first introduced in Varnish Enterprise 6.0.8r2. It is generally recommended to upgrade to the latest release when possible.

Non-Uniform Memory Access (NUMA)

What is NUMA?

NUMA architecture is a design used in multiprocessing when manufacturing computer boards. In a NUMA system, memory banks may be local to certain CPUs and remote for others, and this implies a transparent access by the application independently of where it is running.

A local memory bank is cheaper to access as it is usually directly connected to the NUMA node where the task is currently running. Conversely, remote memory bank costs more to access as the memory bank is further away from the NUMA node as synchronization is required to present a cache coherent memory view to the running task.

When an application accesses remote memory banks frequently, it will be subject to performance penalties. This includes higher latency on memory access, higher CPU usage, lower total throughput, and more. This is a potential bottleneck for high throughput systems with multiple NUMA nodes.

NUMA in Linux

The Linux kernel has a NUMA auto-balance feature to migrate pages to where the application is currently running using an algorithm which tries to reduce the number of remote access based on some heuristics. However, we have found active participation from the application to yield better performance than the auto-balance feature by itself.

Hardware balancing

A key performance factor on a NUMA system is the balance of the hardware itself. A problem typically observed for NUMA aware applications on unbalanced NUMA systems are symptoms where only some NUMA nodes are consumed while others nodes are mostly idle.

A common cause leading to part of the hardware being idle is the presence of a single network card. Having at least one network card per NUMA node will help prevent network operations from crossing an expensive NUMA boundary.

On some systems it can make sense to make the application NUMA aware even at the expense of some idle NUMA nodes. The question is whether the application will perform better running on one NUMA node than being continuous migrated between NUMA nodes in an unbalanced NUMA system.

Tuning unbalanced NUMA systems

When Varnish runs on a unbalanced NUMA system, it is sometimes beneficial to tune the kernel’s auto balancing feature in various ways.

It is recommended to try different settings and tunings, and then choose the configuration which gives the highest performance. Here we give a few hints on what to try when balancing the hardware is not viable option, but a deep analysis of the problem space is out of scope for this document.

Memory allocation policy

In an unbalanced system, the Linux kernel tends to allocate memory for Varnish in a corresponding unbalanced way. It is possible to change how the kernel assigns physical memory to the (virtual) memory allocated by Varnish.

This can be achieved through systemd’s NUMAPolicy and NUMAMask parameters, like this:

NUMAPolicy=interleave
NUMAMask=all

Note: Asking the kernel to interleave page allocations in this way, will often help to make sure that all the memory is used in the system, but has the risk of increasing memory latency and CPU usage. The Linux kernel documentation has an excellent document describing this in technical detail.

IRQ balancing

It is possible to use irqbalance to change how interrupts are distributed to different processors and cores. When experimenting with IRQ balance, it is recommended to first try running or benchmarking without any IRQ tuning, and then see if any changes improve the performance in a particular system.

It is hard to give good general advice on IRQ balancing, as results depend heavily on hardware and software (kernel, drivers, firmware) versions.

Tools to inspect NUMA layout

lstopo is a useful tool to visualize the topology of a system. It can generate a diagram that graphically presents the layout of the system, including which components are connected to which NUMA nodes.

lspci (in verbose mode) can also be used to inspect the systems NUMA layout, as well as the numa_node virtual file under the sysfs subsystem.

NUMA Statistics

numastat(8) is a statistics tool that can be used to evaluate a systems page allocation performance.

Example output:

                             node0         node1
numa_hit                  76557759      92126519
numa_miss                 30772308      30827638
numa_foreign              30827638      30772308
interleave_hit              106507        103832
local_node                76502227      92086995
other_node                30827840      30867162

A balanced system should exhibit a high numa_hit counter and a comparatively low numa_miss counter. The manual manual page for numastat(8) can be consulted for more details on the other counters.

The Intel Performance Counter Monitor (Intel PCM) tool exists for Intel CPUs to get finer details on how much traffic is traversing between the NUMA nodes at a given time (have a look at the pcm-numa.x program): https://github.com/intel/pcm

Similar tools may exist for other platforms.