NUMA awareness requires a relatively recent version of the Linux kernel. Supported platforms:
For the NUMA awareness feature to work as intended, the hardware must also be NUMA balanced. In short, there should be at least one network card attached to each NUMA node.
To enable the NUMA awareness feature in Varnish Enterprise, set the following parameters:
reuseport=on
numa_aware=on
.These parameters must be set before Varnish has been started, and cannot be changed later through varnishadm
.
Example varnishd
run command:
varnishd -a :80 -f /etc/varnish/default.vcl -p reuseport=on -p numa_aware=on
We recommend that you perform testing before applying this change to all of your servers. See below for metrics to look for.
Varnish has a concept of thread pools where each thread pool contains a set of worker threads. This makes for a great abstraction layer for NUMA aware Varnish. This is crucial as shared memory access between the NUMA nodes should be avoided whenever possible. One way to ensure that tasks running in worker threads never cross over to another NUMA node for task-scoped data structures is to colocate memory pools with thread pools.
When numa_aware
is enabled, it is recommended to leave thread_pools
to its
default value of 2. There should be one thread pool per NUMA node, and if your
system has more than two NUMA nodes, Varnish will automatically increase the
number of thread pools to match.
In order to make the transition between the kernel and Varnish efficient, we pin new sessions to the thread pool where the client arrived in the kernel.
Note that it is not possible to eliminate all possible NUMA crossing scenarios. Data structures shared between multiple tasks such as the cache itself are exempt from NUMA node pinning. Likewise, NUMA awareness is not a criterion for book and store selection in persisted caches with MSE. There are also background threads outside the worker thread pools.
The NUMA awareness feature was first introduced in Varnish Enterprise 6.0.8r2. It is generally recommended to upgrade to the latest release when possible.
NUMA architecture is a design used in multiprocessing when manufacturing computer boards. In a NUMA system, memory banks may be local to certain CPUs and remote for others, and this implies a transparent access by the application independently of where it is running.
A local memory bank is cheaper to access as it is usually directly connected to the NUMA node where the task is currently running. Conversely, remote memory bank costs more to access as the memory bank is further away from the NUMA node as synchronization is required to present a cache coherent memory view to the running task.
When an application accesses remote memory banks frequently, it will be subject to performance penalties. This includes higher latency on memory access, higher CPU usage, lower total throughput, and more. This is a potential bottleneck for high throughput systems with multiple NUMA nodes.
The Linux kernel has a NUMA auto-balance feature to migrate pages to where the application is currently running using an algorithm which tries to reduce the number of remote access based on some heuristics. However, we have found active participation from the application to yield better performance than the auto-balance feature by itself.
A key performance factor on a NUMA system is the balance of the hardware itself. A problem typically observed for NUMA aware applications on unbalanced NUMA systems are symptoms where only some NUMA nodes are consumed while others nodes are mostly idle.
A common cause leading to part of the hardware being idle is the presence of a single network card. Having at least one network card per NUMA node will help prevent network operations from crossing an expensive NUMA boundary.
On some systems it can make sense to make the application NUMA aware even at the expense of some idle NUMA nodes. The question is whether the application will perform better running on one NUMA node than being continuous migrated between NUMA nodes in an unbalanced NUMA system.
lstopo
is a useful tool to visualize the topology of a system. It can generate a diagram that graphically presents the layout of the system, including which components are connected to which NUMA nodes.
lspci
(in verbose mode) can also be used to inspect the systems NUMA layout, as well as the numa_node
virtual file under the sysfs
subsystem.
numastat(8)
is a statistics tool that can be used to evaluate a systems page
allocation performance.
Example output:
node0 node1
numa_hit 76557759 92126519
numa_miss 30772308 30827638
numa_foreign 30827638 30772308
interleave_hit 106507 103832
local_node 76502227 92086995
other_node 30827840 30867162
A balanced system should exhibit a high numa_hit
counter and a comparatively
low numa_miss
counter. The manual manual page for numastat(8)
can be
consulted for more details on the other counters.
The Intel Performance Counter Monitor (Intel PCM) tool exists for Intel CPUs to get finer details on how much traffic is traversing between the NUMA nodes at a given time (have a look at the pcm-numa.x program): https://github.com/intel/pcm
Similar tools may exist for other platforms.