Under the hood

Now that you know what Varnish does, and how powerful this piece of software is, it’s time to take a look behind the curtain. The raw power of Varnish is the direct consequence of an architecture that doesn’t compromise when it comes to performance.

This section is very technical, and covers some concepts that will be clarified later in the book. However, it is useful to talk about some of the internals of Varnish right now, because it will give you a better understanding when we cover the concepts in details.

This applies to runtime parameters, default behavior, and the impact of certain VCL changes.

Let’s start at the beginning.

When the varnishd program is executed, it starts one additional process resulting in:

The manager process
The child process

In varnishd there is separation of privileges, where actions that require privileged access to the operating system are run in the manager process. All other actions run in a separate child process.

The manager process

Because of its privileged access to the operating system, the manager process will only perform actions that require this access, and leave all other tasks to the child process.

The manager process is responsible for opening up sockets, and binding them to a selected endpoint.

If you configured varnishd to listen for incoming connections on port 80, which is a privileged port, this is processed by the manager process with elevated privileges. However, the manager process will not handle any data that is sent on this socket. Listening on that socket, accepting new connections, and reading incoming data is the responsibility of the child process.

The manager process will also make sure that the child process is responsive. It continuously pings the child process, and if the child process should become unresponsive or die unexpectedly, the manager process will tear down the old and start a new one.

The manager process also owns the command line socket, which is used by the varnishadm management program. Privileged access is required here in order to start or stop the child process using varnishadm.

The manager process is also responsible for opening up the VCL file and reading its contents. However, the compilation of VCL happens in a separate process.

And finally, the manager process is also responsible for the different VCL configurations that were loaded into Varnish.

The VCL compiler process

As mentioned, the manager process will open the VCL file, and will read the contents, but it will not process the VCL code. The VCL compiler process, which is a separate process, will take care of the VCL compilation.

The process is named vcc-compiler. However, you won’t often see it appear in your process list, as it is a transient process: it only runs for the duration of the compilation.

The vcc-compiler process isn’t just used on startup; it also runs when the vcl.load or vcl.inline commands are executed by the varnishadm command line tool.

Compilation steps

The first step in the VCL compilation process is to take the raw VCL code read from the VCL file and process any include statements in the VCL code. The files are resolved and inserted verbatim in the code. After that a special source referred to as the built-in VCL code is included last. This adds a sane default behavior that respects best practices of each VCL function, even if you don’t write any additional VCL.

As mentioned earlier, VCL is language can be used to extend the behavior of or various states in the Varnish finite state machine. The built-in VCL is just there as a safety net. The details of this finite state machine will be covered in chapters 3 and 4.

With the complete VCL source available, the VCL compiler will then transform the VCL code into C-code. The management process will then spawn a new vcc-compiler process to compile the C-code.

If you want to see the actual C-code that is produced for your VCL file, you can run the following command: varnishd -C -f <vcl_filename>.

The vcc-compiler will take the C-code, and will run the gcc compiler to compile the code. This compiles and optimizes the C statements into object code that can be executed directly by the host system CPU. The output is a shared library under the form of a .so file. The parameter cc_command is used by the vcc-compiler to set the gcc flags used for compilation.

After that, the child process is messaged, and will use and run the .so file.

All these steps describe how Varnish compiles the VCL on startup. But using the varnishadm vcl.load, and the varnishadm vcl.use commands, you can load new VCL at runtime. varnishadm vcl.load will compile the VCL into the .so file and load it. varnishadm vcl.use will select the .so file to be used on new connections.

The child process

From a security point of view, you really want to avoid giving the manager process too many responsibilities. That’s why the child process does most of the work in Varnish.

Accepting the connections, processing the requests, and producing the responses are all done in the child process. Seen from afar the child process basically sits in a loop and waits for incoming connections and requests to be processed. This in turn will activate the numerous mechanisms that make up the Varnish caching engine, such as backend fetches and caching of content.

Threads

To be honest, all this logic doesn’t happen in one place. The child process will distribute the workload across a set of threads. Threads are used to facilitate both parallelism and asynchronous operations.

There are various threads in Varnish. Some of them have a dedicated role, others are general-purpose worker threads that are kept in thread pools.

Here’s an overview of the threading model in Varnish:

Thread name	Amount	Task
`cache-main`	1	startup & initialization
`acceptor`	1 per thread pool per listening endpoint	accept new connections
`cache-worker`	1 per active connection	request handling, fetch
`ban-lurker`	1	background ban processing and
`waiter`	1	manages idle connections
`expiry`	1	remove expired content
`backend-poller`	1	manage probe tasks
`thread-pool-herder`	1 per thread pool	monitor & manage threads
`hcb_cleaner`	1	cleaning up retired hashes

If you’re using Varnish Enterprise, you can use the varnishscoreboard program to display the state of the currently active threads.

Varnish Enterprise has several additional threads

Thread name	Amount	Task
`vsm_publish`	1	publish & remove shared memory segments
`cache-memory-stats`	1	memory statistics gathering
`cache-governator`	1	memory governor balancing thread
`mse_waterlevel`	1 per MSE book	MSE book database waterlevel handling
`mse_aio`	1 per MSE store	MSE store AIO execution
`mse_hoic`	1 per MSE store	MSE store waterlevel handling

The cache-main thread

The cache-main thread is the entry point at which the management process forks off the child process.

This thread initializes the dependencies of the child process. These are just a set of dedicated threads, which will be covered in a just a minute.

As soon as the initialization is finished, the cache-main thread really doesn’t have anything more to do. So it turns into the command line thread: it sits in a loop, waiting for CLI commands to come in.

This may seem confusing because earlier I mentioned that the management process takes care of the command line. Well, in fact, they both do.

There is a Unix pipe in between management process and the cache-main thread of the child process. Although the command line socket is owned by the manager process, commands that are relevant to the child process, will be sent over the Unix pipe.

Commands that require privileged access to the system are the responsibility of the manager process.

The thread pool herder thread

One of the first threads that is initialized by cache-main is the thread-pool-herder thread because a lot of internal components depend on thread pools.

A thread pool is a collection of resources that Varnish reuses while handling incoming requests. These resources include things like the worker threads and the workspaces they use for scratch space. Some of these resources benefit from Non-uniform memory access (NUMA) locality, and are grouped together in a pool. The number of thread pools is configurable through the thread_pools runtime parameter. The default value is two.

A thread pool manages a set of threads that perform work on demand. The threads do not terminate right away. When one of the threads completes a task, the thread becomes idle, is returned to the pool, and is ready to be dispatched to another task.

The thread-pool-herder is a per pool management thread. It will create the amount of threads that is defined by the thread_pool_min runtime parameter at startup, and never goes below that amount. The default value is 100.

When the traffic on the Varnish server increases, the thread-pool-herder threads will create new threads in their pools. It will continue to monitor the traffic, and create new threads until the thread_pool_max value is reached. The default value is 5000.

Note that thread_pool_min and thread_pool_max set limits per thread pool.

When new workload exceeds the amount of free threads, the thread-pool-herder thread will queue incoming tasks, while new threads are being created.

When threads have been idle for too long, the thread-pool-herder thread will remove these threads from the thread pool. The thread_pool_timeout runtime parameter defines the thread idle threshold.

But as mentioned, the number of threads in a thread pool will never go below the value of thread_pool_min.

The acceptor threads

The acceptor threads are the point of entry for incoming connections. They are created by the cache-main thread, and are one of those dependencies I referred to.

The acceptor threads will call accept on a socket that was opened by the management process. This call is the server end part of the TCP handshake. The SYN-ACK part, if you will.

The acceptor threads will then delegate the incoming connections by dispatching a worker thread from the thread pool.

There is one acceptor thread per listening point per thread pool. This means for a single listening point and the default number of thread pools, there will be two acceptor threads that are running.

The waiter thread

The waiter thread is used to manage idle file descriptors.

Behind the scenes, epoll or kqueue are used, depending on the operating system. epoll is a Linux implementation. kqueue is a BSD implementation. Since this book focuses on Linux, we’ll talk about epoll.

epoll is the successor of the poll system call. It polls file descriptors to see if I/O is possible. epoll is a lot more efficient at large scale. The same applies to kqueue on BSD systems.

The term file descriptor is quite vague because we know that on Unix systems everything is a file. Network connections also use file descriptors, and Varnish happens to process a lot of those at large scale.

Varnish leverages the waiter to keep track of open backend connections. Whenever a backend connection is idle, it will sit in the waiter for Varnish to monitor the connection status.

In addition, Varnish will use the waiter for client connections whenever we are done processing a request and the connection goes idle.

Varnish does not use epoll for regular connection handling: client traffic is still processed using blocking I/O. epoll is only used for idle connections.

The expiry thread

The expiry thread is used to remove expired objects from cache.

This thread keeps a heap data structure that tracks the TTL of objects. The object that expires next is always at the top of the data structure.

When an expired object is removed, the heap is re-ordered and again has the object that expires next at the top.

The expiry thread removes expired objects, goes back to sleep, and wakes up to do it all over again. The amount of time that the expiry thread sleeps is the time until the new element at the top of the heap expires.

The backend-poller thread

The backend-poller thread manages a set of health probe tasks. Health probes are used to monitor the health of backends, and to decide whether or not a backend can be considered healthy.

The backend-poller thread keeps track of the health check interval that was defined by the probe, and dispatches the health check at the right time.

As mentioned, this thread manages probes, and dispatches health checks. It doesn’t perform the actual HTTP request itself. Instead the backend-poller thread will farm out the work to a worker thread.

The ban-lurker thread

The ban-lurker thread has the responsibility of removing items from the ban list.

But before we can talk about that, let me briefly explain what banning means.

Banning, and content invalidation will be covered in detail in chapter 6.

A ban is a mechanism in Varnish to ultimately remove one or more objects from the cache.

Bans happen based on a ban expression, and these bans end up on the ban list. This expression can be triggered using the ban() function in VCL, or by calling the ban command in the varnishadm administration tool.

Expressions that match objects in the cache cause these objects to be removed from cache. Once all objects have been checked, the ban is removed from the ban list, because it is no longer relevant.

Bans are evaluated when an object is accessed, causing a ban expression to have an immediate effect on the cache. The ban-lurker thread is responsible for matching ban expression on the ban list with all the objects in cache, as well as those that are infrequently accessed.

There are some runtime parameters that influence the behavior of the thread:

ban_lurker_age: the age a ban should have before the lurker evaluates it. The default value is 60 seconds
ban_lurker_batch: the number of bans that are processed during a ban lurker run. The default value is 1000
ban_lurker_holdoff: the number of seconds the ban lurker holds off when locking contention occurs. The default value is 0.010 seconds
ban_lurker_sleep: the number of seconds the ban-lurker thread sleeps before performing its next run. The default value is also 0.010 seconds

Worker threads

There is one worker thread per active connection when the HTTP/1 protocol is used. For HTTP/2 there are multiple worker threads per connection: One for the HTTP/2 session, and one for each HTTP/2 stream.

Additionally, each backend fetch will consume one worker thread.

Worker threads can be spawned on demand, and the cost of spawning new threads comes at a cost. That’s why we pre-allocated a number of threads in the thread pools.

Transports

One of the first tasks that the worker thread performs is checking which protocol handler was configured.

In Varnish Cache, this can be the PROXY transport handler, or the regular HTTP transport handler.

In Varnish Enterprise, there’s the addition of the TLS transport handler.

Imagine the following address configuration in Varnish:

$ varnishd -a :80,PROXY -f /etc/varnish/default.vcl

Because PROXY was used, we first need to handle the PROXY protocol bytes that are part of the TCP preamble. This information contains the IP and port that were used to connect to Varnish.

Varnish will populate the various IP and port variables based on the information.

Once this decoding process is finished, the PROXY transport handler will hand off the work to the HTTP transport handler, which is now able to process the HTTP part of the TCP request.

The HTTP transport handler will parse the HTTP request and populate the necessary internal data structures with request information for later use.

In the example below, we’re not using the PROXY protocol, this means the HTTP transport handler is used immediately:

$ varnishd -a :80 -f /etc/varnish/default.vcl

In Varnish Enterprise, we can configure native TLS support. Our address configuration may look like this:

$ varnishd -A /etc/varnish/tls.cfg -a :80 -f /etc/varnish/default.vcl

Because the -A runtime parameter was used, the TLS transport handler will be used, which will handle the crypto part. But after that, the work is handed off to the HTTP transport handler.

Disembarking

When a worker thread is waiting for a fetch to finish, its internal state can be stored on a waiting list, while the worker thread is put back into the thread pool.

This concept is called disembarking, and is an optimization so as to avoid needlessly tying up resources that are waiting.

Transactions on the waiting list can be woken up after a fetch finishes, and will be redispatched to another worker thread.

The waiting list

When an incoming request doesn’t result in a cache hit, Varnish has to connect to the origin server to fetch the content. If a lot of connections for the same resource happen at the same time, the origin server has to process a lot of connections through Varnish, and could suffer from increased server load.

To avoid this, Varnish has a waiting list per object, where requests asking for the same object are grouped together.

The first request for this object will result in the creation of a busy object, which tells Varnish there is a fetch in progress. While the busy object is in place, all subsequent requests for this resource are put on the waiting list.

As soon as the response is ready for delivery, all items on the waiting list can be satisfied. However, the rush exponent will make sure the kernel doesn’t choke on a sudden increase of activity.

The rush_exponent runtime parameter defines the amount of waiting list items that can be processed exponentially. Its default value is 3. This means that the first run will satisfy three objects, the next run will satisfy nine objects, and the following one will satisfy 27 objects. This is a mitigation put in place to avoid the so-called thundering herd problem.

The exponential nature of this mechanism ensures a workload buildup that the kernel will be able to handle.

This concept of satisfying multiple items on the waiting list is called request coalescing because we’re basically coalescing multiple similar requests into a single backend request.

Serialization

Request coalescing is a very powerful feature in Varnish. But when Varnish is not able to get a proper TTL for the object, the object is immediately expired.

The first transaction on the waiting list will be satisfied by the fetch, but since the object was immediately expired it cannot be used to satisfy the rest of the requests on the waiting list.

This means that the other waiting list items are kept there, and are processed serially. This side effect is what we call serialization because the waiting list is processed serially, instead of in parallel.

As you can imagine, serialization is very bad for performance, and for the quality of experience in general.

Imagine that you have a waiting list of 1000 items, and a backend fetch takes two seconds to be completed. When serialization takes effect, the last transaction in the waiting list has to wait 2000 seconds until completion.

The sole reason for serialization is bad VCL configuration. As a Varnish operator, you have the flexibility to override many aspects of the behavior of the cache. The TTL and the cacheability of fetched responses are part of that.

Non-cacheable responses are also cached in the so-called hit-for-miss or hit-for-pass cache. In essence, we’re caching the decision not to cache and by default this happens for a duration of 120 seconds.

Items on the hit-for-miss or hit-for-pass cache will bypass the waiting list to avoid serialization.

A common, but bad, practice is setting the TTL of an object to zero in VCL, when deciding not to cache. This expires the object immediately, and the waiting list no longer has the required information.

The way uncacheable content should be approached is by setting the object to uncacheable in VCL, and ensuring a proper TTL, which will be beneficial for transactions in the waiting list.

The built-in VCL, which is covered in chapters 3 and 4, has the necessary behavior in place to protect VCL operators from falling into this trap.

When writing custom VCL, please try to fall back on the built-in VCL as much as possible. As we will discover later in the book: built-in VCL behavior takes effect when VCL subroutines don’t explicitly return an action.

Workspaces

The concept of workspaces in Varnish is an optimization to lessen the strain on the system memory allocator. Memory allocation is expensive, especially for short-lived allocations.

Varnish will allocate a chunk of memory for each transaction. A very simple allocator within Varnish can hand out memory from that chunk. We call this workspace memory.

Different parts of Varnish, use workspace memory in various ways:

Request handling happens using workspace memory, and is sized using the workspace_client runtime parameter. The default value is 64 KB. This means that the client-side processing of each request, and the subsequent response, receives 64 KB of memory per request.

For transactions that involve a backend fetch, a separate piece of workspace memory is used: the workspace_backend runtime parameter defines how much memory per backend request can be used. By default this is also 64 KB.

The workspace_session runtime parameter reflects how much memory from workspace can be consumed for storing session data. This is mainly information about TCP connection addresses, and other information that is kept for the entire duration of the session. The default value here is 0.5 KB.

There is also a workspace_thread runtime parameter that defines how much auxiliary workspace memory will be assigned as thread local scratch space. This memory space is primarily used for storing I/O-vectors during delivery.

Backend fetches

Ever since Varnish 4, there has been a split between client-side logic and backend-side logic. Whereas there was only one thread for this in Varnish 3, it got split up into two separate threads from Varnish 4.

A major advantage is that background fetches are supported. This means that a client doesn’t need to wait for the backend response to be returned. A background fetch takes place, and while that happens, a stale object can be served to the client.

As soon as the background fetch is finished, the object is updated, and subsequent requests receive the fresh data.

Streaming

Another advantage of the client and backend split is that it enables streaming deliver. When this is enabled, the body of a backend fetch may be delivered to clients as it is being received.

This of course has the side effect that fetch failures become visible to the clients. The streaming delivery can be turned off if this is not desired by setting the beresp.do_stream VCL variable to false in vcl_backend_response. This will cause the entire object to be received before it is delivered to any waiting clients.

Varnish Fetch and Delivery Processors

When Varnish fetches content from a backend, it flows through a set of filters called Varnish Fetch Processors (VFP). These filters perform different tasks, like compressing the object using GZIP or parsing the content for Edge Side Include (ESI) instructions.

Similarly when delivering content to clients a set of filters called Varnish Deliver Processors (VDP) is used. These typically perform tasks like decompressing content if necessary, or stitching together ESI content.