Cluster

Introduction

Varnish Cluster is a solution for increasing cache hit rate in a Varnish Enterprise deployment and reducing load on the origin service. It’s dynamic, scalable, and can be enabled with just a few lines of VCL.

Getting Started

Prerequisites

Software:

Varnish Enterprise version 6.0.13r6 or higher.

Networking:

Varnish nodes can communicate with each other over HTTP(s).
For dynamic clusters, all Varnish nodes can make DNS queries.

VCL:

The same request must always result in the same request hash. Requests may go through sub vcl_recv multiple times, so manipulation of req.url and req.http.Host must be idempotent.
The VCL should be the same on all nodes.

DNS: (Applies to dynamic clusters only)

A single domain name that resolves the IPs of all nodes in the cluster.
Multiple IPs per node are not supported.
A, AAAA, and SRV records are supported.
For SRV records, the port, weight, and priority attributes are respected by default. weight must not be changed while the cluster is receiving traffic.

Static Cluster

In a static cluster, each node is defined as a separate backend or director in the VCL. This is a good fit for clusters where nodes are added and removed infrequently.

To configure it, add this to your existing VCL, before any include statement or vcl_* definition:

< the vcl 4.x statement and imports are here >

# Include the clustering framework
include "cluster.vcl";

# Define all the nodes forming the cluster (all nodes will use the same VCL)
backend node_a { .host = "ip:port"; }
backend node_b { .host = "ip:port"; }
backend node_c { .host = "ip:port"; }

# Add the nodes to the cluster director, and set a token so nodes can
# authorize peer-to-peer communication
sub vcl_init {
  cluster.add_backend(node_a);
  cluster.add_backend(node_b);
  cluster.add_backend(node_c);

  cluster_opts.set("token", "secret");
}

< the rest of your code goes here >

Dynamic Cluster

In a dynamic cluster, nodes are resolved from a domain name, allowing the cluster to shrink and grow on demand. A good fit for autoscaling clusters.

As with the static cluster, we’ll add code before any include statement or vcl_* definition:

< the vcl 4.x statement and imports are here >

# Import activedns, used to discover the node based on the cluster's domain
import activedns;

# Include the clustering framework
include "cluster.vcl";

# Create a DNS pool targeting the nodes and subscribe the cluster
# director to it
# Also, set a token so nodes can authorize peer-to-peer communication
sub vcl_init {
  new cluster_group = activedns.dns_group("varnish.nodes");
  cluster.subscribe(cluster_group.get_tag());

  cluster_opts.set("token", "secret");
}

< the rest of your code goes here >

Validate configuration

To validate the configuration, the following steps can be taken:

Step 1: Log into any of the Varnish nodes and execute the following command:

varnishstat -1 -f 'KVSTORE.cluster_stats.*'

A set of cluster metrics should appear in the output.

Step 2: Log into any of the Varnish nodes and execute the following command:

varnishadm backend.list -p

The output should contain one backend for each cluster node, including the node itself. Make sure all cluster nodes are marked as healthy.

Step 3: Enable the X-Cluster-Trace response header by setting the trace option to true:

sub vcl_init {
  cluster_opts.set("trace", "true");
}

Step 4: Use curl -I to send a request to any of the cluster nodes. Make sure to request a cacheable object that is not currently in the cache. The response should contain an X-Cluster-Trace header that shows the requests path through the cluster. The header may look like any of the following:

v1->MISS, origin: The node has determined itself to be the primary node for this object and fetched it from the origin. Requesting the same object from the v2 Varnish node should then result in a v2->MISS, v1->HIT trace.
v1->MISS->RETRY(1), origin: The node has likely self-identified, and its self_identified counter should now be 1.
v1->MISS, v2->MISS, origin: The request was autosharded to v2, which fetched the object from the origin.

Here, v1 and v2 are the hostnames of two varnish nodes (or the server identity if varnish was started with -i), and origin is the name of the VCL backend or director representing the origin server.

Configuration

Token

The token is used to prove cluster membership for requests from one node to another. Must be set to the same value on all nodes in the same cluster.

It is recommended to set this to a hard-to-guess string:

sub vcl_init {
  cluster_opts.set("token", "correct horse battery staple");
}

Fallback

The number of times a node will retry a backend fetch to other nodes in the cluster before going to the origin. The default max_retries value of 4 means that with a fallback value of 3, failed fetches will automatically retry to other cluster nodes up to three times before making a final fetch attempt to origin.

The default value is 3. Setting this parameter to 0 makes the cluster nodes immediately retry to the origin when the fetch to another cluster node fails:

sub vcl_init {
  cluster_opts.set("fallback", "3");
}

Trace

Determines whether or not to return an X-Cluster-Trace response header to the client. The header provides information about the path a request took through the cluster, and can be useful for testing and troubleshooting.

It is normally recommended to leave this setting at its default value (false) to avoid exposing request handling to external clients:

sub vcl_init {
  cluster_opts.set("trace", "false");
}

Primaries

Determines the number of primary nodes for each object. An object’s primary node is responsible for fetching it from the origin. The default value of 1 means that each object has exactly one primary node in the cluster, ensuring that the object is fetched only once from the origin.

Setting this value to 2 means that any given object has two primary nodes, and may be fetched from the origin by either node. Requests for an object to non-primary nodes are load balanced over the two primary nodes. This may reduce the strain on cluster nodes in extreme cases, at the cost of duplicate requests to origin.

It is normally recommended to leave this setting at its default value (1):

sub vcl_init {
  cluster_opts.set("primaries", "1");
}

Health Checks

Health checks can be enabled between cluster nodes by adding probes to the cluster backend definition. For the health checks to succeed, a synthetic 200 response can be added to sub vcl_recv.

Step 1: Define a probe:

probe cluster_probe {
  .url = "/health";
}

Step 2: Assign the probe to the cluster nodes.

For static clusters:

backend node_a { .host = "ip:port"; .probe = cluster_probe; }
backend node_b { .host = "ip:port"; .probe = cluster_probe; }
backend node_c { .host = "ip:port"; .probe = cluster_probe; }

For dynamic clusters:

sub vcl_init {
  new cluster_group = activedns.dns_group("varnish.nodes");
  cluster_group.set_probe_template(cluster_probe);
  cluster.subscribe(cluster_group.get_tag());
}

Step 3: Define the health check endpoint at the top of sub vcl_recv:

sub vcl_recv {
  if (req.url == "/health") {
    return (synth(200));
  }
}

TLS

TLS can be enabled between cluster nodes the same way as with regular backends.

For static clusters:

backend node_a { .host = "ip:port"; .ssl = 1; }
backend node_b { .host = "ip:port"; .ssl = 1; }
backend node_c { .host = "ip:port"; .ssl = 1; }

For dynamic clusters:

sub vcl_init {
  new cluster_group = activedns.dns_group("varnish.nodes:443");
  cluster.subscribe(cluster_group.get_tag());
}

Skip

Any request can be marked to skip autosharding and go directly to the origin in case of a cache MISS. This is done by setting the X-Cluster-Skip header to true in sub vcl_recv:

sub vcl_recv {
  if (req.url == "/foo") {
    set req.http.X-Cluster-Skip = "true";
  }
}

Any request can also be marked to skip receiving accounting keys. This is done This is done by setting the X-Cluster-Skip-Accounting header to true in sub vcl_recv:

sub vcl_recv {
  if (req.url == "/foo") {
    set req.http.X-Cluster-Skip-Accounting = "true";
  }
}

Storage sharding

By default, all objects are cached in the memory cache on all nodes. If Massive Storage Engine (MSE) has been configured with persistent storage, all objects are also persisted to disk on all nodes. The result is that each moderately popular object will be cached and persisted on all nodes in the cluster, making the cluster highly resilient to node churn. This is called full replication.

It is often not necessary to persist all objects on all nodes, and full replication is mutually exclusive with cache pooling. Cluster storage capacity can be scaled horizontally with storage sharding. By using the autosharding algorithm to selectively persist objects to disk, the total storage capacity can be increased with each node added to the cluster.

To implement storage sharding, import the mse4 VMOD and add the following snippet to sub vcl_backend_response:

import mse4;

sub vcl_backend_response {
  if (bereq.backend == cluster.backend() && !cluster.self_is_next(1)) {
    # Storage sharding: Mark the response as memory-only
    mse4.set_storage(EPHEMERAL);
  }
}

By making objects ephemeral on all but the primary node, we ensure that any given object is persisted to disk on only one node in the cluster. This type of sharding is called full sharding.

Partial sharding is also possible by changing the cluster.self_is_next() argument from 1 to 2 (or more). This will persist each object on both its primary and secondary node. The cluster can now lose any node without significantly increasing traffic to origin, but the total cluster storage capacity is reduced by 50%.

To summarize:

Full replication: All objects persisted on all nodes (default).
Full sharding: Each object persisted on one node.
Partial sharding: Each object persisted on N nodes.

Storage sharding with MSE 3

If you are using MSE3, use this VCL snippet instead:

import mse;

sub vcl_backend_response {
  if (bereq.backend == cluster.backend() && !cluster.self_is_next(1)) {
    # Storage sharding: Mark the response as memory-only
    mse.set_stores("none");
  }
}

Cache invalidation

Cache invalidation can be performed as normal in a cluster, with one significant exception: It must be run twice. Whether PURGEs, BANs, or yKey purges are used, two rounds of invalidation must be performed to guarantee that all matching objects in the cluster have been evaluated.

The first invalidation round will invalidate all primary and non-primary objects currently cached in the cluster. When the first round has been completed, the second round will invalidate all non-primary objects that were created during the first invalidation round. It is important to wait for the first round to complete before starting the second round.

For examples on how to invalidate cache, see the cache invalidation tutorial.

Observability

Metrics

The following varnishtest counters are created by cluster.vcl:

error_token: Bad cluster tokens received. This is likely caused by cluster nodes not being configured with the same token, or by overlap between two clusters.

error_fallback_limit: Cluster fallback limit exceeded. Incremented when a backend transaction reaches the cluster fallback limit. This indicates issues with getting successful responses from the other cluster nodes.

error_unhealthy: No healthy nodes in the cluster. Incremented when autosharding was not possible due to all the nodes in the cluster being marked unhealthy. This is likely caused by health probes failing.

skipped: Autosharding was skipped.

passed: Cluster was bypassed. Incremented for PASS requests and causes the request to skip autosharding and go directly to origin.

hitmiss: Cluster was bypassed. Incremented for Hit-For-Miss requests and causes the request to skip autosharding and go directly to the origin.

self_identified: Node has self-identified. Set to 1 when the node has identified itself with a backend in the cluster director. Not automatically set to 1 if cluster.set_identity() has been used instead of self-identification.

These can be observed by running the following varnishstat command:

varnishstat -1 -f 'KVSTORE.cluster_stats.*'

Accounting

cluster.vcl uses the accounting VMOD to make it easier to monitor the cache efficiency of a cluster. An accounting namespace called cluster is automatically created and used for every request. The following keys may be added to a cluster transaction:

client_deliver: Added in sub vcl_deliver when a response is being delivered to a real client.

cluster_deliver: Added in sub vcl_deliver when a response is being delivered to a cluster node.

cluster_backend_response: Added in sub vcl_backend_response when a response has been received from a cluster node.

origin_backend_response: Added in sub vcl_backend_response when a response has been received from the origin.

These can be observed by running the following varnishstat command:

varnishstat -1 -f 'ACCG.cluster.*'

The accounting metrics can for example be used to calculate the cluster-wide cache HIT rate:

client_deliver.client_hit_count + cluster_deliver.client_hit_count /
client_deliver.client_req_count

This calculates the number of client requests that resulted in a cache HIT on either the first or second hop in the cluster divided by the total number of client requests received by the cluster. To get a more complete picture of cluster request handling, the MISS, SYNTH, PASS, and PIPE rates should also be calculated in a similar way.

If a namespace has already been set when sub vcl_recv is entered in cluster.vcl (for example in a shared deployment with labeled VCLs), keys are added to that namespace instead of cluster.

Requests can be excepted from accounting with the X-Cluster-Skip-Accounting header.

Traces

The X-Cluster-Trace response header contains useful information about a given requests path though the cluster. It is based on each nodes server.identity value, which defaults to the server’s hostname, but may be changed with the varnishd -i command line argument. The trace header is not transmitted to clients by default, but this can be changed by setting the trace cluster configuration parameter to true.

Logs

cluster.vcl logs are prefixed with Cluster: and are logged with the VCL_Log VSL tag. They can be observed with the following varnishlog command:

varnishlog -g request -q 'VCL_Log ~ "^Cluster:"' -i VCL_Log

When using a dynamic cluster, backend creation and destruction events can be observed with the following command:

varinshlog -g raw -q 'VCL_Log ~ "^udo:"' -i VCL_Log

And DNS events can be observed with the following commands:

varinshlog -g raw -q 'ADNS ~ "^libadns:"' -i ADNS

How it works

Autosharding

A consistent hashing algorithm is used to assign each client request to a primary node in the cluster. The primary node for a request is responsible for fetching it from the origin and optionally persisting it to disk. When a node receives a request it is not the primary for, it will fetch the object from the primary node. We call this autosharding.

Autosharding has two major benefits:

Cluster-wide request coalescing: each object is fetched from the origin once per cluster instead of once per node.
Storage sharding: pool cache capacity and scale persistent caches horizontally.

A node will not fetch from the primary node if:

The X-Cluster-Skip request header is set to true.
The fetch has been marked as a PASS.
The fetch has been marked as a Hit-For-Miss or Hit-For-Pass.
A fetch attempt to the origin has already been made for any reason.
The fallback limit has been reached.
This node is the primary node for this request.
The fetch was triggered by a request from another cluster node (more than 1 hop is disallowed).
All backends in the cluster are marked unhealthy.

The request hash is by default based on the request’s Host header and req.url, but this can be changed in sub vcl_hash or overridden with cluster.set_hash().

Self-Identification

Each node in the cluster will automatically discover which backend in the cluster director corresponds to itself through a procedure we call Self-Identification. This procedure happens each time the VCL is reloaded.

Before a node has established its own identity, it will autoshard all requests like normal, but each fetch includes an X-Cluster-Identifier header. This identifier is a randomly generated string associated with one of the backends in the cluster director. When the node eventually receives an identifier that it has generated itself, it knows which backend represents its own identity.

From this point on, whenever the autosharding algorithm determines the primary backend for a given request to be the node itself, the node knows to fetch directly from the origin instead of looping back on itself.

FAQ

Q: Can Slicer be used with cluster.vcl?

A: Yes, Slicer can be enabled like normal and will take advantage of autosharding in a cluster. The hash of a Slicer subrequest is based on the top level request, so all Slicer subrequests for the same object are autosharded to the same primary node.

Q: Can ESI be used with cluster.vcl?

A: Yes, ESI can be enabled like normal and will take advantage of autosharding in a cluster. Unlike Slicer subrequests, the hash of each ESI subrequest is based on the request hash of each subrequest. Make sure your VCL does not set resp.do_esi to true in sub vcl_deliver.

Q: Can cluster.vcl be used with VCL labels?

A: Yes, each labeled VCL can choose to include cluster.vcl and define the cluster as normal. It is best practice to define a different cluster token for each labeled VCL, as it makes it easier to discover misconfigurations in label routing. The root VCL does not need to include cluster.vcl.

Q: How are background fetches performed between cluster nodes?

A: When a client request hits a stale object in cache on a non-primary node, a background fetch is kicked off as normal to the primary node. For this fetch, any stale object from the primary node is ignored. This happens automatically, and avoids revalidating a stale object with another stale object.

Q: Will PASS requests be autosharded?

A: No, any PASS request will go directly to the origin.

Q: Do cluster nodes communicate though a side-channel?

A: All communication between cluster nodes happens over the regular HTTP(s) listening endpoints (varnishd -a or varnishd -A). There is no side-channel communication outside normal request handling.

Q: Does cluster.vcl increase memory usage?

A: Cluster headers increase workspace usage by a small amount, but memory usage for the system as a whole should not be affected significantly.

Q: Does cluster.vcl increase network usage?

A: Network usage will typically stay the same for each Varnish node when clustering is enabled, but network traffic to the origin should decrease. Network usage may increase if the cluster has a low cache HIT rate.

Availability

cluster.vcl is a versioned VCL shipped with Varnish Enterprise. The version is stated at the top of the VCL.

Cluster 1.x is available from Varnish Enterprise version 6.0.9r5.
Cluster 2.x is available from Varnish Enterprise version 6.0.13r6.