Important Metrics

varnish-otel will export a lot of metrics, which might feel overwhelming at first. This page presents a list of important metrics that you should keep an eye on.

Traffic

Before starting, it’s important to note one thing: most likely, you don’t really care about “hit ratio” per se. It’s usually a proxy value for what you really want from Varnish, which is “backend traffic reduction”.

This is an important distinction because actually computing the hit ratio doesn’t tell the whole story:

synth responses never touch the cache nor the backend
esi, slicer, retries and restarts may generate many internal requests, so the backends may see more requests than Varnish received from the clients

In light of this, the next two sections focus on the traffic reduction, presented as a ratio backend traffic / client traffic, either in terms of request number, or in terms of volume/bandwidth. We generally want this number to be as low as possible, but it’s absolutely possible, for the reasons explained above that it’s actually higher than 100%.

Also, note that this page offers backend metrics, but it’s absolutely valid to use the metrics reported by the backends themselves, if they are. Keeping the Varnish metrics means you only focus on the traffic generated by Varnish, which is usually what is wanted.

Request reduction

rate(
	varnish.main.backend.req / varnish.main.client.requests
)

Alternatively, you can also use accounting metrics if you want to groupd by namespace/key:

rate(
	varnish.accounting.backend.requests / varnish.accounting.client.requests
)

varnish.main.backend.req vs. varnish.accounting.backend.requests

Number of backend requests sent by Varnish to a backend. The first one is the total number while the second one is a collection of metrics, labeled with labels accounting.namespace and accounting.key (it does require you to set up vmod_accounting to make use of it).

varnish.main.client.requests vs. varnish.accounting.client.requests

Number of requests actually received by Varnish, so it doesn’t account for restarts or esi subrequests. As for the previous section, the first one is a global counter while the other is a collection of labeled counters.

Volume reduction

rate(
	(varnish.backend.response.hdrbytes + varnish.backend.response.bodybytes) /
	(varnish.main.response.hdrbytes + varnish.main.response.bodybytes)
)

Alternatively, you can also use accounting metrics if you want to groupd by namespace/key:

rate(
	(varnish.accounting.backend.response.hdrbytes + varnish.accounting.backend.response.bodybytes) /
	(varnish.accounting.client.response.hdrbytes + varnish.accounting.client.response.bodybytes)
)

Note: if you are piping a lot, you may want to also account for varnish.backend.pipe.out, which are labeled by vcl and backend.name. However, pipe traffic is usual a negligible part of the traffic, so it’s omitted from the general recommendations.

Connection reduction

This last one is usually much less important, but it can be interesting to track the new connections that Varnish is responding to:

rate(varnish.main.sess)

And the ones it’s opening to the backends

rate(varnish.main.backend.connections)

Request types

This section is completely optional, You may not want to track all kinds of requests, if any at all.

Varnish can process requests in many different ways, here the list of counters you can use to keep track of each:

rate(varnish.main.cache.graced_hits)
rate(varnish.main.cache.hits)
rate(varnish.main.cache.misses)
rate(varnish.main.pass)
rate(varnish.main.pipe)
rate(varnish.main.synth)

rate(varnish.main.cache.hitmisses)	# these will be also counted in varnish.main.cache.misses
rate(varnish.main.cache.hitpasses)	# these will be also counted in varnish.main.pass

And accounting will offer almost the same counters, bucketized by namespace/key:

varnish.accounting.client.graced_hits
varnish.accounting.client.hits
varnish.accounting.client.misses
varnish.accounting.client.passes
varnish.accounting.client.pipes

Again, be careful when summing these as they will amount to the number or requests entering vcl_recv{}, so restarts, esi and slicer requests will be tracked here, which can make results confusing to the unaware operator.

ESI/slicer requests

If you aren’t using esi, please ignore this section.

As esi is essentially a way to multiply internal traffic, it can be a good idea to get a sense of the multiplication factor.

rate(varnish.main.esi.requests / varnish.main.client.requests)

With accounting:

rate(
	varnish.accounting.client.requests{accounting.level="sub"} /
	varnish.accounting.client.requests{accounting.level="total"}
)

Invalidation requests

Many invalidation techniques mean multiple counters, depending on your approach, you probably need only need one or two of these counters:

rate(varnish.main.purges)
rate(varnish.main.bans.added)
rate(varnish.main.ykey.purges)

As for most traffic metrics, high number of purges are not necessarily problematic, and it’s the sudden changes in short a amount of time that should be looked at.

However, there’s an exception to this: bans. Because they keep processing in the background, accumulating too many of them can be detrimental to performance.

In particular, look out for lunker-unfriendly bans using req.* expressions, in an ideal world, their number should be 0. Therefore you’ll want to keep these two gauges low:

varnish.main.bans
varnish.main.bans.req

Invalidated objects

While it’s usually not needed, you may want to also keep track of the objects being invalidated. Note that the following counters don’t track evicted objects (i.e. removed from cache for space reason), these are listed below in the [Saturation] section.

rate(varnish.main.obj.purged)	# regular purges
rate(varnish.main.ykey.purges)	# ykey purges
rate(
	varnish.main.bans.obj.killed +
	varnish.main.bans.lurker.obj.killed +
	varnish.main.bans.lurker.obj.killed.cutoff
)

Saturation

The saturation section deals with how busy resources are, for example:

how much of your cache is full?
how many threads are being used?

Note that CPU utilization or actual disk usage is left to lower-level exporters as Varnish can only report metrics from it own abstracted perspective.

Threads

# note: this is a gauge
varnish.main.thread.live

Threads are a pretty central component to the Varnish architecture and can be use as a first signal of how busy a server is as one transaction (client or backend) will consume one thread.

Spikes in thread number are generally signs of increased client traffic, and plateaus probably mean that you hit the maximum number of threads and are, or will be queueing, and maybe dropping requests.

MSE4

Varnish supports multiple cache storages but this guide focuses only on the recommended one, mse4.

Note: Varnish is a cache, and it’s normal for the cache to be full. You shouldn’t try to autoscale immediately if you reach % usage. What is important is the churn rate, which is explained below.

Memory space used

varnish.mse4.bytes.used /
(varnish.mse4.bytes.used + varnish.mse4.bytes.unused)

This is simply the ratio of memory used to store objects over the maximum space usable. If you are using disk, expect this to be fairly close to 100% as Varnish will load up object into memory as they are requested, and will remove them if they are less popular, counting on the disk for long-term storage.

Disk space used

varnish.mse4.store.bytes.used) /
(varnish.mse4.store.bytes.used + varnish.mse4.store.bytes.unused)

The disk space equivalent of the previous item, once you reach 100%, Varnish will possibly need to remove object to make room for new ones (which is normal and shouldn’t immediately worry you).

Book slots

varnish.mse4.book.slots.used /
(varnish.mse4.book.slots.used + varnish.mse4.book.slots.unused)

Memory isn’t the only resource that can be saturated. When using disk storage, mse4 keeps track of all allocation in books, where each one has slots corresponding to objects. In practice, with a 8G book, you should almost never run out of slots, but we’re listing the metric for exhaustiveness.

Evicted objects

rate(varnish.main.lru.nuked)
rate(varnish.mse4.book.c_objects.liberated)
rate(varnish.mse4.store.c_segment_pruned_objects)

When Varnish runs out of storage, it will evict older and less popular objects to make space. It’s part of normal and sane cache storage management, but of course, the thing to watch out for is excess. It’s ok to churn out dozens or hundreds of objects every second, but this needs to be weighed against the whole size of the cache. If you replace your whole cache every three seconds, you may need more storage.

Memory governor

# note: these are all gauges

# worker process private memory usage
rate(varnish.main.mem.private)
# worker process RSS memory usage
rate(varnish.main.mem.rss)
# worker process file backed page cache memory
rate(varnish.main.mem.file)
# worker process swap usage
rate(varnish.main.mem.swap)

The memory governor keeps track of the memory and makes sure it doesn’t go over a certain limit (80% of the total memory by default). These metrics allow you to peek into the governor’s perspective.

Errors

Backends

Health

varnish.backend.is_healthy

A sick backend doesn’t mean your users will necessarily be impacted (make sure to use udo to retry faulty backends) but it should definitely be checked.

Fetch failures

# could not retrieve a response (connection issue, time out, etc.)
# check the failure.type attribute
rate(varnish.backend.failed)
# a backend reached its number of max connections and timed out
# (see `backend_wait_timeout` and `backend_wait_limit` in `varnishd -x parameter`)
rate(varnish.main.backend.wait_failed)
# could not get a thread for a background  fetch
rate(varnish.main.bgfetch.no_thread)

There are many reasons why a backend fetch could fail, and you should check the logs, but these counters can be the base of your alerts.

MSE4

Storage health

rate(varnish.mse4.book.online)
rate(varnish.mse4.store.online)

mse4 has been designed from the ground up to be extremely resilient to disk and store loss, but if any of the components aren’t online, it means something pretty bad is currently happening, and the server is likely losing at least one disk.

Allocation/eviction

rate(varnish.mse4.c_allocation_failure)
rate(varnish.mse4.c_eviction_failure)

These should be extremely rare, but they may occur on extremely constrained systems where the default water levels are giving enough leeways to allocate/evict objects to serve traffic.

Various but traffic breaking errors

# low-level failure accepting connections, check the failure.type attribute
rate(varnish.main.session.failed)
# failed to create a thread (possibly because `sysctl max_map_count` is too low)
rate(varnish.main.thread.failed)
# couldn't create a thread because `thread_pool_max` was reached
rate(varnish.main.thread.limited)
# panics
rate(varnish.child.panics)

# workspace overflows
rate(varnish.main.ws.backend.overflow)
rate(varnish.main.ws.client.overflow)
rate(varnish.main.ws.thread.overflow)
rate(varnish.main.ws.session.overflow)

# VCL failures
rate(varnish.main.vcl.failed)
rate(varnish.main.losthdr)

# Tasks queued or dropped for lack of threads
rate(varnish.main.sess.queued)
rate(varnish.main.session.dropped)
rate(varnish.main.request.dropped)

Most of these errors are usually due to limits set too low, but remember that said limits exists to prevent single requests from consuming too many resources.

ESI errors

rate(varnish.main.esi.warnings)
rate(varnish.main.esi.maxdepth)
rate(varnish.main.esi.req_abort)

ESI error can be hard to debug as content is composed from multiple objects, however these metrics will be strong signals to check the logs for ESI errors.