varnish-otel
will export a lot of metrics, which might feel overwhelming at first. This page presents a list of important metrics that you should keep an eye on.
Before starting, it’s important to note one thing: most likely, you don’t really care about “hit ratio” per se. It’s usually a proxy value for what you really want from Varnish, which is “backend traffic reduction”.
This is an important distinction because actually computing the hit ratio doesn’t tell the whole story:
synth
responses never touch the cache nor the backendesi
, slicer
, retries
and restarts
may generate many internal requests, so the backends may see more requests than Varnish received from the clientsIn light of this, the next two sections focus on the traffic reduction, presented as a ratio backend traffic / client traffic
, either in terms of request number, or in terms of volume/bandwidth. We generally want this number to be as low as possible, but it’s absolutely possible, for the reasons explained above that it’s actually higher than 100%.
Also, note that this page offers backend metrics, but it’s absolutely valid to use the metrics reported by the backends themselves, if they are. Keeping the Varnish metrics means you only focus on the traffic generated by Varnish, which is usually what is wanted.
rate(
varnish.main.backend.req / varnish.main.client.requests
)
Alternatively, you can also use accounting
metrics if you want to groupd by namespace
/key
:
rate(
varnish.accounting.backend.requests / varnish.accounting.client.requests
)
Number of backend requests sent by Varnish to a backend. The first one is the total number while the second one is a collection of metrics, labeled with labels accounting.namespace
and accounting.key
(it does require you to set up vmod_accounting
to make use of it).
Number of requests actually received by Varnish, so it doesn’t account for restarts
or esi
subrequests. As for the previous section, the first one is a global counter while the other is a collection of labeled counters.
rate(
(varnish.backend.response.hdrbytes + varnish.backend.response.bodybytes) /
(varnish.main.response.hdrbytes + varnish.main.response.bodybytes)
)
Alternatively, you can also use accounting
metrics if you want to groupd by namespace
/key
:
rate(
(varnish.accounting.backend.response.hdrbytes + varnish.accounting.backend.response.bodybytes) /
(varnish.accounting.client.response.hdrbytes + varnish.accounting.client.response.bodybytes)
)
Note: if you are piping a lot, you may want to also account for varnish.backend.pipe.out
, which are labeled by vcl
and backend.name
. However, pipe traffic is usual a negligible part of the traffic, so it’s omitted from the general recommendations.
This last one is usually much less important, but it can be interesting to track the new connections that Varnish is responding to:
rate(varnish.main.sess)
And the ones it’s opening to the backends
rate(varnish.main.backend.connections)
This section is completely optional, You may not want to track all kinds of requests, if any at all.
Varnish can process requests in many different ways, here the list of counters you can use to keep track of each:
rate(varnish.main.cache.graced_hits)
rate(varnish.main.cache.hits)
rate(varnish.main.cache.misses)
rate(varnish.main.pass)
rate(varnish.main.pipe)
rate(varnish.main.synth)
rate(varnish.main.cache.hitmisses) # these will be also counted in varnish.main.cache.misses
rate(varnish.main.cache.hitpasses) # these will be also counted in varnish.main.pass
And accounting will offer almost the same counters, bucketized by namespace
/key
:
varnish.accounting.client.graced_hits
varnish.accounting.client.hits
varnish.accounting.client.misses
varnish.accounting.client.passes
varnish.accounting.client.pipes
Again, be careful when summing these as they will amount to the number or requests entering vcl_recv{}
, so restarts
, esi
and slicer
requests will be tracked here, which can make results confusing to the unaware operator.
If you aren’t using esi
, please ignore this section.
As esi
is essentially a way to multiply internal traffic, it can be a good idea to get a sense of the multiplication factor.
rate(varnish.main.esi.requests / varnish.main.client.requests)
With accounting:
rate(
varnish.accounting.client.requests{accounting.level="sub"} /
varnish.accounting.client.requests{accounting.level="total"}
)
Many invalidation techniques mean multiple counters, depending on your approach, you probably need only need one or two of these counters:
rate(varnish.main.purges)
rate(varnish.main.bans.added)
rate(varnish.main.ykey.purges)
As for most traffic metrics, high number of purges are not necessarily problematic, and it’s the sudden changes in short a amount of time that should be looked at.
However, there’s an exception to this: bans. Because they keep processing in the background, accumulating too many of them can be detrimental to performance.
In particular, look out for lunker-unfriendly bans using req.*
expressions, in an ideal world, their number should be 0. Therefore you’ll want to keep these two gauges low:
varnish.main.bans
varnish.main.bans.req
While it’s usually not needed, you may want to also keep track of the objects being invalidated. Note that the following counters don’t track evicted objects (i.e. removed from cache for space reason), these are listed below in the [Saturation] section.
rate(varnish.main.obj.purged) # regular purges
rate(varnish.main.ykey.purges) # ykey purges
rate(
varnish.main.bans.obj.killed +
varnish.main.bans.lurker.obj.killed +
varnish.main.bans.lurker.obj.killed.cutoff
)
The saturation section deals with how busy resources are, for example:
Note that CPU utilization or actual disk usage is left to lower-level exporters as Varnish can only report metrics from it own abstracted perspective.
# note: this is a gauge
varnish.main.thread.live
Threads are a pretty central component to the Varnish architecture and can be use as a first signal of how busy a server is as one transaction (client or backend) will consume one thread.
Spikes in thread number are generally signs of increased client traffic, and plateaus probably mean that you hit the maximum number of threads and are, or will be queueing, and maybe dropping requests.
Varnish supports multiple cache storages but this guide focuses only on the recommended one, mse4.
Note: Varnish is a cache, and it’s normal for the cache to be full. You shouldn’t try to autoscale immediately if you reach % usage. What is important is the churn rate, which is explained below.
varnish.mse4.bytes.used /
(varnish.mse4.bytes.used + varnish.mse4.bytes.unused)
This is simply the ratio of memory used to store objects over the maximum space usable. If you are using disk, expect this to be fairly close to 100% as Varnish will load up object into memory as they are requested, and will remove them if they are less popular, counting on the disk for long-term storage.
varnish.mse4.store.bytes.used) /
(varnish.mse4.store.bytes.used + varnish.mse4.store.bytes.unused)
The disk space equivalent of the previous item, once you reach 100%, Varnish will possibly need to remove object to make room for new ones (which is normal and shouldn’t immediately worry you).
varnish.mse4.book.slots.used /
(varnish.mse4.book.slots.used + varnish.mse4.book.slots.unused)
Memory isn’t the only resource that can be saturated. When using disk storage, mse4
keeps track of all allocation in books
, where each one has slots corresponding to objects. In practice, with a 8G book, you should almost never run out of slots, but we’re listing the metric for exhaustiveness.
rate(varnish.main.lru.nuked)
rate(varnish.mse4.book.c_objects.liberated)
rate(varnish.mse4.store.c_segment_pruned_objects)
When Varnish runs out of storage, it will evict older and less popular objects to make space. It’s part of normal and sane cache storage management, but of course, the thing to watch out for is excess. It’s ok to churn out dozens or hundreds of objects every second, but this needs to be weighed against the whole size of the cache. If you replace your whole cache every three seconds, you may need more storage.
# note: these are all gauges
# worker process private memory usage
rate(varnish.main.mem.private)
# worker process RSS memory usage
rate(varnish.main.mem.rss)
# worker process file backed page cache memory
rate(varnish.main.mem.file)
# worker process swap usage
rate(varnish.main.mem.swap)
The memory governor keeps track of the memory and makes sure it doesn’t go over a certain limit (80% of the total memory by default). These metrics allow you to peek into the governor
’s perspective.
varnish.backend.is_healthy
A sick backend doesn’t mean your users will necessarily be impacted (make sure to use udo to retry faulty backends) but it should definitely be checked.
# could not retrieve a response (connection issue, time out, etc.)
# check the failure.type attribute
rate(varnish.backend.failed)
# a backend reached its number of max connections and timed out
# (see `backend_wait_timeout` and `backend_wait_limit` in `varnishd -x parameter`)
rate(varnish.main.backend.wait_failed)
# could not get a thread for a background fetch
rate(varnish.main.bgfetch.no_thread)
There are many reasons why a backend fetch could fail, and you should check the logs, but these counters can be the base of your alerts.
rate(varnish.mse4.book.online)
rate(varnish.mse4.store.online)
mse4
has been designed from the ground up to be extremely resilient to disk and store loss, but if any of the components aren’t online, it means something pretty bad is currently happening, and the server is likely losing at least one disk.
rate(varnish.mse4.c_allocation_failure)
rate(varnish.mse4.c_eviction_failure)
These should be extremely rare, but they may occur on extremely constrained systems where the default water levels are giving enough leeways to allocate/evict objects to serve traffic.
# low-level failure accepting connections, check the failure.type attribute
rate(varnish.main.session.failed)
# failed to create a thread (possibly because `sysctl max_map_count` is too low)
rate(varnish.main.thread.failed)
# couldn't create a thread because `thread_pool_max` was reached
rate(varnish.main.thread.limited)
# panics
rate(varnish.child.panics)
# workspace overflows
rate(varnish.main.ws.backend.overflow)
rate(varnish.main.ws.client.overflow)
rate(varnish.main.ws.thread.overflow)
rate(varnish.main.ws.session.overflow)
# VCL failures
rate(varnish.main.vcl.failed)
rate(varnish.main.losthdr)
# Tasks queued or dropped for lack of threads
rate(varnish.main.sess.queued)
rate(varnish.main.session.dropped)
rate(varnish.main.request.dropped)
Most of these errors are usually due to limits set too low, but remember that said limits exists to prevent single requests from consuming too many resources.
rate(varnish.main.esi.warnings)
rate(varnish.main.esi.maxdepth)
rate(varnish.main.esi.req_abort)
ESI error can be hard to debug as content is composed from multiple objects, however these metrics will be strong signals to check the logs for ESI errors.