The Massive Storage Engine is probably the most significant feature that Varnish Enterprise offers. It combines the speed of memory, and the reliability of disk to offer a stevedore that can cache terabytes of data on a single Varnish server.
The cache is persisted and can survive a restart. This means you don’t need to rewarm the cache when a Varnish server recovers from a restart.
MSE is commonly used to build custom CDNs and to accelerate OTT video platforms.
But it’s not only about caching large volumes of data: MSE’s memory
implementation represents a clear improvement over Varnish’s original
malloc stevedore.
Before we dive into the details, we need to talk about the history of MSE, and why this stevedore was developed in the first place.
Before MSE was a viable stevedore for caching large volumes of data, the Varnish project had two disk-based stevedores:
filepersistenceThe file stevedore stores objects on disk, but not in a persistent
way: a restart will result in an empty cache.
That is because the file stevedore is nothing more than memory storage
that is backed by a file: if the size of the cache exceeds the available
memory, content will be sent to the disk. And content that is not
available in memory will be loaded from the disk.
It’s just like the operating system’s swap mechanism, where the disk is only used to extend the capacity of the memory without offering any persistence guarantees.
The operating system’s page cache will decide what the hot content is that should be buffered in memory. This ensures that not every access to the file system results in I/O instructions.
The problem with the file stevedore is that the page cache isn’t
really optimized for caches like Varnish. The page cache will guess
what content should be in memory, and if that guess is wrong, file
system access is required.
This results in a lot of context switching, which consumes server resources.
The biggest problem, however, is disk fragmentation, which increases
over time. Because the file stevedore has no mechanism to efficiently
allocate objects on disk, objects might be scattered across multiple
blocks. This results in a lot more disk I/O taking place to assemble
the object.
Over time, this becomes a real performance killer. The only solution is to restart Varnish, which will allow you to start over again. Unfortunately, this will also empty the cache, which is never good.
As the name indicates, the persistence stevedore can persist objects
on disk. These objects can survive a server restart. This way your cache
remains intact, even after a potential failure.
However, we would not advise you to ever use this stevedore. This is a quote from the official Varnish documentation for this feature:
The persistent storage backend has multiple issues with it and will likely be removed from a future version of Varnish.
The main problem is that the persistence stevedore is designed as a
log-structured file system that enforces a strict FIFO approach.
This means:
This means that the first object that is inserted will be the first one to expire. Although this works fine from a write performance point of view, it totally sidelines key Varnish features, such as LRU eviction, banning, and many more.
Since this stevedore is basically unusable in most real-life
situations, and since the file stevedore is not persistent and prone
to disk fragmentation, there is a need for a better solution. Enter the
Massive Storage Engine.
In the very first year of Varnish Software’s existence, a plan was drawn up to develop a good file-based stevedore.
The initial focus was not to develop a persistent stevedore, but the critical goal in the beginning was to offer the capability to cache objects that are larger than the available memory.
Even the initial implementation was not that different from the file
stevedore. It was mostly a matter of smoothing the rough edges to
create an improved version of the file stevedore.
Behind the scenes memory-mapped files were still used to store the objects. This means that the operating system decides which portions of the file it keeps in the page cache. Since the page cache corresponds to a part of the system’s actual physical memory, the result is that the operating system’s page cache serves as the memory cache for Varnish, storing the hot objects, whereas the file system contains all objects.
The critical goal for the second version of MSE was to add persistent storage that would survive a server restart. The metadata, residing in memory in version 1, also needed to be persisted on disk. Memory-mapped files were also used to accomplish this.
A major side effect of memory-mapped files, in combination with large volumes of data, is the overhead of the very large memory maps that are created by the kernel: for each page in a memory-mapped file, the kernel will allocate some bytes in its page tables.
If for example the backing files grow to 20 TB worth of cached objects, an overhead of 90 GB of memory could occur.
This is memory that is not limited by the storage engine and is expected to be readily available on top of memory that was already allocated to Varnish for object storage. Managing these memory maps can become CPU intensive too.
With the release of Varnish Enterprise 6, a new version of MSE was released as well. MSE 3 was designed with the limitations of prior versions in mind.
Let’s look at the improved architecture of MSE, and learn why this is such an important feature in Varnish Enterprise 6.
In previous versions of MSE the operating system’s page cache mechanism was responsible for deciding what content from a memory-mapped file would end up in memory. This is the so-called hot content.
Not only did this result in large memory maps, which create overhead, it is also tough for the page cache to determine which parts from disk should be cached in memory. What is the hot data, and how can the page cache know without the proper context?
That’s the reason why, in MSE, we decided to reimplement the page cache in user space. From an architecture point of view, MSE version 3 stepped away from memory-mapped files for object storage and implemented the logic for loading data from files into memory itself. This also implies that the persistence layer had to be redesigned.
By no longer depending on these memory-mapped files, the overhead from keeping track of pages in the kernel has been eliminated.
And because the page cache mechanism in MSE version 3 is a custom implementation, MSE has the necessary context and can more accurately determine the hot content. The result is higher efficiency and less pressure on the disks, compared to previous versions.
As before, memory only contains a portion of the total cached content, while the persistence layer stores the full cache. However, with the increased control MSE has over what goes where, Varnish can greatly reduce the number of I/O operations, a limiting factor in many SSD drives. MSE can even decide to make an object memory only if it realizes it is lagging behind in writing to disk, where previous versions would slow down considerably.
Traditionally, non-volatile storage options have been perceived as slow, but now it is possible to configure a system that can read hundreds of gigabits per second from an array of persistent storage drives. MSE is designed to work well with all kinds of hardware, all the way down to spinning magnetic drives.
As we’ll explain next, the disk-based storage layer is comprised of a metadata database, which we call a book, and the persistent storage is something that we call the store. You can configure them, use multiple disks to store data on, and have multiple books, which can have references to multiple stores.
Let’s talk about books first. As mentioned, these books contain metadata for each of the objects in the MSE.
The main questions these books answer are:
In terms of metadata, the book holds the following information:
This information is kept in a Lightning Memory-Mapped Database (LMDB). This is a low-level transactional database that lives inside a memory-mapped file. This is a specific case where we still rely on the operating system’s page cache to manage what lives in memory and what is stored on disk. To keep the database safe to use after a power failure, the database code will force write pages to disk.
The size of the book is set in the MSE configuration, and directly impacts the number of objects you can have in the corresponding stores. It does this by imposing a maximum amount of metadata that can reside in the book at any point in time.
If you have few Ykeys and your objects are not too big, a good rule of thumb is to have 2 kB book space per object.
Let’s look at a concrete example for this rule of thumb: if you have stores with a total of 10 TB of data, and your average object size is 1 MB, then your maximum number of objects is ten million. With 2 kB per object, the rule indicates that the book should be at least 20 GB.
If it turns out the you have sized the book much bigger than your needs, then the extra space in the book will simply never be used, and its contents will never make it into the page cache or consume any memory. If the book becomes too close to full, Varnish will start removing objects for the store, resulting in a potential under-usage of the store. For this reason it is better to err on the safe side when calculating the optimal size for the book unless you have space available to expand the book after the fact.
In most cases, you should account for memory corresponding to the number of bytes in use in the book. This will let the kernel keep the book in memory at all times instead of having to page in parts of the book that have not been used in a while. The exception is when the book contains a high proportion of objects that are very infrequently accessed, and paging in data does not significantly reduce the performance of the system. We will get back to this when discussing the memory governor later in this chapter.
It is possible to configure multiple books. This is especially useful when you use multiple disks for cache storage, which improves performance. In case of disk failures, partitioning the cache reduces potential data loss.
The standard location of the books is in /var/lib/mse. This is a
folder that is allowed by our SELinux rules, which is part of our
packaging.
What we call the book is actually a directory that contains multiple files:
MSE.lck is a lock file that protects the storage from potentially
being accessed by multiple Varnish instances.data.mdb is the actual LMDB database that contains the metadata.lock.mdb is the internal lock file from LMDB.varnish-mse-banlist-15f19907.dat is a per-book ban list,
containing the currently active bans.varnish-mse-journal-75b6069b.dat is a per-store journal that keeps
track of incremental changes prior to the final state being stored in
data.mdb.The stores are the physical files on disk that contain the cached objects. The stores are designed as pre-allocated large files. Inside these files, a custom file system is used to organize content.
Stores are associated with a book. The book holds the location of each object in the store. Without the book, there is no way to retrieve objects from cache because MSE wouldn’t know where to look.
If you lose the book, or the book gets corrupted, there is no way to regenerate it. That part of your persisted cache is lost. But remember: it’s a cache; the data can always be regenerated from the origin.
Every store is represented as a single file, and the standard location
of these files is /var/lib/mse. This is also because our SELinux
rules allow this folder to be used by varnishd.
There are significant performance benefits when using pre-allocated large files.
The most obvious one is reducing I/O overhead that results from
opening files and iterating through directories. Unlike typical storage
systems that entirely rely on the file system, MSE doesn’t create a
file per object. Only a single file is opened, and this happens when
varnishd is started.
Disk fragmentation is also a huge one: by pre-allocating one large file per store, disk fragmentation is heavily reduced. The size of the file and its location on disk are fixed, and all access to persisted objects is done within that file.
MSE also has algorithms to find and create continuous free space within that file. This results in fewer I/O operations and allows the system to get more out of the disk’s bandwidth without being limited by I/O operations per second (IOPS).
Access to stores is done using asynchronous I/O, which doesn’t block the progress of execution while the system is waiting for data from the disks. This also boosts performance quite a bit.
We just talked about the concept of stores, and disk fragmentation was frequently mentioned.
A disk is fragmented when single files get spread across multiple regions on the disk. For spinning disks, this would cause a mechanical head to have to seek from one area of the disk to another with a significantly reduced performance as a result. In the era of SSDs, disk fragmentation is less of an issue, but it is not free: all drives have a maximum number of I/O operations per second in addition to the maximum bandwidth. When a disk is too fragmented, or objects are very small, this can become a limiting factor. Needless to say, it is important to consider the I/O operations per second (IOPS) for the drives when configuring any server, and NVMe drives generally perform much better than SATA SSDs.
MSE has mechanisms for reducing fragmentation, which work well for most use cases, but huge caches with small objects will still require drives with a high number IOPS.
MSE makes sure that the fragmentation of the store is low, but it
cannot control the fragmentation of the store file in the file system.
For this reason it is recommended to only put MSE stores on volumes
dedicated to MSE. The easiest way is to put a single store and its book
on each physical drive intended for MSE, and to keep the operating
system on a separate drive. When the store file is created, MSE makes
sure to pre-allocate all the space for the store file to make sure that
the file system cannot introduce more fragmentation after the creation
of the store. Some file systems, like xfs, do not implement this, so
only ext3 and ext4 would be candidates. However, we only support
ext4 for the MSE stores.
Currently there is no support for using the block device directly with no file system on top, but it might arrive in the future.
When an MSE store or book starts to get almost full, MSE needs to evict objects for the store or book in question. The eviction process, often called nuking, starts when the amount of used space reaches a certain level, explained below. The process tries to delete content that has not been accessed in a while, but the method is slightly different than the least recently used (LRU) eviction found in memory-based stevedores. The difference is based on MSE’s desire to create continuous free space to avoid fragmentation.
There are individual waterlevel parameters for books and stores, but
they both default to 90%. For the stores, the parameter is called
waterlevel, while the book’s waterlevel nuking is controlled by the
database_waterlevel parameter.
If the parameters are left at the default values, and either the book or
the store usage reaches 90%, backend fetches will be paused until the
usage goes below 90%. To avoid performance degradation for backend
fetches, MSE starts evicting objects before the waterlevel is reached.
The runtime parameters waterlevel_hysterisis for stores, and
database_waterlevel_hysterisis for books, both defaulting to 5%,
control this behavior. If all the parameters are left at their default
values, MSE will start evicting objects when the store or the book are
85% full, and this is usually sufficient to keep the usage under
90%, avoiding stalling fetches as a result.
The goal is to evict neighboring segments within the store to create continuous free space without removing objects that have been used recently. The thread that is responsible for freeing space by removing objects, scans objects linearly, and tests whether the object is in the third least recently used. If that is the case, it is removed. Once we get below the waterlevel, the eviction mechanism is paused and can resume from that position on the next run.
The description above is actually slightly misleading since we have not
yet explained exactly how MSE calculates the amount of used and free
space. Since small free chunks are not usable for placing big objects
without creating significant fragmentation, MSE will only consider
sufficiently large chunks of unused space when calculating the total
number of bytes available for allocation. The parameter
waterlevel_minchunksize defines what the minimum chunksize is that
should be counted, and it defaults to 512 KB.
In other words, only chunks that are equal or greater than
waterlevel_minchunksize will be considered when making sure that there
is, by default, at least 15% free space in the store. All chunks that
are smaller than the 512 KB default value, remain untouched. However,
these smaller chunks of free space are still eligible when MSE needs to
insert objects that are smaller than 512 KB, and MSE will even select
the smallest one that is big enough for any new allocation.
It might be tempting to reduce the waterlevel_minchunksize to a low
value, but that will increase fragmentation of bigger objects, as they
will often be chopped into pieces equal to the size of
waterlevel_minchunksize. Such fragmentation actually increases usage
in the book, as each chunk will need its own entry in the book.
Basically, waterlevel_minchunksize is a tunable fragmentation
tolerance level, and finding the right value for you depends on how MSE
is used. A high value will minimize fragmentation, which translates into
a leaner book and higher performance due to fewer I/O operations, while
lower values will fit more small objects into the cache.
The malloc stevedore, used by most Varnish Cache servers, needs to
be configured to hold a fixed maximum amount of data.
A common rule of thumb is to configure it to be 80% of the total
memory of a server. For a server with 64 GB of RAM, this translates to
a little over 51 GB. The remaining 20% is then available for the
operating system and for various parts of the varnishd process.
Unfortunately, this rule of thumb does not always work well. The optimal value for your server heavily depends on traffic patterns, on worker thread memory requirements, on object data structures, and on transient storage.
None of these memory needs are accounted for in the malloc stevedore,
which makes it hard to guess what Varnish’s total memory footprint
will be by just looking at the malloc sizing.
The result is that the server will suddenly be out of memory, even when
you apply a seemingly conservative size to your malloc store. If
certain aspects of your VCL require a lot of memory to be executed, or
if your transient storage goes through the roof, you’ll be in trouble,
and there’s no predictable way to counter this. On the other hand, if
your server is very simple, stores a few big objects, and does not serve
a lot of concurrent users, many gigabytes of memory can sit unused when
it would be better to use that memory for caching.
MSE’s memory governor feature solves the problem with using the
stevedore to control the memory usage of Varnish. Instead of assigning a
fixed amount of space for cache payloads, the memory governor aims to
limit the total memory usage of the varnishd process, as seen by the
operating system.
Instead limiting the size of the cache in MSE, by setting the
memcache_size configuration directive to an absolute value, we can set
it to auto to limit the total memory usage of the varnishd process
instead.
When the memory governor observes that the memory usage is too high, it will start removing objects from the cache until the memory usage goes under the limit. This means that the actual memory used by object payloads will vary when other memory usage varies, but the total will be near constant. For example, if there are suddenly thousands of connections coming in, and thousands of threads need to be started to serve the connections, the extra memory usage from the connection handling will result in some objects being removed from memory until things calm down. If MSE with persistence is in use, no objects will be removed from the cache, just from the memory part of the cache.
The memory_target runtime parameter, which is set at 80% by default,
will ensure varnishd remains below that memory consumption ratio.
memory_target can also be set to an absolute value, just like you
would with the -s parameter.
When you set memory_target to a negative value, you can indicate how
much memory you want the memory governor to leave unused.
The memory_target can also be changed at runtime. This can be useful
if you need some memory for a different process and need varnishd to
use less memory for a while.
The memory_target does not include the part of the kernel’s page
cache that is used to keep frequently used parts of the book in memory
for fast access. For this reason, it might be necessary to tune down the
memory_target parameter if your cache contains a lot of objects, and
the book usage, measured in bytes, starts to creep up towards 10% of
your available memory. It is not necessarily bad to have some paging
activity as long as it stays under control.
When running MSE, it is a good idea to monitor paging behavior on the
system, for example by using the vmstat tool. If it suddenly goes
through the roof, you should consider reducing memory_target and see
if it helps. Remember that this can be done on a running Varnish
without restarting the service.
Enforcing the memory target is done similarly to the waterlevel & hysteresis mechanisms inside the persistence layer: it is also an over/under measurement.
When varnishd requests memory from the OS that results in exceeding
the memory_target, debt is collected, which should quickly be repaid.
Repaying debt is done by removing objects from the cache on an LRU basis. Repaying accumulated debt is a shared responsibility:
If a fetch needs to store a 2 MB object, and as a consequence
surpasses memory_target, it needs to remove 2 MB of content from the
cache using LRU.
The debt collector thread will ensure varnishd’s memory consumption
goes below the memory_target by removing objects from cache until the
target is reached.
Funnily enough, the debt collector thread is nicknamed the governator because in order to govern, it needs to terminate objects.
Varnish Cache suffers from a concept called the lucky loser effect.
When a fetch in Varnish Cache needs to free up space from the cache in order to facilitate its cache insert, it risks losing that space to a competing fetch.
That competing fetch was also looking for free space and happened to find it because the other fetch freed it up. This one is the lucky loser, but it results in a retry from the original fetch.
In theory the fetch can get stuck in a loop, continuously trying to free
up space but failing to claim it. The nuke_limit runtime parameter
defines how many times a fetch may try to free up space. The standard
value is 50.
This concept can become detrimental to the performance of Varnish. The
originating request will be left waiting until the object is stored in
cache or until nuke_limit forces varnishd`` to bail out.
Luckily Varnish Enterprise doesn’t suffer from this limitation when
MSE is used. The standard MSE implementation has a level of
isolation such that other fetches cannot see space that was freed up by
other fetches.
When the memory governor is enabled, every fetch can only allocate objects to memory if they can cover the debt.
Basically, the unfairness is gone, which benefits performance.
Enabling MSE is pretty simple. It’s just a matter of adding -s mse
to the varnishd command, and you’re good to go. This will give you
MSE in memory-only mode with the memory governor enabled. However,
in most cases, you’ll want to specify a bit more configuration.
As mentioned earlier in this chapter, you can point your storage configuration to an MSE configuration file. Here’s a typical example:
varnishd -s mse,/var/lib/mse/mse.conf
The /var/lib/mse/mse.conf contains both the memory-caching
configuration, and the cache-persistence configuration.
You can set the size of the memcache in the MSE config file, and it
will have the same effect as specifying the size of a malloc
stevedore.
Here’s an example:
 env: {
	id = "mse";
	memcache_size = "5G";
};
This MSE configuration is named mse and allocates 5 GB for object
caching. This configuration will solve the lucky loser problem
described above but will otherwise be equivalent to a malloc
stevedore set to 5 GB.
You can add some more configuration parameters to the environment. Here’s an example:
env: {
	id = "mse";
	memcache_size = "5G";
	memcache_chunksize = "4M"
	memcache_metachunksize = "4K"
};
These two extra parameters define the maximum memory chunk size that can be allocated. One is for objects in general, and the second is an indication of the size of the metadata for such an object.
As previously mentioned, omitting the configuration file in the
varnishd command line will enable the memory governor. If you do
specify a configuration file, the memory governor can be enabled by
setting memcache_size to "auto", as illustrated below:
env: {
	id = "mse";
	memcache_size = "auto";
};
#### Persistence
Although MSE is a really good memory cache, most people will enable persistence.
While persistence is an important MSE feature, most people just want to cache more objects than they can fit in memory. Either way, you need books and stores.
Here’s a simple example that was already featured earlier in this chapter:
env: {
	id = "mse";
	memcache_size = "auto";
	books = ( {
		id = "book";
		directory = "/var/lib/mse/book";
		database_size = "2G";
		stores = ( {
			id = "store";
			filename = "/var/lib/mse/store.dat";
			size = "100G";
		} );
	} );
};
This example uses a single book, which is located in
/var/lib/mse/book, and a single store, located in
/var/lib/mse/store.dat. The size of the book is limited to 2 GB,
and the store to 100 GB. Meanwhile the memory governor is enabled
to automatically manage the size of the memory cache.
varnishd will not create the files that are required for persistence
to work. You have to initialize those paths yourself. Varnish
Enterprise ships with an mkfs.mse program that reads the
configuration file and creates the necessary files.
The following example uses mkfs.mse to create the necessary files,
based on the /var/lib/mse/mse.conf configuration file:
$ sudo mkfs.mse -c /var/lib/mse/mse.conf
Creating environment 'mse'
Creating book 'mse.book' in '/var/lib/mse/book'
Creating store 'mse.book.store' in '/var/lib/mse/store.dat'
Book 'mse.book' created successfully
Environment 'mse' created successfully
It is also possible to configure multiple stores:
env: {
	id = "mse";
	memcache_size = "auto";
	books = ( {
		id = "book";
		directory = "/var/lib/mse/book";
		database_size = "2G";
		stores = ( {
			id = "store1";
			filename = "/var/lib/mse/store1.dat";
			size = "100G";
		},{
			id = "store2";
			filename = "/var/lib/mse/store2.dat";
			size = "100G";
		} );
	} );
};
The individual store files can be stored on multiple disks to reduce risk, but also to benefit from the improved I/O performance. MSE will cycle through the stores using a round-robin algorithm.
It is also possible to have multiple books, each with their own stores:
env: {
	id = "mse";
	memcache_size = "auto";
	books = ( {
		id = "book1";
		directory = "/var/lib/mse/book1";
		database_size = "2G";
		stores = ( {
			id = "store1";
			filename = "/var/lib/mse/store1.dat";
			size = "100G";
		},{
			id = "store2";
			filename = "/var/lib/mse/store2.dat";
			size = "100G";
		} );
	},{
		id = "book2";
		directory = "/var/lib/mse/book2";
		database_size = "2G";
		stores = ( {
			id = "store3";
			filename = "/var/lib/mse/store3.dat";
			size = "100G";
		},{
			id = "store4";
			filename = "/var/lib/mse/store4.dat";
			size = "100G";
		} );
	} );
};
In this case the metadata databases are in different locations. It would make sense to host them on separate disks as well, just like the stores.
Although books and their stores form a unit, MSE will, when using its round-robin algorithm to find somewhere to store the object, only cycle through the list of stores, ignoring their relationship with books.
Books have various configuration directives, some of which have already been discussed. Although the default values will do for most people, it is worth noting that changing can be impactful, depending on your use case.
id is a required parameter and is used to name the book. directory
is also required, as it refers to the location where the LMDB
database, lockfiles, and journals that comprise the book will be
hosted.
Here’s a list of parameters that can be tuned:
database_size: the total size of the LMDB database of the book.
Defaults to 1 GBdatabase_readers: the maximum number of simultaneous database
readers. Defaults to 4096database_sync: whether or not to wait until the disk has confirmed
the disk write of a change in the LMDB database. Defaults to true,
which ensures data consistency. Setting it to false will increase
performance, but you risk data corruption when a server outage occurs
before the latest changes are synchronized to the disk.database_waterlevel: the maximum fill level of the LMDB database.
Defaults to 0.9, which is 90%database_waterlevel_hysterisis: the over/under ratio to maintain
when enforcing the waterlevel of the LMDB database. Defaults to
0.05, which is a 5% over/under on the 90% that was defined in
database_waterleveldatabase_waterlevel_snipecount: the number of objects to remove in
one batch when enforcing the waterlevel for the database. Defaults
to 10banlist_size: the size of the ban list journal. Defaults to 1
MB. Exceeding this limit will cause new bans to overflow into the
LMDB database.Similar configuration parameters are available for tuning stores. It all starts with two required parameters:
id: the unique identifier of a storefilename: the location on disk of the store fileUnlike a book, which is a collection of files in a directory, a
store consists of a single file. The size of store files is defined
by the size parameter, which defaults to 1 GB.
These are the basic settings, but there are more configurable parameters. Here’s a selection of configurable parameters:
align: defaults to 4 KB and defines store allocations to be
multiples of this valueminfreechunk: also defaults to 4 KB and is the minimum size of a
free chunkaio_requests: the number of simultaneous asynchronous I/O
requests. Defaults to 128aio_db_handles: the number of simultaneous read-only handles that
are available for reading metadata from the corresponding bookjournal_size: the size of the journal that keeps track of
incremental changes until they are applied to the corresponding LMDB
database. Defaults to 1 Mwaterlevel_painted: the fraction of objects that is painted as LRU
candidates when the waterlevel is reached. By default this is
0.33, which corresponds to 33%waterlevel_threads: the number of threads that are responsible for
enforcing the waterlevel and removing LRU-painted objects from the
cache. Defaults to 1waterlevel_minchunksize: the minimum chunk size that is considered
when creating continuous free space when the waterlevel is
exceeded. Defaults to 512 KBwaterlevel: the ratio of continuous free space that should be
maintained. Defaults to 0.9 which corresponds to 90%waterlevel_hysterisis: the over/under ratio to maintain when
enforcing the waterlevel of the store. Defaults to 0.05, which
is a 5% over/under on the 90% that was defined in waterlevelwaterlevel_snipecount: the number of objects to remove in one batch
when enforcing the waterlevel. Defaults to 10The default settings for the book and store configuration have been carefully chosen by our engineers. We would advise you to stick with the default values unless you have specific concerns you want to address ahead of time, or if you’re experiencing problems.
Round-robin is the default way stores are selected in MSE; this can
be changed in VCL through vmod_mse.
Stores can be selected individually by name, or you can select all the stores in a book by using the name of the book.
However, in most cases it is natural to apply tags to the books
and/or stores, and use the tags to select the set of stores you want
MSE to choose from. The selection of stores is a core MSE feature, but
vmod_mse is used as an interface for this, when you need to override
the default settings.
Tagging happens in the MSE configuration file, typically
/var/lib/mse/mse.conf, and has not been discussed in this section up
until this point.
In your /var/lib/mse/mse.conf file you can use the tags directive to
associate tags with individual stores.
Here’s an example in which one store is hosted on a large SATA disk, and the other store is hosted on a smaller but much faster SSD disk:
env: {
	id = "mse";
	memcache_size = "auto";
	books = ( {
		id = "book";
		directory = "/var/lib/mse/book";
		stores = ( {
			id = "store1";
			filename = "/var/lib/mse/store1.dat";
			size = "500G";
			tags = ( "small", "ssd" );
		},{
			id = "store2";
			filename = "/var/lib/mse/store2.dat";
			size = "10T";
			tags = ( "big", "sata" );
		} );
	} );
};
In this case store1 has 500 GB of SSD storage at its disposal.
That is at least what the tag indicates.
store2 on the other hand is 10 TB in size and has a sata tag
linked to it. This would imply that this store has a larger but slower
disk.
It is up to you to decide on naming of tags. Their names have no
underlying significance. You can easily change the tag ssd into
fast, and sata into slow if that is more intuitive to you.
These tags will be used in VCL when
vmod_msecomes into play.
It is also possible to apply these tags on the book level. This means that all underlying stores will be tagged with these values.
Here’s a multi-book example:
env: {
	id = "mse";
	memcache_size = "auto";
	books = ( {
		id = "book1";
		directory = "/var/lib/mse/book1";
		tags = ( "small", "ssd" );
		stores = ( {
			id = "store1";
			filename = "/var/lib/mse/store1.dat";
			size = "500G";
		},{
			id = "store2";
			filename = "/var/lib/mse/store2.dat";
			size = "500G";
		} );
	},{
		id = "book2";
		directory = "/var/lib/mse/book2";
		tags = ( "big", "sata" );
		stores = ( {
			id = "store3";
			filename = "/var/lib/mse/store3.dat";
			size = "10T";
		},{
			id = "store4";
			filename = "/var/lib/mse/store4.dat";
			size = "10T";
		} );
	} );
};
In this case store1 and store2 are tagged as small and ssd
because these tags were applied to their corresponding book. They
probably have smaller SSD disks in them, as the tags may imply.
For store3 and store4, which are managed by book2, the tags are
big and sata. No surprises here either: although we know about the
size of the stores, which grant the big tag, we can only assume the
underlying disks are SATA disks.
Simply adding tags, like in the example above, does not change anything. The default behavior, which is round-robin between all of the stores, still takes place until a set of stores is explicitly selected: either in VCL or in the configuration file.
Let’s have a look at how this is done.
One way of selecting the default stores is by using the default_stores
configuration directive in /var/lib/mse/mse.conf. This directive
refers to a store based on its name, its tag, or the name or tag of the
book.
Based on the example above we could configure the default stores as follows:
default_stores = "sata";
Unless instructed otherwise in VCL, the default stores will be
store3 and store4.
There is also a special value none, which does not refer to any tag.
default_stores = "none";
This example will ensure objects only get stored in memory cache unless instructed otherwise in VCL.
As soon as
default_storesis set, the round-robin store selection no longer applies and is replaced by a random selection, where a potentially uneven weighting is applied based on the store size.
You can have fine-grained control over your store selection, if you
leverage vmod_mse.
This VMOD has an mse.set_stores() function that allows you to refer
to a book, a store, or a tag. Also the special value none is
allowed to bypass the persistence layer.
Here’s an example where we select the stores based on response-header criteria:
vcl 4.1;
import mse;
import std;
sub vcl_backend_response {
	if (beresp.ttl < 120s) {
		mse.set_stores("none");
	} else {
		if (beresp.http.Transfer-Encoding ~ "chunked" || 
		std.integer(beresp.http.Content-Length,0) > std.bytes("100M")) {
			mse.set_stores("sata");
		} else {
			mse.set_stores("ssd");
		}
	}
}
Let’s break it down:
Content-Length response header indicates a
size of more than 100 MB will be stored on the sata-tagged stores.Transfer-Encoding response header is set to
chunked also end up on the sata-tagged stores. Because data is
streamed to Varnish, we have no clue of the size ahead of time.When one or more stores are selected based on a tag or name, either in
VCL or by using default_stores, the actual destination of the object
will always be determined by a fast quasi random number generator.
This means that if you add just a few objects to your MSE, you should
expect the distribution to be uneven, but for any reasonable number of
objects, the unevenness should be negligible.
You can change the weighting of stores through the function
mse.set_weighting() to one of the following:
size: bigger stores have a higher probability of being selected.available: stores with more available space have a higher
probability of being selected.smooth: store size and availability of space are combined to
assign weights to stores.As you already know,
sizebecomes the default weighting mechanism when a store is selected throughdefault_storesormse.set_stores(). Themse.set_weighting()function also allows you to set weighting mechanisms.
Here’s an example of how to set the weighting to smooth::
vcl 4.1;
import mse;
sub vcl_backend_response {
	mse.set_weighting(smooth);
}
Although there is a dedicated section about monitoring coming up later
in this chapter, we do want to hint at monitoring internal MSE
counters using varnishstat.
Here’s a table with some counters that relate to memory caching:
| Counter | Meaning | 
|---|---|
| MSE.mse.g_bytes | Bytes outstanding | 
| MSE.mse.g_space | Bytes available | 
| MSE.mse.n_lru_nuked | Number of LRU-nuked objects | 
| MSE.mse.n_vary | Number of Vary header keys | 
| MSE.mse.c_memcache_hit | Stored objects cache hits | 
| MSE.mse.c_memcache_miss | Stored objects cache misses | 
| MSE.mse.g_ykey_keys | Number of YKeys registered | 
| MSE.mse.c_ykey_purged | Number of objects purged with YKey | 
If you run the following command, you can use the MSE.mse.g_space
counter to see how much space is left in memory for caching:
varnishstat -f MSE.mse.g_space
The mse keyword in these counters refers to the name of your
environment. In this case it is named mse. If you were to name your
environment server1, the counter would be MSE.server1.g_space. If
you want to make sure you see all environments, you can use an asterisk,
as illustrated below:
varnishstat -f MSE.*.g_space
There are also counters to monitor the state of your books. Here’s a table with some select counters related to books:
| Counter | Meaning | 
|---|---|
| MSE_BOOK.book1.n_vary | Number of Vary header keys | 
| MSE_BOOK.book1.g_bytes | Number of bytes used in the book database | 
| MSE_BOOK.book1.g_space | Number of bytes available in the book database | 
| MSE_BOOK.book1.g_waterlevel_queue | Number of threads queued waiting for database space | 
| MSE_BOOK.book1.c_waterlevel_queue | Number of times a thread has been queued waiting for database space | 
| MSE_BOOK.book1.c_waterlevel_runs | Number of times the waterlevel purge thread was activated | 
| MSE_BOOK.book1.c_waterlevel_purge | Number of objects purged to achieve database waterlevel | 
| MSE_BOOK.book1.c_insert_timeout | Number of times database object insertion timed out | 
| MSE_BOOK.book1.g_banlist_bytes | Number of bytes used from the ban list journal file | 
| MSE_BOOK.book1.g_banlist_space | Number of bytes available in the ban list journal file | 
| MSE_BOOK.book1.g_banlist_database | Number of bytes used in the database for persisted bans | 
These counters specifically refer to book1, but as there are multiple
books, it makes sense to query on all books, as illustrated below:
varnishstat -f MSE_BOOK.*
This command shows all counters for all books.
And finally, stores also have their own counters. Here’s the table:
| Counter | Meaning | 
|---|---|
| MSE_STORE.store1.g_waterlevel_queue | Number of threads queued waiting for store space | 
| MSE_STORE.store1.c_waterlevel_queue | Number of times a thread has been queued waiting for store space | 
| MSE_STORE.store1.c_waterlevel_purge | Number of objects purged to achieve store waterlevel | 
| MSE_STORE.store1.g_objects | Number of objects in the store | 
| MSE_STORE.store1.g_ykey_keys | Number of YKeys registered | 
| MSE_STORE.store1.c_ykey_purged | Number of objects purged with YKey | 
| MSE_STORE.store1.g_alloc_bytes | Total number of bytes in allocation extents | 
| MSE_STORE.store1.g_free_bytes | Total number of bytes in free extents | 
| MSE_STORE.store1.g_free_small_bytes | Number of bytes in free extents smaller than 16k | 
| MSE_STORE.store1.g_free_16k_bytes | Number of bytes in free extents between 16k and 32k | 
| MSE_STORE.store1.g_free_32k_bytes | Number of bytes in free extents between 32k and 64k | 
| MSE_STORE.store1.g_free_64k_bytes | Number of bytes in free extents between 64k and 128k | 
| MSE_STORE.store1.g_free_128k_bytes | Number of bytes in free extents between 128k and 256k | 
| MSE_STORE.store1.g_free_256k_bytes | Number of bytes in free extents between 256k and 512k | 
| MSE_STORE.store1.g_free_512k_bytes | Number of bytes in free extents between 512k and 1m | 
| MSE_STORE.store1.g_free_1m_bytes | Number of bytes in free extents between 1m and 2m | 
| MSE_STORE.store1.g_free_2m_bytes | Number of bytes in free extents between 2m and 4m | 
| MSE_STORE.store1.g_free_4m_bytes | Number of bytes in free extents between 4m and 8m | 
| MSE_STORE.store1.g_free_large_bytes | Number of bytes in free extents larger than 8m | 
Besides the typical free space, bytes allocated, and number of objects
in the store, you’ll also find detailed counters on the extents per
size. This is part of the anti-fragmentation mechanism that aims to
have continuous free space but tolerates a level of fragmentation
below the waterlevel_minchunksize.
In the beginning, there will be no fragmentation within your store, so
the MSE_STORE.store1.g_free_large_bytes counter will be high, the
others will be low or zero.
The following command will monitor the free bytes per extent for all stores:
varnishstat  -f MSE_STORE.*.g_free_*
A more basic situation to monitor is the number of objects in cache, the available space in the stores, and the used space. Here’s how you do that:
varnishstat -f MSE_STORE.*.g_objects -f MSE_STORE.*.g_free_bytes -f MSE_STORE.*.g_alloc_bytes
The promise of MSE is persistence. You benefit from this when a server restart occurs: stores and books will be loaded from disk and the full context is restored.
If you take a backup of your stores and books, you can restore this backup and perform a disaster recovery.
Even when your books and stores are corrupted, or even gone, the backup will restore the previous state, and your cache will be warm.
Besides disaster recovery, this strategy can also be used to pre-warm the cache on new Varnish instances.
When you restart varnishd, and the recovered books and stores are
found, the output can be contain the following:
varnish    | Info: Child (24) said Store mse.book1.store1 revived 6 objects
varnish    | Info: Child (24) said Store mse.book1.store1 removed 0 objects (partial=0 age=0 marked=0 noban=0 novary=0)
varnish    | Info: Child (24) said Store mse.book1.store2 revived 2 objects
varnish    | Info: Child (24) said Store mse.book1.store2 removed 0 objects (partial=0 age=0 marked=0 noban=0 novary=0)
varnish    | Info: Child (24) said Store mse.book2.store3 revived 6 objects
varnish    | Info: Child (24) said Store mse.book2.store3 removed 0 objects (partial=0 age=0 marked=0 noban=0 novary=0)
varnish    | Info: Child (24) said Store mse.book2.store4 revived 3 objects
varnish    | Info: Child (24) said Store mse.book2.store4 removed 0 objects (partial=0 age=0 marked=0 noban=0 novary=0)
varnish    | Info: Child (24) said Environment mse fully populated in 0.00 seconds. (0.00 0.00 0.00 17 0 3/4 4 0 4 0)
As you can see, objects were revived. In huge caches with lots of objects, loading the environment might take a bit longer.
MSE can easily load more than a million objects per second. Although the time it takes for MSE to load objects from disk depends on the kind of hardware you use, we can assume that for bigger caches this would only take mere seconds.