Massive Storage Engine

The Massive Storage Engine is probably the most significant feature that Varnish Enterprise offers. It combines the speed of memory, and the reliability of disk to offer a stevedore that can cache terabytes of data on a single Varnish server.

The cache is persisted and can survive a restart. This means you don’t need to rewarm the cache when a Varnish server recovers from a restart.

MSE is commonly used to build custom CDNs and to accelerate OTT video platforms.

But it’s not only about caching large volumes of data: MSE’s memory implementation represents a clear improvement over Varnish’s original malloc stevedore.

Before we dive into the details, we need to talk about the history of MSE, and why this stevedore was developed in the first place.

History

Before MSE was a viable stevedore for caching large volumes of data, the Varnish project had two disk-based stevedores:

file
persistence

The file stevedore

The file stevedore stores objects on disk, but not in a persistent way: a restart will result in an empty cache.

That is because the file stevedore is nothing more than memory storage that is backed by a file: if the size of the cache exceeds the available memory, content will be sent to the disk. And content that is not available in memory will be loaded from the disk.

It’s just like the operating system’s swap mechanism, where the disk is only used to extend the capacity of the memory without offering any persistence guarantees.

The operating system’s page cache will decide what the hot content is that should be buffered in memory. This ensures that not every access to the file system results in I/O instructions.

The problem with the file stevedore is that the page cache isn’t really optimized for caches like Varnish. The page cache will guess what content should be in memory, and if that guess is wrong, file system access is required.

This results in a lot of context switching, which consumes server resources.

The biggest problem, however, is disk fragmentation, which increases over time. Because the file stevedore has no mechanism to efficiently allocate objects on disk, objects might be scattered across multiple blocks. This results in a lot more disk I/O taking place to assemble the object.

Over time, this becomes a real performance killer. The only solution is to restart Varnish, which will allow you to start over again. Unfortunately, this will also empty the cache, which is never good.

The persistence stevedore

As the name indicates, the persistence stevedore can persist objects on disk. These objects can survive a server restart. This way your cache remains intact, even after a potential failure.

However, we would not advise you to ever use this stevedore. This is a quote from the official Varnish documentation for this feature:

The persistent storage backend has multiple issues with it and will likely be removed from a future version of Varnish.

The main problem is that the persistence stevedore is designed as a log-structured file system that enforces a strict FIFO approach.

This means:

Objects always come in one way.
Objects always come out the other way.

This means that the first object that is inserted will be the first one to expire. Although this works fine from a write performance point of view, it totally sidelines key Varnish features, such as LRU eviction, banning, and many more.

Since this stevedore is basically unusable in most real-life situations, and since the file stevedore is not persistent and prone to disk fragmentation, there is a need for a better solution. Enter the Massive Storage Engine.

Early versions of MSE

In the very first year of Varnish Software’s existence, a plan was drawn up to develop a good file-based stevedore.

The initial focus was not to develop a persistent stevedore, but the critical goal in the beginning was to offer the capability to cache objects that are larger than the available memory.

Even the initial implementation was not that different from the file stevedore. It was mostly a matter of smoothing the rough edges to create an improved version of the file stevedore.

Behind the scenes memory-mapped files were still used to store the objects. This means that the operating system decides which portions of the file it keeps in the page cache. Since the page cache corresponds to a part of the system’s actual physical memory, the result is that the operating system’s page cache serves as the memory cache for Varnish, storing the hot objects, whereas the file system contains all objects.

The critical goal for the second version of MSE was to add persistent storage that would survive a server restart. The metadata, residing in memory in version 1, also needed to be persisted on disk. Memory-mapped files were also used to accomplish this.

A major side effect of memory-mapped files, in combination with large volumes of data, is the overhead of the very large memory maps that are created by the kernel: for each page in a memory-mapped file, the kernel will allocate some bytes in its page tables.

If for example the backing files grow to 20 TB worth of cached objects, an overhead of 90 GB of memory could occur.

This is memory that is not limited by the storage engine and is expected to be readily available on top of memory that was already allocated to Varnish for object storage. Managing these memory maps can become CPU intensive too.

Architecture

With the release of Varnish Enterprise 6, a new version of MSE was released as well. MSE 3 was designed with the limitations of prior versions in mind.

Let’s look at the improved architecture of MSE, and learn why this is such an important feature in Varnish Enterprise 6.

Memory vs disk

In previous versions of MSE the operating system’s page cache mechanism was responsible for deciding what content from a memory-mapped file would end up in memory. This is the so-called hot content.

Not only did this result in large memory maps, which create overhead, it is also tough for the page cache to determine which parts from disk should be cached in memory. What is the hot data, and how can the page cache know without the proper context?

That’s the reason why, in MSE, we decided to reimplement the page cache in user space. From an architecture point of view, MSE version 3 stepped away from memory-mapped files for object storage and implemented the logic for loading data from files into memory itself. This also implies that the persistence layer had to be redesigned.

By no longer depending on these memory-mapped files, the overhead from keeping track of pages in the kernel has been eliminated.

And because the page cache mechanism in MSE version 3 is a custom implementation, MSE has the necessary context and can more accurately determine the hot content. The result is higher efficiency and less pressure on the disks, compared to previous versions.

As before, memory only contains a portion of the total cached content, while the persistence layer stores the full cache. However, with the increased control MSE has over what goes where, Varnish can greatly reduce the number of I/O operations, a limiting factor in many SSD drives. MSE can even decide to make an object memory only if it realizes it is lagging behind in writing to disk, where previous versions would slow down considerably.

Traditionally, non-volatile storage options have been perceived as slow, but now it is possible to configure a system that can read hundreds of gigabits per second from an array of persistent storage drives. MSE is designed to work well with all kinds of hardware, all the way down to spinning magnetic drives.

As we’ll explain next, the disk-based storage layer is comprised of a metadata database, which we call a book, and the persistent storage is something that we call the store. You can configure them, use multiple disks to store data on, and have multiple books, which can have references to multiple stores.

Books

Let’s talk about books first. As mentioned, these books contain metadata for each of the objects in the MSE.

The main questions these books answer are:

What cached objects does your cache contain at any point in time?
Where on disk should I look to find these objects?

In terms of metadata, the book holds the following information:

Lifetime counters, such as TTL, grace, and keep
Object hashes
Information about cache variations
Ban information
Ykey indexes
Information about free storage
The location of an object on disk

This information is kept in a Lightning Memory-Mapped Database (LMDB). This is a low-level transactional database that lives inside a memory-mapped file. This is a specific case where we still rely on the operating system’s page cache to manage what lives in memory and what is stored on disk. To keep the database safe to use after a power failure, the database code will force write pages to disk.

The size of the book is set in the MSE configuration, and directly impacts the number of objects you can have in the corresponding stores. It does this by imposing a maximum amount of metadata that can reside in the book at any point in time.

If you have few Ykeys and your objects are not too big, a good rule of thumb is to have 2 kB book space per object.

Let’s look at a concrete example for this rule of thumb: if you have stores with a total of 10 TB of data, and your average object size is 1 MB, then your maximum number of objects is ten million. With 2 kB per object, the rule indicates that the book should be at least 20 GB.

If it turns out the you have sized the book much bigger than your needs, then the extra space in the book will simply never be used, and its contents will never make it into the page cache or consume any memory. If the book becomes too close to full, Varnish will start removing objects for the store, resulting in a potential under-usage of the store. For this reason it is better to err on the safe side when calculating the optimal size for the book unless you have space available to expand the book after the fact.

In most cases, you should account for memory corresponding to the number of bytes in use in the book. This will let the kernel keep the book in memory at all times instead of having to page in parts of the book that have not been used in a while. The exception is when the book contains a high proportion of objects that are very infrequently accessed, and paging in data does not significantly reduce the performance of the system. We will get back to this when discussing the memory governor later in this chapter.

It is possible to configure multiple books. This is especially useful when you use multiple disks for cache storage, which improves performance. In case of disk failures, partitioning the cache reduces potential data loss.

The standard location of the books is in /var/lib/mse. This is a folder that is allowed by our SELinux rules, which is part of our packaging.

What we call the book is actually a directory that contains multiple files:

MSE.lck is a lock file that protects the storage from potentially being accessed by multiple Varnish instances.
data.mdb is the actual LMDB database that contains the metadata.
lock.mdb is the internal lock file from LMDB.
varnish-mse-banlist-15f19907.dat is a per-book ban list, containing the currently active bans.
varnish-mse-journal-75b6069b.dat is a per-store journal that keeps track of incremental changes prior to the final state being stored in data.mdb.

Stores

The stores are the physical files on disk that contain the cached objects. The stores are designed as pre-allocated large files. Inside these files, a custom file system is used to organize content.

Stores are associated with a book. The book holds the location of each object in the store. Without the book, there is no way to retrieve objects from cache because MSE wouldn’t know where to look.

If you lose the book, or the book gets corrupted, there is no way to regenerate it. That part of your persisted cache is lost. But remember: it’s a cache; the data can always be regenerated from the origin.

Every store is represented as a single file, and the standard location of these files is /var/lib/mse. This is also because our SELinux rules allow this folder to be used by varnishd.

There are significant performance benefits when using pre-allocated large files.

The most obvious one is reducing I/O overhead that results from opening files and iterating through directories. Unlike typical storage systems that entirely rely on the file system, MSE doesn’t create a file per object. Only a single file is opened, and this happens when varnishd is started.

Disk fragmentation is also a huge one: by pre-allocating one large file per store, disk fragmentation is heavily reduced. The size of the file and its location on disk are fixed, and all access to persisted objects is done within that file.

MSE also has algorithms to find and create continuous free space within that file. This results in fewer I/O operations and allows the system to get more out of the disk’s bandwidth without being limited by I/O operations per second (IOPS).

Access to stores is done using asynchronous I/O, which doesn’t block the progress of execution while the system is waiting for data from the disks. This also boosts performance quite a bit.

The danger of disk fragmentation

We just talked about the concept of stores, and disk fragmentation was frequently mentioned.

A disk is fragmented when single files get spread across multiple regions on the disk. For spinning disks, this would cause a mechanical head to have to seek from one area of the disk to another with a significantly reduced performance as a result. In the era of SSDs, disk fragmentation is less of an issue, but it is not free: all drives have a maximum number of I/O operations per second in addition to the maximum bandwidth. When a disk is too fragmented, or objects are very small, this can become a limiting factor. Needless to say, it is important to consider the I/O operations per second (IOPS) for the drives when configuring any server, and NVMe drives generally perform much better than SATA SSDs.

MSE has mechanisms for reducing fragmentation, which work well for most use cases, but huge caches with small objects will still require drives with a high number IOPS.

Selecting a location for the store and book

MSE makes sure that the fragmentation of the store is low, but it cannot control the fragmentation of the store file in the file system. For this reason it is recommended to only put MSE stores on volumes dedicated to MSE. The easiest way is to put a single store and its book on each physical drive intended for MSE, and to keep the operating system on a separate drive. When the store file is created, MSE makes sure to pre-allocate all the space for the store file to make sure that the file system cannot introduce more fragmentation after the creation of the store. Some file systems, like xfs, do not implement this, so only ext3 and ext4 would be candidates. However, we only support ext4 for the MSE stores.

Currently there is no support for using the block device directly with no file system on top, but it might arrive in the future.

Making sure there is room for more

When an MSE store or book starts to get almost full, MSE needs to evict objects for the store or book in question. The eviction process, often called nuking, starts when the amount of used space reaches a certain level, explained below. The process tries to delete content that has not been accessed in a while, but the method is slightly different than the least recently used (LRU) eviction found in memory-based stevedores. The difference is based on MSE’s desire to create continuous free space to avoid fragmentation.

There are individual waterlevel parameters for books and stores, but they both default to 90%. For the stores, the parameter is called waterlevel, while the book’s waterlevel nuking is controlled by the database_waterlevel parameter.

If the parameters are left at the default values, and either the book or the store usage reaches 90%, backend fetches will be paused until the usage goes below 90%. To avoid performance degradation for backend fetches, MSE starts evicting objects before the waterlevel is reached. The runtime parameters waterlevel_hysterisis for stores, and database_waterlevel_hysterisis for books, both defaulting to 5%, control this behavior. If all the parameters are left at their default values, MSE will start evicting objects when the store or the book are 85% full, and this is usually sufficient to keep the usage under 90%, avoiding stalling fetches as a result.

The goal is to evict neighboring segments within the store to create continuous free space without removing objects that have been used recently. The thread that is responsible for freeing space by removing objects, scans objects linearly, and tests whether the object is in the third least recently used. If that is the case, it is removed. Once we get below the waterlevel, the eviction mechanism is paused and can resume from that position on the next run.

The description above is actually slightly misleading since we have not yet explained exactly how MSE calculates the amount of used and free space. Since small free chunks are not usable for placing big objects without creating significant fragmentation, MSE will only consider sufficiently large chunks of unused space when calculating the total number of bytes available for allocation. The parameter waterlevel_minchunksize defines what the minimum chunksize is that should be counted, and it defaults to 512 KB.

In other words, only chunks that are equal or greater than waterlevel_minchunksize will be considered when making sure that there is, by default, at least 15% free space in the store. All chunks that are smaller than the 512 KB default value, remain untouched. However, these smaller chunks of free space are still eligible when MSE needs to insert objects that are smaller than 512 KB, and MSE will even select the smallest one that is big enough for any new allocation.

It might be tempting to reduce the waterlevel_minchunksize to a low value, but that will increase fragmentation of bigger objects, as they will often be chopped into pieces equal to the size of waterlevel_minchunksize. Such fragmentation actually increases usage in the book, as each chunk will need its own entry in the book.

Basically, waterlevel_minchunksize is a tunable fragmentation tolerance level, and finding the right value for you depends on how MSE is used. A high value will minimize fragmentation, which translates into a leaner book and higher performance due to fewer I/O operations, while lower values will fit more small objects into the cache.

Problems with the traditional memory allocator

The malloc stevedore, used by most Varnish Cache servers, needs to be configured to hold a fixed maximum amount of data.

A common rule of thumb is to configure it to be 80% of the total memory of a server. For a server with 64 GB of RAM, this translates to a little over 51 GB. The remaining 20% is then available for the operating system and for various parts of the varnishd process.

Unfortunately, this rule of thumb does not always work well. The optimal value for your server heavily depends on traffic patterns, on worker thread memory requirements, on object data structures, and on transient storage.

None of these memory needs are accounted for in the malloc stevedore, which makes it hard to guess what Varnish’s total memory footprint will be by just looking at the malloc sizing.

The result is that the server will suddenly be out of memory, even when you apply a seemingly conservative size to your malloc store. If certain aspects of your VCL require a lot of memory to be executed, or if your transient storage goes through the roof, you’ll be in trouble, and there’s no predictable way to counter this. On the other hand, if your server is very simple, stores a few big objects, and does not serve a lot of concurrent users, many gigabytes of memory can sit unused when it would be better to use that memory for caching.

Memory governor

MSE’s memory governor feature solves the problem with using the stevedore to control the memory usage of Varnish. Instead of assigning a fixed amount of space for cache payloads, the memory governor aims to limit the total memory usage of the varnishd process, as seen by the operating system.

Instead limiting the size of the cache in MSE, by setting the memcache_size configuration directive to an absolute value, we can set it to auto to limit the total memory usage of the varnishd process instead.

When the memory governor observes that the memory usage is too high, it will start removing objects from the cache until the memory usage goes under the limit. This means that the actual memory used by object payloads will vary when other memory usage varies, but the total will be near constant. For example, if there are suddenly thousands of connections coming in, and thousands of threads need to be started to serve the connections, the extra memory usage from the connection handling will result in some objects being removed from memory until things calm down. If MSE with persistence is in use, no objects will be removed from the cache, just from the memory part of the cache.

The memory_target runtime parameter, which is set at 80% by default, will ensure varnishd remains below that memory consumption ratio. memory_target can also be set to an absolute value, just like you would with the -s parameter.

When you set memory_target to a negative value, you can indicate how much memory you want the memory governor to leave unused.

The memory_target can also be changed at runtime. This can be useful if you need some memory for a different process and need varnishd to use less memory for a while.

The memory_target does not include the part of the kernel’s page cache that is used to keep frequently used parts of the book in memory for fast access. For this reason, it might be necessary to tune down the memory_target parameter if your cache contains a lot of objects, and the book usage, measured in bytes, starts to creep up towards 10% of your available memory. It is not necessarily bad to have some paging activity as long as it stays under control.

When running MSE, it is a good idea to monitor paging behavior on the system, for example by using the vmstat tool. If it suddenly goes through the roof, you should consider reducing memory_target and see if it helps. Remember that this can be done on a running Varnish without restarting the service.

Debt collection

Enforcing the memory target is done similarly to the waterlevel & hysteresis mechanisms inside the persistence layer: it is also an over/under measurement.

When varnishd requests memory from the OS that results in exceeding the memory_target, debt is collected, which should quickly be repaid.

Repaying debt is done by removing objects from the cache on an LRU basis. Repaying accumulated debt is a shared responsibility:

Fetches that contribute to the debt should repay it themselves.
General debt is repaid by the debt collector thread.

If a fetch needs to store a 2 MB object, and as a consequence surpasses memory_target, it needs to remove 2 MB of content from the cache using LRU.

The debt collector thread will ensure varnishd’s memory consumption goes below the memory_target by removing objects from cache until the target is reached.

Funnily enough, the debt collector thread is nicknamed the governator because in order to govern, it needs to terminate objects.

Lucky loser

Varnish Cache suffers from a concept called the lucky loser effect.

When a fetch in Varnish Cache needs to free up space from the cache in order to facilitate its cache insert, it risks losing that space to a competing fetch.

That competing fetch was also looking for free space and happened to find it because the other fetch freed it up. This one is the lucky loser, but it results in a retry from the original fetch.

In theory the fetch can get stuck in a loop, continuously trying to free up space but failing to claim it. The nuke_limit runtime parameter defines how many times a fetch may try to free up space. The standard value is 50.

This concept can become detrimental to the performance of Varnish. The originating request will be left waiting until the object is stored in cache or until nuke_limit forces varnishd`` to bail out.
Luckily Varnish Enterprise doesn’t suffer from this limitation when MSE is used. The standard MSE implementation has a level of isolation such that other fetches cannot see space that was freed up by other fetches.

When the memory governor is enabled, every fetch can only allocate objects to memory if they can cover the debt.

Basically, the unfairness is gone, which benefits performance.

Configuration

Enabling MSE is pretty simple. It’s just a matter of adding -s mse to the varnishd command, and you’re good to go. This will give you MSE in memory-only mode with the memory governor enabled. However, in most cases, you’ll want to specify a bit more configuration.

As mentioned earlier in this chapter, you can point your storage configuration to an MSE configuration file. Here’s a typical example:

$ varnishd -s mse,/var/lib/mse/mse.conf

The /var/lib/mse/mse.conf contains both the memory-caching configuration, and the cache-persistence configuration.

Memory configuration

You can set the size of the memcache in the MSE config file, and it will have the same effect as specifying the size of a malloc stevedore.

Here’s an example:

 env: {
	id = "mse";
	memcache_size = "5G";
};

This MSE configuration is named mse and allocates 5 GB for object caching. This configuration will solve the lucky loser problem described above but will otherwise be equivalent to a malloc stevedore set to 5 GB.

You can add some more configuration parameters to the environment. Here’s an example:

env: {
	id = "mse";
	memcache_size = "5G";
	memcache_chunksize = "4M"
	memcache_metachunksize = "4K"
};

These two extra parameters define the maximum memory chunk size that can be allocated. One is for objects in general, and the second is an indication of the size of the metadata for such an object.

As previously mentioned, omitting the configuration file in the varnishd command line will enable the memory governor. If you do specify a configuration file, the memory governor can be enabled by setting memcache_size to "auto", as illustrated below:

env: {
	id = "mse";
	memcache_size = "auto";
};

#### Persistence

Although MSE is a really good memory cache, most people will enable persistence.

While persistence is an important MSE feature, most people just want to cache more objects than they can fit in memory. Either way, you need books and stores.

Here’s a simple example that was already featured earlier in this chapter:

env: {
	id = "mse";
	memcache_size = "auto";

	books = ( {
		id = "book";
		directory = "/var/lib/mse/book";
		database_size = "2G";

		stores = ( {
			id = "store";
			filename = "/var/lib/mse/store.dat";
			size = "100G";
		} );
	} );
};

This example uses a single book, which is located in /var/lib/mse/book, and a single store, located in /var/lib/mse/store.dat. The size of the book is limited to 2 GB, and the store to 100 GB. Meanwhile the memory governor is enabled to automatically manage the size of the memory cache.

varnishd will not create the files that are required for persistence to work. You have to initialize those paths yourself. Varnish Enterprise ships with an mkfs.mse program that reads the configuration file and creates the necessary files.

The following example uses mkfs.mse to create the necessary files, based on the /var/lib/mse/mse.conf configuration file:

$ sudo mkfs.mse -c /var/lib/mse/mse.conf
Creating environment 'mse'
Creating book 'mse.book' in '/var/lib/mse/book'
Creating store 'mse.book.store' in '/var/lib/mse/store.dat'
Book 'mse.book' created successfully
Environment 'mse' created successfully

It is also possible to configure multiple stores:

env: {
	id = "mse";
	memcache_size = "auto";

	books = ( {
		id = "book";
		directory = "/var/lib/mse/book";
		database_size = "2G";

		stores = ( {
			id = "store1";
			filename = "/var/lib/mse/store1.dat";
			size = "100G";
		},{
			id = "store2";
			filename = "/var/lib/mse/store2.dat";
			size = "100G";
		} );
	} );
};

The individual store files can be stored on multiple disks to reduce risk, but also to benefit from the improved I/O performance. MSE will cycle through the stores using a round-robin algorithm.

It is also possible to have multiple books, each with their own stores:

env: {
	id = "mse";
	memcache_size = "auto";

	books = ( {
		id = "book1";
		directory = "/var/lib/mse/book1";
		database_size = "2G";

		stores = ( {
			id = "store1";
			filename = "/var/lib/mse/store1.dat";
			size = "100G";
		},{
			id = "store2";
			filename = "/var/lib/mse/store2.dat";
			size = "100G";
		} );
	},{
		id = "book2";
		directory = "/var/lib/mse/book2";
		database_size = "2G";

		stores = ( {
			id = "store3";
			filename = "/var/lib/mse/store3.dat";
			size = "100G";
		},{
			id = "store4";
			filename = "/var/lib/mse/store4.dat";
			size = "100G";
		} );
	} );
};

In this case the metadata databases are in different locations. It would make sense to host them on separate disks as well, just like the stores.

Although books and their stores form a unit, MSE will, when using its round-robin algorithm to find somewhere to store the object, only cycle through the list of stores, ignoring their relationship with books.

Book configuration

Books have various configuration directives, some of which have already been discussed. Although the default values will do for most people, it is worth noting that changing can be impactful, depending on your use case.

id is a required parameter and is used to name the book. directory is also required, as it refers to the location where the LMDB database, lockfiles, and journals that comprise the book will be hosted.

Here’s a list of parameters that can be tuned:

database_size: the total size of the LMDB database of the book. Defaults to 1 GB
database_readers: the maximum number of simultaneous database readers. Defaults to 4096
database_sync: whether or not to wait until the disk has confirmed the disk write of a change in the LMDB database. Defaults to true, which ensures data consistency. Setting it to false will increase performance, but you risk data corruption when a server outage occurs before the latest changes are synchronized to the disk.
database_waterlevel: the maximum fill level of the LMDB database. Defaults to 0.9, which is 90%
database_waterlevel_hysterisis: the over/under ratio to maintain when enforcing the waterlevel of the LMDB database. Defaults to 0.05, which is a 5% over/under on the 90% that was defined in database_waterlevel
database_waterlevel_snipecount: the number of objects to remove in one batch when enforcing the waterlevel for the database. Defaults to 10
banlist_size: the size of the ban list journal. Defaults to 1 MB. Exceeding this limit will cause new bans to overflow into the LMDB database.

Store configuration

Similar configuration parameters are available for tuning stores. It all starts with two required parameters:

id: the unique identifier of a store
filename: the location on disk of the store file

Unlike a book, which is a collection of files in a directory, a store consists of a single file. The size of store files is defined by the size parameter, which defaults to 1 GB.

These are the basic settings, but there are more configurable parameters. Here’s a selection of configurable parameters:

align: defaults to 4 KB and defines store allocations to be multiples of this value
minfreechunk: also defaults to 4 KB and is the minimum size of a free chunk
aio_requests: the number of simultaneous asynchronous I/O requests. Defaults to 128
aio_db_handles: the number of simultaneous read-only handles that are available for reading metadata from the corresponding book
journal_size: the size of the journal that keeps track of incremental changes until they are applied to the corresponding LMDB database. Defaults to 1 M
waterlevel_painted: the fraction of objects that is painted as LRU candidates when the waterlevel is reached. By default this is 0.33, which corresponds to 33%
waterlevel_threads: the number of threads that are responsible for enforcing the waterlevel and removing LRU-painted objects from the cache. Defaults to 1
waterlevel_minchunksize: the minimum chunk size that is considered when creating continuous free space when the waterlevel is exceeded. Defaults to 512 KB
waterlevel: the ratio of continuous free space that should be maintained. Defaults to 0.9 which corresponds to 90%
waterlevel_hysterisis: the over/under ratio to maintain when enforcing the waterlevel of the store. Defaults to 0.05, which is a 5% over/under on the 90% that was defined in waterlevel
waterlevel_snipecount: the number of objects to remove in one batch when enforcing the waterlevel. Defaults to 10

The default settings for the book and store configuration have been carefully chosen by our engineers. We would advise you to stick with the default values unless you have specific concerns you want to address ahead of time, or if you’re experiencing problems.

Store selection

Round-robin is the default way stores are selected in MSE; this can be changed in VCL through vmod_mse.

Stores can be selected individually by name, or you can select all the stores in a book by using the name of the book.

However, in most cases it is natural to apply tags to the books and/or stores, and use the tags to select the set of stores you want MSE to choose from. The selection of stores is a core MSE feature, but vmod_mse is used as an interface for this, when you need to override the default settings.

Tagging happens in the MSE configuration file, typically /var/lib/mse/mse.conf, and has not been discussed in this section up until this point.

Tagging stores

In your /var/lib/mse/mse.conf file you can use the tags directive to associate tags with individual stores.

Here’s an example in which one store is hosted on a large SATA disk, and the other store is hosted on a smaller but much faster SSD disk:

env: {
	id = "mse";
	memcache_size = "auto";

	books = ( {
		id = "book";
		directory = "/var/lib/mse/book";

		stores = ( {
			id = "store1";
			filename = "/var/lib/mse/store1.dat";
			size = "500G";
			tags = ( "small", "ssd" );
		},{
			id = "store2";
			filename = "/var/lib/mse/store2.dat";
			size = "10T";
			tags = ( "big", "sata" );
		} );
	} );
};

In this case store1 has 500 GB of SSD storage at its disposal. That is at least what the tag indicates.

store2 on the other hand is 10 TB in size and has a sata tag linked to it. This would imply that this store has a larger but slower disk.

It is up to you to decide on naming of tags. Their names have no underlying significance. You can easily change the tag ssd into fast, and sata into slow if that is more intuitive to you.

These tags will be used in VCL when vmod_mse comes into play.

Tagging books

It is also possible to apply these tags on the book level. This means that all underlying stores will be tagged with these values.

Here’s a multi-book example:

env: {
	id = "mse";
	memcache_size = "auto";

	books = ( {
		id = "book1";
		directory = "/var/lib/mse/book1";
		tags = ( "small", "ssd" );

		stores = ( {
			id = "store1";
			filename = "/var/lib/mse/store1.dat";
			size = "500G";
		},{
			id = "store2";
			filename = "/var/lib/mse/store2.dat";
			size = "500G";
		} );
	},{
		id = "book2";
		directory = "/var/lib/mse/book2";
		tags = ( "big", "sata" );

		stores = ( {
			id = "store3";
			filename = "/var/lib/mse/store3.dat";
			size = "10T";
		},{
			id = "store4";
			filename = "/var/lib/mse/store4.dat";
			size = "10T";
		} );
	} );
};

In this case store1 and store2 are tagged as small and ssd because these tags were applied to their corresponding book. They probably have smaller SSD disks in them, as the tags may imply.

For store3 and store4, which are managed by book2, the tags are big and sata. No surprises here either: although we know about the size of the stores, which grant the big tag, we can only assume the underlying disks are SATA disks.

Simply adding tags, like in the example above, does not change anything. The default behavior, which is round-robin between all of the stores, still takes place until a set of stores is explicitly selected: either in VCL or in the configuration file.

Let’s have a look at how this is done.

Setting the default stores

One way of selecting the default stores is by using the default_stores configuration directive in /var/lib/mse/mse.conf. This directive refers to a store based on its name, its tag, or the name or tag of the book.

Based on the example above we could configure the default stores as follows:

default_stores = "sata";

Unless instructed otherwise in VCL, the default stores will be store3 and store4.

There is also a special value none, which does not refer to any tag.

default_stores = "none";

This example will ensure objects only get stored in memory cache unless instructed otherwise in VCL.

As soon as default_stores is set, the round-robin store selection no longer applies and is replaced by a random selection, where a potentially uneven weighting is applied based on the store size.

vmod_mse

You can have fine-grained control over your store selection, if you leverage vmod_mse.

This VMOD has an mse.set_stores() function that allows you to refer to a book, a store, or a tag. Also the special value none is allowed to bypass the persistence layer.

Here’s an example where we select the stores based on response-header criteria:

vcl 4.1;

import mse;
import std;

sub vcl_backend_response {
	if (beresp.ttl < 120s) {
		mse.set_stores("none");
	} else {
		if (beresp.http.Transfer-Encoding ~ "chunked" || 
		std.integer(beresp.http.Content-Length,0) > std.bytes("100M")) {
			mse.set_stores("sata");
		} else {
			mse.set_stores("ssd");
		}
	}
}

Let’s break it down:

Cache inserts with a TTL of less than two minutes will not be persisted and will only be cached in memory.
Cache inserts where the Content-Length response header indicates a size of more than 100 MB will be stored on the sata-tagged stores.
Cache inserts where the Transfer-Encoding response header is set to chunked also end up on the sata-tagged stores. Because data is streamed to Varnish, we have no clue of the size ahead of time.
All other cache inserts are less than 100 MB in size and will end up on the ssd-tagged stores.

When one or more stores are selected based on a tag or name, either in VCL or by using default_stores, the actual destination of the object will always be determined by a fast quasi random number generator. This means that if you add just a few objects to your MSE, you should expect the distribution to be uneven, but for any reasonable number of objects, the unevenness should be negligible.

You can change the weighting of stores through the function mse.set_weighting() to one of the following:

size: bigger stores have a higher probability of being selected.
available: stores with more available space have a higher probability of being selected.
smooth: store size and availability of space are combined to assign weights to stores.

As you already know, size becomes the default weighting mechanism when a store is selected through default_stores or mse.set_stores(). The mse.set_weighting() function also allows you to set weighting mechanisms.

Here’s an example of how to set the weighting to smooth::

vcl 4.1;

import mse;

sub vcl_backend_response {
	mse.set_weighting(smooth);
}

Monitoring

Although there is a dedicated section about monitoring coming up later in this chapter, we do want to hint at monitoring internal MSE counters using varnishstat.

Memory counters

Here’s a table with some counters that relate to memory caching:

Counter	Meaning
`MSE.mse.g_bytes`	Bytes outstanding
`MSE.mse.g_space`	Bytes available
`MSE.mse.n_lru_nuked`	Number of LRU-nuked objects
`MSE.mse.n_vary`	Number of Vary header keys
`MSE.mse.c_memcache_hit`	Stored objects cache hits
`MSE.mse.c_memcache_miss`	Stored objects cache misses
`MSE.mse.g_ykey_keys`	Number of YKeys registered
`MSE.mse.c_ykey_purged`	Number of objects purged with YKey

If you run the following command, you can use the MSE.mse.g_space counter to see how much space is left in memory for caching:

$ varnishstat -f MSE.mse.g_space

The mse keyword in these counters refers to the name of your environment. In this case it is named mse. If you were to name your environment server1, the counter would be MSE.server1.g_space. If you want to make sure you see all environments, you can use an asterisk, as illustrated below:

$ varnishstat -f MSE.*.g_space

Book counters

There are also counters to monitor the state of your books. Here’s a table with some select counters related to books:

Counter	Meaning
`MSE_BOOK.book1.n_vary`	Number of Vary header keys
`MSE_BOOK.book1.g_bytes`	Number of bytes used in the book database
`MSE_BOOK.book1.g_space`	Number of bytes available in the book database
`MSE_BOOK.book1.g_waterlevel_queue`	Number of threads queued waiting for database space
`MSE_BOOK.book1.c_waterlevel_queue`	Number of times a thread has been queued waiting for database space
`MSE_BOOK.book1.c_waterlevel_runs`	Number of times the waterlevel purge thread was activated
`MSE_BOOK.book1.c_waterlevel_purge`	Number of objects purged to achieve database waterlevel
`MSE_BOOK.book1.c_insert_timeout`	Number of times database object insertion timed out
`MSE_BOOK.book1.g_banlist_bytes`	Number of bytes used from the ban list journal file
`MSE_BOOK.book1.g_banlist_space`	Number of bytes available in the ban list journal file
`MSE_BOOK.book1.g_banlist_database`	Number of bytes used in the database for persisted bans

These counters specifically refer to book1, but as there are multiple books, it makes sense to query on all books, as illustrated below:

$ varnishstat -f MSE_BOOK.*

This command shows all counters for all books.

Store counters

And finally, stores also have their own counters. Here’s the table:

Counter	Meaning
`MSE_STORE.store1.g_waterlevel_queue`	Number of threads queued waiting for store space
`MSE_STORE.store1.c_waterlevel_queue`	Number of times a thread has been queued waiting for store space
`MSE_STORE.store1.c_waterlevel_purge`	Number of objects purged to achieve store waterlevel
`MSE_STORE.store1.g_objects`	Number of objects in the store
`MSE_STORE.store1.g_ykey_keys`	Number of YKeys registered
`MSE_STORE.store1.c_ykey_purged`	Number of objects purged with YKey
`MSE_STORE.store1.g_alloc_bytes`	Total number of bytes in allocation extents
`MSE_STORE.store1.g_free_bytes`	Total number of bytes in free extents
`MSE_STORE.store1.g_free_small_bytes`	Number of bytes in free extents smaller than 16k
`MSE_STORE.store1.g_free_16k_bytes`	Number of bytes in free extents between 16k and 32k
`MSE_STORE.store1.g_free_32k_bytes`	Number of bytes in free extents between 32k and 64k
`MSE_STORE.store1.g_free_64k_bytes`	Number of bytes in free extents between 64k and 128k
`MSE_STORE.store1.g_free_128k_bytes`	Number of bytes in free extents between 128k and 256k
`MSE_STORE.store1.g_free_256k_bytes`	Number of bytes in free extents between 256k and 512k
`MSE_STORE.store1.g_free_512k_bytes`	Number of bytes in free extents between 512k and 1m
`MSE_STORE.store1.g_free_1m_bytes`	Number of bytes in free extents between 1m and 2m
`MSE_STORE.store1.g_free_2m_bytes`	Number of bytes in free extents between 2m and 4m
`MSE_STORE.store1.g_free_4m_bytes`	Number of bytes in free extents between 4m and 8m
`MSE_STORE.store1.g_free_large_bytes`	Number of bytes in free extents larger than 8m

Besides the typical free space, bytes allocated, and number of objects in the store, you’ll also find detailed counters on the extents per size. This is part of the anti-fragmentation mechanism that aims to have continuous free space but tolerates a level of fragmentation below the waterlevel_minchunksize.

In the beginning, there will be no fragmentation within your store, so the MSE_STORE.store1.g_free_large_bytes counter will be high, the others will be low or zero.

The following command will monitor the free bytes per extent for all stores:

$ varnishstat  -f MSE_STORE.*.g_free_*

A more basic situation to monitor is the number of objects in cache, the available space in the stores, and the used space. Here’s how you do that:

$ varnishstat -f MSE_STORE.*.g_objects -f MSE_STORE.*.g_free_bytes -f MSE_STORE.*.g_alloc_bytes

Cache warming

The promise of MSE is persistence. You benefit from this when a server restart occurs: stores and books will be loaded from disk and the full context is restored.

If you take a backup of your stores and books, you can restore this backup and perform a disaster recovery.

Even when your books and stores are corrupted, or even gone, the backup will restore the previous state, and your cache will be warm.

Besides disaster recovery, this strategy can also be used to pre-warm the cache on new Varnish instances.

When you restart varnishd, and the recovered books and stores are found, the output can be contain the following:

varnish    | Info: Child (24) said Store mse.book1.store1 revived 6 objects
varnish    | Info: Child (24) said Store mse.book1.store1 removed 0 objects (partial=0 age=0 marked=0 noban=0 novary=0)
varnish    | Info: Child (24) said Store mse.book1.store2 revived 2 objects
varnish    | Info: Child (24) said Store mse.book1.store2 removed 0 objects (partial=0 age=0 marked=0 noban=0 novary=0)
varnish    | Info: Child (24) said Store mse.book2.store3 revived 6 objects
varnish    | Info: Child (24) said Store mse.book2.store3 removed 0 objects (partial=0 age=0 marked=0 noban=0 novary=0)
varnish    | Info: Child (24) said Store mse.book2.store4 revived 3 objects
varnish    | Info: Child (24) said Store mse.book2.store4 removed 0 objects (partial=0 age=0 marked=0 noban=0 novary=0)
varnish    | Info: Child (24) said Environment mse fully populated in 0.00 seconds. (0.00 0.00 0.00 17 0 3/4 4 0 4 0)

As you can see, objects were revived. In huge caches with lots of objects, loading the environment might take a bit longer.

MSE can easily load more than a million objects per second. Although the time it takes for MSE to load objects from disk depends on the kind of hardware you use, we can assume that for bigger caches this would only take mere seconds.