Persisted caching

Varnish 6.0

Persisted caching in MSE4

MSE4 offers the possiblity of storing cached content not only in memory, but also on disk. This makes it possible to extend the cache size of the system beyond the available system memory. Upon system restarts the on disk content will be kept and made available again without having to fetch it from the backend. This mode of operation is referred to as persisted caching. The other option where content is only kept in memory and never written to disk is referred to as ephemeral caching. This document will also use the terms persisted objects and ephemeral objects, referring to objects that have a disk backing and those that don’t.

When Varnish is responding to client requests delivering cached data, that data is always delivered from memory. This is true for both ephemeral and persisted caching. Where persisted caching differs is that for an object known to be on disk, memory buffers are allocated and filled by reading from disk rather than getting the bytes from a backend. Once the memory buffers are in memory, they are used mostly in the same way as for ephemeral caching. This means that the data will be shared among all client requests requesting the same data, and only a single copy is kept in memory regardless of how many simultaneous clients are using it.

When a persisted object first enters the cache through a backend fetch, memory buffers are allocated as needed to hold the content. Like for ephemeral caching, the buffer bytes received from the backend are immediately made available for use by any client requests streaming the data. This means that the use of persisted caching does not have to impose an IO delay for writing the data to disk before they can be used to deliver cached content. Once the buffers are completely filled with data from the backend, the disk writeout is done asynchronously.

Reading data back into memory from disk is done on demand. When it is found that a byte belonging to a persisted object is not in memory, a memory buffer for a byte range that covers the requested byte is allocated and filled by reading from disk. Once the buffer has been filled, any waiting delivery tasks will be notified and they can resume the delivery.

The memory buffers used to hold persisted objects will be kept in memory for as long as possible, even if there are no active client connections requesting the data. It is only when the available memory is getting low and space need to be made to hold other content that the buffers can be evicted. The algorithm to choose what content to evict is a variant of Least Recently Used (LRU), and ephemeral objects and persisted memory buffers alike can be chosen. If a persisted memory buffer is found to be the least recently used and chosen for eviction, only that buffer itself is evicted. The other object buffers will stay in memory until they happen to be chosen as the eviction candidate.

Between the on-demand read from disk mechanism and the least recently used buffer eviction, only the specific parts of an object that is actually requested is kept in memory. This makes it so that for example large objects that only see a demand for the very beginning of the object will not require the entire object to be held in memory.

File devices

A file device is a term used to describe the large files that MSE4 uses to store information on disk. It refers to the actual file itself residing in the file system, and not the drive or device that holds the file system data.

The books and stores are the file devices that are used to store persisted objects to disk. The books hold the meta data about the objects, while the stores hold the actual object data, including object headers and the object body bytes.

The books and stores are kept in separate files to provide flexibility in how the data is laid out on the drives available. If the system has a nonhomogeneous set of drives available where some are faster than the others, it is recommended to provision the system so that the books are kept on the faster drives and the stores on the slower drives. If the drives are all equal, it is recommended to have one book on each drive, and then any number of stores as required on each drive using that book for meta data.

The Book

The book is a type of custom database developed specifically for MSE4 to provide fast and consistent updates to the set of objects in the cache, while minimizing the amout of IO needed in order to record changes to the set of objects. All data in the book is checksummed for consistency, and any updates are journaled to keep the order of operations consistent.

The book stores for each object all of the meta data that Varnish needs in order to figure out if an object matches a client request. This includes the object hash value and Vary match data, as well as the object lifetime parameters (time-to-live, grace period etc.). In addition it stores any Ykeys associated with the objects, and finally a list byte offset, length and checksum triplets which shows where in the store where the actual headers and object body data is stored.

The book database is comprised of a number of fixed size slots. A slot is wide enough to describe one persisted object in the cache, with a combined maximum number of 4 data chunks or Ykeys. Slots will be chained together to describe larger objects or objects with many Ykeys, with each additional chained slot increasing the number of data chunks or Ykeys described by 9.

The layout of the book and its slot capacity is determined at the time the file device is created, and is influenced by several configuration settings. The maxinum number of slots a given book can hold can be queried by giving the headers command to mkfs.mse4 and looking at the maxslots key.

Each book can provide meta data storage for up to 16 stores. The set of slots available is shared among all of the stores the book is managing.

To speed up operations and avoid blocking on IO in critical data paths, the entire book slot table is kept in memory at runtime. The IO generated by the book after startup will only be writes to record the slot table changes.

The Store

The store is where the actual payload bytes of objects are stored. This includes the object HTTP headers, the object body bytes and any auxiliary attributes stored with the objects. All of the store bytes for an object is always kept in the same store.

The data stored does not contain any structure information, it is just a series of byte chunks of varying length stored consecutively in one large file. All of the data needed to stitch an object back together, as well as the checksums for the data chunks of the store, is kept in the book.

Configuring MSE4 for use with persisted caching

To enable persisted caching, the file devices in which the persisted objects will be stored needs to be defined. This is done in an MSE4 specific configuration file.

The configuration file declares the file devices to be used, their sizes location in the file system. The books and stores are listed hierarchically in the configuration, where the stores sharing a book for meta data storage are listed under the book configuration.

Please see Configuration for information and an example of how to structure the configuration file.

The Varnish daemon needs to be configured to use MSE4, and the path to the configuration file specified. To do this, give a single -s mse4,<path-to-mse4-configuration> argument to the Varnish daemon.

Creating the file devices

Before Varnish and MSE4 can start using the file devices, they need to be created. For this purpose a special utility program called mkfs.mse4 is provided.

As a one time operation before starting the Varnish daemon, execute this command:

   $ mkfs.mse4 -c <path-to-configuration-file> configure

Please see man mkfs.mse4 for more information about the mkfs.mse4 utility.

Object Creation

When a new object enters the cache, MSE4 will need to make a decision on how to store the content, whether to persist it and if so to which store it should be written.

Only regular cached content can be persisted. Special objects like hit-for-pass and hit-for-miss will always become ephemeral. Temporary objects that are created to hold request bodies or to handle passes are also always ephemeral.

The VCL program can also set whether to attempt to persist the object or not. The vmod mse4 has a function called set_storage() for this purpose. See VMOD mse4 for more information.

Content Category

The set of stores an object can be persisted to is determined by the object’s assigned content category. Each category has a list of stores assigned, and a store can only be assigned to one category.

Note that when there are no category definitions in the MSE4 configuration file, a default root category is created that has all of the configured stores automatically assigned to it.

If the object’s content category does not list any assigned stores, the object becomes ephemeral.

See Categories for more information on how to configure and use content categories.

Store Selection

When the assigned object category has multiple stores to choose from, one is selected based on the configured store selection algorithm. The algorithm to use is determined by the value of the *store_select* key in the selected category’s configuration, or if not specified by the environment level configuration key default_store_select (which defaults to smooth).

When running the store selection algorithm, only stores in state ONLINE are considered. Stores in any other state are ignored, and the algorithm is executed as if they were not part of the set. If none of the stores in the set are ONLINE, the object becomes ephemeral. See the “Runtime drive failures” section below for more information about store state.

The possible values for store_select are:

smooth (default)

The smooth algorithm gives weigths to the possible stores by the sum of the available free space in the store and the size of the store. The store is then selected by a random draw from the weighted stores. The smooth algorithm favours a large and mostly empty store to ensure that it is filled efficiently when empty, and once all stores are filled transitions to a size based store selection.
size

The size algorithm gives weights to the possible stores by the size of the store, and a store is then selected by a random draw from the weighted stores. This makes the store selection probability be according to the size, making a store that is twice the size of another be selected twice as often.
available

The available algorithm gives weights to the stores according to how much free space is available in the store. This gives the store with the largest amount of free space the priority.
round-robin

The round-robin algorithm will simply choose a store in a round robin fashion.

Bans and persisted caching

Banning is a cache invalidation mechanism where an administrator issues a statement which is tested on each and every object in the cache. If the statement evaluates as true, the object will be invalidated (removed from the cache). It is a very flexible system for cache invalidation, but does come with quite high cost.

In order for creating new ban statements to not take forever while testing each and every object, the actual execution of the ban statement is done in a lazy manner. This means that the statements are simply added to the cache without doing any searching and matching, and then the actual execution of the ban statement is done as a background task. During cache look-up, any object will first have any not yet matched ban statements applied, and possibly invalidated at that time if it is a match.

For this scheme to work with persisted caches, it becomes necessary to also persist any ban statements that haven’t yet been tested against all of the cached objects. For this purpose, each book sets aside an area to store away on disk the most recently issued ban statements. The size of this area is controlled by the banjournal_size book parameter.

Persisted cache bootstrapping

When the Varnish daemon starts up using MSE4 configured for persisted caching, a bootstrap process is performed. The slot table from each configured book will be read into memory, and all of the slots verified by checksum for consistency. Any valid objects described will be added to the main Varnish look-up tree so that the object’s presence is known during cache look-ups.

Invalid slots found during the bootstrap will be cleared and reclaimed before normal cache operations commences. This could for example be the result of an unclean shutdown of the system and subsequent aborted IO operations.

The bootstrap process may also identify any number of invalid objects. The slots they occupy will also be cleared and reclaimed. Reasons for invalidating an object during the bootstrap include:

The lifetime parameters indicate that the object is too old. This can happen when the object expired while Varnish was not running.
The object’s ban timestamp is invalid. This can happen if the book’s ban journal is too small to keep all of the active bans in the system. Since there is no way of knowing whether the object would’ve been invalidated by one of the missing bans, the object is removed as a precaution.
The store that holds the objects cache data is offline, missing or otherwise not available. When the store is not accessible, several Varnish mechanisms like cache invalidations and bans will not be applied to the affected objects. If the store is then made accessible again on a later restart, objects that would’ve been invalidated may resurface. To safe guard against this, inaccessible objects are invalidated during the bootstrap.

Store checksum verification

All of the store data can be checksummed upon first being written to the stores, and then verified when reading it back. The checksums for the content is stored in the book.

This functionality is enabled or disabled by the write_checksum and verify_checksum store configuration parameters. By default checksum verification is turned on.

When data is read back into memory from the store and the checksum verification failed, the object will automatically be evicted from the cache. This ensures that no other client request can successfully make a cache hit on that object from that point in time, and subsequent attempts at getting the content will cause a cache fetch from the backend.

However, due to how content is only read back into memory on-demand, a delivery process may be well under way and the beginning of an object already delivered to the client by the time the checksum verification fails on a later chunk for the object. It will be too late to communicate the bad object to the client, and the only option left is to cause a delivery failure. Clients may experience short reads and a forced connection close in this situation. Note that it is ensured that no byte from the invalid read will ever be communicated to the client.

Persisted object eviction

When either the book runs out of slots or the store is full, content will be evicted from the cache to free up resources.

Book free slot management

When new persisted objects are added to the cache, free slots are needed in the store’s book in which to store the object meta data.

Every book holds a reserve of free slots ready to be handed out. The size of this reserve is controlled by the slot_reserve book configuration key.

When the reserve runs low, a background task free up slots by evicting the least recently used object from among all of the objects stored in the book.

Store free space management

When new persisted objects are added to the cache, free space in the store needs to be assigned. The available free space is for efficiency reasons not mapped in its entirety. Rather a background task exists to fill the reserve with known free space regions meeting a minimum requirement for continuous byte ranges. The fetch tasks grab byte ranges from the reserve to assign them to new persisted objects.

The goal for this background task is controlled by the reserve_size store configuration key. Upon reaching this goal the task will sleep, and be woken again when the reserve drops below the goal. The available reserve is shown in the MSE4_STORE.<store-id>.g_reserve_bytes VSC counter.

When the background task fails to meet its goals through natural decay (objects being deleted due to their lifetime settings), persisted content will be evicted from the cache to make new space.

When evicting content to make up store space, it is necessary to work towards creating continuous free areas to avoid store space fragmentation. To achieve this, the space is considered a segment at a time in a round robin fashion, with the segment size determined by the segment_size configuration key. All objects with one or more allocations in that segment, and are not frequently accessed, are evicted.

Runtime drive failures

If a drive IO error is reported at runtime, MSE4 is able to take a book or store offline without causing system down time. The only disruption of service will be that the objects that were hosted on the affected book or store will be removed from the cache.

The IO errors that are monitored and will cause the file devices to be taken offline are any error codes reported back to MSE4 from the operating system during IO operations.

When multiple file devices are hosted on the same drive, a failure to one will not automatically take the others offline. However, when a book device fails it will also fail all of the stores that are hosted in that book.

When a file device fails, MSE4 will immediately cease all IO operations to the device. All of the objects hosted on the device will be invalidated. Any active delivery tasks using any object being invalidated may have to fail the delivery as a result. Clients may experience short reads and a forced connection close in this situation.

After a file device has failed and become offline, an administrator may reset the device and bring it back up into service again. This would presumably be done after having executed a hot swap of the affected drive to replace the faulty hardware. When a file device reset is performed, the device will always come back into service empty. That means that any cached objects that were on the file device will have been lost.

File device state

A file device will be in one of three states at runtime. The states are:

ONLINE

When a file device is in state ONLINE, it operates normally.
FAILING

Whenever a file device fails, it will first transition from ONLINE to the FAILING state. While in this state, active clean up tasks is still being performed on the cached objecs that this device holds. MSE4 will still be holding file descriptors open on the device during this period, and an administrator should ensure that no file devices on an affected drive is in this state before hotplugging the drive.

A file device in the FAILING state can not be reset.
OFFLINE

Once the clean up tasks have been completed, the file device will transition from the FAILING state to the OFFLINE state. At this point in time, MSE4 will have closed all of the file descriptors to the file device, and it is safe to hotplug the drive.

Once a file device has reached the OFFLINE state, it is possible to reset the device. This will reinitialize the file device and bring it back to the ONLINE state.

The current state of the file devices can be queried by using the CLI command mse4.status.

A boolean online status is also reported through the varnishstat counters (MSE4_BOOK.<book-id>.online and MSE4_STORE.<store-id>.online). This counter value will be 1 when the file device state is ONLINE, and 0 otherwise.

The MSE4 statelog

MSE4 uses a statelog file to record state changes to the file devices. This log resides by default in /var/lib/mse/<hostname>.mse4_statelog (the default value is derived from storage_statelog_path). The statelog file serves two purposes. The primary use is to initialize a file device’s assumed state after a restart of Varnish, to make sure that a failed drive stays in the failed state until the drive has been replaced. Secondly it serves as a log of IO errors that has happened on the device.

The log is an ascii-file, and records any state change that has happened to the file device. For each file device, the very last state change recorded will be the state assumed after a restart.

Manual device failure

The administrator can induce a file device failure manually. This can be a useful tool if e.g. S.M.A.R.T. monitoring is indicating that a drive is starting to experience issues, and a proactive drive replacement needs to be performed.

To cause a file device to fail, execute the CLI command mse4.fail <book-or-store-id>. This will have the same effect as if an IO operation on the file device reported an IO error.

Drive hotplugging

When a file device reaches the OFFLINE state, MSE4 will have closed all of its file descriptors pointing to the file device. This will make it possible to safely unmount the file system upon which the file resides, and a hotplug of the drive to be performed.

Note that the administrator should make sure that all of the MSE4 file devices residing on a given drive are in the OFFLINE state before unmounting the file system.

The instructions for how to hotplug a drive is out of scope for this manual.

After a fresh drive has been installed, a file system needs to be created on the drive, and the drive remounted.

File device reset

An MSE4 file device in the OFFLINE state can be reset at runtime to bring it back to the ONLINE state.

To do this, execute the CLI command mse4.reset <book-or-store-id>.

A store device can not be reset if its book device is not ONLINE. If resetting both a books and its stores, reset the book first.

The reset command will recreate the file device from scratch, using the configuration parameters as described in the MSE4 configuration file at the time the Varnish daemon was started. If a file of the same name already exists, it will first be deleted.

A successful reset will mark the file device as ONLINE in the statelog.

File device resizing

The Varnish daemon will on startup print a warning message to syslog if the actual file devices presented to it differs in size and layout from what the configuration suggests. It will start up as normal though, and use the applicable settings (file and journal sizes) as stored in the file devices, disregarding the configuration options.

To bring the file devices into compliance with the configuration, it is necessary to perform an offline file device resize operation. This can be done using the mkfs.mse4 utility and the resize command. The Varnish daemon must be stopped in order to execute this command. Please see man mkfs.mse4 for more information about running this command.

High iowait

There are cases where the stats may show high (even up to 100%) iowait. This is an effect of how io_uring is accounting time spent by threads doing clever polling.

In MSE4, a dedicated thread is responsible for picking up IO tasks generated by the other threads, while simultaneously watching for events finished by the hardware. While the thread awaits new tasks and completion events, it is blocked; although it is neither doing any work nor starved for IO, it will be accounted as waiting on IO by the kernel which will show up as IO wait.

This implementation minimizes the number of system calls to the absolute minimum. We have not observed functional or performance problems with Varnish caused by this accounting and so there is no issue with the displayed iowait at or close to 100%.