MSE4 offers the possiblity of storing cached content not only in memory, but also on disk. This makes it possible to extend the cache size of the system beyond the available system memory. Upon system restarts the on disk content will be kept and made available again without having to fetch it from the backend. This mode of operation is referred to as persisted caching. The other option where content is only kept in memory and never written to disk is referred to as ephemeral caching. This document will also use the terms persisted objects and ephemeral objects, referring to objects that have a disk backing and those that don’t.
When Varnish is responding to client requests delivering cached data, that data is always delivered from memory. This is true for both ephemeral and persisted caching. Where persisted caching differs is that for an object known to be on disk, memory buffers are allocated and filled by reading from disk rather than getting the bytes from a backend. Once the memory buffers are in memory, they are used mostly in the same way as for ephemeral caching. This means that the data will be shared among all client requests requesting the same data, and only a single copy is kept in memory regardless of how many simultaneous clients are using it.
When a persisted object first enters the cache through a backend fetch, memory buffers are allocated as needed to hold the content. Like for ephemeral caching, the buffer bytes received from the backend are immediately made available for use by any client requests streaming the data. This means that the use of persisted caching does not have to impose an IO delay for writing the data to disk before they can be used to deliver cached content. Once the buffers are completely filled with data from the backend, the disk writeout is done asynchronously.
Reading data back into memory from disk is done on demand. When it is found that a byte belonging to a persisted object is not in memory, a memory buffer for a byte range that covers the requested byte is allocated and filled by reading from disk. Once the buffer has been filled, any waiting delivery tasks will be notified and they can resume the delivery.
The memory buffers used to hold persisted objects will be kept in memory for as long as possible, even if there are no active client connections requesting the data. It is only when the available memory is getting low and space need to be made to hold other content that the buffers can be evicted. The algorithm to choose what content to evict is a variant of Least Recently Used (LRU), and ephemeral objects and persisted memory buffers alike can be chosen. If a persisted memory buffer is found to be the least recently used and chosen for eviction, only that buffer itself is evicted. The other object buffers will stay in memory until they happen to be chosen as the eviction candidate.
Between the on-demand read from disk mechanism and the least recently used buffer eviction, only the specific parts of an object that is actually requested is kept in memory. This makes it so that for example large objects that only see a demand for the very beginning of the object will not require the entire object to be held in memory.
A file device is a term used to describe the large files that MSE4 uses to store information on disk. It refers to the actual file itself residing in the file system, and not the drive or device that holds the file system data.
The books and stores are the file devices that are used to store persisted objects to disk. The books hold the meta data about the objects, while the stores hold the actual object data, including object headers and the object body bytes.
The books and stores are kept in separate files to provide flexibility in how the data is laid out on the drives available. If the system has a nonhomogeneous set of drives available where some are faster than the others, it is recommended to provision the system so that the books are kept on the faster drives and the stores on the slower drives. If the drives are all equal, it is recommended to have one book on each drive, and then any number of stores as required on each drive using that book for meta data.
The book is a type of custom database developed specifically for MSE4 to provide fast and consistent updates to the set of objects in the cache, while minimizing the amout of IO needed in order to record changes to the set of objects. All data in the book is checksummed for consistency, and any updates are journaled to keep the order of operations consistent.
The book stores for each object all of the meta data that Varnish needs in order to figure out if an object matches a client request. This includes the object hash value and Vary match data, as well as the object lifetime parameters (time-to-live, grace period etc.). In addition it stores any Ykeys associated with the objects, and finally a list byte offset, length and checksum triplets which shows where in the store where the actual headers and object body data is stored.
The book database is comprised of a number of fixed size slots. A slot is wide enough to describe one persisted object in the cache, with a combined maximum number of 4 data chunks or Ykeys. Slots will be chained together to describe larger objects or objects with many Ykeys, with each additional chained slot increasing the number of data chunks or Ykeys described by 9.
The layout of the book and its slot capacity is determined at the time the
file device is created, and is influenced by several configuration
settings. The maxinum number of slots a given book can hold can be queried
by giving the headers
command to mkfs.mse4
and looking at the
maxslots
key.
Each book can provide meta data storage for up to 16 stores. The set of slots available is shared among all of the stores the book is managing.
To speed up operations and avoid blocking on IO in critical data paths, the entire book slot table is kept in memory at runtime. The IO generated by the book after startup will only be writes to record the slot table changes.
The store is where the actual payload bytes of objects are stored. This includes the object HTTP headers, the object body bytes and any auxiliary attributes stored with the objects. All of the store bytes for an object is always kept in the same store.
The data stored does not contain any structure information, it is just a series of byte chunks of varying length stored consecutively in one large file. All of the data needed to stitch an object back together, as well as the checksums for the data chunks of the store, is kept in the book.
To enable persisted caching, the file devices in which the persisted objects will be stored needs to be defined. This is done in an MSE4 specific configuration file.
The configuration file declares the file devices to be used, their sizes location in the file system. The books and stores are listed hierarchically in the configuration, where the stores sharing a book for meta data storage are listed under the book configuration.
Please see Configuration for information and an example of how to structure the configuration file.
The Varnish daemon needs to be configured to use MSE4, and the path to the
configuration file specified. To do this, give a single -s mse4,<path-to-mse4-configuration>
argument to the Varnish daemon.
Before Varnish and MSE4 can start using the file devices, they need to be
created. For this purpose a special utility program called mkfs.mse4
is
provided.
As a one time operation before starting the Varnish daemon, execute this command:
$ mkfs.mse4 -c <path-to-configuration-file> configure
Please see man mkfs.mse4
for more information about the mkfs.mse4
utility.
When a new object enters the cache, MSE4 will need to make a decision on how to store the content, whether to persist it and if so to which store it should be written.
Only regular cached content can be persisted. Special objects like
hit-for-pass
and hit-for-miss
will always become
ephemeral. Temporary objects that are created to hold request bodies or to
handle passes are also always ephemeral.
The VCL program can also set whether to attempt to persist the object
or not. The vmod mse4
has a function called set_storage()
for
this purpose. See VMOD mse4 for more information.
The set of stores an object can be persisted to is determined by the object’s assigned content category. Each category has a list of stores assigned, and a store can only be assigned to one category.
Note that when there are no category definitions in the MSE4 configuration file, a default root category is created that has all of the configured stores automatically assigned to it.
If the object’s content category does not list any assigned stores, the object becomes ephemeral.
See Categories for more information on how to configure and use content categories.
When the assigned object category has multiple stores to choose from, one
is selected based on the configured store selection algorithm. The
algorithm to use is determined by the value of the *store_select*
key in
the selected category’s configuration, or if not specified by the
environment level configuration key default_store_select
(which defaults
to smooth
).
When running the store selection algorithm, only stores in state
ONLINE
are considered. Stores in any other state are ignored, and the
algorithm is executed as if they were not part of the set. If none of the
stores in the set are ONLINE
, the object becomes ephemeral. See the
“Runtime drive failures” section below for more information about store
state.
The possible values for store_select
are:
smooth
(default)
The smooth
algorithm gives weigths to the possible stores by the sum
of the available free space in the store and the size of the store. The
store is then selected by a random draw from the weighted stores. The
smooth
algorithm favours a large and mostly empty store to ensure
that it is filled efficiently when empty, and once all stores are filled
transitions to a size
based store selection.
size
The size
algorithm gives weights to the possible stores by the size
of the store, and a store is then selected by a random draw from the
weighted stores. This makes the store selection probability be according
to the size, making a store that is twice the size of another be
selected twice as often.
available
The available
algorithm gives weights to the stores according to how
much free space is available in the store. This gives the store with the
largest amount of free space the priority.
round-robin
The round-robin
algorithm will simply choose a store in a round
robin fashion.
Banning is a cache invalidation mechanism where an administrator issues a
statement which is tested on each and every object in the cache. If the
statement evaluates as true
, the object will be invalidated (removed
from the cache). It is a very flexible system for cache invalidation, but
does come with quite high cost.
In order for creating new ban statements to not take forever while testing each and every object, the actual execution of the ban statement is done in a lazy manner. This means that the statements are simply added to the cache without doing any searching and matching, and then the actual execution of the ban statement is done as a background task. During cache look-up, any object will first have any not yet matched ban statements applied, and possibly invalidated at that time if it is a match.
For this scheme to work with persisted caches, it becomes necessary to
also persist any ban statements that haven’t yet been tested against all
of the cached objects. For this purpose, each book sets aside an area to
store away on disk the most recently issued ban statements. The size of
this area is controlled by the banjournal_size
book parameter.
When the Varnish daemon starts up using MSE4 configured for persisted caching, a bootstrap process is performed. The slot table from each configured book will be read into memory, and all of the slots verified by checksum for consistency. Any valid objects described will be added to the main Varnish look-up tree so that the object’s presence is known during cache look-ups.
Invalid slots found during the bootstrap will be cleared and reclaimed before normal cache operations commences. This could for example be the result of an unclean shutdown of the system and subsequent aborted IO operations.
The bootstrap process may also identify any number of invalid objects. The slots they occupy will also be cleared and reclaimed. Reasons for invalidating an object during the bootstrap include:
The lifetime parameters indicate that the object is too old. This can happen when the object expired while Varnish was not running.
The object’s ban timestamp is invalid. This can happen if the book’s ban journal is too small to keep all of the active bans in the system. Since there is no way of knowing whether the object would’ve been invalidated by one of the missing bans, the object is removed as a precaution.
The store that holds the objects cache data is offline, missing or otherwise not available. When the store is not accessible, several Varnish mechanisms like cache invalidations and bans will not be applied to the affected objects. If the store is then made accessible again on a later restart, objects that would’ve been invalidated may resurface. To safe guard against this, inaccessible objects are invalidated during the bootstrap.
All of the store data can be checksummed upon first being written to the stores, and then verified when reading it back. The checksums for the content is stored in the book.
This functionality is enabled or disabled by the write_checksum
and
verify_checksum
store configuration parameters. By default checksum
verification is turned on.
When data is read back into memory from the store and the checksum verification failed, the object will automatically be evicted from the cache. This ensures that no other client request can successfully make a cache hit on that object from that point in time, and subsequent attempts at getting the content will cause a cache fetch from the backend.
However, due to how content is only read back into memory on-demand, a delivery process may be well under way and the beginning of an object already delivered to the client by the time the checksum verification fails on a later chunk for the object. It will be too late to communicate the bad object to the client, and the only option left is to cause a delivery failure. Clients may experience short reads and a forced connection close in this situation. Note that it is ensured that no byte from the invalid read will ever be communicated to the client.
When either the book runs out of slots or the store is full, content will be evicted from the cache to free up resources.
When new persisted objects are added to the cache, free slots are needed in the store’s book in which to store the object meta data.
Every book holds a reserve of free slots ready to be handed out. The size
of this reserve is controlled by the slot_reserve
book configuration
key.
When the reserve runs low, a background task free up slots by evicting the least recently used object from among all of the objects stored in the book.
When new persisted objects are added to the cache, free space in the store needs to be assigned. The available free space is for efficiency reasons not mapped in its entirety. Rather a background task exists to fill the reserve with known free space regions meeting a minimum requirement for continuous byte ranges. The fetch tasks grab byte ranges from the reserve to assign them to new persisted objects.
The goal for this background task is controlled by the reserve_size
store configuration key. Upon reaching this goal the task will sleep, and
be woken again when the reserve drops below the goal. The available
reserve is shown in the MSE4_STORE.<store-id>.g_reserve_bytes
VSC
counter.
When the background task fails to meet its goals through natural decay (objects being deleted due to their lifetime settings), persisted content will be evicted from the cache to make new space.
When evicting content to make up store space, it is necessary to work
towards creating continuous free areas to avoid store space
fragmentation. To achieve this, the space is considered a segment at a
time in a round robin fashion, with the segment size determined by the
segment_size
configuration key. All objects with one or more allocations
in that segment, and are not frequently accessed, are evicted.
If a drive IO error is reported at runtime, MSE4 is able to take a book or store offline without causing system down time. The only disruption of service will be that the objects that were hosted on the affected book or store will be removed from the cache.
The IO errors that are monitored and will cause the file devices to be taken offline are any error codes reported back to MSE4 from the operating system during IO operations.
When multiple file devices are hosted on the same drive, a failure to one will not automatically take the others offline. However, when a book device fails it will also fail all of the stores that are hosted in that book.
When a file device fails, MSE4 will immediately cease all IO operations to the device. All of the objects hosted on the device will be invalidated. Any active delivery tasks using any object being invalidated may have to fail the delivery as a result. Clients may experience short reads and a forced connection close in this situation.
After a file device has failed and become offline, an administrator may reset the device and bring it back up into service again. This would presumably be done after having executed a hot swap of the affected drive to replace the faulty hardware. When a file device reset is performed, the device will always come back into service empty. That means that any cached objects that were on the file device will have been lost.
A file device will be in one of three states at runtime. The states are:
ONLINE
When a file device is in state ONLINE
, it operates normally.
FAILING
Whenever a file device fails, it will first transition from ONLINE
to the FAILING
state. While in this state, active clean up tasks is
still being performed on the cached objecs that this device holds. MSE4
will still be holding file descriptors open on the device during this
period, and an administrator should ensure that no file devices on an
affected drive is in this state before hotplugging the drive.
A file device in the FAILING
state can not be reset.
OFFLINE
Once the clean up tasks have been completed, the file device will
transition from the FAILING
state to the OFFLINE
state. At this
point in time, MSE4 will have closed all of the file descriptors to the
file device, and it is safe to hotplug the drive.
Once a file device has reached the OFFLINE
state, it is possible to
reset the device. This will reinitialize the file device and bring it
back to the ONLINE
state.
The current state of the file devices can be queried by using the CLI
command mse4.status
.
A boolean online
status is also reported through the varnishstat
counters (MSE4_BOOK.<book-id>.online
and
MSE4_STORE.<store-id>.online
). This counter value will be 1
when
the file device state is ONLINE
, and 0
otherwise.
MSE4 uses a statelog file to record state changes to the file
devices. This log resides by default in
/var/lib/mse/<hostname>.mse4_statelog
. The statelog file serves two
purposes. The primary use is to initialize a file device’s assumed state
after a restart of Varnish, to make sure that a failed drive stays in the
failed state until the drive has been replaced. Secondly it serves as a
log of IO errors that has happened on the device.
The log is an ascii-file, and records any state change that has happened to the file device. For each file device, the very last state change recorded will be the state assumed after a restart.
The administrator can induce a file device failure manually. This can be a
useful tool if e.g. S.M.A.R.T.
monitoring is indicating that a drive
is starting to experience issues, and a proactive drive replacement needs
to be performed.
To cause a file device to fail, execute the CLI command mse4.fail <book-or-store-id>
. This will have the same effect as if an IO operation
on the file device reported an IO error.
When a file device reaches the OFFLINE
state, MSE4 will have closed
all of its file descriptors pointing to the file device. This will make it
possible to safely unmount the file system upon which the file resides,
and a hotplug of the drive to be performed.
Note that the administrator should make sure that all of the MSE4 file
devices residing on a given drive are in the OFFLINE
state before
unmounting the file system.
The instructions for how to hotplug a drive is out of scope for this manual.
After a fresh drive has been installed, a file system needs to be created on the drive, and the drive remounted.
An MSE4 file device in the OFFLINE
state can be reset at runtime to
bring it back to the ONLINE
state.
To do this, execute the CLI command mse4.reset <book-or-store-id>
.
A store device can not be reset if its book device is not ONLINE
. If
resetting both a books and its stores, reset the book first.
The reset command will recreate the file device from scratch, using the configuration parameters as described in the MSE4 configuration file at the time the Varnish daemon was started. If a file of the same name already exists, it will first be deleted.
A successful reset will mark the file device as ONLINE
in the
statelog.
The Varnish daemon will on startup print a warning message to syslog if the actual file devices presented to it differs in size and layout from what the configuration suggests. It will start up as normal though, and use the applicable settings (file and journal sizes) as stored in the file devices, disregarding the configuration options.
To bring the file devices into compliance with the configuration, it is
necessary to perform an offline file device resize operation. This can be
done using the mkfs.mse4
utility and the resize
command. The Varnish
daemon must be stopped in order to execute this command. Please see man mkfs.mse4
for more information about running this command.