\chapter{Checkpointing} \label{chap-checkpointing} In this chapter, we describe two alternative checkpointing techniques. The first one is inspired by the work on the EROS operating system. The second one is based on work on log-structured file systems. \section{Technique inspired by EROS} The checkpointing mechanism described in this section is inspired by that of the EROS system. The address of an object can be considered as consisting of two parts: the \emph{page number} and the \emph{offset within the page}. The page number directly corresponds to the location on disk of the page. However, when checkpointing is activated, the available disk memory is divided into three parts, and the page number should be multiplied by 3 to get the first of three disk locations where the object might be located.% \footnote{The price to pay for checkpointing is thus that disk memory will cost a factor 3 as much compared to the price when no checkpointing is used.} Checkpointing is divided into \emph{cycles} delimited by \emph{snapshots}. At any point in time, two checkpointing cycles are important. The \emph{current} checkpointing cycle started at the last snapshot and is still going on. The \emph{previous} checkpointing cycle is the one that ended at the last snapshot. A page can exist in one, two, or three \emph{versions}, located in three different places on disk. Version $0$ of the page is the oldest version, and also the version that would be used when the system is rebooted after a crash. Version $0$ of the page always exists. Version $1$ of the page corresponds to the contents of the page as it was at the end of the \emph{previous} checkpoint cycle. Version $1$ of the page exists if and only if the page was modified during the previous checkpoint cycle. Version $2$ of the page is the \emph{current} version of the page. Version $2$ of the page exists if and only if the page has been modified since the beginning of the \emph{current} checkpoint cycle. We use the word \emph{page instance} to refer to a particular version of a particular page. A page can be associated with a \emph{frame}.% \footnote{A \emph{frame} is the main-memory instance of a page.} An attempt to access a page that is not associated with a frame results in a \emph{page fault}. At most one version of a particular page can be associated with a frame, and then it is the version with the highest number. A frame associated with version $0$ or version $1$ of a page is \emph{write protected}, but a frame associated with version $2$ of a page is not. Any attempt to modify the contents of a write-protected frame results in a \emph{write fault}. A frame can be \emph{clean} or \emph{dirty}. By definition, when the frame is clean, its contents are identical to those of the associated page instance. When the frame is dirty, it means that it has been modified after it was associated with the underlying page instance. A frame that is associated with version $0$ of a page can not be dirty. If a frame that is associated with version $1$ of a page is dirty, then it is because it was modified during the \emph{previous} checkpointing cycle, and not the current one. When a page fault occurs, and there are unused frames, an arbitrary unused frame is associated with the latest version of the page. If there are no unused frames when a page fault occurs (which is the normal situation), a frame that is already associated with a page must be freed up. To select the frame to free up, an ordinary ALRU method can be used. If the selected frame is dirty, the contents are written to the page instance associated with the frame. Finally, the latest version of the requested page is associated with the selected frame. If the latest version of the requested page is either version $0$ or version $1$, then the frame is write protected before execution resumes. As indicated above, when a write fault occurs, the frame written to must be associated with either version $0$ or version $1$ of a page. If it is associated with version $0$ of the page, then the frame must be clean. In that case, the association of the frame is modified, so that it henceforth is associated with version $2$ of the page. Before execution resumes, the frame is unprotected. As soon as execution resumes, the frame will be marked as dirty since the reason for the fault was an attempt to write to it. When a write fault occurs and the frame is associated with version $1$ of the associated page, the frame may be either clean or dirty. If it is clean, again, the association of the frame is modified so that it henceforth is associated with version $2$ of the page, and again the frame is unprotected before execution resumes. If the frame is dirty, then its contents are first written to the associated page instance. Then the association is changed as before. To determine the disk location of each version of each page, we use a \emph{version table}. The version table is just a sequence of bytes, one for each page. Only 6 bits in each byte are actually used. The two least significant bits indicate the location of version $0$ of the page. $00$ means the first of the $3$ possible consecutive disk locations, $01$ means the second and $10$ means the third, and $11$ is not used. The next two bits indicate the location of version $1$ of the page, with the same meaning as before, except that $11$ means that there is no version $1$ of the page. The final two bits indicate the location of version $2$ of the page with the same interpretation as for version $1$. At any point in time, there exist three version tables; two on disk and one in main memory. The two versions on disk play the same role as the disk tables in EROS, i.e., while one of them is being updated, the other is still complete and accurate. A single bit in the boot sector of the disk selects which one should be used at boot time. When a new version table needs to be written to disk, it is first written to the place of the unused disk table, and then the boot sector is written with a flipped selection bit. The version table in main memory is represented in two levels with a \emph{directory} of pages. If one page is 4kiB, then one page can hold $2^{12}$ version table entries. For a $300GB$ disk (with room for around $25$ million pages), the directory will contain around $6000$ entries. A directory entry contains not only a pointer to the page of table entries, but also a bit indicating whether any of the table entries in the corresponding page indicates a page which exists in more than one version. It is expected that a relatively small fraction of the directory entries in each checkpointing cycle with have the bit set. When a write fault occurs and as a result a new version of a page is created, the in-memory version table is consulted. The entry for the page indicates the disk location of version $0$ of the page, and sometimes also version $1$ of the page. The disk location for the new version (version $2$) of the page is chosen to be one of the two unused ones (if only version $0$ of the page exists) or the only unused one (if both version $0$ and version $1$ of the page exists). The location for version $2$ of the page is indicated in the version table entry by setting bits $4$ and $5$ of the entry to the corresponding disk location. In parallel with mutator threads, one or more threads scan the page table of the operating system for dirty frames. When a dirty frame corresponding to version $1$ of a page is found, the contents of the frame is saved to its associated page instance, and the dirty-bit is cleared. When there are no more dirty frames corresponding to version $1$ pages, the set of page instances corresponding to all version $1$ pages and version $0$ pages where no version $1$ exists represents the state of the system at the time of the last snapshot. To save the coherent state of the system to disk, the in-memory version table directory is scanned. Whenever a directory entry with the bit indicating the existence of pages with several versions set, the page of the directory entry is saved to disk. When the entire version table has been scanned, a new boot sector is written to indicate that the newly saved table is the current one. The final action to take in order to finish the current checkpointing cycle and begin a new one is an \emph{atomic flip}. This atomic flip consists of turning all version $1$ pages into version $0$ pages and all version $2$ pages into version $1$ pages. To do that, mutator threads must be stopped. Then the in-memory version table is scanned. Whenever an entry is found that has a version other than $0$ in it, it is modified. If both a version $1$ and a version $2$ exists, bits $2$ and $3$ of the entry are moved to position $0$ and $1$, bits $4$ and $5$ are moved to positions $2$ and $3$, and positions $4$, and $5$ are set to $11$. If no version $1$ exists, then bits $4$ and $5$ are moved to positions $2$ and $3$, and positions $4$, and $5$ are set to $11$. Finally, mutator threads are restarted. The easiest way to modify a version table entry is probably to create a 64-byte table in memory which, for each possible version of the existing version table entry gives the new version. Even though it would require a memory access, this table will quickly be in the cache, so access will be fast. To get an idea of performance of the atomic flip, let us take a situation where the \emph{working set} is no bigger than the size of main memory.% \footnote{If the working set is larger than the main memory, performance is likely to deteriorate for more fundamental reasons.} Furthermore, let us say that the size of main memory is $64GiB$ and that around half the pages of the working set are modified in a particular checkpointing cycle. If we assume that the modified pages are concentrated with respect to the version table directory, then we can ignore the time to scan the version table directory. To accomplish the flip, we then need to modify $2^{23}$ entries. If we assume modified entries are adjacent, we can load and store $8$ of them at a time, requiring $2^{21}$ memory accesses. If a memory access takes around $10$ns, the flip will take around $20$ms. The time for a flip can be made shorter by taking more frequent snapshots. %% LocalWords: checkpointing mutator \section{Technique based on log-structured file systems} To make the description more concrete, we imagine a secondary storage device consisting of around $2^{30}$ pages, each containing $2^{12}$ bytes. Recall that \sysname{} treats primary memory as a cache for secondary memory. Therefore, the pages on the secondary storage device can be considered as making up the complete address space of \sysname{}. As such, they have unique numbers, starting at $0$. In the example system, the unique page number would occupy bits $41 - 12$ of a pointer. However, with the technique described in this section, the unique page number does not correspond to any fixed location on the secondary storage device. Instead, the location of a particular page can vary over time. But when a page fault for a particular unique page number occurs, the location of the page on secondary storage must be known. For that reason, we keep a \emph{page map} in main memory. In the example system, this page map would consist of $2^{30}$ $4$-byte entries, for a total of $2^{32}$ bytes of main memory. With the technique described in this section, the secondary storage device represents a very large \emph{circular queue} where each element of the queue is called a \emph{segment}. Such a segment represents a unit of checkpointing. New segments are added to the tail of the queue. Old segments are removed from the head of the queue as described below. A segment consists of: \begin{itemize} \item a \emph{header} containing metadata about the contents of the segment, \item all the registers of the processor, as observed when creating the checkpoint, and \item a certain number of pages that may have been modified since the previous checkpoint. \end{itemize} Again, to make the description more concrete, let us imagine that the number of pages in a segment is around $250$ or so, for a total of around $1MB$ of page data. A segment is written as a unit to the secondary storage device. If that device is a disk, then the seek time and rotation delay of the disk will not significantly impact the transfer of the segment to the disk, because the size of the segment is sufficiently large that the data-transfer time will dominate. Furthermore, it is advantageous to keep the secondary storage device nearly full, because then (if the device is a disk) the head and the tail of the queue will be physically close, thereby minimizing seek time. The header of a segment contains: \begin{itemize} \item A list of the unique page number of each of the pages in the segment. For the example segment size, this information occupies around $1KB$. \item A SHA value calculated from the data in the segment. \item The position of the head of the queue, i.e. the position of the first segment to be removed from the secondary device. \end{itemize} In addition to the queue of segments, the secondary storage device contains a single word of information, indicating the tail of the queue, i.e. the position on the device of the last checkpoint segment that was written. The first thing we need to verify at this point is that it is possible to boot the system, given only the information on the secondary storage device. Here is how the system would be booted: \begin{enumerate} \item Read the information indicating where the tail of the queue is located. \item Using this information, read the metadata of the last checkpointing segment that was written. \item From this metadata, retrieve the information about the head of the queue. \item Read each segment from the head to the tail of the queue, constructing the page map from the metadata of each segment. \item Load initial pages into main memory, setting up the page tables as appropriate. \item Load the registers stored in the checkpointing segment to continue execution, or jump to a default entry point of the system. \end{enumerate} Segments are removed from the head of the queue, by a procedure called \emph{cleaning}. This procedure will be described later. For now, we assume that it is not present. The system maintains three buffers, each one the size of a segment. Two buffers are used to alternate, so that one is being written to secondary memory while the other one (the \emph{active one}) is used to receive pages in main memory. The third buffer is used to read back and compare what was written to secondary storage. Two counters, $M$ and $N$, each with an initial value of $0$ is kept for each of two ordinary segment buffers. $M$ indicates the first free page in the active segment buffer, or equivalently, the number of pages that have already been copied to the buffer. $N$ indicates the number of dirty pages that have not yet been copied to the segment buffer. If ever $M+N$ reaches the value corresponding to the number of pages in the buffer (in our example, $250$, then a \emph{checkpoint} is triggered as described below. When a page fault occurs, a victim page is chosen using some standard technique, such as ``least recently used''. If the victim page is clean, it is simply discarded and the page map is modified to reflect the change. If the victim page is dirty, its contents is copied to the first free page of the active segment buffer, and the value of $M$ is incremented. The unique number of the page is retrieved from the page map and stored in the header of the active segment buffer. All clean pages are read-only. When an attempt is made to modify a page, $N$ is incremented and the page is marked as writable. As mentioned above, when $M+N$ reaches the value corresponding to the number of available pages in the segment buffer, a checkpoint is triggered. The initial operation of a checkpoint is called an \emph{atomic flip} which involves two segment buffers that we shall call $A$ and $B$. $A$ is the current active segment buffer with $M_A+N_A$ having reached its ceiling and $B$ is the next one to be activated with its $M_B$ and $N_B$ equal to $0$. The flipping operation must be done atomically, i.e., all executing threads must be temporarily stopped. First, the $N_A$ dirty pages not yet in the buffer are marked as read-only. The active segment buffer is then set to segment $B$. Then the $N_A$ pages that were dirty are copied to segment buffer $A$. Their respective unique page numbers are retrieved from the page map and copied to the header of segment buffer $A$. The registers of the machine prior to flipping are then stored in the segment buffer. Once this is done, previously stopped threads can continue execution, and the entire segment $A$ is written to the end of the queue on secondary storage, and $M_A$ and $N_A$ are set to $0$. To avoid that the secondary storage device fills up with more and more checkpoint segments, an activity called \emph{cleaning} works in parallel with the activity described above. Conceptually, a segment is read from the head of the queue and processed as follows. The list of unique page numbers in the segment header is examined. For each unique page number, the page map in main memory is consulted. There are two possible outcomes: \begin{enumerate} \item The location of the page as indicated by the page map is different from the location in the segment being processed. Then, there is a segment further back in the queue that contains a newer version of the page. Therefore, this version of the page is obsolete, and is simply discarded. \item The location of the page as indicated by the page map is the same the location in the segment being processed. Then, this version of the page is the most recent one. In this case, the page is copied to the active segment buffer and $M$ is incremented. \end{enumerate} When every page in the head segment has been processed this way, the header of the active segment buffer is updated to reflect that the complete segment at the head of the queue has been processed and the following segment on the queue should be processed next. Notice that there is no danger in processing pages this way multiple times. Thus, if a crash occurs in the middle, there is no harm done. Now, let us turn our attention to performance. Clearly, if a disk the size of the secondary storage device in our example is to be completely read when the system boots, it will take a very long time indeed. We suggest handling this problem by separating the segment headers from the segment pages either to two separate parts of a single storage device or to a second device. Only the headers need to be read for a page map to be constructed in memory. The headers are less than one half of a percent the size of the space occupied by pages in our example, so booting the system is then much faster. Even better, if the segment headers are placed on a persistent solid-state device, they can be read much faster.