|
|
|
\chapter{Checkpointing}
|
|
|
|
\label{chap-checkpointing}
|
|
|
|
|
|
|
|
In this chapter, we describe two alternative checkpointing
|
|
|
|
techniques. The first one is inspired by the work on the EROS
|
|
|
|
operating system. The second one is based on work on log-structured
|
|
|
|
file systems.
|
|
|
|
|
|
|
|
\section{Technique inspired by EROS}
|
|
|
|
|
|
|
|
The checkpointing mechanism described in this section is inspired by
|
|
|
|
that of the EROS system.
|
|
|
|
|
|
|
|
The address of an object can be considered as consisting of two parts:
|
|
|
|
the \emph{page number} and the \emph{offset within the page}. The
|
|
|
|
page number directly corresponds to the location on disk of the page.
|
|
|
|
However, when checkpointing is activated, the available disk memory is
|
|
|
|
divided into three parts, and the page number should be multiplied by
|
|
|
|
3 to get the first of three disk locations where the object might be
|
|
|
|
located.%
|
|
|
|
\footnote{The price to pay for checkpointing is thus that disk memory
|
|
|
|
will cost a factor 3 as much compared to the price when no
|
|
|
|
checkpointing is used.}
|
|
|
|
|
|
|
|
Checkpointing is divided into \emph{cycles} delimited by
|
|
|
|
\emph{snapshots}. At any point in time, two checkpointing cycles are
|
|
|
|
important. The \emph{current} checkpointing cycle started at the
|
|
|
|
last snapshot and is still going on. The \emph{previous}
|
|
|
|
checkpointing cycle is the one that ended at the last snapshot.
|
|
|
|
|
|
|
|
A page can exist in one, two, or three \emph{versions}, located in
|
|
|
|
three different places on disk. Version $0$ of the page is the oldest
|
|
|
|
version, and also the version that would be used when the system is
|
|
|
|
rebooted after a crash. Version $0$ of the page always exists.
|
|
|
|
Version $1$ of the page corresponds to the contents of the page as it
|
|
|
|
was at the end of the \emph{previous} checkpoint cycle. Version $1$
|
|
|
|
of the page exists if and only if the page was modified during the
|
|
|
|
previous checkpoint cycle. Version $2$ of the page is the
|
|
|
|
\emph{current} version of the page. Version $2$ of the page exists if
|
|
|
|
and only if the page has been modified since the beginning of the
|
|
|
|
\emph{current} checkpoint cycle. We use the word \emph{page instance}
|
|
|
|
to refer to a particular version of a particular page.
|
|
|
|
|
|
|
|
A page can be associated with a \emph{frame}.%
|
|
|
|
\footnote{A \emph{frame} is the main-memory instance of a page.} An
|
|
|
|
attempt to access a page that is not associated with a frame results
|
|
|
|
in a \emph{page fault}. At most one version of a particular page can
|
|
|
|
be associated with a frame, and then it is the version with the
|
|
|
|
highest number. A frame associated with version $0$ or version $1$ of
|
|
|
|
a page is \emph{write protected}, but a frame associated with version
|
|
|
|
$2$ of a page is not. Any attempt to modify the contents of a
|
|
|
|
write-protected frame results in a \emph{write fault}.
|
|
|
|
|
|
|
|
A frame can be \emph{clean} or \emph{dirty}. By definition, when the
|
|
|
|
frame is clean, its contents are identical to those of the associated
|
|
|
|
page instance. When the frame is dirty, it means that it has been
|
|
|
|
modified after it was associated with the underlying page instance. A
|
|
|
|
frame that is associated with version $0$ of a page can not be dirty.
|
|
|
|
If a frame that is associated with version $1$ of a page is dirty,
|
|
|
|
then it is because it was modified during the \emph{previous}
|
|
|
|
checkpointing cycle, and not the current one.
|
|
|
|
|
|
|
|
When a page fault occurs, and there are unused frames, an arbitrary
|
|
|
|
unused frame is associated with the latest version of the page. If
|
|
|
|
there are no unused frames when a page fault occurs (which is the
|
|
|
|
normal situation), a frame that is already associated with a page must
|
|
|
|
be freed up. To select the frame to free up, an ordinary ALRU method
|
|
|
|
can be used. If the selected frame is dirty, the contents are written
|
|
|
|
to the page instance associated with the frame. Finally, the latest
|
|
|
|
version of the requested page is associated with the selected frame.
|
|
|
|
If the latest version of the requested page is either version $0$ or
|
|
|
|
version $1$, then the frame is write protected before execution
|
|
|
|
resumes.
|
|
|
|
|
|
|
|
As indicated above, when a write fault occurs, the frame written to
|
|
|
|
must be associated with either version $0$ or version $1$ of a page.
|
|
|
|
If it is associated with version $0$ of the page, then the frame must
|
|
|
|
be clean. In that case, the association of the frame is modified, so
|
|
|
|
that it henceforth is associated with version $2$ of the page. Before
|
|
|
|
execution resumes, the frame is unprotected. As soon as execution
|
|
|
|
resumes, the frame will be marked as dirty since the reason for the
|
|
|
|
fault was an attempt to write to it. When a write fault occurs and
|
|
|
|
the frame is associated with version $1$ of the associated page, the
|
|
|
|
frame may be either clean or dirty. If it is clean, again, the
|
|
|
|
association of the frame is modified so that it henceforth is
|
|
|
|
associated with version $2$ of the page, and again the frame is
|
|
|
|
unprotected before execution resumes. If the frame is dirty, then its
|
|
|
|
contents are first written to the associated page instance. Then the
|
|
|
|
association is changed as before.
|
|
|
|
|
|
|
|
To determine the disk location of each version of each page, we use a
|
|
|
|
\emph{version table}. The version table is just a sequence of bytes,
|
|
|
|
one for each page. Only 6 bits in each byte are actually used. The
|
|
|
|
two least significant bits indicate the location of version $0$ of the
|
|
|
|
page. $00$ means the first of the $3$ possible consecutive disk
|
|
|
|
locations, $01$ means the second and $10$ means the third, and $11$ is
|
|
|
|
not used. The next two bits indicate the location of version $1$ of
|
|
|
|
the page, with the same meaning as before, except that $11$ means that
|
|
|
|
there is no version $1$ of the page. The final two bits indicate the
|
|
|
|
location of version $2$ of the page with the same interpretation as
|
|
|
|
for version $1$.
|
|
|
|
|
|
|
|
At any point in time, there exist three version tables; two on disk
|
|
|
|
and one in main memory. The two versions on disk play the same role
|
|
|
|
as the disk tables in EROS, i.e., while one of them is being updated,
|
|
|
|
the other is still complete and accurate. A single bit in the boot
|
|
|
|
sector of the disk selects which one should be used at boot time.
|
|
|
|
When a new version table needs to be written to disk, it is first
|
|
|
|
written to the place of the unused disk table, and then the boot
|
|
|
|
sector is written with a flipped selection bit.
|
|
|
|
|
|
|
|
The version table in main memory is represented in two levels with a
|
|
|
|
\emph{directory} of pages. If one page is 4kiB, then one page can
|
|
|
|
hold $2^{12}$ version table entries. For a $300GB$ disk (with room
|
|
|
|
for around $25$ million pages), the directory will contain around
|
|
|
|
$6000$ entries. A directory entry contains not only a pointer to the
|
|
|
|
page of table entries, but also a bit indicating whether any of the
|
|
|
|
table entries in the corresponding page indicates a page which exists
|
|
|
|
in more than one version. It is expected that a relatively small
|
|
|
|
fraction of the directory entries in each checkpointing cycle with
|
|
|
|
have the bit set.
|
|
|
|
|
|
|
|
When a write fault occurs and as a result a new version of a page is
|
|
|
|
created, the in-memory version table is consulted. The entry for the
|
|
|
|
page indicates the disk location of version $0$ of the page, and
|
|
|
|
sometimes also version $1$ of the page. The disk location for the new
|
|
|
|
version (version $2$) of the page is chosen to be one of the two
|
|
|
|
unused ones (if only version $0$ of the page exists) or the only
|
|
|
|
unused one (if both version $0$ and version $1$ of the page exists).
|
|
|
|
The location for version $2$ of the page is indicated in the version
|
|
|
|
table entry by setting bits $4$ and $5$ of the entry to the
|
|
|
|
corresponding disk location.
|
|
|
|
|
|
|
|
In parallel with mutator threads, one or more threads scan the page
|
|
|
|
table of the operating system for dirty frames. When a dirty frame
|
|
|
|
corresponding to version $1$ of a page is found, the contents of the
|
|
|
|
frame is saved to its associated page instance, and the dirty-bit is
|
|
|
|
cleared. When there are no more dirty frames corresponding to version
|
|
|
|
$1$ pages, the set of page instances corresponding to all version $1$
|
|
|
|
pages and version $0$ pages where no version $1$ exists represents the
|
|
|
|
state of the system at the time of the last snapshot.
|
|
|
|
|
|
|
|
To save the coherent state of the system to disk, the in-memory
|
|
|
|
version table directory is scanned. Whenever a directory entry with
|
|
|
|
the bit indicating the existence of pages with several versions set,
|
|
|
|
the page of the directory entry is saved to disk. When the entire
|
|
|
|
version table has been scanned, a new boot sector is written
|
|
|
|
to indicate that the newly saved table is the current one.
|
|
|
|
|
|
|
|
The final action to take in order to finish the current checkpointing
|
|
|
|
cycle and begin a new one is an \emph{atomic flip}. This atomic flip
|
|
|
|
consists of turning all version $1$ pages into version $0$ pages and
|
|
|
|
all version $2$ pages into version $1$ pages. To do that, mutator
|
|
|
|
threads must be stopped. Then the in-memory version table is scanned.
|
|
|
|
Whenever an entry is found that has a version other than $0$ in it, it
|
|
|
|
is modified. If both a version $1$ and a version $2$ exists, bits $2$
|
|
|
|
and $3$ of the entry are moved to position $0$ and $1$, bits $4$ and
|
|
|
|
$5$ are moved to positions $2$ and $3$, and positions $4$, and $5$ are
|
|
|
|
set to $11$. If no version $1$ exists, then bits $4$ and $5$ are
|
|
|
|
moved to positions $2$ and $3$, and positions $4$, and $5$ are set to
|
|
|
|
$11$. Finally, mutator threads are restarted.
|
|
|
|
|
|
|
|
The easiest way to modify a version table entry is probably to create
|
|
|
|
a 64-byte table in memory which, for each possible version of the
|
|
|
|
existing version table entry gives the new version. Even though it
|
|
|
|
would require a memory access, this table will quickly be in the
|
|
|
|
cache, so access will be fast.
|
|
|
|
|
|
|
|
To get an idea of performance of the atomic flip, let us take a
|
|
|
|
situation where the \emph{working set} is no bigger than the size of
|
|
|
|
main memory.%
|
|
|
|
\footnote{If the working set is larger than the main memory,
|
|
|
|
performance is likely to deteriorate for more fundamental reasons.}
|
|
|
|
Furthermore, let us say that the size of main memory is $64GiB$ and
|
|
|
|
that around half the pages of the working set are modified in a
|
|
|
|
particular checkpointing cycle. If we assume that the modified pages
|
|
|
|
are concentrated with respect to the version table directory, then we
|
|
|
|
can ignore the time to scan the version table directory. To
|
|
|
|
accomplish the flip, we then need to modify $2^{23}$ entries. If we
|
|
|
|
assume modified entries are adjacent, we can load and store $8$ of
|
|
|
|
them at a time, requiring $2^{21}$ memory accesses. If a memory
|
|
|
|
access takes around $10$ns, the flip will take around $20$ms.
|
|
|
|
|
|
|
|
The time for a flip can be made shorter by taking more frequent
|
|
|
|
snapshots.
|
|
|
|
|
|
|
|
%% LocalWords: checkpointing mutator
|
|
|
|
|
|
|
|
\section{Technique based on log-structured file systems}
|
|
|
|
|
|
|
|
To make the description more concrete, we imagine a secondary storage
|
|
|
|
device consisting of around $2^{30}$ pages, each containing $2^{12}$
|
|
|
|
bytes. Recall that \sysname{} treats primary memory as a cache for
|
|
|
|
secondary memory. Therefore, the pages on the secondary storage
|
|
|
|
device can be considered as making up the complete address space of
|
|
|
|
\sysname{}. As such, they have unique numbers, starting at $0$.
|
|
|
|
In the example system, the unique page number would occupy bits $41 -
|
|
|
|
12$ of a pointer.
|
|
|
|
|
|
|
|
However, with the technique described in this section, the unique page
|
|
|
|
number does not correspond to any fixed location on the secondary
|
|
|
|
storage device. Instead, the location of a particular page can vary
|
|
|
|
over time. But when a page fault for a particular unique page number
|
|
|
|
occurs, the location of the page on secondary storage must be known.
|
|
|
|
For that reason, we keep a \emph{page map} in main memory. In the
|
|
|
|
example system, this page map would consist of $2^{30}$ $4$-byte
|
|
|
|
entries, for a total of $2^{32}$ bytes of main memory.
|
|
|
|
|
|
|
|
With the technique described in this section, the secondary storage
|
|
|
|
device represents a very large \emph{circular queue} where each
|
|
|
|
element of the queue is called a \emph{segment}. Such a segment
|
|
|
|
represents a unit of checkpointing. New segments are added to the
|
|
|
|
tail of the queue. Old segments are removed from the head of the
|
|
|
|
queue as described below.
|
|
|
|
|
|
|
|
A segment consists of:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item a \emph{header} containing metadata about the contents of the
|
|
|
|
segment,
|
|
|
|
\item all the registers of the processor, as observed when creating
|
|
|
|
the checkpoint, and
|
|
|
|
\item a certain number of pages that may have been modified since the
|
|
|
|
previous checkpoint.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
Again, to make the description more concrete, let us imagine that the
|
|
|
|
number of pages in a segment is around $250$ or so, for a total of
|
|
|
|
around $1MB$ of page data. A segment is written as a unit to the
|
|
|
|
secondary storage device. If that device is a disk, then the seek
|
|
|
|
time and rotation delay of the disk will not significantly impact the
|
|
|
|
transfer of the segment to the disk, because the size of the segment
|
|
|
|
is sufficiently large that the data-transfer time will dominate.
|
|
|
|
|
|
|
|
Furthermore, it is advantageous to keep the secondary storage device
|
|
|
|
nearly full, because then (if the device is a disk) the head and the
|
|
|
|
tail of the queue will be physically close, thereby minimizing seek
|
|
|
|
time.
|
|
|
|
|
|
|
|
The header of a segment contains:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item A list of the unique page number of each of the pages in the
|
|
|
|
segment. For the example segment size, this information occupies
|
|
|
|
around $1KB$.
|
|
|
|
\item A SHA value calculated from the data in the segment.
|
|
|
|
\item The position of the head of the queue, i.e. the position of the
|
|
|
|
first segment to be removed from the secondary device.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
In addition to the queue of segments, the secondary storage device
|
|
|
|
contains a single word of information, indicating the tail of the
|
|
|
|
queue, i.e. the position on the device of the last checkpoint segment
|
|
|
|
that was written.
|
|
|
|
|
|
|
|
The first thing we need to verify at this point is that it is possible
|
|
|
|
to boot the system, given only the information on the secondary
|
|
|
|
storage device. Here is how the system would be booted:
|
|
|
|
|
|
|
|
\begin{enumerate}
|
|
|
|
\item Read the information indicating where the tail of the queue is
|
|
|
|
located.
|
|
|
|
\item Using this information, read the metadata of the last
|
|
|
|
checkpointing segment that was written.
|
|
|
|
\item From this metadata, retrieve the information about the head of
|
|
|
|
the queue.
|
|
|
|
\item Read each segment from the head to the tail of the queue,
|
|
|
|
constructing the page map from the metadata of each segment.
|
|
|
|
\item Load initial pages into main memory, setting up the page tables
|
|
|
|
as appropriate.
|
|
|
|
\item Load the registers stored in the checkpointing segment to
|
|
|
|
continue execution, or jump to a default entry point of the system.
|
|
|
|
\end{enumerate}
|
|
|
|
|
|
|
|
Segments are removed from the head of the queue, by a procedure called
|
|
|
|
\emph{cleaning}. This procedure will be described later. For now, we
|
|
|
|
assume that it is not present.
|
|
|
|
|
|
|
|
The system maintains three buffers, each one the size of a segment.
|
|
|
|
Two buffers are used to alternate, so that one is being written to
|
|
|
|
secondary memory while the other one (the \emph{active one}) is used
|
|
|
|
to receive pages in main memory. The third buffer is used to read
|
|
|
|
back and compare what was written to secondary storage. Two counters,
|
|
|
|
$M$ and $N$, each with an initial value of $0$ is kept for each of two
|
|
|
|
ordinary segment buffers. $M$ indicates the first free page in the
|
|
|
|
active segment buffer, or equivalently, the number of pages that have
|
|
|
|
already been copied to the buffer. $N$ indicates the number of dirty
|
|
|
|
pages that have not yet been copied to the segment buffer. If ever
|
|
|
|
$M+N$ reaches the value corresponding to the number of pages in the
|
|
|
|
buffer (in our example, $250$, then a \emph{checkpoint} is triggered
|
|
|
|
as described below.
|
|
|
|
|
|
|
|
When a page fault occurs, a victim page is chosen using some standard
|
|
|
|
technique, such as ``least recently used''. If the victim page is
|
|
|
|
clean, it is simply discarded and the page map is modified to
|
|
|
|
reflect the change. If the victim page is dirty, its contents is
|
|
|
|
copied to the first free page of the active segment buffer, and the
|
|
|
|
value of $M$ is incremented. The unique number of the page is
|
|
|
|
retrieved from the page map and stored in the header of the active
|
|
|
|
segment buffer.
|
|
|
|
|
|
|
|
All clean pages are read-only. When an attempt is made to modify a
|
|
|
|
page, $N$ is incremented and the page is marked as writable.
|
|
|
|
|
|
|
|
As mentioned above, when $M+N$ reaches the value corresponding to the
|
|
|
|
number of available pages in the segment buffer, a checkpoint is
|
|
|
|
triggered. The initial operation of a checkpoint is called an
|
|
|
|
\emph{atomic flip} which involves two segment buffers that we shall
|
|
|
|
call $A$ and $B$. $A$ is the current active segment buffer with $M_A+N_A$
|
|
|
|
having reached its ceiling and $B$ is the next one to be activated
|
|
|
|
with its $M_B$ and $N_B$ equal to $0$.
|
|
|
|
|
|
|
|
The flipping operation must be done atomically, i.e., all executing
|
|
|
|
threads must be temporarily stopped. First, the $N_A$ dirty pages not
|
|
|
|
yet in the buffer are marked as read-only. The active segment
|
|
|
|
buffer is then set to segment $B$. Then the $N_A$ pages that
|
|
|
|
were dirty are copied to segment buffer $A$. Their respective unique page
|
|
|
|
numbers are retrieved from the page map and copied to the header of
|
|
|
|
segment buffer $A$. The registers of the machine prior to flipping
|
|
|
|
are then stored in the segment buffer. Once this is done, previously
|
|
|
|
stopped threads can continue execution, and the entire segment $A$ is
|
|
|
|
written to the end of the queue on secondary storage, and $M_A$ and
|
|
|
|
$N_A$ are set to $0$.
|
|
|
|
|
|
|
|
To avoid that the secondary storage device fills up with more and more
|
|
|
|
checkpoint segments, an activity called \emph{cleaning} works in
|
|
|
|
parallel with the activity described above. Conceptually, a segment
|
|
|
|
is read from the head of the queue and processed as follows. The
|
|
|
|
list of unique page numbers in the segment header is examined. For
|
|
|
|
each unique page number, the page map in main memory is consulted.
|
|
|
|
There are two possible outcomes:
|
|
|
|
|
|
|
|
\begin{enumerate}
|
|
|
|
\item The location of the page as indicated by the page map is
|
|
|
|
different from the location in the segment being processed. Then,
|
|
|
|
there is a segment further back in the queue that contains a newer
|
|
|
|
version of the page. Therefore, this version of the page is
|
|
|
|
obsolete, and is simply discarded.
|
|
|
|
\item The location of the page as indicated by the page map is the
|
|
|
|
same the location in the segment being processed. Then, this
|
|
|
|
version of the page is the most recent one. In this case, the page
|
|
|
|
is copied to the active segment buffer and $M$ is incremented.
|
|
|
|
\end{enumerate}
|
|
|
|
|
|
|
|
When every page in the head segment has been processed this way, the
|
|
|
|
header of the active segment buffer is updated to reflect that the
|
|
|
|
complete segment at the head of the queue has been processed and the
|
|
|
|
following segment on the queue should be processed next. Notice that
|
|
|
|
there is no danger in processing pages this way multiple times. Thus,
|
|
|
|
if a crash occurs in the middle, there is no harm done.
|
|
|
|
|
|
|
|
Now, let us turn our attention to performance. Clearly, if a disk the
|
|
|
|
size of the secondary storage device in our example is to be
|
|
|
|
completely read when the system boots, it will take a very long time
|
|
|
|
indeed. We suggest handling this problem by separating the segment
|
|
|
|
headers from the segment pages either to two separate parts of a
|
|
|
|
single storage device or to a second device. Only the headers need to
|
|
|
|
be read for a page map to be constructed in memory. The headers are
|
|
|
|
less than one half of a percent the size of the space occupied by
|
|
|
|
pages in our example, so booting the system is then much faster. Even
|
|
|
|
better, if the segment headers are placed on a persistent solid-state
|
|
|
|
device, they can be read much faster.
|