closos/Documentation/chap-checkpointing.tex

\chapter{Checkpointing}
\label{chap-checkpointing}

In this chapter, we describe two alternative checkpointing
techniques.  The first one is inspired by the work on the EROS
operating system.  The second one is based on work on log-structured
file systems.

\section{Technique inspired by EROS}

The checkpointing mechanism described in this section is inspired by
that of the EROS system.

The address of an object can be considered as consisting of two parts:
the \emph{page number} and the \emph{offset within the page}.  The
page number directly corresponds to the location on disk of the page.
However, when checkpointing is activated, the available disk memory is
divided into three parts, and the page number should be multiplied by
3 to get the first of three disk locations where the object might be
located.%
\footnote{The price to pay for checkpointing is thus that disk memory
  will cost a factor 3 as much compared to the price when no
  checkpointing is used.}

Checkpointing is divided into \emph{cycles} delimited by
\emph{snapshots}.  At any point in time, two checkpointing cycles are
important.  The \emph{current} checkpointing cycle started at the
last snapshot and is still going on.  The \emph{previous}
checkpointing cycle is the one that ended at the last snapshot.

A page can exist in one, two, or three \emph{versions}, located in
three different places on disk.  Version $0$ of the page is the oldest
version, and also the version that would be used when the system is
rebooted after a crash.  Version $0$ of the page always exists.
Version $1$ of the page corresponds to the contents of the page as it
was at the end of the \emph{previous} checkpoint cycle.  Version $1$
of the page exists if and only if the page was modified during the
previous checkpoint cycle.  Version $2$ of the page is the
\emph{current} version of the page.  Version $2$ of the page exists if
and only if the page has been modified since the beginning of the
\emph{current} checkpoint cycle.  We use the word \emph{page instance}
to refer to a particular version of a particular page.

A page can be associated with a \emph{frame}.%
\footnote{A \emph{frame} is the main-memory instance of a page.}  An
attempt to access a page that is not associated with a frame results
in a \emph{page fault}.  At most one version of a particular page can
be associated with a frame, and then it is the version with the
highest number.  A frame associated with version $0$ or version $1$ of
a page is \emph{write protected}, but a frame associated with version
$2$ of a page is not.  Any attempt to modify the contents of a
write-protected frame results in a \emph{write fault}.

A frame can be \emph{clean} or \emph{dirty}.  By definition, when the
frame is clean, its contents are identical to those of the associated
page instance.  When the frame is dirty, it means that it has been
modified after it was associated with the underlying page instance.  A
frame that is associated with version $0$ of a page can not be dirty.
If a frame that is associated with version $1$ of a page is dirty,
then it is because it was modified during the \emph{previous}
checkpointing cycle, and not the current one.

When a page fault occurs, and there are unused frames, an arbitrary
unused frame is associated with the latest version of the page.  If
there are no unused frames when a page fault occurs (which is the
normal situation), a frame that is already associated with a page must
be freed up.  To select the frame to free up, an ordinary ALRU method
can be used.  If the selected frame is dirty, the contents are written
to the page instance associated with the frame.  Finally, the latest
version of the requested page is associated with the selected frame.
If the latest version of the requested page is either version $0$ or
version $1$, then the frame is write protected before execution
resumes.

As indicated above, when a write fault occurs, the frame written to
must be associated with either version $0$ or version $1$ of a page.
If it is associated with version $0$ of the page, then the frame must
be clean.  In that case, the association of the frame is modified, so
that it henceforth is associated with version $2$ of the page.  Before
execution resumes, the frame is unprotected.  As soon as execution
resumes, the frame will be marked as dirty since the reason for the
fault was an attempt to write to it.  When a write fault occurs and
the frame is associated with version $1$ of the associated page, the
frame may be either clean or dirty.  If it is clean, again, the
association of the frame is modified so that it henceforth is
associated with version $2$ of the page, and again the frame is
unprotected before execution resumes.  If the frame is dirty, then its
contents are first written to the associated page instance.  Then the
association is changed as before.

To determine the disk location of each version of each page, we use a
\emph{version table}.  The version table is just a sequence of bytes,
one for each page.  Only 6 bits in each byte are actually used.  The
two least significant bits indicate the location of version $0$ of the
page.  $00$ means the first of the $3$ possible consecutive disk
locations, $01$ means the second and $10$ means the third, and $11$ is
not used.  The next two bits indicate the location of version $1$ of
the page, with the same meaning as before, except that $11$ means that
there is no version $1$ of the page.  The final two bits indicate the
location of version $2$ of the page with the same interpretation as
for version $1$.

At any point in time, there exist three version tables; two on disk
and one in main memory.  The two versions on disk play the same role
as the disk tables in EROS, i.e., while one of them is being updated,
the other is still complete and accurate.  A single bit in the boot
sector of the disk selects which one should be used at boot time.
When a new version table needs to be written to disk, it is first
written to the place of the unused disk table, and then the boot
sector is written with a flipped selection bit.

The version table in main memory is represented in two levels with a
\emph{directory} of pages.  If one page is 4kiB, then one page can
hold $2^{12}$ version table entries.  For a $300GB$ disk (with room
for around $25$ million pages), the directory will contain around
$6000$ entries.  A directory entry contains not only a pointer to the
page of table entries, but also a bit indicating whether any of the
table entries in the corresponding page indicates a page which exists
in more than one version.  It is expected that a relatively small
fraction of the directory entries in each checkpointing cycle with
have the bit set.

When a write fault occurs and as a result a new version of a page is
created, the in-memory version table is consulted.  The entry for the
page indicates the disk location of version $0$ of the page, and
sometimes also version $1$ of the page.  The disk location for the new
version (version $2$) of the page is chosen to be one of the two
unused ones (if only version $0$ of the page exists) or the only
unused one (if both version $0$ and version $1$ of the page exists).
The location for version $2$ of the page is indicated in the version
table entry by setting bits $4$ and $5$ of the entry to the
corresponding disk location.

In parallel with mutator threads, one or more threads scan the page
table of the operating system for dirty frames.  When a dirty frame
corresponding to version $1$ of a page is found, the contents of the
frame is saved to its associated page instance, and the dirty-bit is
cleared.  When there are no more dirty frames corresponding to version
$1$ pages, the set of page instances corresponding to all version $1$
pages and version $0$ pages where no version $1$ exists represents the
state of the system at the time of the last snapshot.

To save the coherent state of the system to disk, the in-memory
version table directory is scanned.  Whenever a directory entry with
the bit indicating the existence of pages with several versions set,
the page of the directory entry is saved to disk.  When the entire
version table has been scanned, a new boot sector is written
to indicate that the newly saved table is the current one.

The final action to take in order to finish the current checkpointing
cycle and begin a new one is an \emph{atomic flip}.  This atomic flip
consists of turning all version $1$ pages into version $0$ pages and
all version $2$ pages into version $1$ pages.  To do that, mutator
threads must be stopped.  Then the in-memory version table is scanned.
Whenever an entry is found that has a version other than $0$ in it, it
is modified.  If both a version $1$ and a version $2$ exists, bits $2$
and $3$ of the entry are moved to position $0$ and $1$, bits $4$ and
$5$ are moved to positions $2$ and $3$, and positions $4$, and $5$ are
set to $11$.  If no version $1$ exists, then bits $4$ and $5$ are
moved to positions $2$ and $3$, and positions $4$, and $5$ are set to
$11$.  Finally, mutator threads are restarted.

The easiest way to modify a version table entry is probably to create
a 64-byte table in memory which, for each possible version of the
existing version table entry gives the new version.  Even though it
would require a memory access, this table will quickly be in the
cache, so access will be fast.

To get an idea of performance of the atomic flip, let us take a
situation where the \emph{working set} is no bigger than the size of
main memory.%
\footnote{If the working set is larger than the main memory,
  performance is likely to deteriorate for more fundamental reasons.}
Furthermore, let us say that the size of main memory is $64GiB$ and
that around half the pages of the working set are modified in a
particular checkpointing cycle.  If we assume that the modified pages
are concentrated with respect to the version table directory, then we
can ignore the time to scan the version table directory.  To
accomplish the flip, we then need to modify $2^{23}$ entries.  If we
assume modified entries are adjacent, we can load and store $8$ of
them at a time, requiring $2^{21}$ memory accesses.  If a memory
access takes around $10$ns, the flip will take around $20$ms.

The time for a flip can be made shorter by taking more frequent
snapshots.

%%  LocalWords:  checkpointing mutator

\section{Technique based on log-structured file systems}

To make the description more concrete, we imagine a secondary storage
device consisting of around $2^{30}$ pages, each containing $2^{12}$
bytes.  Recall that \sysname{} treats primary memory as a cache for
secondary memory.  Therefore, the pages on the secondary storage
device can be considered as making up the complete address space of
\sysname{}.  As such, they have unique numbers, starting at $0$.
In the example system, the unique page number would occupy bits $41 -
12$ of a pointer.

However, with the technique described in this section, the unique page
number does not correspond to any fixed location on the secondary
storage device.  Instead, the location of a particular page can vary
over time.  But when a page fault for a particular unique page number
occurs, the location of the page on secondary storage must be known.
For that reason, we keep a \emph{page map} in main memory.  In the
example system, this page map would consist of $2^{30}$ $4$-byte
entries, for a total of $2^{32}$ bytes of main memory.

With the technique described in this section, the secondary storage
device represents a very large \emph{circular queue} where each
element of the queue is called a \emph{segment}.  Such a segment
represents a unit of checkpointing.  New segments are added to the
tail of the queue.  Old segments are removed from the head of the
queue as described below.

A segment consists of:

\begin{itemize}
\item a \emph{header} containing metadata about the contents of the
  segment,
\item all the registers of the processor, as observed when creating
  the checkpoint, and
\item a certain number of pages that may have been modified since the
  previous checkpoint.
\end{itemize}

Again, to make the description more concrete, let us imagine that the
number of pages in a segment is around $250$ or so, for a total of
around $1MB$ of page data.  A segment is written as a unit to the
secondary storage device.  If that device is a disk, then the seek
time and rotation delay of the disk will not significantly impact the
transfer of the segment to the disk, because the size of the segment
is sufficiently large that the data-transfer time will dominate.

Furthermore, it is advantageous to keep the secondary storage device
nearly full, because then (if the device is a disk) the head and the
tail of the queue will be physically close, thereby minimizing seek
time.

The header of a segment contains:

\begin{itemize}
\item A list of the unique page number of each of the pages in the
  segment.  For the example segment size, this information occupies
  around $1KB$.
\item A SHA value calculated from the data in the segment.
\item The position of the head of the queue, i.e. the position of the
  first segment to be removed from the secondary device.
\end{itemize}

In addition to the queue of segments, the secondary storage device
contains a single word of information, indicating the tail of the
queue, i.e. the position on the device of the last checkpoint segment
that was written.

The first thing we need to verify at this point is that it is possible
to boot the system, given only the information on the secondary
storage device.  Here is how the system would be booted:

\begin{enumerate}
\item Read the information indicating where the tail of the queue is
  located.
\item Using this information, read the metadata of the last
  checkpointing segment that was written.
\item From this metadata, retrieve the information about the head of
  the queue.
\item Read each segment from the head to the tail of the queue,
  constructing the page map from the metadata of each segment.
\item Load initial pages into main memory, setting up the page tables
  as appropriate.
\item Load the registers stored in the checkpointing segment to
  continue execution, or jump to a default entry point of the system.
\end{enumerate}

Segments are removed from the head of the queue, by a procedure called
\emph{cleaning}.  This procedure will be described later.  For now, we
assume that it is not present.

The system maintains three buffers, each one the size of a segment.
Two buffers are used to alternate, so that one is being written to
secondary memory while the other one (the \emph{active one}) is used
to receive pages in main memory.  The third buffer is used to read
back and compare what was written to secondary storage.  Two counters,
$M$ and $N$, each with an initial value of $0$ is kept for each of two
ordinary segment buffers.  $M$ indicates the first free page in the
active segment buffer, or equivalently, the number of pages that have
already been copied to the buffer.  $N$ indicates the number of dirty
pages that have not yet been copied to the segment buffer.  If ever
$M+N$ reaches the value corresponding to the number of pages in the
buffer (in our example, $250$, then a \emph{checkpoint} is triggered
as described below.

When a page fault occurs, a victim page is chosen using some standard
technique, such as ``least recently used''.  If the victim page is
clean, it is simply discarded and the page map is modified to
reflect the change.  If the victim page is dirty, its contents is
copied to the first free page of the active segment buffer, and the
value of $M$ is incremented.  The unique number of the page is
retrieved from the page map and stored in the header of the active
segment buffer.

All clean pages are read-only.  When an attempt is made to modify a
page, $N$ is incremented and the page is marked as writable.

As mentioned above, when $M+N$ reaches the value corresponding to the
number of available pages in the segment buffer, a checkpoint is
triggered.  The initial operation of a checkpoint is called an
\emph{atomic flip} which involves two segment buffers that we shall
call $A$ and $B$.  $A$ is the current active segment buffer with $M_A+N_A$
having reached its ceiling and $B$ is the next one to be activated
with its $M_B$ and $N_B$ equal to $0$.

The flipping operation must be done atomically, i.e., all executing
threads must be temporarily stopped.  First, the $N_A$ dirty pages not
yet in the buffer are marked as read-only. The active segment
buffer is then set to segment $B$.  Then the $N_A$ pages that
were dirty are copied to segment buffer $A$.  Their respective unique page
numbers are retrieved from the page map and copied to the header of
segment buffer $A$.  The registers of the machine prior to flipping
are then stored in the segment buffer.  Once this is done, previously
stopped threads can continue execution, and the entire segment $A$ is
written to the end of the queue on secondary storage, and $M_A$ and
$N_A$ are set to $0$.

To avoid that the secondary storage device fills up with more and more
checkpoint segments, an activity called \emph{cleaning} works in
parallel with the activity described above.  Conceptually, a segment
is read from the head of the queue and processed as follows.  The
list of unique page numbers in the segment header is examined.  For
each unique page number, the page map in main memory is consulted.
There are two possible outcomes:

\begin{enumerate}
\item The location of the page as indicated by the page map is
  different from the location in the segment being processed.  Then,
  there is a segment further back in the queue that contains a newer
  version of the page.  Therefore, this version of the page is
  obsolete, and is simply discarded.
\item The location of the page as indicated by the page map is the
  same the location in the segment being processed.  Then, this
  version of the page is the most recent one.  In this case, the page
  is copied to the active segment buffer and $M$ is incremented.
\end{enumerate}

When every page in the head segment has been processed this way, the
header of the active segment buffer is updated to reflect that the
complete segment at the head of the queue has been processed and the
following segment on the queue should be processed next.  Notice that
there is no danger in processing pages this way multiple times.  Thus,
if a crash occurs in the middle, there is no harm done.

Now, let us turn our attention to performance.  Clearly, if a disk the
size of the secondary storage device in our example is to be
completely read when the system boots, it will take a very long time
indeed.  We suggest handling this problem by separating the segment
headers from the segment pages either to two separate parts of a
single storage device or to a second device.  Only the headers need to
be read for a page map to be constructed in memory.  The headers are
less than one half of a percent the size of the space occupied by
pages in our example, so booting the system is then much faster.  Even
better, if the segment headers are placed on a persistent solid-state
device, they can be read much faster.