You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

181 lines
9.9 KiB
TeX

\chapter{Checkpointing}
\label{chap-checkpointing}
The checkpointing mechanism described in this chapter is inspired by
that of the EROS system.
The address of an object can be considered as consisting of two parts:
the \emph{page number} and the \emph{offset within the page}. The
page number directly corresponds to the location on disk of the page.
However, when checkpointing is activated, the available disk memory is
divided into three parts, and the page number should be multiplied by
3 to get the first of three disk locations where the object might be
located.%
\footnote{The price to pay for checkpointing is thus that disk memory
will cost a factor 3 as much compared to the price when no
checkpointing is used.}
Checkpointing is divided into \emph{cycles} delimited by
\emph{snapshots}. At any point in time, two checkpointing cycles are
important. The \emph{current} checkpointing cycle started at the
last snapshot and is still going on. The \emph{previous}
checkpointing cycle is the one that ended at the last snapshot.
A page can exist in one, two, or three \emph{versions}, located in
three different places on disk. Version $0$ of the page is the oldest
version, and also the version that would be used when the system is
rebooted after a crash. Version $0$ of the page always exists.
Version $1$ of the page corresponds to the contents of the page as it
was at the end of the \emph{previous} checkpoint cycle. Version $1$
of the page exists if and only if the page was modified during the
previous checkpoint cycle. Version $2$ of the page is the
\emph{current} version of the page. Version $2$ of the page exists if
and only if the page has been modified since the beginning of the
\emph{current} checkpoint cycle. We use the word \emph{page instance}
to refer to a particular version of a particular page.
A page can be associated with a \emph{frame}.%
\footnote{A \emph{frame} is the main-memory instance of a page.} An
attempt to access a page that is not associated with a frame results
in a \emph{page fault}. At most one version of a particular page can
be associated with a frame, and then it is the version with the
highest number. A frame associated with version $0$ or version $1$ of
a page is \emph{write protected}, but a frame associated with version
$2$ of a page is not. Any attempt to modify the contents of a
write-protected frame results in a \emph{write fault}.
A frame can be \emph{clean} or \emph{dirty}. By definition, when the
frame is clean, its contents are identical to those of the associated
page instance. When the frame is dirty, it means that it has been
modified after it was associated with the underlying page instance. A
frame that is associated with version $0$ of a page can not be dirty.
If a frame that is associated with version $1$ of a page is dirty,
then it is because it was modified during the \emph{previous}
checkpointing cycle, and not the current one.
When a page fault occurs, and there are unused frames, an arbitrary
unused frame is associated with the latest version of the page. If
there are no unused frames when a page fault occurs (which is the
normal situation), a frame that is already associated with a page must
be freed up. To select the frame to free up, an ordinary ALRU method
can be used. If the selected frame is dirty, the contents are written
to the page instance associated with the frame. Finally, the latest
version of the requested page is associated with the selected frame.
If the latest version of the requested page is either version $0$ or
version $1$, then the frame is write protected before execution
resumes.
As indicated above, when a write fault occurs, the frame written to
must be associated with either version $0$ or version $1$ of a page.
If it is associated with version $0$ of the page, then the frame must
be clean. In that case, the association of the frame is modified, so
that it henceforth is associated with version $2$ of the page. Before
execution resumes, the frame is unprotected. As soon as execution
resumes, the frame will be marked as dirty since the reason for the
fault was an attempt to write to it. When a write fault occurs and
the frame is associated with version $1$ of the associated page, the
frame may be either clean or dirty. If it is clean, again, the
association of the frame is modified so that it henceforth is
associated with version $2$ of the page, and again the frame is
unprotected before execution resumes. If the frame is dirty, then its
contents are first written to the associated page instance. Then the
association is changed as before.
To determine the disk location of each version of each page, we use a
\emph{version table}. The version table is just a sequence of bytes,
one for each page. Only 6 bits in each byte are actually used. The
two least significant bits indicate the location of version $0$ of the
page. $00$ means the first of the $3$ possible consecutive disk
locations, $01$ means the second and $10$ means the third, and $11$ is
not used. The next two bits indicate the location of version $1$ of
the page, with the same meaning as before, except that $11$ means that
there is no version $1$ of the page. The final two bits indicate the
location of version $2$ of the page with the same interpretation as
for version $1$.
At any point in time, there exist three version tables; two on disk
and one in main memory. The two versions on disk play the same role
as the disk tables in EROS, i.e., while one of them is being updated,
the other is still complete and accurate. A single bit in the boot
sector of the disk selects which one should be used at boot time.
When a new version table needs to be written to disk, it is first
written to the place of the unused disk table, and then the boot
sector is written with a flipped selection bit.
The version table in main memory is represented in two levels with a
\emph{directory} of pages. If one page is 4kiB, then one page can
hold $2^{12}$ version table entries. For a $300GB$ disk (with room
for around $25$ million pages), the directory will contain around
$6000$ entries. A directory entry contains not only a pointer to the
page of table entries, but also a bit indicating whether any of the
table entries in the corresponding page indicates a page which exists
in more than one version. It is expected that a relatively small
fraction of the directory entries in each checkpointing cycle with
have the bit set.
When a write fault occurs and as a result a new version of a page is
created, the in-memory version table is consulted. The entry for the
page indicates the disk location of version $0$ of the page, and
sometimes also version $1$ of the page. The disk location for the new
version (version $2$) of the page is chosen to be one of the two
unused ones (if only version $0$ of the page exists) or the only
unused one (if both version $0$ and version $1$ of the page exists).
The location for version $2$ of the page is indicated in the version
table entry by setting bits $4$ and $5$ of the entry to the
corresponding disk location.
In parallel with mutator threads, one or more threads scan the page
table of the operating system for dirty frames. When a dirty frame
corresponding to version $1$ of a page is found, the contents of the
frame is saved to its associated page instance, and the dirty-bit is
cleared. When there are no more dirty frames corresponding to version
$1$ pages, the set of page instances corresponding to all version $1$
pages and version $0$ pages where no version $1$ exists represents the
state of the system at the time of the last snapshot.
To save the coherent state of the system to disk, the in-memory
version table directory is scanned. Whenever a directory entry with
the bit indicating the existence of pages with several versions set,
the page of the directory entry is saved to disk. When the entire
version table has been scanned, a new boot sector is written
to indicate that the newly saved table is the current one.
The final action to take in order to finish the current checkpointing
cycle and begin a new one is an \emph{atomic flip}. This atomic flip
consists of turning all version $1$ pages into version $0$ pages and
all version $2$ pages into version $1$ pages. To do that, mutator
threads must be stopped. Then the in-memory version table is scanned.
Whenever an entry is found that has a version other than $0$ in it, it
is modified. If both a version $1$ and a version $2$ exists, bits $2$
and $3$ of the entry are moved to position $0$ and $1$, bits $4$ and
$5$ are moved to positions $2$ and $3$, and positions $4$, and $5$ are
set to $11$. If no version $1$ exists, then bits $4$ and $5$ are
moved to positions $2$ and $3$, and positions $4$, and $5$ are set to
$11$. Finally, mutator threads are restarted.
The easiest way to modify a version table entry is probably to create
a 64-byte table in memory which, for each possible version of the
existing version table entry gives the new version. Even though it
would require a memory access, this table will quickly be in the
cache, so access will be fast.
To get an idea of performance of the atomic flip, let us take a
situation where the \emph{working set} is no bigger than the size of
main memory.%
\footnote{If the working set is larger than the main memory,
performance is likely to deteriorate for more fundamental reasons.}
Furthermore, let us say that the size of main memory is $64GiB$ and
that around half the pages of the working set are modified in a
particular checkpointing cycle. If we assume that the modified pages
are concentrated with respect to the version table directory, then we
can ignore the time to scan the version table directory. To
accomplish the flip, we then need to modify $2^{23}$ entries. If we
assume modified entries are adjacent, we can load and store $8$ of
them at a time, requiring $2^{21}$ memory accesses. If a memory
access takes around $10$ns, the flip will take around $20$ms.
The time for a flip can be made shorter by taking more frequent
snapshots.
%% LocalWords: checkpointing mutator