You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
181 lines
9.9 KiB
TeX
181 lines
9.9 KiB
TeX
\chapter{Checkpointing}
|
|
\label{chap-checkpointing}
|
|
|
|
The checkpointing mechanism described in this chapter is inspired by
|
|
that of the EROS system.
|
|
|
|
The address of an object can be considered as consisting of two parts:
|
|
the \emph{page number} and the \emph{offset within the page}. The
|
|
page number directly corresponds to the location on disk of the page.
|
|
However, when checkpointing is activated, the available disk memory is
|
|
divided into three parts, and the page number should be multiplied by
|
|
3 to get the first of three disk locations where the object might be
|
|
located.%
|
|
\footnote{The price to pay for checkpointing is thus that disk memory
|
|
will cost a factor 3 as much compared to the price when no
|
|
checkpointing is used.}
|
|
|
|
Checkpointing is divided into \emph{cycles} delimited by
|
|
\emph{snapshots}. At any point in time, two checkpointing cycles are
|
|
important. The \emph{current} checkpointing cycle started at the
|
|
last snapshot and is still going on. The \emph{previous}
|
|
checkpointing cycle is the one that ended at the last snapshot.
|
|
|
|
A page can exist in one, two, or three \emph{versions}, located in
|
|
three different places on disk. Version $0$ of the page is the oldest
|
|
version, and also the version that would be used when the system is
|
|
rebooted after a crash. Version $0$ of the page always exists.
|
|
Version $1$ of the page corresponds to the contents of the page as it
|
|
was at the end of the \emph{previous} checkpoint cycle. Version $1$
|
|
of the page exists if and only if the page was modified during the
|
|
previous checkpoint cycle. Version $2$ of the page is the
|
|
\emph{current} version of the page. Version $2$ of the page exists if
|
|
and only if the page has been modified since the beginning of the
|
|
\emph{current} checkpoint cycle. We use the word \emph{page instance}
|
|
to refer to a particular version of a particular page.
|
|
|
|
A page can be associated with a \emph{frame}.%
|
|
\footnote{A \emph{frame} is the main-memory instance of a page.} An
|
|
attempt to access a page that is not associated with a frame results
|
|
in a \emph{page fault}. At most one version of a particular page can
|
|
be associated with a frame, and then it is the version with the
|
|
highest number. A frame associated with version $0$ or version $1$ of
|
|
a page is \emph{write protected}, but a frame associated with version
|
|
$2$ of a page is not. Any attempt to modify the contents of a
|
|
write-protected frame results in a \emph{write fault}.
|
|
|
|
A frame can be \emph{clean} or \emph{dirty}. By definition, when the
|
|
frame is clean, its contents are identical to those of the associated
|
|
page instance. When the frame is dirty, it means that it has been
|
|
modified after it was associated with the underlying page instance. A
|
|
frame that is associated with version $0$ of a page can not be dirty.
|
|
If a frame that is associated with version $1$ of a page is dirty,
|
|
then it is because it was modified during the \emph{previous}
|
|
checkpointing cycle, and not the current one.
|
|
|
|
When a page fault occurs, and there are unused frames, an arbitrary
|
|
unused frame is associated with the latest version of the page. If
|
|
there are no unused frames when a page fault occurs (which is the
|
|
normal situation), a frame that is already associated with a page must
|
|
be freed up. To select the frame to free up, an ordinary ALRU method
|
|
can be used. If the selected frame is dirty, the contents are written
|
|
to the page instance associated with the frame. Finally, the latest
|
|
version of the requested page is associated with the selected frame.
|
|
If the latest version of the requested page is either version $0$ or
|
|
version $1$, then the frame is write protected before execution
|
|
resumes.
|
|
|
|
As indicated above, when a write fault occurs, the frame written to
|
|
must be associated with either version $0$ or version $1$ of a page.
|
|
If it is associated with version $0$ of the page, then the frame must
|
|
be clean. In that case, the association of the frame is modified, so
|
|
that it henceforth is associated with version $2$ of the page. Before
|
|
execution resumes, the frame is unprotected. As soon as execution
|
|
resumes, the frame will be marked as dirty since the reason for the
|
|
fault was an attempt to write to it. When a write fault occurs and
|
|
the frame is associated with version $1$ of the associated page, the
|
|
frame may be either clean or dirty. If it is clean, again, the
|
|
association of the frame is modified so that it henceforth is
|
|
associated with version $2$ of the page, and again the frame is
|
|
unprotected before execution resumes. If the frame is dirty, then its
|
|
contents are first written to the associated page instance. Then the
|
|
association is changed as before.
|
|
|
|
To determine the disk location of each version of each page, we use a
|
|
\emph{version table}. The version table is just a sequence of bytes,
|
|
one for each page. Only 6 bits in each byte are actually used. The
|
|
two least significant bits indicate the location of version $0$ of the
|
|
page. $00$ means the first of the $3$ possible consecutive disk
|
|
locations, $01$ means the second and $10$ means the third, and $11$ is
|
|
not used. The next two bits indicate the location of version $1$ of
|
|
the page, with the same meaning as before, except that $11$ means that
|
|
there is no version $1$ of the page. The final two bits indicate the
|
|
location of version $2$ of the page with the same interpretation as
|
|
for version $1$.
|
|
|
|
At any point in time, there exist three version tables; two on disk
|
|
and one in main memory. The two versions on disk play the same role
|
|
as the disk tables in EROS, i.e., while one of them is being updated,
|
|
the other is still complete and accurate. A single bit in the boot
|
|
sector of the disk selects which one should be used at boot time.
|
|
When a new version table needs to be written to disk, it is first
|
|
written to the place of the unused disk table, and then the boot
|
|
sector is written with a flipped selection bit.
|
|
|
|
The version table in main memory is represented in two levels with a
|
|
\emph{directory} of pages. If one page is 4kiB, then one page can
|
|
hold $2^{12}$ version table entries. For a $300GB$ disk (with room
|
|
for around $25$ million pages), the directory will contain around
|
|
$6000$ entries. A directory entry contains not only a pointer to the
|
|
page of table entries, but also a bit indicating whether any of the
|
|
table entries in the corresponding page indicates a page which exists
|
|
in more than one version. It is expected that a relatively small
|
|
fraction of the directory entries in each checkpointing cycle with
|
|
have the bit set.
|
|
|
|
When a write fault occurs and as a result a new version of a page is
|
|
created, the in-memory version table is consulted. The entry for the
|
|
page indicates the disk location of version $0$ of the page, and
|
|
sometimes also version $1$ of the page. The disk location for the new
|
|
version (version $2$) of the page is chosen to be one of the two
|
|
unused ones (if only version $0$ of the page exists) or the only
|
|
unused one (if both version $0$ and version $1$ of the page exists).
|
|
The location for version $2$ of the page is indicated in the version
|
|
table entry by setting bits $4$ and $5$ of the entry to the
|
|
corresponding disk location.
|
|
|
|
In parallel with mutator threads, one or more threads scan the page
|
|
table of the operating system for dirty frames. When a dirty frame
|
|
corresponding to version $1$ of a page is found, the contents of the
|
|
frame is saved to its associated page instance, and the dirty-bit is
|
|
cleared. When there are no more dirty frames corresponding to version
|
|
$1$ pages, the set of page instances corresponding to all version $1$
|
|
pages and version $0$ pages where no version $1$ exists represents the
|
|
state of the system at the time of the last snapshot.
|
|
|
|
To save the coherent state of the system to disk, the in-memory
|
|
version table directory is scanned. Whenever a directory entry with
|
|
the bit indicating the existence of pages with several versions set,
|
|
the page of the directory entry is saved to disk. When the entire
|
|
version table has been scanned, a new boot sector is written
|
|
to indicate that the newly saved table is the current one.
|
|
|
|
The final action to take in order to finish the current checkpointing
|
|
cycle and begin a new one is an \emph{atomic flip}. This atomic flip
|
|
consists of turning all version $1$ pages into version $0$ pages and
|
|
all version $2$ pages into version $1$ pages. To do that, mutator
|
|
threads must be stopped. Then the in-memory version table is scanned.
|
|
Whenever an entry is found that has a version other than $0$ in it, it
|
|
is modified. If both a version $1$ and a version $2$ exists, bits $2$
|
|
and $3$ of the entry are moved to position $0$ and $1$, bits $4$ and
|
|
$5$ are moved to positions $2$ and $3$, and positions $4$, and $5$ are
|
|
set to $11$. If no version $1$ exists, then bits $4$ and $5$ are
|
|
moved to positions $2$ and $3$, and positions $4$, and $5$ are set to
|
|
$11$. Finally, mutator threads are restarted.
|
|
|
|
The easiest way to modify a version table entry is probably to create
|
|
a 64-byte table in memory which, for each possible version of the
|
|
existing version table entry gives the new version. Even though it
|
|
would require a memory access, this table will quickly be in the
|
|
cache, so access will be fast.
|
|
|
|
To get an idea of performance of the atomic flip, let us take a
|
|
situation where the \emph{working set} is no bigger than the size of
|
|
main memory.%
|
|
\footnote{If the working set is larger than the main memory,
|
|
performance is likely to deteriorate for more fundamental reasons.}
|
|
Furthermore, let us say that the size of main memory is $64GiB$ and
|
|
that around half the pages of the working set are modified in a
|
|
particular checkpointing cycle. If we assume that the modified pages
|
|
are concentrated with respect to the version table directory, then we
|
|
can ignore the time to scan the version table directory. To
|
|
accomplish the flip, we then need to modify $2^{23}$ entries. If we
|
|
assume modified entries are adjacent, we can load and store $8$ of
|
|
them at a time, requiring $2^{21}$ memory accesses. If a memory
|
|
access takes around $10$ns, the flip will take around $20$ms.
|
|
|
|
The time for a flip can be made shorter by taking more frequent
|
|
snapshots.
|
|
|
|
%% LocalWords: checkpointing mutator
|