Add a more detailed explanation about softupdates. I originally wrote

this as an article for a German-language newsgroup, and people felt
there's a lot of misconception about the topic around, so they urged
me to publish it.  Eventually, Matthias Schuendehuette submitted an
English translation of the article to me and asked me to review it.

Finally, Kirk graciously reviewed it, too.

Submitted by:	Matthias Schuendehuette <msch@snafu.de>
Reviewed by:	joerg, kirk
MFC after:	3 days
This commit is contained in:
Joerg Wunsch 2002-01-20 14:09:34 +00:00
parent d4b2892aa1
commit 1f954d79d5
Notes: svn2git 2020-12-08 03:00:23 +00:00
svn path=/head/; revision=11761

View file

@ -901,6 +901,151 @@ kern.maxfiles: 2088 -> 5000</screen>
filesystem) which is close to full, doing a major update of it, e.g.
<command>make installworld</command>, can run it out of space and
cause the update to fail.</para>
<sect3>
<title>More details about Soft Updates</title>
<indexterm><primary>Soft Updates (Details)</primary></indexterm>
<para>There are two classical approaches how to write metadata of
a filesystem back to disk. (Metadata updates are updates to
non-content data like i-nodes or directories.)</para>
<para>Historically, the default behaviour was to write out
metadata updates synchronously. If a directory had been
changed, the system waited until the change was actually
written to disk. The file data buffers (file contents) have
been passed through the buffer cache however, and backed up
to disk later on asynchronously. The advantage of this
implementation is that it is operating very safely. If there is
a failure during an update the metadata are always in a
consistent state. A file has either been completely created
or not at all. If the data blocks of a file did not find
their way out of the buffer cache onto the disk by the time
of the crash, &man.fsck.8; is able to recognize this and to
repair the filesystem (e. g. the file length will be set to
0). Additionally, the implementation is clear and simple.
The disadvantage is that metadata changes are very slow. A
<command>rm -r</command> for instance touches all files of a
directory sequentially, but every single of these directory
changes (deletion of a file) will be written synchronously
to the disk. This includes updates to the directory itself,
to the i-node table, and possibly to indirect blocks
allocated by the file. Similar considerations apply for
unrolling large hierachies (<command>tar -x</command>).</para>
<para>The second case are asynchronous metadata updates. This
is e. g. the default for Linux/ext2fs or achieved by
<command>mount -o async</command> for *BSD ufs. All
metadata updates are simply being passed through the buffer
cache too, that is, they will be intermixed with the updates
of the file content data. The advantage of this
implementation is there's no need to wait until each
metadata update has been written to disk, so all operations
which cause huge amounts of metadata updates work much
faster than in the synchronous case. Also, the
implementation is still clear and simple, so there's a low
risk for bugs creeping into the code. The disadvantage is
that there is no guarantee at all for a consistent state of
the filesystem. If there is a failure during an operation
that updated large amounts of metadata (like a power
failure, or someone pressing the reset button),
the file system
will be left in an unpredictable state. There's no chance
to examine the state of the file system when the system
comes up again; the data blocks of a file could already have
been written to the disk while the updates of the i-node
table or the associated directory were not. It is actually
impossible to implement a <command>fsck</command> which is
able to clean up the resulting chaos (because the necessary
information is just not available on the disk). If the
filesystem has been damaged beyond repair, the only choice
is to <command>newfs</command> it and restore it from backup.
</para>
<para>The usual solution for this problem was to implement a
<emphasis>dirty region logging</emphasis> (sometimes also
referred to as <emphasis>journalling</emphasis>, albeit that
term has not been used consistently and occasionally applied
to other forms of transaction logging as well). Metadata
updates are still written out synchronously, but only into a
small region of the disk. Later on they will be distributed
from there to their proper location. Because the logging
area is only a small, contiguous region on the disk, there
are no long distances for the disk heads to move, even
during heavy operations, so these operations are accelerated
quite a bit compared to the classical synchronous updates.
Additionally the complexity of the implementation is fairly
limited and thus the risk for bugs still low. A disadvatage
is that all metadata are written twice (once into the
logging region and once to the proper location) so for
normal work, a performance <quote>pessimization</quote>
might result. On the other hand, in case of a crash, all
pending metadata operations can be quickly either rolled-back
or completed from the logging area after the system comes
up again, resulting in a fast filesystem startup.</para>
<para>Now, Kirk McKusick's (the developer of Berkeley FFS)
solution to the problem are Soft Updates: all pending
metadata updates are kept in memory and written out to disk
in a sorted sequence (<quote>ordered metadata
updates</quote>). This has the effect that, in case of
heavy metadata operations, later updates of a certain item
<quote>catch</quote> the earlier ones if those are still in
memory and have not already been written to disk. So all
operations on, say, a directory are generally done still in
memory before the update is written to disk (the data
blocks are sorted to their according position as well so
that they won't be on the disk ahead of their metadata).
In case of a crash this causes an implicit <quote>log
rewind</quote>: all operations which did not find their way
to the disk appear as if they had never happened. A
consistent filesystem state is maintained that appears to
be the one of 30--60 seconds earlier. The
algorithm used guarantees that all actually used resources
are marked as such in their appropriate bitmaps: blocks and i-nodes.
After a crash, the only resource allocation error
that occur are that resources are
marked as <quote>used</quote> which actually are <quote>free</quote>.
&man.fsck.8; then recognizes this situation,
and free up those no longer used resources. It is safe to
ignore the dirty state of the filesystem after a crash, by
forcibly mounting it with <command>mount -f</command>. In
order to free up possibly unused resources, &man.fsck.8;
needs to be run at a later time. This is the idea behind
the <emphasis>background fsck</emphasis>: at system startup
time, only a <emphasis>snapshot</emphasis> from the
filesystem is recorded, that <command>fsck</command> can be
run against later on. All filesystems can then be mounted
<quote> dirty</quote>, and system startup proceeds to
multiuser mode. Then, background <command>fsck</command>s
will be scheduled for all filesystems that need it, to free
up possibly unused resources. (Filesystems that do not use
soft updates still need the usual foreground
<command>fsck</command> though.)</para>
<para>The advantage is that metadata operations are nearly as
fast as asynchronous updates (i. e. faster than with
<emphasis>logging</emphasis>, which has to write the
metadata twice). The disadvantages are the complexity of
the code (implying a higher risk for bugs in an area that
is highly sensitive regarding loss of user data), and a
higher memory consumption. Additionally there are some
<quote>idiosyncrasies</quote> one has to get used to.
After a crash, the state of the filesystem appears to be
somewhat <quote>older</quote>; e. g. in situations where
the standard synchronous approach would have caused some
zero-length files to remain after the
<command>fsck</command>, these files do not exist at all
with a soft updates filesystem because neither the metadata
nor the file contents have ever been written to disk.
After a <command>rm</command>, the released disk space is
not instantly available but only after the updates have
written to disk. This can in particular cause problems
when installing large amounts of data into a filesystem
that doesn't have enough free space to hold all the files
twice.</para>
</sect3>
</sect2>
</sect1>