Add a more detailed explanation about softupdates. I originally wrote
this as an article for a German-language newsgroup, and people felt there's a lot of misconception about the topic around, so they urged me to publish it. Eventually, Matthias Schuendehuette submitted an English translation of the article to me and asked me to review it. Finally, Kirk graciously reviewed it, too. Submitted by: Matthias Schuendehuette <msch@snafu.de> Reviewed by: joerg, kirk MFC after: 3 days
This commit is contained in:
parent
d4b2892aa1
commit
1f954d79d5
Notes:
svn2git
2020-12-08 03:00:23 +00:00
svn path=/head/; revision=11761
1 changed files with 145 additions and 0 deletions
|
|
@ -901,6 +901,151 @@ kern.maxfiles: 2088 -> 5000</screen>
|
|||
filesystem) which is close to full, doing a major update of it, e.g.
|
||||
<command>make installworld</command>, can run it out of space and
|
||||
cause the update to fail.</para>
|
||||
|
||||
<sect3>
|
||||
<title>More details about Soft Updates</title>
|
||||
|
||||
<indexterm><primary>Soft Updates (Details)</primary></indexterm>
|
||||
|
||||
<para>There are two classical approaches how to write metadata of
|
||||
a filesystem back to disk. (Metadata updates are updates to
|
||||
non-content data like i-nodes or directories.)</para>
|
||||
|
||||
<para>Historically, the default behaviour was to write out
|
||||
metadata updates synchronously. If a directory had been
|
||||
changed, the system waited until the change was actually
|
||||
written to disk. The file data buffers (file contents) have
|
||||
been passed through the buffer cache however, and backed up
|
||||
to disk later on asynchronously. The advantage of this
|
||||
implementation is that it is operating very safely. If there is
|
||||
a failure during an update the metadata are always in a
|
||||
consistent state. A file has either been completely created
|
||||
or not at all. If the data blocks of a file did not find
|
||||
their way out of the buffer cache onto the disk by the time
|
||||
of the crash, &man.fsck.8; is able to recognize this and to
|
||||
repair the filesystem (e. g. the file length will be set to
|
||||
0). Additionally, the implementation is clear and simple.
|
||||
The disadvantage is that metadata changes are very slow. A
|
||||
<command>rm -r</command> for instance touches all files of a
|
||||
directory sequentially, but every single of these directory
|
||||
changes (deletion of a file) will be written synchronously
|
||||
to the disk. This includes updates to the directory itself,
|
||||
to the i-node table, and possibly to indirect blocks
|
||||
allocated by the file. Similar considerations apply for
|
||||
unrolling large hierachies (<command>tar -x</command>).</para>
|
||||
|
||||
<para>The second case are asynchronous metadata updates. This
|
||||
is e. g. the default for Linux/ext2fs or achieved by
|
||||
<command>mount -o async</command> for *BSD ufs. All
|
||||
metadata updates are simply being passed through the buffer
|
||||
cache too, that is, they will be intermixed with the updates
|
||||
of the file content data. The advantage of this
|
||||
implementation is there's no need to wait until each
|
||||
metadata update has been written to disk, so all operations
|
||||
which cause huge amounts of metadata updates work much
|
||||
faster than in the synchronous case. Also, the
|
||||
implementation is still clear and simple, so there's a low
|
||||
risk for bugs creeping into the code. The disadvantage is
|
||||
that there is no guarantee at all for a consistent state of
|
||||
the filesystem. If there is a failure during an operation
|
||||
that updated large amounts of metadata (like a power
|
||||
failure, or someone pressing the reset button),
|
||||
the file system
|
||||
will be left in an unpredictable state. There's no chance
|
||||
to examine the state of the file system when the system
|
||||
comes up again; the data blocks of a file could already have
|
||||
been written to the disk while the updates of the i-node
|
||||
table or the associated directory were not. It is actually
|
||||
impossible to implement a <command>fsck</command> which is
|
||||
able to clean up the resulting chaos (because the necessary
|
||||
information is just not available on the disk). If the
|
||||
filesystem has been damaged beyond repair, the only choice
|
||||
is to <command>newfs</command> it and restore it from backup.
|
||||
</para>
|
||||
|
||||
<para>The usual solution for this problem was to implement a
|
||||
<emphasis>dirty region logging</emphasis> (sometimes also
|
||||
referred to as <emphasis>journalling</emphasis>, albeit that
|
||||
term has not been used consistently and occasionally applied
|
||||
to other forms of transaction logging as well). Metadata
|
||||
updates are still written out synchronously, but only into a
|
||||
small region of the disk. Later on they will be distributed
|
||||
from there to their proper location. Because the logging
|
||||
area is only a small, contiguous region on the disk, there
|
||||
are no long distances for the disk heads to move, even
|
||||
during heavy operations, so these operations are accelerated
|
||||
quite a bit compared to the classical synchronous updates.
|
||||
Additionally the complexity of the implementation is fairly
|
||||
limited and thus the risk for bugs still low. A disadvatage
|
||||
is that all metadata are written twice (once into the
|
||||
logging region and once to the proper location) so for
|
||||
normal work, a performance <quote>pessimization</quote>
|
||||
might result. On the other hand, in case of a crash, all
|
||||
pending metadata operations can be quickly either rolled-back
|
||||
or completed from the logging area after the system comes
|
||||
up again, resulting in a fast filesystem startup.</para>
|
||||
|
||||
<para>Now, Kirk McKusick's (the developer of Berkeley FFS)
|
||||
solution to the problem are Soft Updates: all pending
|
||||
metadata updates are kept in memory and written out to disk
|
||||
in a sorted sequence (<quote>ordered metadata
|
||||
updates</quote>). This has the effect that, in case of
|
||||
heavy metadata operations, later updates of a certain item
|
||||
<quote>catch</quote> the earlier ones if those are still in
|
||||
memory and have not already been written to disk. So all
|
||||
operations on, say, a directory are generally done still in
|
||||
memory before the update is written to disk (the data
|
||||
blocks are sorted to their according position as well so
|
||||
that they won't be on the disk ahead of their metadata).
|
||||
In case of a crash this causes an implicit <quote>log
|
||||
rewind</quote>: all operations which did not find their way
|
||||
to the disk appear as if they had never happened. A
|
||||
consistent filesystem state is maintained that appears to
|
||||
be the one of 30--60 seconds earlier. The
|
||||
algorithm used guarantees that all actually used resources
|
||||
are marked as such in their appropriate bitmaps: blocks and i-nodes.
|
||||
After a crash, the only resource allocation error
|
||||
that occur are that resources are
|
||||
marked as <quote>used</quote> which actually are <quote>free</quote>.
|
||||
&man.fsck.8; then recognizes this situation,
|
||||
and free up those no longer used resources. It is safe to
|
||||
ignore the dirty state of the filesystem after a crash, by
|
||||
forcibly mounting it with <command>mount -f</command>. In
|
||||
order to free up possibly unused resources, &man.fsck.8;
|
||||
needs to be run at a later time. This is the idea behind
|
||||
the <emphasis>background fsck</emphasis>: at system startup
|
||||
time, only a <emphasis>snapshot</emphasis> from the
|
||||
filesystem is recorded, that <command>fsck</command> can be
|
||||
run against later on. All filesystems can then be mounted
|
||||
<quote> dirty</quote>, and system startup proceeds to
|
||||
multiuser mode. Then, background <command>fsck</command>s
|
||||
will be scheduled for all filesystems that need it, to free
|
||||
up possibly unused resources. (Filesystems that do not use
|
||||
soft updates still need the usual foreground
|
||||
<command>fsck</command> though.)</para>
|
||||
|
||||
<para>The advantage is that metadata operations are nearly as
|
||||
fast as asynchronous updates (i. e. faster than with
|
||||
<emphasis>logging</emphasis>, which has to write the
|
||||
metadata twice). The disadvantages are the complexity of
|
||||
the code (implying a higher risk for bugs in an area that
|
||||
is highly sensitive regarding loss of user data), and a
|
||||
higher memory consumption. Additionally there are some
|
||||
<quote>idiosyncrasies</quote> one has to get used to.
|
||||
After a crash, the state of the filesystem appears to be
|
||||
somewhat <quote>older</quote>; e. g. in situations where
|
||||
the standard synchronous approach would have caused some
|
||||
zero-length files to remain after the
|
||||
<command>fsck</command>, these files do not exist at all
|
||||
with a soft updates filesystem because neither the metadata
|
||||
nor the file contents have ever been written to disk.
|
||||
After a <command>rm</command>, the released disk space is
|
||||
not instantly available but only after the updates have
|
||||
written to disk. This can in particular cause problems
|
||||
when installing large amounts of data into a filesystem
|
||||
that doesn't have enough free space to hold all the files
|
||||
twice.</para>
|
||||
</sect3>
|
||||
</sect2>
|
||||
</sect1>
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue