Whitespace changes for clarity.
Translators please ignore.
This commit is contained in:
parent
f8f3aa5727
commit
27ba1359ad
Notes:
svn2git
2020-12-08 03:00:23 +00:00
svn path=/head/; revision=16781
1 changed files with 370 additions and 296 deletions
|
@ -48,62 +48,70 @@
|
|||
<indexterm><primary>RAID</primary>
|
||||
<secondary>Software</secondary></indexterm>
|
||||
|
||||
<para><emphasis>Vinum</emphasis> is a
|
||||
so-called <emphasis>Volume Manager</emphasis>, a virtual disk driver that
|
||||
addresses these three problems. Let us look at them in more detail. Various
|
||||
solutions to these problems have been proposed and implemented:</para>
|
||||
<para><emphasis>Vinum</emphasis> is a so-called <emphasis>Volume
|
||||
Manager</emphasis>, a virtual disk driver that addresses these
|
||||
three problems. Let us look at them in more detail. Various
|
||||
solutions to these problems have been proposed and
|
||||
implemented:</para>
|
||||
|
||||
|
||||
<para>Disks are getting bigger, but so are data storage requirements.
|
||||
Often you will find you want a file system that is bigger than the disks
|
||||
you have available. Admittedly, this problem is not as acute as it was
|
||||
ten years ago, but it still exists. Some systems have solved this by
|
||||
creating an abstract device which stores its data on a number of disks.</para>
|
||||
<para>Disks are getting bigger, but so are data storage
|
||||
requirements. Often you will find you want a file system that
|
||||
is bigger than the disks you have available. Admittedly, this
|
||||
problem is not as acute as it was ten years ago, but it still
|
||||
exists. Some systems have solved this by creating an abstract
|
||||
device which stores its data on a number of disks.</para>
|
||||
</sect1>
|
||||
|
||||
<sect1 id="vinum-access-bottlenecks">
|
||||
<title>Access bottlenecks</title>
|
||||
|
||||
<para>Modern systems frequently need to access data in a highly
|
||||
concurrent manner. For example, large FTP or HTTP servers can maintain
|
||||
thousands of concurrent sessions and have multiple 100 Mbit/s connections
|
||||
to the outside world, well beyond the sustained transfer rate of most
|
||||
disks.</para>
|
||||
concurrent manner. For example, large FTP or HTTP servers can
|
||||
maintain thousands of concurrent sessions and have multiple
|
||||
100 Mbit/s connections to the outside world, well beyond
|
||||
the sustained transfer rate of most disks.</para>
|
||||
|
||||
<para>Current disk drives can transfer data sequentially at up to
|
||||
70 MB/s, but this value is of little importance in an environment
|
||||
where many independent processes access a drive, where they may
|
||||
achieve only a fraction of these values. In such cases it is more
|
||||
interesting to view the problem from the viewpoint of the disk
|
||||
subsystem: the important parameter is the load that a transfer places
|
||||
on the subsystem, in other words the time for which a transfer occupies
|
||||
the drives involved in the transfer.</para>
|
||||
70 MB/s, but this value is of little importance in an
|
||||
environment where many independent processes access a drive,
|
||||
where they may achieve only a fraction of these values. In such
|
||||
cases it is more interesting to view the problem from the
|
||||
viewpoint of the disk subsystem: the important parameter is the
|
||||
load that a transfer places on the subsystem, in other words the
|
||||
time for which a transfer occupies the drives involved in the
|
||||
transfer.</para>
|
||||
|
||||
<para>In any disk transfer, the drive must first position the heads, wait
|
||||
for the first sector to pass under the read head, and then perform the
|
||||
transfer. These actions can be considered to be atomic: it does not make
|
||||
any sense to interrupt them.</para>
|
||||
<para>In any disk transfer, the drive must first position the
|
||||
heads, wait for the first sector to pass under the read head,
|
||||
and then perform the transfer. These actions can be considered
|
||||
to be atomic: it does not make any sense to interrupt
|
||||
them.</para>
|
||||
|
||||
<para><anchor id="vinum-latency">
|
||||
Consider a typical transfer of about 10 kB: the current generation of
|
||||
high-performance disks can position the heads in an average of 3.5 ms. The
|
||||
fastest drives spin at 15,000 rpm, so the average rotational latency
|
||||
(half a revolution) is 2 ms. At 70 MB/s, the transfer itself takes about
|
||||
150 μs, almost nothing compared to the positioning time. In such a
|
||||
case, the effective transfer rate drops to a little over 1 MB/s and is
|
||||
clearly highly dependent on the transfer size.</para>
|
||||
<para><anchor id="vinum-latency"> Consider a typical transfer of
|
||||
about 10 kB: the current generation of high-performance
|
||||
disks can position the heads in an average of 3.5 ms. The
|
||||
fastest drives spin at 15,000 rpm, so the average
|
||||
rotational latency (half a revolution) is 2 ms. At
|
||||
70 MB/s, the transfer itself takes about 150 μs,
|
||||
almost nothing compared to the positioning time. In such a
|
||||
case, the effective transfer rate drops to a little over
|
||||
1 MB/s and is clearly highly dependent on the transfer
|
||||
size.</para>
|
||||
|
||||
<para>The traditional and obvious solution to this bottleneck is
|
||||
<quote>more spindles</quote>: rather than using one large disk, it uses
|
||||
several smaller disks with the same aggregate storage space. Each disk is
|
||||
capable of positioning and transferring independently, so the effective
|
||||
throughput increases by a factor close to the number of disks used.
|
||||
<quote>more spindles</quote>: rather than using one large disk,
|
||||
it uses several smaller disks with the same aggregate storage
|
||||
space. Each disk is capable of positioning and transferring
|
||||
independently, so the effective throughput increases by a factor
|
||||
close to the number of disks used.
|
||||
</para>
|
||||
|
||||
<para>The exact throughput improvement is, of course, smaller than the
|
||||
number of disks involved: although each drive is capable of transferring
|
||||
in parallel, there is no way to ensure that the requests are evenly
|
||||
distributed across the drives. Inevitably the load on one drive will be
|
||||
higher than on another.</para>
|
||||
<para>The exact throughput improvement is, of course, smaller than
|
||||
the number of disks involved: although each drive is capable of
|
||||
transferring in parallel, there is no way to ensure that the
|
||||
requests are evenly distributed across the drives. Inevitably
|
||||
the load on one drive will be higher than on another.</para>
|
||||
|
||||
<indexterm>
|
||||
<primary>disk concatenation</primary>
|
||||
|
@ -113,20 +121,22 @@
|
|||
<secondary>concatenation</secondary>
|
||||
</indexterm>
|
||||
|
||||
<para>The evenness of the load on the disks is strongly dependent on
|
||||
the way the data is shared across the drives. In the following
|
||||
discussion, it is convenient to think of the disk storage as a large
|
||||
number of data sectors which are addressable by number, rather like the
|
||||
pages in a book. The most obvious method is to divide the virtual disk
|
||||
into groups of consecutive sectors the size of the individual physical
|
||||
disks and store them in this manner, rather like taking a large book and
|
||||
tearing it into smaller sections. This method is called
|
||||
<emphasis>concatenation</emphasis> and has the advantage that the disks
|
||||
are not required to have any specific size relationships. It works
|
||||
well when the access to the virtual disk is spread evenly about its
|
||||
address space. When access is concentrated on a smaller area, the
|
||||
improvement is less marked. <xref linkend="vinum-concat"> illustrates
|
||||
the sequence in which storage units are allocated in a concatenated
|
||||
<para>The evenness of the load on the disks is strongly dependent
|
||||
on the way the data is shared across the drives. In the
|
||||
following discussion, it is convenient to think of the disk
|
||||
storage as a large number of data sectors which are addressable
|
||||
by number, rather like the pages in a book. The most obvious
|
||||
method is to divide the virtual disk into groups of consecutive
|
||||
sectors the size of the individual physical disks and store them
|
||||
in this manner, rather like taking a large book and tearing it
|
||||
into smaller sections. This method is called
|
||||
<emphasis>concatenation</emphasis> and has the advantage that
|
||||
the disks are not required to have any specific size
|
||||
relationships. It works well when the access to the virtual
|
||||
disk is spread evenly about its address space. When access is
|
||||
concentrated on a smaller area, the improvement is less marked.
|
||||
<xref linkend="vinum-concat"> illustrates the sequence in which
|
||||
storage units are allocated in a concatenated
|
||||
organization.</para>
|
||||
|
||||
<para>
|
||||
|
@ -144,26 +154,28 @@
|
|||
<secondary>striping</secondary>
|
||||
</indexterm>
|
||||
|
||||
<para>An alternative mapping is to divide the address space into smaller,
|
||||
equal-sized components and store them sequentially on different devices.
|
||||
For example, the first 256 sectors may be stored on the first disk, the
|
||||
next 256 sectors on the next disk and so on. After filling the last
|
||||
disk, the process repeats until the disks are full. This mapping is called
|
||||
<para>An alternative mapping is to divide the address space into
|
||||
smaller, equal-sized components and store them sequentially on
|
||||
different devices. For example, the first 256 sectors may be
|
||||
stored on the first disk, the next 256 sectors on the next disk
|
||||
and so on. After filling the last disk, the process repeats
|
||||
until the disks are full. This mapping is called
|
||||
<emphasis>striping</emphasis> or <acronym>RAID-0</acronym>
|
||||
|
||||
<footnote>
|
||||
<indexterm><primary>RAID</primary></indexterm>
|
||||
|
||||
<para><acronym>RAID</acronym> stands for <emphasis>Redundant Array of
|
||||
Inexpensive Disks</emphasis> and offers various forms of fault tolerance,
|
||||
though the latter term is somewhat misleading: it provides no redundancy.</para>
|
||||
</footnote>.
|
||||
<para><acronym>RAID</acronym> stands for <emphasis>Redundant
|
||||
Array of Inexpensive Disks</emphasis> and offers various forms
|
||||
of fault tolerance, though the latter term is somewhat
|
||||
misleading: it provides no redundancy.</para> </footnote>.
|
||||
|
||||
Striping requires somewhat more effort to locate the data, and it can cause
|
||||
additional I/O load where a transfer is spread over multiple disks, but it
|
||||
can also provide a more constant load across the disks.
|
||||
<xref linkend="vinum-striped"> illustrates the sequence in which storage
|
||||
units are allocated in a striped organization.</para>
|
||||
Striping requires somewhat more effort to locate the data, and it
|
||||
can cause additional I/O load where a transfer is spread over
|
||||
multiple disks, but it can also provide a more constant load
|
||||
across the disks. <xref linkend="vinum-striped"> illustrates the
|
||||
sequence in which storage units are allocated in a striped
|
||||
organization.</para>
|
||||
|
||||
<para>
|
||||
<figure id="vinum-striped">
|
||||
|
@ -175,11 +187,13 @@
|
|||
|
||||
<sect1 id="vinum-data-integrity">
|
||||
<title>Data integrity</title>
|
||||
<para>The final problem with current disks is that they are unreliable.
|
||||
Although disk drive reliability has increased tremendously over the last
|
||||
few years, they are still the most likely core component of a server to
|
||||
fail. When they do, the results can be catastrophic: replacing a failed
|
||||
disk drive and restoring data to it can take days.</para>
|
||||
|
||||
<para>The final problem with current disks is that they are
|
||||
unreliable. Although disk drive reliability has increased
|
||||
tremendously over the last few years, they are still the most
|
||||
likely core component of a server to fail. When they do, the
|
||||
results can be catastrophic: replacing a failed disk drive and
|
||||
restoring data to it can take days.</para>
|
||||
|
||||
<indexterm>
|
||||
<primary>disk mirroring</primary>
|
||||
|
@ -195,10 +209,11 @@
|
|||
<para>The traditional way to approach this problem has been
|
||||
<emphasis>mirroring</emphasis>, keeping two copies of the data
|
||||
on different physical hardware. Since the advent of the
|
||||
<acronym>RAID</acronym> levels, this technique has also been called
|
||||
<acronym>RAID level 1</acronym> or <acronym>RAID-1</acronym>. Any
|
||||
write to the volume writes to both locations; a read can be satisfied from
|
||||
either, so if one drive fails, the data is still available on the other
|
||||
<acronym>RAID</acronym> levels, this technique has also been
|
||||
called <acronym>RAID level 1</acronym> or
|
||||
<acronym>RAID-1</acronym>. Any write to the volume writes to
|
||||
both locations; a read can be satisfied from either, so if one
|
||||
drive fails, the data is still available on the other
|
||||
drive.</para>
|
||||
|
||||
<para>Mirroring has two problems:</para>
|
||||
|
@ -211,24 +226,26 @@
|
|||
|
||||
<listitem>
|
||||
<para>The performance impact. Writes must be performed to
|
||||
both drives, so they take up twice the bandwidth of a non-mirrored
|
||||
volume. Reads do not suffer from a performance penalty: it even looks
|
||||
as if they are faster.</para>
|
||||
both drives, so they take up twice the bandwidth of a
|
||||
non-mirrored volume. Reads do not suffer from a
|
||||
performance penalty: it even looks as if they are
|
||||
faster.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
<para><indexterm><primary>RAID-5</primary></indexterm>An alternative
|
||||
solution is <emphasis>parity</emphasis>, implemented in the
|
||||
<acronym>RAID</acronym> levels 2, 3, 4 and 5. Of these,
|
||||
<acronym>RAID-5</acronym> is the most interesting. As implemented
|
||||
in Vinum, it is a variant on a striped organization which dedicates
|
||||
one block of each stripe to parity of the other blocks. As implemented
|
||||
by Vinum, a <acronym>RAID-5</acronym> plex is similar to a
|
||||
striped plex, except that it implements <acronym>RAID-5</acronym> by
|
||||
<para><indexterm><primary>RAID-5</primary></indexterm>An
|
||||
alternative solution is <emphasis>parity</emphasis>,
|
||||
implemented in the <acronym>RAID</acronym> levels 2, 3, 4 and
|
||||
5. Of these, <acronym>RAID-5</acronym> is the most
|
||||
interesting. As implemented in Vinum, it is a variant on a
|
||||
striped organization which dedicates one block of each stripe
|
||||
to parity of the other blocks. As implemented by Vinum, a
|
||||
<acronym>RAID-5</acronym> plex is similar to a striped plex,
|
||||
except that it implements <acronym>RAID-5</acronym> by
|
||||
including a parity block in each stripe. As required by
|
||||
<acronym>RAID-5</acronym>, the location of this parity block changes from one
|
||||
stripe to the next. The numbers in the data blocks indicate the relative
|
||||
block numbers.</para>
|
||||
<acronym>RAID-5</acronym>, the location of this parity block
|
||||
changes from one stripe to the next. The numbers in the data
|
||||
blocks indicate the relative block numbers.</para>
|
||||
|
||||
<para>
|
||||
<figure id="vinum-raid5-org">
|
||||
|
@ -237,13 +254,15 @@
|
|||
</figure>
|
||||
</para>
|
||||
|
||||
<para>Compared to mirroring, <acronym>RAID-5</acronym> has the advantage of requiring
|
||||
significantly less storage space. Read access is similar to that of
|
||||
striped organizations, but write access is significantly slower,
|
||||
approximately 25% of the read performance. If one drive fails, the array
|
||||
can continue to operate in degraded mode: a read from one of the remaining
|
||||
accessible drives continues normally, but a read from the failed drive is
|
||||
recalculated from the corresponding block from all the remaining drives.
|
||||
<para>Compared to mirroring, <acronym>RAID-5</acronym> has the
|
||||
advantage of requiring significantly less storage space. Read
|
||||
access is similar to that of striped organizations, but write
|
||||
access is significantly slower, approximately 25% of the read
|
||||
performance. If one drive fails, the array can continue to
|
||||
operate in degraded mode: a read from one of the remaining
|
||||
accessible drives continues normally, but a read from the
|
||||
failed drive is recalculated from the corresponding block from
|
||||
all the remaining drives.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
|
@ -261,28 +280,33 @@
|
|||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>Volumes are composed of <emphasis>plexes</emphasis>, each of which
|
||||
represent the total address space of a volume. This level in the
|
||||
hierarchy thus provides redundancy. Think of plexes as individual
|
||||
disks in a mirrored array, each containing the same data.</para>
|
||||
<para>Volumes are composed of <emphasis>plexes</emphasis>,
|
||||
each of which represent the total address space of a
|
||||
volume. This level in the hierarchy thus provides
|
||||
redundancy. Think of plexes as individual disks in a
|
||||
mirrored array, each containing the same data.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>Since Vinum exists within the UNIX™ disk storage framework,
|
||||
it would be possible to use UNIX™ partitions as the building
|
||||
block for multi-disk plexes, but in fact this turns out to be too
|
||||
inflexible: UNIX™ disks can have only a limited number of partitions.
|
||||
Instead, Vinum subdivides a single UNIX™ partition (the
|
||||
<emphasis>drive</emphasis>) into contiguous areas called
|
||||
<emphasis>subdisks</emphasis>, which it uses as building blocks for plexes.</para>
|
||||
<para>Since Vinum exists within the UNIX™ disk storage
|
||||
framework, it would be possible to use UNIX™
|
||||
partitions as the building block for multi-disk plexes,
|
||||
but in fact this turns out to be too inflexible:
|
||||
UNIX™ disks can have only a limited number of
|
||||
partitions. Instead, Vinum subdivides a single
|
||||
UNIX™ partition (the <emphasis>drive</emphasis>)
|
||||
into contiguous areas called
|
||||
<emphasis>subdisks</emphasis>, which it uses as building
|
||||
blocks for plexes.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>Subdisks reside on Vinum <emphasis>drives</emphasis>,
|
||||
currently UNIX™ partitions. Vinum drives can contain any number of
|
||||
subdisks. With the exception of a small area at the beginning of the
|
||||
drive, which is used for storing configuration and state information,
|
||||
the entire drive is available for data storage.</para>
|
||||
currently UNIX™ partitions. Vinum drives can
|
||||
contain any number of subdisks. With the exception of a
|
||||
small area at the beginning of the drive, which is used
|
||||
for storing configuration and state information, the
|
||||
entire drive is available for data storage.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
|
@ -292,29 +316,33 @@
|
|||
<sect2>
|
||||
<title>Volume size considerations</title>
|
||||
|
||||
<para>Plexes can include multiple subdisks spread over all drives in the
|
||||
Vinum configuration. As a result, the size of an individual drive does
|
||||
not limit the size of a plex, and thus of a volume.</para>
|
||||
<para>Plexes can include multiple subdisks spread over all
|
||||
drives in the Vinum configuration. As a result, the size of
|
||||
an individual drive does not limit the size of a plex, and
|
||||
thus of a volume.</para>
|
||||
</sect2>
|
||||
|
||||
<sect2>
|
||||
<title>Redundant data storage</title>
|
||||
<para>Vinum implements mirroring by attaching multiple plexes to a
|
||||
volume. Each plex is a representation of the data in a volume. A
|
||||
volume may contain between one and eight plexes.</para>
|
||||
<para>Vinum implements mirroring by attaching multiple plexes to
|
||||
a volume. Each plex is a representation of the data in a
|
||||
volume. A volume may contain between one and eight
|
||||
plexes.</para>
|
||||
|
||||
<para>Although a plex represents the complete data of a volume, it is
|
||||
possible for parts of the representation to be physically missing,
|
||||
either by design (by not defining a subdisk for parts of the plex) or by
|
||||
accident (as a result of the failure of a drive). As long as at least
|
||||
one plex can provide the data for the complete address range of the
|
||||
volume, the volume is fully functional.</para>
|
||||
<para>Although a plex represents the complete data of a volume,
|
||||
it is possible for parts of the representation to be
|
||||
physically missing, either by design (by not defining a
|
||||
subdisk for parts of the plex) or by accident (as a result of
|
||||
the failure of a drive). As long as at least one plex can
|
||||
provide the data for the complete address range of the volume,
|
||||
the volume is fully functional.</para>
|
||||
</sect2>
|
||||
|
||||
<sect2>
|
||||
<title>Performance issues</title>
|
||||
<para>Vinum implements both concatenation and striping at the plex
|
||||
level:</para>
|
||||
|
||||
<para>Vinum implements both concatenation and striping at the
|
||||
plex level:</para>
|
||||
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
|
@ -324,9 +352,9 @@
|
|||
|
||||
<listitem>
|
||||
<para>A <emphasis>striped plex</emphasis> stripes the data
|
||||
across each subdisk. The subdisks must all have the same size, and
|
||||
there must be at least two subdisks in order to distinguish it from a
|
||||
concatenated plex.</para>
|
||||
across each subdisk. The subdisks must all have the same
|
||||
size, and there must be at least two subdisks in order to
|
||||
distinguish it from a concatenated plex.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
</sect2>
|
||||
|
@ -339,24 +367,29 @@
|
|||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>Concatenated plexes are the most flexible: they can
|
||||
contain any number of subdisks, and the subdisks may be of different
|
||||
length. The plex may be extended by adding additional subdisks. They
|
||||
require less <acronym>CPU</acronym> time than striped plexes, though
|
||||
the difference in <acronym>CPU</acronym> overhead is not measurable.
|
||||
On the other hand, they are most susceptible to hot spots, where one
|
||||
disk is very active and others are idle.</para>
|
||||
contain any number of subdisks, and the subdisks may be of
|
||||
different length. The plex may be extended by adding
|
||||
additional subdisks. They require less
|
||||
<acronym>CPU</acronym> time than striped plexes, though
|
||||
the difference in <acronym>CPU</acronym> overhead is not
|
||||
measurable. On the other hand, they are most susceptible
|
||||
to hot spots, where one disk is very active and others are
|
||||
idle.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>The greatest advantage of striped (<acronym>RAID-0</acronym>)
|
||||
plexes is that they reduce hot spots: by choosing an optimum sized stripe
|
||||
(about 256 kB), you can even out the load on the component drives.
|
||||
The disadvantages of this approach are (fractionally) more complex
|
||||
code and restrictions on subdisks: they must be all the same size, and
|
||||
extending a plex by adding new subdisks is so complicated that Vinum
|
||||
currently does not implement it. Vinum imposes an additional, trivial
|
||||
restriction: a striped plex must have at least two subdisks, since
|
||||
otherwise it is indistinguishable from a concatenated plex.</para>
|
||||
<para>The greatest advantage of striped
|
||||
(<acronym>RAID-0</acronym>) plexes is that they reduce hot
|
||||
spots: by choosing an optimum sized stripe (about
|
||||
256 kB), you can even out the load on the component
|
||||
drives. The disadvantages of this approach are
|
||||
(fractionally) more complex code and restrictions on
|
||||
subdisks: they must be all the same size, and extending a
|
||||
plex by adding new subdisks is so complicated that Vinum
|
||||
currently does not implement it. Vinum imposes an
|
||||
additional, trivial restriction: a striped plex must have
|
||||
at least two subdisks, since otherwise it is
|
||||
indistinguishable from a concatenated plex.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
|
@ -402,14 +435,16 @@
|
|||
|
||||
<sect1 id="vinum-examples">
|
||||
<title>Some examples</title>
|
||||
<para>Vinum maintains a <emphasis>configuration database</emphasis>
|
||||
which describes the objects known to an individual system. Initially, the
|
||||
user creates the configuration database from one or more configuration files
|
||||
with the aid of the &man.vinum.8; utility program. Vinum stores a copy of
|
||||
its configuration database on each disk slice (which Vinum calls a
|
||||
<emphasis>device</emphasis>) under its control. This database is updated on
|
||||
each state change, so that a restart accurately restores the state of each
|
||||
Vinum object.</para>
|
||||
|
||||
<para>Vinum maintains a <emphasis>configuration
|
||||
database</emphasis> which describes the objects known to an
|
||||
individual system. Initially, the user creates the
|
||||
configuration database from one or more configuration files with
|
||||
the aid of the &man.vinum.8; utility program. Vinum stores a
|
||||
copy of its configuration database on each disk slice (which
|
||||
Vinum calls a <emphasis>device</emphasis>) under its control.
|
||||
This database is updated on each state change, so that a restart
|
||||
accurately restores the state of each Vinum object.</para>
|
||||
|
||||
<sect2>
|
||||
<title>The configuration file</title>
|
||||
|
@ -427,11 +462,12 @@
|
|||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>The <emphasis>drive</emphasis> line describes a disk
|
||||
partition (<emphasis>drive</emphasis>) and its location relative to the
|
||||
underlying hardware. It is given the symbolic name
|
||||
<emphasis>a</emphasis>. This separation of the symbolic names from the
|
||||
device names allows disks to be moved from one location to another
|
||||
without confusion.</para>
|
||||
partition (<emphasis>drive</emphasis>) and its location
|
||||
relative to the underlying hardware. It is given the
|
||||
symbolic name <emphasis>a</emphasis>. This separation of
|
||||
the symbolic names from the device names allows disks to
|
||||
be moved from one location to another without
|
||||
confusion.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
|
@ -441,23 +477,27 @@
|
|||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>The <emphasis>plex</emphasis> line defines a plex. The
|
||||
only required parameter is the organization, in this case
|
||||
<emphasis>concat</emphasis>. No name is necessary: the system
|
||||
automatically generates a name from the volume name by adding the suffix
|
||||
<para>The <emphasis>plex</emphasis> line defines a plex.
|
||||
The only required parameter is the organization, in this
|
||||
case <emphasis>concat</emphasis>. No name is necessary:
|
||||
the system automatically generates a name from the volume
|
||||
name by adding the suffix
|
||||
<emphasis>.p</emphasis><emphasis>x</emphasis>, where
|
||||
<emphasis>x</emphasis> is the number of the plex in the volume. Thus
|
||||
this plex will be called <emphasis>myvol.p0</emphasis>.</para>
|
||||
<emphasis>x</emphasis> is the number of the plex in the
|
||||
volume. Thus this plex will be called
|
||||
<emphasis>myvol.p0</emphasis>.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>The <emphasis>sd</emphasis> line describes a subdisk.
|
||||
The minimum specifications are the name of a drive on which to store it,
|
||||
and the length of the subdisk. As with plexes, no name is necessary:
|
||||
the system automatically assigns names derived from the plex name by
|
||||
adding the suffix <emphasis>.s</emphasis><emphasis>x</emphasis>, where
|
||||
<emphasis>x</emphasis> is the number of the subdisk in the plex. Thus
|
||||
Vinum gives this subdisk the name <emphasis>myvol.p0.s0</emphasis>.</para>
|
||||
The minimum specifications are the name of a drive on
|
||||
which to store it, and the length of the subdisk. As with
|
||||
plexes, no name is necessary: the system automatically
|
||||
assigns names derived from the plex name by adding the
|
||||
suffix <emphasis>.s</emphasis><emphasis>x</emphasis>,
|
||||
where <emphasis>x</emphasis> is the number of the subdisk
|
||||
in the plex. Thus Vinum gives this subdisk the name
|
||||
<emphasis>myvol.p0.s0</emphasis>.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
|
@ -490,23 +530,28 @@
|
|||
</figure>
|
||||
</para>
|
||||
|
||||
<para>This figure, and the ones which follow, represent a volume, which
|
||||
contains the plexes, which in turn contain the subdisks. In this trivial
|
||||
example, the volume contains one plex, and the plex contains one subdisk.</para>
|
||||
<para>This figure, and the ones which follow, represent a
|
||||
volume, which contains the plexes, which in turn contain the
|
||||
subdisks. In this trivial example, the volume contains one
|
||||
plex, and the plex contains one subdisk.</para>
|
||||
|
||||
<para>This particular volume has no specific advantage over a conventional
|
||||
disk partition. It contains a single plex, so it is not redundant. The
|
||||
plex contains a single subdisk, so there is no difference in storage
|
||||
allocation from a conventional disk partition. The following sections
|
||||
illustrate various more interesting configuration methods.</para>
|
||||
<para>This particular volume has no specific advantage over a
|
||||
conventional disk partition. It contains a single plex, so it
|
||||
is not redundant. The plex contains a single subdisk, so
|
||||
there is no difference in storage allocation from a
|
||||
conventional disk partition. The following sections
|
||||
illustrate various more interesting configuration
|
||||
methods.</para>
|
||||
</sect2>
|
||||
|
||||
<sect2>
|
||||
<title>Increased resilience: mirroring</title>
|
||||
<para>The resilience of a volume can be increased by mirroring. When
|
||||
laying out a mirrored volume, it is important to ensure that the subdisks
|
||||
of each plex are on different drives, so that a drive failure will not
|
||||
take down both plexes. The following configuration mirrors a volume:</para>
|
||||
|
||||
<para>The resilience of a volume can be increased by mirroring.
|
||||
When laying out a mirrored volume, it is important to ensure
|
||||
that the subdisks of each plex are on different drives, so
|
||||
that a drive failure will not take down both plexes. The
|
||||
following configuration mirrors a volume:</para>
|
||||
|
||||
<programlisting>
|
||||
drive b device /dev/da4h
|
||||
|
@ -516,10 +561,11 @@
|
|||
plex org concat
|
||||
sd length 512m drive b</programlisting>
|
||||
|
||||
<para>In this example, it was not necessary to specify a definition of
|
||||
drive <emphasis>a</emphasis> again, since Vinum keeps track of all
|
||||
objects in its configuration database. After processing this
|
||||
definition, the configuration looks like:</para>
|
||||
<para>In this example, it was not necessary to specify a
|
||||
definition of drive <emphasis>a</emphasis> again, since Vinum
|
||||
keeps track of all objects in its configuration database.
|
||||
After processing this definition, the configuration looks
|
||||
like:</para>
|
||||
|
||||
|
||||
<programlisting>
|
||||
|
@ -552,20 +598,23 @@
|
|||
</figure>
|
||||
</para>
|
||||
|
||||
<para>In this example, each plex contains the full 512 MB of address
|
||||
space. As in the previous example, each plex contains only a single
|
||||
subdisk.</para>
|
||||
<para>In this example, each plex contains the full 512 MB
|
||||
of address space. As in the previous example, each plex
|
||||
contains only a single subdisk.</para>
|
||||
</sect2>
|
||||
|
||||
<sect2>
|
||||
<title>Optimizing performance</title>
|
||||
<para>The mirrored volume in the previous example is more resistant to
|
||||
failure than an unmirrored volume, but its performance is less: each write
|
||||
to the volume requires a write to both drives, using up a greater
|
||||
proportion of the total disk bandwidth. Performance considerations demand
|
||||
a different approach: instead of mirroring, the data is striped across as
|
||||
many disk drives as possible. The following configuration shows a volume
|
||||
with a plex striped across four disk drives:</para>
|
||||
|
||||
<para>The mirrored volume in the previous example is more
|
||||
resistant to failure than an unmirrored volume, but its
|
||||
performance is less: each write to the volume requires a write
|
||||
to both drives, using up a greater proportion of the total
|
||||
disk bandwidth. Performance considerations demand a different
|
||||
approach: instead of mirroring, the data is striped across as
|
||||
many disk drives as possible. The following configuration
|
||||
shows a volume with a plex striped across four disk
|
||||
drives:</para>
|
||||
|
||||
<programlisting>
|
||||
drive c device /dev/da5h
|
||||
|
@ -624,10 +673,12 @@
|
|||
|
||||
<sect2>
|
||||
<title>Resilience and performance</title>
|
||||
<para><anchor id="vinum-resilience">With sufficient hardware, it is
|
||||
possible to build volumes which show both increased resilience and
|
||||
increased performance compared to standard UNIX™ partitions. A typical
|
||||
configuration file might be:</para>
|
||||
|
||||
<para><anchor id="vinum-resilience">With sufficient hardware, it
|
||||
is possible to build volumes which show both increased
|
||||
resilience and increased performance compared to standard
|
||||
UNIX™ partitions. A typical configuration file might
|
||||
be:</para>
|
||||
|
||||
<programlisting>
|
||||
volume raid10
|
||||
|
@ -662,72 +713,80 @@
|
|||
|
||||
<sect1 id="vinum-object-naming">
|
||||
<title>Object naming</title>
|
||||
<para>As described above, Vinum assigns default names to plexes and
|
||||
subdisks, although they may be overridden. Overriding the default names
|
||||
is not recommended: experience with the VERITAS volume manager, which
|
||||
allows arbitrary naming of objects, has shown that this flexibility does
|
||||
not bring a significant advantage, and it can cause confusion.</para>
|
||||
|
||||
<para>Names may contain any non-blank character, but it is recommended to
|
||||
restrict them to letters, digits and the underscore characters. The names
|
||||
of volumes, plexes and subdisks may be up to 64 characters long, and the
|
||||
names of drives may be up to 32 characters long.</para>
|
||||
<para>As described above, Vinum assigns default names to plexes
|
||||
and subdisks, although they may be overridden. Overriding the
|
||||
default names is not recommended: experience with the VERITAS
|
||||
volume manager, which allows arbitrary naming of objects, has
|
||||
shown that this flexibility does not bring a significant
|
||||
advantage, and it can cause confusion.</para>
|
||||
|
||||
<para>Vinum objects
|
||||
are assigned device nodes in the hierarchy <filename>/dev/vinum</filename>.
|
||||
The configuration shown above would cause Vinum to create the following
|
||||
device nodes:</para>
|
||||
<para>Names may contain any non-blank character, but it is
|
||||
recommended to restrict them to letters, digits and the
|
||||
underscore characters. The names of volumes, plexes and
|
||||
subdisks may be up to 64 characters long, and the names of
|
||||
drives may be up to 32 characters long.</para>
|
||||
|
||||
<para>Vinum objects are assigned device nodes in the hierarchy
|
||||
<filename>/dev/vinum</filename>. The configuration shown above
|
||||
would cause Vinum to create the following device nodes:</para>
|
||||
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>The control devices <devicename>/dev/vinum/control</devicename> and
|
||||
<devicename>/dev/vinum/controld</devicename>, which are used by
|
||||
&man.vinum.8; and the Vinum daemon respectively.</para>
|
||||
<para>The control devices
|
||||
<devicename>/dev/vinum/control</devicename> and
|
||||
<devicename>/dev/vinum/controld</devicename>, which are used
|
||||
by &man.vinum.8; and the Vinum daemon respectively.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>Block and character device entries for each volume.
|
||||
These are the main devices used by Vinum. The block device names are
|
||||
the name of the volume, while the character device names follow the BSD
|
||||
tradition of prepending the letter <emphasis>r</emphasis> to the name.
|
||||
Thus the configuration above would include the block devices
|
||||
<devicename>/dev/vinum/myvol</devicename>,
|
||||
<devicename>/dev/vinum/mirror</devicename>,
|
||||
<devicename>/dev/vinum/striped</devicename>,
|
||||
<devicename>/dev/vinum/raid5</devicename> and
|
||||
<devicename>/dev/vinum/raid10</devicename>, and the character devices
|
||||
<devicename>/dev/vinum/rmyvol</devicename>,
|
||||
These are the main devices used by Vinum. The block device
|
||||
names are the name of the volume, while the character device
|
||||
names follow the BSD tradition of prepending the letter
|
||||
<emphasis>r</emphasis> to the name. Thus the configuration
|
||||
above would include the block devices
|
||||
<devicename>/dev/vinum/myvol</devicename>,
|
||||
<devicename>/dev/vinum/mirror</devicename>,
|
||||
<devicename>/dev/vinum/striped</devicename>,
|
||||
<devicename>/dev/vinum/raid5</devicename> and
|
||||
<devicename>/dev/vinum/raid10</devicename>, and the
|
||||
character devices
|
||||
<devicename>/dev/vinum/rmyvol</devicename>,
|
||||
<devicename>/dev/vinum/rmirror</devicename>,
|
||||
<devicename>/dev/vinum/rstriped</devicename>,
|
||||
<devicename>/dev/vinum/rraid5</devicename> and
|
||||
<devicename>/dev/vinum/rraid10</devicename>.
|
||||
There is obviously a problem here: it is possible to have two volumes
|
||||
called <emphasis>r</emphasis> and <emphasis>rr</emphasis>, but there
|
||||
will be a conflict creating the device node
|
||||
<devicename>/dev/vinum/rr</devicename>: is it a character device for
|
||||
volume <emphasis>r</emphasis> or a block device for volume
|
||||
<emphasis>rr</emphasis>? Currently Vinum does not address this
|
||||
conflict: the first-defined volume will get the name.</para>
|
||||
<devicename>/dev/vinum/rstriped</devicename>,
|
||||
<devicename>/dev/vinum/rraid5</devicename> and
|
||||
<devicename>/dev/vinum/rraid10</devicename>. There is
|
||||
obviously a problem here: it is possible to have two volumes
|
||||
called <emphasis>r</emphasis> and <emphasis>rr</emphasis>,
|
||||
but there will be a conflict creating the device node
|
||||
<devicename>/dev/vinum/rr</devicename>: is it a character
|
||||
device for volume <emphasis>r</emphasis> or a block device
|
||||
for volume <emphasis>rr</emphasis>? Currently Vinum does
|
||||
not address this conflict: the first-defined volume will get
|
||||
the name.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>A directory <devicename>/dev/vinum/drive</devicename>
|
||||
with entries for each drive. These entries are in fact symbolic links
|
||||
to the corresponding disk nodes.</para>
|
||||
with entries for each drive. These entries are in fact
|
||||
symbolic links to the corresponding disk nodes.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>A directory <filename>/dev/vinum/volume</filename> with
|
||||
entries for each volume. It contains subdirectories for each plex,
|
||||
which in turn contain subdirectories for their component subdisks.</para>
|
||||
entries for each volume. It contains subdirectories for
|
||||
each plex, which in turn contain subdirectories for their
|
||||
component subdisks.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>The directories <devicename>/dev/vinum/plex</devicename>,
|
||||
<para>The directories
|
||||
<devicename>/dev/vinum/plex</devicename>,
|
||||
<devicename>/dev/vinum/sd</devicename>, and
|
||||
<devicename>/dev/vinum/rsd</devicename>, which contain block device
|
||||
nodes for each plex and block and character device nodes respectively
|
||||
for each subdisk.</para>
|
||||
<devicename>/dev/vinum/rsd</devicename>, which contain block
|
||||
device nodes for each plex and block and character device
|
||||
nodes respectively for each subdisk.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
|
@ -806,26 +865,31 @@
|
|||
brwxr-xr-- 1 root wheel 25, 0x20200002 Apr 13 16:46 s64.p0.s2
|
||||
brwxr-xr-- 1 root wheel 25, 0x20300002 Apr 13 16:46 s64.p0.s3</programlisting>
|
||||
|
||||
<para>Although it is recommended that plexes and subdisks should not be
|
||||
allocated specific names, Vinum drives must be named. This makes it
|
||||
possible to move a drive to a different location and still recognize it
|
||||
automatically. Drive names may be up to 32 characters long.</para>
|
||||
<para>Although it is recommended that plexes and subdisks should
|
||||
not be allocated specific names, Vinum drives must be named.
|
||||
This makes it possible to move a drive to a different location
|
||||
and still recognize it automatically. Drive names may be up to
|
||||
32 characters long.</para>
|
||||
|
||||
<sect2>
|
||||
<title>Creating file systems</title>
|
||||
<para>Volumes appear to the system to be identical to disks, with one exception.
|
||||
Unlike UNIX™ drives, Vinum does not partition volumes, which thus do
|
||||
not contain a partition table. This has required modification to some disk
|
||||
|
||||
<para>Volumes appear to the system to be identical to disks,
|
||||
with one exception. Unlike UNIX™ drives, Vinum does
|
||||
not partition volumes, which thus do not contain a partition
|
||||
table. This has required modification to some disk
|
||||
utilities, notably &man.newfs.8;, which previously tried to
|
||||
interpret the last letter of a Vinum volume name as a partition identifier.
|
||||
For example, a disk drive may have a name like <devicename>/dev/ad0a</devicename>
|
||||
or <devicename>/dev/da2h</devicename>. These names represent the first
|
||||
partition (<devicename>a</devicename>) on the first (0) IDE disk
|
||||
(<devicename>ad</devicename>) and the eighth partition
|
||||
(<devicename>h</devicename>) on the third (2) SCSI disk
|
||||
(<devicename>da</devicename>) respectively. By contrast, a Vinum volume
|
||||
might be called <devicename>/dev/vinum/concat</devicename>, a name which
|
||||
has no relationship with a partition name.</para>
|
||||
interpret the last letter of a Vinum volume name as a
|
||||
partition identifier. For example, a disk drive may have a
|
||||
name like <devicename>/dev/ad0a</devicename> or
|
||||
<devicename>/dev/da2h</devicename>. These names represent
|
||||
the first partition (<devicename>a</devicename>) on the
|
||||
first (0) IDE disk (<devicename>ad</devicename>) and the
|
||||
eighth partition (<devicename>h</devicename>) on the third
|
||||
(2) SCSI disk (<devicename>da</devicename>) respectively.
|
||||
By contrast, a Vinum volume might be called
|
||||
<devicename>/dev/vinum/concat</devicename>, a name which has
|
||||
no relationship with a partition name.</para>
|
||||
|
||||
<para>Normally, &man.newfs.8; interprets the name of the disk and
|
||||
complains if it cannot understand it. For example:</para>
|
||||
|
@ -843,21 +907,25 @@ newfs: /dev/vinum/concat: can't figure out file system partition</screen>
|
|||
|
||||
<sect1 id="vinum-config">
|
||||
<title>Configuring Vinum</title>
|
||||
<para>The <filename>GENERIC</filename> kernel does not contain Vinum. It is
|
||||
possible to build a special kernel which includes Vinum, but this is not
|
||||
recommended. The standard way to start Vinum is as a kernel module
|
||||
(<acronym>kld</acronym>). You do not even need to use &man.kldload.8;
|
||||
for Vinum: when you start &man.vinum.8;, it checks whether the module
|
||||
has been loaded, and if it is not, it loads it automatically.</para>
|
||||
|
||||
<para>The <filename>GENERIC</filename> kernel does not contain
|
||||
Vinum. It is possible to build a special kernel which includes
|
||||
Vinum, but this is not recommended. The standard way to start
|
||||
Vinum is as a kernel module (<acronym>kld</acronym>). You do
|
||||
not even need to use &man.kldload.8; for Vinum: when you start
|
||||
&man.vinum.8;, it checks whether the module has been loaded, and
|
||||
if it is not, it loads it automatically.</para>
|
||||
|
||||
|
||||
<sect2>
|
||||
<title>Startup</title>
|
||||
<para>Vinum stores configuration information on the disk slices in
|
||||
essentially the same form as in the configuration files. When reading
|
||||
from the configuration database, Vinum recognizes a number of keywords
|
||||
which are not allowed in the configuration files. For example, a disk
|
||||
configuration might contain the following text:</para>
|
||||
|
||||
<para>Vinum stores configuration information on the disk slices
|
||||
in essentially the same form as in the configuration files.
|
||||
When reading from the configuration database, Vinum recognizes
|
||||
a number of keywords which are not allowed in the
|
||||
configuration files. For example, a disk configuration might
|
||||
contain the following text:</para>
|
||||
|
||||
<programlisting>volume myvol state up
|
||||
volume bigraid state down
|
||||
|
@ -879,37 +947,43 @@ sd name bigraid.p0.s2 drive c plex bigraid.p0 state initializing len 4194304b dr
|
|||
sd name bigraid.p0.s3 drive d plex bigraid.p0 state initializing len 4194304b driveoff set 1573129b plexoffset 12582912b
|
||||
sd name bigraid.p0.s4 drive e plex bigraid.p0 state initializing len 4194304b driveoff set 1573129b plexoffset 16777216b</programlisting>
|
||||
|
||||
<para>The obvious differences here are the presence of explicit location
|
||||
information and naming (both of which are also allowed, but discouraged, for
|
||||
use by the user) and the information on the states (which are not available
|
||||
to the user). Vinum does not store information about drives in the
|
||||
configuration information: it finds the drives by scanning the configured
|
||||
disk drives for partitions with a Vinum label. This enables Vinum to
|
||||
identify drives correctly even if they have been assigned different UNIX™
|
||||
drive IDs.</para>
|
||||
<para>The obvious differences here are the presence of
|
||||
explicit location information and naming (both of which are
|
||||
also allowed, but discouraged, for use by the user) and the
|
||||
information on the states (which are not available to the
|
||||
user). Vinum does not store information about drives in the
|
||||
configuration information: it finds the drives by scanning
|
||||
the configured disk drives for partitions with a Vinum
|
||||
label. This enables Vinum to identify drives correctly even
|
||||
if they have been assigned different UNIX™ drive
|
||||
IDs.</para>
|
||||
|
||||
<sect3>
|
||||
<title>Automatic startup</title>
|
||||
<para>In order to start Vinum automatically when you boot the system,
|
||||
ensure that you have the following line in your
|
||||
<filename>/etc/rc.conf</filename>:</para>
|
||||
|
||||
<para>In order to start Vinum automatically when you boot the
|
||||
system, ensure that you have the following line in your
|
||||
<filename>/etc/rc.conf</filename>:</para>
|
||||
|
||||
<programlisting>start_vinum="YES" # set to YES to start vinum</programlisting>
|
||||
|
||||
<para>If you do not have a file <filename>/etc/rc.conf</filename>, create
|
||||
one with this content. This will cause the system to load the Vinum
|
||||
<acronym>kld</acronym> at startup, and to start any objects mentioned in
|
||||
the configuration. This is done before mounting file systems, so it is
|
||||
possible to automatically &man.fsck.8; and mount file systems on Vinum
|
||||
volumes.</para>
|
||||
<para>If you do not have a file
|
||||
<filename>/etc/rc.conf</filename>, create one with this
|
||||
content. This will cause the system to load the Vinum
|
||||
<acronym>kld</acronym> at startup, and to start any objects
|
||||
mentioned in the configuration. This is done before
|
||||
mounting file systems, so it is possible to automatically
|
||||
&man.fsck.8; and mount file systems on Vinum volumes.</para>
|
||||
|
||||
<para>When you start Vinum with the <command>vinum start</command> command,
|
||||
Vinum reads the configuration database from one of the Vinum drives.
|
||||
Under normal circumstances, each drive contains an identical copy of the
|
||||
configuration database, so it does not matter which drive is read. After
|
||||
a crash, however, Vinum must determine which drive was updated most
|
||||
recently and read the configuration from this drive. It then updates the
|
||||
configuration if necessary from progressively older drives.</para>
|
||||
<para>When you start Vinum with the <command>vinum
|
||||
start</command> command, Vinum reads the configuration
|
||||
database from one of the Vinum drives. Under normal
|
||||
circumstances, each drive contains an identical copy of the
|
||||
configuration database, so it does not matter which drive is
|
||||
read. After a crash, however, Vinum must determine which
|
||||
drive was updated most recently and read the configuration
|
||||
from this drive. It then updates the configuration if
|
||||
necessary from progressively older drives.</para>
|
||||
|
||||
</sect3>
|
||||
</sect2>
|
||||
|
|
Loading…
Reference in a new issue