Whitespace changes for clarity.

Translators please ignore.
This commit is contained in:
Murray Stokely 2003-05-04 11:45:58 +00:00
parent f8f3aa5727
commit 27ba1359ad
Notes: svn2git 2020-12-08 03:00:23 +00:00
svn path=/head/; revision=16781

View file

@ -48,62 +48,70 @@
<indexterm><primary>RAID</primary> <indexterm><primary>RAID</primary>
<secondary>Software</secondary></indexterm> <secondary>Software</secondary></indexterm>
<para><emphasis>Vinum</emphasis> is a <para><emphasis>Vinum</emphasis> is a so-called <emphasis>Volume
so-called <emphasis>Volume Manager</emphasis>, a virtual disk driver that Manager</emphasis>, a virtual disk driver that addresses these
addresses these three problems. Let us look at them in more detail. Various three problems. Let us look at them in more detail. Various
solutions to these problems have been proposed and implemented:</para> solutions to these problems have been proposed and
implemented:</para>
<para>Disks are getting bigger, but so are data storage requirements. <para>Disks are getting bigger, but so are data storage
Often you will find you want a file system that is bigger than the disks requirements. Often you will find you want a file system that
you have available. Admittedly, this problem is not as acute as it was is bigger than the disks you have available. Admittedly, this
ten years ago, but it still exists. Some systems have solved this by problem is not as acute as it was ten years ago, but it still
creating an abstract device which stores its data on a number of disks.</para> exists. Some systems have solved this by creating an abstract
device which stores its data on a number of disks.</para>
</sect1> </sect1>
<sect1 id="vinum-access-bottlenecks"> <sect1 id="vinum-access-bottlenecks">
<title>Access bottlenecks</title> <title>Access bottlenecks</title>
<para>Modern systems frequently need to access data in a highly <para>Modern systems frequently need to access data in a highly
concurrent manner. For example, large FTP or HTTP servers can maintain concurrent manner. For example, large FTP or HTTP servers can
thousands of concurrent sessions and have multiple 100&nbsp;Mbit/s connections maintain thousands of concurrent sessions and have multiple
to the outside world, well beyond the sustained transfer rate of most 100&nbsp;Mbit/s connections to the outside world, well beyond
disks.</para> the sustained transfer rate of most disks.</para>
<para>Current disk drives can transfer data sequentially at up to <para>Current disk drives can transfer data sequentially at up to
70&nbsp;MB/s, but this value is of little importance in an environment 70&nbsp;MB/s, but this value is of little importance in an
where many independent processes access a drive, where they may environment where many independent processes access a drive,
achieve only a fraction of these values. In such cases it is more where they may achieve only a fraction of these values. In such
interesting to view the problem from the viewpoint of the disk cases it is more interesting to view the problem from the
subsystem: the important parameter is the load that a transfer places viewpoint of the disk subsystem: the important parameter is the
on the subsystem, in other words the time for which a transfer occupies load that a transfer places on the subsystem, in other words the
the drives involved in the transfer.</para> time for which a transfer occupies the drives involved in the
transfer.</para>
<para>In any disk transfer, the drive must first position the heads, wait <para>In any disk transfer, the drive must first position the
for the first sector to pass under the read head, and then perform the heads, wait for the first sector to pass under the read head,
transfer. These actions can be considered to be atomic: it does not make and then perform the transfer. These actions can be considered
any sense to interrupt them.</para> to be atomic: it does not make any sense to interrupt
them.</para>
<para><anchor id="vinum-latency"> <para><anchor id="vinum-latency"> Consider a typical transfer of
Consider a typical transfer of about 10&nbsp;kB: the current generation of about 10&nbsp;kB: the current generation of high-performance
high-performance disks can position the heads in an average of 3.5&nbsp;ms. The disks can position the heads in an average of 3.5&nbsp;ms. The
fastest drives spin at 15,000&nbsp;rpm, so the average rotational latency fastest drives spin at 15,000&nbsp;rpm, so the average
(half a revolution) is 2&nbsp;ms. At 70&nbsp;MB/s, the transfer itself takes about rotational latency (half a revolution) is 2&nbsp;ms. At
150&nbsp;&mu;s, almost nothing compared to the positioning time. In such a 70&nbsp;MB/s, the transfer itself takes about 150&nbsp;&mu;s,
case, the effective transfer rate drops to a little over 1&nbsp;MB/s and is almost nothing compared to the positioning time. In such a
clearly highly dependent on the transfer size.</para> case, the effective transfer rate drops to a little over
1&nbsp;MB/s and is clearly highly dependent on the transfer
size.</para>
<para>The traditional and obvious solution to this bottleneck is <para>The traditional and obvious solution to this bottleneck is
<quote>more spindles</quote>: rather than using one large disk, it uses <quote>more spindles</quote>: rather than using one large disk,
several smaller disks with the same aggregate storage space. Each disk is it uses several smaller disks with the same aggregate storage
capable of positioning and transferring independently, so the effective space. Each disk is capable of positioning and transferring
throughput increases by a factor close to the number of disks used. independently, so the effective throughput increases by a factor
close to the number of disks used.
</para> </para>
<para>The exact throughput improvement is, of course, smaller than the <para>The exact throughput improvement is, of course, smaller than
number of disks involved: although each drive is capable of transferring the number of disks involved: although each drive is capable of
in parallel, there is no way to ensure that the requests are evenly transferring in parallel, there is no way to ensure that the
distributed across the drives. Inevitably the load on one drive will be requests are evenly distributed across the drives. Inevitably
higher than on another.</para> the load on one drive will be higher than on another.</para>
<indexterm> <indexterm>
<primary>disk concatenation</primary> <primary>disk concatenation</primary>
@ -113,20 +121,22 @@
<secondary>concatenation</secondary> <secondary>concatenation</secondary>
</indexterm> </indexterm>
<para>The evenness of the load on the disks is strongly dependent on <para>The evenness of the load on the disks is strongly dependent
the way the data is shared across the drives. In the following on the way the data is shared across the drives. In the
discussion, it is convenient to think of the disk storage as a large following discussion, it is convenient to think of the disk
number of data sectors which are addressable by number, rather like the storage as a large number of data sectors which are addressable
pages in a book. The most obvious method is to divide the virtual disk by number, rather like the pages in a book. The most obvious
into groups of consecutive sectors the size of the individual physical method is to divide the virtual disk into groups of consecutive
disks and store them in this manner, rather like taking a large book and sectors the size of the individual physical disks and store them
tearing it into smaller sections. This method is called in this manner, rather like taking a large book and tearing it
<emphasis>concatenation</emphasis> and has the advantage that the disks into smaller sections. This method is called
are not required to have any specific size relationships. It works <emphasis>concatenation</emphasis> and has the advantage that
well when the access to the virtual disk is spread evenly about its the disks are not required to have any specific size
address space. When access is concentrated on a smaller area, the relationships. It works well when the access to the virtual
improvement is less marked. <xref linkend="vinum-concat"> illustrates disk is spread evenly about its address space. When access is
the sequence in which storage units are allocated in a concatenated concentrated on a smaller area, the improvement is less marked.
<xref linkend="vinum-concat"> illustrates the sequence in which
storage units are allocated in a concatenated
organization.</para> organization.</para>
<para> <para>
@ -144,26 +154,28 @@
<secondary>striping</secondary> <secondary>striping</secondary>
</indexterm> </indexterm>
<para>An alternative mapping is to divide the address space into smaller, <para>An alternative mapping is to divide the address space into
equal-sized components and store them sequentially on different devices. smaller, equal-sized components and store them sequentially on
For example, the first 256 sectors may be stored on the first disk, the different devices. For example, the first 256 sectors may be
next 256 sectors on the next disk and so on. After filling the last stored on the first disk, the next 256 sectors on the next disk
disk, the process repeats until the disks are full. This mapping is called and so on. After filling the last disk, the process repeats
until the disks are full. This mapping is called
<emphasis>striping</emphasis> or <acronym>RAID-0</acronym> <emphasis>striping</emphasis> or <acronym>RAID-0</acronym>
<footnote> <footnote>
<indexterm><primary>RAID</primary></indexterm> <indexterm><primary>RAID</primary></indexterm>
<para><acronym>RAID</acronym> stands for <emphasis>Redundant Array of <para><acronym>RAID</acronym> stands for <emphasis>Redundant
Inexpensive Disks</emphasis> and offers various forms of fault tolerance, Array of Inexpensive Disks</emphasis> and offers various forms
though the latter term is somewhat misleading: it provides no redundancy.</para> of fault tolerance, though the latter term is somewhat
</footnote>. misleading: it provides no redundancy.</para> </footnote>.
Striping requires somewhat more effort to locate the data, and it can cause Striping requires somewhat more effort to locate the data, and it
additional I/O load where a transfer is spread over multiple disks, but it can cause additional I/O load where a transfer is spread over
can also provide a more constant load across the disks. multiple disks, but it can also provide a more constant load
<xref linkend="vinum-striped"> illustrates the sequence in which storage across the disks. <xref linkend="vinum-striped"> illustrates the
units are allocated in a striped organization.</para> sequence in which storage units are allocated in a striped
organization.</para>
<para> <para>
<figure id="vinum-striped"> <figure id="vinum-striped">
@ -175,11 +187,13 @@
<sect1 id="vinum-data-integrity"> <sect1 id="vinum-data-integrity">
<title>Data integrity</title> <title>Data integrity</title>
<para>The final problem with current disks is that they are unreliable.
Although disk drive reliability has increased tremendously over the last <para>The final problem with current disks is that they are
few years, they are still the most likely core component of a server to unreliable. Although disk drive reliability has increased
fail. When they do, the results can be catastrophic: replacing a failed tremendously over the last few years, they are still the most
disk drive and restoring data to it can take days.</para> likely core component of a server to fail. When they do, the
results can be catastrophic: replacing a failed disk drive and
restoring data to it can take days.</para>
<indexterm> <indexterm>
<primary>disk mirroring</primary> <primary>disk mirroring</primary>
@ -195,10 +209,11 @@
<para>The traditional way to approach this problem has been <para>The traditional way to approach this problem has been
<emphasis>mirroring</emphasis>, keeping two copies of the data <emphasis>mirroring</emphasis>, keeping two copies of the data
on different physical hardware. Since the advent of the on different physical hardware. Since the advent of the
<acronym>RAID</acronym> levels, this technique has also been called <acronym>RAID</acronym> levels, this technique has also been
<acronym>RAID level 1</acronym> or <acronym>RAID-1</acronym>. Any called <acronym>RAID level 1</acronym> or
write to the volume writes to both locations; a read can be satisfied from <acronym>RAID-1</acronym>. Any write to the volume writes to
either, so if one drive fails, the data is still available on the other both locations; a read can be satisfied from either, so if one
drive fails, the data is still available on the other
drive.</para> drive.</para>
<para>Mirroring has two problems:</para> <para>Mirroring has two problems:</para>
@ -211,24 +226,26 @@
<listitem> <listitem>
<para>The performance impact. Writes must be performed to <para>The performance impact. Writes must be performed to
both drives, so they take up twice the bandwidth of a non-mirrored both drives, so they take up twice the bandwidth of a
volume. Reads do not suffer from a performance penalty: it even looks non-mirrored volume. Reads do not suffer from a
as if they are faster.</para> performance penalty: it even looks as if they are
faster.</para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
<para><indexterm><primary>RAID-5</primary></indexterm>An alternative <para><indexterm><primary>RAID-5</primary></indexterm>An
solution is <emphasis>parity</emphasis>, implemented in the alternative solution is <emphasis>parity</emphasis>,
<acronym>RAID</acronym> levels 2, 3, 4 and 5. Of these, implemented in the <acronym>RAID</acronym> levels 2, 3, 4 and
<acronym>RAID-5</acronym> is the most interesting. As implemented 5. Of these, <acronym>RAID-5</acronym> is the most
in Vinum, it is a variant on a striped organization which dedicates interesting. As implemented in Vinum, it is a variant on a
one block of each stripe to parity of the other blocks. As implemented striped organization which dedicates one block of each stripe
by Vinum, a <acronym>RAID-5</acronym> plex is similar to a to parity of the other blocks. As implemented by Vinum, a
striped plex, except that it implements <acronym>RAID-5</acronym> by <acronym>RAID-5</acronym> plex is similar to a striped plex,
except that it implements <acronym>RAID-5</acronym> by
including a parity block in each stripe. As required by including a parity block in each stripe. As required by
<acronym>RAID-5</acronym>, the location of this parity block changes from one <acronym>RAID-5</acronym>, the location of this parity block
stripe to the next. The numbers in the data blocks indicate the relative changes from one stripe to the next. The numbers in the data
block numbers.</para> blocks indicate the relative block numbers.</para>
<para> <para>
<figure id="vinum-raid5-org"> <figure id="vinum-raid5-org">
@ -237,13 +254,15 @@
</figure> </figure>
</para> </para>
<para>Compared to mirroring, <acronym>RAID-5</acronym> has the advantage of requiring <para>Compared to mirroring, <acronym>RAID-5</acronym> has the
significantly less storage space. Read access is similar to that of advantage of requiring significantly less storage space. Read
striped organizations, but write access is significantly slower, access is similar to that of striped organizations, but write
approximately 25% of the read performance. If one drive fails, the array access is significantly slower, approximately 25% of the read
can continue to operate in degraded mode: a read from one of the remaining performance. If one drive fails, the array can continue to
accessible drives continues normally, but a read from the failed drive is operate in degraded mode: a read from one of the remaining
recalculated from the corresponding block from all the remaining drives. accessible drives continues normally, but a read from the
failed drive is recalculated from the corresponding block from
all the remaining drives.
</para> </para>
</sect1> </sect1>
@ -261,28 +280,33 @@
</listitem> </listitem>
<listitem> <listitem>
<para>Volumes are composed of <emphasis>plexes</emphasis>, each of which <para>Volumes are composed of <emphasis>plexes</emphasis>,
represent the total address space of a volume. This level in the each of which represent the total address space of a
hierarchy thus provides redundancy. Think of plexes as individual volume. This level in the hierarchy thus provides
disks in a mirrored array, each containing the same data.</para> redundancy. Think of plexes as individual disks in a
mirrored array, each containing the same data.</para>
</listitem> </listitem>
<listitem> <listitem>
<para>Since Vinum exists within the UNIX&trade; disk storage framework, <para>Since Vinum exists within the UNIX&trade; disk storage
it would be possible to use UNIX&trade; partitions as the building framework, it would be possible to use UNIX&trade;
block for multi-disk plexes, but in fact this turns out to be too partitions as the building block for multi-disk plexes,
inflexible: UNIX&trade; disks can have only a limited number of partitions. but in fact this turns out to be too inflexible:
Instead, Vinum subdivides a single UNIX&trade; partition (the UNIX&trade; disks can have only a limited number of
<emphasis>drive</emphasis>) into contiguous areas called partitions. Instead, Vinum subdivides a single
<emphasis>subdisks</emphasis>, which it uses as building blocks for plexes.</para> UNIX&trade; partition (the <emphasis>drive</emphasis>)
into contiguous areas called
<emphasis>subdisks</emphasis>, which it uses as building
blocks for plexes.</para>
</listitem> </listitem>
<listitem> <listitem>
<para>Subdisks reside on Vinum <emphasis>drives</emphasis>, <para>Subdisks reside on Vinum <emphasis>drives</emphasis>,
currently UNIX&trade; partitions. Vinum drives can contain any number of currently UNIX&trade; partitions. Vinum drives can
subdisks. With the exception of a small area at the beginning of the contain any number of subdisks. With the exception of a
drive, which is used for storing configuration and state information, small area at the beginning of the drive, which is used
the entire drive is available for data storage.</para> for storing configuration and state information, the
entire drive is available for data storage.</para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
@ -292,29 +316,33 @@
<sect2> <sect2>
<title>Volume size considerations</title> <title>Volume size considerations</title>
<para>Plexes can include multiple subdisks spread over all drives in the <para>Plexes can include multiple subdisks spread over all
Vinum configuration. As a result, the size of an individual drive does drives in the Vinum configuration. As a result, the size of
not limit the size of a plex, and thus of a volume.</para> an individual drive does not limit the size of a plex, and
thus of a volume.</para>
</sect2> </sect2>
<sect2> <sect2>
<title>Redundant data storage</title> <title>Redundant data storage</title>
<para>Vinum implements mirroring by attaching multiple plexes to a <para>Vinum implements mirroring by attaching multiple plexes to
volume. Each plex is a representation of the data in a volume. A a volume. Each plex is a representation of the data in a
volume may contain between one and eight plexes.</para> volume. A volume may contain between one and eight
plexes.</para>
<para>Although a plex represents the complete data of a volume, it is <para>Although a plex represents the complete data of a volume,
possible for parts of the representation to be physically missing, it is possible for parts of the representation to be
either by design (by not defining a subdisk for parts of the plex) or by physically missing, either by design (by not defining a
accident (as a result of the failure of a drive). As long as at least subdisk for parts of the plex) or by accident (as a result of
one plex can provide the data for the complete address range of the the failure of a drive). As long as at least one plex can
volume, the volume is fully functional.</para> provide the data for the complete address range of the volume,
the volume is fully functional.</para>
</sect2> </sect2>
<sect2> <sect2>
<title>Performance issues</title> <title>Performance issues</title>
<para>Vinum implements both concatenation and striping at the plex
level:</para> <para>Vinum implements both concatenation and striping at the
plex level:</para>
<itemizedlist> <itemizedlist>
<listitem> <listitem>
@ -324,9 +352,9 @@
<listitem> <listitem>
<para>A <emphasis>striped plex</emphasis> stripes the data <para>A <emphasis>striped plex</emphasis> stripes the data
across each subdisk. The subdisks must all have the same size, and across each subdisk. The subdisks must all have the same
there must be at least two subdisks in order to distinguish it from a size, and there must be at least two subdisks in order to
concatenated plex.</para> distinguish it from a concatenated plex.</para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
</sect2> </sect2>
@ -339,24 +367,29 @@
<itemizedlist> <itemizedlist>
<listitem> <listitem>
<para>Concatenated plexes are the most flexible: they can <para>Concatenated plexes are the most flexible: they can
contain any number of subdisks, and the subdisks may be of different contain any number of subdisks, and the subdisks may be of
length. The plex may be extended by adding additional subdisks. They different length. The plex may be extended by adding
require less <acronym>CPU</acronym> time than striped plexes, though additional subdisks. They require less
the difference in <acronym>CPU</acronym> overhead is not measurable. <acronym>CPU</acronym> time than striped plexes, though
On the other hand, they are most susceptible to hot spots, where one the difference in <acronym>CPU</acronym> overhead is not
disk is very active and others are idle.</para> measurable. On the other hand, they are most susceptible
to hot spots, where one disk is very active and others are
idle.</para>
</listitem> </listitem>
<listitem> <listitem>
<para>The greatest advantage of striped (<acronym>RAID-0</acronym>) <para>The greatest advantage of striped
plexes is that they reduce hot spots: by choosing an optimum sized stripe (<acronym>RAID-0</acronym>) plexes is that they reduce hot
(about 256&nbsp;kB), you can even out the load on the component drives. spots: by choosing an optimum sized stripe (about
The disadvantages of this approach are (fractionally) more complex 256&nbsp;kB), you can even out the load on the component
code and restrictions on subdisks: they must be all the same size, and drives. The disadvantages of this approach are
extending a plex by adding new subdisks is so complicated that Vinum (fractionally) more complex code and restrictions on
currently does not implement it. Vinum imposes an additional, trivial subdisks: they must be all the same size, and extending a
restriction: a striped plex must have at least two subdisks, since plex by adding new subdisks is so complicated that Vinum
otherwise it is indistinguishable from a concatenated plex.</para> currently does not implement it. Vinum imposes an
additional, trivial restriction: a striped plex must have
at least two subdisks, since otherwise it is
indistinguishable from a concatenated plex.</para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
@ -402,14 +435,16 @@
<sect1 id="vinum-examples"> <sect1 id="vinum-examples">
<title>Some examples</title> <title>Some examples</title>
<para>Vinum maintains a <emphasis>configuration database</emphasis>
which describes the objects known to an individual system. Initially, the <para>Vinum maintains a <emphasis>configuration
user creates the configuration database from one or more configuration files database</emphasis> which describes the objects known to an
with the aid of the &man.vinum.8; utility program. Vinum stores a copy of individual system. Initially, the user creates the
its configuration database on each disk slice (which Vinum calls a configuration database from one or more configuration files with
<emphasis>device</emphasis>) under its control. This database is updated on the aid of the &man.vinum.8; utility program. Vinum stores a
each state change, so that a restart accurately restores the state of each copy of its configuration database on each disk slice (which
Vinum object.</para> Vinum calls a <emphasis>device</emphasis>) under its control.
This database is updated on each state change, so that a restart
accurately restores the state of each Vinum object.</para>
<sect2> <sect2>
<title>The configuration file</title> <title>The configuration file</title>
@ -427,11 +462,12 @@
<itemizedlist> <itemizedlist>
<listitem> <listitem>
<para>The <emphasis>drive</emphasis> line describes a disk <para>The <emphasis>drive</emphasis> line describes a disk
partition (<emphasis>drive</emphasis>) and its location relative to the partition (<emphasis>drive</emphasis>) and its location
underlying hardware. It is given the symbolic name relative to the underlying hardware. It is given the
<emphasis>a</emphasis>. This separation of the symbolic names from the symbolic name <emphasis>a</emphasis>. This separation of
device names allows disks to be moved from one location to another the symbolic names from the device names allows disks to
without confusion.</para> be moved from one location to another without
confusion.</para>
</listitem> </listitem>
<listitem> <listitem>
@ -441,23 +477,27 @@
</listitem> </listitem>
<listitem> <listitem>
<para>The <emphasis>plex</emphasis> line defines a plex. The <para>The <emphasis>plex</emphasis> line defines a plex.
only required parameter is the organization, in this case The only required parameter is the organization, in this
<emphasis>concat</emphasis>. No name is necessary: the system case <emphasis>concat</emphasis>. No name is necessary:
automatically generates a name from the volume name by adding the suffix the system automatically generates a name from the volume
name by adding the suffix
<emphasis>.p</emphasis><emphasis>x</emphasis>, where <emphasis>.p</emphasis><emphasis>x</emphasis>, where
<emphasis>x</emphasis> is the number of the plex in the volume. Thus <emphasis>x</emphasis> is the number of the plex in the
this plex will be called <emphasis>myvol.p0</emphasis>.</para> volume. Thus this plex will be called
<emphasis>myvol.p0</emphasis>.</para>
</listitem> </listitem>
<listitem> <listitem>
<para>The <emphasis>sd</emphasis> line describes a subdisk. <para>The <emphasis>sd</emphasis> line describes a subdisk.
The minimum specifications are the name of a drive on which to store it, The minimum specifications are the name of a drive on
and the length of the subdisk. As with plexes, no name is necessary: which to store it, and the length of the subdisk. As with
the system automatically assigns names derived from the plex name by plexes, no name is necessary: the system automatically
adding the suffix <emphasis>.s</emphasis><emphasis>x</emphasis>, where assigns names derived from the plex name by adding the
<emphasis>x</emphasis> is the number of the subdisk in the plex. Thus suffix <emphasis>.s</emphasis><emphasis>x</emphasis>,
Vinum gives this subdisk the name <emphasis>myvol.p0.s0</emphasis>.</para> where <emphasis>x</emphasis> is the number of the subdisk
in the plex. Thus Vinum gives this subdisk the name
<emphasis>myvol.p0.s0</emphasis>.</para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
@ -490,23 +530,28 @@
</figure> </figure>
</para> </para>
<para>This figure, and the ones which follow, represent a volume, which <para>This figure, and the ones which follow, represent a
contains the plexes, which in turn contain the subdisks. In this trivial volume, which contains the plexes, which in turn contain the
example, the volume contains one plex, and the plex contains one subdisk.</para> subdisks. In this trivial example, the volume contains one
plex, and the plex contains one subdisk.</para>
<para>This particular volume has no specific advantage over a conventional <para>This particular volume has no specific advantage over a
disk partition. It contains a single plex, so it is not redundant. The conventional disk partition. It contains a single plex, so it
plex contains a single subdisk, so there is no difference in storage is not redundant. The plex contains a single subdisk, so
allocation from a conventional disk partition. The following sections there is no difference in storage allocation from a
illustrate various more interesting configuration methods.</para> conventional disk partition. The following sections
illustrate various more interesting configuration
methods.</para>
</sect2> </sect2>
<sect2> <sect2>
<title>Increased resilience: mirroring</title> <title>Increased resilience: mirroring</title>
<para>The resilience of a volume can be increased by mirroring. When
laying out a mirrored volume, it is important to ensure that the subdisks <para>The resilience of a volume can be increased by mirroring.
of each plex are on different drives, so that a drive failure will not When laying out a mirrored volume, it is important to ensure
take down both plexes. The following configuration mirrors a volume:</para> that the subdisks of each plex are on different drives, so
that a drive failure will not take down both plexes. The
following configuration mirrors a volume:</para>
<programlisting> <programlisting>
drive b device /dev/da4h drive b device /dev/da4h
@ -516,10 +561,11 @@
plex org concat plex org concat
sd length 512m drive b</programlisting> sd length 512m drive b</programlisting>
<para>In this example, it was not necessary to specify a definition of <para>In this example, it was not necessary to specify a
drive <emphasis>a</emphasis> again, since Vinum keeps track of all definition of drive <emphasis>a</emphasis> again, since Vinum
objects in its configuration database. After processing this keeps track of all objects in its configuration database.
definition, the configuration looks like:</para> After processing this definition, the configuration looks
like:</para>
<programlisting> <programlisting>
@ -552,20 +598,23 @@
</figure> </figure>
</para> </para>
<para>In this example, each plex contains the full 512&nbsp;MB of address <para>In this example, each plex contains the full 512&nbsp;MB
space. As in the previous example, each plex contains only a single of address space. As in the previous example, each plex
subdisk.</para> contains only a single subdisk.</para>
</sect2> </sect2>
<sect2> <sect2>
<title>Optimizing performance</title> <title>Optimizing performance</title>
<para>The mirrored volume in the previous example is more resistant to
failure than an unmirrored volume, but its performance is less: each write <para>The mirrored volume in the previous example is more
to the volume requires a write to both drives, using up a greater resistant to failure than an unmirrored volume, but its
proportion of the total disk bandwidth. Performance considerations demand performance is less: each write to the volume requires a write
a different approach: instead of mirroring, the data is striped across as to both drives, using up a greater proportion of the total
many disk drives as possible. The following configuration shows a volume disk bandwidth. Performance considerations demand a different
with a plex striped across four disk drives:</para> approach: instead of mirroring, the data is striped across as
many disk drives as possible. The following configuration
shows a volume with a plex striped across four disk
drives:</para>
<programlisting> <programlisting>
drive c device /dev/da5h drive c device /dev/da5h
@ -624,10 +673,12 @@
<sect2> <sect2>
<title>Resilience and performance</title> <title>Resilience and performance</title>
<para><anchor id="vinum-resilience">With sufficient hardware, it is
possible to build volumes which show both increased resilience and <para><anchor id="vinum-resilience">With sufficient hardware, it
increased performance compared to standard UNIX&trade; partitions. A typical is possible to build volumes which show both increased
configuration file might be:</para> resilience and increased performance compared to standard
UNIX&trade; partitions. A typical configuration file might
be:</para>
<programlisting> <programlisting>
volume raid10 volume raid10
@ -662,72 +713,80 @@
<sect1 id="vinum-object-naming"> <sect1 id="vinum-object-naming">
<title>Object naming</title> <title>Object naming</title>
<para>As described above, Vinum assigns default names to plexes and
subdisks, although they may be overridden. Overriding the default names
is not recommended: experience with the VERITAS volume manager, which
allows arbitrary naming of objects, has shown that this flexibility does
not bring a significant advantage, and it can cause confusion.</para>
<para>Names may contain any non-blank character, but it is recommended to <para>As described above, Vinum assigns default names to plexes
restrict them to letters, digits and the underscore characters. The names and subdisks, although they may be overridden. Overriding the
of volumes, plexes and subdisks may be up to 64 characters long, and the default names is not recommended: experience with the VERITAS
names of drives may be up to 32 characters long.</para> volume manager, which allows arbitrary naming of objects, has
shown that this flexibility does not bring a significant
advantage, and it can cause confusion.</para>
<para>Vinum objects <para>Names may contain any non-blank character, but it is
are assigned device nodes in the hierarchy <filename>/dev/vinum</filename>. recommended to restrict them to letters, digits and the
The configuration shown above would cause Vinum to create the following underscore characters. The names of volumes, plexes and
device nodes:</para> subdisks may be up to 64 characters long, and the names of
drives may be up to 32 characters long.</para>
<para>Vinum objects are assigned device nodes in the hierarchy
<filename>/dev/vinum</filename>. The configuration shown above
would cause Vinum to create the following device nodes:</para>
<itemizedlist> <itemizedlist>
<listitem> <listitem>
<para>The control devices <devicename>/dev/vinum/control</devicename> and <para>The control devices
<devicename>/dev/vinum/controld</devicename>, which are used by <devicename>/dev/vinum/control</devicename> and
&man.vinum.8; and the Vinum daemon respectively.</para> <devicename>/dev/vinum/controld</devicename>, which are used
by &man.vinum.8; and the Vinum daemon respectively.</para>
</listitem> </listitem>
<listitem> <listitem>
<para>Block and character device entries for each volume. <para>Block and character device entries for each volume.
These are the main devices used by Vinum. The block device names are These are the main devices used by Vinum. The block device
the name of the volume, while the character device names follow the BSD names are the name of the volume, while the character device
tradition of prepending the letter <emphasis>r</emphasis> to the name. names follow the BSD tradition of prepending the letter
Thus the configuration above would include the block devices <emphasis>r</emphasis> to the name. Thus the configuration
<devicename>/dev/vinum/myvol</devicename>, above would include the block devices
<devicename>/dev/vinum/mirror</devicename>, <devicename>/dev/vinum/myvol</devicename>,
<devicename>/dev/vinum/striped</devicename>, <devicename>/dev/vinum/mirror</devicename>,
<devicename>/dev/vinum/raid5</devicename> and <devicename>/dev/vinum/striped</devicename>,
<devicename>/dev/vinum/raid10</devicename>, and the character devices <devicename>/dev/vinum/raid5</devicename> and
<devicename>/dev/vinum/rmyvol</devicename>, <devicename>/dev/vinum/raid10</devicename>, and the
character devices
<devicename>/dev/vinum/rmyvol</devicename>,
<devicename>/dev/vinum/rmirror</devicename>, <devicename>/dev/vinum/rmirror</devicename>,
<devicename>/dev/vinum/rstriped</devicename>, <devicename>/dev/vinum/rstriped</devicename>,
<devicename>/dev/vinum/rraid5</devicename> and <devicename>/dev/vinum/rraid5</devicename> and
<devicename>/dev/vinum/rraid10</devicename>. <devicename>/dev/vinum/rraid10</devicename>. There is
There is obviously a problem here: it is possible to have two volumes obviously a problem here: it is possible to have two volumes
called <emphasis>r</emphasis> and <emphasis>rr</emphasis>, but there called <emphasis>r</emphasis> and <emphasis>rr</emphasis>,
will be a conflict creating the device node but there will be a conflict creating the device node
<devicename>/dev/vinum/rr</devicename>: is it a character device for <devicename>/dev/vinum/rr</devicename>: is it a character
volume <emphasis>r</emphasis> or a block device for volume device for volume <emphasis>r</emphasis> or a block device
<emphasis>rr</emphasis>? Currently Vinum does not address this for volume <emphasis>rr</emphasis>? Currently Vinum does
conflict: the first-defined volume will get the name.</para> not address this conflict: the first-defined volume will get
the name.</para>
</listitem> </listitem>
<listitem> <listitem>
<para>A directory <devicename>/dev/vinum/drive</devicename> <para>A directory <devicename>/dev/vinum/drive</devicename>
with entries for each drive. These entries are in fact symbolic links with entries for each drive. These entries are in fact
to the corresponding disk nodes.</para> symbolic links to the corresponding disk nodes.</para>
</listitem> </listitem>
<listitem> <listitem>
<para>A directory <filename>/dev/vinum/volume</filename> with <para>A directory <filename>/dev/vinum/volume</filename> with
entries for each volume. It contains subdirectories for each plex, entries for each volume. It contains subdirectories for
which in turn contain subdirectories for their component subdisks.</para> each plex, which in turn contain subdirectories for their
component subdisks.</para>
</listitem> </listitem>
<listitem> <listitem>
<para>The directories <devicename>/dev/vinum/plex</devicename>, <para>The directories
<devicename>/dev/vinum/plex</devicename>,
<devicename>/dev/vinum/sd</devicename>, and <devicename>/dev/vinum/sd</devicename>, and
<devicename>/dev/vinum/rsd</devicename>, which contain block device <devicename>/dev/vinum/rsd</devicename>, which contain block
nodes for each plex and block and character device nodes respectively device nodes for each plex and block and character device
for each subdisk.</para> nodes respectively for each subdisk.</para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
@ -806,26 +865,31 @@
brwxr-xr-- 1 root wheel 25, 0x20200002 Apr 13 16:46 s64.p0.s2 brwxr-xr-- 1 root wheel 25, 0x20200002 Apr 13 16:46 s64.p0.s2
brwxr-xr-- 1 root wheel 25, 0x20300002 Apr 13 16:46 s64.p0.s3</programlisting> brwxr-xr-- 1 root wheel 25, 0x20300002 Apr 13 16:46 s64.p0.s3</programlisting>
<para>Although it is recommended that plexes and subdisks should not be <para>Although it is recommended that plexes and subdisks should
allocated specific names, Vinum drives must be named. This makes it not be allocated specific names, Vinum drives must be named.
possible to move a drive to a different location and still recognize it This makes it possible to move a drive to a different location
automatically. Drive names may be up to 32 characters long.</para> and still recognize it automatically. Drive names may be up to
32 characters long.</para>
<sect2> <sect2>
<title>Creating file systems</title> <title>Creating file systems</title>
<para>Volumes appear to the system to be identical to disks, with one exception.
Unlike UNIX&trade; drives, Vinum does not partition volumes, which thus do <para>Volumes appear to the system to be identical to disks,
not contain a partition table. This has required modification to some disk with one exception. Unlike UNIX&trade; drives, Vinum does
not partition volumes, which thus do not contain a partition
table. This has required modification to some disk
utilities, notably &man.newfs.8;, which previously tried to utilities, notably &man.newfs.8;, which previously tried to
interpret the last letter of a Vinum volume name as a partition identifier. interpret the last letter of a Vinum volume name as a
For example, a disk drive may have a name like <devicename>/dev/ad0a</devicename> partition identifier. For example, a disk drive may have a
or <devicename>/dev/da2h</devicename>. These names represent the first name like <devicename>/dev/ad0a</devicename> or
partition (<devicename>a</devicename>) on the first (0) IDE disk <devicename>/dev/da2h</devicename>. These names represent
(<devicename>ad</devicename>) and the eighth partition the first partition (<devicename>a</devicename>) on the
(<devicename>h</devicename>) on the third (2) SCSI disk first (0) IDE disk (<devicename>ad</devicename>) and the
(<devicename>da</devicename>) respectively. By contrast, a Vinum volume eighth partition (<devicename>h</devicename>) on the third
might be called <devicename>/dev/vinum/concat</devicename>, a name which (2) SCSI disk (<devicename>da</devicename>) respectively.
has no relationship with a partition name.</para> By contrast, a Vinum volume might be called
<devicename>/dev/vinum/concat</devicename>, a name which has
no relationship with a partition name.</para>
<para>Normally, &man.newfs.8; interprets the name of the disk and <para>Normally, &man.newfs.8; interprets the name of the disk and
complains if it cannot understand it. For example:</para> complains if it cannot understand it. For example:</para>
@ -843,21 +907,25 @@ newfs: /dev/vinum/concat: can't figure out file system partition</screen>
<sect1 id="vinum-config"> <sect1 id="vinum-config">
<title>Configuring Vinum</title> <title>Configuring Vinum</title>
<para>The <filename>GENERIC</filename> kernel does not contain Vinum. It is
possible to build a special kernel which includes Vinum, but this is not <para>The <filename>GENERIC</filename> kernel does not contain
recommended. The standard way to start Vinum is as a kernel module Vinum. It is possible to build a special kernel which includes
(<acronym>kld</acronym>). You do not even need to use &man.kldload.8; Vinum, but this is not recommended. The standard way to start
for Vinum: when you start &man.vinum.8;, it checks whether the module Vinum is as a kernel module (<acronym>kld</acronym>). You do
has been loaded, and if it is not, it loads it automatically.</para> not even need to use &man.kldload.8; for Vinum: when you start
&man.vinum.8;, it checks whether the module has been loaded, and
if it is not, it loads it automatically.</para>
<sect2> <sect2>
<title>Startup</title> <title>Startup</title>
<para>Vinum stores configuration information on the disk slices in
essentially the same form as in the configuration files. When reading <para>Vinum stores configuration information on the disk slices
from the configuration database, Vinum recognizes a number of keywords in essentially the same form as in the configuration files.
which are not allowed in the configuration files. For example, a disk When reading from the configuration database, Vinum recognizes
configuration might contain the following text:</para> a number of keywords which are not allowed in the
configuration files. For example, a disk configuration might
contain the following text:</para>
<programlisting>volume myvol state up <programlisting>volume myvol state up
volume bigraid state down volume bigraid state down
@ -879,37 +947,43 @@ sd name bigraid.p0.s2 drive c plex bigraid.p0 state initializing len 4194304b dr
sd name bigraid.p0.s3 drive d plex bigraid.p0 state initializing len 4194304b driveoff set 1573129b plexoffset 12582912b sd name bigraid.p0.s3 drive d plex bigraid.p0 state initializing len 4194304b driveoff set 1573129b plexoffset 12582912b
sd name bigraid.p0.s4 drive e plex bigraid.p0 state initializing len 4194304b driveoff set 1573129b plexoffset 16777216b</programlisting> sd name bigraid.p0.s4 drive e plex bigraid.p0 state initializing len 4194304b driveoff set 1573129b plexoffset 16777216b</programlisting>
<para>The obvious differences here are the presence of explicit location <para>The obvious differences here are the presence of
information and naming (both of which are also allowed, but discouraged, for explicit location information and naming (both of which are
use by the user) and the information on the states (which are not available also allowed, but discouraged, for use by the user) and the
to the user). Vinum does not store information about drives in the information on the states (which are not available to the
configuration information: it finds the drives by scanning the configured user). Vinum does not store information about drives in the
disk drives for partitions with a Vinum label. This enables Vinum to configuration information: it finds the drives by scanning
identify drives correctly even if they have been assigned different UNIX&trade; the configured disk drives for partitions with a Vinum
drive IDs.</para> label. This enables Vinum to identify drives correctly even
if they have been assigned different UNIX&trade; drive
IDs.</para>
<sect3> <sect3>
<title>Automatic startup</title> <title>Automatic startup</title>
<para>In order to start Vinum automatically when you boot the system,
ensure that you have the following line in your <para>In order to start Vinum automatically when you boot the
<filename>/etc/rc.conf</filename>:</para> system, ensure that you have the following line in your
<filename>/etc/rc.conf</filename>:</para>
<programlisting>start_vinum="YES" # set to YES to start vinum</programlisting> <programlisting>start_vinum="YES" # set to YES to start vinum</programlisting>
<para>If you do not have a file <filename>/etc/rc.conf</filename>, create <para>If you do not have a file
one with this content. This will cause the system to load the Vinum <filename>/etc/rc.conf</filename>, create one with this
<acronym>kld</acronym> at startup, and to start any objects mentioned in content. This will cause the system to load the Vinum
the configuration. This is done before mounting file systems, so it is <acronym>kld</acronym> at startup, and to start any objects
possible to automatically &man.fsck.8; and mount file systems on Vinum mentioned in the configuration. This is done before
volumes.</para> mounting file systems, so it is possible to automatically
&man.fsck.8; and mount file systems on Vinum volumes.</para>
<para>When you start Vinum with the <command>vinum start</command> command, <para>When you start Vinum with the <command>vinum
Vinum reads the configuration database from one of the Vinum drives. start</command> command, Vinum reads the configuration
Under normal circumstances, each drive contains an identical copy of the database from one of the Vinum drives. Under normal
configuration database, so it does not matter which drive is read. After circumstances, each drive contains an identical copy of the
a crash, however, Vinum must determine which drive was updated most configuration database, so it does not matter which drive is
recently and read the configuration from this drive. It then updates the read. After a crash, however, Vinum must determine which
configuration if necessary from progressively older drives.</para> drive was updated most recently and read the configuration
from this drive. It then updates the configuration if
necessary from progressively older drives.</para>
</sect3> </sect3>
</sect2> </sect2>