Editorial review of first 1/2 of HAST chapter.
Sponsored by: iXsystems
This commit is contained in:
parent
aa33908aca
commit
c9c8b80069
Notes:
svn2git
2020-12-08 03:00:23 +00:00
svn path=/head/; revision=44485
1 changed files with 141 additions and 173 deletions
|
@ -3297,7 +3297,7 @@ Device 1K-blocks Used Avail Capacity
|
|||
|
||||
<sect1 xml:id="disks-hast">
|
||||
<info>
|
||||
<title>Highly Available Storage (HAST)</title>
|
||||
<title>Highly Available Storage (<acronym>HAST</acronym>)</title>
|
||||
|
||||
<authorgroup>
|
||||
<author>
|
||||
|
@ -3348,75 +3348,24 @@ Device 1K-blocks Used Avail Capacity
|
|||
|
||||
<para>High availability is one of the main requirements in
|
||||
serious business applications and highly-available storage is a
|
||||
key component in such environments. Highly Available STorage,
|
||||
or <acronym>HAST<remark role="acronym">Highly
|
||||
Available STorage</remark></acronym>, was developed by
|
||||
&a.pjd.email; as a framework which allows transparent storage of
|
||||
key component in such environments. In &os;, the Highly Available STorage
|
||||
(<acronym>HAST</acronym>)
|
||||
framework allows transparent storage of
|
||||
the same data across several physically separated machines
|
||||
connected by a TCP/IP network. <acronym>HAST</acronym> can be
|
||||
connected by a <acronym>TCP/IP</acronym> network. <acronym>HAST</acronym> can be
|
||||
understood as a network-based RAID1 (mirror), and is similar to
|
||||
the DRBD® storage system known from the GNU/&linux;
|
||||
the DRBD® storage system used in the GNU/&linux;
|
||||
platform. In combination with other high-availability features
|
||||
of &os; like <acronym>CARP</acronym>, <acronym>HAST</acronym>
|
||||
makes it possible to build a highly-available storage cluster
|
||||
that is resistant to hardware failures.</para>
|
||||
|
||||
<para>After reading this section, you will know:</para>
|
||||
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>What <acronym>HAST</acronym> is, how it works and
|
||||
which features it provides.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>How to set up and use <acronym>HAST</acronym> on
|
||||
&os;.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>How to integrate <acronym>CARP</acronym> and
|
||||
&man.devd.8; to build a robust storage system.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
<para>Before reading this section, you should:</para>
|
||||
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>Understand &unix; and <link
|
||||
linkend="basics">&os; basics</link>.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>Know how to <link
|
||||
linkend="config-tuning">configure</link> network
|
||||
interfaces and other core &os; subsystems.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>Have a good understanding of <link
|
||||
linkend="network-communication">&os;
|
||||
networking</link>.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
<para>The <acronym>HAST</acronym> project was sponsored by The
|
||||
&os; Foundation with support from <link
|
||||
xlink:href="http://www.omc.net/">OMCnet Internet Service
|
||||
GmbH</link> and <link
|
||||
xlink:href="http://www.transip.nl/">TransIP
|
||||
BV</link>.</para>
|
||||
|
||||
<sect2>
|
||||
<title>HAST Features</title>
|
||||
|
||||
<para>The main features of the <acronym>HAST</acronym> system
|
||||
are:</para>
|
||||
<para>The following are the main features of
|
||||
<acronym>HAST</acronym>:</para>
|
||||
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>Can be used to mask I/O errors on local hard
|
||||
<para>Can be used to mask <acronym>I/O</acronym> errors on local hard
|
||||
drives.</para>
|
||||
</listitem>
|
||||
|
||||
|
@ -3426,9 +3375,9 @@ Device 1K-blocks Used Avail Capacity
|
|||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>Efficient and quick resynchronization, synchronizing
|
||||
only blocks that were modified during the downtime of a
|
||||
node.</para>
|
||||
<para>Efficient and quick resynchronization as
|
||||
only the blocks that were modified during the downtime of a
|
||||
node are synchronized.</para>
|
||||
</listitem>
|
||||
|
||||
<!--
|
||||
|
@ -3450,64 +3399,94 @@ Device 1K-blocks Used Avail Capacity
|
|||
system.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
</sect2>
|
||||
|
||||
<para>After reading this section, you will know:</para>
|
||||
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>What <acronym>HAST</acronym> is, how it works, and
|
||||
which features it provides.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>How to set up and use <acronym>HAST</acronym> on
|
||||
&os;.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>How to integrate <acronym>CARP</acronym> and
|
||||
&man.devd.8; to build a robust storage system.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
<para>Before reading this section, you should:</para>
|
||||
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>Understand &unix; and &os; basics (<xref
|
||||
linkend="basics"/>).</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>Know how to configure network
|
||||
interfaces and other core &os; subsystems (<xref
|
||||
linkend="config-tuning"/>).</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>Have a good understanding of &os;
|
||||
networking (<xref
|
||||
linkend="network-communication"/>).</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
<para>The <acronym>HAST</acronym> project was sponsored by The
|
||||
&os; Foundation with support from <link
|
||||
xlink:href="http://www.omc.net/">http://www.omc.net/</link> and <link
|
||||
xlink:href="http://www.transip.nl/">http://www.transip.nl/</link>.</para>
|
||||
|
||||
<sect2>
|
||||
<title>HAST Operation</title>
|
||||
|
||||
<para>As <acronym>HAST</acronym> provides a synchronous
|
||||
block-level replication of any storage media to several
|
||||
machines, it requires at least two physical machines:
|
||||
the <literal>primary</literal>, also known as the
|
||||
<literal>master</literal> node, and the
|
||||
<literal>secondary</literal> or <literal>slave</literal>
|
||||
<para><acronym>HAST</acronym> provides synchronous
|
||||
block-level replication between two
|
||||
physical machines:
|
||||
the <emphasis>primary</emphasis>, also known as the
|
||||
<emphasis>master</emphasis> node, and the
|
||||
<emphasis>secondary</emphasis>, or <emphasis>slave</emphasis>
|
||||
node. These two machines together are referred to as a
|
||||
cluster.</para>
|
||||
|
||||
<note>
|
||||
<para>HAST is currently limited to two cluster nodes in
|
||||
total.</para>
|
||||
</note>
|
||||
|
||||
<para>Since <acronym>HAST</acronym> works in a
|
||||
primary-secondary configuration, it allows only one of the
|
||||
cluster nodes to be active at any given time. The
|
||||
<literal>primary</literal> node, also called
|
||||
<literal>active</literal>, is the one which will handle all
|
||||
the I/O requests to <acronym>HAST</acronym>-managed
|
||||
devices. The <literal>secondary</literal> node is
|
||||
automatically synchronized from the <literal>primary</literal>
|
||||
primary node, also called
|
||||
<emphasis>active</emphasis>, is the one which will handle all
|
||||
the <acronym>I/O</acronym> requests to <acronym>HAST</acronym>-managed
|
||||
devices. The secondary node is
|
||||
automatically synchronized from the primary
|
||||
node.</para>
|
||||
|
||||
<para>The physical components of the <acronym>HAST</acronym>
|
||||
system are:</para>
|
||||
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>local disk on primary node, and</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>disk on remote, secondary node.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
system are the local disk on primary node, and the
|
||||
disk on the remote, secondary node.</para>
|
||||
|
||||
<para><acronym>HAST</acronym> operates synchronously on a block
|
||||
level, making it transparent to file systems and applications.
|
||||
<acronym>HAST</acronym> provides regular GEOM providers in
|
||||
<filename>/dev/hast/</filename> for use by
|
||||
other tools or applications, thus there is no difference
|
||||
other tools or applications. There is no difference
|
||||
between using <acronym>HAST</acronym>-provided devices and
|
||||
raw disks or partitions.</para>
|
||||
|
||||
<para>Each write, delete, or flush operation is sent to the
|
||||
local disk and to the remote disk over TCP/IP. Each read
|
||||
<para>Each write, delete, or flush operation is sent to both the
|
||||
local disk and to the remote disk over <acronym>TCP/IP</acronym>. Each read
|
||||
operation is served from the local disk, unless the local disk
|
||||
is not up-to-date or an I/O error occurs. In such case, the
|
||||
is not up-to-date or an <acronym>I/O</acronym> error occurs. In such cases, the
|
||||
read operation is sent to the secondary node.</para>
|
||||
|
||||
<para><acronym>HAST</acronym> tries to provide fast failure
|
||||
recovery. For this reason, it is very important to reduce
|
||||
recovery. For this reason, it is important to reduce
|
||||
synchronization time after a node's outage. To provide fast
|
||||
synchronization, <acronym>HAST</acronym> manages an on-disk
|
||||
bitmap of dirty extents and only synchronizes those during a
|
||||
|
@ -3520,29 +3499,29 @@ Device 1K-blocks Used Avail Capacity
|
|||
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para><emphasis>memsync</emphasis>: report write operation
|
||||
<para><emphasis>memsync</emphasis>: This mode reports a write operation
|
||||
as completed when the local write operation is finished
|
||||
and when the remote node acknowledges data arrival, but
|
||||
before actually storing the data. The data on the remote
|
||||
node will be stored directly after sending the
|
||||
acknowledgement. This mode is intended to reduce
|
||||
latency, but still provides very good
|
||||
latency, but still provides good
|
||||
reliability.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para><emphasis>fullsync</emphasis>: report write
|
||||
operation as completed when local write completes and
|
||||
when remote write completes. This is the safest and the
|
||||
<para><emphasis>fullsync</emphasis>: This mode reports a write
|
||||
operation as completed when both the local write and the
|
||||
remote write complete. This is the safest and the
|
||||
slowest replication mode. This mode is the
|
||||
default.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para><emphasis>async</emphasis>: report write operation as
|
||||
completed when local write completes. This is the
|
||||
<para><emphasis>async</emphasis>: This mode reports a write operation as
|
||||
completed when the local write completes. This is the
|
||||
fastest and the most dangerous replication mode. It
|
||||
should be used when replicating to a distant node where
|
||||
should only be used when replicating to a distant node where
|
||||
latency is too high for other modes.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
@ -3551,65 +3530,64 @@ Device 1K-blocks Used Avail Capacity
|
|||
<sect2>
|
||||
<title>HAST Configuration</title>
|
||||
|
||||
<para><acronym>HAST</acronym> requires
|
||||
<literal>GEOM_GATE</literal> support which is not present in
|
||||
the default <literal>GENERIC</literal> kernel. However, the
|
||||
<varname>geom_gate.ko</varname> loadable module is available
|
||||
in the default &os; installation. Alternatively, to build
|
||||
<literal>GEOM_GATE</literal> support into the kernel
|
||||
statically, add this line to the custom kernel configuration
|
||||
file:</para>
|
||||
|
||||
<programlisting>options GEOM_GATE</programlisting>
|
||||
|
||||
<para>The <acronym>HAST</acronym> framework consists of several
|
||||
parts from the operating system's point of view:</para>
|
||||
components:</para>
|
||||
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>the &man.hastd.8; daemon responsible for data
|
||||
synchronization,</para>
|
||||
<para>The &man.hastd.8; daemon which provides data
|
||||
synchronization. When this daemon is started, it will
|
||||
automatically load <varname>geom_gate.ko</varname>.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>the &man.hastctl.8; userland management
|
||||
utility,</para>
|
||||
<para>The userland management
|
||||
utility, &man.hastctl.8;.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>and the &man.hast.conf.5; configuration file.</para>
|
||||
<para>The &man.hast.conf.5; configuration file. This file
|
||||
must exist before starting
|
||||
<application>hastd</application>.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
<para>Users who prefer to statically build
|
||||
<literal>GEOM_GATE</literal> support into the kernel
|
||||
should add this line to the custom kernel configuration
|
||||
file, then rebuild the kernel using the instructions in <xref
|
||||
linkend="kernelconfig"/>:</para>
|
||||
|
||||
<programlisting>options GEOM_GATE</programlisting>
|
||||
|
||||
<para>The following example describes how to configure two nodes
|
||||
in <literal>master</literal>-<literal>slave</literal> /
|
||||
<literal>primary</literal>-<literal>secondary</literal>
|
||||
in master-slave/primary-secondary
|
||||
operation using <acronym>HAST</acronym> to replicate the data
|
||||
between the two. The nodes will be called
|
||||
<literal><replaceable>hasta</replaceable></literal> with an IP address of
|
||||
<replaceable>172.16.0.1</replaceable> and
|
||||
<literal><replaceable>hastb</replaceable></literal> with an IP of address
|
||||
<replaceable>172.16.0.2</replaceable>. Both nodes will have a
|
||||
dedicated hard drive <filename>/dev/<replaceable>ad6</replaceable></filename> of the same
|
||||
<literal>hasta</literal>, with an <acronym>IP</acronym> address of
|
||||
<literal>172.16.0.1</literal>, and
|
||||
<literal>hastb</literal>, with an <acronym>IP</acronym> of address
|
||||
<literal>172.16.0.2</literal>. Both nodes will have a
|
||||
dedicated hard drive <filename>/dev/ad6</filename> of the same
|
||||
size for <acronym>HAST</acronym> operation. The
|
||||
<acronym>HAST</acronym> pool, sometimes also referred to as a
|
||||
resource or the GEOM provider in
|
||||
<acronym>HAST</acronym> pool, sometimes referred to as a
|
||||
resource or the <acronym>GEOM</acronym> provider in
|
||||
<filename class="directory">/dev/hast/</filename>, will be called
|
||||
<filename><replaceable>test</replaceable></filename>.</para>
|
||||
<literal>test</literal>.</para>
|
||||
|
||||
<para>Configuration of <acronym>HAST</acronym> is done using
|
||||
<filename>/etc/hast.conf</filename>. This file should be the
|
||||
same on both nodes. The simplest configuration possible
|
||||
<filename>/etc/hast.conf</filename>. This file should be
|
||||
identical on both nodes. The simplest configuration
|
||||
is:</para>
|
||||
|
||||
<programlisting>resource test {
|
||||
on hasta {
|
||||
local /dev/ad6
|
||||
remote 172.16.0.2
|
||||
<programlisting>resource <replaceable>test</replaceable> {
|
||||
on <replaceable>hasta</replaceable> {
|
||||
local <replaceable>/dev/ad6</replaceable>
|
||||
remote <replaceable>172.16.0.2</replaceable>
|
||||
}
|
||||
on hastb {
|
||||
local /dev/ad6
|
||||
remote 172.16.0.1
|
||||
on <replaceable>hastb</replaceable> {
|
||||
local <replaceable>/dev/ad6</replaceable>
|
||||
remote <replaceable>172.16.0.1</replaceable>
|
||||
}
|
||||
}</programlisting>
|
||||
|
||||
|
@ -3618,18 +3596,18 @@ Device 1K-blocks Used Avail Capacity
|
|||
|
||||
<tip>
|
||||
<para>It is also possible to use host names in the
|
||||
<literal>remote</literal> statements. In such a case, make
|
||||
sure that these hosts are resolvable and are defined in
|
||||
<literal>remote</literal> statements if
|
||||
the hosts are resolvable and defined either in
|
||||
<filename>/etc/hosts</filename> or in the local
|
||||
<acronym>DNS</acronym>.</para>
|
||||
</tip>
|
||||
|
||||
<para>Now that the configuration exists on both nodes,
|
||||
<para>Once the configuration exists on both nodes,
|
||||
the <acronym>HAST</acronym> pool can be created. Run these
|
||||
commands on both nodes to place the initial metadata onto the
|
||||
local disk and to start &man.hastd.8;:</para>
|
||||
|
||||
<screen>&prompt.root; <userinput>hastctl create test</userinput>
|
||||
<screen>&prompt.root; <userinput>hastctl create <replaceable>test</replaceable></userinput>
|
||||
&prompt.root; <userinput>service hastd onestart</userinput></screen>
|
||||
|
||||
<note>
|
||||
|
@ -3646,50 +3624,40 @@ Device 1K-blocks Used Avail Capacity
|
|||
administrator, or software like
|
||||
<application>Heartbeat</application>, using &man.hastctl.8;.
|
||||
On the primary node,
|
||||
<literal><replaceable>hasta</replaceable></literal>, issue
|
||||
<literal>hasta</literal>, issue
|
||||
this command:</para>
|
||||
|
||||
<screen>&prompt.root; <userinput>hastctl role primary test</userinput></screen>
|
||||
<screen>&prompt.root; <userinput>hastctl role primary <replaceable>test</replaceable></userinput></screen>
|
||||
|
||||
<para>Similarly, run this command on the secondary node,
|
||||
<literal><replaceable>hastb</replaceable></literal>:</para>
|
||||
<para>Run this command on the secondary node,
|
||||
<literal>hastb</literal>:</para>
|
||||
|
||||
<screen>&prompt.root; <userinput>hastctl role secondary test</userinput></screen>
|
||||
<screen>&prompt.root; <userinput>hastctl role secondary <replaceable>test</replaceable></userinput></screen>
|
||||
|
||||
<caution>
|
||||
<para>When the nodes are unable to communicate with each
|
||||
other, and both are configured as primary nodes, the
|
||||
condition is called <literal>split-brain</literal>. To
|
||||
troubleshoot this situation, follow the steps described in
|
||||
<xref linkend="disks-hast-sb"/>.</para>
|
||||
</caution>
|
||||
|
||||
<para>Verify the result by running &man.hastctl.8; on each
|
||||
<para>Verify the result by running <command>hastctl</command> on each
|
||||
node:</para>
|
||||
|
||||
<screen>&prompt.root; <userinput>hastctl status test</userinput></screen>
|
||||
<screen>&prompt.root; <userinput>hastctl status <replaceable>test</replaceable></userinput></screen>
|
||||
|
||||
<para>The important text is the <literal>status</literal> line,
|
||||
which should say <literal>complete</literal>
|
||||
on each of the nodes. If it says <literal>degraded</literal>,
|
||||
something went wrong. At this point, the synchronization
|
||||
between the nodes has already started. The synchronization
|
||||
<para>Check the <literal>status</literal> line in the output.
|
||||
If it says <literal>degraded</literal>,
|
||||
something is wrong with the configuration file. It should say <literal>complete</literal>
|
||||
on each node, meaning that the synchronization
|
||||
between the nodes has started. The synchronization
|
||||
completes when <command>hastctl status</command>
|
||||
reports 0 bytes of <literal>dirty</literal> extents.</para>
|
||||
|
||||
|
||||
<para>The next step is to create a file system on the
|
||||
<filename>/dev/hast/<replaceable>test</replaceable></filename>
|
||||
GEOM provider and mount it. This must be done on the
|
||||
<literal>primary</literal> node, as
|
||||
<filename>/dev/hast/<replaceable>test</replaceable></filename>
|
||||
appears only on the <literal>primary</literal> node. Creating
|
||||
<acronym>GEOM</acronym> provider and mount it. This must be done on the
|
||||
<literal>primary</literal> node. Creating
|
||||
the file system can take a few minutes, depending on the size
|
||||
of the hard drive:</para>
|
||||
of the hard drive. This example creates a <acronym>UFS</acronym>
|
||||
file system on <filename>/dev/hast/test</filename>:</para>
|
||||
|
||||
<screen>&prompt.root; <userinput>newfs -U /dev/hast/test</userinput>
|
||||
&prompt.root; <userinput>mkdir /hast/test</userinput>
|
||||
&prompt.root; <userinput>mount /dev/hast/test /hast/test</userinput></screen>
|
||||
<screen>&prompt.root; <userinput>newfs -U /dev/hast/<replaceable>test</replaceable></userinput>
|
||||
&prompt.root; <userinput>mkdir /hast/<replaceable>test</replaceable></userinput>
|
||||
&prompt.root; <userinput>mount /dev/hast/<replaceable>test</replaceable> <replaceable>/hast/test</replaceable></userinput></screen>
|
||||
|
||||
<para>Once the <acronym>HAST</acronym> framework is configured
|
||||
properly, the final step is to make sure that
|
||||
|
|
Loading…
Reference in a new issue