817 lines
		
	
	
	
		
			28 KiB
		
	
	
	
		
			XML
		
	
	
	
	
	
			
		
		
	
	
			817 lines
		
	
	
	
		
			28 KiB
		
	
	
	
		
			XML
		
	
	
	
	
	
| <?xml version="1.0" encoding="iso-8859-1"?>
 | |
| <!DOCTYPE article PUBLIC "-//FreeBSD//DTD DocBook XML V5.0-Based Extension//EN"
 | |
| 	"http://www.FreeBSD.org/XML/share/xml/freebsd50.dtd">
 | |
| <article xmlns="http://docbook.org/ns/docbook"
 | |
|   xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
 | |
|   xml:lang="en">
 | |
|   <info>
 | |
|     <title>Writing a GEOM Class</title>
 | |
| 
 | |
|     <authorgroup>
 | |
|       <author>
 | |
| 	<personname>
 | |
| 	  <firstname>Ivan</firstname>
 | |
| 	  <surname>Voras</surname>
 | |
| 	</personname>
 | |
| 	<affiliation>
 | |
| 	  <address>
 | |
| 	    <email>ivoras@FreeBSD.org</email>
 | |
| 	  </address>
 | |
| 	</affiliation>
 | |
|       </author>
 | |
|     </authorgroup>
 | |
| 
 | |
|     <legalnotice xml:id="trademarks" role="trademarks">
 | |
|       &tm-attrib.freebsd;
 | |
|       &tm-attrib.intel;
 | |
|       &tm-attrib.general;
 | |
|     </legalnotice>
 | |
| 
 | |
|     <pubdate>$FreeBSD$</pubdate>
 | |
| 
 | |
|     <releaseinfo>$FreeBSD$</releaseinfo>
 | |
| 
 | |
|     <abstract>
 | |
|       <para>This text documents some starting points in developing
 | |
| 	GEOM classes, and kernel modules in general.  It is assumed
 | |
| 	that the reader is familiar with C userland
 | |
| 	programming.</para>
 | |
|     </abstract>
 | |
|   </info>
 | |
| 
 | |
| <!-- Introduction -->
 | |
|   <sect1 xml:id="intro">
 | |
|     <title>Introduction</title>
 | |
| 
 | |
|     <sect2 xml:id="intro-docs">
 | |
|       <title>Documentation</title>
 | |
| 
 | |
|       <para>Documentation on kernel programming is scarce — it
 | |
| 	is one of few areas where there is nearly nothing in the way
 | |
| 	of friendly tutorials, and the phrase <quote>use the
 | |
| 	  source!</quote> really holds true.  However, there are some
 | |
| 	bits and pieces (some of them seriously outdated) floating
 | |
| 	around that should be studied before beginning to code:</para>
 | |
| 
 | |
|       <itemizedlist>
 | |
| 	<listitem>
 | |
| 	  <para>The <link
 | |
| 	      xlink:href="&url.books.developers-handbook;/index.html">FreeBSD
 | |
| 	      Developer's Handbook</link> — part of the
 | |
| 	    documentation project, it does not contain anything
 | |
| 	    specific to kernel programming, but rather some general
 | |
| 	    useful information.</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>The <link
 | |
| 	      xlink:href="&url.books.arch-handbook;/index.html">FreeBSD
 | |
| 	      Architecture Handbook</link> — also from the
 | |
| 	    documentation project, contains descriptions of several
 | |
| 	    low-level facilities and procedures.  The most important
 | |
| 	    chapter is 13, <link
 | |
| 	      xlink:href="&url.books.arch-handbook;/driverbasics.html">Writing
 | |
| 	      FreeBSD device drivers</link>.</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>The Blueprints section of <link
 | |
| 	      xlink:href="http://www.freebsddiary.org">FreeBSD
 | |
| 	      Diary</link> web site — contains several
 | |
| 	    interesting articles on kernel
 | |
| 	    facilities.</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>The man pages in section 9 — for important
 | |
| 	    documentation on kernel functions.</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>The &man.geom.4; man page and <link
 | |
| 	      xlink:href="http://phk.freebsd.dk/pubs/">PHK's GEOM
 | |
| 	      slides</link> — for general introduction of the
 | |
| 	    GEOM subsystem.</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>Man pages &man.g.bio.9;, &man.g.event.9;,
 | |
| 	    &man.g.data.9;, &man.g.geom.9;, &man.g.provider.9;
 | |
| 	    &man.g.consumer.9;, &man.g.access.9; & others linked
 | |
| 	    from those, for documentation on specific
 | |
| 	    functionalities.</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>The &man.style.9; man page — for documentation
 | |
| 	    on the coding-style conventions which must be followed for
 | |
| 	    any code which is to be committed to the FreeBSD
 | |
| 	    Subversion tree.</para>
 | |
| 	</listitem>
 | |
|       </itemizedlist>
 | |
|     </sect2>
 | |
|   </sect1>
 | |
| 
 | |
|   <sect1 xml:id="prelim">
 | |
|     <title>Preliminaries</title>
 | |
| 
 | |
|     <para>The best way to do kernel development is to have (at least)
 | |
|       two separate computers.  One of these would contain the
 | |
|       development environment and sources, and the other would be used
 | |
|       to test the newly written code by network-booting and
 | |
|       network-mounting filesystems from the first one.  This way if
 | |
|       the new code contains bugs and crashes the machine, it will not
 | |
|       mess up the sources (and other <quote>live</quote> data).  The
 | |
|       second system does not even require a proper display.  Instead,
 | |
|       it could be connected with a serial cable or KVM to the first
 | |
|       one.</para>
 | |
| 
 | |
|     <para>But, since not everybody has two or more computers handy,
 | |
|       there are a few things that can be done to prepare an otherwise
 | |
|       <quote>live</quote> system for developing kernel code.  This
 | |
|       setup is also applicable for developing in a <link
 | |
| 	xlink:href="http://www.vmware.com/">VMWare</link> or <link
 | |
| 	xlink:href="http://www.qemu.org/">QEmu</link> virtual machine
 | |
|       (the next best thing after a dedicated development
 | |
|       machine).</para>
 | |
| 
 | |
|     <sect2 xml:id="prelim-system">
 | |
|       <title>Modifying a System for Development</title>
 | |
| 
 | |
|       <para>For any kernel programming a kernel with
 | |
| 	<option>INVARIANTS</option> enabled is a must-have.  So enter
 | |
| 	these in your kernel configuration file:</para>
 | |
| 
 | |
|       <programlisting>options INVARIANT_SUPPORT
 | |
| options INVARIANTS</programlisting>
 | |
| 
 | |
|       <para>For more debugging you should also include WITNESS
 | |
| 	support, which will alert you of mistakes in locking:</para>
 | |
| 
 | |
|       <programlisting>options WITNESS_SUPPORT
 | |
| options WITNESS</programlisting>
 | |
| 
 | |
|       <para>For debugging crash dumps, a kernel with debug symbols is
 | |
| 	needed:</para>
 | |
| 
 | |
|       <programlisting>  makeoptions    DEBUG=-g</programlisting>
 | |
| 
 | |
|       <para>With the usual way of installing the kernel (<command>make
 | |
| 	  installkernel</command>) the debug kernel will not be
 | |
| 	automatically installed.  It is called
 | |
| 	<filename>kernel.debug</filename> and located in
 | |
| 	<filename>/usr/obj/usr/src/sys/KERNELNAME/</filename>.  For
 | |
| 	convenience it should be copied to
 | |
| 	<filename>/boot/kernel/</filename>.</para>
 | |
| 
 | |
|       <para>Another convenience is enabling the kernel debugger so you
 | |
| 	can examine a kernel panic when it happens.  For this, enter
 | |
| 	the following lines in your kernel configuration file:</para>
 | |
| 
 | |
|       <programlisting>options KDB
 | |
| options DDB
 | |
| options KDB_TRACE</programlisting>
 | |
| 
 | |
|       <para>For this to work you might need to set a sysctl (if it is
 | |
| 	not on by default):</para>
 | |
| 
 | |
|       <programlisting>  debug.debugger_on_panic=1</programlisting>
 | |
| 
 | |
|       <para>Kernel panics will happen, so care should be taken with
 | |
| 	the filesystem cache.  In particular, having softupdates might
 | |
| 	mean the latest file version could be lost if a panic occurs
 | |
| 	before it is committed to storage.  Disabling softupdates
 | |
| 	yields a great performance hit, and still does not guarantee
 | |
| 	data consistency.  Mounting filesystem with the
 | |
| 	<quote>sync</quote> option is needed for that.  For a
 | |
| 	compromise, the softupdates cache delays can be shortened.
 | |
| 	There are three sysctl's that are useful for this (best to be
 | |
| 	set in <filename>/etc/sysctl.conf</filename>):</para>
 | |
| 
 | |
|       <programlisting>kern.filedelay=5
 | |
| kern.dirdelay=4
 | |
| kern.metadelay=3</programlisting>
 | |
| 
 | |
|       <para>The numbers represent seconds.</para>
 | |
| 
 | |
|       <para>For debugging kernel panics, kernel core dumps are
 | |
| 	required.  Since a kernel panic might make filesystems
 | |
| 	unusable, this crash dump is first written to a raw partition.
 | |
| 	Usually, this is the swap partition.  This partition must be
 | |
| 	at least as large as the physical RAM in the machine.  On the
 | |
| 	next boot, the dump is copied to a regular file.  This happens
 | |
| 	after filesystems are checked and mounted, and before swap is
 | |
| 	enabled.  This is controlled with two
 | |
| 	<filename>/etc/rc.conf</filename> variables:</para>
 | |
| 
 | |
|       <programlisting>dumpdev="/dev/ad0s4b"
 | |
| dumpdir="/usr/core </programlisting>
 | |
| 
 | |
|       <para>The <varname>dumpdev</varname> variable specifies the swap
 | |
| 	partition and <varname>dumpdir</varname> tells the system
 | |
| 	where in the filesystem to relocate the core dump on
 | |
| 	reboot.</para>
 | |
| 
 | |
|       <para>Writing kernel core dumps is slow and takes a long time so
 | |
| 	if you have lots of memory (>256M) and lots of panics it
 | |
| 	could be frustrating to sit and wait while it is done (twice
 | |
| 	— first to write it to swap, then to relocate it to
 | |
| 	filesystem).  It is convenient then to limit the amount of RAM
 | |
| 	the system will use via a
 | |
| 	<filename>/boot/loader.conf</filename> tunable:</para>
 | |
| 
 | |
|       <programlisting>  hw.physmem="256M"</programlisting>
 | |
| 
 | |
|       <para>If the panics are frequent and filesystems large (or you
 | |
| 	simply do not trust softupdates+background fsck) it is
 | |
| 	advisable to turn background fsck off via
 | |
| 	<filename>/etc/rc.conf</filename> variable:</para>
 | |
| 
 | |
|       <programlisting>  background_fsck="NO"</programlisting>
 | |
| 
 | |
|       <para>This way, the filesystems will always get checked when
 | |
| 	needed.  Note that with background fsck, a new panic could
 | |
| 	happen while it is checking the disks.  Again, the safest way
 | |
| 	is not to have many local filesystems by using another
 | |
| 	computer as an NFS server.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="prelim-starting">
 | |
|       <title>Starting the Project</title>
 | |
| 
 | |
|       <para>For the purpose of creating a new GEOM class, an empty
 | |
| 	subdirectory has to be created under an arbitrary
 | |
| 	user-accessible directory.  You do not have to create the
 | |
| 	module directory under <filename>/usr/src</filename>.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="prelim-makefile">
 | |
|       <title>The Makefile</title>
 | |
| 
 | |
|       <para>It is good practice to create
 | |
| 	<filename>Makefile</filename>s for every nontrivial coding
 | |
| 	project, which of course includes kernel modules.</para>
 | |
| 
 | |
|       <para>Creating the <filename>Makefile</filename> is simple
 | |
| 	thanks to an extensive set of helper routines provided by the
 | |
| 	system.  In short, here is how a minimal
 | |
| 	<filename>Makefile</filename> looks for a kernel
 | |
| 	module:</para>
 | |
| 
 | |
|       <programlisting>SRCS=g_journal.c
 | |
| KMOD=geom_journal
 | |
| 
 | |
| .include <bsd.kmod.mk></programlisting>
 | |
| 
 | |
|       <para>This <filename>Makefile</filename> (with changed
 | |
| 	filenames) will do for any kernel module, and a GEOM class can
 | |
| 	reside in just one kernel module.  If more than one file is
 | |
| 	required, list it in the <envar>SRCS</envar> variable,
 | |
| 	separated with whitespace from other filenames.</para>
 | |
|     </sect2>
 | |
|   </sect1>
 | |
| 
 | |
|   <sect1 xml:id="kernelprog">
 | |
|     <title>On FreeBSD Kernel Programming</title>
 | |
| 
 | |
|     <sect2 xml:id="kernelprog-memalloc">
 | |
|       <title>Memory Allocation</title>
 | |
| 
 | |
|       <para>See &man.malloc.9;.  Basic memory allocation is only
 | |
| 	slightly different than its userland equivalent.  Most
 | |
| 	notably, <function>malloc</function>() and
 | |
| 	<function>free</function>() accept additional parameters as is
 | |
| 	described in the man page.</para>
 | |
| 
 | |
|       <para>A <quote>malloc type</quote> must be declared in the
 | |
| 	declaration section of a source file, like this:</para>
 | |
| 
 | |
|       <programlisting>  static MALLOC_DEFINE(M_GJOURNAL, "gjournal data", "GEOM_JOURNAL Data");</programlisting>
 | |
| 
 | |
|       <para>To use this macro, <filename>sys/param.h</filename>,
 | |
| 	<filename>sys/kernel.h</filename> and
 | |
| 	<filename>sys/malloc.h</filename> headers must be
 | |
| 	included.</para>
 | |
| 
 | |
|       <para>There is another mechanism for allocating memory, the UMA
 | |
| 	(Universal Memory Allocator).  See &man.uma.9; for details,
 | |
| 	but it is a special type of allocator mainly used for speedy
 | |
| 	allocation of lists comprised of same-sized items (for
 | |
| 	example, dynamic arrays of structs).</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="kernelprog-lists">
 | |
|       <title>Lists and Queues</title>
 | |
| 
 | |
|       <para>See &man.queue.3;.  There are a LOT of cases when a list
 | |
| 	of things needs to be maintained.  Fortunately, this data
 | |
| 	structure is implemented (in several ways) by C macros
 | |
| 	included in the system.  The most used list type is TAILQ
 | |
| 	because it is the most flexible.  It is also the one with
 | |
| 	largest memory requirements (its elements are doubly-linked)
 | |
| 	and also the slowest (although the speed variation is on the
 | |
| 	order of several CPU instructions more, so it should not be
 | |
| 	taken seriously).</para>
 | |
| 
 | |
|       <para>If data retrieval speed is very important, see
 | |
| 	&man.tree.3; and &man.hashinit.9;.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="kernelprog-bios">
 | |
|       <title>BIOs</title>
 | |
| 
 | |
|       <para>Structure <varname remap="structname">bio</varname> is
 | |
| 	used for any and all Input/Output operations concerning GEOM.
 | |
| 	It basically contains information about what device
 | |
| 	('provider') should satisfy the request, request type, offset,
 | |
| 	length, pointer to a buffer, and a bunch of
 | |
| 	<quote>user-specific</quote> flags and fields that can help
 | |
| 	implement various hacks.</para>
 | |
| 
 | |
|       <para>The important thing here is that <varname
 | |
| 	  remap="structname">bio</varname>s are handled
 | |
| 	asynchronously.  That means that, in most parts of the code,
 | |
| 	there is no analogue to userland's &man.read.2; and
 | |
| 	&man.write.2; calls that do not return until a request is
 | |
| 	done.  Rather, a developer-supplied function is called as a
 | |
| 	notification when the request gets completed (or results in
 | |
| 	error).</para>
 | |
| 
 | |
|       <para>The asynchronous programming model (also called
 | |
| 	<quote>event-driven</quote>) is somewhat harder than the much
 | |
| 	more used imperative one used in userland (at least it takes a
 | |
| 	while to get used to it).  In some cases the helper routines
 | |
| 	<function>g_write_data</function>() and
 | |
| 	<function>g_read_data</function>() can be used, but
 | |
| 	<emphasis>not always</emphasis>.  In particular, they cannot
 | |
| 	be used when a mutex is held; for example, the GEOM topology
 | |
| 	mutex or the internal mutex held during the
 | |
| 	<function>.start</function>() and <function>.stop</function>()
 | |
| 	functions.</para>
 | |
|     </sect2>
 | |
|   </sect1>
 | |
| 
 | |
|   <sect1 xml:id="geom">
 | |
|     <title>On GEOM Programming</title>
 | |
| 
 | |
|     <sect2 xml:id="geom-ggate">
 | |
|       <title>Ggate</title>
 | |
| 
 | |
|       <para>If maximum performance is not needed, a much simpler way
 | |
| 	of making a data transformation is to implement it in userland
 | |
| 	via the ggate (GEOM gate) facility.  Unfortunately, there is
 | |
| 	no easy way to convert between, or even share code between the
 | |
| 	two approaches.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="geom-class">
 | |
|       <title>GEOM Class</title>
 | |
| 
 | |
|       <para>GEOM classes are transformations on the data.  These
 | |
| 	transformations can be combined in a tree-like fashion.
 | |
| 	Instances of GEOM classes are called
 | |
| 	<emphasis>geoms</emphasis>.</para>
 | |
| 
 | |
|       <para>Each GEOM class has several <quote>class methods</quote>
 | |
| 	that get called when there is no geom instance available (or
 | |
| 	they are simply not bound to a single instance):</para>
 | |
| 
 | |
|       <itemizedlist>
 | |
| 	<listitem>
 | |
| 	  <para><function>.init</function> is called when GEOM becomes
 | |
| 	    aware of a GEOM class (when the kernel module gets
 | |
| 	    loaded.)</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para><function>.fini</function> gets called when GEOM
 | |
| 	    abandons the class (when the module gets
 | |
| 	    unloaded)</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para><function>.taste</function> is called next, once for
 | |
| 	    each provider the system has available.  If applicable,
 | |
| 	    this function will usually create and start a geom
 | |
| 	    instance.</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para><function>.destroy_geom</function> is called when the
 | |
| 	    geom should be disbanded</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para><function>.ctlconf</function> is called when user
 | |
| 	    requests reconfiguration of existing
 | |
| 	    geom</para>
 | |
| 	</listitem>
 | |
|       </itemizedlist>
 | |
| 
 | |
|       <para>Also defined are the GEOM event functions, which will get
 | |
| 	copied to the geom instance.</para>
 | |
| 
 | |
|       <para>Field <function>.geom</function> in the <varname
 | |
| 	  remap="structname">g_class</varname> structure is a LIST of
 | |
| 	geoms instantiated from the class.</para>
 | |
| 
 | |
|       <para>These functions are called from the g_event kernel
 | |
| 	thread.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="geom-softc">
 | |
|       <title>Softc</title>
 | |
| 
 | |
|       <para>The name <quote>softc</quote> is a legacy term for
 | |
| 	<quote>driver private data</quote>.  The name most probably
 | |
| 	comes from the archaic term <quote>software control
 | |
| 	  block</quote>.  In GEOM, it is a structure (more precise:
 | |
| 	pointer to a structure) that can be attached to a geom
 | |
| 	instance to hold whatever data is private to the geom
 | |
| 	instance.  Most GEOM classes have the following
 | |
| 	members:</para>
 | |
| 
 | |
|       <itemizedlist>
 | |
| 	<listitem>
 | |
| 	  <para><varname>struct g_provider *provider</varname> : The
 | |
| 	    <quote>provider</quote> this geom
 | |
| 	    instantiates</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para><varname>uint16_t n_disks</varname> : Number of
 | |
| 	    consumer this geom consumes</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para><varname>struct g_consumer **disks</varname> : Array
 | |
| 	    of <varname>struct g_consumer*</varname>.  (It is not
 | |
| 	    possible to use just single indirection because struct
 | |
| 	    g_consumer* are created on our behalf by
 | |
| 	    GEOM).</para>
 | |
| 	</listitem>
 | |
|       </itemizedlist>
 | |
| 
 | |
|       <para>The <varname remap="structname">softc</varname> structure
 | |
| 	contains all the state of geom instance.  Every geom instance
 | |
| 	has its own softc.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="geom-metadata">
 | |
|       <title>Metadata</title>
 | |
| 
 | |
|       <para>Format of metadata is more-or-less class-dependent, but
 | |
| 	MUST start with:</para>
 | |
| 
 | |
|       <itemizedlist>
 | |
| 	<listitem>
 | |
| 	  <para>16 byte buffer for null-terminated signature (usually
 | |
| 	    the class name)</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>uint32 version ID</para>
 | |
| 	</listitem>
 | |
|       </itemizedlist>
 | |
| 
 | |
|       <para>It is assumed that geom classes know how to handle
 | |
| 	metadata with version ID's lower than theirs.</para>
 | |
| 
 | |
|       <para>Metadata is located in the last sector of the provider
 | |
| 	(and thus must fit in it).</para>
 | |
| 
 | |
|       <para>(All this is implementation-dependent but all existing
 | |
| 	code works like that, and it is supported by
 | |
| 	libraries.)</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="geom-creating">
 | |
|       <title>Labeling/creating a GEOM</title>
 | |
| 
 | |
|       <para>The sequence of events is:</para>
 | |
| 
 | |
|       <itemizedlist>
 | |
| 	<listitem>
 | |
| 	  <para>user calls &man.geom.8; utility (or one of its
 | |
| 	    hardlinked friends)</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>the utility figures out which geom class it is
 | |
| 	    supposed to handle and searches for
 | |
| 	    <filename>geom_<replaceable>CLASSNAME</replaceable>.so</filename>
 | |
| 	    library (usually in
 | |
| 	    <filename>/lib/geom</filename>).</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>it &man.dlopen.3;-s the library, extracts the
 | |
| 	    definitions of command-line parameters and helper
 | |
| 	    functions.</para>
 | |
| 	</listitem>
 | |
|       </itemizedlist>
 | |
| 
 | |
|       <para>In the case of creating/labeling a new geom, this is what
 | |
| 	happens:</para>
 | |
| 
 | |
|       <itemizedlist>
 | |
| 	<listitem>
 | |
| 	  <para>&man.geom.8; looks in the command-line argument for
 | |
| 	    the command (usually <option>label</option>), and calls a
 | |
| 	    helper function.</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>The helper function checks parameters and gathers
 | |
| 	    metadata, which it proceeds to write to all concerned
 | |
| 	    providers.</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>This <quote>spoils</quote> existing geoms (if any) and
 | |
| 	    initializes a new round of <quote>tasting</quote> of the
 | |
| 	    providers.  The intended geom class recognizes the
 | |
| 	    metadata and brings the geom up.</para>
 | |
| 	</listitem>
 | |
|       </itemizedlist>
 | |
| 
 | |
|       <para>(The above sequence of events is implementation-dependent
 | |
| 	but all existing code works like that, and it is supported by
 | |
| 	libraries.)</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="geom-command">
 | |
|       <title>GEOM Command Structure</title>
 | |
| 
 | |
|       <para>The helper <filename>geom_CLASSNAME.so</filename> library
 | |
| 	exports <varname remap="structname">class_commands</varname>
 | |
| 	structure, which is an array of <varname
 | |
| 	  remap="structname">struct g_command</varname> elements.
 | |
| 	Commands are of uniform format and look like:</para>
 | |
| 
 | |
|       <programlisting>  verb [-options] geomname [other]</programlisting>
 | |
| 
 | |
|       <para>Common verbs are:</para>
 | |
| 
 | |
|       <itemizedlist>
 | |
| 	<listitem>
 | |
| 	  <para>label — to write metadata to devices so they can
 | |
| 	    be recognized at tasting and brought up in
 | |
| 	    geoms</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>destroy — to destroy metadata, so the geoms get
 | |
| 	    destroyed</para>
 | |
| 	</listitem>
 | |
|       </itemizedlist>
 | |
| 
 | |
|       <para>Common options are:</para>
 | |
| 
 | |
|       <itemizedlist>
 | |
| 	<listitem>
 | |
| 	  <para><literal>-v</literal> : be verbose</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para><literal>-f</literal> : force</para>
 | |
| 	</listitem>
 | |
|       </itemizedlist>
 | |
| 
 | |
|       <para>Many actions, such as labeling and destroying metadata can
 | |
| 	be performed in userland.  For this, <varname
 | |
| 	  remap="structname">struct g_command</varname> provides field
 | |
| 	<varname>gc_func</varname> that can be set to a function (in
 | |
| 	the same <filename>.so</filename>) that will be called to
 | |
| 	process a verb.  If <varname>gc_func</varname> is NULL, the
 | |
| 	command will be passed to kernel module, to
 | |
| 	<function>.ctlreq</function> function of the geom
 | |
| 	class.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="geom-geoms">
 | |
|       <title>Geoms</title>
 | |
| 
 | |
|       <para>Geoms are instances of GEOM classes.  They have internal
 | |
| 	data (a softc structure) and some functions with which they
 | |
| 	respond to external events.</para>
 | |
| 
 | |
|       <para>The event functions are:</para>
 | |
| 
 | |
|       <itemizedlist>
 | |
| 	<listitem>
 | |
| 	  <para><function>.access</function> : calculates permissions
 | |
| 	    (read/write/exclusive)</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para><function>.dumpconf</function> : returns XML-formatted
 | |
| 	    information about the geom</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para><function>.orphan</function> : called when some
 | |
| 	    underlying provider gets disconnected</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para><function>.spoiled</function> : called when some
 | |
| 	    underlying provider gets written to</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para><function>.start</function> : handles I/O</para>
 | |
| 	</listitem>
 | |
|       </itemizedlist>
 | |
| 
 | |
|       <para>These functions are called from the
 | |
| 	<function>g_down</function> kernel thread and there can be no
 | |
| 	sleeping in this context, (see definition of sleeping
 | |
| 	elsewhere) which limits what can be done quite a bit, but
 | |
| 	forces the handling to be fast.</para>
 | |
| 
 | |
|       <para>Of these, the most important function for doing actual
 | |
| 	useful work is the <function>.start</function>() function,
 | |
| 	which is called when a BIO request arrives for a provider
 | |
| 	managed by a instance of geom class.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="geom-threads">
 | |
|       <title>GEOM Threads</title>
 | |
| 
 | |
|       <para>There are three kernel threads created and run by the GEOM
 | |
| 	framework:</para>
 | |
| 
 | |
|       <itemizedlist>
 | |
| 	<listitem>
 | |
| 	  <para><literal>g_down</literal> : Handles requests coming
 | |
| 	    from high-level entities (such as a userland request) on
 | |
| 	    the way to physical devices</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para><literal>g_up</literal> : Handles responses from
 | |
| 	    device drivers to requests made by higher-level
 | |
| 	    entities</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para><literal>g_event</literal> : Handles all other cases:
 | |
| 	    creation of geom instances, access counting,
 | |
| 	    <quote>spoil</quote> events, etc.</para>
 | |
| 	</listitem>
 | |
|       </itemizedlist>
 | |
| 
 | |
|       <para>When a user process issues <quote>read data X at offset Y
 | |
| 	  of a file</quote> request, this is what happens:</para>
 | |
| 
 | |
|       <itemizedlist>
 | |
| 	<listitem>
 | |
| 	  <para>The filesystem converts the request into a struct bio
 | |
| 	    instance and passes it to the GEOM subsystem.  It knows
 | |
| 	    what geom instance should handle it because filesystems
 | |
| 	    are hosted directly on a geom instance.</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>The request ends up as a call to the
 | |
| 	    <function>.start</function>() function made on the g_down
 | |
| 	    thread and reaches the top-level geom
 | |
| 	    instance.</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>This top-level geom instance (for example the
 | |
| 	    partition slicer) determines that the request should be
 | |
| 	    routed to a lower-level instance (for example the disk
 | |
| 	    driver).  It makes a copy of the bio request (bio requests
 | |
| 	    <emphasis>ALWAYS</emphasis> need to be copied between
 | |
| 	    instances, with <function>g_clone_bio</function>()!),
 | |
| 	    modifies the data offset and target provider fields and
 | |
| 	    executes the copy with
 | |
| 	    <function>g_io_request</function>()</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>The disk driver gets the bio request also as a call to
 | |
| 	    <function>.start</function>() on the
 | |
| 	    <literal>g_down</literal> thread.  It talks to hardware,
 | |
| 	    gets the data back, and calls
 | |
| 	    <function>g_io_deliver</function>() on the
 | |
| 	    bio.</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>Now, the notification of bio completion <quote>bubbles
 | |
| 	      up</quote> in the <literal>g_up</literal> thread.  First
 | |
| 	    the partition slicer gets <function>.done</function>()
 | |
| 	    called in the <literal>g_up</literal> thread, it uses
 | |
| 	    information stored in the bio to free the cloned <varname
 | |
| 	      remap="structname">bio</varname> structure (with
 | |
| 	    <function>g_destroy_bio</function>()) and calls
 | |
| 	    <function>g_io_deliver</function>() on the original
 | |
| 	    request.</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>The filesystem gets the data and transfers it to
 | |
| 	    userland.</para>
 | |
| 	</listitem>
 | |
|       </itemizedlist>
 | |
| 
 | |
|       <para>See &man.g.bio.9; man page for information how the data is
 | |
| 	passed back and forth in the <varname
 | |
| 	  remap="structname">bio</varname> structure (note in
 | |
| 	particular the <varname>bio_parent</varname> and
 | |
| 	<varname>bio_children</varname> fields and how they are
 | |
| 	handled).</para>
 | |
| 
 | |
|       <para>One important feature is: <emphasis>THERE CAN BE NO
 | |
| 	  SLEEPING IN G_UP AND G_DOWN THREADS</emphasis>.  This means
 | |
| 	that none of the following things can be done in those threads
 | |
| 	(the list is of course not complete, but only
 | |
| 	informative):</para>
 | |
| 
 | |
|       <itemizedlist>
 | |
| 	<listitem>
 | |
| 	  <para>Calls to <function>msleep</function>() and
 | |
| 	    <function>tsleep</function>(),
 | |
| 	    obviously.</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>Calls to <function>g_write_data</function>() and
 | |
| 	    <function>g_read_data</function>(), because these sleep
 | |
| 	    between passing the data to consumers and
 | |
| 	    returning.</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>Waiting for I/O.</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>Calls to &man.malloc.9; and
 | |
| 	    <function>uma_zalloc</function>() with
 | |
| 	    <varname>M_WAITOK</varname> flag set</para>
 | |
| 	</listitem>
 | |
| 
 | |
| 	<listitem>
 | |
| 	  <para>sx and other sleepable locks</para>
 | |
| 	</listitem>
 | |
|       </itemizedlist>
 | |
| 
 | |
|       <para>This restriction is here to stop GEOM code clogging the
 | |
| 	I/O request path, since sleeping is usually not time-bound and
 | |
| 	there can be no guarantees on how long will it take (there are
 | |
| 	some other, more technical reasons also).  It also means that
 | |
| 	there is not much that can be done in those threads; for
 | |
| 	example, almost any complex thing requires memory allocation.
 | |
| 	Fortunately, there is a way out: creating additional kernel
 | |
| 	threads.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="geom-kernelthreads">
 | |
|       <title>Kernel Threads for Use in GEOM Code</title>
 | |
| 
 | |
|       <para>Kernel threads are created with &man.kthread.create.9;
 | |
| 	function, and they are sort of similar to userland threads in
 | |
| 	behavior, only they cannot return to caller to signify
 | |
| 	termination, but must call &man.kthread.exit.9;.</para>
 | |
| 
 | |
|       <para>In GEOM code, the usual use of threads is to offload
 | |
| 	processing of requests from <literal>g_down</literal> thread
 | |
| 	(the <function>.start</function>() function).  These threads
 | |
| 	look like <quote>event handlers</quote>: they have a linked
 | |
| 	list of event associated with them (on which events can be
 | |
| 	posted by various functions in various threads so it must be
 | |
| 	protected by a mutex), take the events from the list one by
 | |
| 	one and process them in a big <literal>switch</literal>()
 | |
| 	statement.</para>
 | |
| 
 | |
|       <para>The main benefit of using a thread to handle I/O requests
 | |
| 	is that it can sleep when needed.  Now, this sounds good, but
 | |
| 	should be carefully thought out.  Sleeping is well and very
 | |
| 	convenient but can very effectively destroy performance of the
 | |
| 	geom transformation.  Extremely performance-sensitive classes
 | |
| 	probably should do all the work in
 | |
| 	<function>.start</function>() function call, taking great care
 | |
| 	to handle out-of-memory and similar errors.</para>
 | |
| 
 | |
|       <para>The other benefit of having a event-handler thread like
 | |
| 	that is to serialize all the requests and responses coming
 | |
| 	from different geom threads into one thread.  This is also
 | |
| 	very convenient but can be slow.  In most cases, handling of
 | |
| 	<function>.done</function>() requests can be left to the
 | |
| 	<literal>g_up</literal> thread.</para>
 | |
| 
 | |
|       <para>Mutexes in FreeBSD kernel (see &man.mutex.9;) have one
 | |
| 	distinction from their more common userland cousins —
 | |
| 	the code cannot sleep while holding a mutex).  If the code
 | |
| 	needs to sleep a lot, &man.sx.9; locks may be more
 | |
| 	appropriate.  On the other hand, if you do almost everything
 | |
| 	in a single thread, you may get away with no mutexes at
 | |
| 	all.</para>
 | |
|     </sect2>
 | |
|   </sect1>
 | |
| </article>
 |