diff --git a/en_US.ISO8859-1/articles/smp/Makefile b/en_US.ISO8859-1/articles/smp/Makefile
new file mode 100644
index 0000000000..85675c8e15
--- /dev/null
+++ b/en_US.ISO8859-1/articles/smp/Makefile
@@ -0,0 +1,18 @@
+# $FreeBSD$
+
+MAINTAINER=jhb@FreeBSD.org
+
+DOC?= article
+
+FORMATS?= html
+
+INSTALL_COMPRESSED?=gz
+INSTALL_ONLY_COMPRESSED?=
+
+JADEFLAGS+=	-V %generate-article-toc%
+
+SRCS=	article.sgml
+
+DOC_PREFIX?=	${.CURDIR}/../../..
+
+.include "${DOC_PREFIX}/share/mk/doc.project.mk"
diff --git a/en_US.ISO8859-1/articles/smp/article.sgml b/en_US.ISO8859-1/articles/smp/article.sgml
new file mode 100644
index 0000000000..3f6b233f60
--- /dev/null
+++ b/en_US.ISO8859-1/articles/smp/article.sgml
@@ -0,0 +1,934 @@
+<!DOCTYPE article PUBLIC "-//FreeBSD//DTD DocBook V4.1-Based Extension//EN" [
+<!ENTITY % man PUBLIC "-//FreeBSD//ENTITIES DocBook Manual Page Entities//EN">
+%man;
+
+<!ENTITY % authors PUBLIC "-//FreeBSD//ENTITIES DocBook Author Entities//EN">
+%authors;
+
+<!--ENTITY % mailing-lists PUBLIC "-//FreeBSD//ENTITIES DocBook Mailing List Entities//EN"-->
+<!--
+%mailing-lists;
+-->
+
+]>
+
+<article>
+  <articleinfo>
+    <title>SMPng Design Document</title>
+
+    <authorgroup>
+      <author>
+	<firstname>John</firstname>
+	<surname>Baldwin</surname>
+      </author>
+      <author>
+	<firstname>Robert</firstname>
+	<surname>Watson</surname>
+      </author>
+    </authorgroup>
+
+    <pubdate>$FreeBSD$</pubdate>
+
+    <copyright>
+      <year>2002</year>
+      <holder>John Baldwin</holder>
+      <holder>Robert Watson</holder>
+    </copyright>
+
+    <abstract>
+      <para>This document presents the current design and implementation of
+	the SMPng Architecture.  First, the basic primitives and tools are
+	introduced.  Next, a general architecture for the FreeBSD kernel's
+	synchronization and execution model is laid out.  Then, locking
+	strategies for specific subsystems are discussed, documenting the
+	approaches taken to introduce fine-grained synchronization and
+	parallelism for each subsystem.  Finally, detailed implementation
+	notes are provided to motivate design choices, and make the reader
+	aware of important implications involving the use of specific
+	primitives. </para>
+    </abstract>
+  </articleinfo>
+
+  <sect1>
+    <title>Introduction</title>
+
+    <para> This document is a work-in-progress, and will be updated to
+      reflect on-going design and implementation activities associated
+      with the SMPng Project.  Many sections currently exist only in
+      outline form, but will be fleshed out as work proceeds.  Updates or
+      suggestions regarding the document may be directed to the document
+      editors.</para>
+
+    <para>The goal of SMPng is to allow concurrency in the kernel.
+      The kernel is basically one rather large and complex program. To
+      make the kernel multithreaded we use some of the same tools used
+      to make other programs multithreaded.  These include mutexes,
+      reader/writer locks, semaphores, and condition variables.  For
+      definitions of many of the terms, please see
+      <xref linkend="defs">.</para>
+  </sect1>
+
+  <sect1>
+    <title>Basic Tools and Locking Fundamentals</title>
+
+    <sect2>
+      <title>Atomic Instructions and Memory Barriers</title>
+
+      <para>There are several existing treatments of memory barriers
+	and atomic instructions, so this section will not include a
+	lot of detail.  To put it simply, one cannot go around reading
+	variables without a lock if a lock is used to protect writes
+	to that variable.  This becomes obvious when you consider that
+	memory barriers simply determine relative order of memory
+	operations; they do not make any guarantee about timing of
+	memory operations.  That is, a memory barrier does not force
+	the contents of a CPU's local cache or store buffer to flush.
+	Instead, the memory barrier at lock release simply ensures
+	that all writes to the protected data will be visible to other
+	CPU's or devices if the write to release the lock is visible.
+	The CPU is free to keep that data in its cache or store buffer
+	as long as it wants. However, if another CPU performs an
+	atomic instruction on the same datum, the first CPU must
+	guarantee that the updated value is made visible to the second
+	CPU along with any other operations that memory barriers may
+	require.</para>
+
+      <para>For example, assuming a simple model where data is
+	considered visible when it is in main memory (or a global
+	cache), when an atomic instruction is triggered on one CPU,
+	other CPU's store buffers and caches must flush any writes to
+	that same cache line along with any pending operations behind
+	a memory barrier.</para>
+
+      <para>This requires one to take special care when using an item
+	protected by atomic instructions.  For example, in the sleep
+	mutex implementation, we have to use an
+	<function>atomic_cmpset</function> rather than an
+	<function>atomic_set</function> to turn on the
+	<constant>MTX_CONTESTED</constant> bit.  The reason is that we
+	read the value of <structfield>mtx_lock</structfield> into a
+	variable and then make a decision based on that read.
+	However, the value we read may be stale, or it may change
+	while we are making our decision.  Thus, when the
+	<function>atomic_set</function> executed, it may end up
+	setting the bit on another value than the one we made the
+	decision on. Thus, we have to use an
+	<function>atomic_cmpset</function> to set the value only if
+	the value we made the decision on is up-to-date and
+	valid.</para>
+
+      <para>Finally, atomic instructions only allow one item to be
+	updated or read.  If one needs to atomically update several
+	items, then a lock must be used instad.  For example, if two
+	counters must be read and have values that are consistent
+	relative to each other, then those counters must be protected
+	by a lock rather than by separate atomic instructions.</para>
+    </sect2>
+
+    <sect2>
+      <title>Read Locks versus Write Locks</title>
+
+      <para>Read locks do not need to be as strong as write locks.
+	Both types of locks need to ensure that the data they are
+	accessing is not stale.  However, only write access requires
+	exclusive access.  Multiple threads can safely read a value.
+	Using different types of locks for reads and writes can be
+	implemented in a number of ways.</para>
+
+      <para>First, sx locks can be used in this manner by using an
+	exclusive lock when writing and a shared lock when reading.
+	This method is quite straightforward.</para>
+
+      <para>A second method is a bit more obscure.  You can protect a
+	datum with multiple locks.  Then for reading that data you
+	simply need to have a read lock of one of the locks.  However,
+	to write to the data, you need to have a write lock of all of
+	the locks.  This can make writing rather expensive but can be
+	useful when data is accessed in various ways.  For example,
+	the parent process pointer is proctected by both the
+	proctree_lock sx lock and the per-process mutex.  Sometimes
+	the proc lock is easier as we are just checking to see who a
+	parent of a process is that we already have locked.  However,
+	other places such as <function>inferior</function> need to
+	walk the tree of processes via parent pointers and locking
+	each process would be prohibitive as well as a pain to
+	guarantee that the condition you are checking remains valid
+	for both the check and the actions taken as a result of the
+	check.</para>
+    </sect2>
+
+    <sect2>
+      <title>Locking Conditions and Results</title>
+
+      <para>If you need a lock to check the state of a variable so
+	that you can take an action based on the state you read, you
+	can't just hold the lock while reading the variable and then
+	drop the lock before you act on the value you read.  Once you
+	drop the lock, the variable can change rendering your decision
+	invalid. Thus, you must hold the lock both while reading the
+	variable and while performing the action as a result of the
+	test.</para>
+    </sect2>
+  </sect1>
+
+  <sect1>
+    <title>General Architecture and Design</title>
+
+    <sect2>
+      <title>Interrupt Handling</title>
+
+      <para>Following the pattern of several other multithreaded Unix
+	kernels, FreeBSD deals with interrupt handlers by giving them
+	their own thread context.  Providing a context for interrupt
+	handlers allows them to block on locks.  To help avoid
+	latency, however, interrupt threads run at real-time kernel
+	priority. Thus, interrupt handlers should not execute for very
+	long to avoid starving other kernel threads.  In addition,
+	since multiple handlers may share an interrupt thread,
+	interrupt handlers should not sleep or use a sleepable lock to
+	avoid starving another interrupt handler.</para>
+
+      <para>The interrupt threads currently in FreeBSD are referred to
+	as heavyweight interrupt threads.  They are called this
+	because switching to an interrupt thread involves a full
+	context switch. In the initial implementation, the kernel was
+	not preemptive and thus interrupts that interrupted a kernel
+	thread would have to wait until the kernel thread blocked or
+	returned to userland before they would have an opportunity to
+	run.</para>
+
+      <para>To deal with the latency problems, the kernel in FreeBSD
+	has been made preemptive.  Currently, we only preempt a kernel
+	thread when we release a sleep mutex or when an interrupt
+	comes in.  However, the plan is to make the FreeBSD kernel
+	fully preemptive as described below.</para>
+
+      <para>Not all interrupt handlers execute in a thread context.
+	Instead, some handlers execute directly in primary interrupt
+	context.  These interrupt handlers are currently misnamed
+	<quote>fast</quote> interrupt handlers since the
+	<constant>INTR_FAST</constant> flag used in earlier versions
+	of the kernel is used to mark these handlers.  The only
+	interrupts which currently use these types of interrupt
+	handlers are clock interrupts and serial I/O device
+	interrupts.  Since these handlers do not have their own
+	context, they may not acquire blocking locks and thus may only
+	use spin mutexes.</para>
+
+      <para>Finally, there is one optional optimization that can be
+	added in MD code called lightweight context switches.  Since
+	an interrupt thread executes in a kernel context, it can
+	borrow the vmspace of any process.  Thus, in a lightweight
+	context switch, the switch to the interrupt thread does not
+	switch vmspaces but borrows the vmspace of the interrupted
+	thread.  In order to ensure that the vmspace of the
+	interrupted thread doesn't disappear out from under us, the
+	interrupted thread is not allowed to execute until the
+	interrupt thread is no longer borrowing its vmspace.  This can
+	happen when the interrupt thread either blocks or finishes.
+	If an interrupt thread blocks, then it will use its own
+	context when it is made runnable again.  Thus, it can release
+	the interrupted thread.</para>
+
+      <para>The cons of this optimization are that they are very
+	machine specific and complex and thus only worth the effor if
+	their is a large performance improvement.  At this point it is
+	probably too early to tell, and in fact, will probably hurt
+	performance as almost all interrupt handlers will immediately
+	block on Giant and require a thread fixup when they block.
+	Also, an alternative method of interrupt handling has been
+	proposed by Mike Smith that works like so:</para>
+
+      <orderedlist>
+	<listitem>
+	  <para>Each interrupt handler has two parts: a predicate
+	    which runs in primary interrupt context and a handler
+	    which runs in its own thread context.</para>
+	</listitem>
+
+	<listitem>
+	  <para>If an interrupt handler has a predicate, then when an
+	    interrupt is triggered, the predicate is run.  If the
+	    predicate returns true then the interrupt is assumed to be
+	    fully handled and the kernel returns from the interrupt.
+	    If the predicate returns false or there is no predicate,
+	    then the threaded handler is scheduled to run.</para>
+	</listitem>
+      </orderedlist>
+
+      <para>Fitting light weight context switches into this scheme
+	might prove rather complicated.  Since we may want to change
+	to this scheme at some point in the future, it is probably
+	best to defer work on light weight context switches until we
+	have settled on the final interrupt handling architecture and
+	determined how light weight context switches might or might
+	not fit into it.</para>
+    </sect2>
+
+    <sect2>
+      <title>Kernel Preemption and Critical Sections</title>
+
+      <sect3>
+	<title>Kernel Preemption in a Nutshell</title>
+
+	<para>Kernel preemption is fairly simple.  The basic idea is
+	  that a CPU should always be doing the highest priority work
+	  available.  Well, that is the ideal at least.  There are a
+	  couple of cases where the expense of achieving the ideal is
+	  not worth being perfect.</para>
+
+	<para>Implementing full kernel preemption is very
+	  straightforward: when you schedule a thread to be executed
+	  by putting it on a runqueue, you check to see if it's
+	  priority is higher than the currently executing thread.  If
+	  so, you initiate a context switch to that thread.</para>
+
+	<para>While locks can protect most data in the case of a
+	  preemption, not all of the kernel is preemption safe.  For
+	  example, if a thread holding a spin mutex preempted and the
+	  new thread attempts to grab the same spin mutex, the new
+	  thread may spin forever as the interrupted thread may never
+	  get a chance to execute.  Also, some code such as the code
+	  to assign an address space number for a process during
+	  exec() on the Alpha needs to not be preempted as it supports
+	  the actual context switch code.  Preemption is disabled for
+	  these code sections by using a critical section.</para>
+      </sect3>
+
+      <sect3>
+	<title>Critical Sections</title>
+
+	<para>The responsibility of the critical section API is to
+	  prevent context switches inside of a critical section.  With
+	  a fully preemptive kernel, every
+	  <function>setrunqueue</function> of a thread other than the
+	  current thread is a preemption point.  One implementation is
+	  for <function>critical_enter</function> to set a per-thread
+	  flag that is cleared by its counterpart.  If
+	  <function>setrunqueue</function> is called with this flag
+	  set, it doesn't preempt regarless of the priority of the new
+	  thread relative to the current thread.  However, since
+	  critical sections are used in spin mutexes to prevent
+	  context switches and multiple spin mutexes can be acquired,
+	  the critical section API must support nesting.  For this
+	  reason the current implementation uses a nesting count
+	  instead of a single per-thread flag.</para>
+
+	<para>In order to minimize latency, preemptions inside of a
+	  critical section are deferred rather than dropped.  If a
+	  thread is made runnable that would normally be preempted to
+	  outside of a critical section, then a per-thread flag is set
+	  to indicate that there is a pending preemption.  When the
+	  outermost critical section is exited, the flag is checked.
+	  If the flag is set, then the current thread is preempted to
+	  allow the higher priority thread to run.</para>
+
+	<para>Interrupts pose a problem with regards to spin mutexes.
+	  If a low-level interrupt handler needs a lock, it needs to
+	  not interrupt any code needing that lock to avoid possible
+	  data structure corruption.  Currently, providing this
+	  mechanism is piggybacked onto critical section API by means
+	  of the <function>cpu_critical_enter</function> and
+	  <function>cpu_critical_exit</function> functions.  Currently
+	  this API disables and reenables interrupts on all of
+	  FreeBSD's current platforms.  This approach may not be
+	  purely optimal, but it is simple to understand and simple to
+	  get right. Theoretically, this second API need only be used
+	  for spin mutexes that are used in primary interrupt context.
+	  However, to make the code simpler, it is used for all spin
+	  mutexes and even all critical sections.  It may be desirable
+	  to split out the MD API from the MI API and only use it in
+	  conjunction with the MI API in the spin mutex
+	  implementation.  If this approach is taken, then the MD API
+	  likely would need a rename to show that it is a separate API
+	  now.</para>
+      </sect3>
+
+      <sect3>
+	<title>Design Tradeoffs</title>
+
+	<para>As mentioned earlier, a couple of tradeoffs have been
+	  made to sacrafice cases where perfect preemption may not
+	  always provide the best performance.</para>
+
+	<para>The first tradeoff is that the preemption code does not
+	  take other CPUs into account.  Suppose we have a two CPU's A
+	  and B with the priority of A's thread as 4 and the priority
+	  of B's thread as 2.  If CPU B makes a thread with priority 1
+	  runnable, then in theory, we want CPU A to switch to the new
+	  thread so that we will be running the two highest priority
+	  runnable threads.  However, the cost of determining which
+	  CPU to enforce a preemption on as well as actually signaling
+	  that CPU via an IPI along with the synchronization that
+	  would be required would be enormous.  Thus, the current code
+	  would instead force CPU B to switch to the higher priority
+	  thread. Note that this still puts the system in a better
+	  position as CPU B is executing a thread of priority 1 rather
+	  than a thread of priority 2.</para>
+
+	<para>The second tradeoff limits immediate kernel preemption
+	  to real-time priority kernel threads.  In the simple case of
+	  preemption defined above, a thread is always preempted
+	  immediately (or as soon as a critical section is exited) if
+	  a higher priority thread is made runnable.  However, many
+	  threads executing in the kernel only execute in a kernel
+	  context for a short time before either blocking or returning
+	  to userland.  Thus, if the kernel preempts these threads to
+	  run another non-realtime kernel thread, the kernel may
+	  switch out the executing thread just before it is about to
+	  sleep or execute.  The cache on the CPU must then adjust to
+	  the new thread.  When the kernel returns to the interrupted
+	  CPU, it must refill all the cache informatino that was lost.
+	  In addition, two extra context switches are performed that
+	  could be avoided if the kernel deferred the preemption until
+	  the first thread blocked or returned to userland.  Thus, by
+	  default, the preemption code will only preempt immediately
+	  if the higher priority thread is a real-time priority
+	  thread.</para>
+
+	<para>Turning on full kernel preemption for all kernel threads
+	  has value as a debugging aid since it exposes more race
+	  conditions.  It is especially useful on UP systems were many
+	  races are hard to simulate otherwise.  Thus, there will be a
+	  kernel option to enable preemption for all kernel threads
+	  that can be used for debugging purposes.</para>
+      </sect3>
+    </sect2>
+
+    <sect2>
+      <title>Thread Migration</title>
+
+      <para>Simply put, a thread migrates when it moves from one CPU
+	to another.  In a non-preemptive kernel this can only happen
+	at well-defined points such as when calling
+	<function>tsleep</function> or returning to userland.
+	However, in the preemptive kernel, an interrupt can force a
+	preemption and possible migration at any time.  This can have
+	negative affects on per-CPU data since with the exception of
+	<varname>curthread</varname> and <varname>curpcb</varname> the
+	data can change whenever you migrate.  Since you can
+	potentially migrate at any time this renders per-CPU data
+	rather useless. Thus it is desirable to be able to disable
+	migration for sections of code that need per-CPU data to be
+	stable.</para>
+
+      <para>Critical sections currently prevent migration since they
+	don't allow context switches.  However, this may be too strong
+	of a requirement to enforce in some cases since a critical
+	section also effectively blocks interrupt threads on the
+	current processor.  As a result, it may be desirable to
+	provide an API whereby code may indicate that if the current
+	thread is preempted it should not migrate to another
+	CPU.</para>
+
+      <para>One possible implementation is to use a per-thread nesting
+	count <varname>td_pinnest</varname> along with a
+	<varname>td_pincpu</varname> which is updated to the current
+	CPU on each context switch.  Each CPU has its own run queue
+	that holds threads pinned to that CPU.  A thread is pinned
+	when its nesting count is greater than zero and a thread
+	starts off unpinned with a nesting count of zero.  When a
+	thread is put on a runqueue, we check to see if it is pinned.
+	If so, we put it on the per-CPU runqueue, otherwise we put it
+	on the global runqueue.  When
+	<function>choosethread</function> is called to retrieve the
+	next thread, it could either always prefer bound threads to
+	unbound threads or use some sort of bias when comparing
+	priorities.  If the nesting count is only ever written to by
+	the thread itself and is only read by other threads when the
+	owning thread is not executing but while holding the
+	<varname>sched_lock</varname>, then
+	<varname>td_pinnest</varname> will not need any other locks.
+	The <function>migrate_disable</function> function would
+	increment the nesting count and
+	<function>migrate_enable</function> would decrement the
+	nesting count.  Due to the locking requirements specified
+	above, they will only operate on the current thread and thus
+	would not need to handle the case of making a thread
+	migratable that currently resides on a per-CPU run
+	queue.</para>
+
+      <para>It is still debatable if this API is needed or if the
+	critical section API is sufficient by itself.  Many of the
+	places that need to prevent migration also need to prevent
+	preemption as well, and in those places a critical section
+	must be used regardless.</para>
+    </sect2>
+
+    <sect2>
+      <title>Callouts</title>
+
+      <para>The <function>timeout()</function> kernel facility permits
+	kernel services to register funtions for execution as part
+	of the <function>softclock()</function> software interrupt.
+	Events are scheduled based on a desired number of clock
+	ticks, and callbacks to the consumer-provided function
+	will occur at approximately the right time.</para>
+
+      <para>The global list of pending timeout events is protected
+	by a global spin mutex, <varname>callout_lock</varname>;
+	all access to the timeout list must be performed with this
+	mutex held.  When <function>softclock()</function> is
+	woken up, it scans the list of pending timeouts for those
+	that should fire.  In order to avoid lock order reversal,
+	the <function>softclock</function> thread will release the
+	<varname>callout_lock</varname> mutex when invoking the
+	provided <function>timeout()</function> callback function.
+	If the <constant>CALLOUT_MPSAFE</constant> flag was not set
+	during registration, then Giant will be grabbed before
+	invoking the callout, and then released afterwards.  The
+	<varname>callout_lock</varname> mutex will be re-grabbed
+	before proceeding.  The <function>softclock()</function>
+	code is careful to leave the list in a consistent state
+	while releasing the mutex.  If <constant>DIAGNOSTIC</constant>
+	is enabled, then the time taken to execute each function is
+	measured, and a warning generated if it exceeds a
+	threshold.</para>
+    </sect2>
+  </sect1>
+
+  <sect1>
+    <title>Specific Locking Strategies</title>
+
+    <sect2>
+      <title>Credentials</title>
+
+      <para><structname>struct ucred</structname> is the system
+	internal credential structure, and is generally used as the
+	basis for process-driven access control.  BSD-derived systems
+	use a "copy-on-write" model for credential data: multiple
+	references may exist for a credential structure, and when a
+	change needs to be made, the structure is duplicated,
+	modified, and then the reference replaced.  Due to wide-spread
+	caching of the credential to implement access control on open,
+	this results in substantial memory savings.  With a move to
+	fine-grained SMP, this model also saves substantially on
+	locking operations by requiring that modification only occur
+	on an unshared credential, avoiding the need for explicit   
+	synchronization when consuming a known-shared
+	credential.</para>
+
+      <para>Credential structures with a single reference are
+	considered mutable; shared credential structures must not be  
+	modified or a race condition is risked.  A mutex,
+	<structfield>cr_mtxp</structfield> protects the reference 
+	count of the <structname>struct ucred</structname> so as to
+	maintain consistency.  Any use of the structure requires a
+	valid reference for the duration of the use, or the structure
+	may be released out from under the illegitimate
+	consumer.</para>
+
+      <para>The <structname>struct ucred</structname> mutex is a leaf
+	mutex, and for performance reasons, is implemented via a mutex
+	pool.</para>
+    </sect2>
+
+    <sect2>
+      <title>File Descriptors and File Descriptor Tables</title>
+
+      <para>Details to follow.</para>
+    </sect2>
+
+    <sect2>
+      <title>Jail Structures</title>
+
+      <para><structname>struct prison</structname> stores
+	administrative details pertinent to the maintenance of jails
+	created using the &man.jail.2; API.  This includes the
+	per-jail hostname, IP address, and related settings.  This
+	structure is reference-counted since pointers to instances of
+	the structure are shared by many credential structures.  A
+	single mutex, <structfield>pr_mtx</structfield> protects read
+	and write access to the reference count and all mutable
+	variables inside the struct jail.  Some variables are set only
+	when the jail is created, and a valid reference to the
+	<structname>struct prison</structname> is sufficient to read
+	these values.  The precise locking of each entry is documented
+	via comments in jail.h.</para>
+    </sect2>
+
+    <sect2>
+      <title>MAC Framework</title>
+
+      <para>The TrustedBSD MAC Framework maintains data in a variety
+	of kernel objects, in the form of <structname>struct
+	label</structname>.  In general, labels in kernel objects
+	are protected by the same lock as the remainder of the kernel
+	object.  For example, the <structfield>v_label</structfield>
+	label in <structname>struct vnode</structname> is protected
+	by the vnode lock on the vnode.</para>
+
+      <para>In addition to labels maintained in standard kernel objects,
+	the MAC Framework also maintains a list of registered and
+	active policies.  The policy list is protected by a global
+	mutex (<varname>mac_policy_list_lock</varname>) and a busy
+	count (also protected by the mutex).  Since many access
+	control checks may occur in parallel, entry to the framework
+	for a read-only access to the policy list requires holding the
+	mutex while incrementing (and later decrementing) the busy
+	count.  The mutex need not be held for the duration of the
+	MAC entry operation--some operations, such as label operations
+	on file system objects--are long-lived.  To modify the policy
+	list, such as during policy registration and deregistration,
+	the mutex must be held and the reference count must be zero,
+	to prevent modification of the list while it is in use.</para>
+
+      <para>A condition variable,
+	<varname>mac_policy_list_not_busy</varname>, is available to
+	threads that need to wait for the list to become unbusy, but
+	this condition variable must only be waited on if the caller is
+	holding no other locks, or a lock order violation may be
+	possible.  The busy count, in effect, acts as a form of
+	reader/writer lock over access to the framework: the difference
+	is that, unlike with an sxlock, consumers waiting for the list
+	to become unbusy may be starved, rather than permitting lock
+	order problems with regards to the busy count and other locks
+	that may be held on entry to (or inside) the MAC Framework.</para>
+    </sect2>
+
+    <sect2>
+      <title>Modules</title>
+
+      <para>For the module subsystem there exists a single lock that is
+	used to protect the shared data.  This lock is a shared/exclusive
+	(SX) lock and has a good chance of needing to be acquired (shared
+	or exclusively), therefore there are a few macros that have been
+	added to make access to the lock more easy.  These macros can be
+	located in <filename>sys/module.h</filename> and are quite basic
+	in terms of usage.  The main structures protected under this lock
+	are the <structname>module_t</structname> structures (when shared)
+	and the global <structname>modulelist_t</structname> structure,
+	modules.  One should review the related source code in
+	<filename>kern/kern_module.c</filename> to further understand the
+	locking strategy.</para>
+    </sect2>
+
+    <sect2>
+      <title>Newbus Device Tree</title>
+
+      <para>The newbus system will have one sx lock.  Readers will
+	lock it &man.sx.slock.9; and writers will lock it
+	&man.sx.xlock.9;.  Internal only functions will not do locking
+	at all.  The externally visable ones will lock as needed.
+	Those items that don't matter if the race is won or lost will
+	not be locked, since they tend to be read all over the place
+	(eg &man.device.get.softc.9;).  There will be relatively few
+	changes to the newbus datastructures, so a single lock should
+	be sufficient and not impose a performance penalty.</para>
+    </sect2>
+
+    <sect2>
+      <title>Pipes</title>
+
+      <para>...</para>
+    </sect2>
+
+    <sect2>
+      <title>Processes and Threads</title>
+
+      <para>- process hiearachy</para>
+      <para>- proc locks, references</para>
+      <para>- thread-specific copies of proc entries to freeze during system
+	calls, including td_ucred</para>
+      <para>- inter-process operations</para>
+      <para>- process groups and sessions</para>
+    </sect2>
+
+    <sect2>
+      <title>Scheduler</title>
+
+      <para>Lots of references to <varname>sched_lock</varname> and notes
+	pointing at specific primitives and related magic elsewhere in the
+	document.</para>
+    </sect2>
+
+    <sect2>
+      <title>Select and Poll</title>
+
+      <para>The select() and poll() functions permit threads to block
+	waiting on events on file descriptors--most frequently, whether
+	or not the file descriptors are readable or writable.</para>
+
+      <para>...</para>
+    </sect2>
+
+    <sect2>
+      <title>SIGIO</title>
+
+      <para>The SIGIO service permits processes to request the delivery
+	of a SIGIO signal to its process group when the read/write status
+	of specified file descriptors changes.  At most one process or
+	process group is permitted to register for SIGIO from any given
+	kernel object, and that process or group is referred to as
+	the owner.  Each object supporting SIGIO registration contains
+	pointer field that is NULL if the object is not registered, or
+	points to a <structname>struct sigio</structname> describing
+	the registration.  This field is protected by a global mutex,
+	<varname>sigio_lock</varname>.  Callers to SIGIO maintenance
+	functions must pass in this field "by reference" so that local
+	register copies of the field are not made when unprotected by
+	the lock.</para>
+
+      <para>One <structname>struct sigio</structname> is allocated for
+	each registered object associated with any process or process
+	group, and contains back-pointers to the object, owner, signal
+	information, a credential, and the general disposition of the
+	registration.  Each process or progress group contains a list of
+	registered <structname>struct sigio</structname> structures,
+	<structfield>p_sigiolst</structfield> for processes, and
+	<structfield>pg_sigiolst</structfield> for process groups.
+	These lists are protected by the process or process group
+	locks respectively.  Most fields in each <structname>struct
+	sigio</structname> are constant for the duration of the
+	registration, with the exception of the
+	<structfield>sio_pgsigio</structfield> field which links the
+	<structname>struct sigio</structname> into the process or
+	process group list.  Developers implementing new kernel
+	objects supporting SIGIO will, in general, want to avoid
+	holding structure locks while invoking SIGIO supporting
+	functions, such as <function>fsetown()</function>
+	or <function>funsetown()</function> to avoid
+	defining a lock order between structure locks and the global
+	SIGIO lock.  This is generally possible through use of an
+	elevated reference count on the structure, such as reliance
+	on a file descriptor reference to a pipe during a pipe
+	operation.<para>
+    </sect2>
+
+    <sect2>
+      <title>sysctl</title>
+
+      <para>The <function>sysctl()</function> MIB service is invoked
+	from both within the kernel and from userland applications
+	using a system call.  At least two issues are raised in locking:
+	first, the protection of the structures maintaining the
+	namespace, and second, interactions with kernel variables and
+	functions that are accessed by the sysctl interface.  Since
+	sysctl permits the direct export (and modification) of
+	kernel statistics and configuration parameters, the sysctl
+	mechanism must become aware of appropriate locking semantics
+	for those variables.  Currently, sysctl makes use of a
+	single global <varname>sxlock</varname> to serialize use
+	of sysctl(); however, it is assumed to operate under Giant
+	and other protections are not provided.  The remainder of
+	this section speculates on locking and semantic changes
+	to sysctl.</para>
+
+      <para>- Need to change the order of operations for sysctl's that
+	update values from read old, copyin and copyout, write new to
+	copyin, lock, read old and write new, unlock, copyout.  Normal
+	sysctl's that just copyout the old value and set a new value
+	that they copyin may still be able to follow the old model.
+	However, it may be cleaner to use the second model for all of
+	the sysctl handlers to avoid lock operations.</para>
+
+      <para>- To allow for the common case, a sysctl could embed a
+	pointer to a mutex in the SYSCTL_FOO macros and in the struct.
+	This would work for most sysctls.  For values protected by sx
+	locks, spin mutexes, or other locking strategies besides a
+	single sleep mutex, SYSCTL_PROC nodes could be used to get the
+	locking right.</para>
+    </sect2>
+
+    <sect2>
+      <title>Taskqueue</title>
+
+       <para> The taskqueue's interface has two basic locks associated
+	with it in order to protect the related shared data.  The
+	<varname>taskqueue_queues_mutex</varname> is meant to serve as a
+	lock to protect the <varname>taskqueue_queues</varname> TAILQ.
+	The other mutex lock associated with this system is the one in the
+	<structname>struct taskqueue</structname> data structure.  The
+	use of the synchronization primitive here is to protect the
+	integrity of the data in the <structname>struct
+	taskqueue</structname>.  It should be noted that there are no
+	separate macros to assist the user in locking down his/her own work
+	since these locks are most likely not going to be used outside of
+	<filename>kern/subr_taskqueue.c</filename>.</para>
+    </sect2>
+  </sect1>
+
+  <sect1>
+    <title>Implementation Notes</title>
+
+    <sect2>
+      <title>Details of the Mutex Implementation</title>
+
+      <para>- Should we require mutexes to be owned for mtx_destroy()
+	since we can't safely assert that they are unowned by anyone
+	else otherwise?</para>
+
+      <sect3>
+	<title>Spin Mutexes</title>
+
+	<para>- Use a critical section...</para>
+      </sect3>
+
+      <sect3>
+	<title>Sleep Mutexes</title>
+
+	<para>- Describe the races with contested mutexes</para>
+
+	<para>- Why it's safe to read mtx_lock of a contested mutex
+	  when holding sched_lock.</para>
+
+	<para>- Priority propagation</para>
+      </sect3>
+    </sect2>
+
+    <sect2>
+      <title>Witesss</title>
+
+      <para>- What does it do</para>
+
+      <para>- How does it work</para>
+    </sect2>
+  </sect1>
+
+  <sect1>
+    <title>Miscellaneous Topics</title>
+
+    <sect2>
+      <title>Interrupt Source and ICU Abstractions</title>
+
+      <para>- struct isrc</para>
+
+      <para>- pic drivers</para>
+    </sect2>
+
+    <sect2>
+      <title>Other Random Questions/Topics</title>
+
+      <para>Should we pass an interlock into
+	<function>sema_wait</function>?</para>
+
+      <para>- Generic turnstiles for sleep mutexes and sx locks.</para>
+
+      <para>- Should we have non-sleepable sx locks?</para>
+    </sect2>
+  </sect1>
+
+  <glossary id="defs">
+    <title>Definitions</title>
+
+    <glossentry id="atomic">
+      <glossterm>atomic</glossterm>
+      <glossdef>
+	<para>An operation is atomic if all of its effects are visible
+	  to other CPUs together when the proper access protocol is
+	  followed.  In the degenerate case are atomic instructions
+	  provided directly by machine architectures.  At a higher
+	  level, if several members of a structure are protected by a
+	  lock, then a set of operations are atomic if they are all
+	  performed while holding the lock without releasing the lock
+	  in between any of the operations.</para>
+
+	<glossseealso>operation</glossseealso>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="block">
+      <glossterm>block</glossterm>
+      <glossdef>
+	<para>A thread is blocked when it is waiting on a lock,
+	  resource, or condition.  Unfortunately this term is a bit
+	  overloaded as a result.</para>
+
+	<glossseealso>sleep</glossseealso>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="critical-section">
+      <glossterm>critical section</glossterm>
+      <glossdef>
+	<para>A section of code that is not allowed to be preempted.
+	  A critical section is entered and exited using the
+	  &man.critical.enter.9; API.</para>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="MD">
+      <glossterm>MD</glossterm>
+      <glossdef>
+	<para>Machine depenedent.</para>
+
+	<glossseealso>MI</glossseealso>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="memory-operation">
+      <glossterm>memory operation</glossterm>
+      <glossdef>
+	<para>A memory operation reads and/or writes to a memory
+	  location.</para>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="MI">
+      <glossterm>MI</glossterm>
+      <glossdef>
+	<para>Machine indepenedent.</para>
+
+	<glossseealso>MD</glossseealso>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="operation">
+      <glossterm>operation</glossterm>
+      <glosssee>memory operation</glosssee>
+    </glossentry>
+
+    <glossentry id="primary-interrupt-context">
+      <glossterm>primary interrupt context</glossterm>
+      <glossdef>
+	<para>Primary interrupt context refers to the code that runs
+	  when an interrupt occurs.  This code can either run an
+	  interrupt handler directly or schedule an asynchronous
+	  interrupt thread to execute the interrupt handlers for a
+	  given interrupt source.</para>
+      </glossdef>
+    </glossentry>
+
+    <glossentry>
+      <glossterm>realtime kernel thread</glossterm>
+      <glossdef>
+	<para>A high priority kernel thread.  Currently, the only
+	  realtime priority kernel threads are interrupt threads.</para>
+
+	<glossseealso>thread</glossseealso>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="sleep">
+      <glossterm>sleep</glossterm>
+      <glossdef>
+	<para>A thread is asleep when it is blocked on a condition
+	  variable or a sleep queue via <function>msleep</function> or
+	  <function>tsleep</function>.</para>
+
+	<glossseealso>block</glossseealso>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="sleepable-lock">
+      <glossterm>sleepable lock</glossterm>
+      <glossdef>
+	<para>A sleepable lock is a lock that can be held by a thread
+	  which is asleep.  Lockmgr locks and sx locks are currently
+	  the only sleepable locks in FreeBSD.  Eventually, some sx
+	  locks such as the allproc and proctree locks may become
+	  non-sleepable locks.</para>
+
+	<glossseealso>sleep</glossseealso>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="thread">
+      <glossterm>thread</glossterm>
+      <glossdef>
+	<para>A kernel thread represented by a struct thread.  Threads own
+	  locks and hold a single execution context.</para>
+      </glossdef>
+    </glossentry>
+  </glossary>
+</article>
diff --git a/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml b/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
new file mode 100644
index 0000000000..3f6b233f60
--- /dev/null
+++ b/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
@@ -0,0 +1,934 @@
+<!DOCTYPE article PUBLIC "-//FreeBSD//DTD DocBook V4.1-Based Extension//EN" [
+<!ENTITY % man PUBLIC "-//FreeBSD//ENTITIES DocBook Manual Page Entities//EN">
+%man;
+
+<!ENTITY % authors PUBLIC "-//FreeBSD//ENTITIES DocBook Author Entities//EN">
+%authors;
+
+<!--ENTITY % mailing-lists PUBLIC "-//FreeBSD//ENTITIES DocBook Mailing List Entities//EN"-->
+<!--
+%mailing-lists;
+-->
+
+]>
+
+<article>
+  <articleinfo>
+    <title>SMPng Design Document</title>
+
+    <authorgroup>
+      <author>
+	<firstname>John</firstname>
+	<surname>Baldwin</surname>
+      </author>
+      <author>
+	<firstname>Robert</firstname>
+	<surname>Watson</surname>
+      </author>
+    </authorgroup>
+
+    <pubdate>$FreeBSD$</pubdate>
+
+    <copyright>
+      <year>2002</year>
+      <holder>John Baldwin</holder>
+      <holder>Robert Watson</holder>
+    </copyright>
+
+    <abstract>
+      <para>This document presents the current design and implementation of
+	the SMPng Architecture.  First, the basic primitives and tools are
+	introduced.  Next, a general architecture for the FreeBSD kernel's
+	synchronization and execution model is laid out.  Then, locking
+	strategies for specific subsystems are discussed, documenting the
+	approaches taken to introduce fine-grained synchronization and
+	parallelism for each subsystem.  Finally, detailed implementation
+	notes are provided to motivate design choices, and make the reader
+	aware of important implications involving the use of specific
+	primitives. </para>
+    </abstract>
+  </articleinfo>
+
+  <sect1>
+    <title>Introduction</title>
+
+    <para> This document is a work-in-progress, and will be updated to
+      reflect on-going design and implementation activities associated
+      with the SMPng Project.  Many sections currently exist only in
+      outline form, but will be fleshed out as work proceeds.  Updates or
+      suggestions regarding the document may be directed to the document
+      editors.</para>
+
+    <para>The goal of SMPng is to allow concurrency in the kernel.
+      The kernel is basically one rather large and complex program. To
+      make the kernel multithreaded we use some of the same tools used
+      to make other programs multithreaded.  These include mutexes,
+      reader/writer locks, semaphores, and condition variables.  For
+      definitions of many of the terms, please see
+      <xref linkend="defs">.</para>
+  </sect1>
+
+  <sect1>
+    <title>Basic Tools and Locking Fundamentals</title>
+
+    <sect2>
+      <title>Atomic Instructions and Memory Barriers</title>
+
+      <para>There are several existing treatments of memory barriers
+	and atomic instructions, so this section will not include a
+	lot of detail.  To put it simply, one cannot go around reading
+	variables without a lock if a lock is used to protect writes
+	to that variable.  This becomes obvious when you consider that
+	memory barriers simply determine relative order of memory
+	operations; they do not make any guarantee about timing of
+	memory operations.  That is, a memory barrier does not force
+	the contents of a CPU's local cache or store buffer to flush.
+	Instead, the memory barrier at lock release simply ensures
+	that all writes to the protected data will be visible to other
+	CPU's or devices if the write to release the lock is visible.
+	The CPU is free to keep that data in its cache or store buffer
+	as long as it wants. However, if another CPU performs an
+	atomic instruction on the same datum, the first CPU must
+	guarantee that the updated value is made visible to the second
+	CPU along with any other operations that memory barriers may
+	require.</para>
+
+      <para>For example, assuming a simple model where data is
+	considered visible when it is in main memory (or a global
+	cache), when an atomic instruction is triggered on one CPU,
+	other CPU's store buffers and caches must flush any writes to
+	that same cache line along with any pending operations behind
+	a memory barrier.</para>
+
+      <para>This requires one to take special care when using an item
+	protected by atomic instructions.  For example, in the sleep
+	mutex implementation, we have to use an
+	<function>atomic_cmpset</function> rather than an
+	<function>atomic_set</function> to turn on the
+	<constant>MTX_CONTESTED</constant> bit.  The reason is that we
+	read the value of <structfield>mtx_lock</structfield> into a
+	variable and then make a decision based on that read.
+	However, the value we read may be stale, or it may change
+	while we are making our decision.  Thus, when the
+	<function>atomic_set</function> executed, it may end up
+	setting the bit on another value than the one we made the
+	decision on. Thus, we have to use an
+	<function>atomic_cmpset</function> to set the value only if
+	the value we made the decision on is up-to-date and
+	valid.</para>
+
+      <para>Finally, atomic instructions only allow one item to be
+	updated or read.  If one needs to atomically update several
+	items, then a lock must be used instad.  For example, if two
+	counters must be read and have values that are consistent
+	relative to each other, then those counters must be protected
+	by a lock rather than by separate atomic instructions.</para>
+    </sect2>
+
+    <sect2>
+      <title>Read Locks versus Write Locks</title>
+
+      <para>Read locks do not need to be as strong as write locks.
+	Both types of locks need to ensure that the data they are
+	accessing is not stale.  However, only write access requires
+	exclusive access.  Multiple threads can safely read a value.
+	Using different types of locks for reads and writes can be
+	implemented in a number of ways.</para>
+
+      <para>First, sx locks can be used in this manner by using an
+	exclusive lock when writing and a shared lock when reading.
+	This method is quite straightforward.</para>
+
+      <para>A second method is a bit more obscure.  You can protect a
+	datum with multiple locks.  Then for reading that data you
+	simply need to have a read lock of one of the locks.  However,
+	to write to the data, you need to have a write lock of all of
+	the locks.  This can make writing rather expensive but can be
+	useful when data is accessed in various ways.  For example,
+	the parent process pointer is proctected by both the
+	proctree_lock sx lock and the per-process mutex.  Sometimes
+	the proc lock is easier as we are just checking to see who a
+	parent of a process is that we already have locked.  However,
+	other places such as <function>inferior</function> need to
+	walk the tree of processes via parent pointers and locking
+	each process would be prohibitive as well as a pain to
+	guarantee that the condition you are checking remains valid
+	for both the check and the actions taken as a result of the
+	check.</para>
+    </sect2>
+
+    <sect2>
+      <title>Locking Conditions and Results</title>
+
+      <para>If you need a lock to check the state of a variable so
+	that you can take an action based on the state you read, you
+	can't just hold the lock while reading the variable and then
+	drop the lock before you act on the value you read.  Once you
+	drop the lock, the variable can change rendering your decision
+	invalid. Thus, you must hold the lock both while reading the
+	variable and while performing the action as a result of the
+	test.</para>
+    </sect2>
+  </sect1>
+
+  <sect1>
+    <title>General Architecture and Design</title>
+
+    <sect2>
+      <title>Interrupt Handling</title>
+
+      <para>Following the pattern of several other multithreaded Unix
+	kernels, FreeBSD deals with interrupt handlers by giving them
+	their own thread context.  Providing a context for interrupt
+	handlers allows them to block on locks.  To help avoid
+	latency, however, interrupt threads run at real-time kernel
+	priority. Thus, interrupt handlers should not execute for very
+	long to avoid starving other kernel threads.  In addition,
+	since multiple handlers may share an interrupt thread,
+	interrupt handlers should not sleep or use a sleepable lock to
+	avoid starving another interrupt handler.</para>
+
+      <para>The interrupt threads currently in FreeBSD are referred to
+	as heavyweight interrupt threads.  They are called this
+	because switching to an interrupt thread involves a full
+	context switch. In the initial implementation, the kernel was
+	not preemptive and thus interrupts that interrupted a kernel
+	thread would have to wait until the kernel thread blocked or
+	returned to userland before they would have an opportunity to
+	run.</para>
+
+      <para>To deal with the latency problems, the kernel in FreeBSD
+	has been made preemptive.  Currently, we only preempt a kernel
+	thread when we release a sleep mutex or when an interrupt
+	comes in.  However, the plan is to make the FreeBSD kernel
+	fully preemptive as described below.</para>
+
+      <para>Not all interrupt handlers execute in a thread context.
+	Instead, some handlers execute directly in primary interrupt
+	context.  These interrupt handlers are currently misnamed
+	<quote>fast</quote> interrupt handlers since the
+	<constant>INTR_FAST</constant> flag used in earlier versions
+	of the kernel is used to mark these handlers.  The only
+	interrupts which currently use these types of interrupt
+	handlers are clock interrupts and serial I/O device
+	interrupts.  Since these handlers do not have their own
+	context, they may not acquire blocking locks and thus may only
+	use spin mutexes.</para>
+
+      <para>Finally, there is one optional optimization that can be
+	added in MD code called lightweight context switches.  Since
+	an interrupt thread executes in a kernel context, it can
+	borrow the vmspace of any process.  Thus, in a lightweight
+	context switch, the switch to the interrupt thread does not
+	switch vmspaces but borrows the vmspace of the interrupted
+	thread.  In order to ensure that the vmspace of the
+	interrupted thread doesn't disappear out from under us, the
+	interrupted thread is not allowed to execute until the
+	interrupt thread is no longer borrowing its vmspace.  This can
+	happen when the interrupt thread either blocks or finishes.
+	If an interrupt thread blocks, then it will use its own
+	context when it is made runnable again.  Thus, it can release
+	the interrupted thread.</para>
+
+      <para>The cons of this optimization are that they are very
+	machine specific and complex and thus only worth the effor if
+	their is a large performance improvement.  At this point it is
+	probably too early to tell, and in fact, will probably hurt
+	performance as almost all interrupt handlers will immediately
+	block on Giant and require a thread fixup when they block.
+	Also, an alternative method of interrupt handling has been
+	proposed by Mike Smith that works like so:</para>
+
+      <orderedlist>
+	<listitem>
+	  <para>Each interrupt handler has two parts: a predicate
+	    which runs in primary interrupt context and a handler
+	    which runs in its own thread context.</para>
+	</listitem>
+
+	<listitem>
+	  <para>If an interrupt handler has a predicate, then when an
+	    interrupt is triggered, the predicate is run.  If the
+	    predicate returns true then the interrupt is assumed to be
+	    fully handled and the kernel returns from the interrupt.
+	    If the predicate returns false or there is no predicate,
+	    then the threaded handler is scheduled to run.</para>
+	</listitem>
+      </orderedlist>
+
+      <para>Fitting light weight context switches into this scheme
+	might prove rather complicated.  Since we may want to change
+	to this scheme at some point in the future, it is probably
+	best to defer work on light weight context switches until we
+	have settled on the final interrupt handling architecture and
+	determined how light weight context switches might or might
+	not fit into it.</para>
+    </sect2>
+
+    <sect2>
+      <title>Kernel Preemption and Critical Sections</title>
+
+      <sect3>
+	<title>Kernel Preemption in a Nutshell</title>
+
+	<para>Kernel preemption is fairly simple.  The basic idea is
+	  that a CPU should always be doing the highest priority work
+	  available.  Well, that is the ideal at least.  There are a
+	  couple of cases where the expense of achieving the ideal is
+	  not worth being perfect.</para>
+
+	<para>Implementing full kernel preemption is very
+	  straightforward: when you schedule a thread to be executed
+	  by putting it on a runqueue, you check to see if it's
+	  priority is higher than the currently executing thread.  If
+	  so, you initiate a context switch to that thread.</para>
+
+	<para>While locks can protect most data in the case of a
+	  preemption, not all of the kernel is preemption safe.  For
+	  example, if a thread holding a spin mutex preempted and the
+	  new thread attempts to grab the same spin mutex, the new
+	  thread may spin forever as the interrupted thread may never
+	  get a chance to execute.  Also, some code such as the code
+	  to assign an address space number for a process during
+	  exec() on the Alpha needs to not be preempted as it supports
+	  the actual context switch code.  Preemption is disabled for
+	  these code sections by using a critical section.</para>
+      </sect3>
+
+      <sect3>
+	<title>Critical Sections</title>
+
+	<para>The responsibility of the critical section API is to
+	  prevent context switches inside of a critical section.  With
+	  a fully preemptive kernel, every
+	  <function>setrunqueue</function> of a thread other than the
+	  current thread is a preemption point.  One implementation is
+	  for <function>critical_enter</function> to set a per-thread
+	  flag that is cleared by its counterpart.  If
+	  <function>setrunqueue</function> is called with this flag
+	  set, it doesn't preempt regarless of the priority of the new
+	  thread relative to the current thread.  However, since
+	  critical sections are used in spin mutexes to prevent
+	  context switches and multiple spin mutexes can be acquired,
+	  the critical section API must support nesting.  For this
+	  reason the current implementation uses a nesting count
+	  instead of a single per-thread flag.</para>
+
+	<para>In order to minimize latency, preemptions inside of a
+	  critical section are deferred rather than dropped.  If a
+	  thread is made runnable that would normally be preempted to
+	  outside of a critical section, then a per-thread flag is set
+	  to indicate that there is a pending preemption.  When the
+	  outermost critical section is exited, the flag is checked.
+	  If the flag is set, then the current thread is preempted to
+	  allow the higher priority thread to run.</para>
+
+	<para>Interrupts pose a problem with regards to spin mutexes.
+	  If a low-level interrupt handler needs a lock, it needs to
+	  not interrupt any code needing that lock to avoid possible
+	  data structure corruption.  Currently, providing this
+	  mechanism is piggybacked onto critical section API by means
+	  of the <function>cpu_critical_enter</function> and
+	  <function>cpu_critical_exit</function> functions.  Currently
+	  this API disables and reenables interrupts on all of
+	  FreeBSD's current platforms.  This approach may not be
+	  purely optimal, but it is simple to understand and simple to
+	  get right. Theoretically, this second API need only be used
+	  for spin mutexes that are used in primary interrupt context.
+	  However, to make the code simpler, it is used for all spin
+	  mutexes and even all critical sections.  It may be desirable
+	  to split out the MD API from the MI API and only use it in
+	  conjunction with the MI API in the spin mutex
+	  implementation.  If this approach is taken, then the MD API
+	  likely would need a rename to show that it is a separate API
+	  now.</para>
+      </sect3>
+
+      <sect3>
+	<title>Design Tradeoffs</title>
+
+	<para>As mentioned earlier, a couple of tradeoffs have been
+	  made to sacrafice cases where perfect preemption may not
+	  always provide the best performance.</para>
+
+	<para>The first tradeoff is that the preemption code does not
+	  take other CPUs into account.  Suppose we have a two CPU's A
+	  and B with the priority of A's thread as 4 and the priority
+	  of B's thread as 2.  If CPU B makes a thread with priority 1
+	  runnable, then in theory, we want CPU A to switch to the new
+	  thread so that we will be running the two highest priority
+	  runnable threads.  However, the cost of determining which
+	  CPU to enforce a preemption on as well as actually signaling
+	  that CPU via an IPI along with the synchronization that
+	  would be required would be enormous.  Thus, the current code
+	  would instead force CPU B to switch to the higher priority
+	  thread. Note that this still puts the system in a better
+	  position as CPU B is executing a thread of priority 1 rather
+	  than a thread of priority 2.</para>
+
+	<para>The second tradeoff limits immediate kernel preemption
+	  to real-time priority kernel threads.  In the simple case of
+	  preemption defined above, a thread is always preempted
+	  immediately (or as soon as a critical section is exited) if
+	  a higher priority thread is made runnable.  However, many
+	  threads executing in the kernel only execute in a kernel
+	  context for a short time before either blocking or returning
+	  to userland.  Thus, if the kernel preempts these threads to
+	  run another non-realtime kernel thread, the kernel may
+	  switch out the executing thread just before it is about to
+	  sleep or execute.  The cache on the CPU must then adjust to
+	  the new thread.  When the kernel returns to the interrupted
+	  CPU, it must refill all the cache informatino that was lost.
+	  In addition, two extra context switches are performed that
+	  could be avoided if the kernel deferred the preemption until
+	  the first thread blocked or returned to userland.  Thus, by
+	  default, the preemption code will only preempt immediately
+	  if the higher priority thread is a real-time priority
+	  thread.</para>
+
+	<para>Turning on full kernel preemption for all kernel threads
+	  has value as a debugging aid since it exposes more race
+	  conditions.  It is especially useful on UP systems were many
+	  races are hard to simulate otherwise.  Thus, there will be a
+	  kernel option to enable preemption for all kernel threads
+	  that can be used for debugging purposes.</para>
+      </sect3>
+    </sect2>
+
+    <sect2>
+      <title>Thread Migration</title>
+
+      <para>Simply put, a thread migrates when it moves from one CPU
+	to another.  In a non-preemptive kernel this can only happen
+	at well-defined points such as when calling
+	<function>tsleep</function> or returning to userland.
+	However, in the preemptive kernel, an interrupt can force a
+	preemption and possible migration at any time.  This can have
+	negative affects on per-CPU data since with the exception of
+	<varname>curthread</varname> and <varname>curpcb</varname> the
+	data can change whenever you migrate.  Since you can
+	potentially migrate at any time this renders per-CPU data
+	rather useless. Thus it is desirable to be able to disable
+	migration for sections of code that need per-CPU data to be
+	stable.</para>
+
+      <para>Critical sections currently prevent migration since they
+	don't allow context switches.  However, this may be too strong
+	of a requirement to enforce in some cases since a critical
+	section also effectively blocks interrupt threads on the
+	current processor.  As a result, it may be desirable to
+	provide an API whereby code may indicate that if the current
+	thread is preempted it should not migrate to another
+	CPU.</para>
+
+      <para>One possible implementation is to use a per-thread nesting
+	count <varname>td_pinnest</varname> along with a
+	<varname>td_pincpu</varname> which is updated to the current
+	CPU on each context switch.  Each CPU has its own run queue
+	that holds threads pinned to that CPU.  A thread is pinned
+	when its nesting count is greater than zero and a thread
+	starts off unpinned with a nesting count of zero.  When a
+	thread is put on a runqueue, we check to see if it is pinned.
+	If so, we put it on the per-CPU runqueue, otherwise we put it
+	on the global runqueue.  When
+	<function>choosethread</function> is called to retrieve the
+	next thread, it could either always prefer bound threads to
+	unbound threads or use some sort of bias when comparing
+	priorities.  If the nesting count is only ever written to by
+	the thread itself and is only read by other threads when the
+	owning thread is not executing but while holding the
+	<varname>sched_lock</varname>, then
+	<varname>td_pinnest</varname> will not need any other locks.
+	The <function>migrate_disable</function> function would
+	increment the nesting count and
+	<function>migrate_enable</function> would decrement the
+	nesting count.  Due to the locking requirements specified
+	above, they will only operate on the current thread and thus
+	would not need to handle the case of making a thread
+	migratable that currently resides on a per-CPU run
+	queue.</para>
+
+      <para>It is still debatable if this API is needed or if the
+	critical section API is sufficient by itself.  Many of the
+	places that need to prevent migration also need to prevent
+	preemption as well, and in those places a critical section
+	must be used regardless.</para>
+    </sect2>
+
+    <sect2>
+      <title>Callouts</title>
+
+      <para>The <function>timeout()</function> kernel facility permits
+	kernel services to register funtions for execution as part
+	of the <function>softclock()</function> software interrupt.
+	Events are scheduled based on a desired number of clock
+	ticks, and callbacks to the consumer-provided function
+	will occur at approximately the right time.</para>
+
+      <para>The global list of pending timeout events is protected
+	by a global spin mutex, <varname>callout_lock</varname>;
+	all access to the timeout list must be performed with this
+	mutex held.  When <function>softclock()</function> is
+	woken up, it scans the list of pending timeouts for those
+	that should fire.  In order to avoid lock order reversal,
+	the <function>softclock</function> thread will release the
+	<varname>callout_lock</varname> mutex when invoking the
+	provided <function>timeout()</function> callback function.
+	If the <constant>CALLOUT_MPSAFE</constant> flag was not set
+	during registration, then Giant will be grabbed before
+	invoking the callout, and then released afterwards.  The
+	<varname>callout_lock</varname> mutex will be re-grabbed
+	before proceeding.  The <function>softclock()</function>
+	code is careful to leave the list in a consistent state
+	while releasing the mutex.  If <constant>DIAGNOSTIC</constant>
+	is enabled, then the time taken to execute each function is
+	measured, and a warning generated if it exceeds a
+	threshold.</para>
+    </sect2>
+  </sect1>
+
+  <sect1>
+    <title>Specific Locking Strategies</title>
+
+    <sect2>
+      <title>Credentials</title>
+
+      <para><structname>struct ucred</structname> is the system
+	internal credential structure, and is generally used as the
+	basis for process-driven access control.  BSD-derived systems
+	use a "copy-on-write" model for credential data: multiple
+	references may exist for a credential structure, and when a
+	change needs to be made, the structure is duplicated,
+	modified, and then the reference replaced.  Due to wide-spread
+	caching of the credential to implement access control on open,
+	this results in substantial memory savings.  With a move to
+	fine-grained SMP, this model also saves substantially on
+	locking operations by requiring that modification only occur
+	on an unshared credential, avoiding the need for explicit   
+	synchronization when consuming a known-shared
+	credential.</para>
+
+      <para>Credential structures with a single reference are
+	considered mutable; shared credential structures must not be  
+	modified or a race condition is risked.  A mutex,
+	<structfield>cr_mtxp</structfield> protects the reference 
+	count of the <structname>struct ucred</structname> so as to
+	maintain consistency.  Any use of the structure requires a
+	valid reference for the duration of the use, or the structure
+	may be released out from under the illegitimate
+	consumer.</para>
+
+      <para>The <structname>struct ucred</structname> mutex is a leaf
+	mutex, and for performance reasons, is implemented via a mutex
+	pool.</para>
+    </sect2>
+
+    <sect2>
+      <title>File Descriptors and File Descriptor Tables</title>
+
+      <para>Details to follow.</para>
+    </sect2>
+
+    <sect2>
+      <title>Jail Structures</title>
+
+      <para><structname>struct prison</structname> stores
+	administrative details pertinent to the maintenance of jails
+	created using the &man.jail.2; API.  This includes the
+	per-jail hostname, IP address, and related settings.  This
+	structure is reference-counted since pointers to instances of
+	the structure are shared by many credential structures.  A
+	single mutex, <structfield>pr_mtx</structfield> protects read
+	and write access to the reference count and all mutable
+	variables inside the struct jail.  Some variables are set only
+	when the jail is created, and a valid reference to the
+	<structname>struct prison</structname> is sufficient to read
+	these values.  The precise locking of each entry is documented
+	via comments in jail.h.</para>
+    </sect2>
+
+    <sect2>
+      <title>MAC Framework</title>
+
+      <para>The TrustedBSD MAC Framework maintains data in a variety
+	of kernel objects, in the form of <structname>struct
+	label</structname>.  In general, labels in kernel objects
+	are protected by the same lock as the remainder of the kernel
+	object.  For example, the <structfield>v_label</structfield>
+	label in <structname>struct vnode</structname> is protected
+	by the vnode lock on the vnode.</para>
+
+      <para>In addition to labels maintained in standard kernel objects,
+	the MAC Framework also maintains a list of registered and
+	active policies.  The policy list is protected by a global
+	mutex (<varname>mac_policy_list_lock</varname>) and a busy
+	count (also protected by the mutex).  Since many access
+	control checks may occur in parallel, entry to the framework
+	for a read-only access to the policy list requires holding the
+	mutex while incrementing (and later decrementing) the busy
+	count.  The mutex need not be held for the duration of the
+	MAC entry operation--some operations, such as label operations
+	on file system objects--are long-lived.  To modify the policy
+	list, such as during policy registration and deregistration,
+	the mutex must be held and the reference count must be zero,
+	to prevent modification of the list while it is in use.</para>
+
+      <para>A condition variable,
+	<varname>mac_policy_list_not_busy</varname>, is available to
+	threads that need to wait for the list to become unbusy, but
+	this condition variable must only be waited on if the caller is
+	holding no other locks, or a lock order violation may be
+	possible.  The busy count, in effect, acts as a form of
+	reader/writer lock over access to the framework: the difference
+	is that, unlike with an sxlock, consumers waiting for the list
+	to become unbusy may be starved, rather than permitting lock
+	order problems with regards to the busy count and other locks
+	that may be held on entry to (or inside) the MAC Framework.</para>
+    </sect2>
+
+    <sect2>
+      <title>Modules</title>
+
+      <para>For the module subsystem there exists a single lock that is
+	used to protect the shared data.  This lock is a shared/exclusive
+	(SX) lock and has a good chance of needing to be acquired (shared
+	or exclusively), therefore there are a few macros that have been
+	added to make access to the lock more easy.  These macros can be
+	located in <filename>sys/module.h</filename> and are quite basic
+	in terms of usage.  The main structures protected under this lock
+	are the <structname>module_t</structname> structures (when shared)
+	and the global <structname>modulelist_t</structname> structure,
+	modules.  One should review the related source code in
+	<filename>kern/kern_module.c</filename> to further understand the
+	locking strategy.</para>
+    </sect2>
+
+    <sect2>
+      <title>Newbus Device Tree</title>
+
+      <para>The newbus system will have one sx lock.  Readers will
+	lock it &man.sx.slock.9; and writers will lock it
+	&man.sx.xlock.9;.  Internal only functions will not do locking
+	at all.  The externally visable ones will lock as needed.
+	Those items that don't matter if the race is won or lost will
+	not be locked, since they tend to be read all over the place
+	(eg &man.device.get.softc.9;).  There will be relatively few
+	changes to the newbus datastructures, so a single lock should
+	be sufficient and not impose a performance penalty.</para>
+    </sect2>
+
+    <sect2>
+      <title>Pipes</title>
+
+      <para>...</para>
+    </sect2>
+
+    <sect2>
+      <title>Processes and Threads</title>
+
+      <para>- process hiearachy</para>
+      <para>- proc locks, references</para>
+      <para>- thread-specific copies of proc entries to freeze during system
+	calls, including td_ucred</para>
+      <para>- inter-process operations</para>
+      <para>- process groups and sessions</para>
+    </sect2>
+
+    <sect2>
+      <title>Scheduler</title>
+
+      <para>Lots of references to <varname>sched_lock</varname> and notes
+	pointing at specific primitives and related magic elsewhere in the
+	document.</para>
+    </sect2>
+
+    <sect2>
+      <title>Select and Poll</title>
+
+      <para>The select() and poll() functions permit threads to block
+	waiting on events on file descriptors--most frequently, whether
+	or not the file descriptors are readable or writable.</para>
+
+      <para>...</para>
+    </sect2>
+
+    <sect2>
+      <title>SIGIO</title>
+
+      <para>The SIGIO service permits processes to request the delivery
+	of a SIGIO signal to its process group when the read/write status
+	of specified file descriptors changes.  At most one process or
+	process group is permitted to register for SIGIO from any given
+	kernel object, and that process or group is referred to as
+	the owner.  Each object supporting SIGIO registration contains
+	pointer field that is NULL if the object is not registered, or
+	points to a <structname>struct sigio</structname> describing
+	the registration.  This field is protected by a global mutex,
+	<varname>sigio_lock</varname>.  Callers to SIGIO maintenance
+	functions must pass in this field "by reference" so that local
+	register copies of the field are not made when unprotected by
+	the lock.</para>
+
+      <para>One <structname>struct sigio</structname> is allocated for
+	each registered object associated with any process or process
+	group, and contains back-pointers to the object, owner, signal
+	information, a credential, and the general disposition of the
+	registration.  Each process or progress group contains a list of
+	registered <structname>struct sigio</structname> structures,
+	<structfield>p_sigiolst</structfield> for processes, and
+	<structfield>pg_sigiolst</structfield> for process groups.
+	These lists are protected by the process or process group
+	locks respectively.  Most fields in each <structname>struct
+	sigio</structname> are constant for the duration of the
+	registration, with the exception of the
+	<structfield>sio_pgsigio</structfield> field which links the
+	<structname>struct sigio</structname> into the process or
+	process group list.  Developers implementing new kernel
+	objects supporting SIGIO will, in general, want to avoid
+	holding structure locks while invoking SIGIO supporting
+	functions, such as <function>fsetown()</function>
+	or <function>funsetown()</function> to avoid
+	defining a lock order between structure locks and the global
+	SIGIO lock.  This is generally possible through use of an
+	elevated reference count on the structure, such as reliance
+	on a file descriptor reference to a pipe during a pipe
+	operation.<para>
+    </sect2>
+
+    <sect2>
+      <title>sysctl</title>
+
+      <para>The <function>sysctl()</function> MIB service is invoked
+	from both within the kernel and from userland applications
+	using a system call.  At least two issues are raised in locking:
+	first, the protection of the structures maintaining the
+	namespace, and second, interactions with kernel variables and
+	functions that are accessed by the sysctl interface.  Since
+	sysctl permits the direct export (and modification) of
+	kernel statistics and configuration parameters, the sysctl
+	mechanism must become aware of appropriate locking semantics
+	for those variables.  Currently, sysctl makes use of a
+	single global <varname>sxlock</varname> to serialize use
+	of sysctl(); however, it is assumed to operate under Giant
+	and other protections are not provided.  The remainder of
+	this section speculates on locking and semantic changes
+	to sysctl.</para>
+
+      <para>- Need to change the order of operations for sysctl's that
+	update values from read old, copyin and copyout, write new to
+	copyin, lock, read old and write new, unlock, copyout.  Normal
+	sysctl's that just copyout the old value and set a new value
+	that they copyin may still be able to follow the old model.
+	However, it may be cleaner to use the second model for all of
+	the sysctl handlers to avoid lock operations.</para>
+
+      <para>- To allow for the common case, a sysctl could embed a
+	pointer to a mutex in the SYSCTL_FOO macros and in the struct.
+	This would work for most sysctls.  For values protected by sx
+	locks, spin mutexes, or other locking strategies besides a
+	single sleep mutex, SYSCTL_PROC nodes could be used to get the
+	locking right.</para>
+    </sect2>
+
+    <sect2>
+      <title>Taskqueue</title>
+
+       <para> The taskqueue's interface has two basic locks associated
+	with it in order to protect the related shared data.  The
+	<varname>taskqueue_queues_mutex</varname> is meant to serve as a
+	lock to protect the <varname>taskqueue_queues</varname> TAILQ.
+	The other mutex lock associated with this system is the one in the
+	<structname>struct taskqueue</structname> data structure.  The
+	use of the synchronization primitive here is to protect the
+	integrity of the data in the <structname>struct
+	taskqueue</structname>.  It should be noted that there are no
+	separate macros to assist the user in locking down his/her own work
+	since these locks are most likely not going to be used outside of
+	<filename>kern/subr_taskqueue.c</filename>.</para>
+    </sect2>
+  </sect1>
+
+  <sect1>
+    <title>Implementation Notes</title>
+
+    <sect2>
+      <title>Details of the Mutex Implementation</title>
+
+      <para>- Should we require mutexes to be owned for mtx_destroy()
+	since we can't safely assert that they are unowned by anyone
+	else otherwise?</para>
+
+      <sect3>
+	<title>Spin Mutexes</title>
+
+	<para>- Use a critical section...</para>
+      </sect3>
+
+      <sect3>
+	<title>Sleep Mutexes</title>
+
+	<para>- Describe the races with contested mutexes</para>
+
+	<para>- Why it's safe to read mtx_lock of a contested mutex
+	  when holding sched_lock.</para>
+
+	<para>- Priority propagation</para>
+      </sect3>
+    </sect2>
+
+    <sect2>
+      <title>Witesss</title>
+
+      <para>- What does it do</para>
+
+      <para>- How does it work</para>
+    </sect2>
+  </sect1>
+
+  <sect1>
+    <title>Miscellaneous Topics</title>
+
+    <sect2>
+      <title>Interrupt Source and ICU Abstractions</title>
+
+      <para>- struct isrc</para>
+
+      <para>- pic drivers</para>
+    </sect2>
+
+    <sect2>
+      <title>Other Random Questions/Topics</title>
+
+      <para>Should we pass an interlock into
+	<function>sema_wait</function>?</para>
+
+      <para>- Generic turnstiles for sleep mutexes and sx locks.</para>
+
+      <para>- Should we have non-sleepable sx locks?</para>
+    </sect2>
+  </sect1>
+
+  <glossary id="defs">
+    <title>Definitions</title>
+
+    <glossentry id="atomic">
+      <glossterm>atomic</glossterm>
+      <glossdef>
+	<para>An operation is atomic if all of its effects are visible
+	  to other CPUs together when the proper access protocol is
+	  followed.  In the degenerate case are atomic instructions
+	  provided directly by machine architectures.  At a higher
+	  level, if several members of a structure are protected by a
+	  lock, then a set of operations are atomic if they are all
+	  performed while holding the lock without releasing the lock
+	  in between any of the operations.</para>
+
+	<glossseealso>operation</glossseealso>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="block">
+      <glossterm>block</glossterm>
+      <glossdef>
+	<para>A thread is blocked when it is waiting on a lock,
+	  resource, or condition.  Unfortunately this term is a bit
+	  overloaded as a result.</para>
+
+	<glossseealso>sleep</glossseealso>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="critical-section">
+      <glossterm>critical section</glossterm>
+      <glossdef>
+	<para>A section of code that is not allowed to be preempted.
+	  A critical section is entered and exited using the
+	  &man.critical.enter.9; API.</para>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="MD">
+      <glossterm>MD</glossterm>
+      <glossdef>
+	<para>Machine depenedent.</para>
+
+	<glossseealso>MI</glossseealso>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="memory-operation">
+      <glossterm>memory operation</glossterm>
+      <glossdef>
+	<para>A memory operation reads and/or writes to a memory
+	  location.</para>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="MI">
+      <glossterm>MI</glossterm>
+      <glossdef>
+	<para>Machine indepenedent.</para>
+
+	<glossseealso>MD</glossseealso>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="operation">
+      <glossterm>operation</glossterm>
+      <glosssee>memory operation</glosssee>
+    </glossentry>
+
+    <glossentry id="primary-interrupt-context">
+      <glossterm>primary interrupt context</glossterm>
+      <glossdef>
+	<para>Primary interrupt context refers to the code that runs
+	  when an interrupt occurs.  This code can either run an
+	  interrupt handler directly or schedule an asynchronous
+	  interrupt thread to execute the interrupt handlers for a
+	  given interrupt source.</para>
+      </glossdef>
+    </glossentry>
+
+    <glossentry>
+      <glossterm>realtime kernel thread</glossterm>
+      <glossdef>
+	<para>A high priority kernel thread.  Currently, the only
+	  realtime priority kernel threads are interrupt threads.</para>
+
+	<glossseealso>thread</glossseealso>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="sleep">
+      <glossterm>sleep</glossterm>
+      <glossdef>
+	<para>A thread is asleep when it is blocked on a condition
+	  variable or a sleep queue via <function>msleep</function> or
+	  <function>tsleep</function>.</para>
+
+	<glossseealso>block</glossseealso>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="sleepable-lock">
+      <glossterm>sleepable lock</glossterm>
+      <glossdef>
+	<para>A sleepable lock is a lock that can be held by a thread
+	  which is asleep.  Lockmgr locks and sx locks are currently
+	  the only sleepable locks in FreeBSD.  Eventually, some sx
+	  locks such as the allproc and proctree locks may become
+	  non-sleepable locks.</para>
+
+	<glossseealso>sleep</glossseealso>
+      </glossdef>
+    </glossentry>
+
+    <glossentry id="thread">
+      <glossterm>thread</glossterm>
+      <glossdef>
+	<para>A kernel thread represented by a struct thread.  Threads own
+	  locks and hold a single execution context.</para>
+      </glossdef>
+    </glossentry>
+  </glossary>
+</article>