diff --git a/en_US.ISO8859-1/articles/smp/Makefile b/en_US.ISO8859-1/articles/smp/Makefile
new file mode 100644
index 0000000000..85675c8e15
--- /dev/null
+++ b/en_US.ISO8859-1/articles/smp/Makefile
@@ -0,0 +1,18 @@
+# $FreeBSD$
+
+MAINTAINER=jhb@FreeBSD.org
+
+DOC?= article
+
+FORMATS?= html
+
+INSTALL_COMPRESSED?=gz
+INSTALL_ONLY_COMPRESSED?=
+
+JADEFLAGS+= -V %generate-article-toc%
+
+SRCS= article.sgml
+
+DOC_PREFIX?= ${.CURDIR}/../../..
+
+.include "${DOC_PREFIX}/share/mk/doc.project.mk"
diff --git a/en_US.ISO8859-1/articles/smp/article.sgml b/en_US.ISO8859-1/articles/smp/article.sgml
new file mode 100644
index 0000000000..3f6b233f60
--- /dev/null
+++ b/en_US.ISO8859-1/articles/smp/article.sgml
@@ -0,0 +1,934 @@
+
+%man;
+
+
+%authors;
+
+
+
+
+]>
+
+
+
+ SMPng Design Document
+
+
+
+ John
+ Baldwin
+
+
+ Robert
+ Watson
+
+
+
+ $FreeBSD$
+
+
+ 2002
+ John Baldwin
+ Robert Watson
+
+
+
+ This document presents the current design and implementation of
+ the SMPng Architecture. First, the basic primitives and tools are
+ introduced. Next, a general architecture for the FreeBSD kernel's
+ synchronization and execution model is laid out. Then, locking
+ strategies for specific subsystems are discussed, documenting the
+ approaches taken to introduce fine-grained synchronization and
+ parallelism for each subsystem. Finally, detailed implementation
+ notes are provided to motivate design choices, and make the reader
+ aware of important implications involving the use of specific
+ primitives.
+
+
+
+
+ Introduction
+
+ This document is a work-in-progress, and will be updated to
+ reflect on-going design and implementation activities associated
+ with the SMPng Project. Many sections currently exist only in
+ outline form, but will be fleshed out as work proceeds. Updates or
+ suggestions regarding the document may be directed to the document
+ editors.
+
+ The goal of SMPng is to allow concurrency in the kernel.
+ The kernel is basically one rather large and complex program. To
+ make the kernel multithreaded we use some of the same tools used
+ to make other programs multithreaded. These include mutexes,
+ reader/writer locks, semaphores, and condition variables. For
+ definitions of many of the terms, please see
+ .
+
+
+
+ Basic Tools and Locking Fundamentals
+
+
+ Atomic Instructions and Memory Barriers
+
+ There are several existing treatments of memory barriers
+ and atomic instructions, so this section will not include a
+ lot of detail. To put it simply, one cannot go around reading
+ variables without a lock if a lock is used to protect writes
+ to that variable. This becomes obvious when you consider that
+ memory barriers simply determine relative order of memory
+ operations; they do not make any guarantee about timing of
+ memory operations. That is, a memory barrier does not force
+ the contents of a CPU's local cache or store buffer to flush.
+ Instead, the memory barrier at lock release simply ensures
+ that all writes to the protected data will be visible to other
+ CPU's or devices if the write to release the lock is visible.
+ The CPU is free to keep that data in its cache or store buffer
+ as long as it wants. However, if another CPU performs an
+ atomic instruction on the same datum, the first CPU must
+ guarantee that the updated value is made visible to the second
+ CPU along with any other operations that memory barriers may
+ require.
+
+ For example, assuming a simple model where data is
+ considered visible when it is in main memory (or a global
+ cache), when an atomic instruction is triggered on one CPU,
+ other CPU's store buffers and caches must flush any writes to
+ that same cache line along with any pending operations behind
+ a memory barrier.
+
+ This requires one to take special care when using an item
+ protected by atomic instructions. For example, in the sleep
+ mutex implementation, we have to use an
+ atomic_cmpset rather than an
+ atomic_set to turn on the
+ MTX_CONTESTED bit. The reason is that we
+ read the value of mtx_lock into a
+ variable and then make a decision based on that read.
+ However, the value we read may be stale, or it may change
+ while we are making our decision. Thus, when the
+ atomic_set executed, it may end up
+ setting the bit on another value than the one we made the
+ decision on. Thus, we have to use an
+ atomic_cmpset to set the value only if
+ the value we made the decision on is up-to-date and
+ valid.
+
+ Finally, atomic instructions only allow one item to be
+ updated or read. If one needs to atomically update several
+ items, then a lock must be used instad. For example, if two
+ counters must be read and have values that are consistent
+ relative to each other, then those counters must be protected
+ by a lock rather than by separate atomic instructions.
+
+
+
+ Read Locks versus Write Locks
+
+ Read locks do not need to be as strong as write locks.
+ Both types of locks need to ensure that the data they are
+ accessing is not stale. However, only write access requires
+ exclusive access. Multiple threads can safely read a value.
+ Using different types of locks for reads and writes can be
+ implemented in a number of ways.
+
+ First, sx locks can be used in this manner by using an
+ exclusive lock when writing and a shared lock when reading.
+ This method is quite straightforward.
+
+ A second method is a bit more obscure. You can protect a
+ datum with multiple locks. Then for reading that data you
+ simply need to have a read lock of one of the locks. However,
+ to write to the data, you need to have a write lock of all of
+ the locks. This can make writing rather expensive but can be
+ useful when data is accessed in various ways. For example,
+ the parent process pointer is proctected by both the
+ proctree_lock sx lock and the per-process mutex. Sometimes
+ the proc lock is easier as we are just checking to see who a
+ parent of a process is that we already have locked. However,
+ other places such as inferior need to
+ walk the tree of processes via parent pointers and locking
+ each process would be prohibitive as well as a pain to
+ guarantee that the condition you are checking remains valid
+ for both the check and the actions taken as a result of the
+ check.
+
+
+
+ Locking Conditions and Results
+
+ If you need a lock to check the state of a variable so
+ that you can take an action based on the state you read, you
+ can't just hold the lock while reading the variable and then
+ drop the lock before you act on the value you read. Once you
+ drop the lock, the variable can change rendering your decision
+ invalid. Thus, you must hold the lock both while reading the
+ variable and while performing the action as a result of the
+ test.
+
+
+
+
+ General Architecture and Design
+
+
+ Interrupt Handling
+
+ Following the pattern of several other multithreaded Unix
+ kernels, FreeBSD deals with interrupt handlers by giving them
+ their own thread context. Providing a context for interrupt
+ handlers allows them to block on locks. To help avoid
+ latency, however, interrupt threads run at real-time kernel
+ priority. Thus, interrupt handlers should not execute for very
+ long to avoid starving other kernel threads. In addition,
+ since multiple handlers may share an interrupt thread,
+ interrupt handlers should not sleep or use a sleepable lock to
+ avoid starving another interrupt handler.
+
+ The interrupt threads currently in FreeBSD are referred to
+ as heavyweight interrupt threads. They are called this
+ because switching to an interrupt thread involves a full
+ context switch. In the initial implementation, the kernel was
+ not preemptive and thus interrupts that interrupted a kernel
+ thread would have to wait until the kernel thread blocked or
+ returned to userland before they would have an opportunity to
+ run.
+
+ To deal with the latency problems, the kernel in FreeBSD
+ has been made preemptive. Currently, we only preempt a kernel
+ thread when we release a sleep mutex or when an interrupt
+ comes in. However, the plan is to make the FreeBSD kernel
+ fully preemptive as described below.
+
+ Not all interrupt handlers execute in a thread context.
+ Instead, some handlers execute directly in primary interrupt
+ context. These interrupt handlers are currently misnamed
+ fast
interrupt handlers since the
+ INTR_FAST flag used in earlier versions
+ of the kernel is used to mark these handlers. The only
+ interrupts which currently use these types of interrupt
+ handlers are clock interrupts and serial I/O device
+ interrupts. Since these handlers do not have their own
+ context, they may not acquire blocking locks and thus may only
+ use spin mutexes.
+
+ Finally, there is one optional optimization that can be
+ added in MD code called lightweight context switches. Since
+ an interrupt thread executes in a kernel context, it can
+ borrow the vmspace of any process. Thus, in a lightweight
+ context switch, the switch to the interrupt thread does not
+ switch vmspaces but borrows the vmspace of the interrupted
+ thread. In order to ensure that the vmspace of the
+ interrupted thread doesn't disappear out from under us, the
+ interrupted thread is not allowed to execute until the
+ interrupt thread is no longer borrowing its vmspace. This can
+ happen when the interrupt thread either blocks or finishes.
+ If an interrupt thread blocks, then it will use its own
+ context when it is made runnable again. Thus, it can release
+ the interrupted thread.
+
+ The cons of this optimization are that they are very
+ machine specific and complex and thus only worth the effor if
+ their is a large performance improvement. At this point it is
+ probably too early to tell, and in fact, will probably hurt
+ performance as almost all interrupt handlers will immediately
+ block on Giant and require a thread fixup when they block.
+ Also, an alternative method of interrupt handling has been
+ proposed by Mike Smith that works like so:
+
+
+
+ Each interrupt handler has two parts: a predicate
+ which runs in primary interrupt context and a handler
+ which runs in its own thread context.
+
+
+
+ If an interrupt handler has a predicate, then when an
+ interrupt is triggered, the predicate is run. If the
+ predicate returns true then the interrupt is assumed to be
+ fully handled and the kernel returns from the interrupt.
+ If the predicate returns false or there is no predicate,
+ then the threaded handler is scheduled to run.
+
+
+
+ Fitting light weight context switches into this scheme
+ might prove rather complicated. Since we may want to change
+ to this scheme at some point in the future, it is probably
+ best to defer work on light weight context switches until we
+ have settled on the final interrupt handling architecture and
+ determined how light weight context switches might or might
+ not fit into it.
+
+
+
+ Kernel Preemption and Critical Sections
+
+
+ Kernel Preemption in a Nutshell
+
+ Kernel preemption is fairly simple. The basic idea is
+ that a CPU should always be doing the highest priority work
+ available. Well, that is the ideal at least. There are a
+ couple of cases where the expense of achieving the ideal is
+ not worth being perfect.
+
+ Implementing full kernel preemption is very
+ straightforward: when you schedule a thread to be executed
+ by putting it on a runqueue, you check to see if it's
+ priority is higher than the currently executing thread. If
+ so, you initiate a context switch to that thread.
+
+ While locks can protect most data in the case of a
+ preemption, not all of the kernel is preemption safe. For
+ example, if a thread holding a spin mutex preempted and the
+ new thread attempts to grab the same spin mutex, the new
+ thread may spin forever as the interrupted thread may never
+ get a chance to execute. Also, some code such as the code
+ to assign an address space number for a process during
+ exec() on the Alpha needs to not be preempted as it supports
+ the actual context switch code. Preemption is disabled for
+ these code sections by using a critical section.
+
+
+
+ Critical Sections
+
+ The responsibility of the critical section API is to
+ prevent context switches inside of a critical section. With
+ a fully preemptive kernel, every
+ setrunqueue of a thread other than the
+ current thread is a preemption point. One implementation is
+ for critical_enter to set a per-thread
+ flag that is cleared by its counterpart. If
+ setrunqueue is called with this flag
+ set, it doesn't preempt regarless of the priority of the new
+ thread relative to the current thread. However, since
+ critical sections are used in spin mutexes to prevent
+ context switches and multiple spin mutexes can be acquired,
+ the critical section API must support nesting. For this
+ reason the current implementation uses a nesting count
+ instead of a single per-thread flag.
+
+ In order to minimize latency, preemptions inside of a
+ critical section are deferred rather than dropped. If a
+ thread is made runnable that would normally be preempted to
+ outside of a critical section, then a per-thread flag is set
+ to indicate that there is a pending preemption. When the
+ outermost critical section is exited, the flag is checked.
+ If the flag is set, then the current thread is preempted to
+ allow the higher priority thread to run.
+
+ Interrupts pose a problem with regards to spin mutexes.
+ If a low-level interrupt handler needs a lock, it needs to
+ not interrupt any code needing that lock to avoid possible
+ data structure corruption. Currently, providing this
+ mechanism is piggybacked onto critical section API by means
+ of the cpu_critical_enter and
+ cpu_critical_exit functions. Currently
+ this API disables and reenables interrupts on all of
+ FreeBSD's current platforms. This approach may not be
+ purely optimal, but it is simple to understand and simple to
+ get right. Theoretically, this second API need only be used
+ for spin mutexes that are used in primary interrupt context.
+ However, to make the code simpler, it is used for all spin
+ mutexes and even all critical sections. It may be desirable
+ to split out the MD API from the MI API and only use it in
+ conjunction with the MI API in the spin mutex
+ implementation. If this approach is taken, then the MD API
+ likely would need a rename to show that it is a separate API
+ now.
+
+
+
+ Design Tradeoffs
+
+ As mentioned earlier, a couple of tradeoffs have been
+ made to sacrafice cases where perfect preemption may not
+ always provide the best performance.
+
+ The first tradeoff is that the preemption code does not
+ take other CPUs into account. Suppose we have a two CPU's A
+ and B with the priority of A's thread as 4 and the priority
+ of B's thread as 2. If CPU B makes a thread with priority 1
+ runnable, then in theory, we want CPU A to switch to the new
+ thread so that we will be running the two highest priority
+ runnable threads. However, the cost of determining which
+ CPU to enforce a preemption on as well as actually signaling
+ that CPU via an IPI along with the synchronization that
+ would be required would be enormous. Thus, the current code
+ would instead force CPU B to switch to the higher priority
+ thread. Note that this still puts the system in a better
+ position as CPU B is executing a thread of priority 1 rather
+ than a thread of priority 2.
+
+ The second tradeoff limits immediate kernel preemption
+ to real-time priority kernel threads. In the simple case of
+ preemption defined above, a thread is always preempted
+ immediately (or as soon as a critical section is exited) if
+ a higher priority thread is made runnable. However, many
+ threads executing in the kernel only execute in a kernel
+ context for a short time before either blocking or returning
+ to userland. Thus, if the kernel preempts these threads to
+ run another non-realtime kernel thread, the kernel may
+ switch out the executing thread just before it is about to
+ sleep or execute. The cache on the CPU must then adjust to
+ the new thread. When the kernel returns to the interrupted
+ CPU, it must refill all the cache informatino that was lost.
+ In addition, two extra context switches are performed that
+ could be avoided if the kernel deferred the preemption until
+ the first thread blocked or returned to userland. Thus, by
+ default, the preemption code will only preempt immediately
+ if the higher priority thread is a real-time priority
+ thread.
+
+ Turning on full kernel preemption for all kernel threads
+ has value as a debugging aid since it exposes more race
+ conditions. It is especially useful on UP systems were many
+ races are hard to simulate otherwise. Thus, there will be a
+ kernel option to enable preemption for all kernel threads
+ that can be used for debugging purposes.
+
+
+
+
+ Thread Migration
+
+ Simply put, a thread migrates when it moves from one CPU
+ to another. In a non-preemptive kernel this can only happen
+ at well-defined points such as when calling
+ tsleep or returning to userland.
+ However, in the preemptive kernel, an interrupt can force a
+ preemption and possible migration at any time. This can have
+ negative affects on per-CPU data since with the exception of
+ curthread and curpcb the
+ data can change whenever you migrate. Since you can
+ potentially migrate at any time this renders per-CPU data
+ rather useless. Thus it is desirable to be able to disable
+ migration for sections of code that need per-CPU data to be
+ stable.
+
+ Critical sections currently prevent migration since they
+ don't allow context switches. However, this may be too strong
+ of a requirement to enforce in some cases since a critical
+ section also effectively blocks interrupt threads on the
+ current processor. As a result, it may be desirable to
+ provide an API whereby code may indicate that if the current
+ thread is preempted it should not migrate to another
+ CPU.
+
+ One possible implementation is to use a per-thread nesting
+ count td_pinnest along with a
+ td_pincpu which is updated to the current
+ CPU on each context switch. Each CPU has its own run queue
+ that holds threads pinned to that CPU. A thread is pinned
+ when its nesting count is greater than zero and a thread
+ starts off unpinned with a nesting count of zero. When a
+ thread is put on a runqueue, we check to see if it is pinned.
+ If so, we put it on the per-CPU runqueue, otherwise we put it
+ on the global runqueue. When
+ choosethread is called to retrieve the
+ next thread, it could either always prefer bound threads to
+ unbound threads or use some sort of bias when comparing
+ priorities. If the nesting count is only ever written to by
+ the thread itself and is only read by other threads when the
+ owning thread is not executing but while holding the
+ sched_lock, then
+ td_pinnest will not need any other locks.
+ The migrate_disable function would
+ increment the nesting count and
+ migrate_enable would decrement the
+ nesting count. Due to the locking requirements specified
+ above, they will only operate on the current thread and thus
+ would not need to handle the case of making a thread
+ migratable that currently resides on a per-CPU run
+ queue.
+
+ It is still debatable if this API is needed or if the
+ critical section API is sufficient by itself. Many of the
+ places that need to prevent migration also need to prevent
+ preemption as well, and in those places a critical section
+ must be used regardless.
+
+
+
+ Callouts
+
+ The timeout() kernel facility permits
+ kernel services to register funtions for execution as part
+ of the softclock() software interrupt.
+ Events are scheduled based on a desired number of clock
+ ticks, and callbacks to the consumer-provided function
+ will occur at approximately the right time.
+
+ The global list of pending timeout events is protected
+ by a global spin mutex, callout_lock;
+ all access to the timeout list must be performed with this
+ mutex held. When softclock() is
+ woken up, it scans the list of pending timeouts for those
+ that should fire. In order to avoid lock order reversal,
+ the softclock thread will release the
+ callout_lock mutex when invoking the
+ provided timeout() callback function.
+ If the CALLOUT_MPSAFE flag was not set
+ during registration, then Giant will be grabbed before
+ invoking the callout, and then released afterwards. The
+ callout_lock mutex will be re-grabbed
+ before proceeding. The softclock()
+ code is careful to leave the list in a consistent state
+ while releasing the mutex. If DIAGNOSTIC
+ is enabled, then the time taken to execute each function is
+ measured, and a warning generated if it exceeds a
+ threshold.
+
+
+
+
+ Specific Locking Strategies
+
+
+ Credentials
+
+ struct ucred is the system
+ internal credential structure, and is generally used as the
+ basis for process-driven access control. BSD-derived systems
+ use a "copy-on-write" model for credential data: multiple
+ references may exist for a credential structure, and when a
+ change needs to be made, the structure is duplicated,
+ modified, and then the reference replaced. Due to wide-spread
+ caching of the credential to implement access control on open,
+ this results in substantial memory savings. With a move to
+ fine-grained SMP, this model also saves substantially on
+ locking operations by requiring that modification only occur
+ on an unshared credential, avoiding the need for explicit
+ synchronization when consuming a known-shared
+ credential.
+
+ Credential structures with a single reference are
+ considered mutable; shared credential structures must not be
+ modified or a race condition is risked. A mutex,
+ cr_mtxp protects the reference
+ count of the struct ucred so as to
+ maintain consistency. Any use of the structure requires a
+ valid reference for the duration of the use, or the structure
+ may be released out from under the illegitimate
+ consumer.
+
+ The struct ucred mutex is a leaf
+ mutex, and for performance reasons, is implemented via a mutex
+ pool.
+
+
+
+ File Descriptors and File Descriptor Tables
+
+ Details to follow.
+
+
+
+ Jail Structures
+
+ struct prison stores
+ administrative details pertinent to the maintenance of jails
+ created using the &man.jail.2; API. This includes the
+ per-jail hostname, IP address, and related settings. This
+ structure is reference-counted since pointers to instances of
+ the structure are shared by many credential structures. A
+ single mutex, pr_mtx protects read
+ and write access to the reference count and all mutable
+ variables inside the struct jail. Some variables are set only
+ when the jail is created, and a valid reference to the
+ struct prison is sufficient to read
+ these values. The precise locking of each entry is documented
+ via comments in jail.h.
+
+
+
+ MAC Framework
+
+ The TrustedBSD MAC Framework maintains data in a variety
+ of kernel objects, in the form of struct
+ label. In general, labels in kernel objects
+ are protected by the same lock as the remainder of the kernel
+ object. For example, the v_label
+ label in struct vnode is protected
+ by the vnode lock on the vnode.
+
+ In addition to labels maintained in standard kernel objects,
+ the MAC Framework also maintains a list of registered and
+ active policies. The policy list is protected by a global
+ mutex (mac_policy_list_lock) and a busy
+ count (also protected by the mutex). Since many access
+ control checks may occur in parallel, entry to the framework
+ for a read-only access to the policy list requires holding the
+ mutex while incrementing (and later decrementing) the busy
+ count. The mutex need not be held for the duration of the
+ MAC entry operation--some operations, such as label operations
+ on file system objects--are long-lived. To modify the policy
+ list, such as during policy registration and deregistration,
+ the mutex must be held and the reference count must be zero,
+ to prevent modification of the list while it is in use.
+
+ A condition variable,
+ mac_policy_list_not_busy, is available to
+ threads that need to wait for the list to become unbusy, but
+ this condition variable must only be waited on if the caller is
+ holding no other locks, or a lock order violation may be
+ possible. The busy count, in effect, acts as a form of
+ reader/writer lock over access to the framework: the difference
+ is that, unlike with an sxlock, consumers waiting for the list
+ to become unbusy may be starved, rather than permitting lock
+ order problems with regards to the busy count and other locks
+ that may be held on entry to (or inside) the MAC Framework.
+
+
+
+ Modules
+
+ For the module subsystem there exists a single lock that is
+ used to protect the shared data. This lock is a shared/exclusive
+ (SX) lock and has a good chance of needing to be acquired (shared
+ or exclusively), therefore there are a few macros that have been
+ added to make access to the lock more easy. These macros can be
+ located in sys/module.h and are quite basic
+ in terms of usage. The main structures protected under this lock
+ are the module_t structures (when shared)
+ and the global modulelist_t structure,
+ modules. One should review the related source code in
+ kern/kern_module.c to further understand the
+ locking strategy.
+
+
+
+ Newbus Device Tree
+
+ The newbus system will have one sx lock. Readers will
+ lock it &man.sx.slock.9; and writers will lock it
+ &man.sx.xlock.9;. Internal only functions will not do locking
+ at all. The externally visable ones will lock as needed.
+ Those items that don't matter if the race is won or lost will
+ not be locked, since they tend to be read all over the place
+ (eg &man.device.get.softc.9;). There will be relatively few
+ changes to the newbus datastructures, so a single lock should
+ be sufficient and not impose a performance penalty.
+
+
+
+ Pipes
+
+ ...
+
+
+
+ Processes and Threads
+
+ - process hiearachy
+ - proc locks, references
+ - thread-specific copies of proc entries to freeze during system
+ calls, including td_ucred
+ - inter-process operations
+ - process groups and sessions
+
+
+
+ Scheduler
+
+ Lots of references to sched_lock and notes
+ pointing at specific primitives and related magic elsewhere in the
+ document.
+
+
+
+ Select and Poll
+
+ The select() and poll() functions permit threads to block
+ waiting on events on file descriptors--most frequently, whether
+ or not the file descriptors are readable or writable.
+
+ ...
+
+
+
+ SIGIO
+
+ The SIGIO service permits processes to request the delivery
+ of a SIGIO signal to its process group when the read/write status
+ of specified file descriptors changes. At most one process or
+ process group is permitted to register for SIGIO from any given
+ kernel object, and that process or group is referred to as
+ the owner. Each object supporting SIGIO registration contains
+ pointer field that is NULL if the object is not registered, or
+ points to a struct sigio describing
+ the registration. This field is protected by a global mutex,
+ sigio_lock. Callers to SIGIO maintenance
+ functions must pass in this field "by reference" so that local
+ register copies of the field are not made when unprotected by
+ the lock.
+
+ One struct sigio is allocated for
+ each registered object associated with any process or process
+ group, and contains back-pointers to the object, owner, signal
+ information, a credential, and the general disposition of the
+ registration. Each process or progress group contains a list of
+ registered struct sigio structures,
+ p_sigiolst for processes, and
+ pg_sigiolst for process groups.
+ These lists are protected by the process or process group
+ locks respectively. Most fields in each struct
+ sigio are constant for the duration of the
+ registration, with the exception of the
+ sio_pgsigio field which links the
+ struct sigio into the process or
+ process group list. Developers implementing new kernel
+ objects supporting SIGIO will, in general, want to avoid
+ holding structure locks while invoking SIGIO supporting
+ functions, such as fsetown()
+ or funsetown() to avoid
+ defining a lock order between structure locks and the global
+ SIGIO lock. This is generally possible through use of an
+ elevated reference count on the structure, such as reliance
+ on a file descriptor reference to a pipe during a pipe
+ operation.
+
+
+
+ sysctl
+
+ The sysctl() MIB service is invoked
+ from both within the kernel and from userland applications
+ using a system call. At least two issues are raised in locking:
+ first, the protection of the structures maintaining the
+ namespace, and second, interactions with kernel variables and
+ functions that are accessed by the sysctl interface. Since
+ sysctl permits the direct export (and modification) of
+ kernel statistics and configuration parameters, the sysctl
+ mechanism must become aware of appropriate locking semantics
+ for those variables. Currently, sysctl makes use of a
+ single global sxlock to serialize use
+ of sysctl(); however, it is assumed to operate under Giant
+ and other protections are not provided. The remainder of
+ this section speculates on locking and semantic changes
+ to sysctl.
+
+ - Need to change the order of operations for sysctl's that
+ update values from read old, copyin and copyout, write new to
+ copyin, lock, read old and write new, unlock, copyout. Normal
+ sysctl's that just copyout the old value and set a new value
+ that they copyin may still be able to follow the old model.
+ However, it may be cleaner to use the second model for all of
+ the sysctl handlers to avoid lock operations.
+
+ - To allow for the common case, a sysctl could embed a
+ pointer to a mutex in the SYSCTL_FOO macros and in the struct.
+ This would work for most sysctls. For values protected by sx
+ locks, spin mutexes, or other locking strategies besides a
+ single sleep mutex, SYSCTL_PROC nodes could be used to get the
+ locking right.
+
+
+
+ Taskqueue
+
+ The taskqueue's interface has two basic locks associated
+ with it in order to protect the related shared data. The
+ taskqueue_queues_mutex is meant to serve as a
+ lock to protect the taskqueue_queues TAILQ.
+ The other mutex lock associated with this system is the one in the
+ struct taskqueue data structure. The
+ use of the synchronization primitive here is to protect the
+ integrity of the data in the struct
+ taskqueue. It should be noted that there are no
+ separate macros to assist the user in locking down his/her own work
+ since these locks are most likely not going to be used outside of
+ kern/subr_taskqueue.c.
+
+
+
+
+ Implementation Notes
+
+
+ Details of the Mutex Implementation
+
+ - Should we require mutexes to be owned for mtx_destroy()
+ since we can't safely assert that they are unowned by anyone
+ else otherwise?
+
+
+ Spin Mutexes
+
+ - Use a critical section...
+
+
+
+ Sleep Mutexes
+
+ - Describe the races with contested mutexes
+
+ - Why it's safe to read mtx_lock of a contested mutex
+ when holding sched_lock.
+
+ - Priority propagation
+
+
+
+
+ Witesss
+
+ - What does it do
+
+ - How does it work
+
+
+
+
+ Miscellaneous Topics
+
+
+ Interrupt Source and ICU Abstractions
+
+ - struct isrc
+
+ - pic drivers
+
+
+
+ Other Random Questions/Topics
+
+ Should we pass an interlock into
+ sema_wait?
+
+ - Generic turnstiles for sleep mutexes and sx locks.
+
+ - Should we have non-sleepable sx locks?
+
+
+
+
+ Definitions
+
+
+ atomic
+
+ An operation is atomic if all of its effects are visible
+ to other CPUs together when the proper access protocol is
+ followed. In the degenerate case are atomic instructions
+ provided directly by machine architectures. At a higher
+ level, if several members of a structure are protected by a
+ lock, then a set of operations are atomic if they are all
+ performed while holding the lock without releasing the lock
+ in between any of the operations.
+
+ operation
+
+
+
+
+ block
+
+ A thread is blocked when it is waiting on a lock,
+ resource, or condition. Unfortunately this term is a bit
+ overloaded as a result.
+
+ sleep
+
+
+
+
+ critical section
+
+ A section of code that is not allowed to be preempted.
+ A critical section is entered and exited using the
+ &man.critical.enter.9; API.
+
+
+
+
+ MD
+
+ Machine depenedent.
+
+ MI
+
+
+
+
+ memory operation
+
+ A memory operation reads and/or writes to a memory
+ location.
+
+
+
+
+ MI
+
+ Machine indepenedent.
+
+ MD
+
+
+
+
+ operation
+ memory operation
+
+
+
+ primary interrupt context
+
+ Primary interrupt context refers to the code that runs
+ when an interrupt occurs. This code can either run an
+ interrupt handler directly or schedule an asynchronous
+ interrupt thread to execute the interrupt handlers for a
+ given interrupt source.
+
+
+
+
+ realtime kernel thread
+
+ A high priority kernel thread. Currently, the only
+ realtime priority kernel threads are interrupt threads.
+
+ thread
+
+
+
+
+ sleep
+
+ A thread is asleep when it is blocked on a condition
+ variable or a sleep queue via msleep or
+ tsleep.
+
+ block
+
+
+
+
+ sleepable lock
+
+ A sleepable lock is a lock that can be held by a thread
+ which is asleep. Lockmgr locks and sx locks are currently
+ the only sleepable locks in FreeBSD. Eventually, some sx
+ locks such as the allproc and proctree locks may become
+ non-sleepable locks.
+
+ sleep
+
+
+
+
+ thread
+
+ A kernel thread represented by a struct thread. Threads own
+ locks and hold a single execution context.
+
+
+
+
diff --git a/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml b/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
new file mode 100644
index 0000000000..3f6b233f60
--- /dev/null
+++ b/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
@@ -0,0 +1,934 @@
+
+%man;
+
+
+%authors;
+
+
+
+
+]>
+
+
+
+ SMPng Design Document
+
+
+
+ John
+ Baldwin
+
+
+ Robert
+ Watson
+
+
+
+ $FreeBSD$
+
+
+ 2002
+ John Baldwin
+ Robert Watson
+
+
+
+ This document presents the current design and implementation of
+ the SMPng Architecture. First, the basic primitives and tools are
+ introduced. Next, a general architecture for the FreeBSD kernel's
+ synchronization and execution model is laid out. Then, locking
+ strategies for specific subsystems are discussed, documenting the
+ approaches taken to introduce fine-grained synchronization and
+ parallelism for each subsystem. Finally, detailed implementation
+ notes are provided to motivate design choices, and make the reader
+ aware of important implications involving the use of specific
+ primitives.
+
+
+
+
+ Introduction
+
+ This document is a work-in-progress, and will be updated to
+ reflect on-going design and implementation activities associated
+ with the SMPng Project. Many sections currently exist only in
+ outline form, but will be fleshed out as work proceeds. Updates or
+ suggestions regarding the document may be directed to the document
+ editors.
+
+ The goal of SMPng is to allow concurrency in the kernel.
+ The kernel is basically one rather large and complex program. To
+ make the kernel multithreaded we use some of the same tools used
+ to make other programs multithreaded. These include mutexes,
+ reader/writer locks, semaphores, and condition variables. For
+ definitions of many of the terms, please see
+ .
+
+
+
+ Basic Tools and Locking Fundamentals
+
+
+ Atomic Instructions and Memory Barriers
+
+ There are several existing treatments of memory barriers
+ and atomic instructions, so this section will not include a
+ lot of detail. To put it simply, one cannot go around reading
+ variables without a lock if a lock is used to protect writes
+ to that variable. This becomes obvious when you consider that
+ memory barriers simply determine relative order of memory
+ operations; they do not make any guarantee about timing of
+ memory operations. That is, a memory barrier does not force
+ the contents of a CPU's local cache or store buffer to flush.
+ Instead, the memory barrier at lock release simply ensures
+ that all writes to the protected data will be visible to other
+ CPU's or devices if the write to release the lock is visible.
+ The CPU is free to keep that data in its cache or store buffer
+ as long as it wants. However, if another CPU performs an
+ atomic instruction on the same datum, the first CPU must
+ guarantee that the updated value is made visible to the second
+ CPU along with any other operations that memory barriers may
+ require.
+
+ For example, assuming a simple model where data is
+ considered visible when it is in main memory (or a global
+ cache), when an atomic instruction is triggered on one CPU,
+ other CPU's store buffers and caches must flush any writes to
+ that same cache line along with any pending operations behind
+ a memory barrier.
+
+ This requires one to take special care when using an item
+ protected by atomic instructions. For example, in the sleep
+ mutex implementation, we have to use an
+ atomic_cmpset rather than an
+ atomic_set to turn on the
+ MTX_CONTESTED bit. The reason is that we
+ read the value of mtx_lock into a
+ variable and then make a decision based on that read.
+ However, the value we read may be stale, or it may change
+ while we are making our decision. Thus, when the
+ atomic_set executed, it may end up
+ setting the bit on another value than the one we made the
+ decision on. Thus, we have to use an
+ atomic_cmpset to set the value only if
+ the value we made the decision on is up-to-date and
+ valid.
+
+ Finally, atomic instructions only allow one item to be
+ updated or read. If one needs to atomically update several
+ items, then a lock must be used instad. For example, if two
+ counters must be read and have values that are consistent
+ relative to each other, then those counters must be protected
+ by a lock rather than by separate atomic instructions.
+
+
+
+ Read Locks versus Write Locks
+
+ Read locks do not need to be as strong as write locks.
+ Both types of locks need to ensure that the data they are
+ accessing is not stale. However, only write access requires
+ exclusive access. Multiple threads can safely read a value.
+ Using different types of locks for reads and writes can be
+ implemented in a number of ways.
+
+ First, sx locks can be used in this manner by using an
+ exclusive lock when writing and a shared lock when reading.
+ This method is quite straightforward.
+
+ A second method is a bit more obscure. You can protect a
+ datum with multiple locks. Then for reading that data you
+ simply need to have a read lock of one of the locks. However,
+ to write to the data, you need to have a write lock of all of
+ the locks. This can make writing rather expensive but can be
+ useful when data is accessed in various ways. For example,
+ the parent process pointer is proctected by both the
+ proctree_lock sx lock and the per-process mutex. Sometimes
+ the proc lock is easier as we are just checking to see who a
+ parent of a process is that we already have locked. However,
+ other places such as inferior need to
+ walk the tree of processes via parent pointers and locking
+ each process would be prohibitive as well as a pain to
+ guarantee that the condition you are checking remains valid
+ for both the check and the actions taken as a result of the
+ check.
+
+
+
+ Locking Conditions and Results
+
+ If you need a lock to check the state of a variable so
+ that you can take an action based on the state you read, you
+ can't just hold the lock while reading the variable and then
+ drop the lock before you act on the value you read. Once you
+ drop the lock, the variable can change rendering your decision
+ invalid. Thus, you must hold the lock both while reading the
+ variable and while performing the action as a result of the
+ test.
+
+
+
+
+ General Architecture and Design
+
+
+ Interrupt Handling
+
+ Following the pattern of several other multithreaded Unix
+ kernels, FreeBSD deals with interrupt handlers by giving them
+ their own thread context. Providing a context for interrupt
+ handlers allows them to block on locks. To help avoid
+ latency, however, interrupt threads run at real-time kernel
+ priority. Thus, interrupt handlers should not execute for very
+ long to avoid starving other kernel threads. In addition,
+ since multiple handlers may share an interrupt thread,
+ interrupt handlers should not sleep or use a sleepable lock to
+ avoid starving another interrupt handler.
+
+ The interrupt threads currently in FreeBSD are referred to
+ as heavyweight interrupt threads. They are called this
+ because switching to an interrupt thread involves a full
+ context switch. In the initial implementation, the kernel was
+ not preemptive and thus interrupts that interrupted a kernel
+ thread would have to wait until the kernel thread blocked or
+ returned to userland before they would have an opportunity to
+ run.
+
+ To deal with the latency problems, the kernel in FreeBSD
+ has been made preemptive. Currently, we only preempt a kernel
+ thread when we release a sleep mutex or when an interrupt
+ comes in. However, the plan is to make the FreeBSD kernel
+ fully preemptive as described below.
+
+ Not all interrupt handlers execute in a thread context.
+ Instead, some handlers execute directly in primary interrupt
+ context. These interrupt handlers are currently misnamed
+ fast
interrupt handlers since the
+ INTR_FAST flag used in earlier versions
+ of the kernel is used to mark these handlers. The only
+ interrupts which currently use these types of interrupt
+ handlers are clock interrupts and serial I/O device
+ interrupts. Since these handlers do not have their own
+ context, they may not acquire blocking locks and thus may only
+ use spin mutexes.
+
+ Finally, there is one optional optimization that can be
+ added in MD code called lightweight context switches. Since
+ an interrupt thread executes in a kernel context, it can
+ borrow the vmspace of any process. Thus, in a lightweight
+ context switch, the switch to the interrupt thread does not
+ switch vmspaces but borrows the vmspace of the interrupted
+ thread. In order to ensure that the vmspace of the
+ interrupted thread doesn't disappear out from under us, the
+ interrupted thread is not allowed to execute until the
+ interrupt thread is no longer borrowing its vmspace. This can
+ happen when the interrupt thread either blocks or finishes.
+ If an interrupt thread blocks, then it will use its own
+ context when it is made runnable again. Thus, it can release
+ the interrupted thread.
+
+ The cons of this optimization are that they are very
+ machine specific and complex and thus only worth the effor if
+ their is a large performance improvement. At this point it is
+ probably too early to tell, and in fact, will probably hurt
+ performance as almost all interrupt handlers will immediately
+ block on Giant and require a thread fixup when they block.
+ Also, an alternative method of interrupt handling has been
+ proposed by Mike Smith that works like so:
+
+
+
+ Each interrupt handler has two parts: a predicate
+ which runs in primary interrupt context and a handler
+ which runs in its own thread context.
+
+
+
+ If an interrupt handler has a predicate, then when an
+ interrupt is triggered, the predicate is run. If the
+ predicate returns true then the interrupt is assumed to be
+ fully handled and the kernel returns from the interrupt.
+ If the predicate returns false or there is no predicate,
+ then the threaded handler is scheduled to run.
+
+
+
+ Fitting light weight context switches into this scheme
+ might prove rather complicated. Since we may want to change
+ to this scheme at some point in the future, it is probably
+ best to defer work on light weight context switches until we
+ have settled on the final interrupt handling architecture and
+ determined how light weight context switches might or might
+ not fit into it.
+
+
+
+ Kernel Preemption and Critical Sections
+
+
+ Kernel Preemption in a Nutshell
+
+ Kernel preemption is fairly simple. The basic idea is
+ that a CPU should always be doing the highest priority work
+ available. Well, that is the ideal at least. There are a
+ couple of cases where the expense of achieving the ideal is
+ not worth being perfect.
+
+ Implementing full kernel preemption is very
+ straightforward: when you schedule a thread to be executed
+ by putting it on a runqueue, you check to see if it's
+ priority is higher than the currently executing thread. If
+ so, you initiate a context switch to that thread.
+
+ While locks can protect most data in the case of a
+ preemption, not all of the kernel is preemption safe. For
+ example, if a thread holding a spin mutex preempted and the
+ new thread attempts to grab the same spin mutex, the new
+ thread may spin forever as the interrupted thread may never
+ get a chance to execute. Also, some code such as the code
+ to assign an address space number for a process during
+ exec() on the Alpha needs to not be preempted as it supports
+ the actual context switch code. Preemption is disabled for
+ these code sections by using a critical section.
+
+
+
+ Critical Sections
+
+ The responsibility of the critical section API is to
+ prevent context switches inside of a critical section. With
+ a fully preemptive kernel, every
+ setrunqueue of a thread other than the
+ current thread is a preemption point. One implementation is
+ for critical_enter to set a per-thread
+ flag that is cleared by its counterpart. If
+ setrunqueue is called with this flag
+ set, it doesn't preempt regarless of the priority of the new
+ thread relative to the current thread. However, since
+ critical sections are used in spin mutexes to prevent
+ context switches and multiple spin mutexes can be acquired,
+ the critical section API must support nesting. For this
+ reason the current implementation uses a nesting count
+ instead of a single per-thread flag.
+
+ In order to minimize latency, preemptions inside of a
+ critical section are deferred rather than dropped. If a
+ thread is made runnable that would normally be preempted to
+ outside of a critical section, then a per-thread flag is set
+ to indicate that there is a pending preemption. When the
+ outermost critical section is exited, the flag is checked.
+ If the flag is set, then the current thread is preempted to
+ allow the higher priority thread to run.
+
+ Interrupts pose a problem with regards to spin mutexes.
+ If a low-level interrupt handler needs a lock, it needs to
+ not interrupt any code needing that lock to avoid possible
+ data structure corruption. Currently, providing this
+ mechanism is piggybacked onto critical section API by means
+ of the cpu_critical_enter and
+ cpu_critical_exit functions. Currently
+ this API disables and reenables interrupts on all of
+ FreeBSD's current platforms. This approach may not be
+ purely optimal, but it is simple to understand and simple to
+ get right. Theoretically, this second API need only be used
+ for spin mutexes that are used in primary interrupt context.
+ However, to make the code simpler, it is used for all spin
+ mutexes and even all critical sections. It may be desirable
+ to split out the MD API from the MI API and only use it in
+ conjunction with the MI API in the spin mutex
+ implementation. If this approach is taken, then the MD API
+ likely would need a rename to show that it is a separate API
+ now.
+
+
+
+ Design Tradeoffs
+
+ As mentioned earlier, a couple of tradeoffs have been
+ made to sacrafice cases where perfect preemption may not
+ always provide the best performance.
+
+ The first tradeoff is that the preemption code does not
+ take other CPUs into account. Suppose we have a two CPU's A
+ and B with the priority of A's thread as 4 and the priority
+ of B's thread as 2. If CPU B makes a thread with priority 1
+ runnable, then in theory, we want CPU A to switch to the new
+ thread so that we will be running the two highest priority
+ runnable threads. However, the cost of determining which
+ CPU to enforce a preemption on as well as actually signaling
+ that CPU via an IPI along with the synchronization that
+ would be required would be enormous. Thus, the current code
+ would instead force CPU B to switch to the higher priority
+ thread. Note that this still puts the system in a better
+ position as CPU B is executing a thread of priority 1 rather
+ than a thread of priority 2.
+
+ The second tradeoff limits immediate kernel preemption
+ to real-time priority kernel threads. In the simple case of
+ preemption defined above, a thread is always preempted
+ immediately (or as soon as a critical section is exited) if
+ a higher priority thread is made runnable. However, many
+ threads executing in the kernel only execute in a kernel
+ context for a short time before either blocking or returning
+ to userland. Thus, if the kernel preempts these threads to
+ run another non-realtime kernel thread, the kernel may
+ switch out the executing thread just before it is about to
+ sleep or execute. The cache on the CPU must then adjust to
+ the new thread. When the kernel returns to the interrupted
+ CPU, it must refill all the cache informatino that was lost.
+ In addition, two extra context switches are performed that
+ could be avoided if the kernel deferred the preemption until
+ the first thread blocked or returned to userland. Thus, by
+ default, the preemption code will only preempt immediately
+ if the higher priority thread is a real-time priority
+ thread.
+
+ Turning on full kernel preemption for all kernel threads
+ has value as a debugging aid since it exposes more race
+ conditions. It is especially useful on UP systems were many
+ races are hard to simulate otherwise. Thus, there will be a
+ kernel option to enable preemption for all kernel threads
+ that can be used for debugging purposes.
+
+
+
+
+ Thread Migration
+
+ Simply put, a thread migrates when it moves from one CPU
+ to another. In a non-preemptive kernel this can only happen
+ at well-defined points such as when calling
+ tsleep or returning to userland.
+ However, in the preemptive kernel, an interrupt can force a
+ preemption and possible migration at any time. This can have
+ negative affects on per-CPU data since with the exception of
+ curthread and curpcb the
+ data can change whenever you migrate. Since you can
+ potentially migrate at any time this renders per-CPU data
+ rather useless. Thus it is desirable to be able to disable
+ migration for sections of code that need per-CPU data to be
+ stable.
+
+ Critical sections currently prevent migration since they
+ don't allow context switches. However, this may be too strong
+ of a requirement to enforce in some cases since a critical
+ section also effectively blocks interrupt threads on the
+ current processor. As a result, it may be desirable to
+ provide an API whereby code may indicate that if the current
+ thread is preempted it should not migrate to another
+ CPU.
+
+ One possible implementation is to use a per-thread nesting
+ count td_pinnest along with a
+ td_pincpu which is updated to the current
+ CPU on each context switch. Each CPU has its own run queue
+ that holds threads pinned to that CPU. A thread is pinned
+ when its nesting count is greater than zero and a thread
+ starts off unpinned with a nesting count of zero. When a
+ thread is put on a runqueue, we check to see if it is pinned.
+ If so, we put it on the per-CPU runqueue, otherwise we put it
+ on the global runqueue. When
+ choosethread is called to retrieve the
+ next thread, it could either always prefer bound threads to
+ unbound threads or use some sort of bias when comparing
+ priorities. If the nesting count is only ever written to by
+ the thread itself and is only read by other threads when the
+ owning thread is not executing but while holding the
+ sched_lock, then
+ td_pinnest will not need any other locks.
+ The migrate_disable function would
+ increment the nesting count and
+ migrate_enable would decrement the
+ nesting count. Due to the locking requirements specified
+ above, they will only operate on the current thread and thus
+ would not need to handle the case of making a thread
+ migratable that currently resides on a per-CPU run
+ queue.
+
+ It is still debatable if this API is needed or if the
+ critical section API is sufficient by itself. Many of the
+ places that need to prevent migration also need to prevent
+ preemption as well, and in those places a critical section
+ must be used regardless.
+
+
+
+ Callouts
+
+ The timeout() kernel facility permits
+ kernel services to register funtions for execution as part
+ of the softclock() software interrupt.
+ Events are scheduled based on a desired number of clock
+ ticks, and callbacks to the consumer-provided function
+ will occur at approximately the right time.
+
+ The global list of pending timeout events is protected
+ by a global spin mutex, callout_lock;
+ all access to the timeout list must be performed with this
+ mutex held. When softclock() is
+ woken up, it scans the list of pending timeouts for those
+ that should fire. In order to avoid lock order reversal,
+ the softclock thread will release the
+ callout_lock mutex when invoking the
+ provided timeout() callback function.
+ If the CALLOUT_MPSAFE flag was not set
+ during registration, then Giant will be grabbed before
+ invoking the callout, and then released afterwards. The
+ callout_lock mutex will be re-grabbed
+ before proceeding. The softclock()
+ code is careful to leave the list in a consistent state
+ while releasing the mutex. If DIAGNOSTIC
+ is enabled, then the time taken to execute each function is
+ measured, and a warning generated if it exceeds a
+ threshold.
+
+
+
+
+ Specific Locking Strategies
+
+
+ Credentials
+
+ struct ucred is the system
+ internal credential structure, and is generally used as the
+ basis for process-driven access control. BSD-derived systems
+ use a "copy-on-write" model for credential data: multiple
+ references may exist for a credential structure, and when a
+ change needs to be made, the structure is duplicated,
+ modified, and then the reference replaced. Due to wide-spread
+ caching of the credential to implement access control on open,
+ this results in substantial memory savings. With a move to
+ fine-grained SMP, this model also saves substantially on
+ locking operations by requiring that modification only occur
+ on an unshared credential, avoiding the need for explicit
+ synchronization when consuming a known-shared
+ credential.
+
+ Credential structures with a single reference are
+ considered mutable; shared credential structures must not be
+ modified or a race condition is risked. A mutex,
+ cr_mtxp protects the reference
+ count of the struct ucred so as to
+ maintain consistency. Any use of the structure requires a
+ valid reference for the duration of the use, or the structure
+ may be released out from under the illegitimate
+ consumer.
+
+ The struct ucred mutex is a leaf
+ mutex, and for performance reasons, is implemented via a mutex
+ pool.
+
+
+
+ File Descriptors and File Descriptor Tables
+
+ Details to follow.
+
+
+
+ Jail Structures
+
+ struct prison stores
+ administrative details pertinent to the maintenance of jails
+ created using the &man.jail.2; API. This includes the
+ per-jail hostname, IP address, and related settings. This
+ structure is reference-counted since pointers to instances of
+ the structure are shared by many credential structures. A
+ single mutex, pr_mtx protects read
+ and write access to the reference count and all mutable
+ variables inside the struct jail. Some variables are set only
+ when the jail is created, and a valid reference to the
+ struct prison is sufficient to read
+ these values. The precise locking of each entry is documented
+ via comments in jail.h.
+
+
+
+ MAC Framework
+
+ The TrustedBSD MAC Framework maintains data in a variety
+ of kernel objects, in the form of struct
+ label. In general, labels in kernel objects
+ are protected by the same lock as the remainder of the kernel
+ object. For example, the v_label
+ label in struct vnode is protected
+ by the vnode lock on the vnode.
+
+ In addition to labels maintained in standard kernel objects,
+ the MAC Framework also maintains a list of registered and
+ active policies. The policy list is protected by a global
+ mutex (mac_policy_list_lock) and a busy
+ count (also protected by the mutex). Since many access
+ control checks may occur in parallel, entry to the framework
+ for a read-only access to the policy list requires holding the
+ mutex while incrementing (and later decrementing) the busy
+ count. The mutex need not be held for the duration of the
+ MAC entry operation--some operations, such as label operations
+ on file system objects--are long-lived. To modify the policy
+ list, such as during policy registration and deregistration,
+ the mutex must be held and the reference count must be zero,
+ to prevent modification of the list while it is in use.
+
+ A condition variable,
+ mac_policy_list_not_busy, is available to
+ threads that need to wait for the list to become unbusy, but
+ this condition variable must only be waited on if the caller is
+ holding no other locks, or a lock order violation may be
+ possible. The busy count, in effect, acts as a form of
+ reader/writer lock over access to the framework: the difference
+ is that, unlike with an sxlock, consumers waiting for the list
+ to become unbusy may be starved, rather than permitting lock
+ order problems with regards to the busy count and other locks
+ that may be held on entry to (or inside) the MAC Framework.
+
+
+
+ Modules
+
+ For the module subsystem there exists a single lock that is
+ used to protect the shared data. This lock is a shared/exclusive
+ (SX) lock and has a good chance of needing to be acquired (shared
+ or exclusively), therefore there are a few macros that have been
+ added to make access to the lock more easy. These macros can be
+ located in sys/module.h and are quite basic
+ in terms of usage. The main structures protected under this lock
+ are the module_t structures (when shared)
+ and the global modulelist_t structure,
+ modules. One should review the related source code in
+ kern/kern_module.c to further understand the
+ locking strategy.
+
+
+
+ Newbus Device Tree
+
+ The newbus system will have one sx lock. Readers will
+ lock it &man.sx.slock.9; and writers will lock it
+ &man.sx.xlock.9;. Internal only functions will not do locking
+ at all. The externally visable ones will lock as needed.
+ Those items that don't matter if the race is won or lost will
+ not be locked, since they tend to be read all over the place
+ (eg &man.device.get.softc.9;). There will be relatively few
+ changes to the newbus datastructures, so a single lock should
+ be sufficient and not impose a performance penalty.
+
+
+
+ Pipes
+
+ ...
+
+
+
+ Processes and Threads
+
+ - process hiearachy
+ - proc locks, references
+ - thread-specific copies of proc entries to freeze during system
+ calls, including td_ucred
+ - inter-process operations
+ - process groups and sessions
+
+
+
+ Scheduler
+
+ Lots of references to sched_lock and notes
+ pointing at specific primitives and related magic elsewhere in the
+ document.
+
+
+
+ Select and Poll
+
+ The select() and poll() functions permit threads to block
+ waiting on events on file descriptors--most frequently, whether
+ or not the file descriptors are readable or writable.
+
+ ...
+
+
+
+ SIGIO
+
+ The SIGIO service permits processes to request the delivery
+ of a SIGIO signal to its process group when the read/write status
+ of specified file descriptors changes. At most one process or
+ process group is permitted to register for SIGIO from any given
+ kernel object, and that process or group is referred to as
+ the owner. Each object supporting SIGIO registration contains
+ pointer field that is NULL if the object is not registered, or
+ points to a struct sigio describing
+ the registration. This field is protected by a global mutex,
+ sigio_lock. Callers to SIGIO maintenance
+ functions must pass in this field "by reference" so that local
+ register copies of the field are not made when unprotected by
+ the lock.
+
+ One struct sigio is allocated for
+ each registered object associated with any process or process
+ group, and contains back-pointers to the object, owner, signal
+ information, a credential, and the general disposition of the
+ registration. Each process or progress group contains a list of
+ registered struct sigio structures,
+ p_sigiolst for processes, and
+ pg_sigiolst for process groups.
+ These lists are protected by the process or process group
+ locks respectively. Most fields in each struct
+ sigio are constant for the duration of the
+ registration, with the exception of the
+ sio_pgsigio field which links the
+ struct sigio into the process or
+ process group list. Developers implementing new kernel
+ objects supporting SIGIO will, in general, want to avoid
+ holding structure locks while invoking SIGIO supporting
+ functions, such as fsetown()
+ or funsetown() to avoid
+ defining a lock order between structure locks and the global
+ SIGIO lock. This is generally possible through use of an
+ elevated reference count on the structure, such as reliance
+ on a file descriptor reference to a pipe during a pipe
+ operation.
+
+
+
+ sysctl
+
+ The sysctl() MIB service is invoked
+ from both within the kernel and from userland applications
+ using a system call. At least two issues are raised in locking:
+ first, the protection of the structures maintaining the
+ namespace, and second, interactions with kernel variables and
+ functions that are accessed by the sysctl interface. Since
+ sysctl permits the direct export (and modification) of
+ kernel statistics and configuration parameters, the sysctl
+ mechanism must become aware of appropriate locking semantics
+ for those variables. Currently, sysctl makes use of a
+ single global sxlock to serialize use
+ of sysctl(); however, it is assumed to operate under Giant
+ and other protections are not provided. The remainder of
+ this section speculates on locking and semantic changes
+ to sysctl.
+
+ - Need to change the order of operations for sysctl's that
+ update values from read old, copyin and copyout, write new to
+ copyin, lock, read old and write new, unlock, copyout. Normal
+ sysctl's that just copyout the old value and set a new value
+ that they copyin may still be able to follow the old model.
+ However, it may be cleaner to use the second model for all of
+ the sysctl handlers to avoid lock operations.
+
+ - To allow for the common case, a sysctl could embed a
+ pointer to a mutex in the SYSCTL_FOO macros and in the struct.
+ This would work for most sysctls. For values protected by sx
+ locks, spin mutexes, or other locking strategies besides a
+ single sleep mutex, SYSCTL_PROC nodes could be used to get the
+ locking right.
+
+
+
+ Taskqueue
+
+ The taskqueue's interface has two basic locks associated
+ with it in order to protect the related shared data. The
+ taskqueue_queues_mutex is meant to serve as a
+ lock to protect the taskqueue_queues TAILQ.
+ The other mutex lock associated with this system is the one in the
+ struct taskqueue data structure. The
+ use of the synchronization primitive here is to protect the
+ integrity of the data in the struct
+ taskqueue. It should be noted that there are no
+ separate macros to assist the user in locking down his/her own work
+ since these locks are most likely not going to be used outside of
+ kern/subr_taskqueue.c.
+
+
+
+
+ Implementation Notes
+
+
+ Details of the Mutex Implementation
+
+ - Should we require mutexes to be owned for mtx_destroy()
+ since we can't safely assert that they are unowned by anyone
+ else otherwise?
+
+
+ Spin Mutexes
+
+ - Use a critical section...
+
+
+
+ Sleep Mutexes
+
+ - Describe the races with contested mutexes
+
+ - Why it's safe to read mtx_lock of a contested mutex
+ when holding sched_lock.
+
+ - Priority propagation
+
+
+
+
+ Witesss
+
+ - What does it do
+
+ - How does it work
+
+
+
+
+ Miscellaneous Topics
+
+
+ Interrupt Source and ICU Abstractions
+
+ - struct isrc
+
+ - pic drivers
+
+
+
+ Other Random Questions/Topics
+
+ Should we pass an interlock into
+ sema_wait?
+
+ - Generic turnstiles for sleep mutexes and sx locks.
+
+ - Should we have non-sleepable sx locks?
+
+
+
+
+ Definitions
+
+
+ atomic
+
+ An operation is atomic if all of its effects are visible
+ to other CPUs together when the proper access protocol is
+ followed. In the degenerate case are atomic instructions
+ provided directly by machine architectures. At a higher
+ level, if several members of a structure are protected by a
+ lock, then a set of operations are atomic if they are all
+ performed while holding the lock without releasing the lock
+ in between any of the operations.
+
+ operation
+
+
+
+
+ block
+
+ A thread is blocked when it is waiting on a lock,
+ resource, or condition. Unfortunately this term is a bit
+ overloaded as a result.
+
+ sleep
+
+
+
+
+ critical section
+
+ A section of code that is not allowed to be preempted.
+ A critical section is entered and exited using the
+ &man.critical.enter.9; API.
+
+
+
+
+ MD
+
+ Machine depenedent.
+
+ MI
+
+
+
+
+ memory operation
+
+ A memory operation reads and/or writes to a memory
+ location.
+
+
+
+
+ MI
+
+ Machine indepenedent.
+
+ MD
+
+
+
+
+ operation
+ memory operation
+
+
+
+ primary interrupt context
+
+ Primary interrupt context refers to the code that runs
+ when an interrupt occurs. This code can either run an
+ interrupt handler directly or schedule an asynchronous
+ interrupt thread to execute the interrupt handlers for a
+ given interrupt source.
+
+
+
+
+ realtime kernel thread
+
+ A high priority kernel thread. Currently, the only
+ realtime priority kernel threads are interrupt threads.
+
+ thread
+
+
+
+
+ sleep
+
+ A thread is asleep when it is blocked on a condition
+ variable or a sleep queue via msleep or
+ tsleep.
+
+ block
+
+
+
+
+ sleepable lock
+
+ A sleepable lock is a lock that can be held by a thread
+ which is asleep. Lockmgr locks and sx locks are currently
+ the only sleepable locks in FreeBSD. Eventually, some sx
+ locks such as the allproc and proctree locks may become
+ non-sleepable locks.
+
+ sleep
+
+
+
+
+ thread
+
+ A kernel thread represented by a struct thread. Threads own
+ locks and hold a single execution context.
+
+
+
+