o Split the FreeBSD Developers Handbook into two books:

- FreeBSD Architecture Handbook, which is a book about the FreeBSD
    architecture.  The SMP article have also been moved into the new
    arch handbook as a separate chapter.

  - FreeBSD Developers Handbook, which is a book about developing on
    FreeBSD; basically what was left when the architecture parts was
    moved away.

o Hook up the new FreeBSD Architecture Handbook to the build.

o Remove the SMP article since it is now part of the FreeBSD
  Architecture Handbook.

The relevant files from the FreeBSD Developers Handbook have been
repository copied to the new FreeBSD Architecture Handbook.

This is just step one in the split, both books need some work to be
real seperate books.  E.g. the FreeBSD Architecture Handbook still
needs an introduction.

Repository copy by:	joe
Requested by:		rwatson
Approved by:		murray, ceri (mentor)
This commit is contained in:
Simon L. B. Nielsen 2003-08-06 23:48:04 +00:00
parent 5421dc303f
commit 541f5ec33d
Notes: svn2git 2020-12-08 03:00:23 +00:00
svn path=/head/; revision=17783
26 changed files with 55 additions and 19061 deletions

View file

@ -1,18 +0,0 @@
# $FreeBSD$
MAINTAINER=jhb@FreeBSD.org
DOC?= article
FORMATS?= html
INSTALL_COMPRESSED?=gz
INSTALL_ONLY_COMPRESSED?=
WITH_ARTICLE_TOC?=YES
SRCS= article.sgml
DOC_PREFIX?= ${.CURDIR}/../../..
.include "${DOC_PREFIX}/share/mk/doc.project.mk"

View file

@ -1,959 +0,0 @@
<!DOCTYPE article PUBLIC "-//FreeBSD//DTD DocBook V4.1-Based Extension//EN" [
<!ENTITY % man PUBLIC "-//FreeBSD//ENTITIES DocBook Manual Page Entities//EN">
%man;
<!ENTITY % authors PUBLIC "-//FreeBSD//ENTITIES DocBook Author Entities//EN">
%authors;
<!ENTITY % misc PUBLIC "-//FreeBSD//ENTITIES DocBook Miscellaneous FreeBSD Entities//EN">
%misc;
<!ENTITY % freebsd PUBLIC "-//FreeBSD//ENTITIES DocBook Miscellaneous FreeBSD Entities//EN">
%freebsd;
<!--ENTITY % mailing-lists PUBLIC "-//FreeBSD//ENTITIES DocBook Mailing List Entities//EN"-->
<!--
%mailing-lists;
-->
]>
<article>
<articleinfo>
<title>SMPng Design Document</title>
<authorgroup>
<author>
<firstname>John</firstname>
<surname>Baldwin</surname>
</author>
<author>
<firstname>Robert</firstname>
<surname>Watson</surname>
</author>
</authorgroup>
<pubdate>$FreeBSD$</pubdate>
<copyright>
<year>2002</year>
<year>2003</year>
<holder>John Baldwin</holder>
<holder>Robert Watson</holder>
</copyright>
<abstract>
<para>This document presents the current design and implementation of
the SMPng Architecture. First, the basic primitives and tools are
introduced. Next, a general architecture for the FreeBSD kernel's
synchronization and execution model is laid out. Then, locking
strategies for specific subsystems are discussed, documenting the
approaches taken to introduce fine-grained synchronization and
parallelism for each subsystem. Finally, detailed implementation
notes are provided to motivate design choices, and make the reader
aware of important implications involving the use of specific
primitives. </para>
</abstract>
</articleinfo>
<sect1>
<title>Introduction</title>
<para>This document is a work-in-progress, and will be updated to
reflect on-going design and implementation activities associated
with the SMPng Project. Many sections currently exist only in
outline form, but will be fleshed out as work proceeds. Updates or
suggestions regarding the document may be directed to the document
editors.</para>
<para>The goal of SMPng is to allow concurrency in the kernel.
The kernel is basically one rather large and complex program. To
make the kernel multi-threaded we use some of the same tools used
to make other programs multi-threaded. These include mutexes,
shared/exclusive locks, semaphores, and condition variables. For
the definitions of these and other SMP-related terms, please see
the <xref linkend="glossary"> section of this article.</para>
</sect1>
<sect1>
<title>Basic Tools and Locking Fundamentals</title>
<sect2>
<title>Atomic Instructions and Memory Barriers</title>
<para>There are several existing treatments of memory barriers
and atomic instructions, so this section will not include a
lot of detail. To put it simply, one can not go around reading
variables without a lock if a lock is used to protect writes
to that variable. This becomes obvious when you consider that
memory barriers simply determine relative order of memory
operations; they do not make any guarantee about timing of
memory operations. That is, a memory barrier does not force
the contents of a CPU's local cache or store buffer to flush.
Instead, the memory barrier at lock release simply ensures
that all writes to the protected data will be visible to other
CPU's or devices if the write to release the lock is visible.
The CPU is free to keep that data in its cache or store buffer
as long as it wants. However, if another CPU performs an
atomic instruction on the same datum, the first CPU must
guarantee that the updated value is made visible to the second
CPU along with any other operations that memory barriers may
require.</para>
<para>For example, assuming a simple model where data is
considered visible when it is in main memory (or a global
cache), when an atomic instruction is triggered on one CPU,
other CPU's store buffers and caches must flush any writes to
that same cache line along with any pending operations behind
a memory barrier.</para>
<para>This requires one to take special care when using an item
protected by atomic instructions. For example, in the sleep
mutex implementation, we have to use an
<function>atomic_cmpset</function> rather than an
<function>atomic_set</function> to turn on the
<constant>MTX_CONTESTED</constant> bit. The reason is that we
read the value of <structfield>mtx_lock</structfield> into a
variable and then make a decision based on that read.
However, the value we read may be stale, or it may change
while we are making our decision. Thus, when the
<function>atomic_set</function> executed, it may end up
setting the bit on another value than the one we made the
decision on. Thus, we have to use an
<function>atomic_cmpset</function> to set the value only if
the value we made the decision on is up-to-date and
valid.</para>
<para>Finally, atomic instructions only allow one item to be
updated or read. If one needs to atomically update several
items, then a lock must be used instead. For example, if two
counters must be read and have values that are consistent
relative to each other, then those counters must be protected
by a lock rather than by separate atomic instructions.</para>
</sect2>
<sect2>
<title>Read Locks versus Write Locks</title>
<para>Read locks do not need to be as strong as write locks.
Both types of locks need to ensure that the data they are
accessing is not stale. However, only write access requires
exclusive access. Multiple threads can safely read a value.
Using different types of locks for reads and writes can be
implemented in a number of ways.</para>
<para>First, sx locks can be used in this manner by using an
exclusive lock when writing and a shared lock when reading.
This method is quite straightforward.</para>
<para>A second method is a bit more obscure. You can protect a
datum with multiple locks. Then for reading that data you
simply need to have a read lock of one of the locks. However,
to write to the data, you need to have a write lock of all of
the locks. This can make writing rather expensive but can be
useful when data is accessed in various ways. For example,
the parent process pointer is protected by both the
proctree_lock sx lock and the per-process mutex. Sometimes
the proc lock is easier as we are just checking to see who a
parent of a process is that we already have locked. However,
other places such as <function>inferior</function> need to
walk the tree of processes via parent pointers and locking
each process would be prohibitive as well as a pain to
guarantee that the condition you are checking remains valid
for both the check and the actions taken as a result of the
check.</para>
</sect2>
<sect2>
<title>Locking Conditions and Results</title>
<para>If you need a lock to check the state of a variable so
that you can take an action based on the state you read, you
can not just hold the lock while reading the variable and then
drop the lock before you act on the value you read. Once you
drop the lock, the variable can change rendering your decision
invalid. Thus, you must hold the lock both while reading the
variable and while performing the action as a result of the
test.</para>
</sect2>
</sect1>
<sect1>
<title>General Architecture and Design</title>
<sect2>
<title>Interrupt Handling</title>
<para>Following the pattern of several other multi-threaded &unix;
kernels, FreeBSD deals with interrupt handlers by giving them
their own thread context. Providing a context for interrupt
handlers allows them to block on locks. To help avoid
latency, however, interrupt threads run at real-time kernel
priority. Thus, interrupt handlers should not execute for very
long to avoid starving other kernel threads. In addition,
since multiple handlers may share an interrupt thread,
interrupt handlers should not sleep or use a sleepable lock to
avoid starving another interrupt handler.</para>
<para>The interrupt threads currently in FreeBSD are referred to
as heavyweight interrupt threads. They are called this
because switching to an interrupt thread involves a full
context switch. In the initial implementation, the kernel was
not preemptive and thus interrupts that interrupted a kernel
thread would have to wait until the kernel thread blocked or
returned to userland before they would have an opportunity to
run.</para>
<para>To deal with the latency problems, the kernel in FreeBSD
has been made preemptive. Currently, we only preempt a kernel
thread when we release a sleep mutex or when an interrupt
comes in. However, the plan is to make the FreeBSD kernel
fully preemptive as described below.</para>
<para>Not all interrupt handlers execute in a thread context.
Instead, some handlers execute directly in primary interrupt
context. These interrupt handlers are currently misnamed
<quote>fast</quote> interrupt handlers since the
<constant>INTR_FAST</constant> flag used in earlier versions
of the kernel is used to mark these handlers. The only
interrupts which currently use these types of interrupt
handlers are clock interrupts and serial I/O device
interrupts. Since these handlers do not have their own
context, they may not acquire blocking locks and thus may only
use spin mutexes.</para>
<para>Finally, there is one optional optimization that can be
added in MD code called lightweight context switches. Since
an interrupt thread executes in a kernel context, it can
borrow the vmspace of any process. Thus, in a lightweight
context switch, the switch to the interrupt thread does not
switch vmspaces but borrows the vmspace of the interrupted
thread. In order to ensure that the vmspace of the
interrupted thread does not disappear out from under us, the
interrupted thread is not allowed to execute until the
interrupt thread is no longer borrowing its vmspace. This can
happen when the interrupt thread either blocks or finishes.
If an interrupt thread blocks, then it will use its own
context when it is made runnable again. Thus, it can release
the interrupted thread.</para>
<para>The cons of this optimization are that they are very
machine specific and complex and thus only worth the effort if
their is a large performance improvement. At this point it is
probably too early to tell, and in fact, will probably hurt
performance as almost all interrupt handlers will immediately
block on Giant and require a thread fix-up when they block.
Also, an alternative method of interrupt handling has been
proposed by Mike Smith that works like so:</para>
<orderedlist>
<listitem>
<para>Each interrupt handler has two parts: a predicate
which runs in primary interrupt context and a handler
which runs in its own thread context.</para>
</listitem>
<listitem>
<para>If an interrupt handler has a predicate, then when an
interrupt is triggered, the predicate is run. If the
predicate returns true then the interrupt is assumed to be
fully handled and the kernel returns from the interrupt.
If the predicate returns false or there is no predicate,
then the threaded handler is scheduled to run.</para>
</listitem>
</orderedlist>
<para>Fitting light weight context switches into this scheme
might prove rather complicated. Since we may want to change
to this scheme at some point in the future, it is probably
best to defer work on light weight context switches until we
have settled on the final interrupt handling architecture and
determined how light weight context switches might or might
not fit into it.</para>
</sect2>
<sect2>
<title>Kernel Preemption and Critical Sections</title>
<sect3>
<title>Kernel Preemption in a Nutshell</title>
<para>Kernel preemption is fairly simple. The basic idea is
that a CPU should always be doing the highest priority work
available. Well, that is the ideal at least. There are a
couple of cases where the expense of achieving the ideal is
not worth being perfect.</para>
<para>Implementing full kernel preemption is very
straightforward: when you schedule a thread to be executed
by putting it on a runqueue, you check to see if it's
priority is higher than the currently executing thread. If
so, you initiate a context switch to that thread.</para>
<para>While locks can protect most data in the case of a
preemption, not all of the kernel is preemption safe. For
example, if a thread holding a spin mutex preempted and the
new thread attempts to grab the same spin mutex, the new
thread may spin forever as the interrupted thread may never
get a chance to execute. Also, some code such as the code
to assign an address space number for a process during
exec() on the Alpha needs to not be preempted as it supports
the actual context switch code. Preemption is disabled for
these code sections by using a critical section.</para>
</sect3>
<sect3>
<title>Critical Sections</title>
<para>The responsibility of the critical section API is to
prevent context switches inside of a critical section. With
a fully preemptive kernel, every
<function>setrunqueue</function> of a thread other than the
current thread is a preemption point. One implementation is
for <function>critical_enter</function> to set a per-thread
flag that is cleared by its counterpart. If
<function>setrunqueue</function> is called with this flag
set, it does not preempt regardless of the priority of the new
thread relative to the current thread. However, since
critical sections are used in spin mutexes to prevent
context switches and multiple spin mutexes can be acquired,
the critical section API must support nesting. For this
reason the current implementation uses a nesting count
instead of a single per-thread flag.</para>
<para>In order to minimize latency, preemptions inside of a
critical section are deferred rather than dropped. If a
thread is made runnable that would normally be preempted to
outside of a critical section, then a per-thread flag is set
to indicate that there is a pending preemption. When the
outermost critical section is exited, the flag is checked.
If the flag is set, then the current thread is preempted to
allow the higher priority thread to run.</para>
<para>Interrupts pose a problem with regards to spin mutexes.
If a low-level interrupt handler needs a lock, it needs to
not interrupt any code needing that lock to avoid possible
data structure corruption. Currently, providing this
mechanism is piggybacked onto critical section API by means
of the <function>cpu_critical_enter</function> and
<function>cpu_critical_exit</function> functions. Currently
this API disables and re-enables interrupts on all of
FreeBSD's current platforms. This approach may not be
purely optimal, but it is simple to understand and simple to
get right. Theoretically, this second API need only be used
for spin mutexes that are used in primary interrupt context.
However, to make the code simpler, it is used for all spin
mutexes and even all critical sections. It may be desirable
to split out the MD API from the MI API and only use it in
conjunction with the MI API in the spin mutex
implementation. If this approach is taken, then the MD API
likely would need a rename to show that it is a separate API
now.</para>
</sect3>
<sect3>
<title>Design Tradeoffs</title>
<para>As mentioned earlier, a couple of trade-offs have been
made to sacrifice cases where perfect preemption may not
always provide the best performance.</para>
<para>The first trade-off is that the preemption code does not
take other CPUs into account. Suppose we have a two CPU's A
and B with the priority of A's thread as 4 and the priority
of B's thread as 2. If CPU B makes a thread with priority 1
runnable, then in theory, we want CPU A to switch to the new
thread so that we will be running the two highest priority
runnable threads. However, the cost of determining which
CPU to enforce a preemption on as well as actually signaling
that CPU via an IPI along with the synchronization that
would be required would be enormous. Thus, the current code
would instead force CPU B to switch to the higher priority
thread. Note that this still puts the system in a better
position as CPU B is executing a thread of priority 1 rather
than a thread of priority 2.</para>
<para>The second trade-off limits immediate kernel preemption
to real-time priority kernel threads. In the simple case of
preemption defined above, a thread is always preempted
immediately (or as soon as a critical section is exited) if
a higher priority thread is made runnable. However, many
threads executing in the kernel only execute in a kernel
context for a short time before either blocking or returning
to userland. Thus, if the kernel preempts these threads to
run another non-realtime kernel thread, the kernel may
switch out the executing thread just before it is about to
sleep or execute. The cache on the CPU must then adjust to
the new thread. When the kernel returns to the interrupted
CPU, it must refill all the cache information that was lost.
In addition, two extra context switches are performed that
could be avoided if the kernel deferred the preemption until
the first thread blocked or returned to userland. Thus, by
default, the preemption code will only preempt immediately
if the higher priority thread is a real-time priority
thread.</para>
<para>Turning on full kernel preemption for all kernel threads
has value as a debugging aid since it exposes more race
conditions. It is especially useful on UP systems were many
races are hard to simulate otherwise. Thus, there will be a
kernel option to enable preemption for all kernel threads
that can be used for debugging purposes.</para>
</sect3>
</sect2>
<sect2>
<title>Thread Migration</title>
<para>Simply put, a thread migrates when it moves from one CPU
to another. In a non-preemptive kernel this can only happen
at well-defined points such as when calling
<function>tsleep</function> or returning to userland.
However, in the preemptive kernel, an interrupt can force a
preemption and possible migration at any time. This can have
negative affects on per-CPU data since with the exception of
<varname>curthread</varname> and <varname>curpcb</varname> the
data can change whenever you migrate. Since you can
potentially migrate at any time this renders per-CPU data
rather useless. Thus it is desirable to be able to disable
migration for sections of code that need per-CPU data to be
stable.</para>
<para>Critical sections currently prevent migration since they
do not allow context switches. However, this may be too strong
of a requirement to enforce in some cases since a critical
section also effectively blocks interrupt threads on the
current processor. As a result, it may be desirable to
provide an API whereby code may indicate that if the current
thread is preempted it should not migrate to another
CPU.</para>
<para>One possible implementation is to use a per-thread nesting
count <varname>td_pinnest</varname> along with a
<varname>td_pincpu</varname> which is updated to the current
CPU on each context switch. Each CPU has its own run queue
that holds threads pinned to that CPU. A thread is pinned
when its nesting count is greater than zero and a thread
starts off unpinned with a nesting count of zero. When a
thread is put on a runqueue, we check to see if it is pinned.
If so, we put it on the per-CPU runqueue, otherwise we put it
on the global runqueue. When
<function>choosethread</function> is called to retrieve the
next thread, it could either always prefer bound threads to
unbound threads or use some sort of bias when comparing
priorities. If the nesting count is only ever written to by
the thread itself and is only read by other threads when the
owning thread is not executing but while holding the
<varname>sched_lock</varname>, then
<varname>td_pinnest</varname> will not need any other locks.
The <function>migrate_disable</function> function would
increment the nesting count and
<function>migrate_enable</function> would decrement the
nesting count. Due to the locking requirements specified
above, they will only operate on the current thread and thus
would not need to handle the case of making a thread
migrateable that currently resides on a per-CPU run
queue.</para>
<para>It is still debatable if this API is needed or if the
critical section API is sufficient by itself. Many of the
places that need to prevent migration also need to prevent
preemption as well, and in those places a critical section
must be used regardless.</para>
</sect2>
<sect2>
<title>Callouts</title>
<para>The <function>timeout()</function> kernel facility permits
kernel services to register functions for execution as part
of the <function>softclock()</function> software interrupt.
Events are scheduled based on a desired number of clock
ticks, and callbacks to the consumer-provided function
will occur at approximately the right time.</para>
<para>The global list of pending timeout events is protected
by a global spin mutex, <varname>callout_lock</varname>;
all access to the timeout list must be performed with this
mutex held. When <function>softclock()</function> is
woken up, it scans the list of pending timeouts for those
that should fire. In order to avoid lock order reversal,
the <function>softclock</function> thread will release the
<varname>callout_lock</varname> mutex when invoking the
provided <function>timeout()</function> callback function.
If the <constant>CALLOUT_MPSAFE</constant> flag was not set
during registration, then Giant will be grabbed before
invoking the callout, and then released afterwards. The
<varname>callout_lock</varname> mutex will be re-grabbed
before proceeding. The <function>softclock()</function>
code is careful to leave the list in a consistent state
while releasing the mutex. If <constant>DIAGNOSTIC</constant>
is enabled, then the time taken to execute each function is
measured, and a warning generated if it exceeds a
threshold.</para>
</sect2>
</sect1>
<sect1>
<title>Specific Locking Strategies</title>
<sect2>
<title>Credentials</title>
<para><structname>struct ucred</structname> is the kernel's
internal credential structure, and is generally used as the
basis for process-driven access control within the kernel.
BSD-derived systems use a <quote>copy-on-write</quote> model for credential
data: multiple references may exist for a credential structure,
and when a change needs to be made, the structure is duplicated,
modified, and then the reference replaced. Due to wide-spread
caching of the credential to implement access control on open,
this results in substantial memory savings. With a move to
fine-grained SMP, this model also saves substantially on
locking operations by requiring that modification only occur
on an unshared credential, avoiding the need for explicit
synchronization when consuming a known-shared
credential.</para>
<para>Credential structures with a single reference are
considered mutable; shared credential structures must not be
modified or a race condition is risked. A mutex,
<structfield>cr_mtxp</structfield> protects the reference
count of <structname>struct ucred</structname> so as to
maintain consistency. Any use of the structure requires a
valid reference for the duration of the use, or the structure
may be released out from under the illegitimate
consumer.</para>
<para>The <structname>struct ucred</structname> mutex is a leaf
mutex, and for performance reasons, is implemented via a mutex
pool.</para>
<para>Usually, credentials are used in a read-only manner for access
control decisions, and in this case <structfield>td_ucred</structfield>
is generally preferred because it requires no locking. When a
process' credential is updated the <literal>proc</literal> lock
must be held across the check and update operations thus avoid
races. The process credential <structfield>p_ucred</structfield>
must be used for check and update operations to prevent
time-of-check, time-of-use races.</para>
<para>If system call invocations will perform access control after
an update to the process credential, the value of
<structfield>td_ucred</structfield> must also be refreshed to
the current process value. This will prevent use of a stale
credential following a change. The kernel automatically
refreshes the <structfield>td_ucred</structfield> pointer in
the thread structure from the process
<structfield>p_ucred</structfield> whenever a process enters
the kernel, permitting use of a fresh credential for kernel
access control.</para>
</sect2>
<sect2>
<title>File Descriptors and File Descriptor Tables</title>
<para>Details to follow.</para>
</sect2>
<sect2>
<title>Jail Structures</title>
<para><structname>struct prison</structname> stores
administrative details pertinent to the maintenance of jails
created using the &man.jail.2; API. This includes the
per-jail hostname, IP address, and related settings. This
structure is reference-counted since pointers to instances of
the structure are shared by many credential structures. A
single mutex, <structfield>pr_mtx</structfield> protects read
and write access to the reference count and all mutable
variables inside the struct jail. Some variables are set only
when the jail is created, and a valid reference to the
<structname>struct prison</structname> is sufficient to read
these values. The precise locking of each entry is documented
via comments in <filename>sys/jail.h</filename>.</para>
</sect2>
<sect2>
<title>MAC Framework</title>
<para>The TrustedBSD MAC Framework maintains data in a variety
of kernel objects, in the form of <structname>struct
label</structname>. In general, labels in kernel objects
are protected by the same lock as the remainder of the kernel
object. For example, the <structfield>v_label</structfield>
label in <structname>struct vnode</structname> is protected
by the vnode lock on the vnode.</para>
<para>In addition to labels maintained in standard kernel objects,
the MAC Framework also maintains a list of registered and
active policies. The policy list is protected by a global
mutex (<varname>mac_policy_list_lock</varname>) and a busy
count (also protected by the mutex). Since many access
control checks may occur in parallel, entry to the framework
for a read-only access to the policy list requires holding the
mutex while incrementing (and later decrementing) the busy
count. The mutex need not be held for the duration of the
MAC entry operation--some operations, such as label operations
on file system objects--are long-lived. To modify the policy
list, such as during policy registration and de-registration,
the mutex must be held and the reference count must be zero,
to prevent modification of the list while it is in use.</para>
<para>A condition variable,
<varname>mac_policy_list_not_busy</varname>, is available to
threads that need to wait for the list to become unbusy, but
this condition variable must only be waited on if the caller is
holding no other locks, or a lock order violation may be
possible. The busy count, in effect, acts as a form of
shared/exclusive lock over access to the framework: the difference
is that, unlike with an sx lock, consumers waiting for the list
to become unbusy may be starved, rather than permitting lock
order problems with regards to the busy count and other locks
that may be held on entry to (or inside) the MAC Framework.</para>
</sect2>
<sect2>
<title>Modules</title>
<para>For the module subsystem there exists a single lock that is
used to protect the shared data. This lock is a shared/exclusive
(SX) lock and has a good chance of needing to be acquired (shared
or exclusively), therefore there are a few macros that have been
added to make access to the lock more easy. These macros can be
located in <filename>sys/module.h</filename> and are quite basic
in terms of usage. The main structures protected under this lock
are the <structname>module_t</structname> structures (when shared)
and the global <structname>modulelist_t</structname> structure,
modules. One should review the related source code in
<filename>kern/kern_module.c</filename> to further understand the
locking strategy.</para>
</sect2>
<sect2>
<title>Newbus Device Tree</title>
<para>The newbus system will have one sx lock. Readers will
hold a shared (read) lock (&man.sx.slock.9;) and writers will hold
an exclusive (write) lock (&man.sx.xlock.9;). Internal functions
will not do locking at all. Externally visible ones will lock as
needed.
Those items that do not matter if the race is won or lost will
not be locked, since they tend to be read all over the place
(e.g. &man.device.get.softc.9;). There will be relatively few
changes to the newbus data structures, so a single lock should
be sufficient and not impose a performance penalty.</para>
</sect2>
<sect2>
<title>Pipes</title>
<para>...</para>
</sect2>
<sect2>
<title>Processes and Threads</title>
<para>- process hierarchy</para>
<para>- proc locks, references</para>
<para>- thread-specific copies of proc entries to freeze during system
calls, including td_ucred</para>
<para>- inter-process operations</para>
<para>- process groups and sessions</para>
</sect2>
<sect2>
<title>Scheduler</title>
<para>Lots of references to <varname>sched_lock</varname> and notes
pointing at specific primitives and related magic elsewhere in the
document.</para>
</sect2>
<sect2>
<title>Select and Poll</title>
<para>The select() and poll() functions permit threads to block
waiting on events on file descriptors--most frequently, whether
or not the file descriptors are readable or writable.</para>
<para>...</para>
</sect2>
<sect2>
<title>SIGIO</title>
<para>The SIGIO service permits processes to request the delivery
of a SIGIO signal to its process group when the read/write status
of specified file descriptors changes. At most one process or
process group is permitted to register for SIGIO from any given
kernel object, and that process or group is referred to as
the owner. Each object supporting SIGIO registration contains
pointer field that is NULL if the object is not registered, or
points to a <structname>struct sigio</structname> describing
the registration. This field is protected by a global mutex,
<varname>sigio_lock</varname>. Callers to SIGIO maintenance
functions must pass in this field <quote>by reference</quote> so that local
register copies of the field are not made when unprotected by
the lock.</para>
<para>One <structname>struct sigio</structname> is allocated for
each registered object associated with any process or process
group, and contains back-pointers to the object, owner, signal
information, a credential, and the general disposition of the
registration. Each process or progress group contains a list of
registered <structname>struct sigio</structname> structures,
<structfield>p_sigiolst</structfield> for processes, and
<structfield>pg_sigiolst</structfield> for process groups.
These lists are protected by the process or process group
locks respectively. Most fields in each <structname>struct
sigio</structname> are constant for the duration of the
registration, with the exception of the
<structfield>sio_pgsigio</structfield> field which links the
<structname>struct sigio</structname> into the process or
process group list. Developers implementing new kernel
objects supporting SIGIO will, in general, want to avoid
holding structure locks while invoking SIGIO supporting
functions, such as <function>fsetown()</function>
or <function>funsetown()</function> to avoid
defining a lock order between structure locks and the global
SIGIO lock. This is generally possible through use of an
elevated reference count on the structure, such as reliance
on a file descriptor reference to a pipe during a pipe
operation.</para>
</sect2>
<sect2>
<title>Sysctl</title>
<para>The <function>sysctl()</function> MIB service is invoked
from both within the kernel and from userland applications
using a system call. At least two issues are raised in locking:
first, the protection of the structures maintaining the
namespace, and second, interactions with kernel variables and
functions that are accessed by the sysctl interface. Since
sysctl permits the direct export (and modification) of
kernel statistics and configuration parameters, the sysctl
mechanism must become aware of appropriate locking semantics
for those variables. Currently, sysctl makes use of a
single global sx lock to serialize use of sysctl(); however, it
is assumed to operate under Giant and other protections are not
provided. The remainder of this section speculates on locking
and semantic changes to sysctl.</para>
<para>- Need to change the order of operations for sysctl's that
update values from read old, copyin and copyout, write new to
copyin, lock, read old and write new, unlock, copyout. Normal
sysctl's that just copyout the old value and set a new value
that they copyin may still be able to follow the old model.
However, it may be cleaner to use the second model for all of
the sysctl handlers to avoid lock operations.</para>
<para>- To allow for the common case, a sysctl could embed a
pointer to a mutex in the SYSCTL_FOO macros and in the struct.
This would work for most sysctl's. For values protected by sx
locks, spin mutexes, or other locking strategies besides a
single sleep mutex, SYSCTL_PROC nodes could be used to get the
locking right.</para>
</sect2>
<sect2>
<title>Taskqueue</title>
<para> The taskqueue's interface has two basic locks associated
with it in order to protect the related shared data. The
<varname>taskqueue_queues_mutex</varname> is meant to serve as a
lock to protect the <varname>taskqueue_queues</varname> TAILQ.
The other mutex lock associated with this system is the one in the
<structname>struct taskqueue</structname> data structure. The
use of the synchronization primitive here is to protect the
integrity of the data in the <structname>struct
taskqueue</structname>. It should be noted that there are no
separate macros to assist the user in locking down his/her own work
since these locks are most likely not going to be used outside of
<filename>kern/subr_taskqueue.c</filename>.</para>
</sect2>
</sect1>
<sect1>
<title>Implementation Notes</title>
<sect2>
<title>Details of the Mutex Implementation</title>
<para>- Should we require mutexes to be owned for mtx_destroy()
since we can not safely assert that they are unowned by anyone
else otherwise?</para>
<sect3>
<title>Spin Mutexes</title>
<para>- Use a critical section...</para>
</sect3>
<sect3>
<title>Sleep Mutexes</title>
<para>- Describe the races with contested mutexes</para>
<para>- Why it is safe to read mtx_lock of a contested mutex
when holding sched_lock.</para>
<para>- Priority propagation</para>
</sect3>
</sect2>
<sect2>
<title>Witness</title>
<para>- What does it do</para>
<para>- How does it work</para>
</sect2>
</sect1>
<sect1>
<title>Miscellaneous Topics</title>
<sect2>
<title>Interrupt Source and ICU Abstractions</title>
<para>- struct isrc</para>
<para>- pic drivers</para>
</sect2>
<sect2>
<title>Other Random Questions/Topics</title>
<para>Should we pass an interlock into
<function>sema_wait</function>?</para>
<para>- Generic turnstiles for sleep mutexes and sx locks.</para>
<para>- Should we have non-sleepable sx locks?</para>
</sect2>
</sect1>
<glossary id="glossary">
<title>Glossary</title>
<glossentry id="atomic">
<glossterm>atomic</glossterm>
<glossdef>
<para>An operation is atomic if all of its effects are visible
to other CPUs together when the proper access protocol is
followed. In the degenerate case are atomic instructions
provided directly by machine architectures. At a higher
level, if several members of a structure are protected by a
lock, then a set of operations are atomic if they are all
performed while holding the lock without releasing the lock
in between any of the operations.</para>
<glossseealso>operation</glossseealso>
</glossdef>
</glossentry>
<glossentry id="block">
<glossterm>block</glossterm>
<glossdef>
<para>A thread is blocked when it is waiting on a lock,
resource, or condition. Unfortunately this term is a bit
overloaded as a result.</para>
<glossseealso>sleep</glossseealso>
</glossdef>
</glossentry>
<glossentry id="critical-section">
<glossterm>critical section</glossterm>
<glossdef>
<para>A section of code that is not allowed to be preempted.
A critical section is entered and exited using the
&man.critical.enter.9; API.</para>
</glossdef>
</glossentry>
<glossentry id="MD">
<glossterm>MD</glossterm>
<glossdef>
<para>Machine dependent.</para>
<glossseealso>MI</glossseealso>
</glossdef>
</glossentry>
<glossentry id="memory-operation">
<glossterm>memory operation</glossterm>
<glossdef>
<para>A memory operation reads and/or writes to a memory
location.</para>
</glossdef>
</glossentry>
<glossentry id="MI">
<glossterm>MI</glossterm>
<glossdef>
<para>Machine independent.</para>
<glossseealso>MD</glossseealso>
</glossdef>
</glossentry>
<glossentry id="operation">
<glossterm>operation</glossterm>
<glosssee>memory operation</glosssee>
</glossentry>
<glossentry id="primary-interrupt-context">
<glossterm>primary interrupt context</glossterm>
<glossdef>
<para>Primary interrupt context refers to the code that runs
when an interrupt occurs. This code can either run an
interrupt handler directly or schedule an asynchronous
interrupt thread to execute the interrupt handlers for a
given interrupt source.</para>
</glossdef>
</glossentry>
<glossentry>
<glossterm>realtime kernel thread</glossterm>
<glossdef>
<para>A high priority kernel thread. Currently, the only
realtime priority kernel threads are interrupt threads.</para>
<glossseealso>thread</glossseealso>
</glossdef>
</glossentry>
<glossentry id="sleep">
<glossterm>sleep</glossterm>
<glossdef>
<para>A thread is asleep when it is blocked on a condition
variable or a sleep queue via <function>msleep</function> or
<function>tsleep</function>.</para>
<glossseealso>block</glossseealso>
</glossdef>
</glossentry>
<glossentry id="sleepable-lock">
<glossterm>sleepable lock</glossterm>
<glossdef>
<para>A sleepable lock is a lock that can be held by a thread
which is asleep. Lockmgr locks and sx locks are currently
the only sleepable locks in FreeBSD. Eventually, some sx
locks such as the allproc and proctree locks may become
non-sleepable locks.</para>
<glossseealso>sleep</glossseealso>
</glossdef>
</glossentry>
<glossentry id="thread">
<glossterm>thread</glossterm>
<glossdef>
<para>A kernel thread represented by a struct thread. Threads own
locks and hold a single execution context.</para>
</glossdef>
</glossentry>
</glossary>
</article>

View file

@ -1,6 +1,7 @@
# $FreeBSD$
SUBDIR = corp-net-guide
SUBDIR = arch-handbook
SUBDIR+= corp-net-guide
SUBDIR+= design-44bsd
SUBDIR+= developers-handbook
SUBDIR+= faq

View file

@ -15,9 +15,6 @@ HAS_INDEX= true
INSTALL_COMPRESSED?= gz
INSTALL_ONLY_COMPRESSED?=
# Images
IMAGES_EN= sockets/layers.eps sockets/sain.eps sockets/sainfill.eps sockets/sainlsb.eps sockets/sainmsb.eps sockets/sainserv.eps sockets/serv.eps sockets/serv2.eps sockets/slayers.eps
#
# SRCS lists the individual SGML files that make up the document. Changes
# to any of these files will force a rebuild
@ -26,28 +23,20 @@ IMAGES_EN= sockets/layers.eps sockets/sain.eps sockets/sainfill.eps sockets/sain
# SGML content
SRCS= book.sgml
SRCS+= boot/chapter.sgml
SRCS+= dma/chapter.sgml
SRCS+= driverbasics/chapter.sgml
SRCS+= introduction/chapter.sgml
SRCS+= ipv6/chapter.sgml
SRCS+= isa/chapter.sgml
SRCS+= jail/chapter.sgml
SRCS+= kerneldebug/chapter.sgml
SRCS+= kobj/chapter.sgml
SRCS+= l10n/chapter.sgml
SRCS+= locking/chapter.sgml
SRCS+= mac/chapter.sgml
SRCS+= newbus/chapter.sgml
SRCS+= pci/chapter.sgml
SRCS+= policies/chapter.sgml
SRCS+= scsi/chapter.sgml
SRCS+= secure/chapter.sgml
SRCS+= sockets/chapter.sgml
SRCS+= smp/chapter.sgml
SRCS+= sound/chapter.sgml
SRCS+= sysinit/chapter.sgml
SRCS+= tools/chapter.sgml
SRCS+= usb/chapter.sgml
SRCS+= vm/chapter.sgml
SRCS+= x86/chapter.sgml
# Entities

View file

@ -20,7 +20,7 @@
<book>
<bookinfo>
<title>FreeBSD Developers' Handbook</title>
<title>&os; Architecture Handbook</title>
<corpauthor>The FreeBSD Documentation Project</corpauthor>
@ -38,7 +38,7 @@
&bookinfo.legalnotice;
<abstract>
<para>Welcome to the Developers' Handbook. This manual is a
<para>Welcome to the &os; Architecture Handbook. This manual is a
<emphasis>work in progress</emphasis> and is the work of many
individuals. Many sections do not yet exist and some of those
that do exist need to be updated. If you are interested in
@ -55,33 +55,6 @@
</abstract>
</bookinfo>
<part id="Basics">
<title>Basics</title>
&chap.introduction;
&chap.tools;
&chap.secure;
&chap.l10n;
&chap.policies;
</part>
<part id="ipc">
<title>Interprocess Communication</title>
<chapter id="signals">
<title>* Signals</title>
<para>Signals, pipes, semaphores, message queues, shared memory,
ports, sockets, doors</para>
</chapter>
&chap.sockets;
&chap.ipv6;
</part>
<part id="kernel">
<title>Kernel</title>
@ -92,8 +65,7 @@
&chap.sysinit;
&chap.mac;
&chap.vm;
&chap.dma;
&chap.kerneldebug;
&chap.smp;
<chapter id="ufs">
<title>* UFS</title>
@ -145,21 +117,22 @@
</part>
<!-- XXX - finish me
<part id="architectures">
<title>Architectures</title>
&chap.x86;
<chapter id="i386">
<title>* I386</title>
<para>Talk about <literal>i386</literal> specific &os;
architecture.</para>
</chapter>
<chapter id="alpha">
<title>* Alpha</title>
<para>Talk about the architectural specifics of
FreeBSD/alpha.</para>
<para>Explanation of alignment errors, how to fix, how to
ignore.</para>
<para>Example assembly language code for FreeBSD/alpha.</para>
</chapter>
<chapter id="ia64">
@ -169,56 +142,36 @@
FreeBSD/ia64.</para>
</chapter>
<chapter id="sparc64">
<title>* SPARC64</title>
<para>Talk about <literal>SPARC64</literal> specific &os;
architecture.</para>
</chapter>
<chapter id="amd64">
<title>* AMD64</title>
<para>Talk about <literal>AMD64</literal> specific &os;
architecture.</para>
</chapter>
<chapter id="powerpc">
<title>* PowerPC</title>
<para>Talk about <literal>PowerPC</literal> specific &os;
architecture.</para>
</chapter>
</part>
-->
<part id="appendices">
<title>Appendices</title>
<bibliography>
<biblioentry id="COD" xreflabel="1">
<authorgroup>
<author>
<firstname>Dave</firstname>
<othername role="MI">A</othername>
<surname>Patterson</surname>
</author>
<author>
<firstname>John</firstname>
<othername role="MI">L</othername>
<surname>Hennessy</surname>
</author>
</authorgroup>
<copyright><year>1998</year><holder>Morgan Kaufmann Publishers,
Inc.</holder></copyright>
<isbn>1-55860-428-6</isbn>
<publisher>
<publishername>Morgan Kaufmann Publishers, Inc.</publishername>
</publisher>
<title>Computer Organization and Design</title>
<subtitle>The Hardware / Software Interface</subtitle>
<pagenums>1-2</pagenums>
</biblioentry>
<biblioentry xreflabel="2">
<authorgroup>
<author>
<firstname>W.</firstname>
<othername role="Middle">Richard</othername>
<surname>Stevens</surname>
</author>
</authorgroup>
<copyright><year>1993</year><holder>Addison Wesley Longman,
Inc.</holder></copyright>
<isbn>0-201-56317-7</isbn>
<publisher>
<publishername>Addison Wesley Longman, Inc.</publishername>
</publisher>
<title>Advanced Programming in the Unix Environment</title>
<pagenums>1-2</pagenums>
</biblioentry>
<biblioentry xreflabel="3">
<biblioentry xreflabel="1">
<authorgroup>
<author>
<firstname>Marshall</firstname>
@ -250,50 +203,6 @@
<pagenums>1-2</pagenums>
</biblioentry>
<biblioentry id="Phrack" xreflabel="4">
<authorgroup>
<author>
<firstname>Aleph</firstname>
<surname>One</surname>
</author>
</authorgroup>
<title>Phrack 49; "Smashing the Stack for Fun and Profit"</title>
</biblioentry>
<biblioentry id="StackGuard" xreflabel="5">
<authorgroup>
<author>
<firstname>Chrispin</firstname>
<surname>Cowan</surname>
</author>
<author>
<firstname>Calton</firstname>
<surname>Pu</surname>
</author>
<author>
<firstname>Dave</firstname>
<surname>Maier</surname>
</author>
</authorgroup>
<title>StackGuard; Automatic Adaptive Detection and Prevention of
Buffer-Overflow Attacks</title>
</biblioentry>
<biblioentry id="OpenBSD" xreflabel="6">
<authorgroup>
<author>
<firstname>Todd</firstname>
<surname>Miller</surname>
</author>
<author>
<firstname>Theo</firstname>
<surname>de Raadt</surname>
</author>
</authorgroup>
<title>strlcpy and strlcat -- consistent, safe string copy and
concatenation.</title>
</biblioentry>
</bibliography>
<![ %chap.index; [ &chap.index; ]]>

View file

@ -1,5 +1,5 @@
<!--
Creates entities for each chapter in the FreeBSD Developer's
Creates entities for each chapter in the FreeBSD Architecture
Handbook. Each entity is named chap.foo, where foo is the value
of the id attribute on that chapter, and corresponds to the name of
the directory in which that chapter's .sgml file is stored.
@ -9,29 +9,17 @@
$FreeBSD$
-->
<!-- Part one -->
<!ENTITY chap.introduction SYSTEM "introduction/chapter.sgml">
<!ENTITY chap.tools SYSTEM "tools/chapter.sgml">
<!ENTITY chap.secure SYSTEM "secure/chapter.sgml">
<!ENTITY chap.l10n SYSTEM "l10n/chapter.sgml">
<!ENTITY chap.policies SYSTEM "policies/chapter.sgml">
<!-- Part two - IPC -->
<!ENTITY chap.sockets SYSTEM "sockets/chapter.sgml">
<!ENTITY chap.ipv6 SYSTEM "ipv6/chapter.sgml">
<!-- Part three - Kernel -->
<!-- Part one - Kernel -->
<!ENTITY chap.boot SYSTEM "boot/chapter.sgml">
<!ENTITY chap.kobj SYSTEM "kobj/chapter.sgml">
<!ENTITY chap.sysinit SYSTEM "sysinit/chapter.sgml">
<!ENTITY chap.locking SYSTEM "locking/chapter.sgml">
<!ENTITY chap.vm SYSTEM "vm/chapter.sgml">
<!ENTITY chap.dma SYSTEM "dma/chapter.sgml">
<!ENTITY chap.kerneldebug SYSTEM "kerneldebug/chapter.sgml">
<!ENTITY chap.jail SYSTEM "jail/chapter.sgml">
<!ENTITY chap.mac SYSTEM "mac/chapter.sgml">
<!ENTITY chap.smp SYSTEM "smp/chapter.sgml">
<!-- Part four - Device Drivers -->
<!-- Part Two - Device Drivers -->
<!ENTITY chap.driverbasics SYSTEM "driverbasics/chapter.sgml">
<!ENTITY chap.isa SYSTEM "isa/chapter.sgml">
<!ENTITY chap.pci SYSTEM "pci/chapter.sgml">
@ -40,9 +28,5 @@
<!ENTITY chap.newbus SYSTEM "newbus/chapter.sgml">
<!ENTITY chap.snd SYSTEM "sound/chapter.sgml">
<!-- Part five - Architectures -->
<!ENTITY chap.x86 SYSTEM "x86/chapter.sgml">
<!-- Part six - Appendices -->
<!ENTITY chap.bibliography SYSTEM "bibliography/chapter.sgml">
<!-- Part three - Appendices -->
<!ENTITY chap.index SYSTEM "index.sgml">

View file

@ -1,25 +1,11 @@
<!DOCTYPE article PUBLIC "-//FreeBSD//DTD DocBook V4.1-Based Extension//EN" [
<!ENTITY % man PUBLIC "-//FreeBSD//ENTITIES DocBook Manual Page Entities//EN">
%man;
<!ENTITY % authors PUBLIC "-//FreeBSD//ENTITIES DocBook Author Entities//EN">
%authors;
<!ENTITY % misc PUBLIC "-//FreeBSD//ENTITIES DocBook Miscellaneous FreeBSD Entities//EN">
%misc;
<!ENTITY % freebsd PUBLIC "-//FreeBSD//ENTITIES DocBook Miscellaneous FreeBSD Entities//EN">
%freebsd;
<!--ENTITY % mailing-lists PUBLIC "-//FreeBSD//ENTITIES DocBook Mailing List Entities//EN"-->
<!--
%mailing-lists;
The FreeBSD Documentation Project
The FreeBSD SMP Next Generation Project
$FreeBSD$
-->
]>
<article>
<articleinfo>
<title>SMPng Design Document</title>
<chapter id="smp">
<chapterinfo>
<authorgroup>
<author>
<firstname>John</firstname>
@ -39,8 +25,12 @@
<holder>John Baldwin</holder>
<holder>Robert Watson</holder>
</copyright>
</chapterinfo>
<abstract>
<title>SMPng Design Document</title>
<sect1>
<title>Introduction</title>
<para>This document presents the current design and implementation of
the SMPng Architecture. First, the basic primitives and tools are
introduced. Next, a general architecture for the FreeBSD kernel's
@ -51,11 +41,6 @@
notes are provided to motivate design choices, and make the reader
aware of important implications involving the use of specific
primitives. </para>
</abstract>
</articleinfo>
<sect1>
<title>Introduction</title>
<para>This document is a work-in-progress, and will be updated to
reflect on-going design and implementation activities associated
@ -956,4 +941,4 @@
</glossdef>
</glossentry>
</glossary>
</article>
</chapter>

View file

@ -25,28 +25,15 @@ IMAGES_EN= sockets/layers.eps sockets/sain.eps sockets/sainfill.eps sockets/sain
# SGML content
SRCS= book.sgml
SRCS+= boot/chapter.sgml
SRCS+= dma/chapter.sgml
SRCS+= driverbasics/chapter.sgml
SRCS+= introduction/chapter.sgml
SRCS+= ipv6/chapter.sgml
SRCS+= isa/chapter.sgml
SRCS+= jail/chapter.sgml
SRCS+= kerneldebug/chapter.sgml
SRCS+= kobj/chapter.sgml
SRCS+= l10n/chapter.sgml
SRCS+= locking/chapter.sgml
SRCS+= mac/chapter.sgml
SRCS+= pci/chapter.sgml
SRCS+= policies/chapter.sgml
SRCS+= scsi/chapter.sgml
SRCS+= secure/chapter.sgml
SRCS+= sockets/chapter.sgml
SRCS+= sound/chapter.sgml
SRCS+= sysinit/chapter.sgml
SRCS+= tools/chapter.sgml
SRCS+= usb/chapter.sgml
SRCS+= vm/chapter.sgml
SRCS+= x86/chapter.sgml
# Entities

View file

@ -12,7 +12,6 @@
<!ENTITY % freebsd PUBLIC "-//FreeBSD//ENTITIES DocBook Miscellaneous FreeBSD Entities//EN">
%freebsd;
<!ENTITY % chapters SYSTEM "chapters.ent"> %chapters;
<!ENTITY % mac-entities SYSTEM "mac.ent"> %mac-entities;
<!ENTITY % authors PUBLIC "-//FreeBSD//ENTITIES DocBook Author Entities//EN"> %authors
<!ENTITY % mailing-lists PUBLIC "-//FreeBSD//ENTITIES DocBook Mailing List Entities//EN"> %mailing-lists;
<!ENTITY % chap.index "IGNORE">
@ -85,13 +84,6 @@
<part id="kernel">
<title>Kernel</title>
&chap.boot;
&chap.locking;
&chap.kobj;
&chap.jail;
&chap.sysinit;
&chap.mac;
&chap.vm;
&chap.dma;
&chap.kerneldebug;
@ -131,20 +123,6 @@
</chapter>
</part>
<part id="devicedrivers">
<title>Device Drivers</title>
&chap.driverbasics;
&chap.isa;
&chap.pci;
&chap.scsi;
&chap.usb;
&chap.newbus;
&chap.snd;
</part>
<part id="architectures">
<title>Architectures</title>
@ -153,22 +131,11 @@
<chapter id="alpha">
<title>* Alpha</title>
<para>Talk about the architectural specifics of
FreeBSD/alpha.</para>
<para>Explanation of alignment errors, how to fix, how to
ignore.</para>
<para>Example assembly language code for FreeBSD/alpha.</para>
</chapter>
<chapter id="ia64">
<title>* IA-64</title>
<para>Talk about the architectural specifics of
FreeBSD/ia64.</para>
</chapter>
</part>
<part id="appendices">

File diff suppressed because it is too large Load diff

View file

@ -21,28 +21,11 @@
<!ENTITY chap.ipv6 SYSTEM "ipv6/chapter.sgml">
<!-- Part three - Kernel -->
<!ENTITY chap.boot SYSTEM "boot/chapter.sgml">
<!ENTITY chap.kobj SYSTEM "kobj/chapter.sgml">
<!ENTITY chap.sysinit SYSTEM "sysinit/chapter.sgml">
<!ENTITY chap.locking SYSTEM "locking/chapter.sgml">
<!ENTITY chap.vm SYSTEM "vm/chapter.sgml">
<!ENTITY chap.dma SYSTEM "dma/chapter.sgml">
<!ENTITY chap.kerneldebug SYSTEM "kerneldebug/chapter.sgml">
<!ENTITY chap.jail SYSTEM "jail/chapter.sgml">
<!ENTITY chap.mac SYSTEM "mac/chapter.sgml">
<!-- Part four - Device Drivers -->
<!ENTITY chap.driverbasics SYSTEM "driverbasics/chapter.sgml">
<!ENTITY chap.isa SYSTEM "isa/chapter.sgml">
<!ENTITY chap.pci SYSTEM "pci/chapter.sgml">
<!ENTITY chap.scsi SYSTEM "scsi/chapter.sgml">
<!ENTITY chap.usb SYSTEM "usb/chapter.sgml">
<!ENTITY chap.newbus SYSTEM "newbus/chapter.sgml">
<!ENTITY chap.snd SYSTEM "sound/chapter.sgml">
<!-- Part five - Architectures -->
<!ENTITY chap.x86 SYSTEM "x86/chapter.sgml">
<!-- Part six - Appendices -->
<!ENTITY chap.bibliography SYSTEM "bibliography/chapter.sgml">
<!ENTITY chap.index SYSTEM "index.sgml">

View file

@ -1,392 +0,0 @@
<!--
The FreeBSD Documentation Project
$FreeBSD$
-->
<chapter id="driverbasics">
<title>Writing FreeBSD Device Drivers</title>
<para>This chapter was written by &a.murray; with selections from a
variety of sources including the intro(4) manual page by
&a.joerg;.</para>
<sect1 id="driverbasics-intro">
<title>Introduction</title>
<para>This chapter provides a brief introduction to writing device
drivers for FreeBSD. A device in this context is a term used
mostly for hardware-related stuff that belongs to the system,
like disks, printers, or a graphics display with its keyboard.
A device driver is the software component of the operating
system that controls a specific device. There are also
so-called pseudo-devices where a device driver emulates the
behavior of a device in software without any particular
underlying hardware. Device drivers can be compiled into the
system statically or loaded on demand through the dynamic kernel
linker facility `kld'.</para>
<para>Most devices in a Unix-like operating system are accessed
through device-nodes, sometimes also called special files.
These files are usually located under the directory
<filename>/dev</filename> in the filesystem hierarchy.
In releases of FreeBSD older than 5.0-RELEASE, where
&man.devfs.5; support is not integrated into FreeBSD,
each device node must be
created statically and independent of the existence of the
associated device driver. Most device nodes on the system are
created by running <command>MAKEDEV</command>.</para>
<para>Device drivers can roughly be broken down into two
categories; character and network device drivers.</para>
</sect1>
<sect1 id="driverbasics-kld">
<title>Dynamic Kernel Linker Facility - KLD</title>
<para>The kld interface allows system administrators to
dynamically add and remove functionality from a running system.
This allows device driver writers to load their new changes into
a running kernel without constantly rebooting to test
changes.</para>
<para>The kld interface is used through the following
privileged commands:
<itemizedlist>
<listitem><simpara><command>kldload</command> - loads a new kernel
module</simpara></listitem>
<listitem><simpara><command>kldunload</command> - unloads a kernel
module</simpara></listitem>
<listitem><simpara><command>kldstat</command> - lists the currently loaded
modules</simpara></listitem>
</itemizedlist>
</para>
<para>Skeleton Layout of a kernel module</para>
<programlisting>/*
* KLD Skeleton
* Inspired by Andrew Reiter's Daemonnews article
*/
#include &lt;sys/types.h&gt;
#include &lt;sys/module.h&gt;
#include &lt;sys/systm.h&gt; /* uprintf */
#include &lt;sys/errno.h&gt;
#include &lt;sys/param.h&gt; /* defines used in kernel.h */
#include &lt;sys/kernel.h&gt; /* types used in module initialization */
/*
* Load handler that deals with the loading and unloading of a KLD.
*/
static int
skel_loader(struct module *m, int what, void *arg)
{
int err = 0;
switch (what) {
case MOD_LOAD: /* kldload */
uprintf("Skeleton KLD loaded.\n");
break;
case MOD_UNLOAD:
uprintf("Skeleton KLD unloaded.\n");
break;
default:
err = EINVAL;
break;
}
return(err);
}
/* Declare this module to the rest of the kernel */
static moduledata_t skel_mod = {
"skel",
skel_loader,
NULL
};
DECLARE_MODULE(skeleton, skel_mod, SI_SUB_KLD, SI_ORDER_ANY);</programlisting>
<sect2>
<title>Makefile</title>
<para>FreeBSD provides a makefile include that you can use to
quickly compile your kernel addition.</para>
<programlisting>SRCS=skeleton.c
KMOD=skeleton
.include &lt;bsd.kmod.mk&gt;</programlisting>
<para>Simply running <command>make</command> with this makefile
will create a file <filename>skeleton.ko</filename> that can
be loaded into your system by typing:
<screen>&prompt.root; <userinput>kldload -v ./skeleton.ko</userinput></screen>
</para>
</sect2>
</sect1>
<sect1 id="driverbasics-access">
<title>Accessing a device driver</title>
<para>Unix provides a common set of system calls for user
applications to use. The upper layers of the kernel dispatch
these calls to the corresponding device driver when a user
accesses a device node. The <command>/dev/MAKEDEV</command>
script makes most of the device nodes for your system but if you
are doing your own driver development it may be necessary to
create your own device nodes with <command>mknod</command>.
</para>
<sect2>
<title>Creating static device nodes</title>
<para>The <command>mknod</command> command requires four
arguments to create a device node. You must specify the name
of the device node, the type of device, the major number of
the device, and the minor number of the device.</para>
</sect2>
<sect2>
<title>Dynamic device nodes</title>
<para>The device filesystem, or devfs, provides access to the
kernel's device namespace in the global filesystem namespace.
This eliminates the problems of potentially having a device
driver without a static device node, or a device node without
an installed device driver. Devfs is still a work in
progress, but it is already working quite nicely.</para>
</sect2>
</sect1>
<sect1 id="driverbasics-char">
<title>Character Devices</title>
<para>A character device driver is one that transfers data
directly to and from a user process. This is the most common
type of device driver and there are plenty of simple examples in
the source tree.</para>
<para>This simple example pseudo-device remembers whatever values
you write to it and can then supply them back to you when you
read from it.</para>
<programlisting>/*
* Simple `echo' pseudo-device KLD
*
* Murray Stokely
*/
#define MIN(a,b) (((a) < (b)) ? (a) : (b))
#include &lt;sys/types.h&gt;
#include &lt;sys/module.h&gt;
#include &lt;sys/systm.h&gt; /* uprintf */
#include &lt;sys/errno.h&gt;
#include &lt;sys/param.h&gt; /* defines used in kernel.h */
#include &lt;sys/kernel.h&gt; /* types used in module initialization */
#include &lt;sys/conf.h&gt; /* cdevsw struct */
#include &lt;sys/uio.h&gt; /* uio struct */
#include &lt;sys/malloc.h&gt;
#define BUFFERSIZE 256
/* Function prototypes */
d_open_t echo_open;
d_close_t echo_close;
d_read_t echo_read;
d_write_t echo_write;
/* Character device entry points */
static struct cdevsw echo_cdevsw = {
echo_open,
echo_close,
echo_read,
echo_write,
noioctl,
nopoll,
nommap,
nostrategy,
"echo",
33, /* reserved for lkms - /usr/src/sys/conf/majors */
nodump,
nopsize,
D_TTY,
-1
};
typedef struct s_echo {
char msg[BUFFERSIZE];
int len;
} t_echo;
/* vars */
static dev_t sdev;
static int len;
static int count;
static t_echo *echomsg;
MALLOC_DECLARE(M_ECHOBUF);
MALLOC_DEFINE(M_ECHOBUF, "echobuffer", "buffer for echo module");
/*
* This function acts is called by the kld[un]load(2) system calls to
* determine what actions to take when a module is loaded or unloaded.
*/
static int
echo_loader(struct module *m, int what, void *arg)
{
int err = 0;
switch (what) {
case MOD_LOAD: /* kldload */
sdev = make_dev(<literal>&</literal>echo_cdevsw,
0,
UID_ROOT,
GID_WHEEL,
0600,
"echo");
/* kmalloc memory for use by this driver */
/* malloc(256,M_ECHOBUF,M_WAITOK); */
MALLOC(echomsg, t_echo *, sizeof(t_echo), M_ECHOBUF, M_WAITOK);
printf("Echo device loaded.\n");
break;
case MOD_UNLOAD:
destroy_dev(sdev);
FREE(echomsg,M_ECHOBUF);
printf("Echo device unloaded.\n");
break;
default:
err = EINVAL;
break;
}
return(err);
}
int
echo_open(dev_t dev, int oflags, int devtype, struct proc *p)
{
int err = 0;
uprintf("Opened device \"echo\" successfully.\n");
return(err);
}
int
echo_close(dev_t dev, int fflag, int devtype, struct proc *p)
{
uprintf("Closing device \"echo.\"\n");
return(0);
}
/*
* The read function just takes the buf that was saved via
* echo_write() and returns it to userland for accessing.
* uio(9)
*/
int
echo_read(dev_t dev, struct uio *uio, int ioflag)
{
int err = 0;
int amt;
/* How big is this read operation? Either as big as the user wants,
or as big as the remaining data */
amt = MIN(uio->uio_resid, (echomsg->len - uio->uio_offset > 0) ? echomsg->len - uio->uio_offset : 0);
if ((err = uiomove(echomsg->msg + uio->uio_offset,amt,uio)) != 0) {
uprintf("uiomove failed!\n");
}
return err;
}
/*
* echo_write takes in a character string and saves it
* to buf for later accessing.
*/
int
echo_write(dev_t dev, struct uio *uio, int ioflag)
{
int err = 0;
/* Copy the string in from user memory to kernel memory */
err = copyin(uio->uio_iov->iov_base, echomsg->msg, MIN(uio->uio_iov->iov_len,BUFFERSIZE));
/* Now we need to null terminate */
*(echomsg->msg + MIN(uio->uio_iov->iov_len,BUFFERSIZE)) = 0;
/* Record the length */
echomsg->len = MIN(uio->uio_iov->iov_len,BUFFERSIZE);
if (err != 0) {
uprintf("Write failed: bad address!\n");
}
count++;
return(err);
}
DEV_MODULE(echo,echo_loader,NULL);</programlisting>
<para>To install this driver you will first need to make a node on
your filesystem with a command such as:</para>
<screen>&prompt.root; <userinput>mknod /dev/echo c 33 0</userinput></screen>
<para>With this driver loaded you should now be able to type
something like:</para>
<screen>&prompt.root; <userinput>echo -n "Test Data" &gt; /dev/echo</userinput>
&prompt.root; <userinput>cat /dev/echo</userinput>
Test Data</screen>
<para>Real hardware devices in the next chapter..</para>
<para>Additional Resources
<itemizedlist>
<listitem><simpara><ulink
url="http://www.daemonnews.org/200010/blueprints.html">Dynamic
Kernel Linker (KLD) Facility Programming Tutorial</ulink> -
<ulink url="http://www.daemonnews.org/">Daemonnews</ulink> October 2000</simpara></listitem>
<listitem><simpara><ulink
url="http://www.daemonnews.org/200007/newbus-intro.html">How
to Write Kernel Drivers with NEWBUS</ulink> - <ulink
url="http://www.daemonnews.org/">Daemonnews</ulink> July
2000</simpara></listitem>
</itemizedlist>
</para>
</sect1>
<sect1 id="driverbasics-net">
<title>Network Drivers</title>
<para>Drivers for network devices do not use device nodes in order
to be accessed. Their selection is based on other decisions
made inside the kernel and instead of calling open(), use of a
network device is generally introduced by using the system call
socket(2).</para>
<para>man ifnet(), loopback device, Bill Paul's drivers,
etc..</para>
</sect1>
</chapter>
<!--
Local Variables:
mode: sgml
sgml-declaration: "../chapter.decl"
sgml-indent-data: t
sgml-omittag: nil
sgml-always-quote-attributes: t
sgml-parent-document: ("../book.sgml" "part" "chapter")
End:
-->

File diff suppressed because it is too large Load diff

View file

@ -1,597 +0,0 @@
<!--
The FreeBSD Documentation Project
$FreeBSD$
-->
<chapter id="jail">
<chapterinfo>
<author>
<firstname>Evan</firstname>
<surname>Sarmiento</surname>
<affiliation>
<address><email>evms@cs.bu.edu</email></address>
</affiliation>
</author>
<copyright>
<year>2001</year>
<holder role="mailto:evms@cs.bu.edu">Evan Sarmiento</holder>
</copyright>
</chapterinfo>
<title>The Jail Subsystem</title>
<para>On most UNIX systems, root has omnipotent power. This promotes
insecurity. If an attacker were to gain root on a system, he would
have every function at his fingertips. In FreeBSD there are
sysctls which dilute the power of root, in order to minimize the
damage caused by an attacker. Specifically, one of these functions
is called secure levels. Similarly, another function which is
present from FreeBSD 4.0 and onward, is a utility called
&man.jail.8;. <application>Jail</application> chroots an
environment and sets certain restrictions on processes which are
forked from within. For example, a jailed process cannot affect
processes outside of the jail, utilize certain system calls, or
inflict any damage on the main computer.</para>
<para><application>Jail</application> is becoming the new security
model. People are running potentially vulnerable servers such as
Apache, BIND, and sendmail within jails, so that if an attacker
gains root within the <application>Jail</application>, it is only
an annoyance, and not a devastation. This article focuses on the
internals (source code) of <application>Jail</application>.
It will also suggest improvements upon the jail code base which
are already being worked on. If you are looking for a how-to on
setting up a <application>Jail</application>, I suggest you look
at my other article in Sys Admin Magazine, May 2001, entitled
"Securing FreeBSD using <application>Jail</application>."</para>
<sect1 id="jail-arch">
<title>Architecture</title>
<para>
<application>Jail</application> consists of two realms: the
user-space program, jail, and the code implemented within the
kernel: the <literal>jail()</literal> system call and associated
restrictions. I will be discussing the user-space program and
then how jail is implemented within the kernel.</para>
<sect2>
<title>Userland code</title>
<para>The source for the user-land jail is located in
<filename>/usr/src/usr.sbin/jail</filename>, consisting of
one file, <filename>jail.c</filename>. The program takes these
arguments: the path of the jail, hostname, ip address, and the
command to be executed.</para>
<sect3>
<title>Data Structures</title>
<para>In <filename>jail.c</filename>, the first thing I would
note is the declaration of an important structure
<literal>struct jail j</literal>; which was included from
<filename>/usr/include/sys/jail.h</filename>.</para>
<para>The definition of the jail structure is:</para>
<programlisting><filename>/usr/include/sys/jail.h</filename>:
struct jail {
u_int32_t version;
char *path;
char *hostname;
u_int32_t ip_number;
};</programlisting>
<para>As you can see, there is an entry for each of the
arguments passed to the jail program, and indeed, they are
set during it's execution.</para>
<programlisting><filename>/usr/src/usr.sbin/jail.c</filename>
j.version = 0;
j.path = argv[1];
j.hostname = argv[2];</programlisting>
</sect3>
<sect3>
<title>Networking</title>
<para>One of the arguments passed to the Jail program is an IP
address with which the jail can be accessed over the
network. Jail translates the ip address given into network
byte order and then stores it in j (the jail structure).</para>
<programlisting><filename>/usr/src/usr.sbin/jail/jail.c</filename>:
struct in.addr in;
...
i = inet.aton(argv[3], <![CDATA[&in]]>);
...
j.ip_number = ntohl(in.s.addr);</programlisting>
<para>The
<citerefentry><refentrytitle>inet_aton</refentrytitle><manvolnum>3</manvolnum></citerefentry>
function "interprets the specified character string as an
Internet address, placing the address into the structure
provided." The ip number node in the jail structure is set
only when the ip address placed onto the in structure by
inet aton is translated into network byte order by
<function>ntohl()</function>.</para>
</sect3>
<sect3>
<title>Jailing The Process</title>
<para>Finally, the userland program jails the process, and
executes the command specified. Jail now becomes an
imprisoned process itself and forks a child process which
then executes the command given using &man.execv.3;</para>
<programlisting><filename>/usr/src/sys/usr.sbin/jail/jail.c</filename>
i = jail(<![CDATA[&j]]>);
...
i = execv(argv[4], argv + 4);</programlisting>
<para>As you can see, the jail function is being called, and
its argument is the jail structure which has been filled
with the arguments given to the program. Finally, the
program you specify is executed. I will now discuss how Jail
is implemented within the kernel.</para>
</sect3>
</sect2>
<sect2>
<title>Kernel Space</title>
<para>We will now be looking at the file
<filename>/usr/src/sys/kern/kern_jail.c</filename>. This is
the file where the jail system call, appropriate sysctls, and
networking functions are defined.</para>
<sect3>
<title>sysctls</title>
<para>In <filename>kern_jail.c</filename>, the following
sysctls are defined:</para>
<programlisting><filename>/usr/src/sys/kern/kern_jail.c:</filename>
int jail_set_hostname_allowed = 1;
SYSCTL_INT(_jail, OID_AUTO, set_hostname_allowed, CTLFLAG_RW,
<![CDATA[&jail]]>_set_hostname_allowed, 0,
"Processes in jail can set their hostnames");
int jail_socket_unixiproute_only = 1;
SYSCTL_INT(_jail, OID_AUTO, socket_unixiproute_only, CTLFLAG_RW,
<![CDATA[&jail]]>_socket_unixiproute_only, 0,
"Processes in jail are limited to creating UNIX/IPv4/route sockets only
");
int jail_sysvipc_allowed = 0;
SYSCTL_INT(_jail, OID_AUTO, sysvipc_allowed, CTLFLAG_RW,
<![CDATA[&jail]]>_sysvipc_allowed, 0,
"Processes in jail can use System V IPC primitives");</programlisting>
<para>Each of these sysctls can be accessed by the user
through the sysctl program. Throughout the kernel, these
specific sysctls are recognized by their name. For example,
the name of the first sysctl is
<literal>jail.set.hostname.allowed</literal>.</para>
</sect3>
<sect3>
<title>&man.jail.2; system call</title>
<para>Like all system calls, the &man.jail.2; system call takes
two arguments, <literal>struct proc *p</literal> and
<literal>struct jail_args
*uap</literal>. <literal>p</literal> is a pointer to a proc
structure which describes the calling process. In this
context, uap is a pointer to a structure which specifies the
arguments given to &man.jail.2; from the userland program
<filename>jail.c</filename>. When I described the userland
program before, you saw that the &man.jail.2; system call was
given a jail structure as its own argument.</para>
<programlisting><filename>/usr/src/sys/kern/kern_jail.c:</filename>
int
jail(p, uap)
struct proc *p;
struct jail_args /* {
syscallarg(struct jail *) jail;
} */ *uap;</programlisting>
<para>Therefore, <literal>uap->jail</literal> would access the
jail structure which was passed to the system call. Next,
the system call copies the jail structure into kernel space
using the <literal>copyin()</literal>
function. <literal>copyin()</literal> takes three arguments:
the data which is to be copied into kernel space,
<literal>uap->jail</literal>, where to store it,
<literal>j</literal> and the size of the storage. The jail
structure <literal>uap->jail</literal> is copied into kernel
space and stored in another jail structure,
<literal>j</literal>.</para>
<programlisting><filename>/usr/src/sys/kern/kern_jail.c: </filename>
error = copyin(uap->jail, <![CDATA[&j]]>, sizeof j);</programlisting>
<para>There is another important structure defined in
jail.h. It is the prison structure
(<literal>pr</literal>). The prison structure is used
exclusively within kernel space. The &man.jail.2; system call
copies everything from the jail structure onto the prison
structure. Here is the definition of the prison structure.</para>
<programlisting><filename>/usr/include/sys/jail.h</filename>:
struct prison {
int pr_ref;
char pr_host[MAXHOSTNAMELEN];
u_int32_t pr_ip;
void *pr_linux;
};</programlisting>
<para>The jail() system call then allocates memory for a
pointer to a prison structure and copies data between the two
structures.</para>
<programlisting><filename>/usr/src/sys/kern/kern_jail.c</filename>:
MALLOC(pr, struct prison *, sizeof *pr , M_PRISON, M_WAITOK);
bzero((caddr_t)pr, sizeof *pr);
error = copyinstr(j.hostname, <![CDATA[&pr->pr_host]]>, sizeof pr->pr_host, 0);
if (error)
goto bail;</programlisting>
<para>Finally, the jail system call chroots the path
specified. The chroot function is given two arguments. The
first is p, which represents the calling process, the second
is a pointer to the structure chroot args. The structure
chroot args contains the path which is to be chrooted. As
you can see, the path specified in the jail structure is
copied to the chroot args structure and used.</para>
<programlisting><filename>/usr/src/sys/kern/kern_jail.c</filename>:
ca.path = j.path;
error = chroot(p, <![CDATA[&ca]]>);</programlisting>
<para>These next three lines in the source are very important,
as they specify how the kernel recognizes a process as
jailed. Each process on a Unix system is described by its
own proc structure. You can see the whole proc structure in
<filename>/usr/include/sys/proc.h</filename>. For example,
the p argument in any system call is actually a pointer to
that process' proc structure, as stated before. The proc
structure contains nodes which can describe the owner's
identity (<literal>p_cred</literal>), the process resource
limits (<literal>p_limit</literal>), and so on. In the
definition of the process structure, there is a pointer to a
prison structure. (<literal>p_prison</literal>).</para>
<programlisting><filename>/usr/include/sys/proc.h: </filename>
struct proc {
...
struct prison *p_prison;
...
};</programlisting>
<para>In <filename>kern_jail.c</filename>, the function then
copies the pr structure, which is filled with all the
information from the original jail structure, over to the
<literal>p->p_prison</literal> structure. It then does a
bitwise OR of <literal>p->p_flag</literal> with the constant
<literal>P_JAILED</literal>, meaning that the calling
process is now recognized as jailed. The parent process of
each process, forked within the jail, is the program jail
itself, as it calls the &man.jail.2; system call. When the
program is executed through execve, it inherits the
properties of its parents proc structure, therefore it has
the <literal>p->p_flag</literal> set, and the
<literal>p->p_prison</literal> structure is filled.</para>
<programlisting><filename>/usr/src/sys/kern/kern_jail.c</filename>
p->p.prison = pr;
p->p.flag |= P.JAILED;</programlisting>
<para>When a process is forked from a parent process, the
&man.fork.2; system call deals differently with imprisoned
processes. In the fork system call, there are two pointers
to a <literal>proc</literal> structure <literal>p1</literal>
and <literal>p2</literal>. <literal>p1</literal> points to
the parent's <literal>proc</literal> structure and p2 points
to the child's unfilled <literal>proc</literal>
structure. After copying all relevant data between the
structures, &man.fork.2; checks if the structure
<literal>p->p_prison</literal> is filled on
<literal>p2</literal>. If it is, it increments the
<literal>pr.ref</literal> by one, and sets the
<literal>p_flag</literal> to one on the child process.</para>
<programlisting><filename>/usr/src/sys/kern/kern_fork.c</filename>:
if (p2->p_prison) {
p2->p_prison->pr_ref++;
p2->p_flag |= P_JAILED;
}</programlisting>
</sect3>
</sect2>
</sect1>
<sect1 id="jail-restrictions">
<title>Restrictions</title>
<para>Throughout the kernel there are access restrictions relating
to jailed processes. Usually, these restrictions only check if
the process is jailed, and if so, returns an error. For
example:</para>
<programlisting>if (p->p_prison)
return EPERM;</programlisting>
<sect2>
<title>SysV IPC</title>
<para>System V IPC is based on messages. Processes can send each
other these messages which tell them how to act. The functions
which deal with messages are: <literal>msgsys</literal>,
<literal>msgctl</literal>, <literal>msgget</literal>,
<literal>msgsend</literal> and <literal>msgrcv</literal>.
Earlier, I mentioned that there were certain sysctls you could
turn on or off in order to affect the behavior of Jail. One of
these sysctls was <literal>jail_sysvipc_allowed</literal>. On
most systems, this sysctl is set to 0. If it were set to 1, it
would defeat the whole purpose of having a jail; privleged
users from within the jail would be able to affect processes
outside of the environment. The difference between a message
and a signal is that the message only consists of the signal
number.</para>
<para><filename>/usr/src/sys/kern/sysv_msg.c</filename>:</para>
<itemizedlist>
<listitem> <para>&man.msgget.3;: msgget returns (and possibly
creates) a message descriptor that designates a message queue
for use in other system calls.</para></listitem>
<listitem> <para>&man.msgctl.3;: Using this function, a process
can query the status of a message
descriptor.</para></listitem>
<listitem> <para>&man.msgsnd.3;: msgsnd sends a message to a
process.</para></listitem>
<listitem> <para>&man.msgrcv.3;: a process receives messages using
this function</para></listitem>
</itemizedlist>
<para>In each of these system calls, there is this
conditional:</para>
<programlisting><filename>/usr/src/sys/kern/sysv msg.c</filename>:
if (!jail.sysvipc.allowed && p->p_prison != NULL)
return (ENOSYS);</programlisting>
<para>Semaphore system calls allow processes to synchronize
execution by doing a set of operations atomically on a set of
semaphores. Basically semaphores provide another way for
processes lock resources. However, process waiting on a
semaphore, that is being used, will sleep until the resources
are relinquished. The following semaphore system calls are
blocked inside a jail: <literal>semsys</literal>,
<literal>semget</literal>, <literal>semctl</literal> and
<literal>semop</literal>.</para>
<para><filename>/usr/src/sys/kern/sysv_sem.c</filename>:</para>
<itemizedlist>
<listitem>
<para>&man.semctl.2;<literal>(id, num, cmd, arg)</literal>:
Semctl does the specified cmd on the semaphore queue
indicated by id.</para></listitem>
<listitem>
<para>&man.semget.2;<literal>(key, nsems, flag)</literal>:
Semget creates an array of semaphores, corresponding to
key.</para>
<para><literal>Key and flag take on the same meaning as they
do in msgget.</literal></para></listitem>
<listitem><para>&man.semop.2;<literal>(id, ops, num)</literal>:
Semop does the set of semaphore operations in the array of
structures ops, to the set of semaphores identified by
id.</para></listitem>
</itemizedlist>
<para>System V IPC allows for processes to share
memory. Processes can communicate directly with each other by
sharing parts of their virtual address space and then reading
and writing data stored in the shared memory. These system
calls are blocked within a jailed environment: <literal>shmdt,
shmat, oshmctl, shmctl, shmget</literal>, and
<literal>shmsys</literal>.</para>
<para><filename>/usr/src/sys/kern/sysv shm.c</filename>:</para>
<itemizedlist>
<listitem><para>&man.shmctl.2;<literal>(id, cmd, buf)</literal>:
shmctl does various control operations on the shared memory
region identified by id.</para></listitem>
<listitem><para>&man.shmget.2;<literal>(key, size,
flag)</literal>: shmget accesses or creates a shared memory
region of size bytes.</para></listitem>
<listitem><para>&man.shmat.2;<literal>(id, addr, flag)</literal>:
shmat attaches a shared memory region identified by id to the
address space of a process.</para></listitem>
<listitem><para>&man.shmdt.2;<literal>(addr)</literal>: shmdt
detaches the shared memory region previously attached at
addr.</para></listitem>
</itemizedlist>
</sect2>
<sect2>
<title>Sockets</title>
<para>Jail treats the &man.socket.2; system call and related
lower-level socket functions in a special manner. In order to
determine whether a certain socket is allowed to be created,
it first checks to see if the sysctl
<literal>jail.socket.unixiproute.only</literal> is set. If
set, sockets are only allowed to be created if the family
specified is either <literal>PF_LOCAL</literal>,
<literal>PF_INET</literal> or
<literal>PF_ROUTE</literal>. Otherwise, it returns an
error.</para>
<programlisting><filename>/usr/src/sys/kern/uipc_socket.c</filename>:
int socreate(dom, aso, type, proto, p)
...
register struct protosw *prp;
...
{
if (p->p_prison && jail_socket_unixiproute_only &&
prp->pr_domain->dom_family != PR_LOCAL && prp->pr_domain->dom_family != PF_INET
&& prp->pr_domain->dom_family != PF_ROUTE)
return (EPROTONOSUPPORT);
...
}</programlisting>
</sect2>
<sect2>
<title>Berkeley Packet Filter</title>
<para>The Berkeley Packet Filter provides a raw interface to
data link layers in a protocol independent fashion. The
function <literal>bpfopen()</literal> opens an Ethernet
device. There is a conditional which disallows any jailed
processes from accessing this function.</para>
<programlisting><filename>/usr/src/sys/net/bpf.c</filename>:
static int bpfopen(dev, flags, fmt, p)
...
{
if (p->p_prison)
return (EPERM);
...
}</programlisting>
</sect2>
<sect2>
<title>Protocols</title>
<para>There are certain protocols which are very common, such as
TCP, UDP, IP and ICMP. IP and ICMP are on the same level: the
network layer 2. There are certain precautions which are
taken in order to prevent a jailed process from binding a
protocol to a certain port only if the <literal>nam</literal>
parameter is set. nam is a pointer to a sockaddr structure,
which describes the address on which to bind the service. A
more exact definition is that sockaddr "may be used as a
template for reffering to the identifying tag and length of
each address"[2]. In the function in
<literal>pcbbind</literal>, <literal>sin</literal> is a
pointer to a sockaddr.in structure, which contains the port,
address, length and domain family of the socket which is to be
bound. Basically, this disallows any processes from jail to be
able to specify the domain family.</para>
<programlisting><filename>/usr/src/sys/kern/netinet/in_pcb.c</filename>:
int in.pcbbind(int, nam, p)
...
struct sockaddr *nam;
struct proc *p;
{
...
struct sockaddr.in *sin;
...
if (nam) {
sin = (struct sockaddr.in *)nam;
...
if (sin->sin_addr.s_addr != INADDR_ANY)
if (prison.ip(p, 0, <![CDATA[&sin]]>->sin.addr.s_addr))
return (EINVAL);
....
}
...
}</programlisting>
<para>You might be wondering what function
<literal>prison_ip()</literal> does. prison.ip is given three
arguments, the current process (represented by
<literal>p</literal>), any flags, and an ip address. It
returns 1 if the ip address belongs to a jail or 0 if it does
not. As you can see from the code, if it is indeed an ip
address belonging to a jail, the protcol is not allowed to
bind to a certain port.</para>
<programlisting><filename>/usr/src/sys/kern/kern_jail.c:</filename>
int prison_ip(struct proc *p, int flag, u_int32_t *ip) {
u_int32_t tmp;
if (!p->p_prison)
return (0);
if (flag)
tmp = *ip;
else tmp = ntohl (*ip);
if (tmp == INADDR_ANY) {
if (flag)
*ip = p->p_prison->pr_ip;
else *ip = htonl(p->p_prison->pr_ip);
return (0);
}
if (p->p_prison->pr_ip != tmp)
return (1);
return (0);
}</programlisting>
<para>Jailed users are not allowed to bind services to an ip
which does not belong to the jail. The restriction is also
written within the function <literal>in_pcbbind</literal>:</para>
<programlisting><filename>/usr/src/sys/net inet/in_pcb.c</filename>
if (nam) {
...
lport = sin->sin.port;
... if (lport) {
...
if (p && p->p_prison)
prison = 1;
if (prison &&
prison_ip(p, 0, <![CDATA[&sin]]>->sin_addr.s_addr))
return (EADDRNOTAVAIL);</programlisting>
</sect2>
<sect2>
<title>Filesystem</title>
<para>Even root users within the jail are not allowed to set any
file flags, such as immutable, append, and no unlink flags, if
the securelevel is greater than 0.</para>
<programlisting>/usr/src/sys/ufs/ufs/ufs_vnops.c:
int ufs.setattr(ap)
...
{
if ((cred->cr.uid == 0) && (p->prison == NULL)) {
if ((ip->i_flags
& (SF_NOUNLINK | SF_IMMUTABLE | SF_APPEND)) &&
securelevel > 0)
return (EPERM);
}</programlisting>
</sect2>
</sect1>
</chapter>

View file

@ -1,298 +0,0 @@
<!--
The FreeBSD Documentation Project
$FreeBSD$
-->
<chapter id="kernel-objects">
<title>Kernel Objects</title>
<para>Kernel Objects, or <firstterm>Kobj</firstterm> provides an
object-oriented C programming system for the kernel. As such the
data being operated on carries the description of how to operate
on it. This allows operations to be added and removed from an
interface at run time and without breaking binary
compatibility.</para>
<sect1 id="kernel-objects-term">
<title>Terminology</title>
<variablelist>
<varlistentry>
<term>Object</term>
<listitem><para>A set of data - data structure - data
allocation.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Method</term>
<listitem>
<para>An operation - function.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Class</term>
<listitem>
<para>One or more methods.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Interface</term>
<listitem>
<para>A standard set of one or more methods.</para>
</listitem>
</varlistentry>
</variablelist>
</sect1>
<sect1 id="kernel-objects-operation">
<title>Kobj Operation</title>
<para>Kobj works by generating descriptions of methods. Each
description holds a unique id as well as a default function. The
description's address is used to uniquely identify the method
within a class' method table.</para>
<para>A class is built by creating a method table associating one
or more functions with method descriptions. Before use the class
is compiled. The compilation allocates a cache and associates it
with the class. A unique id is assigned to each method
description within the method table of the class if not already
done so by another referencing class compilation. For every
method to be used a function is generated by script to qualify
arguments and automatically reference the method description for
a lookup. The generated function looks up the method by using
the unique id associated with the method description as a hash
into the cache associated with the object's class. If the method
is not cached the generated function proceeds to use the class'
table to find the method. If the method is found then the
associated function within the class is used; otherwise, the
default function associated with the method description is
used.</para>
<para>These indirections can be visualized as the
following:</para>
<programlisting>object->cache<->class</programlisting>
</sect1>
<sect1 id="kernel-objects-using">
<title>Using Kobj</title>
<sect2>
<title>Structures</title>
<programlisting>struct kobj_method</programlisting>
</sect2>
<sect2>
<title>Functions</title>
<programlisting>void kobj_class_compile(kobj_class_t cls);
void kobj_class_compile_static(kobj_class_t cls, kobj_ops_t ops);
void kobj_class_free(kobj_class_t cls);
kobj_t kobj_create(kobj_class_t cls, struct malloc_type *mtype, int mflags);
void kobj_init(kobj_t obj, kobj_class_t cls);
void kobj_delete(kobj_t obj, struct malloc_type *mtype);</programlisting>
</sect2>
<sect2>
<title>Macros</title>
<programlisting>KOBJ_CLASS_FIELDS
KOBJ_FIELDS
DEFINE_CLASS(name, methods, size)
KOBJMETHOD(NAME, FUNC)</programlisting>
</sect2>
<sect2>
<title>Headers</title>
<programlisting>&lt;sys/param.h>
&lt;sys/kobj.h></programlisting>
</sect2>
<sect2>
<title>Creating an interface template</title>
<para>The first step in using Kobj is to create an
Interface. Creating the interface involves creating a template
that the script
<filename>src/sys/kern/makeobjops.pl</filename> can use to
generate the header and code for the method declarations and
method lookup functions.</para>
<para>Within this template the following keywords are used:
<literal>#include</literal>, <literal>INTERFACE</literal>,
<literal>CODE</literal>, <literal>METHOD</literal>,
<literal>STATICMETHOD</literal>, and
<literal>DEFAULT</literal>.</para>
<para>The <literal>#include</literal> statement and what follows
it is copied verbatim to the head of the generated code
file.</para>
<para>For example:</para>
<programlisting>#include &lt;sys/foo.h></programlisting>
<para>The <literal>INTERFACE</literal> keyword is used to define
the interface name. This name is concatenated with each method
name as [interface name]_[method name]. Its syntax is
INTERFACE [interface name];.</para>
<para>For example:</para>
<programlisting>INTERFACE foo;</programlisting>
<para>The <literal>CODE</literal> keyword copies its arguments
verbatim into the code file. Its syntax is
<literal>CODE { [whatever] };</literal></para>
<para>For example:</para>
<programlisting>CODE {
struct foo * foo_alloc_null(struct bar *)
{
return NULL;
}
};</programlisting>
<para>The <literal>METHOD</literal> keyword describes a method. Its syntax is
<literal>METHOD [return type] [method name] { [object [,
arguments]] };</literal></para>
<para>For example:</para>
<programlisting>METHOD int bar {
struct object *;
struct foo *;
struct bar;
};</programlisting>
<para>The <literal>DEFAULT</literal> keyword may follow the
<literal>METHOD</literal> keyword. It extends the
<literal>METHOD</literal> key word to include the default
function for method. The extended syntax is
<literal>METHOD [return type] [method name] {
[object; [other arguments]] }DEFAULT [default
function];</literal></para>
<para>For example:</para>
<programlisting>METHOD int bar {
struct object *;
struct foo *;
int bar;
} DEFAULT foo_hack;</programlisting>
<para>The <literal>STATICMETHOD</literal> keyword is used like
the <literal>METHOD</literal> keyword except the kobj data is not
at the head of the object structure so casting to kobj_t would
be incorrect. Instead <literal>STATICMETHOD</literal> relies on the Kobj data being
referenced as 'ops'. This is also useful for calling
methods directly out of a class's method table.</para>
<para>Other complete examples:</para>
<programlisting>src/sys/kern/bus_if.m
src/sys/kern/device_if.m</programlisting>
</sect2>
<sect2>
<title>Creating a Class</title>
<para>The second step in using Kobj is to create a class. A
class consists of a name, a table of methods, and the size of
objects if Kobj's object handling facilities are used. To
create the class use the macro
<function>DEFINE_CLASS()</function>. To create the method
table create an array of kobj_method_t terminated by a NULL
entry. Each non-NULL entry may be created using the macro
<function>KOBJMETHOD()</function>.</para>
<para>For example:</para>
<programlisting>DEFINE_CLASS(fooclass, foomethods, sizeof(struct foodata));
kobj_method_t foomethods[] = {
KOBJMETHOD(bar_doo, foo_doo),
KOBJMETHOD(bar_foo, foo_foo),
{ NULL, NULL}
};</programlisting>
<para>The class must be <quote>compiled</quote>. Depending on
the state of the system at the time that the class is to be
initialized a statically allocated cache, <quote>ops
table</quote> have to be used. This can be accomplished by
declaring a <structname>struct kobj_ops</structname> and using
<function>kobj_class_compile_static();</function> otherwise,
<function>kobj_class_compile()</function> should be used.</para>
</sect2>
<sect2>
<title>Creating an Object</title>
<para>The third step in using Kobj involves how to define the
object. Kobj object creation routines assume that Kobj data is
at the head of an object. If this in not appropriate you will
have to allocate the object yourself and then use
<function>kobj_init()</function> on the Kobj portion of it;
otherwise, you may use <function>kobj_create()</function> to
allocate and initialize the Kobj portion of the object
automatically. <function>kobj_init()</function> may also be
used to change the class that an object uses.</para>
<para>To integrate Kobj into the object you should use the macro
KOBJ_FIELDS.</para>
<para>For example</para>
<programlisting>struct foo_data {
KOBJ_FIELDS;
foo_foo;
foo_bar;
};</programlisting>
</sect2>
<sect2>
<title>Calling Methods</title>
<para>The last step in using Kobj is to simply use the generated
functions to use the desired method within the object's
class. This is as simple as using the interface name and the
method name with a few modifications. The interface name
should be concatenated with the method name using a '_'
between them, all in upper case.</para>
<para>For example, if the interface name was foo and the method
was bar then the call would be:</para>
<programlisting>[return value = ] FOO_BAR(object [, other parameters]);</programlisting>
</sect2>
<sect2>
<title>Cleaning Up</title>
<para>When an object allocated through
<function>kobj_create()</function> is no longer needed
<function>kobj_delete()</function> may be called on it, and
when a class is no longer being used
<function>kobj_class_free()</function> may be called on it.</para>
</sect2>
</sect1>
</chapter>
<!--
Local Variables:
mode: sgml
sgml-declaration: "../chapter.decl"
sgml-indent-data: t
sgml-omittag: nil
sgml-always-quote-attributes: t
sgml-parent-document: ("../book.sgml" "part" "chapter")
End:
-->

View file

@ -1,313 +0,0 @@
<!--
The FreeBSD Documentation Project
The FreeBSD SMP Next Generation Project
$FreeBSD$
-->
<chapter id="locking">
<title>Locking Notes</title>
<para><emphasis>This chapter is maintained by the FreeBSD SMP Next
Generation Project. Please direct any comments or suggestions
to its &a.smp;.</emphasis></para>
<para>This document outlines the locking used in the FreeBSD kernel
to permit effective multi-processing within the kernel. Locking
can be achieved via several means. Data structures can be
protected by mutexes or &man.lockmgr.9; locks. A few variables
are protected simply by always using atomic operations to access
them.</para>
<sect1 id="locking-mutexes">
<title>Mutexes</title>
<para>A mutex is simply a lock used to guarantee mutual exclusion.
Specifically, a mutex may only be owned by one entity at a time.
If another entity wishes to obtain a mutex that is already
owned, it must wait until the mutex is released. In the FreeBSD
kernel, mutexes are owned by processes.</para>
<para>Mutexes may be recursively acquired, but they are intended
to be held for a short period of time. Specifically, one may
not sleep while holding a mutex. If you need to hold a lock
across a sleep, use a &man.lockmgr.9; lock.</para>
<para>Each mutex has several properties of interest:</para>
<variablelist>
<varlistentry>
<term>Variable Name</term>
<listitem>
<para>The name of the <type>struct mtx</type> variable in
the kernel source.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Logical Name</term>
<listitem>
<para>The name of the mutex assigned to it by
<function>mtx_init</function>. This name is displayed in
KTR trace messages and witness errors and warnings and is
used to distinguish mutexes in the witness code.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Type</term>
<listitem>
<para>The type of the mutex in terms of the
<constant>MTX_*</constant> flags. The meaning for each
flag is related to its meaning as documented in
&man.mutex.9;.</para>
<variablelist>
<varlistentry>
<term><constant>MTX_DEF</constant></term>
<listitem>
<para>A sleep mutex</para>
</listitem>
</varlistentry>
<varlistentry>
<term><constant>MTX_SPIN</constant></term>
<listitem>
<para>A spin mutex</para>
</listitem>
</varlistentry>
<varlistentry>
<term><constant>MTX_RECURSE</constant></term>
<listitem>
<para>This mutex is allowed to recurse.</para>
</listitem>
</varlistentry>
</variablelist>
</listitem>
</varlistentry>
<varlistentry>
<term>Protectees</term>
<listitem>
<para>A list of data structures or data structure members
that this entry protects. For data structure members, the
name will be in the form of
<structname/structure name/.<structfield/member name/.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Dependent Functions</term>
<listitem>
<para>Functions that can only be called if this mutex is
held.</para>
</listitem>
</varlistentry>
</variablelist>
<table frame="all" colsep="1" rowsep="1" pgwide="1">
<title>Mutex List</title>
<tgroup cols="5">
<thead>
<row>
<entry>Variable Name</entry>
<entry>Logical Name</entry>
<entry>Type</entry>
<entry>Protectees</entry>
<entry>Dependent Functions</entry>
</row>
</thead>
<!-- The scheduler lock -->
<tbody>
<row>
<entry>sched_lock</entry>
<entry><quote>sched lock</quote></entry>
<entry>
<constant>MTX_SPIN</constant> |
<constant>MTX_RECURSE</constant>
</entry>
<entry>
<varname>_gmonparam</varname>,
<varname>cnt.v_swtch</varname>,
<varname>cp_time</varname>,
<varname>curpriority</varname>,
<structname/mtx/.<structfield/mtx_blocked/,
<structname/mtx/.<structfield/mtx_contested/,
<structname/proc/.<structfield/p_procq/,
<structname/proc/.<structfield/p_slpq/,
<structname/proc/.<structfield/p_sflag/
<structname/proc/.<structfield/p_stat/,
<structname/proc/.<structfield/p_estcpu/,
<structname/proc/.<structfield/p_cpticks/
<structname/proc/.<structfield/p_pctcpu/,
<structname/proc/.<structfield/p_wchan/,
<structname/proc/.<structfield/p_wmesg/,
<structname/proc/.<structfield/p_swtime/,
<structname/proc/.<structfield/p_slptime/,
<structname/proc/.<structfield/p_runtime/,
<structname/proc/.<structfield/p_uu/,
<structname/proc/.<structfield/p_su/,
<structname/proc/.<structfield/p_iu/,
<structname/proc/.<structfield/p_uticks/,
<structname/proc/.<structfield/p_sticks/,
<structname/proc/.<structfield/p_iticks/,
<structname/proc/.<structfield/p_oncpu/,
<structname/proc/.<structfield/p_lastcpu/,
<structname/proc/.<structfield/p_rqindex/,
<structname/proc/.<structfield/p_heldmtx/,
<structname/proc/.<structfield/p_blocked/,
<structname/proc/.<structfield/p_mtxname/,
<structname/proc/.<structfield/p_contested/,
<structname/proc/.<structfield/p_priority/,
<structname/proc/.<structfield/p_usrpri/,
<structname/proc/.<structfield/p_nativepri/,
<structname/proc/.<structfield/p_nice/,
<structname/proc/.<structfield/p_rtprio/,
<varname>pscnt</varname>,
<varname>slpque</varname>,
<varname>itqueuebits</varname>,
<varname>itqueues</varname>,
<varname>rtqueuebits</varname>,
<varname>rtqueues</varname>,
<varname>queuebits</varname>,
<varname>queues</varname>,
<varname>idqueuebits</varname>,
<varname>idqueues</varname>,
<varname>switchtime</varname>,
<varname>switchticks</varname>
</entry>
<entry>
<function>setrunqueue</function>,
<function>remrunqueue</function>,
<function>mi_switch</function>,
<function>chooseproc</function>,
<function>schedclock</function>,
<function>resetpriority</function>,
<function>updatepri</function>,
<function>maybe_resched</function>,
<function>cpu_switch</function>,
<function>cpu_throw</function>,
<function>need_resched</function>,
<function>resched_wanted</function>,
<function>clear_resched</function>,
<function>aston</function>,
<function>astoff</function>,
<function>astpending</function>,
<function>calcru</function>,
<function>proc_compare</function>
</entry>
</row>
<!-- The vm86 pcb lock -->
<row>
<entry>vm86pcb_lock</entry>
<entry><quote>vm86pcb lock</quote></entry>
<entry>
<constant>MTX_DEF</constant>
</entry>
<entry>
<varname>vm86pcb</varname>
</entry>
<entry>
<function>vm86_bioscall</function>
</entry>
</row>
<!-- Giant -->
<row>
<entry>Giant</entry>
<entry><quote>Giant</quote></entry>
<entry>
<constant>MTX_DEF</constant> |
<constant>MTX_RECURSE</constant>
</entry>
<entry>nearly everything</entry>
<entry>lots</entry>
</row>
<!-- The callout lock -->
<row>
<entry>callout_lock</entry>
<entry><quote>callout lock</quote></entry>
<entry>
<constant>MTX_SPIN</constant> |
<constant>MTX_RECURSE</constant>
</entry>
<entry>
<varname>callfree</varname>,
<varname>callwheel</varname>,
<varname>nextsoftcheck</varname>,
<structname/proc/.<structfield/p_itcallout/,
<structname/proc/.<structfield/p_slpcallout/,
<varname>softticks</varname>,
<varname>ticks</varname>
</entry>
<entry>
</entry>
</row>
</tbody>
</tgroup>
</table>
</sect1>
<sect1 id="locking-sx">
<title>Shared Exclusive Locks</title>
<para>These locks provide basic reader-writer type functionality
and may be held by a sleeping process. Currently they are
backed by &man.lockmgr.9;.</para>
<table>
<title>Shared Exclusive Lock List</title>
<tgroup cols="2">
<thead>
<row>
<entry>Variable Name</entry>
<entry>Protectees</entry>
</row>
</thead>
<tbody>
<row>
<entry><varname>allproc_lock</varname></entry>
<entry>
<varname>allproc</varname>
<varname>zombproc</varname>
<varname>pidhashtbl</varname>
<structname/proc/.<structfield/p_list/
<structname/proc/.<structfield/p_hash/
<varname>nextpid</varname>
</entry>
<entry><varname>proctree_lock</varname></entry>
<entry>
<structname/proc/.<structfield/p_children/
<structname/proc/.<structfield/p_sibling/
</entry>
</row>
</tbody>
</tgroup>
</table>
</sect1>
<sect1 id="locking-atomic">
<title>Atomically Protected Variables</title>
<para>An atomically protected variable is a special variable that
is not protected by an explicit lock. Instead, all data
accesses to the variables use special atomic operations as
described in &man.atomic.9;. Very few variables are treated
this way, although other synchronization primitives such as
mutexes are implemented with atomically protected
variables.</para>
<itemizedlist>
<listitem>
<para><structname/mtx/.<structfield/mtx_lock/</para>
</listitem>
</itemizedlist>
</sect1>
</chapter>

View file

@ -1,110 +0,0 @@
<!-- $FreeBSD$ -->
<!ENTITY mac.mpo "mpo">
<!ENTITY mac.thead '
<colspec colname="first" colwidth="0">
<colspec colwidth="0">
<colspec colname="last" colwidth="0">
<thead>
<row>
<entry>Parameter</entry>
<entry>Description</entry>
<entry>Locking</entry>
</row>
</thead>
'>
<!ENTITY mac.externalize.paramdefs '
<paramdef>struct label *<parameter>label</parameter></paramdef>
<paramdef>char *<parameter>element_name</parameter></paramdef>
<paramdef>struct sbuf *<parameter>sb</parameter></paramdef>
<paramdef>int <parameter>*claimed</parameter></paramdef>
'>
<!ENTITY mac.externalize.tbody '
<tbody>
<row>
<entry><parameter>label</parameter></entry>
<entry>Label to be externalized</entry>
</row>
<row>
<entry><parameter>element_name</parameter>
<entry>Name of the policy whose label should be externalized</entry>
</row>
<row>
<entry><parameter>sb</parameter>
<entry>String buffer to be filled with a text representation of
label</entry>
</row>
<row>
<entry><parameter>claimed</parameter></entry>
<entry>Should be incremented when <parameter>element_data</parameter>
can be filled in.</entry>
</row>
</tbody>
'>
<!ENTITY mac.externalize.para "
<para>Produce an externalized label based on the label structure passed.
An externalized label consists of a text representation of the label
contents that can be used with userland applications and read by the
user. Currently, all policies' <function>externalize</function> entry
points will be called, so the implementation should check the contents
of <parameter>element_name</parameter> before attempting to fill in
<parameter>sb</parameter>. If
<parameter>element_name</parameter> does not match the name of your
policy, simply return <returnvalue>0</returnvalue>. Only return nonzero
if an error occurs while externalizing the label data. Once the policy
fills in <parameter>element_data</parameter>, <varname>*claimed</varname>
should be incremented.</para>
">
<!ENTITY mac.internalize.paramdefs '
<paramdef>struct label *<parameter>label</parameter></paramdef>
<paramdef>char *<parameter>element_name</parameter></paramdef>
<paramdef>char *<parameter>element_data</parameter></paramdef>
<paramdef>int *<parameter>claimed</parameter></paramdef>
'>
<!ENTITY mac.internalize.tbody '
<tbody>
<row>
<entry><parameter>label</parameter></entry>
<entry>Label to be filled in</entry>
</row>
<row>
<entry><parameter>element_name</parameter></entry>
<entry>Name of the policy whose label should be internalized</entry>
</row>
<row>
<entry><parameter>element_data</parameter></entry>
<entry>Text data to be internalized</entry>
</row>
<row>
<entry><parameter>claimed</parameter></entry>
<entry>Should be incremented when data can be successfully
internalized.</entry>
</row>
</tbody>
'>
<!ENTITY mac.internalize.para "
<para>Produce an internal label structure based on externalized label data
in text format. Currently, all policies' <function>internalize</function>
entry points are called when internalization is requested, so the
implementation should compare the contents of
<parameter>element_name</parameter> to its own name in order to be sure
it should be internalizing the data in <parameter>element_data</parameter>.
Just as in the <function>externalize</function> entry points, the entry
point should return <returnvalue>0</returnvalue> if
<parameter>element_name</parameter> does not match its own name, or when
data can successfully be internalized, in which case
<varname>*claimed</varname> should be incremented.</para>
">

File diff suppressed because it is too large Load diff

View file

@ -1,360 +0,0 @@
<!--
The FreeBSD Documentation Project
$FreeBSD$
Originally by: Jeroen Ruigrok van der Warven
Date: newbus-draft.txt,v 1.8 2001/01/25 08:01:08
Copyright (c) 2000 Jeroen Ruigrok van der Warven (asmodai@wxs.nl)
Copyright (c) 2002 Hiten Mahesh Pandya (hiten@uk.FreeBSD.org)
Future Additions:
o Expand the information about device_t
o Add information about the bus_* functions.
o Add information about bus specific (e.g. PCI) functions.
o Add a reference section for additional information.
o Add more newbus related structures and typedefs.
o Add a 'Terminology' section.
o Add information on resource manager functions, busspace
manager functions, newbus events related functions.
o More cleanup ... !
Provided under the FreeBSD Documentation License.
-->
<chapter id="newbus">
<chapterinfo>
<authorgroup>
<author>
<firstname>Jeroen</firstname>
<surname>Ruigrok van der Werven (asmodai)</surname>
<affiliation><address><email>asmodai@FreeBSD.org</email></address>
</affiliation>
<contrib>Written by </contrib>
</author>
<author>
<firstname>Hiten</firstname>
<surname>Pandya</surname>
<affiliation><address><email>hiten@uk.FreeBSD.org</email></address>
</affiliation>
</author>
</authorgroup>
</chapterinfo>
<title>Newbus</title>
<para><emphasis>Special thanks to Matthew N. Dodd, Warner Losh, Bill Paul,
Doug Rabson, Mike Smith, Peter Wemm and Scott Long</emphasis>.</para>
<para>This chapter explains the Newbus device framework in detail.</para>
<sect1 id="devdrivers">
<title>Device Drivers</title>
<sect2>
<title>Purpose of a Device Driver</title>
<para>A device driver is a software component which provides the
interface between the kernel's generic view of a peripheral
(e.g. disk, network adapter) and the actual implementation of the
peripheral. The <emphasis>device driver interface (DDI)</emphasis> is
the defined interface between the kernel and the device driver component.
</para>
</sect2>
<sect2>
<title>Types of Device Drivers</title>
<para>There used to be days in &unix;, and thus FreeBSD, in which there
were four types of devices defined:</para>
<itemizedlist>
<listitem><para>block device drivers</para></listitem>
<listitem><para>character device drivers</para></listitem>
<listitem><para>network device drivers</para></listitem>
<listitem><para>pseudo-device drivers</para></listitem>
</itemizedlist>
<para><emphasis>Block devices</emphasis> performed in way that used
fixed size blocks [of data]. This type of driver depended on the
so called <emphasis>buffer cache</emphasis>, which had the purpose
to cache accessed blocks of data in a dedicated part of the memory.
Often this buffer cache was based on write-behind, which meant that when
data was modified in memory it got synced to disk whenever the system
did its periodical disk flushing, thus optimizing writes.</para>
</sect2>
<sect2>
<title>Character devices</title>
<para>However, in the versions of FreeBSD 4.0 and onward the
distinction between block and character devices became non-existent.
</para>
</sect2>
</sect1>
<sect1 id="newbus-overview">
<!--
Real title:
Newbus, Busspace and the Resource Manager, an Explanation of the Possibilities
-->
<title>Overview of Newbus</title>
<para><emphasis>Newbus</emphasis> is the implementation of a new bus
architecture based on abstraction layers which saw its introduction in
FreeBSD 3.0 when the Alpha port was imported into the source tree. It was
not until 4.0 before it became the default system to use for device
drivers. Its goals are to provide a more object oriented means of
interconnecting the various busses and devices which a host system
provides to the <emphasis>Operating System</emphasis>.</para>
<para>Its main features include amongst others:</para>
<itemizedlist>
<listitem><para>dynamic attaching</para></listitem>
<listitem><para>easy modularization of drivers</para></listitem>
<listitem><para>pseudo-busses</para></listitem>
</itemizedlist>
<para>One of the most prominent changes is the migration from the flat and
ad-hoc system to a device tree lay-out.</para>
<para>At the top level resides the <emphasis><quote>root</quote></emphasis>
device which is the parent to hang all other devices on. For each
architecture, there is typically a single child of <quote>root</quote>
which has such things as <emphasis>host-to-PCI bridges</emphasis>, etc.
attached to it. For x86, this <quote>root</quote> device is the
<emphasis><quote>nexus</quote></emphasis> device and for Alpha, various
different different models of Alpha have different top-level devices
corresponding to the different hardware chipsets, including
<emphasis>lca</emphasis>, <emphasis>apecs</emphasis>,
<emphasis>cia</emphasis> and <emphasis>tsunami</emphasis>.</para>
<para>A device in the Newbus context represents a single hardware entity
in the system. For instance each PCI device is represented by a Newbus
device. Any device in the system can have children; a device which has
children is often called a <emphasis><quote>bus</quote></emphasis>.
Examples of common busses in the system are ISA and PCI which manage lists
of devices attached to ISA and PCI busses respectively.</para>
<para>Often, a connection between different kinds of bus is represented by
a <emphasis><quote>bridge</quote></emphasis> device which normally has one
child for the attached bus. An example of this is a
<emphasis>PCI-to-PCI bridge</emphasis> which is represented by a device
<emphasis><devicename>pcibN</devicename></emphasis> on the parent PCI bus
and has a child <emphasis><devicename>pciN</devicename></emphasis> for the
attached bus. This layout simplifies the implementation of the PCI bus
tree, allowing common code to be used for both top-level and bridged
busses.</para>
<para>Each device in the Newbus architecture asks its parent to map its
resources. The parent then asks its own parent until the nexus is
reached. So, basically the nexus is the only part of the Newbus system
which knows about all resources.</para>
<tip><para>An ISA device might want to map its IO port at
<literal>0x230</literal>, so it asks its parent, in this case the ISA
bus. The ISA bus hands it over to the PCI-to-ISA bridge which in its turn
asks the PCI bus, which reaches the host-to-PCI bridge and finally the
nexus. The beauty of this transition upwards is that there is room to
translate the requests. For example, the <literal>0x230</literal> IO port
request might become memory-mapped at <literal>0xb0000230</literal> on a
<acronym>MIPS</acronym> box by the PCI bridge.</para></tip>
<para>Resource allocation can be controlled at any place in the device
tree. For instance on many Alpha platforms, ISA interrupts are managed
separately from PCI interrupts and resource allocations for ISA interrupts
are managed by the Alpha's ISA bus device. On IA-32, ISA and PCI
interrupts are both managed by the top-level nexus device. For both
ports, memory and port address space is managed by a single entity - nexus
for IA-32 and the relevant chipset driver on Alpha (e.g. CIA or tsunami).
</para>
<para>In order to normalize access to memory and port mapped resources,
Newbus integrates the <literal>bus_space</literal> APIs from NetBSD.
These provide a single API to replace inb/outb and direct memory
reads/writes. The advantage of this is that a single driver can easily
use either memory-mapped registers or port-mapped registers
(some hardware supports both).</para>
<para>This support is integrated into the resource allocation mechanism.
When a resource is allocated, a driver can retrieve the associated
<structfield>bus_space_tag_t</structfield> and
<structfield>bus_space_handle_t</structfield> from the resource.</para>
<para>Newbus also allows for definitions of interface methods in files
dedicated to this purpose. These are the <filename>.m</filename> files
that are found under the <filename>src/sys</filename> hierarchy.</para>
<para>The core of the Newbus system is an extensible
<quote>object-based programming</quote> model. Each device in the system
has a table of methods which it supports. The system and other devices
uses those methods to control the device and request services. The
different methods supported by a device are defined by a number of
<quote>interfaces</quote>. An <quote>interface</quote> is simply a group
of related methods which can be implemented by a device.</para>
<para>In the Newbus system, the methods for a device are provided by the
various device drivers in the system. When a device is attached to a
driver during <emphasis>auto-configuration</emphasis>, it uses the method
table declared by the driver. A device can later
<emphasis>detach</emphasis> from its driver and
<emphasis>re-attach</emphasis> to a new driver with a new method table.
This allows dynamic replacement of drivers which can be useful for driver
development.</para>
<para>The interfaces are described by an interface definition language
similar to the language used to define vnode operations for file systems.
The interface would be stored in a methods file (which would normally named
<filename>foo_if.m</filename>).</para>
<example>
<title>Newbus Methods</title>
<programlisting>
# Foo subsystem/driver (a comment...)
INTERFACE foo
METHOD int doit {
device_t dev;
};
# DEFAULT is the method that will be used, if a method was not
# provided via: DEVMETHOD()
METHOD void doit_to_child {
device_t dev;
driver_t child;
} DEFAULT doit_generic_to_child;
</programlisting>
</example>
<para>When this interface is compiled, it generates a header file
<quote><filename>foo_if.h</filename></quote> which contains function
declarations:</para>
<programlisting>
int FOO_DOIT(device_t dev);
int FOO_DOIT_TO_CHILD(device_t dev, device_t child);
</programlisting>
<para>A source file, <quote><filename>foo_if.c</filename></quote> is
also created to accompany the automatically generated header file; it
contains implementations of those functions which look up the location
of the relevant functions in the object's method table and call that
function.</para>
<para>The system defines two main interfaces. The first fundamental
interface is called <emphasis><quote>device</quote></emphasis> and
includes methods which are relevant to all devices. Methods in the
<emphasis><quote>device</quote></emphasis> interface include
<emphasis><quote>probe</quote></emphasis>,
<emphasis><quote>attach</quote></emphasis> and
<emphasis><quote>detach</quote></emphasis> to control detection of
hardware and <emphasis><quote>shutdown</quote></emphasis>,
<emphasis><quote>suspend</quote></emphasis> and
<emphasis><quote>resume</quote></emphasis> for critical event
notification.</para>
<para>The second, more complex interface is
<emphasis><quote>bus</quote></emphasis>. This interface contains
methods suitable for devices which have children, including methods to
access bus specific per-device information
<footnote><para>&man.bus.generic.read.ivar.9; and
&man.bus.generic.write.ivar.9;</para></footnote>, event notification
(<emphasis><literal>child_detached</literal></emphasis>,
<emphasis><literal>driver_added</literal></emphasis>) and resource
management (<emphasis><literal>alloc_resource</literal></emphasis>,
<emphasis><literal>activate_resource</literal></emphasis>,
<emphasis><literal>deactivate_resource</literal></emphasis>,
<emphasis><literal>release_resource</literal></emphasis>).</para>
<para>Many methods in the <quote>bus</quote> interface are performing
services for some child of the bus device. These methods would normally
use the first two arguments to specify the bus providing the service
and the child device which is requesting the service. To simplify
driver code, many of these methods have accessor functions which
lookup the parent and call a method on the parent. For instance the
method
<literal>BUS_TEARDOWN_INTR(device_t dev, device_t child, ...)</literal>
can be called using the function
<literal>bus_teardown_intr(device_t child, ...)</literal>.</para>
<para>Some bus types in the system define additional interfaces to
provide access to bus-specific functionality. For instance, the PCI
bus driver defines the <quote>pci</quote> interface which has two
methods <emphasis><literal>read_config</literal></emphasis> and
<emphasis><literal>write_config</literal></emphasis> for accessing the
configuration registers of a PCI device.</para>
</sect1>
<sect1 id="newbus-api">
<title>Newbus API</title>
<para>As the Newbus API is huge, this section makes some effort at
documenting it. More information to come in the next revision of this
document.</para>
<sect2>
<title>Important locations in the source hierarchy</title>
<para><filename>src/sys/[arch]/[arch]</filename> - Kernel code for a
specific machine architecture resides in this directory. for example,
the <literal>i386</literal> architecture, or the
<literal>SPARC64</literal> architecture.</para>
<para><filename>src/sys/dev/[bus]</filename> - device support for a
specific <literal>[bus]</literal> resides in this directory.</para>
<para><filename>src/sys/dev/pci</filename> - PCI bus support code
resides in this directory.</para>
<para><filename>src/sys/[isa|pci]</filename> - PCI/ISA device drivers
reside in this directory. The PCI/ISA bus support code used to exist
in this directory in FreeBSD version <literal>4.0</literal>.</para>
</sect2>
<sect2>
<title>Important structures and type definitions</title>
<para><literal>devclass_t</literal> - This is a type definition of a
pointer to a <literal>struct devclass</literal>.</para>
<para><literal>device_method_t</literal> - This is same as
<literal>kobj_method_t</literal> (see
<filename>src/sys/kobj.h</filename>).</para>
<para><literal>device_t</literal> - This is a type definition of a
pointer to a <literal>struct device</literal>.
<literal>device_t</literal> represents a device in the system. It is
a kernel object. See <filename>src/sys/sys/bus_private.h</filename>
for implementation details.</para>
<para><literal>driver_t</literal> - This is a type definition which,
references <literal>struct driver</literal>. The
<literal>driver</literal> struct is a class of the
<literal>device</literal> kernel object; it also holds data private
to for the driver.</para>
<figure>
<title><emphasis>driver_t</emphasis> implementation</title>
<programlisting>
struct driver {
KOBJ_CLASS_FIELDS;
void *priv; /* driver private data */
};
</programlisting>
</figure>
<para>A <literal>device_state_t</literal> type, which is
an enumeration, <literal>device_state</literal>. It contains
the possible states of a Newbus device before and after the
autoconfiguration process.</para>
<figure>
<title>Device states<emphasis>device_state_t</emphasis></title>
<programlisting>
/*
* src/sys/sys/bus.h
*/
typedef enum device_state {
DS_NOTPRESENT, /* not probed or probe failed */
DS_ALIVE, /* probe succeeded */
DS_ATTACHED, /* attach method called */
DS_BUSY /* device is open */
} device_state_t;
</programlisting>
</figure>
</sect2>
</sect1>
</chapter>

View file

@ -1,337 +0,0 @@
<!--
The FreeBSD Documentation Project
$FreeBSD$
-->
<chapter id="pccard">
<title>PC Card</title>
<para>This chapter will talk about the FreeBSD mechanisms for
writing a device driver for a PC Card or CardBus device. However,
at the present time, it just documents how to add a driver to an
existing pccard driver.</para>
<sect1 id="pccard-adddev">
<title>Adding a device</title>
<para>Adding a new device to the list of supported devices for
pccard devices has changed form the system used through FreeBSD
4. In prior versions, editing a file in /etc to list the device
was necessary. Starting in FreeBSD 5.0, devices drivers know what
devices they support. There is now a table of supported devices
in the kernel that drivers use to attach to a device.</para>
<sect2 id="pccard-overview">
<title>Overview</title>
<para>PC Cards are identified in one of two ways, both based on
information in the CIS of the card. The first method is to use
numberic manufacturer and product numbers. The second method is
to use the human readable strings that are also contained in the
CIS as well. The PC Card bus uses a centralized database and
some macros to facilitate a design pattern to help the driver
writer match devices to his driver.</para>
<para>There is a widespread practice of one company developing a
reference design for a PC Card product and then selling this
design to other companies to market. Those companies refine the
design, market the product to their target audience or
geographic area and put their own name plate onto the card.
However, the refinements to the physical card typically are very
minor, if any changes are made at all. Often, however, to
strengthen their branding of their version of the card, these
vendors will place their company name in the human strings in
the CIS space, but leave the manufacturer and product ids
unchanged.</para>
<param>Because of the above practice, it is a smaller work load
for FreeBSD to use the numeric IDs. It also introduces some
minor complications into the process of adding IDs to the
system. One must carefully check to see who really made the
card, especially when it appears that the vendor who made the
card from might already have a different manufacturer id listed
in the central database. Linksys, D-Link and NetGear are a
number of US Manufactuers of LAN hardware that often sell the
same design. These same designs can be sold in Japan under the
names such as Buffalo and Corega. Yet often, these devices will
all have the same manufacturer and product id.</param>
<param>The PC Card bus keeps its central database of card
information, but not which driver is associated with them, in
/sys/dev/pccard/pccarddevs. It also provides a set of macros
that allow one to easily construct simple entries in the table
the driver uses to claim devices.</param>
<param>Finally, some really low end divices do not contain
manufacturer identification at all. These devices require that
one matches them using the human readable CIS strings. While it
would be nice if we didn't need this method as a fallback, it is
necessary for some very low end CD-ROM players that are quite
popular. This method should generally be avoided, but a number
of devices are listed in this section because they were added
prior to the recognition of the OEM nature of the PC Card
buisiness. When adding new devices, prefer using the numberic
method.</param>
</sect2>
<sect2 id="pccard-pccarddevs">
<title>Format of pccarddevs</title>
<para>There are four sections of the pccarddevs files. The
first section lists the manufacturer numbers for those vendors
that use them. This section is sorted in numerical order. The
next section has all of the products that are used by these
vendors, along with their product ID numbers and a description
string. The description string typically isn't used (instead we
set the device's description based on the human readable CIS,
even if we match on the numeric version). These two sections
are then repeated for those devices that use the string matching
method. Finally, C-style comments are allowed anywhere in the
file.</para>
<para>The first section of the file contains the vendor IDs.
Please keep this list sorted in numeric order. Also, please
coordinate changes to this file because we share it with
NetBSD to help facilitate a common clearing hose for this
information. For example:
<programlisting>vendor FUJITSU 0x0004 Fujitsu Corporation
vendor NETGEAR_2 0x000b Netgear
vendor PANASONIC 0x0032 Matsushita Electric Industrial Co.
vendor SANDISK 0x0045 Sandisk Corporation
</programlisting>
shows the first few vendor ids. Chances are very good that the
NETGEAR_2 entry is really an OEM that NETGEAR purchased cards
from and the author of support for those cards was unaware at
the time that Netgear was using someone else's id. These
entries are fairly straight forward. There's the vendor keyword
used to denote the kind of line that this is. There's the name
of the vendor. This name will be repated later in the
pccarddevs file, as well as used in the driver's match tables,
so keep it short and a valid C identifier. There's a numeric
ID, in hex, for the manufacturer. Do not add IDs of the form
0xffffffff or 0xffff because these are reserved ids (the former
is 'no id set' while the latter is sometimes seen in extremely
poor quality cards to try to indicate 'none). Finally there's a
string description of the company that makes the card. This is
string is not used in FreeBSD for anything but commentary
purposes.
<para>The second section of the file contains the products.
As you can see in the following example:
<programlisting>/* Allied Telesis K.K. */
product ALLIEDTELESIS LA_PCM 0x0002 Allied Telesis LA-PCM
/* Archos */
product ARCHOS ARC_ATAPI 0x0043 MiniCD
</programlisting>
the format is similar to the vendor lines. There is the product
keyword. Then there is the vendor name, repeated from above.
This is followed by the product name, which is used by the
driver and should be a valid C identifier, but may also start
with a number. There's then the product id for this card, in
hex. As with the vendors, there's the same convention for
0xffffffff and 0xffff. Finally, there's a string description of
the device itself. This string typically is not used in
FreeBSD, since FreeBSD's pccard bus driver will construct a
string from the human readable CIS entries, but can be used in
the rare cases where this is somehow insufficient. The products
are in alphabetical order by manufacturer, then numerical order by
product id. They have a C comment before each manufacturer's
entries and there is a blank line between entries.</para>
<para>The third section is like the previous vendor section, but
with all of the manufacturer numeric ids as -1. -1 means 'match
anything you find' in the FreeBSD pccard bus code. Since these
are C identifiers, their names must be unique. Otherwise the
format is identical to the first section of the file.</para>
<para>The final section contains the entries for those cards
that we must match with string entries. This sections' format
is a little different than the neric section:
<programlisting>product ADDTRON AWP100 { "Addtron", "AWP-100&spWireless&spPCMCIA", "Version&sp01.02", NULL }
product ALLIEDTELESIS WR211PCM { "Allied&spTelesis&spK.K.", "WR211PCM", NULL, NULL } Allied Telesis WR211PCM
</programlisting>
We have the familiar product keyword, followed by the vendor
name followed by the card name, just as in the second section of
the file. However, then we deviate from that format. There is
a {} grouping, followed by a number of strings. These strings
correspond to the vendor, product and extra information that is
defined in a CIS_INFO tuple. These strings are filtered by the
program that generates pccarddevs.h to replace &amp;sp with a
real space. NULL entries mean that that part of the entry
should be ignored. In the example I've picked, there's a bad
entry. It shouldn't contain the version number in it unless
that's critical for the operatin of the card. Sometimes vendors
will have many different versions of the card in the field that
all work, in which case that information only makes it harder
for someone with a similar card to use it with FreeBSD.
Sometimes it is necessary when a vendor wishes to sell many
different parts under the same brand due to market
considerations (availability, price, and so forth). Then it can
be critical to disambiguating the card in those rare cases where
the vendor kept the same manufacturer/product pair. Regular
expression matching is not available at this time.</para>
</sect2>
<sect2 id="pccard-probe">
<title>Sample probe routine</title>
<para>To understand how to add a device to list of supported
devices, one must understand the probe and/or match routines
that many drivers have. It is complicated a little in FreeBSD
5.x because there is a compatibility layer for OLDCARD present
as well. Since only the window-dressing is different, I'll be
presenting an lidealized version.</para>
<programlisting>static const struct pccard_product wi_pccard_products[] = {
PCMCIA_CARD(3COM, 3CRWE737A, 0),
PCMCIA_CARD(BUFFALO, WLI_PCM_S11, 0),
PCMCIA_CARD(BUFFALO, WLI_CF_S11G, 0),
PCMCIA_CARD(TDK, LAK_CD011WL, 0),
{ NULL }
};
static int
wi_pccard_probe(dev)
device_t dev;
{
const struct pccard_product *pp;
if ((pp = pccard_product_lookup(dev, wi_pccard_products,
sizeof(wi_pccard_products[0]), NULL)) != NULL) {
if (pp->pp_name != NULL)
device_set_desc(dev, pp->pp_name);
return (0);
}
return (ENXIO);
}
</programlisting>
<para>Here we have a simple pccard probe routine that matches a
few devices. As stated above, the name may vary (if it isn't
<function>foo_pccard_probe()</function> it will be
<function>foo_pccard_match()</function>). The function
<function>pccard_product_lookup()</function> is a generalized
function that walks the table and returns a pointer to the
first entry that it matches. Some drivers may use this
mechanism to convey addtional information about some cards to
the rest of the driver, so there may be some variance in the
table. The only requirement is that if you have a
different table, the first element of the structure you have a
table of be a struct pccard_product.</para>
<para>Looking at the table wi_pccard_products, one notices that
all the entries are of the form PCMCIA_CARD(foo, bar, baz).
The foo part is the manufacturer id from pccarddevs. The bar
part is the product. The baz is the expected function number
that for this card. Many pccards can have multiple functions,
and some way to disambiguate function 1 from function 0 is
needed. You may see PCMCIA_CARD_D, which includes the device
description from the pccarddevs file. You may also see
PCMCIA_CARD2 and PCMCIA_CARD2_D which are used when you need
to match CIS both CIS strings and manufacturer numbers, in the
'use the default descrition' and 'take the descrition from
pccarddevs' flavors.</para>
</sect2>
<sect2 id="pccard-add">
<title>Putting it all together</title>
<para>So, to add a new device, one must do the following steps.
First, one must obtain the identification information from the
device. The easiest way to do this is to insert the device into
a PC Card or CF slot and issue devinfo -v. You'll likely see
something like:
<programlisting> cbb1 pnpinfo vendor=0x104c device=0xac51 subvendor=0x1265 subdevice=0x0300 class=0x060700 at slot=10 function=1
cardbus1
pccard1
unknown pnpinfo manufacturer=0x026f product=0x030c cisvendor="BUFFALO" cisproduct="WLI2-CF-S11" function_type=6 at function=0
</programlisting>
as part of the output. The manufacturer and product are the
numeric IDs for this product. While the cisvendor and
cisproduct are the strings that are present in the CIS that
describe this product.</para>
<para>Since we first want to prefer the
numeric option, first try to construct an entry based on that.
The above card has been slightly fictionalized for the purpose
of this example. The vendor is BUFFALO, which we see already
has an entry:
<programlisting>vendor BUFFALO 0x026f BUFFALO (Melco Corporation)
</programlisting>
so we're good there. Looking for an entry for this card, we do
not find one. Instead we find:
<programlisting>/* BUFFALO */
product BUFFALO WLI_PCM_S11 0x0305 BUFFALO AirStation 11Mbps WLAN
product BUFFALO LPC_CF_CLT 0x0307 BUFFALO LPC-CF-CLT
product BUFFALO LPC3_CLT 0x030a BUFFALO LPC3-CLT Ethernet Adapter
product BUFFALO WLI_CF_S11G 0x030b BUFFALO AirStation 11Mbps CF WLAN
</programlisting>
we can just add
<programlisting>product BUFFALO WLI2_CF_S11G 0x030c BUFFALO AirStation ultra 802.11b CF
</programlisting>
to pccarddevs. Presently, there is a manual step to regenerate
the pccarddevs.h file used to convey these identifiers to the
the client driver. The following steps must be done before you
can use them in the driver:
<programlisting>cd src/sys/dev/pccard
make -f Makefile.pccarddevs
</programlisting>
</para>
<para>Once these steps are complete, you can add the card to the
driver. That is a simple operation of adding one line:
<programlisting>static const struct pccard_product wi_pccard_products[] = {
PCMCIA_CARD(3COM, 3CRWE737A, 0),
PCMCIA_CARD(BUFFALO, WLI_PCM_S11, 0),
PCMCIA_CARD(BUFFALO, WLI_CF_S11G, 0),
+ PCMCIA_CARD(BUFFALO, WLI_CF2_S11G, 0),
PCMCIA_CARD(TDK, LAK_CD011WL, 0),
{ NULL }
};
</programlisting>
Note that I've included a '+' in the line before the line that I
added, but that is simply to highlight the line. Do not add it
to the eactual driver. Once you've added the line, you can
recompile your kernel or module and try to see if it recognizes
the device. If it does and works, please submit a patch. If it
doesn't work, please figure out what is needed to make it work
and submit a patch. If it didn't recgonize it at all, you have
done something wrong and should recheck each step.</para>
<para>If you are a FreeBSD src committer, and everything appears
to be working, then you can commit the changes to the tree.
However, there are some minor tricky things that you need to
worry about. First, you must commit the pccarddevs file to the
tree first. After you have done that, you must regenerate
pccarddevs.h after the commit of pccarddevs and commit that as a
second commit (this is to make sure that the right $FreeBSD$ tag
is in the latter file). Finally, you need to commit the
additions to the driver.</para>
</sect2>
<sect2 id="pccard-pr">
<title>Submitting a new device</title>
<para>Many people send entries for new devices to the author
directly. Please do not do this. Please submit them as a PR
and send the author the PR number for his records. This makes
sure that entries aren't lost. When submitting a PR, it is
unnecessary to include the pccardevs.h diffs in the patch, since
those will be regenerated. It is necessary to include a
descrition of the device, as well as the patches to the client
driver. If you don't know the name, use OEM99 as the name, and
the author will adjust OEM99 accordingly after investigation.
Committers should not commit OEM99, but instead find the highest
OEM entry and commit one more than that.</para>
</sect2>
</sect1>
</chapter>

View file

@ -1,378 +0,0 @@
<!--
The FreeBSD Documentation Project
$FreeBSD$
-->
<chapter id="pci">
<title>PCI Devices</title>
<para>This chapter will talk about the FreeBSD mechanisms for
writing a device driver for a device on a PCI bus.</para>
<sect1 id="pci-probe">
<title>Probe and Attach</title>
<para>Information here about how the PCI bus code iterates through
the unattached devices and see if a newly loaded kld will attach
to any of them.</para>
<programlisting>/*
* Simple KLD to play with the PCI functions.
*
* Murray Stokely
*/
#define MIN(a,b) (((a) < (b)) ? (a) : (b))
#include &lt;sys/types.h&gt;
#include &lt;sys/module.h&gt;
#include &lt;sys/systm.h&gt; /* uprintf */
#include &lt;sys/errno.h&gt;
#include &lt;sys/param.h&gt; /* defines used in kernel.h */
#include &lt;sys/kernel.h&gt; /* types used in module initialization */
#include &lt;sys/conf.h&gt; /* cdevsw struct */
#include &lt;sys/uio.h&gt; /* uio struct */
#include &lt;sys/malloc.h&gt;
#include &lt;sys/bus.h&gt; /* structs, prototypes for pci bus stuff */
#include &lt;pci/pcivar.h&gt; /* For get_pci macros! */
/* Function prototypes */
d_open_t mypci_open;
d_close_t mypci_close;
d_read_t mypci_read;
d_write_t mypci_write;
/* Character device entry points */
static struct cdevsw mypci_cdevsw = {
.d_open = mypci_open,
.d_close = mypci_close,
.d_read = mypci_read,
.d_write = mypci_write,
.d_name = "mypci",
};
/* vars */
static dev_t sdev;
/* We're more interested in probe/attach than with
open/close/read/write at this point */
int
mypci_open(dev_t dev, int oflags, int devtype, struct proc *p)
{
int err = 0;
uprintf("Opened device \"mypci\" successfully.\n");
return(err);
}
int
mypci_close(dev_t dev, int fflag, int devtype, struct proc *p)
{
int err=0;
uprintf("Closing device \"mypci.\"\n");
return(err);
}
int
mypci_read(dev_t dev, struct uio *uio, int ioflag)
{
int err = 0;
uprintf("mypci read!\n");
return err;
}
int
mypci_write(dev_t dev, struct uio *uio, int ioflag)
{
int err = 0;
uprintf("mypci write!\n");
return(err);
}
/* PCI Support Functions */
/*
* Return identification string if this is device is ours.
*/
static int
mypci_probe(device_t dev)
{
uprintf("MyPCI Probe\n"
"Vendor ID : 0x%x\n"
"Device ID : 0x%x\n",pci_get_vendor(dev),pci_get_device(dev));
if (pci_get_vendor(dev) == 0x11c1) {
uprintf("We've got the Winmodem, probe successful!\n");
return 0;
}
return ENXIO;
}
/* Attach function is only called if the probe is successful */
static int
mypci_attach(device_t dev)
{
uprintf("MyPCI Attach for : deviceID : 0x%x\n",pci_get_vendor(dev));
sdev = make_dev(<literal>&</literal>mypci_cdevsw,
0,
UID_ROOT,
GID_WHEEL,
0600,
"mypci");
uprintf("Mypci device loaded.\n");
return ENXIO;
}
/* Detach device. */
static int
mypci_detach(device_t dev)
{
uprintf("Mypci detach!\n");
return 0;
}
/* Called during system shutdown after sync. */
static int
mypci_shutdown(device_t dev)
{
uprintf("Mypci shutdown!\n");
return 0;
}
/*
* Device suspend routine.
*/
static int
mypci_suspend(device_t dev)
{
uprintf("Mypci suspend!\n");
return 0;
}
/*
* Device resume routine.
*/
static int
mypci_resume(device_t dev)
{
uprintf("Mypci resume!\n");
return 0;
}
static device_method_t mypci_methods[] = {
/* Device interface */
DEVMETHOD(device_probe, mypci_probe),
DEVMETHOD(device_attach, mypci_attach),
DEVMETHOD(device_detach, mypci_detach),
DEVMETHOD(device_shutdown, mypci_shutdown),
DEVMETHOD(device_suspend, mypci_suspend),
DEVMETHOD(device_resume, mypci_resume),
{ 0, 0 }
};
static driver_t mypci_driver = {
"mypci",
mypci_methods,
0,
/* sizeof(struct mypci_softc), */
};
static devclass_t mypci_devclass;
DRIVER_MODULE(mypci, pci, mypci_driver, mypci_devclass, 0, 0);</programlisting>
<para>Additional Resources
<itemizedlist>
<listitem><simpara><ulink url="http://www.pcisig.org/">PCI
Special Interest Group</ulink></simpara></listitem>
<listitem><simpara>PCI System Architecture, Fourth Edition by
Tom Shanley, et al.</simpara></listitem>
</itemizedlist>
</para>
</sect1>
<sect1 id="pci-bus">
<title>Bus Resources</title>
<para>FreeBSD provides an object-oriented mechanism for requesting
resources from a parent bus. Almost all devices will be a child
member of some sort of bus (PCI, ISA, USB, SCSI, etc) and these
devices need to acquire resources from their parent bus (such as
memory segments, interrupt lines, or DMA channels).</para>
<sect2>
<title>Base Address Registers</title>
<para>To do anything particularly useful with a PCI device you
will need to obtain the <emphasis>Base Address
Registers</emphasis> (BARs) from the PCI Configuration space.
The PCI-specific details of obtaining the BAR are abstracted in
the <function>bus_alloc_resource()</function> function.</para>
<para>For example, a typical driver might have something similar
to this in the <function>attach()</function> function:</para>
<programlisting> sc->bar0id = 0x10;
sc->bar0res = bus_alloc_resource(dev, SYS_RES_MEMORY, &amp;(sc->bar0id),
0, ~0, 1, RF_ACTIVE);
if (sc->bar0res == NULL) {
uprintf("Memory allocation of PCI base register 0 failed!\n");
error = ENXIO;
goto fail1;
}
sc->bar1id = 0x14;
sc->bar1res = bus_alloc_resource(dev, SYS_RES_MEMORY, &amp;(sc->bar1id),
0, ~0, 1, RF_ACTIVE);
if (sc->bar1res == NULL) {
uprintf("Memory allocation of PCI base register 1 failed!\n");
error = ENXIO;
goto fail2;
}
sc->bar0_bt = rman_get_bustag(sc->bar0res);
sc->bar0_bh = rman_get_bushandle(sc->bar0res);
sc->bar1_bt = rman_get_bustag(sc->bar1res);
sc->bar1_bh = rman_get_bushandle(sc->bar1res);
</programlisting>
<para>Handles for each base address register are kept in the
<structname>softc</structname> structure so that they can be
used to write to the device later.</para>
<para>These handles can then be used to read or write from the
device registers with the <function>bus_space_*</function>
functions. For example, a driver might contain a shorthand
function to read from a board specific register like this:</para>
<programlisting>uint16_t
board_read(struct ni_softc *sc, uint16_t address) {
return bus_space_read_2(sc->bar1_bt, sc->bar1_bh, address);
}
</programlisting>
<para>Similarly, one could write to the registers with:</para>
<programlisting>void
board_write(struct ni_softc *sc, uint16_t address, uint16_t value) {
bus_space_write_2(sc->bar1_bt, sc->bar1_bh, address, value);
}
</programlisting>
<para>These functions exist in 8bit, 16bit, and 32bit versions
and you should use
<function>bus_space_{read|write}_{1|2|4}</function>
accordingly.</para>
</sect2>
<sect2>
<title>Interrupts</title>
<para>Interrupts are allocated from the object-oriented bus code
in a way similar to the memory resources. First an IRQ
resource must be allocated from the parent bus, and then the
interrupt handler must be setup to deal with this IRQ.</para>
<para>Again, a sample from a device
<function>attach()</function> function says more than
words.</para>
<programlisting>/* Get the IRQ resource */
sc->irqid = 0x0;
sc->irqres = bus_alloc_resource(dev, SYS_RES_IRQ, &amp;(sc->irqid),
0, ~0, 1, RF_SHAREABLE | RF_ACTIVE);
if (sc->irqres == NULL) {
uprintf("IRQ allocation failed!\n");
error = ENXIO;
goto fail3;
}
/* Now we should setup the interrupt handler */
error = bus_setup_intr(dev, sc->irqres, INTR_TYPE_MISC,
my_handler, sc, &amp;(sc->handler));
if (error) {
printf("Couldn't set up irq\n");
goto fail4;
}
sc->irq_bt = rman_get_bustag(sc->irqres);
sc->irq_bh = rman_get_bushandle(sc->irqres);
</programlisting>
<para>Some care must be taken in the detach routine of the
driver. You must quiess the device's interrupt stream, and
remove the interrupt hanlder. Once
<function>bus_space_teardown_intr()</function> has returned, you
know that your interrupt handler will no longer be called, and
that all threads that might have been this interrupt handler
have returned. Depending on the locking strategy of your
driver, you will also need to be careful with what locks you
hold when you do this to avoid deadlock.</para>
</sect2>
<sect2>
<title>DMA</title>
<para>This section is obsolete, and present only for historical
reasons. The proper methods for dealing with these issues is to
use the <function>bus_space_dma*()</function> functions instead.
This paragraph can be removed when this section is updated to reflect
that usage. However, at the moment, the API is in a bit of
flux, so once that settles down, it would be good to update this
section to reflect that.</para>
<para>On the PC, peripherals that want to do bus-mastering DMA
must deal with physical addresses. This is a problem since
FreeBSD uses virtual memory and deals almost exclusively with
virtual addresses. Fortunately, there is a function,
<function>vtophys()</function> to help.</para>
<programlisting>#include &lt;vm/vm.h&gt;
#include &lt;vm/pmap.h&gt;
#define vtophys(virtual_address) (...)
</programlisting>
<para>The solution is a bit different on the alpha however, and
what we really want is a function called
<function>vtobus()</function>.</para>
<programlisting>#if defined(__alpha__)
#define vtobus(va) alpha_XXX_dmamap((vm_offset_t)va)
#else
#define vtobus(va) vtophys(va)
#endif
</programlisting>
</sect2>
<sect2>
<title>Deallocating Resources</title>
<para>It is very important to deallocate all of the resources
that were allocated during <function>attach()</function>.
Care must be taken to deallocate the correct stuff even on a
failure condition so that the system will remain usable while
your driver dies.</para>
</sect2>
</sect1>
</chapter>

File diff suppressed because it is too large Load diff

View file

@ -1,690 +0,0 @@
<!--
The FreeBSD Documentation Project
$FreeBSD$
-->
<chapter id="oss">
<chapterinfo>
<authorgroup>
<author>
<firstname>Jean-Francois</firstname>
<surname>Dockes</surname>
<contrib>Contributed by </contrib>
</author>
</authorgroup>
<!-- 23 November 2001 -->
</chapterinfo>
<title>Sound subsystem</title>
<sect1 id="oss-intro">
<title>Introduction</title>
<para>The FreeBSD sound subsystem cleanly separates generic sound
handling issues from device-specific ones. This makes it easier
to add support for new hardware.</para>
<para>The &man.pcm.4; framework is the central piece of the sound
subsystem. It mainly implements the following elements:</para>
<itemizedlist>
<listitem>
<para>A system call interface (read, write, ioctls) to
digitized sound and mixer functions. The ioctl command set
is compatible with the legacy <emphasis>OSS</emphasis> or
<emphasis>Voxware</emphasis> interface, allowing common
multimedia applications to be ported without
modification.</para>
</listitem>
<listitem>
<para>Common code for processing sound data (format
conversions, virtual channels).</para>
</listitem>
<listitem>
<para>A uniform software interface to hardware-specific audio
interface modules.</para>
</listitem>
<listitem>
<para>Additional support for some common hardware interfaces
(ac97), or shared hardware-specific code (ex: ISA DMA
routines).</para>
</listitem>
</itemizedlist>
<para>The support for specific sound cards is implemented by
hardware-specific drivers, which provide channel and mixer interfaces
to plug into the generic <devicename>pcm</devicename> code.</para>
<para>In this chapter, the term <devicename>pcm</devicename> will
refer to the central, common part of the sound driver, as
opposed to the hardware-specific modules.</para>
<para>The prospective driver writer will of course want to start
from an existing module and use the code as the ultimate
reference. But, while the sound code is nice and clean, it is
also mostly devoid of comments. This document tries to give an
overview of the framework interface and answer some questions
that may arise while adapting the existing code.</para>
<para>As an alternative, or in addition to starting from a working
example, you can find a commented driver template at
<ulink url="http://people.FreeBSD.org/~cg/template.c">
http://people.FreeBSD.org/~cg/template.c</ulink></para>
</sect1>
<sect1 id="oss-files">
<title>Files</title>
<para>All the relevant code currently (FreeBSD 4.4) lives in
<filename>/usr/src/sys/dev/sound/</filename>, except for the
public ioctl interface definitions, found in
<filename>/usr/src/sys/sys/soundcard.h</filename></para>
<para>Under <filename>/usr/src/sys/dev/sound/</filename>, the
<filename>pcm/</filename> directory holds the central code,
while the <filename>isa/</filename> and
<filename>pci/</filename> directories have the drivers for ISA
and PCI boards.</para>
</sect1>
<sect1 id="pcm-probe-and-attach">
<title>Probing, attaching, etc.</title>
<para>Sound drivers probe and attach in almost the same way as any
hardware driver module. You might want to look at the <link
linkend="isa-driver"> ISA</link> or <link
linkend="pci">PCI</link> specific sections of the handbook for
more information.</para>
<para>However, sound drivers differ in some ways:</para>
<itemizedlist>
<listitem>
<para>They declare themselves as <devicename>pcm</devicename>
class devices, with a <structname>struct
snddev_info</structname> device private structure:</para>
<programlisting> static driver_t xxx_driver = {
"pcm",
xxx_methods,
sizeof(struct snddev_info)
};
DRIVER_MODULE(snd_xxxpci, pci, xxx_driver, pcm_devclass, 0, 0);
MODULE_DEPEND(snd_xxxpci, snd_pcm, PCM_MINVER, PCM_PREFVER,PCM_MAXVER);</programlisting>
<para>Most sound drivers need to store additional private
information about their device. A private data structure is
usually allocated in the attach routine. Its address is
passed to <devicename>pcm</devicename> by the calls to
<function>pcm_register()</function> and
<function>mixer_init()</function>.
<devicename>pcm</devicename> later passes back this address
as a parameter in calls to the sound driver
interfaces.</para>
</listitem>
<listitem>
<para>The sound driver attach routine should declare its MIXER
or AC97 interface to <devicename>pcm</devicename> by calling
<function>mixer_init()</function>. For a MIXER interface,
this causes in turn a call to <link linkend="xxxmixer-init">
<function>xxxmixer_init()</function></link>.</para>
</listitem>
<listitem>
<para>The sound driver attach routine declares its general
CHANNEL configuration to <devicename>pcm</devicename> by
calling <function>pcm_register(dev, sc, nplay,
nrec)</function>, where <varname>sc</varname> is the address
for the device data structure, used in further calls from
<devicename>pcm</devicename>, and <varname>nplay</varname>
and <varname>nrec</varname> are the number of play and
record channels.</para>
</listitem>
<listitem>
<para>The sound driver attach routine declares each of its
channel objects by calls to
<function>pcm_addchan()</function>. This sets up the
channel glue in <devicename>pcm</devicename> and causes in
turn a call to
<link linkend="xxxchannel-init">
<function>xxxchannel_init()</function></link>.</para>
</listitem>
<listitem>
<para>The sound driver detach routine should call
<function>pcm_unregister()</function> before releasing its
resources.</para>
</listitem>
</itemizedlist>
<para>There are two possible methods to handle non-PnP devices:</para>
<itemizedlist>
<listitem>
<para>Use a <function>device_identify()</function> method
(example: <filename>sound/isa/es1888.c</filename>). The
<function>device_identify()</function> method probes for the
hardware at known addresses and, if it finds a supported
device, creates a new pcm device which is then passed to
probe/attach.</para>
</listitem>
<listitem>
<para>Use a custom kernel configuration with appropriate hints
for pcm devices (example:
<filename>sound/isa/mss.c</filename>).</para>
</listitem>
</itemizedlist>
<para><devicename>pcm</devicename> drivers should implement
<function>device_suspend</function>,
<function>device_resume</function> and
<function>device_shutdown</function> routines, so that power
management and module unloading function correctly.</para>
</sect1>
<sect1 id="oss-interfaces">
<title>Interfaces</title>
<para>The interface between the <devicename>pcm</devicename> core
and the sound drivers is defined in terms of <link
linkend="kernel-objects">kernel objects</link>.</para>
<para>There are two main interfaces that a sound driver will
usually provide: <emphasis>CHANNEL</emphasis> and either
<emphasis>MIXER</emphasis> or <emphasis>AC97</emphasis>.</para>
<para>The <emphasis>AC97</emphasis> interface is a very small
hardware access (register read/write) interface, implemented by
drivers for hardware with an AC97 codec. In this case, the
actual MIXER interface is provided by the shared AC97 code in
<devicename>pcm</devicename>.</para>
<sect2>
<title>The CHANNEL interface</title>
<sect3>
<title>Common notes for function parameters</title>
<para>Sound drivers usually have a private data structure to
describe their device, and one structure for each play and
record data channel that it supports.</para>
<para>For all CHANNEL interface functions, the first parameter
is an opaque pointer.</para>
<para>The second parameter is a pointer to the private
channel data structure, except for
<function>channel_init()</function> which has a pointer to the
private device structure (and returns the channel pointer
for further use by <devicename>pcm</devicename>).</para>
</sect3>
<sect3>
<title>Overview of data transfer operations</title>
<para>For sound data transfers, the
<devicename>pcm</devicename> core and the sound drivers
communicate through a shared memory area, described by a
<structname>struct snd_dbuf</structname>.</para>
<para><structname>struct snd_dbuf</structname> is private to
<devicename>pcm</devicename>, and sound drivers obtain
values of interest by calls to accessor functions
(<function>sndbuf_getxxx()</function>).</para>
<para>The shared memory area has a size of
<function>sndbuf_getsize()</function> and is divided into
fixed size blocks of <function>sndbuf_getblksz()</function>
bytes.</para>
<para>When playing, the general transfer mechanism is as
follows (reverse the idea for recording):</para>
<itemizedlist>
<listitem>
<para><devicename>pcm</devicename> initially fills up the
buffer, then calls the sound driver's <link
linkend="channel-trigger">
<function>xxxchannel_trigger()</function></link>
function with a parameter of PCMTRIG_START.</para>
</listitem>
<listitem>
<para>The sound driver then arranges to repeatedly
transfer the whole memory area
(<function>sndbuf_getbuf()</function>,
<function>sndbuf_getsize()</function>) to the device, in
blocks of <function>sndbuf_getblksz()</function> bytes.
It calls back the <function>chn_intr()</function>
<devicename>pcm</devicename> function for each
transferred block (this will typically happen at
interrupt time).</para>
</listitem>
<listitem>
<para><function>chn_intr()</function> arranges to copy new
data to the area that was transferred to the device (now
free), and make appropriate updates to the
<structname>snd_dbuf</structname> structure.</para>
</listitem>
</itemizedlist>
</sect3>
<sect3 id="xxxchannel-init">
<title>channel_init</title>
<para><function>xxxchannel_init()</function> is called to
initialize each of the play or record channels. The calls
are initiated from the sound driver attach routine. (See
the <link linkend="pcm-probe-and-attach">probe and attach
section</link>).</para>
<programlisting> static void *
xxxchannel_init(kobj_t obj, void *data,
struct snd_dbuf *b, struct pcm_channel *c, int dir)<co id="co-chinit-params">
{
struct xxx_info *sc = data;
struct xxx_chinfo *ch;
...
return ch;<co id="co-chinit-return">
}</programlisting>
<calloutlist>
<callout arearefs="co-chinit-params">
<para><varname>b</varname> is the address for the channel
<structname>struct snd_dbuf</structname>. It should be
initialized in the function by calling
<function>sndbuf_alloc()</function>. The buffer size to
use is normally a small multiple of the 'typical' unit
transfer size for your device.</para>
<para><varname>c</varname> is the
<devicename>pcm</devicename> channel control structure
pointer. This is an opaque object. The function should
store it in the local channel structure, to be used in
later calls to <devicename>pcm</devicename> (ie:
<function>chn_intr(c)</function>).</para>
<para><varname>dir</varname> indicates the channel
direction (<literal>PCMDIR_PLAY</literal> or
<literal>PCMDIR_REC</literal>).</para>
</callout>
<callout arearefs="co-chinit-return">
<para>The function should return a pointer to the private
area used to control this channel. This will be passed
as a parameter to other channel interface calls.</para>
</callout>
</calloutlist>
</sect3>
<sect3>
<title>channel_setformat</title>
<para><function>xxxchannel_setformat()</function> should set
up the hardware for the specified channel for the specified
sound format.</para>
<programlisting> static int
xxxchannel_setformat(kobj_t obj, void *data, u_int32_t format)<co id="co-chsetformat-params">
{
struct xxx_chinfo *ch = data;
...
return 0;
}</programlisting>
<calloutlist>
<callout arearefs="co-chsetformat-params">
<para><varname>format</varname> is specified as an
<literal>AFMT_XXX value</literal>
(<filename>soundcard.h</filename>).</para>
</callout>
</calloutlist>
</sect3>
<sect3>
<title>channel_setspeed</title>
<para><function>xxxchannel_setspeed()</function> sets up the
channel hardware for the specified sampling speed, and
returns the possibly adjusted speed.</para>
<programlisting> static int
xxxchannel_setspeed(kobj_t obj, void *data, u_int32_t speed)
{
struct xxx_chinfo *ch = data;
...
return speed;
}</programlisting>
</sect3>
<sect3>
<title>channel_setblocksize</title>
<para><function>xxxchannel_setblocksize()</function> sets the
block size, which is the size of unit transactions between
<devicename>pcm</devicename> and the sound driver, and
between the sound driver and the device. Typically, this
would be the number of bytes transferred before an interrupt
occurs. During a transfer, the sound driver should call
<devicename>pcm</devicename>'s
<function>chn_intr()</function> every time this size has
been transferred.</para>
<para>Most sound drivers only take note of the block size
here, to be used when an actual transfer will be
started.</para>
<programlisting> static int
xxxchannel_setblocksize(kobj_t obj, void *data, u_int32_t blocksize)
{
struct xxx_chinfo *ch = data;
...
return blocksize;<co id="co-chsetblocksize-return">
}</programlisting>
<calloutlist>
<callout arearefs="co-chsetblocksize-return">
<para>The function returns the possibly adjusted block
size. In case the block size is indeed changed,
<function>sndbuf_resize()</function> should be called to
adjust the buffer.</para>
</callout>
</calloutlist>
</sect3>
<sect3 id="channel-trigger">
<title>channel_trigger</title>
<para><function>xxxchannel_trigger()</function> is called by
<devicename>pcm</devicename> to control data transfer
operations in the driver.</para>
<programlisting> static int
xxxchannel_trigger(kobj_t obj, void *data, int go)<co id="co-chtrigger-params">
{
struct xxx_chinfo *ch = data;
...
return 0;
}</programlisting>
<calloutlist>
<callout arearefs="co-chtrigger-params">
<para><varname>go</varname> defines the action for the
current call. The possible values are:</para>
<itemizedlist>
<listitem>
<para><literal>PCMTRIG_START</literal>: the driver
should start a data transfer from or to the channel
buffer. If needed, the buffer base and size can be
retrieved through
<function>sndbuf_getbuf()</function> and
<function>sndbuf_getsize()</function>.</para>
</listitem>
<listitem>
<para><literal>PCMTRIG_EMLDMAWR</literal> /
<literal>PCMTRIG_EMLDMARD</literal>: this tells the
driver that the input or output buffer may have been
updated. Most drivers just ignore these
calls.</para>
</listitem>
<listitem>
<para><literal>PCMTRIG_STOP</literal> /
<literal>PCMTRIG_ABORT</literal>: the driver should
stop the current transfer.</para>
</listitem>
</itemizedlist>
</callout>
</calloutlist>
<note><para>If the driver uses ISA DMA,
<function>sndbuf_isadma()</function> should be called before
performing actions on the device, and will take care of the
DMA chip side of things.</para>
</note>
</sect3>
<sect3>
<title>channel_getptr</title>
<para><function>xxxchannel_getptr()</function> returns the
current offset in the transfer buffer. This will typically
be called by <function>chn_intr()</function>, and this is how
<devicename>pcm</devicename> knows where it can transfer
new data.</para>
</sect3>
<sect3>
<title>channel_free</title>
<para><function>xxxchannel_free()</function> is called to free
up channel resources, for example when the driver is
unloaded, and should be implemented if the channel data
structures are dynamically allocated or if
<function>sndbuf_alloc()</function> was not used for buffer
allocation.</para>
</sect3>
<sect3>
<title>channel_getcaps</title>
<programlisting> struct pcmchan_caps *
xxxchannel_getcaps(kobj_t obj, void *data)
{
return &amp;xxx_caps;<co id="co-chgetcaps-return">
}</programlisting>
<calloutlist>
<callout arearefs="co-chgetcaps-return">
<para>The routine returns a pointer to a (usually
statically-defined) <structname>pcmchan_caps</structname>
structure (defined in
<filename>sound/pcm/channel.h</filename>. The structure holds
the minimum and maximum sampling frequencies, and the
accepted sound formats. Look at any sound driver for an
example.</para>
</callout>
</calloutlist>
</sect3>
<sect3>
<title>More functions</title>
<para><function>channel_reset()</function>,
<function>channel_resetdone()</function>, and
<function>channel_notify()</function> are for special purposes
and should not be implemented in a driver without discussing
it with the authorities (&a.cg;).</para>
<para><function>channel_setdir()</function> is deprecated.</para>
</sect3>
</sect2>
<sect2>
<title>The MIXER interface</title>
<sect3 id="xxxmixer-init">
<title>mixer_init</title>
<para><function>xxxmixer_init()</function> initializes the
hardware and tells <devicename>pcm</devicename> what mixer
devices are available for playing and recording</para>
<programlisting> static int
xxxmixer_init(struct snd_mixer *m)
{
struct xxx_info *sc = mix_getdevinfo(m);
u_int32_t v;
[Initialize hardware]
[Set appropriate bits in v for play mixers]<co id="co-mxini-sd">
mix_setdevs(m, v);
[Set appropriate bits in v for record mixers]
mix_setrecdevs(m, v)
return 0;
}</programlisting>
<calloutlist>
<callout arearefs="co-mxini-sd">
<para>Set bits in an integer value and call
<function>mix_setdevs()</function> and
<function>mix_setrecdevs()</function> to tell
<devicename>pcm</devicename> what devices exist.</para>
</callout>
</calloutlist>
<para>Mixer bits definitions can be found in
<filename>soundcard.h</filename>
(<literal>SOUND_MASK_XXX</literal> values and
<literal>SOUND_MIXER_XXX</literal> bit shifts).</para>
</sect3>
<sect3>
<title>mixer_set</title>
<para><function>xxxmixer_set()</function> sets the volume
level for one mixer device.</para>
<programlisting> static int
xxxmixer_set(struct snd_mixer *m, unsigned dev,
unsigned left, unsigned right)<co id="co-mxset-params">
{
struct sc_info *sc = mix_getdevinfo(m);
[set volume level]
return left | (right << 8);<co id="co-mxset-return">
}</programlisting>
<calloutlist>
<callout arearefs="co-mxset-params">
<para>The device is specified as a SOUND_MIXER_XXX
value</para> <para>The volume values are specified in
range [0-100]. A value of zero should mute the
device.</para>
</callout>
<callout arearefs="co-mxset-return">
<para>As the hardware levels probably won't match the
input scale, and some rounding will occur, the routine
returns the actual level values (in range 0-100) as
shown.</para>
</callout>
</calloutlist>
</sect3>
<sect3>
<title>mixer_setrecsrc</title>
<para><function>xxxmixer_setrecsrc()</function> sets the
recording source device.</para>
<programlisting> static int
xxxmixer_setrecsrc(struct snd_mixer *m, u_int32_t src)<co id="co-mxsr-params">
{
struct xxx_info *sc = mix_getdevinfo(m);
[look for non zero bit(s) in src, set up hardware]
[update src to reflect actual action]
return src;<co id="co-mxsr-return">
}</programlisting>
<calloutlist>
<callout arearefs="co-mxsr-params">
<para>The desired recording devices are specified as a
bit field</para>
</callout>
<callout arearefs="co-mxsr-return">
<para>The actual devices set for recording are returned.
Some drivers can only set one device for recording. The
function should return -1 if an error occurs.</para>
</callout>
</calloutlist>
</sect3>
<sect3>
<title>mixer_uninit, mixer_reinit</title>
<para><function>xxxmixer_uninit()</function> should ensure
that all sound is muted and if possible mixer hardware
should be powered down </para>
<para><function>xxxmixer_reinit()</function> should ensure
that the mixer hardware is powered up and any settings not
controlled by <function>mixer_set()</function> or
<function>mixer_setrecsrc()</function> are restored.</para>
</sect3>
</sect2>
<sect2>
<title>The AC97 interface</title>
<para>The <emphasis>AC97</emphasis> interface is implemented
by drivers with an AC97 codec. It only has three methods:</para>
<itemizedlist>
<listitem><para><function>xxxac97_init()</function> returns
the number of ac97 codecs found.</para>
</listitem>
<listitem><para><function>ac97_read()</function> and
<function>ac97_write()</function> read or write a specified
register.</para>
</listitem>
</itemizedlist>
<para>The <emphasis>AC97</emphasis> interface is used by the
AC97 code in <devicename>pcm</devicename> to perform higher
level operations. Look at
<filename>sound/pci/maestro3.c</filename> or many others under
<filename>sound/pci/</filename> for an example.</para>
</sect2>
</sect1>
</chapter>
<!--
Local Variables:
mode: sgml
sgml-declaration: "../chapter.decl"
sgml-indent-data: t
sgml-omittag: nil
sgml-always-quote-attributes: t
sgml-parent-document: ("../book.sgml" "part" "chapter")
End:
-->

View file

@ -1,161 +0,0 @@
<!--
The FreeBSD Documentation Project
$FreeBSD$
-->
<chapter id="sysinit">
<title>The Sysinit Framework</title>
<para>Sysinit is the framework for a generic call sort and dispatch
mechanism. FreeBSD currently uses it for the dynamic
initialization of the kernel. Sysinit allows FreeBSD's kernel
subsystems to be reordered, and added, removed, and replaced at
kernel link time when the kernel or one of its modules is loaded
without having to edit a statically ordered initialization routing
and recompile the kernel. This system also allows kernel modules,
currently called <firstterm>KLD's</firstterm>, to be separately
compiled, linked, and initialized at boot time and loaded even
later while the system is already running. This is accomplished
using the <quote>kernel linker</quote> and <quote>linker
sets</quote>.</para>
<sect1 id="sysinit-term">
<title>Terminology</title>
<variablelist>
<varlistentry>
<term>Linker Set</term>
<listitem>
<para>A linker technique in which the linker gathers
statically declared data throughout a program's source files
into a single contiguously addressable unit of
data.</para>
</listitem>
</varlistentry>
</variablelist>
</sect1>
<sect1 id="sysinit-operation">
<title>Sysinit Operation</title>
<para>Sysinit relies on the ability of the linker to take static
data declared at multiple locations throughout a program's
source and group it together as a single contiguous chunk of
data. This linker technique is called a <quote>linker
set</quote>. Sysinit uses two linker sets to maintain two data
sets containing each consumer's call order, function, and a
pointer to the data to pass to that function.</para>
<para>Sysinit uses two priorities when ordering the functions for
execution. The first priority is a subsystem ID giving an
overall order Sysinit's dispatch of functions. Current predeclared
ID's are in <filename>&lt;sys/kernel.h></filename> in the enum
list <literal>sysinit_sub_id</literal>. The second priority used
is an element order within the subsystem. Current predeclared
subsystem element orders are in
<filename>&lt;sys/kernel.h></filename> in the enum list
<literal>sysinit_elem_order</literal>.</para>
<para>There are currently two uses for Sysinit. Function dispatch
at system startup and kernel module loads, and function dispatch
at system shutdown and kernel module unload.</para>
</sect1>
<sect1 id="sysinit-using">
<title>Using Sysinit</title>
<sect2>
<title>Interface</title>
<sect3>
<title>Headers</title>
<programlisting>&lt;sys/kernel.h></programlisting>
</sect3>
<sect3>
<title>Macros</title>
<programlisting>SYSINIT(uniquifier, subsystem, order, func, ident)
SYSUNINIT(uniquifier, subsystem, order, func, ident)</programlisting>
</sect3>
</sect2>
<sect2>
<title>Startup</title>
<para>The <literal>SYSINIT()</literal> macro creates the
necessary sysinit data in Sysinit's startup data set for
Sysinit to sort and dispatch a function at system startup and
module load. <literal>SYSINIT()</literal> takes a uniquifier
that Sysinit uses identify the particular function dispatch
data, the subsystem order, the subsystem element order, the
function to call, and the data to pass the function. All
functions must take a constant pointer argument.
</para>
<para>For example:</para>
<programlisting>#include &lt;sys/kernel.h>
void foo_null(void *unused)
{
foo_doo();
}
SYSINIT(foo_null, SI_SUB_FOO, SI_ORDER_FOO, NULL);
struct foo foo_voodoo = {
FOO_VOODOO;
}
void foo_arg(void *vdata)
{
struct foo *foo = (struct foo *)vdata;
foo_data(foo);
}
SYSINIT(foo_arg, SI_SUB_FOO, SI_ORDER_FOO, foo_voodoo);
</programlisting>
</sect2>
<sect2>
<title>Shutdown</title>
<para>The <literal>SYSUNINIT()</literal> macro behaves similarly
to the <literal>SYSINIT()</literal> macro except that it adds
the Sysinit data to Sysinit's shutdown data set.</para>
<para>For example:</para>
<programlisting>#include &lt;sys/kernel.h>
void foo_cleanup(void *unused)
{
foo_kill();
}
SYSUNINIT(foo_cleanup, SI_SUB_FOO, SI_ORDER_FOO, NULL);
struct foo_stack foo_stack = {
FOO_STACK_VOODOO;
}
void foo_flush(void *vdata)
{
}
SYSUNINIT(foo_flush, SI_SUB_FOO, SI_ORDER_FOO, foo_stack);
</programlisting>
</sect2>
</sect1>
</chapter>
<!--
Local Variables:
mode: sgml
sgml-declaration: "../chapter.decl"
sgml-indent-data: t
sgml-omittag: nil
sgml-always-quote-attributes: t
sgml-parent-document: ("../book.sgml" "part" "chapter")
End:
-->

View file

@ -1,623 +0,0 @@
<!--
The FreeBSD Documentation Project
$FreeBSD$
-->
<chapter id="usb">
<title>USB Devices</title>
<para><emphasis>This chapter was written by &a.nhibma;. Modifications made for
the handbook by &a.murray;.</emphasis></para>
<sect1 id="usb-intro">
<title>Introduction</title>
<para>The Universal Serial Bus (USB) is a new way of attaching
devices to personal computers. The bus architecture features
two-way communication and has been developed as a response to
devices becoming smarter and requiring more interaction with the
host. USB support is included in all current PC chipsets and is
therefore available in all recently built PCs. Apple's
introduction of the USB-only iMac has been a major incentive for
hardware manufacturers to produce USB versions of their devices.
The future PC specifications specify that all legacy connectors
on PCs should be replaced by one or more USB connectors,
providing generic plug and play capabilities. Support for USB
hardware was available at a very early stage in NetBSD and was
developed by Lennart Augustsson for the NetBSD project. The
code has been ported to FreeBSD and we are currently maintaining
a shared code base. For the implementation of the USB subsystem
a number of features of USB are important.</para>
<para><emphasis>Lennart Augustsson has done most of the implementation of
the USB support for the NetBSD project. Many thanks for this
incredible amount of work. Many thanks also to Ardy and Dirk for
their comments and proofreading of this paper.</emphasis></para>
<itemizedlist>
<listitem><para>Devices connect to ports on the computer
directly or on devices called hubs, forming a treelike device
structure.</para></listitem>
<listitem><para>The devices can be connected and disconnected at
run time.</para></listitem>
<listitem><para>Devices can suspend themselves and trigger
resumes of the host system</para></listitem>
<listitem><para>As the devices can be powered from the bus, the
host software has to keep track of power budgets for each
hub.</para></listitem>
<listitem><para>Different quality of service requirements by the
different device types together with the maximum of 126
devices that can be connected to the same bus, require proper
scheduling of transfers on the shared bus to take full
advantage of the 12Mbps bandwidth available. (over 400Mbps
with USB 2.0)</para></listitem>
<listitem><para>Devices are intelligent and contain easily
accessible information about themselves</para></listitem>
</itemizedlist>
<para>The development of drivers for the USB subsystem and devices
connected to it is supported by the specifications that have
been developed and will be developed. These specifications are
publicly available from the USB home pages. Apple has been very
strong in pushing for standards based drivers, by making drivers
for the generic classes available in their operating system
MacOS and discouraging the use of separate drivers for each new
device. This chapter tries to collate essential information for a
basic understanding of the present implementation of the USB
stack in FreeBSD/NetBSD. It is recommended however to read it
together with the relevant specifications mentioned in the
references below.</para>
<sect2>
<title>Structure of the USB Stack</title>
<para>The USB support in FreeBSD can be split into three
layers. The lowest layer contains the host controller driver,
providing a generic interface to the hardware and its scheduling
facilities. It supports initialisation of the hardware,
scheduling of transfers and handling of completed and/or failed
transfers. Each host controller driver implements a virtual hub
providing hardware independent access to the registers
controlling the root ports on the back of the machine.</para>
<para>The middle layer handles the device connection and
disconnection, basic initialisation of the device, driver
selection, the communication channels (pipes) and does
resource management. This services layer also controls the
default pipes and the device requests transferred over
them.</para>
<para>The top layer contains the individual drivers supporting
specific (classes of) devices. These drivers implement the
protocol that is used over the pipes other than the default
pipe. They also implement additional functionality to make the
device available to other parts of the kernel or userland. They
use the USB driver interface (USBDI) exposed by the services
layer.</para>
</sect2>
</sect1>
<sect1 id="usb-hc">
<title>Host Controllers</title>
<para>The host controller (HC) controls the transmission of
packets on the bus. Frames of 1 millisecond are used. At the
start of each frame the host controller generates a Start of
Frame (SOF) packet.</para>
<para>The SOF packet is used to synchronise to the start of the
frame and to keep track of the frame number. Within each frame
packets are transferred, either from host to device (out) or
from device to host (in). Transfers are always initiated by the
host (polled transfers). Therefore there can only be one host
per USB bus. Each transfer of a packet has a status stage in
which the recipient of the data can return either ACK
(acknowledge reception), NAK (retry), STALL (error condition) or
nothing (garbled data stage, device not available or
disconnected). Section 8.5 of the <ulink
url="http://www.usb.org/developers/docs.html">USB
specification</ulink> explains the details of packets in more
detail. Four different types of transfers can occur on a USB
bus: control, bulk, interrupt and isochronous. The types of
transfers and their characteristics are described below (`Pipes'
subsection).</para>
<para>Large transfers between the device on the USB bus and the
device driver are split up into multiple packets by the host
controller or the HC driver.</para>
<para>Device requests (control transfers) to the default endpoints
are special. They consist of two or three phases: SETUP, DATA
(optional) and STATUS. The set-up packet is sent to the
device. If there is a data phase, the direction of the data
packet(s) is given in the set-up packet. The direction in the
status phase is the opposite of the direction during the data
phase, or IN if there was no data phase. The host controller
hardware also provides registers with the current status of the
root ports and the changes that have occurred since the last
reset of the status change register. Access to these registers
is provided through a virtualised hub as suggested in the USB
specification [ 2]. The virtual hub must comply with the hub
device class given in chapter 11 of that specification. It must
provide a default pipe through which device requests can be sent
to it. It returns the standard andhub class specific set of
descriptors. It should also provide an interrupt pipe that
reports changes happening at its ports. There are currently two
specifications for host controllers available: <ulink
url="http://developer.intel.com/design/USB/UHCI11D.htm">Universal
Host Controller Interface</ulink> (UHCI; Intel) and <ulink
url="http://www.compaq.com/productinfo/development/openhci.html">Open
Host Controller Interface</ulink> (OHCI; Compaq, Microsoft,
National Semiconductor). The UHCI specification has been
designed to reduce hardware complexity by requiring the host
controller driver to supply a complete schedule of the transfers
for each frame. OHCI type controllers are much more independent
by providing a more abstract interface doing alot of work
themselves. </para>
<sect2>
<title>UHCI</title>
<para>The UHCI host controller maintains a framelist with 1024
pointers to per frame data structures. It understands two
different data types: transfer descriptors (TD) and queue
heads (QH). Each TD represents a packet to be communicated to
or from a device endpoint. QHs are a means to groupTDs (and
QHs) together.</para>
<para>Each transfer consists of one or more packets. The UHCI
driver splits large transfers into multiple packets. For every
transfer, apart from isochronous transfers, a QH is
allocated. For every type of transfer these QHs are collected
at a QH for that type. Isochronous transfers have to be
executed first because of the fixed latency requirement and
are directly referred to by the pointer in the framelist. The
last isochronous TD refers to the QH for interrupt transfers
for that frame. All QHs for interrupt transfers point at the
QH for control transfers, which in turn points at the QH for
bulk transfers. The following diagram gives a graphical
overview of this:</para>
<para>This results in the following schedule being run in each
frame. After fetching the pointer for the current frame from
the framelist the controller first executes the TDs for all
the isochronous packets in that frame. The last of these TDs
refers to the QH for the interrupt transfers for
thatframe. The host controller will then descend from that QH
to the QHs for the individual interrupt transfers. After
finishing that queue, the QH for the interrupt transfers will
refer the controller to the QH for all control transfers. It
will execute all the subqueues scheduled there, followed by
all the transfers queued at the bulk QH. To facilitate the
handling of finished or failed transfers different types of
interrupts are generated by the hardware at the end of each
frame. In the last TD for a transfer the Interrupt-On
Completion bit is set by the HC driver to flag an interrupt
when the transfer has completed. An error interrupt is flagged
if a TD reaches its maximum error count. If the short packet
detect bit is set in a TD and less than the set packet length
is transferred this interrupt is flagged to notify
the controller driver of the completed transfer. It is the host
controller driver's task to find out which transfer has
completed or produced an error. When called the interrupt
service routine will locate all the finished transfers and
call their callbacks.</para>
<para>See for a more elaborate description the <ulink
url="http://developer.intel.com/design/USB/UHCI11D.htm">UHCI
specification.</ulink></para>
</sect2>
<sect2>
<title>OHCI</title>
<para>Programming an OHCI host controller is much simpler. The
controller assumes that a set of endpoints is available, and
is aware of scheduling priorities and the ordering of the
types of transfers in a frame. The main data structure used by
the host controller is the endpoint descriptor (ED) to which
aqueue of transfer descriptors (TDs) is attached. The ED
contains the maximum packet size allowed for an endpoint and
the controller hardware does the splitting into packets. The
pointers to the data buffers are updated after each transfer
and when the start and end pointer are equal, the TD is
retired to the done-queue. The four types of endpoints have
their own queues. Control and bulk endpoints are queued each at
their own queue. Interrupt EDs are queued in a tree, with the
level in the tree defining the frequency at which they
run.</para>
<para>framelist interruptisochronous control bulk</para>
<para>The schedule being run by the host controller in each
frame looks as follows. The controller will first run the
non-periodic control and bulk queues, up to a time limit set
by the HC driver. Then the interrupt transfers for that frame
number are run, by using the lower five bits of the frame
number as an index into level 0 of the tree of interrupts
EDs. At the end of this tree the isochronous EDs are connected
and these are traversed subsequently. The isochronous TDs
contain the frame number of the first frame the transfer
should be run in. After all the periodic transfers have been
run, the control and bulk queues are traversed
again. Periodically the interrupt service routine is called to
process the done queue and call the callbacks for each
transfer and reschedule interrupt and isochronous
endpoints.</para>
<para>See for a more elaborate description the <ulink
url="http://www.compaq.com/productinfo/development/openhci.html">
OHCI specification</ulink>. Services layer The middle layer
provides access to the device in a controlled way and
maintains resources in use by the different drivers and the
services layer. The layer takes care of the following
aspects:</para>
<itemizedlist>
<listitem><para>The device configuration
information</para></listitem>
<listitem><para>The pipes to communicate with a
device</para></listitem>
<listitem><para>Probing and attaching and detaching form a
device.</para></listitem>
</itemizedlist>
</sect2>
</sect1>
<sect1 id="usb-dev">
<title>USB Device Information</title>
<sect2>
<title>Device configuration information</title>
<para>Each device provides different levels of configuration
information. Each device has one or more configurations, of
which one is selected during probe/attach. A configuration
provides power and bandwidth requirements. Within each
configuration there can be multiple interfaces. A device
interface is a collection of endpoints. For example USB
speakers can have an interface for the audio data (Audio
Class) and an interface for the knobs, dials and buttons (HID
Class). All interfaces in a configuration are active at the
same time and can be attached to by different drivers. Each
interface can have alternates, providing different quality of
service parameters. In for example cameras this is used to
provide different frame sizes and numbers of frames per
second.</para>
<para>Within each interface 0 or more endpoints can be
specified. Endpoints are the unidirectional access points for
communicating with a device. They provide buffers to
temporarily store incoming or outgoing data from the
device. Each endpoint has a unique address within
a configuration, the endpoint's number plus its direction. The
default endpoint, endpoint 0, is not part of any interface and
available in all configurations. It is managed by the services
layer and not directly available to device drivers.</para>
<para>Level 0 Level 1 Level 2 Slot 0</para>
<para>Slot 3 Slot 2 Slot 1</para>
<para>(Only 4 out of 32 slots shown)</para>
<para>This hierarchical configuration information is described
in the device by a standard set of descriptors (see section 9.6
of the USB specification [ 2]). They can be requested through
the Get Descriptor Request. The services layer caches these
descriptors to avoid unnecessary transfers on the USB
bus. Access to the descriptors is provided through function
calls.</para>
<itemizedlist>
<listitem><para>Device descriptors: General information about
the device, like Vendor, Product and Revision Id, supported
device class, subclass and protocol if applicable, maximum
packet size for the default endpoint, etc.</para></listitem>
<listitem><para>Configuration descriptors: The number of
interfaces in this configuration, suspend and resume
functionality supported and power
requirements.</para></listitem>
<listitem><para>Interface descriptors: interface class,
subclass and protocol if applicable, number of alternate
settings for the interface and the number of
endpoints.</para></listitem>
<listitem><para>Endpoint descriptors: Endpoint address,
direction and type, maximum packet size supported and
polling frequency if type is interrupt endpoint. There is no
descriptor for the default endpoint (endpoint 0) and it is
never counted in an interface descriptor.</para></listitem>
<listitem><para>String descriptors: In the other descriptors
string indices are supplied for some fields.These can be
used to retrieve descriptive strings, possibly in multiple
languages.</para></listitem>
</itemizedlist>
<para>Class specifications can add their own descriptor types
that are available through the GetDescriptor Request.</para>
<para>Pipes Communication to end points on a device flows
through so-called pipes. Drivers submit transfers to endpoints
to a pipe and provide a callback to be called on completion or
failure of the transfer (asynchronous transfers) or wait for
completion (synchronous transfer). Transfers to an endpoint
are serialised in the pipe. A transfer can either complete,
fail or time-out (if a time-out has been set). There are two
types of time-outs for transfers. Time-outs can happen due to
time-out on the USBbus (milliseconds). These time-outs are
seen as failures and can be due to disconnection of the
device. A second form of time-out is implemented in software
and is triggered when a transfer does not complete within a
specified amount of time (seconds). These are caused by a
device acknowledging negatively (NAK) the transferred
packets. The cause for this is the device not being ready to
receive data, buffer under- or overrun or protocol
errors.</para>
<para>If a transfer over a pipe is larger than the maximum
packet size specified in the associated endpoint descriptor,
the host controller (OHCI) or the HC driver (UHCI) will split
the transfer into packets of maximum packet size, with the
last packet possibly smaller than the maximum
packet size.</para>
<para>Sometimes it is not a problem for a device to return less
data than requested. For example abulk-in-transfer to a modem
might request 200 bytes of data, but the modem has only 5
bytes available at that time. The driver can set the short
packet (SPD) flag. It allows the host controller to accept a
packet even if the amount of data transferred is less than
requested. This flag is only valid for in-transfers, as the
amount of data to be sent to a device is always known
beforehand. If an unrecoverable error occurs in a device
during a transfer the pipe is stalled. Before any more data is
accepted or sent the driver needs to resolve the cause of the
stall and clear the endpoint stall condition through send the
clear endpoint halt device request over the default
pipe. The default endpoint should never stall.</para>
<para>There are four different types of endpoints and
corresponding pipes: - Control pipe / default pipe: There is
one control pipe per device, connected to the default endpoint
(endpoint 0). The pipe carries the device requests and
associated data. The difference between transfers over the
default pipe and other pipes is that the protocol for
the transfers is described in the USB specification [ 2]. These
requests are used to reset and configure the device. A basic
set of commands that must be supported by each device is
provided in chapter 9 of the USB specification [ 2]. The
commands supported on this pipe can be extended by a device
class specification to support additional
functionality.</para>
<itemizedlist>
<listitem><para>Bulk pipe: This is the USB equivalent to a raw
transmission medium.</para></listitem>
<listitem><para>Interrupt pipe: The host sends a request for
data to the device and if the device has nothing to send, it
will NAK the data packet. Interrupt transfers are scheduled
at a frequency specified when creating the
pipe.</para></listitem>
<listitem><para>Isochronous pipe: These pipes are intended for
isochronous data, for example video or audio streams, with
fixed latency, but no guaranteed delivery. Some support for
pipes of this type is available in the current
implementation. Packets in control, bulk and interrupt
transfers are retried if an error occurs during transmission
or the device acknowledges the packet negatively (NAK) due to
for example lack of buffer space to store the incoming
data. Isochronous packets are however not retried in case of
failed delivery or NAK of a packet as this might violate the
timing constraints.</para></listitem>
</itemizedlist>
<para>The availability of the necessary bandwidth is calculated
during the creation of the pipe. Transfers are scheduled within
frames of 1 millisecond. The bandwidth allocation within a
frame is prescribed by the USB specification, section 5.6 [
2]. Isochronous and interrupt transfers are allowed to consume
up to 90% of the bandwidth within a frame. Packets for control
and bulk transfers are scheduled after all isochronous and
interrupt packets and will consume all the remaining
bandwidth.</para>
<para>More information on scheduling of transfers and bandwidth
reclamation can be found in chapter 5of the USB specification
[ 2], section 1.3 of the UHCI specification [ 3] and section
3.4.2 of the OHCI specification [4].</para>
</sect2>
</sect1>
<sect1 id="usb-devprobe">
<title>Device probe and attach</title>
<para>After the notification by the hub that a new device has been
connected, the service layer switches on the port, providing the
device with 100 mA of current. At this point the device is in
its default state and listening to device address 0. The
services layer will proceed to retrieve the various descriptors
through the default pipe. After that it will send a Set Address
request to move the device away from the default device address
(address 0). Multiple device drivers might be able to support
the device. For example a modem driver might be able to support
an ISDN TA through the AT compatibility interface. A driver for
that specific model of the ISDN adapter might however be able to
provide much better support for this device. To support this
flexibility, the probes return priorities indicating their level
of support. Support for a specific revision of a product ranks
the highest and the generic driver the lowest priority. It might
also be that multiple drivers could attach to one device if
there are multiple interfaces within one configuration. Each
driver only needs to support a subset of the interfaces.</para>
<para>The probing for a driver for a newly attached device checks
first for device specific drivers. If not found, the probe code
iterates over all supported configurations until a driver
attaches in a configuration. To support devices with multiple
drivers on different interfaces, the probe iterates over all
interfaces in a configuration that have not yet been claimed by
a driver. Configurations that exceed the power budget for the
hub are ignored. During attach the driver should initialise the
device to its proper state, but not reset it, as this will make
the device disconnect itself from the bus and restart the
probing process for it. To avoid consuming unnecessary bandwidth
should not claim the interrupt pipe at attach time, but
should postpone allocating the pipe until the file is opened and
the data is actually used. When the file is closed the pipe
should be closed again, even though the device might still be
attached.</para>
<sect2>
<title>Device disconnect and detach</title>
<para>A device driver should expect to receive errors during any
transaction with the device. The design of USB supports and
encourages the disconnection of devices at any point in
time. Drivers should make sure that they do the right thing
when the device disappears.</para>
<para>Furthermore a device that has been disconnected and
reconnected will not be reattached at the same device
instance. This might change in the future when more devices
support serial numbers (see the device descriptor) or other
means of defining an identity for a device have been
developed.</para>
<para>The disconnection of a device is signaled by a hub in the
interrupt packet delivered to the hub driver. The status
change information indicates which port has seen a connection
change. The device detach method for all device drivers for
the device connected on that port are called and the structures
cleaned up. If the port status indicates that in the mean time
a device has been connected to that port, the procedure for
probing and attaching the device will be started. A device
reset will produce a disconnect-connect sequence on the hub
and will be handled as described above.</para>
</sect2>
</sect1>
<sect1 id="usb-protocol">
<title>USB Drivers Protocol Information</title>
<para>The protocol used over pipes other than the default pipe is
undefined by the USB specification. Information on this can be
found from various sources. The most accurate source is the
developer's section on the USB home pages [ 1]. From these pages
a growing number of deviceclass specifications are
available. These specifications specify what a compliant device
should look like from a driver perspective, basic functionality
it needs to provide and the protocol that is to be used over the
communication channels. The USB specification [ 2] includes the
description of the Hub Class. A class specification for Human
Interface Devices (HID) has been created to cater for keyboards,
tablets, bar-code readers, buttons, knobs, switches, etc. A
third example is the class specification for mass storage
devices. For a full list of device classes see the developers
section on the USB home pages [ 1].</para>
<para>For many devices the protocol information has not yet been
published however. Information on the protocol being used might
be available from the company making the device. Some companies
will require you to sign a Non -Disclosure Agreement (NDA)
before giving you the specifications. This in most cases
precludes making the driver open source.</para>
<para>Another good source of information is the Linux driver
sources, as a number of companies have started to provide drivers
for Linux for their devices. It is always a good idea to contact
the authors of those drivers for their source of
information.</para>
<para>Example: Human Interface Devices The specification for the
Human Interface Devices like keyboards, mice, tablets, buttons,
dials,etc. is referred to in other device class specifications
and is used in many devices.</para>
<para>For example audio speakers provide endpoints to the digital
to analogue converters and possibly an extra pipe for a
microphone. They also provide a HID endpoint in a separate
interface for the buttons and dials on the front of the
device. The same is true for the monitor control class. It is
straightforward to build support for these interfaces through
the available kernel and userland libraries together with the
HID class driver or the generic driver. Another device that
serves as an example for interfaces within one configuration
driven by different device drivers is a cheap keyboard with
built-in legacy mouse port. To avoid having the cost of
including the hardware for a USB hub in the device,
manufacturers combined the mouse data received from the PS/2 port
on the back of the keyboard and the key presses from the keyboard
into two separate interfaces in the same configuration. The
mouse and keyboard drivers each attach to the appropriate
interface and allocate the pipes to the two independent
endpoints.</para>
<para>Example: Firmware download Many devices that have been
developed are based on a general purpose processor with
an additional USB core added to it. Because the development of
drivers and firmware for USB devices is still very new, many
devices require the downloading of the firmware after they
have been connected.</para>
<para>The procedure followed is straightforward. The device
identifies itself through a vendor and product Id. The first
driver probes and attaches to it and downloads the firmware into
it. After that the device soft resets itself and the driver is
detached. After a short pause the device announces its presence
on the bus. The device will have changed its
vendor/product/revision Id to reflect the fact that it has been
supplied with firmware and as a consequence a second driver will
probe it and attach to it.</para>
<para>An example of these types of devices is the ActiveWire I/O
board, based on the EZ-USB chip. For this chip a generic firmware
downloader is available. The firmware downloaded into the
ActiveWire board changes the revision Id. It will then perform a
soft reset of the USB part of the EZ-USB chip to disconnect from
the USB bus and again reconnect.</para>
<para>Example: Mass Storage Devices Support for mass storage
devices is mainly built around existing protocols. The Iomega
USB Zipdrive is based on the SCSI version of their drive. The
SCSI commands and status messages are wrapped in blocks and
transferred over the bulk pipes to and from the device,
emulating a SCSI controller over the USB wire. ATAPI and UFI
commands are supported in a similar fashion.</para>
<para>The Mass Storage Specification supports 2 different types of
wrapping of the command block.The initial attempt was based on
sending the command and status through the default pipe and
using bulk transfers for the data to be moved between the host
and the device. Based on experience a second approach was
designed that was based on wrapping the command and status
blocks and sending them over the bulk out and in endpoint. The
specification specifies exactly what has to happen when and what
has to be done in case an error condition is encountered. The
biggest challenge when writing drivers for these devices is to
fit USB based protocol into the existing support for mass storage
devices. CAM provides hooks to do this in a fairly straight
forward way. ATAPI is less simple as historically the IDE
interface has never had many different appearances.</para>
<para>The support for the USB floppy from Y-E Data is again less
straightforward as a new command set has been designed.</para>
</sect1>
</chapter>

View file

@ -1,260 +0,0 @@
<!--
The FreeBSD Documentation Project
$FreeBSD$
-->
<chapter id="vm">
<chapterinfo>
<authorgroup>
<author>
<firstname>Matthew</firstname>
<surname>Dillon</surname>
<contrib>Contributed by </contrib>
</author>
</authorgroup>
<!-- 6 Feb 1999 -->
</chapterinfo>
<title>Virtual Memory System</title>
<sect1 id="vm-physmem">
<title>Management of physical
memory&mdash;<literal>vm_page_t</literal></title>
<para>Physical memory is managed on a page-by-page basis through the
<literal>vm_page_t</literal> structure. Pages of physical memory are
categorized through the placement of their respective
<literal>vm_page_t</literal> structures on one of several paging
queues.</para>
<para>A page can be in a wired, active, inactive, cache, or free state.
Except for the wired state, the page is typically placed in a doubly
link list queue representing the state that it is in. Wired pages
are not placed on any queue.</para>
<para>FreeBSD implements a more involved paging queue for cached and
free pages in order to implement page coloring. Each of these states
involves multiple queues arranged according to the size of the
processor's L1 and L2 caches. When a new page needs to be allocated,
FreeBSD attempts to obtain one that is reasonably well aligned from
the point of view of the L1 and L2 caches relative to the VM object
the page is being allocated for.</para>
<para>Additionally, a page may be held with a reference count or locked
with a busy count. The VM system also implements an <quote>ultimate
locked</quote> state for a page using the PG_BUSY bit in the page's
flags.</para>
<para>In general terms, each of the paging queues operates in a LRU
fashion. A page is typically placed in a wired or active state
initially. When wired, the page is usually associated with a page
table somewhere. The VM system ages the page by scanning pages in a
more active paging queue (LRU) in order to move them to a less-active
paging queue. Pages that get moved into the cache are still
associated with a VM object but are candidates for immediate reuse.
Pages in the free queue are truly free. FreeBSD attempts to minimize
the number of pages in the free queue, but a certain minimum number of
truly free pages must be maintained in order to accommodate page
allocation at interrupt time.</para>
<para>If a process attempts to access a page that does not exist in its
page table but does exist in one of the paging queues (such as the
inactive or cache queues), a relatively inexpensive page reactivation
fault occurs which causes the page to be reactivated. If the page
does not exist in system memory at all, the process must block while
the page is brought in from disk.</para>
<para>FreeBSD dynamically tunes its paging queues and attempts to
maintain reasonable ratios of pages in the various queues as well as
attempts to maintain a reasonable breakdown of clean vs. dirty pages.
The amount of rebalancing that occurs depends on the system's memory
load. This rebalancing is implemented by the pageout daemon and
involves laundering dirty pages (syncing them with their backing
store), noticing when pages are activity referenced (resetting their
position in the LRU queues or moving them between queues), migrating
pages between queues when the queues are out of balance, and so forth.
FreeBSD's VM system is willing to take a reasonable number of
reactivation page faults to determine how active or how idle a page
actually is. This leads to better decisions being made as to when to
launder or swap-out a page.</para>
</sect1>
<sect1 id="vm-cache">
<title>The unified buffer
cache&mdash;<literal>vm_object_t</literal></title>
<para>FreeBSD implements the idea of a generic <quote>VM object</quote>.
VM objects can be associated with backing store of various
types&mdash;unbacked, swap-backed, physical device-backed, or
file-backed storage. Since the filesystem uses the same VM objects to
manage in-core data relating to files, the result is a unified buffer
cache.</para>
<para>VM objects can be <emphasis>shadowed</emphasis>. That is, they
can be stacked on top of each other. For example, you might have a
swap-backed VM object stacked on top of a file-backed VM object in
order to implement a MAP_PRIVATE mmap()ing. This stacking is also
used to implement various sharing properties, including
copy-on-write, for forked address spaces.</para>
<para>It should be noted that a <literal>vm_page_t</literal> can only be
associated with one VM object at a time. The VM object shadowing
implements the perceived sharing of the same page across multiple
instances.</para>
</sect1>
<sect1 id="vm-fileio">
<title>Filesystem I/O&mdash;<literal>struct buf</literal></title>
<para>vnode-backed VM objects, such as file-backed objects, generally
need to maintain their own clean/dirty info independent from the VM
system's idea of clean/dirty. For example, when the VM system decides
to synchronize a physical page to its backing store, the VM system
needs to mark the page clean before the page is actually written to
its backing store. Additionally, filesystems need to be able to map
portions of a file or file metadata into KVM in order to operate on
it.</para>
<para>The entities used to manage this are known as filesystem buffers,
<literal>struct buf</literal>'s, or
<literal>bp</literal>'s. When a filesystem needs to operate on a
portion of a VM object, it typically maps part of the object into a
struct buf and the maps the pages in the struct buf into KVM. In the
same manner, disk I/O is typically issued by mapping portions of
objects into buffer structures and then issuing the I/O on the buffer
structures. The underlying vm_page_t's are typically busied for the
duration of the I/O. Filesystem buffers also have their own notion of
being busy, which is useful to filesystem driver code which would
rather operate on filesystem buffers instead of hard VM pages.</para>
<para>FreeBSD reserves a limited amount of KVM to hold mappings from
struct bufs, but it should be made clear that this KVM is used solely
to hold mappings and does not limit the ability to cache data.
Physical data caching is strictly a function of
<literal>vm_page_t</literal>'s, not filesystem buffers. However,
since filesystem buffers are used to placehold I/O, they do inherently
limit the amount of concurrent I/O possible. However, as there are usually a
few thousand filesystem buffers available, this is not usually a
problem.</para>
</sect1>
<sect1 id="vm-pagetables">
<title>Mapping Page Tables&mdash;<literal>vm_map_t, vm_entry_t</literal></title>
<para>FreeBSD separates the physical page table topology from the VM
system. All hard per-process page tables can be reconstructed on the
fly and are usually considered throwaway. Special page tables such as
those managing KVM are typically permanently preallocated. These page
tables are not throwaway.</para>
<para>FreeBSD associates portions of vm_objects with address ranges in
virtual memory through <literal>vm_map_t</literal> and
<literal>vm_entry_t</literal> structures. Page tables are directly
synthesized from the
<literal>vm_map_t</literal>/<literal>vm_entry_t</literal>/
<literal>vm_object_t</literal> hierarchy. Recall that I mentioned
that physical pages are only directly associated with a
<literal>vm_object</literal>; that is not quite true.
<literal>vm_page_t</literal>'s are also linked into page tables that
they are actively associated with. One <literal>vm_page_t</literal>
can be linked into several <emphasis>pmaps</emphasis>, as page tables
are called. However, the hierarchical association holds, so all
references to the same page in the same object reference the same
<literal>vm_page_t</literal> and thus give us buffer cache unification
across the board.</para>
</sect1>
<sect1 id="vm-kvm">
<title>KVM Memory Mapping</title>
<para>FreeBSD uses KVM to hold various kernel structures. The single
largest entity held in KVM is the filesystem buffer cache. That is,
mappings relating to <literal>struct buf</literal> entities.</para>
<para>Unlike Linux, FreeBSD does <emphasis>not</emphasis> map all of physical memory into
KVM. This means that FreeBSD can handle memory configurations up to
4G on 32 bit platforms. In fact, if the mmu were capable of it,
FreeBSD could theoretically handle memory configurations up to 8TB on
a 32 bit platform. However, since most 32 bit platforms are only
capable of mapping 4GB of ram, this is a moot point.</para>
<para>KVM is managed through several mechanisms. The main mechanism
used to manage KVM is the <emphasis>zone allocator</emphasis>. The
zone allocator takes a chunk of KVM and splits it up into
constant-sized blocks of memory in order to allocate a specific type
of structure. You can use <command>vmstat -m</command> to get an
overview of current KVM utilization broken down by zone.</para>
</sect1>
<sect1 id="vm-tuning">
<title>Tuning the FreeBSD VM system</title>
<para>A concerted effort has been made to make the FreeBSD kernel
dynamically tune itself. Typically you do not need to mess with
anything beyond the <option>maxusers</option> and
<option>NMBCLUSTERS</option> kernel config options. That is, kernel
compilation options specified in (typically)
<filename>/usr/src/sys/i386/conf/<replaceable>CONFIG_FILE</replaceable></filename>.
A description of all available kernel configuration options can be
found in <filename>/usr/src/sys/i386/conf/LINT</filename>.</para>
<para>In a large system configuration you may wish to increase
<option>maxusers</option>. Values typically range from 10 to 128.
Note that raising <option>maxusers</option> too high can cause the
system to overflow available KVM resulting in unpredictable operation.
It is better to leave <option>maxusers</option> at some reasonable number and add other
options, such as <option>NMBCLUSTERS</option>, to increase specific
resources.</para>
<para>If your system is going to use the network heavily, you may want
to increase <option>NMBCLUSTERS</option>. Typical values range from
1024 to 4096.</para>
<para>The <literal>NBUF</literal> parameter is also traditionally used
to scale the system. This parameter determines the amount of KVA the
system can use to map filesystem buffers for I/O. Note that this
parameter has nothing whatsoever to do with the unified buffer cache!
This parameter is dynamically tuned in 3.0-CURRENT and later kernels
and should generally not be adjusted manually. We recommend that you
<emphasis>not</emphasis> try to specify an <literal>NBUF</literal>
parameter. Let the system pick it. Too small a value can result in
extremely inefficient filesystem operation while too large a value can
starve the page queues by causing too many pages to become wired
down.</para>
<para>By default, FreeBSD kernels are not optimized. You can set
debugging and optimization flags with the
<literal>makeoptions</literal> directive in the kernel configuration.
Note that you should not use <option>-g</option> unless you can
accommodate the large (typically 7 MB+) kernels that result.</para>
<programlisting>makeoptions DEBUG="-g"
makeoptions COPTFLAGS="-O -pipe"</programlisting>
<para>Sysctl provides a way to tune kernel parameters at run-time. You
typically do not need to mess with any of the sysctl variables,
especially the VM related ones.</para>
<para>Run time VM and system tuning is relatively straightforward.
First, use Soft Updates on your UFS/FFS filesystems whenever possible.
<filename>/usr/src/sys/ufs/ffs/README.softupdates</filename> contains
instructions (and restrictions) on how to configure it.</para>
<para>Second, configure sufficient swap. You should have a swap
partition configured on each physical disk, up to four, even on your
<quote>work</quote> disks. You should have at least 2x the swap space
as you have main memory, and possibly even more if you do not have a
lot of memory. You should also size your swap partition based on the
maximum memory configuration you ever intend to put on the machine so
you do not have to repartition your disks later on. If you want to be
able to accommodate a crash dump, your first swap partition must be at
least as large as main memory and <filename>/var/crash</filename> must
have sufficient free space to hold the dump.</para>
<para>NFS-based swap is perfectly acceptable on 4.X or later systems,
but you must be aware that the NFS server will take the brunt of the
paging load.</para>
</sect1>
</chapter>