386 lines
17 KiB
XML
386 lines
17 KiB
XML
<?xml version="1.0" encoding="iso-8859-1"?>
|
|
<!DOCTYPE html PUBLIC "-//FreeBSD//DTD XHTML 1.0 Transitional-Based Extension//EN"
|
|
"http://www.FreeBSD.org/XML/doc/share/sgml/xhtml10-freebsd.dtd" [
|
|
<!ENTITY title "FreeBSD Network Performance Project (netperf)">
|
|
<!ENTITY email 'mux'>
|
|
|
|
<!ENTITY status.na "<font xmlns='http://www.w3.org/1999/xhtml' color='green'>N/A</font>">
|
|
<!ENTITY status.done "<font xmlns='http://www.w3.org/1999/xhtml' color='green'>Done</font>">
|
|
<!ENTITY status.prototyped "<font xmlns='http://www.w3.org/1999/xhtml' color='blue'>Prototyped</font>">
|
|
<!ENTITY status.head "<font xmlns='http://www.w3.org/1999/xhtml' color='orange'>Merged to HEAD; RELENG_5 candidate</font>">
|
|
<!ENTITY status.new "<font xmlns='http://www.w3.org/1999/xhtml' color='red'>Not done</font>">
|
|
<!ENTITY status.unknown "<font xmlns='http://www.w3.org/1999/xhtml' color='red'>Unknown</font>">
|
|
]>
|
|
|
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
<head>
|
|
<title>&title;</title>
|
|
|
|
<cvs:keyword xmlns:cvs="http://www.FreeBSD.org/XML/CVS">$FreeBSD$</cvs:keyword>
|
|
</head>
|
|
|
|
<body class="navinclude.developers">
|
|
|
|
<h2>Contents</h2>
|
|
<ul>
|
|
<li><a href="#goal">Project Goal</a></li>
|
|
<li><a href="#strategies">Project Strategies</a></li>
|
|
<li><a href="#tasks">Project Tasks</a></li>
|
|
<li><a href="#cluster">Netperf Cluster</a></li>
|
|
<li><a href="#papers">Papers and Reports</a></li>
|
|
<li><a href="#links">Links</a></li>
|
|
</ul>
|
|
|
|
<a name="goal"></a>
|
|
<h2>Project Goal</h2>
|
|
|
|
<p>The netperf project is working to enhance the performance of the
|
|
FreeBSD network stack. This work grew out of the
|
|
SMPng Project, which moved the FreeBSD kernel from
|
|
a "Giant Lock" to more fine-grained locking and multi-threading. SMPng
|
|
offered both performance improvement and degradation for the network
|
|
stack, improving parallelism and preemption, but substantially
|
|
increasing per-packet processing costs. The netperf project is
|
|
primarily focussed on further improving parallelism in network
|
|
processing while reducing the SMP synchronization overhead. This in
|
|
turn will lead to higher processing throughput and lower processing
|
|
latency.</p>
|
|
|
|
<a name="strategies"></a>
|
|
<h2>Project Strategies</h2>
|
|
<p>Robert Watson</p>
|
|
|
|
<p>The two primary focuses of this work are to increase parallelism
|
|
while decreasing overhead. Several activities are being performed that
|
|
will work toward these goals:</p>
|
|
|
|
<ul>
|
|
<li><p>The Netperf project has completed locking work for all components
|
|
of the network stack; as of FreeBSD 7.0 we have removed non-MPSAFE
|
|
protocol shims, and as of FreeBSD 8.0 we have removed non-MPSAFE
|
|
device driver shims.</p></li>
|
|
|
|
<li><p>Optimize locking strategies to find better balances between
|
|
locking granularity and locking overhead. In the first cut at locking
|
|
for the kernel, the goal was to adopt a medium-grained locking
|
|
approach based on data locking. This approach identifies critical
|
|
data structures, and inserts new locks and locking operations to
|
|
protect those data structures. Depending on the data model of the
|
|
code being protected, this may lead to the introduction of a
|
|
substantial number of locks offering unnecessary granularity, where
|
|
the overhead of locking overwhelms the benefits of available
|
|
parallelism and preemption. By selectively reducing granularity, it
|
|
is possible to improve performance by decreasing locking overhead.
|
|
</p></li>
|
|
|
|
<li><p>Amortize the cost of locking by processing queues of packets or
|
|
events. While the cost of individual synchronization operations may
|
|
be high, it is possible to amortize the cost of synchronization
|
|
operations by grouping processing of similar data (packets, events)
|
|
under the same protection. This approach focuses on identifying
|
|
places where similar locking occurs frequently in succession, and
|
|
introducing queueing or coalescing of lock operations across the
|
|
body of the work. For example, when a series of packets is inserted
|
|
into an outgoing interface queue, a basic locking approach would
|
|
lock the queue for each insert operation, unlock it, and hand off to
|
|
the interface driver to begin the send, repeating this sequence as
|
|
required. With a coalesced approach, the caller would pass off a
|
|
queue of packets in order to reduce the locking overhead, as well as
|
|
eliminate unnecessary synchronization due to the queue being
|
|
thread-local. This approach can be applied at several levels in the
|
|
stack, and is particularly applicable at lower levels of the stack
|
|
where streams of packets require almost identical processing.
|
|
</p></li>
|
|
|
|
<li><p>Introduce new synchronization strategies with reduced overhead
|
|
relative to traditional strategies. Most traditional strategies
|
|
employ a combination of interrupt disabling and atomic operations to
|
|
achieve mutual exclusion and non-preemption guarantees. However,
|
|
these operations are expensive on modern CPUs, leading to the desire
|
|
for cheaper primitives with weaker semantics. For example, the
|
|
application of uni-processor primitives where synchronization is
|
|
required only on a single processor, and optimizations to critical
|
|
section primitives to avoid the need for interrupt disabling.
|
|
</p></li>
|
|
|
|
<li><p>Modify synchronization strategies to take advantage of
|
|
additional, non-locking, synchronization primitives. This approach
|
|
might take the form of making increased use of per-CPU or per-thread
|
|
data structures, which require little or no synchronization. For
|
|
example, through the use of critical sections, it is possible to
|
|
synchronize access to per-CPU caches and queues. Through the use of
|
|
per-thread queues, data can be handed off between stack layers
|
|
without the use of synchronization.</p></li>
|
|
|
|
<li><p>Increase the opportunities for parallelism through increased
|
|
threading in the network stack. The current network stack model
|
|
offers the opportunity for substantial parallelism, with outbound
|
|
processing typically taking place in the context of the sending
|
|
thread in kernel, crypto occurring in crypto worker threads, and
|
|
receive processing taking place in a combination of the receiving
|
|
ithread and dispatched netisr thread. While handoffs between
|
|
threads introduces overhead (synchronization, context switching),
|
|
there is the opportunity to increase parallelism in some workloads
|
|
through introducing additional worker threads. Identifying work
|
|
that may be relocated to new threads must be done carefully to
|
|
balance overhead, and latency concerns, but can pay off by
|
|
increasing effective CPU utilization and hence throughput. For
|
|
example, introducing additional netisr threads capable of running on
|
|
more than one CPU at a time can increase input parallelism, subject
|
|
to maintaining desirable packet ordering (present in FreeBSD
|
|
8.0).</p></li>
|
|
</ul>
|
|
|
|
<a name="tasks"></a>
|
|
<h2>Project Tasks</h2>
|
|
|
|
<table class="tblbasic">
|
|
<tr>
|
|
<th> Task </th>
|
|
<th> Responsible </th>
|
|
<th> Last updated </th>
|
|
<th> Status </th>
|
|
<th> Notes </th>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td> Prefer file descriptor reference counts to socket reference
|
|
counts for system calls. </td>
|
|
<td> &a.rwatson; </td>
|
|
<td> 20041124 </td>
|
|
<td> &status.done; </td>
|
|
<td> Sockets and file descriptors both have reference counts in order
|
|
to prevent these objects from being free'd while in use. However,
|
|
if a file descriptor is used to reach the socket, the reference
|
|
counts are somewhat interchangeable, as either will prevent
|
|
undesired garbage collection. For socket system calls, overhead
|
|
can be reduced by relying on the file descriptor reference count,
|
|
thus avoiding the synchronized operations necessary to modify the
|
|
socket reference count, an approach also taken in the VFS code.
|
|
This change has been made for most socket system calls, and has
|
|
been committed to HEAD (6.x). It has also been merged to RELENG_5
|
|
for inclusion in 5.4.</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td> Mbuf queue library </td>
|
|
<td> &a.rwatson; </td>
|
|
<td> 20041124 </td>
|
|
<td> &status.prototyped; </td>
|
|
<td> In order to facilitate passing off queues of packets between
|
|
network stack components, create an mbuf queue primitive, struct
|
|
mbufqueue. The initial implementation is complete, and the
|
|
primitive is now being applied in several sample cases to determine
|
|
whether it offers the desired semantics and benefits. The
|
|
implementation can be found in the rwatson_dispatch Perforce
|
|
branch. Additional work must also be done to explore the
|
|
performance impact of "queues" vs arrays of mbuf pointers, which
|
|
are likely to behave better from a caching perspective. </td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td> Employ queued dispatch in interface send API </td>
|
|
<td> &a.rwatson; </td>
|
|
<td> 20041106 </td>
|
|
<td> &status.prototyped; </td>
|
|
<td> An experimental if_start_mbufqueue() interface to struct ifnet
|
|
has been added, which passes an mbuf queue to the device driver for
|
|
processing, avoiding redundant synchronization against the
|
|
interface queue, even in the event that additional queueing is
|
|
required. This has not yet been benchmarked. A subset change to
|
|
dispatch a single mbuf to a driver has also been prototyped, and
|
|
benchmarked at a several percentage point improvement in packet send
|
|
rates from user space. </td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td> Employ queued dispatch in the interface receive API </td>
|
|
<td> &a.rwatson; </td>
|
|
<td> 20041106 </td>
|
|
<td> &status.new; </td>
|
|
<td> Similar to if_start_mbufqueue, allow input of a queue of mbufs
|
|
from the device driver into the lowest protocol layers, such as
|
|
ether_input_mbufqueue. </td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td> Employ queued dispatch across netisr dispatch API </td>
|
|
<td> &a.rwatson; </td>
|
|
<td> 20090601 </td>
|
|
<td> &status.done; </td>
|
|
<td> Pull all of the mbufs in the netisr queue into a thread-local
|
|
mbuf queue to avoid repeated lock operations to access the queue.
|
|
This work was completed as part of the netisr2 project, and will
|
|
ship with 8.0-RELEASE. </td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td> Modify UMA allocator to use critical sections not mutexes for
|
|
per-CPU caches. </td>
|
|
<td> &a.rwatson; </td>
|
|
<td> 20050429 </td>
|
|
<td> &status.done; </td>
|
|
<td> The mutexes protecting per-CPU caches require atomic operations
|
|
on SMP systems; as they are per-CPU objects, the cost of
|
|
synchronizing access to the caches can be reduced by combining
|
|
CPU pinning and/or critical sections instead. This change has now
|
|
been committed and will appear in 6.0-RELEASE; it results in a
|
|
several percentage performance in UDP send from user space, and
|
|
there have been reports of 20%+ improvements in allocation
|
|
intensive code within the kernel. In micro-benchmarks, the cost
|
|
of allocation on SMP is dramatically reduced. </td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td> Modify malloc(9) allocator to use per-CPU statistics with
|
|
critical sections to protect malloc_type statistics rather than
|
|
global statistics with a mutex. </td>
|
|
<td> &a.rwatson; </td>
|
|
<td> 20050529 </td>
|
|
<td> &status.done; </td>
|
|
<td> Previously, malloc(9) used a single statistics structure
|
|
protected by a mutex to hold global malloc statistics for each
|
|
malloc type. This change moves to per-CPU statistics structures,
|
|
which are coalesced when reporting memory allocation statistics to
|
|
the user, and protects them using critical sections. This reduces
|
|
cache line contention for common allocation types by avoiding
|
|
shared lines, and also reduces synchronization costs by using
|
|
critical sections to synchronize access instead of a mutex. While
|
|
malloc(9) is less frequently used in the network stack than uma(9),
|
|
it is used for socket address data, so is on performance critical
|
|
paths for datagram operations. This has been committed and appeared
|
|
6.0-RELEASE. </td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td> Optimize critical section performance </td>
|
|
<td> &a.jhb; </td>
|
|
<td> 20050404 </td>
|
|
<td> &status.done; </td>
|
|
<td> Critical sections prevent preemption of a thread on a CPU, as
|
|
well as preventing migration of that thread to another CPU, and
|
|
maybe used for synchronizing access to per-CPU data structures, as
|
|
well as preventing recursion in interrupt processing. Currently,
|
|
critical sections disable interrupts on the CPU. In previous
|
|
versions of FreeBSD (4.x and before), optimizations were present
|
|
that allowed for software interrupt disabling, which lowers the
|
|
cost of critical sections in the common case by avoiding expensive
|
|
microcode operations on the CPU. By restoring this model, or a
|
|
variation on it, critical sections can be made substantially
|
|
cheaper to enter. In particular, this change lowers the cost
|
|
of critical sections on UP such that it is approximately the same
|
|
cost as a mutex, meaning that optimizations on SMP to use critical
|
|
sections instead of mutexes will not harm UP performance. This
|
|
change has now been committed, and appeared in 6.0-RELEASE. </td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td> Normalize socket and protocol control block reference model </td>
|
|
<td> &a.rwatson; </td>
|
|
<td> 20060401 </td>
|
|
<td> &status.done; </td>
|
|
<td> The socket/protocol boundary is characterized by a set of data
|
|
structures and API interfaces, where the socket code acts as both
|
|
a consumer and a service library for protocols. This task is to
|
|
normalize the reference model by which protocol state is attached
|
|
to and detached from socket state in order to strengthen
|
|
invariants, allowing the removal of countless unused code paths
|
|
(especially error handling), the removal of unnecessary locking
|
|
in TCP, and a general improve the structure of the code. This
|
|
serves both the immediate purpose of improving the quality and
|
|
performance of this code, as well as being necessary for future
|
|
optimization work. These changes have been prototyped in
|
|
Perforce, and now merged to 7-CURRENT. They will be merged into
|
|
RELENG_6 once they have been thoroughly tested.</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td> Add true inpcb reference count support </td>
|
|
<td> &a.mohans;, &a.rwatson;, &a.peter; </td>
|
|
<td> 20081208 </td>
|
|
<td> &status.done; </td>
|
|
<td> Historically, the in-bound TCP and UDP socket paths relied on
|
|
global pcbinfo info locks to prevent PCBs being delivered to from
|
|
being garbage collected by another thread while in use. This set
|
|
of changes introduces a true reference model for PCBs so that the
|
|
global lock can be released during in-bound process, and appear
|
|
in 8.0-RELEASE.</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td> Fine-grained locking for UNIX domain sockets </td>
|
|
<td> &a.rwatson; </td>
|
|
<td> 20070226 </td>
|
|
<td> &status.done; </td>
|
|
<td> UNIX domain sockets in FreeBSD 5.x and 6.x use a single global
|
|
subsystem lock. This is sufficient to allow it to run without
|
|
Giant, but results in contention with large numbers of processors
|
|
simultaneously operating on UNIX domain sockets. This task
|
|
introduced per-protocol control block locks in order to reduce
|
|
contention on a larger subsystem lock, and the results appeared in
|
|
7.0-RELEASE. </td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td> Multiple netisr threads </td>
|
|
<td> &a.rwatson; </td>
|
|
<td> 20090601 </td>
|
|
<td> &status.done; </td>
|
|
<td> Historically, the BSD network stack has used a single network
|
|
software interrupt context, for deferred network processing. With
|
|
the introduction of multi-processing, this became a single
|
|
software interrupt thread. In FreeBSD 8.0, multiple netisr
|
|
threads are now supported, up to the number of CPUs present in the
|
|
system.</td>
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
<a name="cluster"></a>
|
|
<h2>Netperf Cluster</h2>
|
|
|
|
<p>Through the generous donations and investment of Sentex Data
|
|
Communications, FreeBSD Systems, IronPort Systems, and the FreeBSD
|
|
Foundation, a network performance testbed has been created in Ontario,
|
|
Canada for use by FreeBSD developers working in the area of network
|
|
performance. A similar cluster, made possible through the generous
|
|
donation of Verio, is being prepared for use in more general SMP
|
|
performance work in Virginia, US. Each cluster consists of several SMP
|
|
systems inter-connected with giga-bit ethernet such that relatively
|
|
arbitrary topologies can be constructed in order to test host-host, IP
|
|
forwarding, and bridging performance scenarios. Systems are network
|
|
booted, have serial console, and remote power, in order to maximize
|
|
availability and minimize configuration overhead. These systems are
|
|
available on a check-out basis for experimentation and performance
|
|
measurement to FreeBSD developers working on the Netperf project, and
|
|
in related areas.</p>
|
|
|
|
<p><a href="cluster.html">More detailed information on the netperf
|
|
cluster can be found by following this link.</a></p>
|
|
|
|
<a name="papers"></a>
|
|
<h2>Papers and Reports</h2>
|
|
|
|
<p>The following paper(s) have been produced by or are related to the
|
|
Netperf Project:</p>
|
|
|
|
<ul>
|
|
<li><p><a href="http://www.watson.org/~robert/freebsd/netperf/20051027-eurobsdcon2005-netperf.pdf">"Introduction to Multithreading and Multiprocessing in the FreeBSD SMPng Network Stack", EuroBSDCon 2005, Basel, Switzerland</a>.</p></li>
|
|
</ul>
|
|
|
|
<a name="links"></a>
|
|
<h2>Links</h2>
|
|
|
|
<p>Some useful links relating to the netperf work:</p>
|
|
|
|
<ul>
|
|
<li><p>SMPng Project -- Project to introduce
|
|
finer grained locking in the FreeBSD kernel.</p></li>
|
|
|
|
<li><p><a href="http://www.watson.org/~robert/freebsd/netperf/">Robert
|
|
Watson's netperf web page</a> -- Web page that includes a change log
|
|
and performance measurement/debugging information.</p></li>
|
|
</ul>
|
|
|
|
</body>
|
|
</html>
|