doc/en/projects/netperf/index.sgml

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" [
<!ENTITY base CDATA "../..">
<!ENTITY date "$FreeBSD: www/en/projects/netperf/index.sgml,v 1.10 2004/12/04 12:18:00 ceri Exp $">
<!ENTITY title "FreeBSD Network Performance Project (netperf)">
<!ENTITY email 'mux'>
<!ENTITY % includes SYSTEM "../../includes.sgml"> %includes;

<!ENTITY status.na "<font color=green>N/A</font>">
<!ENTITY status.done "<font color=green>Done</font>">
<!ENTITY status.prototyped "<font color=blue>Prototyped</font>">
<!ENTITY status.head "<font color=orange>Merged to HEAD; RELENG_5 candidate</font>">
<!ENTITY status.new "<font color=red>New task</font>">
<!ENTITY status.unknown "<font color=red>Unknown</font>">

<!ENTITY % developers SYSTEM "../../developers.sgml"> %developers;

]>

<html>
  &header;

    <h2>Contents</h2>
    <ul>
      <li><a href="#goal">Project Goal</a></li>
      <li><a href="#strategies">Project Strategies</a></li>
      <li><a href="#tasks">Project Tasks</a></li>
      <li><a href="#cluster">Netperf Cluster</a></li>
      <li><a href="#links">Links</a></li>
    </ul>

    <a name="goal"></a>
    <h2>Project Goal</h2>

    <p>The netperf project is working to enhance the performance of the
      FreeBSD network stack.  This work grew out of the
      <a href="../../smp">SMPng Project</a>, which moved the FreeBSD kernel from
      a "Giant Lock" to more fine-grained locking and multi-threading.  SMPng
      offered both performance improvement and degradation for the network
      stack, improving parallelism and preemption, but substantially
      increasing per-packet processing costs.  The netperf project is
      primarily focussed on further improving parallelism in network
      processing while reducing the SMP synchronization overhead.  This in
      turn will lead to higher processing throughput and lower processing
      latency.</p>

    <a name="strategies"></a>
    <h2>Project Strategies</h2>
    <p>Robert Watson</p>

    <p>The two primary focuses of this work are to increase parallelism
      while decreasing overhead.  Several activities are being performed that
      will work toward these goals:</p>

    <ul>
      <li><p>Complete locking work to make sure all components of the stack
	are able to run without the Giant lock.  While most of the network
	stack, especially mainstream protocols, runs without Giant, some
	components require Giant to be placed back over the stack if compiled
	into the kernel, reducing parallelism.</p></li>

      <li><p>Optimize locking strategies to find better balances between
	locking granularity and locking overhead.  In the first cut at locking
	for the kernel, the goal was to adopt a medium-grained locking
	approach based on data locking.  This approach identifies critical
	data structures, and inserts new locks and locking operations to
	protect those data structures.  Depending on the data model of the
	code being protected, this may lead to the introduction of a
	substantial number of locks offering unnecessary granularity, where
	the overhead of locking overwhelms the benefits of available
	parallelism and preemption.  By selectively reducing granularity, it
	is possible to improve performance by decreasing locking overhead.
	</p></li>

      <li><p>Amortize the cost of locking by processing queues of packets or
	events.  While the cost of individual synchronization operations may
	be high, it is possible to amortize the cost of synchronization
	operations by grouping processing of similar data (packets, events)
	under the same protection.  This approach focuses on identifying
	places where similar locking occurs frequently in succession, and
	introducing queueing or coalescing of lock operations across the
	body of the work.  For example, when a series of packets is inserted
	into an outgoing interface queue, a basic locking approach would
	lock the queue for each insert operation, unlock it, and hand off to
	the interface driver to begin the send, repeating this sequence as
	required.  With a coalesced approach, the caller would pass off a
	queue of packets in order to reduce the locking overhead, as well as
	eliminate unnecessary synchronization due to the queue being
	thread-local.  This approach can be applied at several levels in the
	stack, and is particularly applicable at lower levels of the stack
	where streams of packets require almost identical processing.
	</p></li>

      <li><p>Introduce new synchronization strategies with reduced overhead
	relative to traditional strategies.  Most traditional strategies
	employ a combination of interrupt disabling and atomic operations to
	achieve mutual exclusion and non-preemption guarantees.  However,
	these operations are expensive on modern CPUs, leading to the desire
	for cheaper primitives with weaker semantics.  For example, the
	application of uni-processor primitives where synchronization is
	required only on a single processor, and optimizations to critical
	section primitives to avoid the need for interrupt disabling.
	</p></li>

      <li><p>Modify synchronization strategies to take advantage of
	additional, non-locking, synchronization primitives.  This approach
	might take the form of making increased use of per-CPU or per-thread
	data structures, which require little or no synchronization.  For
	example, through the use of critical sections, it is possible to
	synchronize access to per-CPU caches and queues.  Through the use of
	per-thread queues, data can be handed off between stack layers
	without the use of synchronization.</p></li>

      <li><p>Increase the opportunities for parallelism through increased
	threading in the network stack.  The current network stack model
	offers the opportunity for substantial parallelism, with outbound
	processing typically taking place in the context of the sending
	thread in kernel, crypto occuring in crypto worker threads, and
	receive processing taking place in a combination of the receiving
	ithread and dispatched netisr thread.  While handoffs between
	threads introduces overhead (synchronization, context switching),
	there is the opportunity to increase parallelism in some workloads
	through introducing additional worker threads.  Identifying work
	that may be relocated to new threads must be done carefully to
	balance overhead, and latency concerns, but can pay off by
	increasing effective CPU utilization and hence throughput.  For
	example, introducing additional netisr threads capable of running on
	more than one CPU at a time can increase input parallelism, subject
	to maintaining desirable packet ordering.</p></li>
    </ul>

    <a name="tasks"></a>
    <h2>Project Tasks</h2>

    <table border=3>
      <tr>
	<th> Task </th>
	<th> Responsible </th>
	<th> Last updated </th>
	<th> Status </th>
	<th> Notes </th>
      </tr>

      <tr>
	<td> Prefer file descriptor reference counts to socket reference
	  counts for system calls. </td>
	<td> &a.rwatson; </td>
	<td> 20041124 </td>
	<td> &status.done; </td>
	<td> Sockets and file descriptors both have reference counts in order
	  to prevent these objects from being free'd while in use.  However,
	  if a file descriptor is used to reach the socket, the reference
	  counts are somewhat interchangeable, as either will prevent
	  undesired garbage collection.  For socket system calls, overhead
	  can be reduced by relying on the file descriptor reference count,
	  thus avoiding the synchronized operations necessary to modify the
	  socket reference count, an approach also taken in the VFS code.
	  This change has been made for most socket system calls, and has
	  been committed to HEAD (6.x).  It has also been merged to RELENG_5
	  for inclusion in 5.4.</td>
      </tr>

      <tr>
	<td> Mbuf queue library </td>
	<td> &a.rwatson; </td>
	<td> 20041124 </td>
	<td> &status.prototyped; </td>
	<td> In order to facilitate passing off queues of packets between
	  network stack components, create an mbuf queue primitive, struct
	  mbufqueue.  The initial implementation is complete, and the
	  primitive is now being applied in several sample cases to determine
	  whether it offers the desired semantics and benefits.  The
	  implementation can be found in the rwatson_dispatch Perforce
	  branch.  Additional work must also be done to explore the
	  performance impact of "queues" vs arrays of mbuf pointers, which
	  are likely to behave better from a caching perspective. </td>
      </tr>

      <tr>
	<td> Employ queued dispatch in interface send API </td>
	<td> &a.rwatson; </td>
	<td> 20041106 </td>
	<td> &status.prototyped; </td>
	<td> An experimental if_start_mbufqueue() interface to struct ifnet
	  has been added, which passes an mbuf queue to the device driver for
	  processing, avoiding redundant synchronization against the
	  interface queue, even in the event that additional queueing is
	  required.  This has not yet been benchmarked.  A subset change to
	  dispatch a single mbuf to a driver has also been prototyped, and
	  bencharked at a several percentage point improvement in packet send
	  rates from user space. </td>
      </tr>

      <tr>
	<td> Employ queued dispatch in the interface receive API </td>
	<td> &a.rwatson; </td>
	<td> 20041106 </td>
	<td> &status.new; </td>
	<td> Similar to if_start_mbufqueue, allow input of a queue of mbufs
	  from the device driver into the lowest protocol layers, such as
	  ether_input_mbufqueue. </td>
      </tr>

      <tr>
	<td> Employ queued dispatch across netisr dispatch API </td>
	<td> &a.rwatson; </td>
	<td> 20041124 </td>
	<td> &status.prototyped; </td>
	<td> Pull all of the mbufs in the netisr ifqueue out of the ifqueue
	  into a thread-local mbuf queue to avoid repeated lock operations
	  to access the queue.  Also use lock-free operations to test for
	  queue contents being present.  This has been prototyped in the
	  rwatson_netperf branch. </td>
      </tr>

      <tr>
	<td> Modify UMA allocator to use critical sections not mutexes for
	  per-CPU caches. </td>
	<td> &a.rwatson; </td>
	<td> 20041124 </td>
	<td> &status.prototyped; </td>
	<td> The mutexes protecting per-CPU caches require atomic operations
	  on SMP systems; as they are per-CPU objects, the cost of
	  synchronizing access to the caches can be reduced by combining
	  CPU pinning and/or critical sections instead.  A prototype of this
	  has been implemented in the rwatson_percpu branch, but is waiting
	  on critical section performance optimizations that will prevent
	  this change from negatively impacting uniprocessor performance.
	  The critical section operations from John Baldwin have been posted
	  for public review. </td>
      </tr>

      <tr>
	<td> Optimize critical section performance </td>
	<td> &a.jhb; </td>
	<td> 20041124 </td>
	<td> &status.prototyped; </td>
	<td> Critical sections prevent preemption of a thread on a CPU, as
	  well as preventing migration of that thread to another CPU, and
	  maybe used for synchronizing access to per-CPU data structures, as
	  well as preventing recursion in interrupt processing.  Currently,
	  critical sections disable interrupts on the CPU.  In previous
	  versions of FreeBSD (4.x and before), optimizations were present
	  that allowed for software interrupt disabling, which lowers the
	  cost of critical sections in the common case by avoiding expensive
	  microcode operations on the CPU.  By restoring this model, or a
	  variation on it, critical sections can be made substantially
	  cheaper to enter.  In particular, this change will lower the cost
	  of critical sections on UP such that it is approximately the same
	  cost as a mutex, meaning that optimizations on SMP to use critical
	  sections instead of mutexes will not harm UP performance.  A
	  prototype of this change is present in the jhb_lock Perforce
	  branch, and patches have been posted to per-architecture mailing
	  lists for review. </td>
      </tr>

    </table>

    <a name="cluster"></a>
    <h2>Netperf Cluster</h2>

    <p>Through the generous donations and investment of Sentex Data
      Communications, FreeBSD Systems, IronPort Systems, and the FreeBSD
      Foundation, a network performance testbed has been created in Ontario,
      Canada for use by FreeBSD developers working in the area of network
      performance.  A similar cluster, made possible through the generous
      donation of Verio, is being prepared for use in more general SMP
      performance work in Virginia, US.  Each cluster consists of several SMP
      systems inter-connected with giga-bit ethernet such that relatively
      arbitrary topologies can be constructed in order to test host-host, IP
      forwarding, and bridging performance scenarios.  Systems are network
      booted, have serial console, and remote power, in order to maximize
      availability and minimize configuration overhead.  These systems are
      available on a check-out basis for experimentation and performance
      measurement to FreeBSD developers working on the Netperf project, and
      in related areas.</p>

    <p><a href="cluster.html">More detailed information on the netperf
      cluster can be found by following this linka.</a></p>

    <a name="links"></a>
    <h2>Links</h2>

    <p>Some useful links relating to the netperf work:</p>

    <ul>
      <li><p><a href="../../smp/">SMPng Project</a> -- Project to introduce
	finer grained locking in the FreeBSD kernel.</p></li>

      <li><p><a href="http://www.watson.org/~robert/freebsd/netperf/">Robert
	Watson's netperf web page</a> -- Web page that includes a change log
	and performance measurement/debugging information.</p></li>
    </ul>

  &footer;
  </body>
</html>