296 lines
		
	
	
	
		
			13 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			296 lines
		
	
	
	
		
			13 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" [
 | |
| <!ENTITY base CDATA "../..">
 | |
| <!ENTITY date "$FreeBSD: www/en/projects/netperf/index.sgml,v 1.10 2004/12/04 12:18:00 ceri Exp $">
 | |
| <!ENTITY title "FreeBSD Network Performance Project (netperf)">
 | |
| <!ENTITY email 'mux'>
 | |
| <!ENTITY % includes SYSTEM "../../includes.sgml"> %includes;
 | |
| 
 | |
| <!ENTITY status.na "<font color=green>N/A</font>">
 | |
| <!ENTITY status.done "<font color=green>Done</font>">
 | |
| <!ENTITY status.prototyped "<font color=blue>Prototyped</font>">
 | |
| <!ENTITY status.head "<font color=orange>Merged to HEAD; RELENG_5 candidate</font>">
 | |
| <!ENTITY status.new "<font color=red>New task</font>">
 | |
| <!ENTITY status.unknown "<font color=red>Unknown</font>">
 | |
| 
 | |
| <!ENTITY % developers SYSTEM "../../developers.sgml"> %developers;
 | |
| 
 | |
| ]>
 | |
| 
 | |
| <html>
 | |
|   &header;
 | |
| 
 | |
|     <h2>Contents</h2>
 | |
|     <ul>
 | |
|       <li><a href="#goal">Project Goal</a></li>
 | |
|       <li><a href="#strategies">Project Strategies</a></li>
 | |
|       <li><a href="#tasks">Project Tasks</a></li>
 | |
|       <li><a href="#cluster">Netperf Cluster</a></li>
 | |
|       <li><a href="#links">Links</a></li>
 | |
|     </ul>
 | |
| 
 | |
|     <a name="goal"></a>
 | |
|     <h2>Project Goal</h2>
 | |
| 
 | |
|     <p>The netperf project is working to enhance the performance of the
 | |
|       FreeBSD network stack.  This work grew out of the
 | |
|       <a href="../../smp">SMPng Project</a>, which moved the FreeBSD kernel from
 | |
|       a "Giant Lock" to more fine-grained locking and multi-threading.  SMPng
 | |
|       offered both performance improvement and degradation for the network
 | |
|       stack, improving parallelism and preemption, but substantially
 | |
|       increasing per-packet processing costs.  The netperf project is
 | |
|       primarily focussed on further improving parallelism in network
 | |
|       processing while reducing the SMP synchronization overhead.  This in
 | |
|       turn will lead to higher processing throughput and lower processing
 | |
|       latency.</p>
 | |
| 
 | |
|     <a name="strategies"></a>
 | |
|     <h2>Project Strategies</h2>
 | |
|     <p>Robert Watson</p>
 | |
| 
 | |
|     <p>The two primary focuses of this work are to increase parallelism
 | |
|       while decreasing overhead.  Several activities are being performed that
 | |
|       will work toward these goals:</p>
 | |
| 
 | |
|     <ul>
 | |
|       <li><p>Complete locking work to make sure all components of the stack
 | |
| 	are able to run without the Giant lock.  While most of the network
 | |
| 	stack, especially mainstream protocols, runs without Giant, some
 | |
| 	components require Giant to be placed back over the stack if compiled
 | |
| 	into the kernel, reducing parallelism.</p></li>
 | |
| 
 | |
|       <li><p>Optimize locking strategies to find better balances between
 | |
| 	locking granularity and locking overhead.  In the first cut at locking
 | |
| 	for the kernel, the goal was to adopt a medium-grained locking
 | |
| 	approach based on data locking.  This approach identifies critical
 | |
| 	data structures, and inserts new locks and locking operations to
 | |
| 	protect those data structures.  Depending on the data model of the
 | |
| 	code being protected, this may lead to the introduction of a
 | |
| 	substantial number of locks offering unnecessary granularity, where
 | |
| 	the overhead of locking overwhelms the benefits of available
 | |
| 	parallelism and preemption.  By selectively reducing granularity, it
 | |
| 	is possible to improve performance by decreasing locking overhead.
 | |
| 	</p></li>
 | |
| 
 | |
|       <li><p>Amortize the cost of locking by processing queues of packets or
 | |
| 	events.  While the cost of individual synchronization operations may
 | |
| 	be high, it is possible to amortize the cost of synchronization
 | |
| 	operations by grouping processing of similar data (packets, events)
 | |
| 	under the same protection.  This approach focuses on identifying
 | |
| 	places where similar locking occurs frequently in succession, and
 | |
| 	introducing queueing or coalescing of lock operations across the
 | |
| 	body of the work.  For example, when a series of packets is inserted
 | |
| 	into an outgoing interface queue, a basic locking approach would
 | |
| 	lock the queue for each insert operation, unlock it, and hand off to
 | |
| 	the interface driver to begin the send, repeating this sequence as
 | |
| 	required.  With a coalesced approach, the caller would pass off a
 | |
| 	queue of packets in order to reduce the locking overhead, as well as
 | |
| 	eliminate unnecessary synchronization due to the queue being
 | |
| 	thread-local.  This approach can be applied at several levels in the
 | |
| 	stack, and is particularly applicable at lower levels of the stack
 | |
| 	where streams of packets require almost identical processing.
 | |
| 	</p></li>
 | |
| 
 | |
|       <li><p>Introduce new synchronization strategies with reduced overhead
 | |
| 	relative to traditional strategies.  Most traditional strategies
 | |
| 	employ a combination of interrupt disabling and atomic operations to
 | |
| 	achieve mutual exclusion and non-preemption guarantees.  However,
 | |
| 	these operations are expensive on modern CPUs, leading to the desire
 | |
| 	for cheaper primitives with weaker semantics.  For example, the
 | |
| 	application of uni-processor primitives where synchronization is
 | |
| 	required only on a single processor, and optimizations to critical
 | |
| 	section primitives to avoid the need for interrupt disabling.
 | |
| 	</p></li>
 | |
| 
 | |
|       <li><p>Modify synchronization strategies to take advantage of
 | |
| 	additional, non-locking, synchronization primitives.  This approach
 | |
| 	might take the form of making increased use of per-CPU or per-thread
 | |
| 	data structures, which require little or no synchronization.  For
 | |
| 	example, through the use of critical sections, it is possible to
 | |
| 	synchronize access to per-CPU caches and queues.  Through the use of
 | |
| 	per-thread queues, data can be handed off between stack layers
 | |
| 	without the use of synchronization.</p></li>
 | |
| 
 | |
|       <li><p>Increase the opportunities for parallelism through increased
 | |
| 	threading in the network stack.  The current network stack model
 | |
| 	offers the opportunity for substantial parallelism, with outbound
 | |
| 	processing typically taking place in the context of the sending
 | |
| 	thread in kernel, crypto occuring in crypto worker threads, and
 | |
| 	receive processing taking place in a combination of the receiving
 | |
| 	ithread and dispatched netisr thread.  While handoffs between
 | |
| 	threads introduces overhead (synchronization, context switching),
 | |
| 	there is the opportunity to increase parallelism in some workloads
 | |
| 	through introducing additional worker threads.  Identifying work
 | |
| 	that may be relocated to new threads must be done carefully to
 | |
| 	balance overhead, and latency concerns, but can pay off by
 | |
| 	increasing effective CPU utilization and hence throughput.  For
 | |
| 	example, introducing additional netisr threads capable of running on
 | |
| 	more than one CPU at a time can increase input parallelism, subject
 | |
| 	to maintaining desirable packet ordering.</p></li>
 | |
|     </ul>
 | |
| 
 | |
|     <a name="tasks"></a>
 | |
|     <h2>Project Tasks</h2>
 | |
| 
 | |
|     <table border=3>
 | |
|       <tr>
 | |
| 	<th> Task </th>
 | |
| 	<th> Responsible </th>
 | |
| 	<th> Last updated </th>
 | |
| 	<th> Status </th>
 | |
| 	<th> Notes </th>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
| 	<td> Prefer file descriptor reference counts to socket reference
 | |
| 	  counts for system calls. </td>
 | |
| 	<td> &a.rwatson; </td>
 | |
| 	<td> 20041124 </td>
 | |
| 	<td> &status.done; </td>
 | |
| 	<td> Sockets and file descriptors both have reference counts in order
 | |
| 	  to prevent these objects from being free'd while in use.  However,
 | |
| 	  if a file descriptor is used to reach the socket, the reference
 | |
| 	  counts are somewhat interchangeable, as either will prevent
 | |
| 	  undesired garbage collection.  For socket system calls, overhead
 | |
| 	  can be reduced by relying on the file descriptor reference count,
 | |
| 	  thus avoiding the synchronized operations necessary to modify the
 | |
| 	  socket reference count, an approach also taken in the VFS code.
 | |
| 	  This change has been made for most socket system calls, and has
 | |
| 	  been committed to HEAD (6.x).  It has also been merged to RELENG_5
 | |
| 	  for inclusion in 5.4.</td>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
| 	<td> Mbuf queue library </td>
 | |
| 	<td> &a.rwatson; </td>
 | |
| 	<td> 20041124 </td>
 | |
| 	<td> &status.prototyped; </td>
 | |
| 	<td> In order to facilitate passing off queues of packets between
 | |
| 	  network stack components, create an mbuf queue primitive, struct
 | |
| 	  mbufqueue.  The initial implementation is complete, and the
 | |
| 	  primitive is now being applied in several sample cases to determine
 | |
| 	  whether it offers the desired semantics and benefits.  The
 | |
| 	  implementation can be found in the rwatson_dispatch Perforce
 | |
| 	  branch.  Additional work must also be done to explore the
 | |
| 	  performance impact of "queues" vs arrays of mbuf pointers, which
 | |
| 	  are likely to behave better from a caching perspective. </td>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
| 	<td> Employ queued dispatch in interface send API </td>
 | |
| 	<td> &a.rwatson; </td>
 | |
| 	<td> 20041106 </td>
 | |
| 	<td> &status.prototyped; </td>
 | |
| 	<td> An experimental if_start_mbufqueue() interface to struct ifnet
 | |
| 	  has been added, which passes an mbuf queue to the device driver for
 | |
| 	  processing, avoiding redundant synchronization against the
 | |
| 	  interface queue, even in the event that additional queueing is
 | |
| 	  required.  This has not yet been benchmarked.  A subset change to
 | |
| 	  dispatch a single mbuf to a driver has also been prototyped, and
 | |
| 	  bencharked at a several percentage point improvement in packet send
 | |
| 	  rates from user space. </td>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
| 	<td> Employ queued dispatch in the interface receive API </td>
 | |
| 	<td> &a.rwatson; </td>
 | |
| 	<td> 20041106 </td>
 | |
| 	<td> &status.new; </td>
 | |
| 	<td> Similar to if_start_mbufqueue, allow input of a queue of mbufs
 | |
| 	  from the device driver into the lowest protocol layers, such as
 | |
| 	  ether_input_mbufqueue. </td>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
| 	<td> Employ queued dispatch across netisr dispatch API </td>
 | |
| 	<td> &a.rwatson; </td>
 | |
| 	<td> 20041124 </td>
 | |
| 	<td> &status.prototyped; </td>
 | |
| 	<td> Pull all of the mbufs in the netisr ifqueue out of the ifqueue
 | |
| 	  into a thread-local mbuf queue to avoid repeated lock operations
 | |
| 	  to access the queue.  Also use lock-free operations to test for
 | |
| 	  queue contents being present.  This has been prototyped in the
 | |
| 	  rwatson_netperf branch. </td>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
| 	<td> Modify UMA allocator to use critical sections not mutexes for
 | |
| 	  per-CPU caches. </td>
 | |
| 	<td> &a.rwatson; </td>
 | |
| 	<td> 20041124 </td>
 | |
| 	<td> &status.prototyped; </td>
 | |
| 	<td> The mutexes protecting per-CPU caches require atomic operations
 | |
| 	  on SMP systems; as they are per-CPU objects, the cost of
 | |
| 	  synchronizing access to the caches can be reduced by combining
 | |
| 	  CPU pinning and/or critical sections instead.  A prototype of this
 | |
| 	  has been implemented in the rwatson_percpu branch, but is waiting
 | |
| 	  on critical section performance optimizations that will prevent
 | |
| 	  this change from negatively impacting uniprocessor performance.
 | |
| 	  The critical section operations from John Baldwin have been posted
 | |
| 	  for public review. </td>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
| 	<td> Optimize critical section performance </td>
 | |
| 	<td> &a.jhb; </td>
 | |
| 	<td> 20041124 </td>
 | |
| 	<td> &status.prototyped; </td>
 | |
| 	<td> Critical sections prevent preemption of a thread on a CPU, as
 | |
| 	  well as preventing migration of that thread to another CPU, and
 | |
| 	  maybe used for synchronizing access to per-CPU data structures, as
 | |
| 	  well as preventing recursion in interrupt processing.  Currently,
 | |
| 	  critical sections disable interrupts on the CPU.  In previous
 | |
| 	  versions of FreeBSD (4.x and before), optimizations were present
 | |
| 	  that allowed for software interrupt disabling, which lowers the
 | |
| 	  cost of critical sections in the common case by avoiding expensive
 | |
| 	  microcode operations on the CPU.  By restoring this model, or a
 | |
| 	  variation on it, critical sections can be made substantially
 | |
| 	  cheaper to enter.  In particular, this change will lower the cost
 | |
| 	  of critical sections on UP such that it is approximately the same
 | |
| 	  cost as a mutex, meaning that optimizations on SMP to use critical
 | |
| 	  sections instead of mutexes will not harm UP performance.  A
 | |
| 	  prototype of this change is present in the jhb_lock Perforce
 | |
| 	  branch, and patches have been posted to per-architecture mailing
 | |
| 	  lists for review. </td>
 | |
|       </tr>
 | |
| 
 | |
|     </table>
 | |
| 
 | |
|     <a name="cluster"></a>
 | |
|     <h2>Netperf Cluster</h2>
 | |
| 
 | |
|     <p>Through the generous donations and investment of Sentex Data
 | |
|       Communications, FreeBSD Systems, IronPort Systems, and the FreeBSD
 | |
|       Foundation, a network performance testbed has been created in Ontario,
 | |
|       Canada for use by FreeBSD developers working in the area of network
 | |
|       performance.  A similar cluster, made possible through the generous
 | |
|       donation of Verio, is being prepared for use in more general SMP
 | |
|       performance work in Virginia, US.  Each cluster consists of several SMP
 | |
|       systems inter-connected with giga-bit ethernet such that relatively
 | |
|       arbitrary topologies can be constructed in order to test host-host, IP
 | |
|       forwarding, and bridging performance scenarios.  Systems are network
 | |
|       booted, have serial console, and remote power, in order to maximize
 | |
|       availability and minimize configuration overhead.  These systems are
 | |
|       available on a check-out basis for experimentation and performance
 | |
|       measurement to FreeBSD developers working on the Netperf project, and
 | |
|       in related areas.</p>
 | |
| 
 | |
|     <p><a href="cluster.html">More detailed information on the netperf
 | |
|       cluster can be found by following this linka.</a></p>
 | |
| 
 | |
|     <a name="links"></a>
 | |
|     <h2>Links</h2>
 | |
| 
 | |
|     <p>Some useful links relating to the netperf work:</p>
 | |
| 
 | |
|     <ul>
 | |
|       <li><p><a href="../../smp/">SMPng Project</a> -- Project to introduce
 | |
| 	finer grained locking in the FreeBSD kernel.</p></li>
 | |
| 
 | |
|       <li><p><a href="http://www.watson.org/~robert/freebsd/netperf/">Robert
 | |
| 	Watson's netperf web page</a> -- Web page that includes a change log
 | |
| 	and performance measurement/debugging information.</p></li>
 | |
|     </ul>
 | |
| 
 | |
|   &footer;
 | |
|   </body>
 | |
| </html>
 |