diff --git a/en/projects/netperf/Makefile b/en/projects/netperf/Makefile new file mode 100644 index 0000000000..0f8c6a2171 --- /dev/null +++ b/en/projects/netperf/Makefile @@ -0,0 +1,17 @@ +# Summary for busdma project status +# +# $FreeBSD: www/en/projects/busdma/Makefile,v 1.1 2002/12/09 21:36:29 rwatson Exp $ + +MAINTAINER= rwatson + +.if exists(../Makefile.conf) +.include "../Makefile.conf" +.endif +.if exists(../Makefile.inc) +.include "../Makefile.inc" +.endif + +DOCS= index.sgml +DATA= style.css + +.include "${WEB_PREFIX}/share/mk/web.site.mk" diff --git a/en/projects/netperf/index.sgml b/en/projects/netperf/index.sgml new file mode 100644 index 0000000000..0a7f651cc6 --- /dev/null +++ b/en/projects/netperf/index.sgml @@ -0,0 +1,196 @@ + + + + + %includes; + +N/A"> +Done"> +In progress"> +Needs testing"> +New task"> +Unknown"> + + %developers; + +]> + + + &header; + +
The netperf project is working to enhance the performance of the + FreeBSD network stack. This work grew out of the + SMPng Project, which moved the FreeBSD kernel from + a "Giant Lock" to more fine-grained locking and multi-threading. SMPng + offered both performance improvement and degradation for the network + stack, improving parallelism and preemption, but substantially + increasing per-packet processing costs. The netperf project is + primarily focussed on further improving parallelism in network + processing, while reducing the SMP synchronization overhead. This in + turn will lead to higher processing throughput and lower processing + latency.
+ + +Robert Watson
+ +The two primary focuses of this work are to increase parallelism + while decreasing overhead. Several activities are being performed that + will work towards these goals:
+ +Complete locking work to make sure all components of the stack + are able to run without the Giant lock. While most of the network + stack, especially mainstream protocols, runs without Giant, some + components require Giant to be placed back over the stack if compiled + into the kernel, reducing parallelism.
Optimize locking strategies to find better balances between + locking granularity and locking overhead. In the first cut locking + work on the kernel, the goal was to adopt a medium-grained locking + approach based on data locking. This approach identifies critical + data structures, and inserts new locks and locking operations to + protect those data structures. Depending on the data model of the + code being protected, this may lead to the introduction of a + substantial number of locks offering unnecessary granularity, where + the overhead of locking overwhelms the benefits of available + parallelism and preemption. By selectively reducing granularity, it + is possible to improve performance by decreasing locking overhead. +
Amortize the cost of locking by processing queues of packets or + events. While the cost of individual synchronization operations may + be high, it is possible to amortize the cost of synchronization + operations by grouping processing of similar data (packets, events) + under the same protection. This approach focuses on identifying + places where similar locking occurs frequently in succession, and + introducing queueing or coalescing of lock operations across the + body of the work. For example, when a series of packets is inserted + into an outgoing interface queue, a basic locking approach would + lock the queue for each insert operation, unlock it, and hand off to + the interface driver to begin the send, repeating this sequence as + required. With a coalesced approach, the caller would pass off a + queue of packets in order to reduce the locking overhead, as well as + eliminate unnecessary synchronization due to the queue being + thread-local. This approach can be applied at several levels in the + stack, and is particularly applicable at lower levels of the stack + where streams of packets require almost identical processing. +
Introduce new synchronization strategies with reduced overhead + relative to traditional strategies. Most traditional strategies + employ a combination of interrupt disabling and atomic operations to + achieve mutual exclusion and non-preemption guarantees. However, + these operations are expensive on modern CPUs, leading to the desire + for cheaper primitives with weaker semantics. For example, the + application of uni-processor primitives where synchronization is + required only on a single processor, and optimizations to critical + section primitives to avoid the need for interrupt disabling. +
Modify synchronization strategies to take advantage of + additional, non-locking, synchronization primitives. This approach + might take the form of making increased use of per-CPU or per-thread + data structures, which require little or no synchronization. For + example, through the use of critical sections, it is possible to + synchronize access to per-CPU caches and queues. Through the use of + per-thread queues, data can be handed off between stack layers + without the use of synchronization.
Increase the opportunities for parallelism through increased + threading in the network stack. The current network stack model + offers the opportunity for substantial parallelism, with outbound + processing typically taking place in the context of the sending + thread in kernel, crypto occuring in crypto worker threads, and + receive processing taking place in a combination of the receiving + ithread and dispatched netisr thread. While handoffs between + threads introduces overhead (synchronization, context switching), + there is the opportunity to increase parallelism in some workloads + through introducing additional worker threads. Identifying work + that may be relocated to new threads must be done carefully to + balance overhead, and latency concerns, but can pay off by + increasing effective CPU utilization and hence throughput. For + example, introducing additional netisr threads capable of running on + more than one CPU at a time can increase input parallelism, subject + to maintaining desirable packet ordering.
Task | +Responsible | +Last updated | +Status | +Notes | +
---|---|---|---|---|
Mbuf queue library | +&a.rwatson; | +20041106 | +&status.wip; | +In order to facilitate passing off queues of packets between + network stack components, create an mbuf queue primitive, struct + mbufqueue. The initial implementation is complete, and the + primitive is now being applied in several sample cases to determine + whether it offers the desired semantics and benefits. The + implementation can be found in the rwatson_dispatch Perforce + branch. | +
Employ queued dispatch in interface send API | +&a.rwatson; | +20041106 | +&status.wip; | +An experimental if_start_mbufqueue() interface to struct ifnet + has been added, which passes an mbuf queue to the device driver for + processing, avoiding redundant synchronization against the + interface queue, even in the event that additional queueing is + required. This has not yet been benchmarked. | +
Employ queued dispatch in the interface receive API | +&a.rwatson; | +20041106 | +&status.new; | +Similar to if_start_mbufqueue, allow input of a queue of mbufs + from the device driver into the lowest protocol layers, such as + ether_input_mbufqueue. | +
Some useful links relating to the netperf work:
+ +SMPng Project -- Project to introduce + finer grained locking in the FreeBSD kernel.
Robert + Watson's netperf web page -- Web page that includes a change log + and performance measurement/debugging information.