s/FreeBSD/&os;/
Suggested by: Benjamin Lukas (qavvap att googlemail dott com)
This commit is contained in:
parent
3a6c136eee
commit
77f5bcecff
Notes:
svn2git
2020-12-08 03:00:23 +00:00
svn path=/head/; revision=35896
1 changed files with 46 additions and 46 deletions
|
@ -8,7 +8,7 @@
|
|||
|
||||
<article>
|
||||
<articleinfo>
|
||||
<title>Design elements of the FreeBSD VM system</title>
|
||||
<title>Design elements of the &os; VM system</title>
|
||||
|
||||
<authorgroup>
|
||||
<author>
|
||||
|
@ -36,7 +36,7 @@
|
|||
<para>The title is really just a fancy way of saying that I am going to
|
||||
attempt to describe the whole VM enchilada, hopefully in a way that
|
||||
everyone can follow. For the last year I have concentrated on a number
|
||||
of major kernel subsystems within FreeBSD, with the VM and Swap
|
||||
of major kernel subsystems within &os;, with the VM and Swap
|
||||
subsystems being the most interesting and NFS being <quote>a necessary
|
||||
chore</quote>. I rewrote only small portions of the code. In the VM
|
||||
arena the only major rewrite I have done is to the swap subsystem.
|
||||
|
@ -53,7 +53,7 @@
|
|||
<para>This article was originally published in the January 2000 issue of
|
||||
<ulink url="http://www.daemonnews.org/">DaemonNews</ulink>. This
|
||||
version of the article may include updates from Matt and other authors
|
||||
to reflect changes in FreeBSD's VM implementation.</para>
|
||||
to reflect changes in &os;'s VM implementation.</para>
|
||||
</legalnotice>
|
||||
</articleinfo>
|
||||
|
||||
|
@ -71,7 +71,7 @@
|
|||
operating system by some people, those of us who work on it tend to view
|
||||
it more as a <quote>mature</quote> codebase which has various components
|
||||
modified, extended, or replaced with modern code. It has evolved, and
|
||||
FreeBSD is at the bleeding edge no matter how old some of the code might
|
||||
&os; is at the bleeding edge no matter how old some of the code might
|
||||
be. This is an important distinction to make and one that is
|
||||
unfortunately lost to many people. The biggest error a programmer can
|
||||
make is to not learn from history, and this is precisely the error that
|
||||
|
@ -89,13 +89,13 @@
|
|||
right because our marketing department says so</quote>. I have little
|
||||
tolerance for anyone who cannot learn from history.</para>
|
||||
|
||||
<para>Much of the apparent complexity of the FreeBSD design, especially in
|
||||
<para>Much of the apparent complexity of the &os; design, especially in
|
||||
the VM/Swap subsystem, is a direct result of having to solve serious
|
||||
performance issues that occur under various conditions. These issues
|
||||
are not due to bad algorithmic design but instead rise from
|
||||
environmental factors. In any direct comparison between platforms,
|
||||
these issues become most apparent when system resources begin to get
|
||||
stressed. As I describe FreeBSD's VM/Swap subsystem the reader should
|
||||
stressed. As I describe &os;'s VM/Swap subsystem the reader should
|
||||
always keep two points in mind. First, the most important aspect of
|
||||
performance design is what is known as <quote>Optimizing the Critical
|
||||
Path</quote>. It is often the case that performance optimizations add a
|
||||
|
@ -117,7 +117,7 @@
|
|||
<sect1 id="vm-objects">
|
||||
<title>VM Objects</title>
|
||||
|
||||
<para>The best way to begin describing the FreeBSD VM system is to look at
|
||||
<para>The best way to begin describing the &os; VM system is to look at
|
||||
it from the perspective of a user-level process. Each user process sees
|
||||
a single, private, contiguous VM address space containing several types
|
||||
of memory objects. These objects have various characteristics. Program
|
||||
|
@ -157,7 +157,7 @@
|
|||
(parent and child) expects their own personal post-fork modifications to
|
||||
remain private to themselves and not effect the other.</para>
|
||||
|
||||
<para>FreeBSD manages all of this with a layered VM Object model. The
|
||||
<para>&os; manages all of this with a layered VM Object model. The
|
||||
original binary program file winds up being the lowest VM Object layer.
|
||||
A copy-on-write layer is pushed on top of that to hold those pages which
|
||||
had to be copied from the original file. If the program modifies a data
|
||||
|
@ -235,7 +235,7 @@
|
|||
The original page in B is now completely hidden since both C1 and C2
|
||||
have a copy and B could theoretically be destroyed if it does not
|
||||
represent a <quote>real</quote> file; however, this sort of optimization is not
|
||||
trivial to make because it is so fine-grained. FreeBSD does not make
|
||||
trivial to make because it is so fine-grained. &os; does not make
|
||||
this optimization. Now, suppose (as is often the case) that the child
|
||||
process does an <function>exec()</function>. Its current address space
|
||||
is usually replaced by a new address space representing a new file. In
|
||||
|
@ -274,7 +274,7 @@
|
|||
get their own private copies of the page and the original page in B is
|
||||
no longer accessible by anyone. That page in B can be freed.</para>
|
||||
|
||||
<para>FreeBSD solves the deep layering problem with a special optimization
|
||||
<para>&os; solves the deep layering problem with a special optimization
|
||||
called the <quote>All Shadowed Case</quote>. This case occurs if either
|
||||
C1 or C2 take sufficient COW faults to completely shadow all pages in B.
|
||||
Lets say that C1 achieves this. C1 can now bypass B entirely, so rather
|
||||
|
@ -303,7 +303,7 @@
|
|||
copying need take place. The disadvantage is that you can build a
|
||||
relatively complex VM Object layering that slows page fault handling
|
||||
down a little, and you spend memory managing the VM Object structures.
|
||||
The optimizations FreeBSD makes proves to reduce the problems enough
|
||||
The optimizations &os; makes proves to reduce the problems enough
|
||||
that they can be ignored, leaving no real disadvantage.</para>
|
||||
</sect1>
|
||||
|
||||
|
@ -315,12 +315,12 @@
|
|||
backing object (usually a file) can no longer be used to save a copy of
|
||||
the page when the VM system needs to reuse it for other purposes. This
|
||||
is where SWAP comes in. SWAP is allocated to create backing store for
|
||||
memory that does not otherwise have it. FreeBSD allocates the swap
|
||||
memory that does not otherwise have it. &os; allocates the swap
|
||||
management structure for a VM Object only when it is actually needed.
|
||||
However, the swap management structure has had problems
|
||||
historically.</para>
|
||||
|
||||
<para>Under FreeBSD 3.X the swap management structure preallocates an
|
||||
<para>Under &os; 3.X the swap management structure preallocates an
|
||||
array that encompasses the entire object requiring swap backing
|
||||
store—even if only a few pages of that object are swap-backed.
|
||||
This creates a kernel memory fragmentation problem when large objects
|
||||
|
@ -337,7 +337,7 @@
|
|||
fly for additional swap management structures when a swapout occurs. It
|
||||
is evident that there was plenty of room for improvement.</para>
|
||||
|
||||
<para>For FreeBSD 4.X, I completely rewrote the swap subsystem. With this
|
||||
<para>For &os; 4.X, I completely rewrote the swap subsystem. With this
|
||||
rewrite, swap management structures are allocated through a hash table
|
||||
rather than a linear array giving them a fixed allocation size and much
|
||||
finer granularity. Rather then using a linearly linked list to keep
|
||||
|
@ -373,7 +373,7 @@
|
|||
hundreds of thousands of CPU cycles and a noticeable stall of the
|
||||
affected processes, so we are willing to endure a significant amount of
|
||||
overhead in order to be sure that the right page is chosen. This is why
|
||||
FreeBSD tends to outperform other systems when memory resources become
|
||||
&os; tends to outperform other systems when memory resources become
|
||||
stressed.</para>
|
||||
|
||||
<para>The free page determination algorithm is built upon a history of the
|
||||
|
@ -403,10 +403,10 @@
|
|||
then have to go to disk.</para>
|
||||
</sidebar>
|
||||
|
||||
<para>FreeBSD makes use of several page queues to further refine the
|
||||
<para>&os; makes use of several page queues to further refine the
|
||||
selection of pages to reuse as well as to determine when dirty pages
|
||||
must be flushed to their backing store. Since page tables are dynamic
|
||||
entities under FreeBSD, it costs virtually nothing to unmap a page from
|
||||
entities under &os;, it costs virtually nothing to unmap a page from
|
||||
the address space of any processes using it. When a page candidate has
|
||||
been chosen based on the page-use counter, this is precisely what is
|
||||
done. The system must make a distinction between clean pages which can
|
||||
|
@ -423,7 +423,7 @@
|
|||
in an LRU (least-recently used) fashion when the system needs to
|
||||
allocate new memory.</para>
|
||||
|
||||
<para>It is important to note that the FreeBSD VM system attempts to
|
||||
<para>It is important to note that the &os; VM system attempts to
|
||||
separate clean and dirty pages for the express reason of avoiding
|
||||
unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does
|
||||
it move pages between the various page queues gratuitously when the
|
||||
|
@ -433,8 +433,8 @@
|
|||
becomes more stressed, it makes a greater effort to maintain the various
|
||||
page queues at the levels determined to be the most effective. An urban
|
||||
myth has circulated for years that Linux did a better job avoiding
|
||||
swapouts than FreeBSD, but this in fact is not true. What was actually
|
||||
occurring was that FreeBSD was proactively paging out unused pages in
|
||||
swapouts than &os;, but this in fact is not true. What was actually
|
||||
occurring was that &os; was proactively paging out unused pages in
|
||||
order to make room for more disk cache while Linux was keeping unused
|
||||
pages in core and leaving less memory available for cache and process
|
||||
pages. I do not know whether this is still true today.</para>
|
||||
|
@ -451,9 +451,9 @@
|
|||
not mapped into the page table, then all the pages that will be accessed
|
||||
by the program will have to be faulted in every time the program is run.
|
||||
This is unnecessary when the pages in question are already in the VM
|
||||
Cache, so FreeBSD will attempt to pre-populate a process's page tables
|
||||
Cache, so &os; will attempt to pre-populate a process's page tables
|
||||
with those pages that are already in the VM Cache. One thing that
|
||||
FreeBSD does not yet do is pre-copy-on-write certain pages on exec. For
|
||||
&os; does not yet do is pre-copy-on-write certain pages on exec. For
|
||||
example, if you run the &man.ls.1; program while running <command>vmstat
|
||||
1</command> you will notice that it always takes a certain number of
|
||||
page faults, even when you run it over and over again. These are
|
||||
|
@ -480,7 +480,7 @@
|
|||
<title>Page Table Optimizations</title>
|
||||
|
||||
<para>The page table optimizations make up the most contentious part of
|
||||
the FreeBSD VM design and they have shown some strain with the advent of
|
||||
the &os; VM design and they have shown some strain with the advent of
|
||||
serious use of <function>mmap()</function>. I think this is actually a
|
||||
feature of most BSDs though I am not sure when it was first introduced.
|
||||
There are two major optimizations. The first is that hardware page
|
||||
|
@ -488,23 +488,23 @@
|
|||
any time with only a minor amount of management overhead. The second is
|
||||
that every active page table entry in the system has a governing
|
||||
<literal>pv_entry</literal> structure which is tied into the
|
||||
<literal>vm_page</literal> structure. FreeBSD can simply iterate
|
||||
<literal>vm_page</literal> structure. &os; can simply iterate
|
||||
through those mappings that are known to exist while Linux must check
|
||||
all page tables that <emphasis>might</emphasis> contain a specific
|
||||
mapping to see if it does, which can achieve O(n^2) overhead in certain
|
||||
situations. It is because of this that FreeBSD tends to make better
|
||||
situations. It is because of this that &os; tends to make better
|
||||
choices on which pages to reuse or swap when memory is stressed, giving
|
||||
it better performance under load. However, FreeBSD requires kernel
|
||||
it better performance under load. However, &os; requires kernel
|
||||
tuning to accommodate large-shared-address-space situations such as
|
||||
those that can occur in a news system because it may run out of
|
||||
<literal>pv_entry</literal> structures.</para>
|
||||
|
||||
<para>Both Linux and FreeBSD need work in this area. FreeBSD is trying to
|
||||
<para>Both Linux and &os; need work in this area. &os; is trying to
|
||||
maximize the advantage of a potentially sparse active-mapping model (not
|
||||
all processes need to map all pages of a shared library, for example),
|
||||
whereas Linux is trying to simplify its algorithms. FreeBSD generally
|
||||
whereas Linux is trying to simplify its algorithms. &os; generally
|
||||
has the performance advantage here at the cost of wasting a little extra
|
||||
memory, but FreeBSD breaks down in the case where a large file is
|
||||
memory, but &os; breaks down in the case where a large file is
|
||||
massively shared across hundreds of processes. Linux, on the other hand,
|
||||
breaks down in the case where many processes are sparsely-mapping the
|
||||
same shared library and also runs non-optimally when trying to determine
|
||||
|
@ -530,7 +530,7 @@
|
|||
even with multi-way set-associative caches (though the effect is
|
||||
mitigated somewhat).</para>
|
||||
|
||||
<para>FreeBSD's memory allocation code implements page coloring
|
||||
<para>&os;'s memory allocation code implements page coloring
|
||||
optimizations, which means that the memory allocation code will attempt
|
||||
to locate free pages that are contiguous from the point of view of the
|
||||
cache. For example, if page 16 of physical memory is assigned to page 0
|
||||
|
@ -554,7 +554,7 @@
|
|||
modular and algorithmic approach that BSD has historically taken allows
|
||||
us to study and understand the current implementation as well as
|
||||
relatively cleanly replace large sections of the code. There have been a
|
||||
number of improvements to the FreeBSD VM system in the last several
|
||||
number of improvements to the &os; VM system in the last several
|
||||
years, and work is ongoing.</para>
|
||||
</sect1>
|
||||
|
||||
|
@ -566,23 +566,23 @@
|
|||
<qandaentry>
|
||||
<question>
|
||||
<para>What is <quote>the interleaving algorithm</quote> that you
|
||||
refer to in your listing of the ills of the FreeBSD 3.X swap
|
||||
refer to in your listing of the ills of the &os; 3.X swap
|
||||
arrangements?</para>
|
||||
</question>
|
||||
|
||||
<answer>
|
||||
<para>FreeBSD uses a fixed swap interleave which defaults to 4. This
|
||||
means that FreeBSD reserves space for four swap areas even if you
|
||||
<para>&os; uses a fixed swap interleave which defaults to 4. This
|
||||
means that &os; reserves space for four swap areas even if you
|
||||
only have one, two, or three. Since swap is interleaved the linear
|
||||
address space representing the <quote>four swap areas</quote> will be
|
||||
fragmented if you do not actually have four swap areas. For
|
||||
example, if you have two swap areas A and B FreeBSD's address
|
||||
example, if you have two swap areas A and B &os;'s address
|
||||
space representation for that swap area will be interleaved in
|
||||
blocks of 16 pages:</para>
|
||||
|
||||
<literallayout>A B C D A B C D A B C D A B C D</literallayout>
|
||||
|
||||
<para>FreeBSD 3.X uses a <quote>sequential list of free
|
||||
<para>&os; 3.X uses a <quote>sequential list of free
|
||||
regions</quote> approach to accounting for the free swap areas.
|
||||
The idea is that large blocks of free linear space can be
|
||||
represented with a single list node
|
||||
|
@ -626,7 +626,7 @@
|
|||
<para>I do not get the following:</para>
|
||||
|
||||
<blockquote>
|
||||
<para>It is important to note that the FreeBSD VM system attempts
|
||||
<para>It is important to note that the &os; VM system attempts
|
||||
to separate clean and dirty pages for the express reason of
|
||||
avoiding unnecessary flushes of dirty pages (which eats I/O
|
||||
bandwidth), nor does it move pages between the various page
|
||||
|
@ -649,7 +649,7 @@
|
|||
separate the pages but the reality is that if we are not in a
|
||||
memory crunch, we do not really have to.</para>
|
||||
|
||||
<para>What this means is that FreeBSD will not try very hard to
|
||||
<para>What this means is that &os; will not try very hard to
|
||||
separate out dirty pages (inactive queue) from clean pages (cache
|
||||
queue) when the system is not being stressed, nor will it try to
|
||||
deactivate pages (active queue -> inactive queue) when the system
|
||||
|
@ -663,14 +663,14 @@
|
|||
would not some of the page faults be data page faults (COW from
|
||||
executable file to private page)? I.e., I would expect the page
|
||||
faults to be some zero-fill and some program data. Or are you
|
||||
implying that FreeBSD does do pre-COW for the program data?</para>
|
||||
implying that &os; does do pre-COW for the program data?</para>
|
||||
</question>
|
||||
|
||||
<answer>
|
||||
<para>A COW fault can be either zero-fill or program-data. The
|
||||
mechanism is the same either way because the backing program-data
|
||||
is almost certainly already in the cache. I am indeed lumping the
|
||||
two together. FreeBSD does not pre-COW program data or zero-fill,
|
||||
two together. &os; does not pre-COW program data or zero-fill,
|
||||
but it <emphasis>does</emphasis> pre-map pages that exist in its
|
||||
cache.</para>
|
||||
</answer>
|
||||
|
@ -685,7 +685,7 @@
|
|||
McKusick, Bostic, Karel, Quarterman)? Specifically, what kind of
|
||||
operation/reaction would require scanning the mappings?</para>
|
||||
|
||||
<para>How does Linux do in the case where FreeBSD breaks down
|
||||
<para>How does Linux do in the case where &os; breaks down
|
||||
(sharing a large file mapping over many processes)?</para>
|
||||
</question>
|
||||
|
||||
|
@ -717,7 +717,7 @@
|
|||
index into the page table for each of those 50 processes even if
|
||||
only 10 of them have actually mapped the page. So Linux is
|
||||
trading off the simplicity of its design against performance.
|
||||
Many VM algorithms which are O(1) or (small N) under FreeBSD wind
|
||||
Many VM algorithms which are O(1) or (small N) under &os; wind
|
||||
up being O(N), O(N^2), or worse under Linux. Since the pte's
|
||||
representing a particular page in an object tend to be at the same
|
||||
offset in all the page tables they are mapped in, reducing the
|
||||
|
@ -725,12 +725,12 @@
|
|||
will often avoid blowing away the L1 cache line for that offset,
|
||||
which can lead to better performance.</para>
|
||||
|
||||
<para>FreeBSD has added complexity (the <literal>pv_entry</literal>
|
||||
<para>&os; has added complexity (the <literal>pv_entry</literal>
|
||||
scheme) in order to increase performance (to limit page table
|
||||
accesses to <emphasis>only</emphasis> those pte's that need to be
|
||||
modified).</para>
|
||||
|
||||
<para>But FreeBSD has a scaling problem that Linux does not in that
|
||||
<para>But &os; has a scaling problem that Linux does not in that
|
||||
there are a limited number of <literal>pv_entry</literal>
|
||||
structures and this causes problems when you have massive sharing
|
||||
of data. In this case you may run out of
|
||||
|
@ -744,10 +744,10 @@
|
|||
<literal>pv_entry</literal> scheme: Linux uses
|
||||
<quote>permanent</quote> page tables that are not throw away, but
|
||||
does not need a <literal>pv_entry</literal> for each potentially
|
||||
mapped pte. FreeBSD uses <quote>throw away</quote> page tables but
|
||||
mapped pte. &os; uses <quote>throw away</quote> page tables but
|
||||
adds in a <literal>pv_entry</literal> structure for each
|
||||
actually-mapped pte. I think memory utilization winds up being
|
||||
about the same, giving FreeBSD an algorithmic advantage with its
|
||||
about the same, giving &os; an algorithmic advantage with its
|
||||
ability to throw away page tables at will with very low
|
||||
overhead.</para>
|
||||
</answer>
|
||||
|
|
Loading…
Reference in a new issue