s/FreeBSD/&os;/

Suggested by:   Benjamin Lukas (qavvap att googlemail dott com)
This commit is contained in:
Johann Kois 2010-06-17 09:19:58 +00:00
parent 3a6c136eee
commit 77f5bcecff
Notes: svn2git 2020-12-08 03:00:23 +00:00
svn path=/head/; revision=35896

View file

@ -8,7 +8,7 @@
<article> <article>
<articleinfo> <articleinfo>
<title>Design elements of the FreeBSD VM system</title> <title>Design elements of the &os; VM system</title>
<authorgroup> <authorgroup>
<author> <author>
@ -36,7 +36,7 @@
<para>The title is really just a fancy way of saying that I am going to <para>The title is really just a fancy way of saying that I am going to
attempt to describe the whole VM enchilada, hopefully in a way that attempt to describe the whole VM enchilada, hopefully in a way that
everyone can follow. For the last year I have concentrated on a number everyone can follow. For the last year I have concentrated on a number
of major kernel subsystems within FreeBSD, with the VM and Swap of major kernel subsystems within &os;, with the VM and Swap
subsystems being the most interesting and NFS being <quote>a necessary subsystems being the most interesting and NFS being <quote>a necessary
chore</quote>. I rewrote only small portions of the code. In the VM chore</quote>. I rewrote only small portions of the code. In the VM
arena the only major rewrite I have done is to the swap subsystem. arena the only major rewrite I have done is to the swap subsystem.
@ -53,7 +53,7 @@
<para>This article was originally published in the January 2000 issue of <para>This article was originally published in the January 2000 issue of
<ulink url="http://www.daemonnews.org/">DaemonNews</ulink>. This <ulink url="http://www.daemonnews.org/">DaemonNews</ulink>. This
version of the article may include updates from Matt and other authors version of the article may include updates from Matt and other authors
to reflect changes in FreeBSD's VM implementation.</para> to reflect changes in &os;'s VM implementation.</para>
</legalnotice> </legalnotice>
</articleinfo> </articleinfo>
@ -71,7 +71,7 @@
operating system by some people, those of us who work on it tend to view operating system by some people, those of us who work on it tend to view
it more as a <quote>mature</quote> codebase which has various components it more as a <quote>mature</quote> codebase which has various components
modified, extended, or replaced with modern code. It has evolved, and modified, extended, or replaced with modern code. It has evolved, and
FreeBSD is at the bleeding edge no matter how old some of the code might &os; is at the bleeding edge no matter how old some of the code might
be. This is an important distinction to make and one that is be. This is an important distinction to make and one that is
unfortunately lost to many people. The biggest error a programmer can unfortunately lost to many people. The biggest error a programmer can
make is to not learn from history, and this is precisely the error that make is to not learn from history, and this is precisely the error that
@ -89,13 +89,13 @@
right because our marketing department says so</quote>. I have little right because our marketing department says so</quote>. I have little
tolerance for anyone who cannot learn from history.</para> tolerance for anyone who cannot learn from history.</para>
<para>Much of the apparent complexity of the FreeBSD design, especially in <para>Much of the apparent complexity of the &os; design, especially in
the VM/Swap subsystem, is a direct result of having to solve serious the VM/Swap subsystem, is a direct result of having to solve serious
performance issues that occur under various conditions. These issues performance issues that occur under various conditions. These issues
are not due to bad algorithmic design but instead rise from are not due to bad algorithmic design but instead rise from
environmental factors. In any direct comparison between platforms, environmental factors. In any direct comparison between platforms,
these issues become most apparent when system resources begin to get these issues become most apparent when system resources begin to get
stressed. As I describe FreeBSD's VM/Swap subsystem the reader should stressed. As I describe &os;'s VM/Swap subsystem the reader should
always keep two points in mind. First, the most important aspect of always keep two points in mind. First, the most important aspect of
performance design is what is known as <quote>Optimizing the Critical performance design is what is known as <quote>Optimizing the Critical
Path</quote>. It is often the case that performance optimizations add a Path</quote>. It is often the case that performance optimizations add a
@ -117,7 +117,7 @@
<sect1 id="vm-objects"> <sect1 id="vm-objects">
<title>VM Objects</title> <title>VM Objects</title>
<para>The best way to begin describing the FreeBSD VM system is to look at <para>The best way to begin describing the &os; VM system is to look at
it from the perspective of a user-level process. Each user process sees it from the perspective of a user-level process. Each user process sees
a single, private, contiguous VM address space containing several types a single, private, contiguous VM address space containing several types
of memory objects. These objects have various characteristics. Program of memory objects. These objects have various characteristics. Program
@ -157,7 +157,7 @@
(parent and child) expects their own personal post-fork modifications to (parent and child) expects their own personal post-fork modifications to
remain private to themselves and not effect the other.</para> remain private to themselves and not effect the other.</para>
<para>FreeBSD manages all of this with a layered VM Object model. The <para>&os; manages all of this with a layered VM Object model. The
original binary program file winds up being the lowest VM Object layer. original binary program file winds up being the lowest VM Object layer.
A copy-on-write layer is pushed on top of that to hold those pages which A copy-on-write layer is pushed on top of that to hold those pages which
had to be copied from the original file. If the program modifies a data had to be copied from the original file. If the program modifies a data
@ -235,7 +235,7 @@
The original page in B is now completely hidden since both C1 and C2 The original page in B is now completely hidden since both C1 and C2
have a copy and B could theoretically be destroyed if it does not have a copy and B could theoretically be destroyed if it does not
represent a <quote>real</quote> file; however, this sort of optimization is not represent a <quote>real</quote> file; however, this sort of optimization is not
trivial to make because it is so fine-grained. FreeBSD does not make trivial to make because it is so fine-grained. &os; does not make
this optimization. Now, suppose (as is often the case) that the child this optimization. Now, suppose (as is often the case) that the child
process does an <function>exec()</function>. Its current address space process does an <function>exec()</function>. Its current address space
is usually replaced by a new address space representing a new file. In is usually replaced by a new address space representing a new file. In
@ -274,7 +274,7 @@
get their own private copies of the page and the original page in B is get their own private copies of the page and the original page in B is
no longer accessible by anyone. That page in B can be freed.</para> no longer accessible by anyone. That page in B can be freed.</para>
<para>FreeBSD solves the deep layering problem with a special optimization <para>&os; solves the deep layering problem with a special optimization
called the <quote>All Shadowed Case</quote>. This case occurs if either called the <quote>All Shadowed Case</quote>. This case occurs if either
C1 or C2 take sufficient COW faults to completely shadow all pages in B. C1 or C2 take sufficient COW faults to completely shadow all pages in B.
Lets say that C1 achieves this. C1 can now bypass B entirely, so rather Lets say that C1 achieves this. C1 can now bypass B entirely, so rather
@ -303,7 +303,7 @@
copying need take place. The disadvantage is that you can build a copying need take place. The disadvantage is that you can build a
relatively complex VM Object layering that slows page fault handling relatively complex VM Object layering that slows page fault handling
down a little, and you spend memory managing the VM Object structures. down a little, and you spend memory managing the VM Object structures.
The optimizations FreeBSD makes proves to reduce the problems enough The optimizations &os; makes proves to reduce the problems enough
that they can be ignored, leaving no real disadvantage.</para> that they can be ignored, leaving no real disadvantage.</para>
</sect1> </sect1>
@ -315,12 +315,12 @@
backing object (usually a file) can no longer be used to save a copy of backing object (usually a file) can no longer be used to save a copy of
the page when the VM system needs to reuse it for other purposes. This the page when the VM system needs to reuse it for other purposes. This
is where SWAP comes in. SWAP is allocated to create backing store for is where SWAP comes in. SWAP is allocated to create backing store for
memory that does not otherwise have it. FreeBSD allocates the swap memory that does not otherwise have it. &os; allocates the swap
management structure for a VM Object only when it is actually needed. management structure for a VM Object only when it is actually needed.
However, the swap management structure has had problems However, the swap management structure has had problems
historically.</para> historically.</para>
<para>Under FreeBSD 3.X the swap management structure preallocates an <para>Under &os; 3.X the swap management structure preallocates an
array that encompasses the entire object requiring swap backing array that encompasses the entire object requiring swap backing
store&mdash;even if only a few pages of that object are swap-backed. store&mdash;even if only a few pages of that object are swap-backed.
This creates a kernel memory fragmentation problem when large objects This creates a kernel memory fragmentation problem when large objects
@ -337,7 +337,7 @@
fly for additional swap management structures when a swapout occurs. It fly for additional swap management structures when a swapout occurs. It
is evident that there was plenty of room for improvement.</para> is evident that there was plenty of room for improvement.</para>
<para>For FreeBSD 4.X, I completely rewrote the swap subsystem. With this <para>For &os; 4.X, I completely rewrote the swap subsystem. With this
rewrite, swap management structures are allocated through a hash table rewrite, swap management structures are allocated through a hash table
rather than a linear array giving them a fixed allocation size and much rather than a linear array giving them a fixed allocation size and much
finer granularity. Rather then using a linearly linked list to keep finer granularity. Rather then using a linearly linked list to keep
@ -373,7 +373,7 @@
hundreds of thousands of CPU cycles and a noticeable stall of the hundreds of thousands of CPU cycles and a noticeable stall of the
affected processes, so we are willing to endure a significant amount of affected processes, so we are willing to endure a significant amount of
overhead in order to be sure that the right page is chosen. This is why overhead in order to be sure that the right page is chosen. This is why
FreeBSD tends to outperform other systems when memory resources become &os; tends to outperform other systems when memory resources become
stressed.</para> stressed.</para>
<para>The free page determination algorithm is built upon a history of the <para>The free page determination algorithm is built upon a history of the
@ -403,10 +403,10 @@
then have to go to disk.</para> then have to go to disk.</para>
</sidebar> </sidebar>
<para>FreeBSD makes use of several page queues to further refine the <para>&os; makes use of several page queues to further refine the
selection of pages to reuse as well as to determine when dirty pages selection of pages to reuse as well as to determine when dirty pages
must be flushed to their backing store. Since page tables are dynamic must be flushed to their backing store. Since page tables are dynamic
entities under FreeBSD, it costs virtually nothing to unmap a page from entities under &os;, it costs virtually nothing to unmap a page from
the address space of any processes using it. When a page candidate has the address space of any processes using it. When a page candidate has
been chosen based on the page-use counter, this is precisely what is been chosen based on the page-use counter, this is precisely what is
done. The system must make a distinction between clean pages which can done. The system must make a distinction between clean pages which can
@ -423,7 +423,7 @@
in an LRU (least-recently used) fashion when the system needs to in an LRU (least-recently used) fashion when the system needs to
allocate new memory.</para> allocate new memory.</para>
<para>It is important to note that the FreeBSD VM system attempts to <para>It is important to note that the &os; VM system attempts to
separate clean and dirty pages for the express reason of avoiding separate clean and dirty pages for the express reason of avoiding
unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does
it move pages between the various page queues gratuitously when the it move pages between the various page queues gratuitously when the
@ -433,8 +433,8 @@
becomes more stressed, it makes a greater effort to maintain the various becomes more stressed, it makes a greater effort to maintain the various
page queues at the levels determined to be the most effective. An urban page queues at the levels determined to be the most effective. An urban
myth has circulated for years that Linux did a better job avoiding myth has circulated for years that Linux did a better job avoiding
swapouts than FreeBSD, but this in fact is not true. What was actually swapouts than &os;, but this in fact is not true. What was actually
occurring was that FreeBSD was proactively paging out unused pages in occurring was that &os; was proactively paging out unused pages in
order to make room for more disk cache while Linux was keeping unused order to make room for more disk cache while Linux was keeping unused
pages in core and leaving less memory available for cache and process pages in core and leaving less memory available for cache and process
pages. I do not know whether this is still true today.</para> pages. I do not know whether this is still true today.</para>
@ -451,9 +451,9 @@
not mapped into the page table, then all the pages that will be accessed not mapped into the page table, then all the pages that will be accessed
by the program will have to be faulted in every time the program is run. by the program will have to be faulted in every time the program is run.
This is unnecessary when the pages in question are already in the VM This is unnecessary when the pages in question are already in the VM
Cache, so FreeBSD will attempt to pre-populate a process's page tables Cache, so &os; will attempt to pre-populate a process's page tables
with those pages that are already in the VM Cache. One thing that with those pages that are already in the VM Cache. One thing that
FreeBSD does not yet do is pre-copy-on-write certain pages on exec. For &os; does not yet do is pre-copy-on-write certain pages on exec. For
example, if you run the &man.ls.1; program while running <command>vmstat example, if you run the &man.ls.1; program while running <command>vmstat
1</command> you will notice that it always takes a certain number of 1</command> you will notice that it always takes a certain number of
page faults, even when you run it over and over again. These are page faults, even when you run it over and over again. These are
@ -480,7 +480,7 @@
<title>Page Table Optimizations</title> <title>Page Table Optimizations</title>
<para>The page table optimizations make up the most contentious part of <para>The page table optimizations make up the most contentious part of
the FreeBSD VM design and they have shown some strain with the advent of the &os; VM design and they have shown some strain with the advent of
serious use of <function>mmap()</function>. I think this is actually a serious use of <function>mmap()</function>. I think this is actually a
feature of most BSDs though I am not sure when it was first introduced. feature of most BSDs though I am not sure when it was first introduced.
There are two major optimizations. The first is that hardware page There are two major optimizations. The first is that hardware page
@ -488,23 +488,23 @@
any time with only a minor amount of management overhead. The second is any time with only a minor amount of management overhead. The second is
that every active page table entry in the system has a governing that every active page table entry in the system has a governing
<literal>pv_entry</literal> structure which is tied into the <literal>pv_entry</literal> structure which is tied into the
<literal>vm_page</literal> structure. FreeBSD can simply iterate <literal>vm_page</literal> structure. &os; can simply iterate
through those mappings that are known to exist while Linux must check through those mappings that are known to exist while Linux must check
all page tables that <emphasis>might</emphasis> contain a specific all page tables that <emphasis>might</emphasis> contain a specific
mapping to see if it does, which can achieve O(n^2) overhead in certain mapping to see if it does, which can achieve O(n^2) overhead in certain
situations. It is because of this that FreeBSD tends to make better situations. It is because of this that &os; tends to make better
choices on which pages to reuse or swap when memory is stressed, giving choices on which pages to reuse or swap when memory is stressed, giving
it better performance under load. However, FreeBSD requires kernel it better performance under load. However, &os; requires kernel
tuning to accommodate large-shared-address-space situations such as tuning to accommodate large-shared-address-space situations such as
those that can occur in a news system because it may run out of those that can occur in a news system because it may run out of
<literal>pv_entry</literal> structures.</para> <literal>pv_entry</literal> structures.</para>
<para>Both Linux and FreeBSD need work in this area. FreeBSD is trying to <para>Both Linux and &os; need work in this area. &os; is trying to
maximize the advantage of a potentially sparse active-mapping model (not maximize the advantage of a potentially sparse active-mapping model (not
all processes need to map all pages of a shared library, for example), all processes need to map all pages of a shared library, for example),
whereas Linux is trying to simplify its algorithms. FreeBSD generally whereas Linux is trying to simplify its algorithms. &os; generally
has the performance advantage here at the cost of wasting a little extra has the performance advantage here at the cost of wasting a little extra
memory, but FreeBSD breaks down in the case where a large file is memory, but &os; breaks down in the case where a large file is
massively shared across hundreds of processes. Linux, on the other hand, massively shared across hundreds of processes. Linux, on the other hand,
breaks down in the case where many processes are sparsely-mapping the breaks down in the case where many processes are sparsely-mapping the
same shared library and also runs non-optimally when trying to determine same shared library and also runs non-optimally when trying to determine
@ -530,7 +530,7 @@
even with multi-way set-associative caches (though the effect is even with multi-way set-associative caches (though the effect is
mitigated somewhat).</para> mitigated somewhat).</para>
<para>FreeBSD's memory allocation code implements page coloring <para>&os;'s memory allocation code implements page coloring
optimizations, which means that the memory allocation code will attempt optimizations, which means that the memory allocation code will attempt
to locate free pages that are contiguous from the point of view of the to locate free pages that are contiguous from the point of view of the
cache. For example, if page 16 of physical memory is assigned to page 0 cache. For example, if page 16 of physical memory is assigned to page 0
@ -554,7 +554,7 @@
modular and algorithmic approach that BSD has historically taken allows modular and algorithmic approach that BSD has historically taken allows
us to study and understand the current implementation as well as us to study and understand the current implementation as well as
relatively cleanly replace large sections of the code. There have been a relatively cleanly replace large sections of the code. There have been a
number of improvements to the FreeBSD VM system in the last several number of improvements to the &os; VM system in the last several
years, and work is ongoing.</para> years, and work is ongoing.</para>
</sect1> </sect1>
@ -566,23 +566,23 @@
<qandaentry> <qandaentry>
<question> <question>
<para>What is <quote>the interleaving algorithm</quote> that you <para>What is <quote>the interleaving algorithm</quote> that you
refer to in your listing of the ills of the FreeBSD 3.X swap refer to in your listing of the ills of the &os; 3.X swap
arrangements?</para> arrangements?</para>
</question> </question>
<answer> <answer>
<para>FreeBSD uses a fixed swap interleave which defaults to 4. This <para>&os; uses a fixed swap interleave which defaults to 4. This
means that FreeBSD reserves space for four swap areas even if you means that &os; reserves space for four swap areas even if you
only have one, two, or three. Since swap is interleaved the linear only have one, two, or three. Since swap is interleaved the linear
address space representing the <quote>four swap areas</quote> will be address space representing the <quote>four swap areas</quote> will be
fragmented if you do not actually have four swap areas. For fragmented if you do not actually have four swap areas. For
example, if you have two swap areas A and B FreeBSD's address example, if you have two swap areas A and B &os;'s address
space representation for that swap area will be interleaved in space representation for that swap area will be interleaved in
blocks of 16 pages:</para> blocks of 16 pages:</para>
<literallayout>A B C D A B C D A B C D A B C D</literallayout> <literallayout>A B C D A B C D A B C D A B C D</literallayout>
<para>FreeBSD 3.X uses a <quote>sequential list of free <para>&os; 3.X uses a <quote>sequential list of free
regions</quote> approach to accounting for the free swap areas. regions</quote> approach to accounting for the free swap areas.
The idea is that large blocks of free linear space can be The idea is that large blocks of free linear space can be
represented with a single list node represented with a single list node
@ -626,7 +626,7 @@
<para>I do not get the following:</para> <para>I do not get the following:</para>
<blockquote> <blockquote>
<para>It is important to note that the FreeBSD VM system attempts <para>It is important to note that the &os; VM system attempts
to separate clean and dirty pages for the express reason of to separate clean and dirty pages for the express reason of
avoiding unnecessary flushes of dirty pages (which eats I/O avoiding unnecessary flushes of dirty pages (which eats I/O
bandwidth), nor does it move pages between the various page bandwidth), nor does it move pages between the various page
@ -649,7 +649,7 @@
separate the pages but the reality is that if we are not in a separate the pages but the reality is that if we are not in a
memory crunch, we do not really have to.</para> memory crunch, we do not really have to.</para>
<para>What this means is that FreeBSD will not try very hard to <para>What this means is that &os; will not try very hard to
separate out dirty pages (inactive queue) from clean pages (cache separate out dirty pages (inactive queue) from clean pages (cache
queue) when the system is not being stressed, nor will it try to queue) when the system is not being stressed, nor will it try to
deactivate pages (active queue -> inactive queue) when the system deactivate pages (active queue -> inactive queue) when the system
@ -663,14 +663,14 @@
would not some of the page faults be data page faults (COW from would not some of the page faults be data page faults (COW from
executable file to private page)? I.e., I would expect the page executable file to private page)? I.e., I would expect the page
faults to be some zero-fill and some program data. Or are you faults to be some zero-fill and some program data. Or are you
implying that FreeBSD does do pre-COW for the program data?</para> implying that &os; does do pre-COW for the program data?</para>
</question> </question>
<answer> <answer>
<para>A COW fault can be either zero-fill or program-data. The <para>A COW fault can be either zero-fill or program-data. The
mechanism is the same either way because the backing program-data mechanism is the same either way because the backing program-data
is almost certainly already in the cache. I am indeed lumping the is almost certainly already in the cache. I am indeed lumping the
two together. FreeBSD does not pre-COW program data or zero-fill, two together. &os; does not pre-COW program data or zero-fill,
but it <emphasis>does</emphasis> pre-map pages that exist in its but it <emphasis>does</emphasis> pre-map pages that exist in its
cache.</para> cache.</para>
</answer> </answer>
@ -685,7 +685,7 @@
McKusick, Bostic, Karel, Quarterman)? Specifically, what kind of McKusick, Bostic, Karel, Quarterman)? Specifically, what kind of
operation/reaction would require scanning the mappings?</para> operation/reaction would require scanning the mappings?</para>
<para>How does Linux do in the case where FreeBSD breaks down <para>How does Linux do in the case where &os; breaks down
(sharing a large file mapping over many processes)?</para> (sharing a large file mapping over many processes)?</para>
</question> </question>
@ -717,7 +717,7 @@
index into the page table for each of those 50 processes even if index into the page table for each of those 50 processes even if
only 10 of them have actually mapped the page. So Linux is only 10 of them have actually mapped the page. So Linux is
trading off the simplicity of its design against performance. trading off the simplicity of its design against performance.
Many VM algorithms which are O(1) or (small N) under FreeBSD wind Many VM algorithms which are O(1) or (small N) under &os; wind
up being O(N), O(N^2), or worse under Linux. Since the pte's up being O(N), O(N^2), or worse under Linux. Since the pte's
representing a particular page in an object tend to be at the same representing a particular page in an object tend to be at the same
offset in all the page tables they are mapped in, reducing the offset in all the page tables they are mapped in, reducing the
@ -725,12 +725,12 @@
will often avoid blowing away the L1 cache line for that offset, will often avoid blowing away the L1 cache line for that offset,
which can lead to better performance.</para> which can lead to better performance.</para>
<para>FreeBSD has added complexity (the <literal>pv_entry</literal> <para>&os; has added complexity (the <literal>pv_entry</literal>
scheme) in order to increase performance (to limit page table scheme) in order to increase performance (to limit page table
accesses to <emphasis>only</emphasis> those pte's that need to be accesses to <emphasis>only</emphasis> those pte's that need to be
modified).</para> modified).</para>
<para>But FreeBSD has a scaling problem that Linux does not in that <para>But &os; has a scaling problem that Linux does not in that
there are a limited number of <literal>pv_entry</literal> there are a limited number of <literal>pv_entry</literal>
structures and this causes problems when you have massive sharing structures and this causes problems when you have massive sharing
of data. In this case you may run out of of data. In this case you may run out of
@ -744,10 +744,10 @@
<literal>pv_entry</literal> scheme: Linux uses <literal>pv_entry</literal> scheme: Linux uses
<quote>permanent</quote> page tables that are not throw away, but <quote>permanent</quote> page tables that are not throw away, but
does not need a <literal>pv_entry</literal> for each potentially does not need a <literal>pv_entry</literal> for each potentially
mapped pte. FreeBSD uses <quote>throw away</quote> page tables but mapped pte. &os; uses <quote>throw away</quote> page tables but
adds in a <literal>pv_entry</literal> structure for each adds in a <literal>pv_entry</literal> structure for each
actually-mapped pte. I think memory utilization winds up being actually-mapped pte. I think memory utilization winds up being
about the same, giving FreeBSD an algorithmic advantage with its about the same, giving &os; an algorithmic advantage with its
ability to throw away page tables at will with very low ability to throw away page tables at will with very low
overhead.</para> overhead.</para>
</answer> </answer>