226 lines
11 KiB
Text
226 lines
11 KiB
Text
<!-- $Id: vm.sgml,v 1.3 1999-02-24 22:51:42 dillon Exp $ -->
|
|
<!-- The FreeBSD Documentation Project -->
|
|
|
|
<sect><heading>The FreeBSD VM System<label id="vm"></heading>
|
|
|
|
<p><em>Contributed by &a.dillon;.<newline>
|
|
6 Feb 1999.</em>
|
|
|
|
<em>An involved description of FreeBSD's VM internals</em>
|
|
|
|
<sect1><heading>Management of physical memory - vm_page_t</heading>
|
|
|
|
<p>
|
|
Physical memory is managed on a page-by-page basis through the
|
|
<em>vm_page_t</em> structure. Pages of physical memory are
|
|
categorized through the placement of their respective vm_page_t
|
|
structures on one of several paging queues.
|
|
<p>
|
|
A page can be in a wired, active, inactive, cache, or free state.
|
|
Except for the wired state, the page is typically placed in a doubly
|
|
link list queue representing the state that it is in. Wired pages
|
|
are not placed on any queue.
|
|
<p>
|
|
FreeBSD implements a more involved paging queue for cached and free
|
|
pages in order to implement page coloring. Each of these states
|
|
involves multiple queues arranged according to the size of the
|
|
processor's L1 and L2 caches. When a new page needs to be allocated,
|
|
FreeBSD attempts to obtain one that is reasonably well aligned from
|
|
the point of view of the L1 and L2 caches relative to the VM object the
|
|
page is being allocated for.
|
|
<p>
|
|
Additionally, a page may be held with a reference count or locked
|
|
with a busy count. The VM system also implements an 'ultimate locked'
|
|
state for a page using the PG_BUSY bit in the page's flags.
|
|
<p>
|
|
In general terms, each of the paging queues operates in a LRU fashion.
|
|
A page is typicaly placed in a wired or active state initially. When
|
|
wired, the page is usually associated with a page table somewhere.
|
|
The VM system ages the page by scanning pages in a more active paging
|
|
queue (LRU) in order to move them to a less-active paging queue. Pages
|
|
that get moved into the cache are still associated with a VM object
|
|
but are candidates for immediate reuse. Pages in the free queue are
|
|
truely free. FreeBSD attempts to minimize the number of pages in the
|
|
free queue, but a certain minimum number of truely free pages must be
|
|
maintained in order to accomodate page allocation at interrupt time.
|
|
<p>
|
|
If a process attempts to access a page that does not exist in its
|
|
page table but does exist in one of the paging queues ( such as the
|
|
inactive or cache queues), a relatively inexpensive page reactivation
|
|
fault occurs which causes the page to be reactivated. If the page
|
|
does not exist in system memory at all, the process must block while
|
|
the page is brought in from disk.
|
|
<p>
|
|
FreeBSD dynamically tunes its paging queues and attempts to maintain
|
|
reasonable ratios of pages in the various queues as well as attempts
|
|
to maintain a reasonable breakdown of clean vs dirty pages. The
|
|
amount of rebalancing that occurs depends on the system's memory load.
|
|
This rebalancing is implemented by the pageout daemon and involves
|
|
laundering dirty pages ( syncing them with their backing store ),
|
|
noticing when pages are activity referenced ( resetting their position
|
|
in the LRU queues or moving them between queues ), migrating pages
|
|
between queues when the queues are out of balance, and so forth.
|
|
FreeBSD's VM system is willing to take a reasonable number of
|
|
reactivation page faults to determine how active or how idle a page
|
|
actually is. This leads to better decisions being made as to when
|
|
to launder or swap-out a page.
|
|
|
|
<sect1><heading>The unified buffer cache - vm_object_t</heading>
|
|
|
|
<p>
|
|
FreeBSD implements the idea of a generic 'VM object'. VM objects
|
|
can be associated with backing store of various types - unbacked,
|
|
swap-backed, physical device-backed, or file-backed storage. Since
|
|
the filesystem uses the same VM objects to manage in-core data relating
|
|
to files, the result is a unified buffer cache.
|
|
<p>
|
|
VM objects can be <em>shadowed</em>. That is, they can be stacked on
|
|
top of each other. For example, you might have a swap-backed VM object
|
|
stacked on top of a file-backed VM object in order to implement a
|
|
MAP_PRIVATE mmap()ing. This stacking is also used to implement various
|
|
sharing properties, including, copy-on-write, for forked address
|
|
spaces.
|
|
<p>
|
|
It should be noted that a vm_page_t can only be associated with one
|
|
VM object at a time. The VM object shadowing implements the
|
|
perceived sharing of the same page across multiple instances.
|
|
|
|
<sect1><heading>Filesystem I/O - struct buf</heading>
|
|
|
|
<p>
|
|
vnode-backed VM objects, such as file-backed objects, generally need
|
|
to maintain their own clean/dirty info independant from the VM system's
|
|
idea of clean/dirty. For example, when the VM system decides to
|
|
synchronize a physical page to its backing store, the VM system needs
|
|
to mark the page clean before the page is actually written to its
|
|
backing s tore. Additionally, filesystems need to be able to map
|
|
portions of a file or file metadata into KVM in order to operate on it.
|
|
<p>
|
|
The entities used to manage this are known as filesystem buffers,
|
|
<em>struct buf</em>'s, and also known as <em>bp</em>'s. When a
|
|
filesystem needs to operate on a portion of a VM object, it typically
|
|
maps part of the object into a struct buf and the maps the pages in
|
|
the struct buf into KVM. In the same manner, disk I/O is typically
|
|
issued by mapping portions of objects into buffer structures and
|
|
then issuing the I/O on the buffer structures. The underlying
|
|
vm_page_t's are typically busied for the duration of the I/O.
|
|
Filesystem buffers also have their own notion of being busy, which
|
|
is useful to filesystem driver code which would rather operate on
|
|
filesystem buffers instead of hard VM pages.
|
|
<p>
|
|
FreeBSD reserves a limited amount of KVM to hold mappings from struct
|
|
bufs, but it should be made clear that this KVM is used solely to
|
|
hold mappings and does not limit the ability to cache data. Physical
|
|
data caching is strictly a function of vm_page_t's, not filesystem
|
|
buffers. However, since filesystem buffers are used placehold I/O,
|
|
they do inherently limit the amount of concurrent I/O possible. As
|
|
there are usually a few thousand filesystem buffers available, this
|
|
is not usually a problem.
|
|
|
|
<sect1><heading>Mapping Page Tables - vm_map_t, vm_entry_t</heading>
|
|
|
|
<p>
|
|
FreeBSD separates the physical page table topology from the VM
|
|
system. All hard per-process page tables can be reconstructed on
|
|
the fly and are usually considered throwaway. Special page tables
|
|
such as those managing KVM are typically permanently preallocated.
|
|
These page tables are not throwaway.
|
|
<p>
|
|
FreeBSD associates portions of vm_objects with address ranges in
|
|
virtual memory through vm_map_t and vm_entry_t structures. Page
|
|
tables are directly synthesized from the vm_map_t/vm_entry_t/
|
|
vm_object_t hierarchy. Remember when I mentioned that physical pages
|
|
are only directly associated with a vm_object. Well, that isn't
|
|
quite true. vm_page_t's are also linked into page tables that they
|
|
are actively associated with. One vm_page_t can be linked into
|
|
several <em>pmaps</em>, as page tables are called. However, the
|
|
hierarchical association holds so all references to the same
|
|
page in the same object reference the same vm_page_t and thus give
|
|
us buffer cache unification across the board.
|
|
|
|
<sect1><heading>KVM Memory Mapping</heading>
|
|
|
|
<p>
|
|
FreeBSD uses KVM to hold various kernel structures. The single
|
|
largest entity held in KVM is the filesystem buffer cache. That is,
|
|
mappings relating to struct buf entities.
|
|
<p>
|
|
Unlike Linux, FreeBSD does NOT map all of physical memory into KVM.
|
|
This means that FreeBSD can handle memory configurations up to 4G
|
|
on 32 bit platforms. In fact, if the mmu were capable of it, FreeBSD
|
|
could theoretically handle memory configurations up to 8TB on a 32
|
|
bit platform. However, since most 32 bit platforms are only capable
|
|
of mapping 4GB of ram, this is a moot point.
|
|
<p>
|
|
KVM is managed through several mechanisms. The main mechanism used to
|
|
manage KVM is the <em>zone allocator</em>. The zone allocator takes
|
|
a chunk of KVM and splits it up into constant-sized blocks of memory
|
|
in order to allocate a specific type of structure. You can use the
|
|
<tt>vmstat -m</tt> command to get an overview of current KVM
|
|
utilization broken down by zone.
|
|
<p>
|
|
|
|
<sect1><heading>Tuning the FreeBSD VM system</heading>
|
|
<p>
|
|
A concerted effort has been made to make the FreeBSD kernel dynamically
|
|
tune itself. Typically you do not need to mess with anything beyond
|
|
the 'maxusers' and 'NMBCLUSTERS' kernel config options. That is,
|
|
kernel compilation options specified in ( typically )
|
|
/usr/src/sys/i386/conf/XXX. A description of all available kernel
|
|
configuration options can be found in /usr/src/sys/i386/conf/LINT.
|
|
<p>
|
|
In a large system configuration you may wish to increase 'maxusers'.
|
|
Values typically range from 10 to 128. Note that raising maxusers
|
|
too high can cause the system to overflow available KVM resulting in
|
|
unpredictable operation. It is better to leave maxusers at some
|
|
reasonable number of add other options, such as NMBCLUSTERS, to
|
|
increase specific resources.
|
|
<p>
|
|
If your system is going to use the network heavily, you may want
|
|
to increase NMBCLUSTERS. Typical values range from 1024 to 4096.
|
|
<p>
|
|
The NBUF parameter is also traditionally used to scale the system.
|
|
This parameter determines the amount of KVA the system can use to
|
|
map filesystem buffers for I/O. Note that this parameter has nothing
|
|
whatsoever to do with the unified buffer cache! This parameter
|
|
is dynamically tuned in -3.x and later kernels and should generally not
|
|
be adjusted manually. We recommend that you NOT try to specify an
|
|
NBUF parameter. Let the system pick it. Too small a value can result
|
|
in extremely inefficient filesystem operation while too large a value
|
|
can starve the page queues by causing too many pages to become wired
|
|
down.
|
|
<p>
|
|
By default, FreeBSD kernels are not optimized. You can set debugging
|
|
and optimization flags with the 'makeoptions' directive in the kernel
|
|
configuration. Note that you should not use -g unless you can
|
|
accomodate the large ( typically 7 MB+ ) kernels that result.
|
|
<p><tt>makeoptions DEBUG="-g"</tt>
|
|
<p><tt>makeoptions COPTFLAGS="-O2 -pipe"</tt>
|
|
<p>
|
|
Sysctl provides a way to tune kernel parameters at run-time. You
|
|
typically do not need to mess with any of the sysctl variables,
|
|
especially the VM related ones.
|
|
<p>
|
|
Run time VM and system tuning is relatively straightforward. First,
|
|
use softupdates on your UFS/FFS filesystems whenever possible.
|
|
The /usr/src/contrib/sys/softupdates/README file contains instructions
|
|
( and restrictions ) on how to configure it up.
|
|
<p>
|
|
Second, configure
|
|
sufficient swap. You should have a swap partition configured on each
|
|
physical disk, up to four, even on your 'work' disks. You should have
|
|
at least 2x the swap space as you have main memory, and possibly even
|
|
more if you do not have a lot of memory. You should also size your
|
|
swap partition based on the maximum memory configuration you ever
|
|
intend to put on the machine so you do not have to repartition your
|
|
disks later on. If you want to be able to accomodate a crash dump,
|
|
your first swap partition must be at least as large as main memory
|
|
and /var/crash must have sufficient free space to hold the dump.
|
|
<p>
|
|
NFS-based swap is perfectly acceptable on -4.x or later systems, but
|
|
you must be aware that the NFS server will take the brunt of the
|
|
paging load.
|
|
|
|
<em>Contributed by &a.dillon;.<newline>
|
|
6 Feb 1999.</em>
|
|
|