doc/handbook/vm.sgml

<!-- $Id: vm.sgml,v 1.3 1999-02-24 22:51:42 dillon Exp $ -->
<!-- The FreeBSD Documentation Project -->

<sect><heading>The FreeBSD VM System<label id="vm"></heading>

<p><em>Contributed by &a.dillon;.<newline>
  6 Feb 1999.</em>

<em>An involved description of FreeBSD's VM internals</em>

<sect1><heading>Management of physical memory - vm_page_t</heading>

	<p>
	Physical memory is managed on a page-by-page basis through the
	<em>vm_page_t</em> structure.  Pages of physical memory are
	categorized through the placement of their respective vm_page_t
	structures on one of several paging queues.
	<p>
	A page can be in a wired, active, inactive, cache, or free state.
	Except for the wired state, the page is typically placed in a doubly
	link list queue representing the state that it is in.  Wired pages
	are not placed on any queue.
	<p>
	FreeBSD implements a more involved paging queue for cached and free
	pages in order to implement page coloring.  Each of these states
	involves multiple queues arranged according to the size of the
	processor's L1 and L2 caches.  When a new page needs to be allocated,
	FreeBSD attempts to obtain one that is reasonably well aligned from
	the point of view of the L1 and L2 caches relative to the VM object the
	page is being allocated for.
	<p>
	Additionally, a page may be held with a reference count or locked
	with a busy count.  The VM system also implements an 'ultimate locked'
	state for a page using the PG_BUSY bit in the page's flags.
	<p>
	In general terms, each of the paging queues operates in a LRU fashion.
	A page is typicaly placed in a wired or active state initially.  When
	wired, the page is usually associated with a page table somewhere.
	The VM system ages the page by scanning pages in a more active paging
	queue (LRU) in order to move them to a less-active paging queue.  Pages
	that get moved into the cache are still associated with a VM object
	but are candidates for immediate reuse.  Pages in the free queue are
	truely free.  FreeBSD attempts to minimize the number of pages in the
	free queue, but a certain minimum number of truely free pages must be
	maintained  in order to accomodate page allocation at interrupt time.
	<p>
	If a process attempts to access a page that does not exist in its
	page table but does exist in one of the paging queues ( such as the
	inactive or cache queues), a relatively inexpensive page reactivation
	fault occurs which causes the page to be reactivated.  If the page
	does not exist in system memory at all, the process must block while
	the page is brought in from disk.
	<p>
	FreeBSD dynamically tunes its paging queues and attempts to maintain
	reasonable ratios of pages in the various queues as well as attempts
	to maintain a reasonable breakdown of clean vs dirty pages.  The
	amount of rebalancing that occurs depends on the system's memory load.
	This rebalancing is implemented by the pageout daemon and involves
	laundering dirty pages ( syncing them with their backing store ),
	noticing when pages are activity referenced ( resetting their position
	in the LRU queues or moving them between queues ), migrating pages
	between queues when the queues are out of balance, and so forth.
	FreeBSD's VM system is willing to take a reasonable number of
	reactivation page faults to determine how active or how idle a page
	actually is.  This leads to better decisions being made as to when
	to launder or swap-out a page.

<sect1><heading>The unified buffer cache - vm_object_t</heading>

	<p>
	FreeBSD implements the idea of a generic 'VM object'.  VM objects
	can be associated with backing store of various types - unbacked,
	swap-backed, physical device-backed, or file-backed storage.   Since
	the filesystem uses the same VM objects to manage in-core data relating
	to files, the result is a unified buffer cache.
	<p>
	VM objects can be <em>shadowed</em>.  That is, they can be stacked on
	top of each other.  For example, you might have a swap-backed VM object
	stacked on top of a file-backed VM object in order to implement a
	MAP_PRIVATE mmap()ing.  This stacking is also used to implement various
	sharing properties, including, copy-on-write, for forked address
	spaces.
	<p>
	It should be noted that a vm_page_t can only be associated with one
	VM object at a time.  The VM object shadowing implements the
	perceived sharing of the same page across multiple instances.

<sect1><heading>Filesystem I/O - struct buf</heading>

	<p>
	vnode-backed VM objects, such as file-backed objects, generally need
	to maintain their own clean/dirty info independant from the VM system's
	idea of clean/dirty.  For example, when the VM system decides to
	synchronize a physical page to its backing store, the VM system needs
	to mark the page clean before the page is actually written to its
	backing s tore.  Additionally, filesystems need to be able to map
	portions of a file or file metadata into KVM in order to operate on it.
	<p>
	The entities used to manage this are known as filesystem buffers,
	<em>struct buf</em>'s, and also known as <em>bp</em>'s.  When a
	filesystem needs to operate on a portion of a VM object, it typically
	maps part of the object into a struct buf and the maps the pages in
	the struct buf into KVM.  In the same manner, disk I/O is typically
	issued by mapping portions of objects into buffer structures and
	then issuing the I/O on the buffer structures.  The underlying
	vm_page_t's are typically busied for the duration of the I/O.
	Filesystem buffers also have their own notion of being busy, which
	is useful to filesystem driver code which would rather operate on
	filesystem buffers instead of hard VM pages.
	<p>
	FreeBSD reserves a limited amount of KVM to hold mappings from struct
	bufs, but it should be made clear that this KVM is used solely to
	hold mappings and does not limit the ability to cache data.  Physical
	data caching is strictly a function of vm_page_t's, not filesystem
	buffers.  However, since filesystem buffers are used placehold I/O,
	they do inherently limit the amount of concurrent I/O possible.  As
	there are usually a few thousand filesystem buffers available, this
	is not usually a problem.

<sect1><heading>Mapping Page Tables - vm_map_t, vm_entry_t</heading>

	<p>
	FreeBSD separates the physical page table topology from the VM
	system.  All hard per-process page tables can be reconstructed on
	the fly and are usually considered throwaway.  Special page tables
	such as those managing KVM are typically permanently preallocated.
	These page tables are not throwaway.
	<p>
	FreeBSD associates portions of vm_objects with address ranges in
	virtual memory through vm_map_t and vm_entry_t structures.  Page
	tables are directly synthesized from the vm_map_t/vm_entry_t/
	vm_object_t hierarchy.  Remember when I mentioned that physical pages
	are only directly associated with a vm_object.  Well, that isn't
	quite true.  vm_page_t's are also linked into page tables that they
	are actively associated with.  One vm_page_t can be linked into
	several <em>pmaps</em>, as page tables are called.  However, the
	hierarchical association holds so all references to the same
	page in the same object reference the same vm_page_t and thus give
	us buffer cache unification across the board.

<sect1><heading>KVM Memory Mapping</heading>

	<p>
	FreeBSD uses KVM to hold various kernel structures.  The single
	largest entity held in KVM is the filesystem buffer cache.  That is,
	mappings relating to struct buf entities.
	<p>
	Unlike Linux, FreeBSD does NOT map all of physical memory into KVM.
	This means that FreeBSD can handle memory configurations up to 4G
	on 32 bit platforms.  In fact, if the mmu were capable of it, FreeBSD
	could theoretically handle memory configurations up to 8TB on a 32
	bit platform.  However, since most 32 bit platforms are only capable
	of mapping 4GB of ram, this is a moot point.
	<p>
	KVM is managed through several mechanisms.  The main mechanism used to
	manage KVM is the <em>zone allocator</em>.  The zone allocator takes
	a chunk of KVM and splits it up into constant-sized blocks of memory
	in order to allocate a specific type of structure.  You can use the
	<tt>vmstat -m</tt> command to get an overview of current KVM
	utilization broken down by zone.
	<p>

<sect1><heading>Tuning the FreeBSD VM system</heading>
	<p>
	A concerted effort has been made to make the FreeBSD kernel dynamically
	tune itself.  Typically you do not need to mess with anything beyond
	the 'maxusers' and 'NMBCLUSTERS' kernel config options.  That is,
	kernel compilation options specified in ( typically )
	/usr/src/sys/i386/conf/XXX.  A description of all available kernel
	configuration options can be found in /usr/src/sys/i386/conf/LINT.
	<p>
	In a large system configuration you may wish to increase 'maxusers'.
	Values typically range from 10 to 128.  Note that raising maxusers
	too high can cause the system to overflow available KVM resulting in
	unpredictable operation.  It is better to leave maxusers at some
	reasonable number of add other options, such as NMBCLUSTERS, to
	increase specific resources.
	<p>
	If your system is going to use the network heavily, you may want
	to increase NMBCLUSTERS.  Typical values range from 1024 to 4096.
	<p>
	The NBUF parameter is also traditionally used to scale the system.
	This parameter determines the amount of KVA the system can use to
	map filesystem buffers for I/O.  Note that this parameter has nothing
	whatsoever to do with the unified buffer cache!   This parameter
	is dynamically tuned in -3.x and later kernels and should generally not
	be adjusted manually.  We recommend that you NOT try to specify an
	NBUF parameter.  Let the system pick it.  Too small a value can result
	in extremely inefficient filesystem operation while too large a value
	can starve the page queues by causing too many pages to become wired
	down.
	<p>
	By default, FreeBSD kernels are not optimized.  You can set debugging
	and optimization flags with the 'makeoptions' directive in the kernel
	configuration.  Note that you should not use -g unless you can
	accomodate the large ( typically 7 MB+ ) kernels that result.
	<p><tt>makeoptions    DEBUG="-g"</tt>
	<p><tt>makeoptions     COPTFLAGS="-O2 -pipe"</tt>
	<p>
	Sysctl provides a way to tune kernel parameters at run-time.  You
	typically do not need to mess with any of the sysctl variables,
	especially the VM related ones.
	<p>
	Run time VM and system tuning is relatively straightforward.  First,
	use softupdates on your UFS/FFS filesystems whenever possible.
	The /usr/src/contrib/sys/softupdates/README file contains instructions
	( and restrictions ) on how to configure it up.
	<p>
	Second, configure
	sufficient swap.  You should have a swap partition configured on each
	physical disk, up to four, even on your 'work' disks.  You should have
	at least 2x the swap space as you have main memory, and possibly even
	more if you do not have a lot of memory.  You should also size your
	swap partition based on the maximum memory configuration you ever
	intend to put on the machine so you do not have to repartition your
	disks later on.  If you want to be able to accomodate a crash dump,
	your first swap partition must be at least as large as main memory
	and /var/crash must have sufficient free space to hold the dump.
	<p>
	NFS-based swap is perfectly acceptable on -4.x or later systems, but
	you must be aware that the NFS server will take the brunt of the
	paging load.

<em>Contributed by &a.dillon;.<newline>
  6 Feb 1999.</em>