diff --git a/en_US.ISO8859-1/articles/vm-design/Makefile b/en_US.ISO8859-1/articles/vm-design/Makefile new file mode 100644 index 0000000000..6758b4073a --- /dev/null +++ b/en_US.ISO8859-1/articles/vm-design/Makefile @@ -0,0 +1,16 @@ +# $FreeBSD: doc/en_US.ISO_8859-1/articles/mh/Makefile,v 1.8 1999/09/06 06:52:37 peter Exp $ + +DOC?= article + +FORMATS?= html + +IMAGES= fig1.eps fig2.eps fig3.eps fig4.eps + +INSTALL_COMPRESSED?=gz +INSTALL_ONLY_COMPRESSED?= + +SRCS= article.sgml + +DOC_PREFIX?= ${.CURDIR}/../../.. + +.include "${DOC_PREFIX}/share/mk/doc.project.mk" diff --git a/en_US.ISO8859-1/articles/vm-design/article.sgml b/en_US.ISO8859-1/articles/vm-design/article.sgml new file mode 100644 index 0000000000..7479a04cf8 --- /dev/null +++ b/en_US.ISO8859-1/articles/vm-design/article.sgml @@ -0,0 +1,838 @@ + + + + +%man; +]> + +
+ + Design elements of the FreeBSD VM system + + + + Matthew + + Dillon + + +
+ dillon@apollo.backplane.com +
+
+
+
+ + + The title is really just a fancy way of saying that I am going to + attempt to describe the whole VM enchilada, hopefully in a way that + everyone can follow. For the last year I have concentrated on a number + of major kernel subsystems within FreeBSD, with the VM and Swap + subsystems being the most interesting and NFS being ‘a necessary + chore’. I rewrote only small portions of the code. In the VM + arena the only major rewrite I have done is to the swap subsystem. + Most of my work was cleanup and maintenance, with only moderate code + rewriting and no major algorithmic adjustments within the VM + subsystem. The bulk of the VM subsystem's theoretical base remains + unchanged and a lot of the credit for the modernization effort in the + last few years belongs to John Dyson and David Greenman. Not being a + historian like Kirk I will not attempt to tag all the various features + with peoples names, since I will invariably get it wrong. + + + + This article was originally published in the January 2000 issue of + DaemonNews. This + version of the article may include updates from Matt and other authors + to reflect changes in FreeBSD's VM implementation. + +
+ + + Introduction + + Before moving along to the actual design let's spend a little time + on the necessity of maintaining and modernizing any long-living + codebase. In the programming world, algorithms tend to be more + important than code and it is precisely due to BSD's academic roots that + a great deal of attention was paid to algorithm design from the + beginning. More attention paid to the design generally leads to a clean + and flexible codebase that can be fairly easily modified, extended, or + replaced over time. While BSD is considered an ‘old’ + operating system by some people, those of us who work on it tend to view + it more as a ‘mature’ codebase which has various components + modified, extended, or replaced with modern code. It has evolved, and + FreeBSD is at the bleeding edge no matter how old some of the code might + be. This is an important distinction to make and one that is + unfortunately lost to many people. The biggest error a programmer can + make is to not learn from history, and this is precisely the error that + many other modern operating systems have made. NT is the best example + of this, and the consequences have been dire. Linux also makes this + mistake to some degree—enough that we BSD folk can make small + jokes about it every once in a while, anyway. Linux's problem is simply + one of a lack of experience and history to compare ideas against, a + problem that is easily and rapidly being addressed by the Linux + community in the same way it has been addressed in the BSD + community—by continuous code development. The NT folk, on the + other hand, repeatedly make the same mistakes solved by UNIX decades ago + and then spend years fixing them. Over and over again. They have a + severe case of ‘not designed here’ and ‘we are always + right because our marketing department says so’. I have little + tolerance for anyone who cannot learn from history. + + Much of the apparent complexity of the FreeBSD design, especially in + the VM/Swap subsystem, is a direct result of having to solve serious + performance issues that occur under various conditions. These issues + are not due to bad algorithmic design but instead rise from + environmental factors. In any direct comparison between platforms, + these issues become most apparent when system resources begin to get + stressed. As I describe FreeBSD's VM/Swap subsystem the reader should + always keep two points in mind. First, the most important aspect of + performance design is what is known as “Optimizing the Critical + Path”. It is often the case that performance optimizations add a + little bloat to the code in order to make the critical path perform + better. Second, a solid, generalized design outperforms a + heavily-optimized design over the long run. While a generalized design + may end up being slower than an heavily-optimized design when they are + first implemented, the generalized design tends to be easier to adapt to + changing conditions and the heavily-optimized design winds up having to + be thrown away. Any codebase that will survive and be maintainable for + years must therefore be designed properly from the beginning even if it + costs some performance. Twenty years ago people were still arguing that + programming in assembly was better than programming in a high-level + language because it produced code that was ten times as fast. Today, + the fallibility of that argument is obvious—as are the parallels + to algorithmic design and code generalization. + + + + VM Objects + + The best way to begin describing the FreeBSD VM system is to look at + it from the perspective of a user-level process. Each user process sees + a single, private, contiguous VM address space containing several types + of memory objects. These objects have various characteristics. Program + code and program data are effectively a single memory-mapped file (the + binary file being run), but program code is read-only while program data + is copy-on-write. Program BSS is just memory allocated and filled with + zeros on demand, called demand zero page fill. Arbitrary files can be + memory-mapped into the address space as well, which is how the shared + library mechanism works. Such mappings can require modifications to + remain private to the process making them. The fork system call adds an + entirely new dimension to the VM management problem on top of the + complexity already given. + + A program binary data page (which is a basic copy-on-write page) + illustrates the complexity. A program binary contains a preinitialized + data section which is initially mapped directly from the program file. + When a program is loaded into a process's VM space, this area is + initially memory-mapped and backed by the program binary itself, + allowing the VM system to free/reuse the page and later load it back in + from the binary. The moment a process modifies this data, however, the + VM system must make a private copy of the page for that process. Since + the private copy has been modified, the VM system may no longer free it, + because there is no longer any way to restore it later on. + + You will notice immediately that what was originally a simple file + mapping has become much more complex. Data may be modified on a + page-by-page basis whereas the file mapping encompasses many pages at + once. The complexity further increases when a process forks. When a + process forks, the result is two processes—each with their own + private address spaces, including any modifications made by the original + process prior to the call to fork(). It would be + silly for the VM system to make a complete copy of the data at the time + of the fork() because it is quite possible that at + least one of the two processes will only need to read from that page + from then on, allowing the original page to continue to be used. What + was a private page is made copy-on-write again, since each process + (parent and child) expects their own personal post-fork modifications to + remain private to themselves and not effect the other. + + FreeBSD manages all of this with a layered VM Object model. The + original binary program file winds up being the lowest VM Object layer. + A copy-on-write layer is pushed on top of that to hold those pages which + had to be copied from the original file. If the program modifies a data + page belonging to the original file the VM system takes a fault and + makes a copy of the page in the higher layer. When a process forks, + additional VM Object layers are pushed on. This might make a little + more sense with a fairly basic example. A fork() + is a common operation for any *BSD system, so this example will consider + a program that starts up, and forks. When the process starts, the VM + system creates an object layer, let's call this A: + + + + + + + + +---------------+ +| A | ++---------------+ + + + + A picture + + + + A represents the file—pages may be paged in and out of the + file's physical media as necessary. Paging in from the disk is + reasonable for a program, but we really don't want to page back out and + overwrite the executable. The VM system therefore creates a second + layer, B, that will be physically backed by swap space: + + + + + + + + +---------------+ +| B | ++---------------+ +| A | ++---------------+ + + + + On the first write to a page after this, a new page is created in B, + and its contents are initialized from A. All pages in B can be paged in + or out to a swap device. When the program forks, the VM system creates + two new object layers—C1 for the parent, and C2 for the + child—that rest on top of B: + + + + + + + + +-------+-------+ +| C1 | C2 | ++-------+-------+ +| B | ++---------------+ +| A | ++---------------+ + + + + In this case, let's say a page in B is modified by the original + parent process. The process will take a copy-on-write fault and + duplicate the page in C1, leaving the original page in B untouched. + Now, let's say the same page in B is modified by the child process. The + process will take a copy-on-write fault and duplicate the page in C2. + The original page in B is now completely hidden since both C1 and C2 + have a copy and B could theoretically be destroyed if it does not + represent a 'real' file). However, this sort of optimization is not + trivial to make because it is so fine-grained. FreeBSD does not make + this optimization. Now, suppose (as is often the case) that the child + process does an exec(). Its current address space + is usually replaced by a new address space representing a new file. In + this case, the C2 layer is destroyed: + + + + + + + + +-------+ +| C1 | ++-------+-------+ +| B | ++---------------+ +| A | ++---------------+ + + + + In this case, the number of children of B drops to one, and all + accesses to B now go through C1. This means that B and C1 can be + collapsed together. Any pages in B that also exist in C1 are deleted + from B during the collapse. Thus, even though the optimization in the + previous step could not be made, we can recover the dead pages when + either of the processes exit or exec(). + + This model creates a number of potential problems. The first is that + you can wind up with a relatively deep stack of layered VM Objects which + can cost scanning time and memory when you when you take a fault. Deep + layering can occur when processes fork and then fork again (either + parent or child). The second problem is that you can wind up with dead, + inaccessible pages deep in the stack of VM Objects. In our last example + if both the parent and child processes modify the same page, they both + get their own private copies of the page and the original page in B is + no longer accessible by anyone. That page in B can be freed. + + FreeBSD solves the deep layering problem with a special optimization + called the “All Shadowed Case”. This case occurs if either + C1 or C2 take sufficient COW faults to completely shadow all pages in B. + Lets say that C1 achieves this. C1 can now bypass B entirely, so rather + then have C1->B->A and C2->B->A we now have C1->A and C2->B->A. But + look what also happened—now B has only one reference (C2), so we + can collapse B and C2 together. The end result is that B is deleted + entirely and we have C1->A and C2->A. It is often the case that B will + contain a large number of pages and neither C1 nor C2 will be able to + completely overshadow it. If we fork again and create a set of D + layers, however, it is much more likely that one of the D layers will + eventually be able to completely overshadow the much smaller dataset + reprsented by C1 or C2. The same optimization will work at any point in + the graph and the grand result of this is that even on a heavily forked + machine VM Object stacks tend to not get much deeper then 4. This is + true of both the parent and the children and true whether the parent is + doing the forking or whether the children cascade forks. + + The dead page problem still exists in the case where C1 or C2 do not + completely overshadow B. Due to our other optimizations this case does + not represent much of a problem and we simply allow the pages to be + dead. If the system runs low on memory it will swap them out, eating a + little swap, but that's it. + + The advantage to the VM Object model is that + fork() is extremely fast, since no real data + copying need take place. The disadvantage is that you can build a + relatively complex VM Object layering that slows page fault handling + down a little, and you spend memory managing the VM Object structures. + The optimizations FreeBSD makes proves to reduce the problems enough + that they can be ignored, leaving no real disadvantage. + + + + SWAP Layers + + Private data pages are initially either copy-on-write or zero-fill + pages. When a change, and therefore a copy, is made, the original + backing object (usually a file) can no longer be used to save a copy of + the page when the VM system needs to reuse it for other purposes. This + is where SWAP comes in. SWAP is allocated to create backing store for + memory that does not otherwise have it. FreeBSD allocates the swap + management structure for a VM Object only when it is actually needed. + However, the swap management structure has had problems + historically. + + Under FreeBSD 3.x the swap management structure preallocates an + array that encompasses the entire object requiring swap backing + store—even if only a few pages of that object are swap-backed. + This creates a kernel memory fragmentation problem when large objects + are mapped, or processes with large runsizes (RSS) fork. Also, in order + to keep track of swap space, a ‘list of holes’ is kept in + kernel memory, and this tends to get severely fragmented as well. Since + the 'list of holes' is a linear list, the swap allocation and freeing + performance is a non-optimal O(n)-per-page. It also requires kernel + memory allocations to take place during the swap freeing process, and + that creates low memory deadlock problems. The problem is further + exacerbated by holes created due to the interleaving algorithm. Also, + the swap block map can become fragmented fairly easily resulting in + non-contiguous allocations. Kernel memory must also be allocated on the + fly for additional swap management structures when a swapout occurs. It + is evident that there was plenty of room for improvement. + + For FreeBSD 4.x, I completely rewrote the swap subsystem. With this + rewrite, swap management structures are allocated through a hash table + rather than a linear array giving them a fixed allocation size and much + finer granularity. Rather then using a linearly linked list to keep + track of swap space reservations, it now uses a bitmap of swap blocks + arranged in a radix tree structure with free-space hinting in the radix + node structures. This effectively makes swap allocation and freeing an + O(1) operation. The entire radix tree bitmap is also preallocated in + order to avoid having to allocate kernel memory during critical low + memory swapping operations. After all, the system tends to swap when it + is low on memory so we should avoid allocating kernel memory at such + times in order to avoid potential deadlocks. Finally, to reduce + fragmentation the radix tree is capable of allocating large contiguous + chunks at once, skipping over smaller fragmented chunks. I did not take + the final step of having an 'allocating hint pointer' that would trundle + through a portion of swap as allocations were made in order to further + guarantee contiguous allocations or at least locality of reference, but + I ensured that such an addition could be made. + + + + When to free a page + + Since the VM system uses all available memory for disk caching, + there are usually very few truly-free pages. The VM system depends on + being able to properly choose pages which are not in use to reuse for + new allocations. Selecting the optimal pages to free is possibly the + single-most important function any VM system can perform because if it + makes a poor selection, the VM system may be forced to unnecessarily + retrieve pages from disk, seriously degrading system performance. + + How much overhead are we willing to suffer in the critical path to + avoid freeing the wrong page? Each wrong choice we make will cost us + hundreds of thousands of CPU cycles and a noticeable stall of the + affected processes, so we are willing to endure a significant amount of + overhead in order to be sure that the right page is chosen. This is why + FreeBSD tends to outperform other systems when memory resources become + stressed. + + The free page determination algorithm is built upon a history of the + use of memory pages. To acquire this history, the system takes advantage + of a page-used bit feature that most hardware page tables have. + + In any case, the page-used bit is cleared and at some later point + the VM system comes across the page again and sees that the page-used + bit has been set. This indicates that the page is still being actively + used. If the bit is still clear it is an indication that the page is not + being actively used. By testing this bit periodically, a use history (in + the form of a counter) for the physical page is developed. When the VM + system later needs to free up some pages, checking this history becomes + the cornerstone of determining the best candidate page to reuse. + + + What if the hardware has no page-used bit? + + For those platforms that do not have this feature, the system + actually emulates a page-used bit. It unmaps or protects a page, + forcing a page fault if the page is accessed again. When the page + fault is taken, the system simply marks the page as having been used + and unprotects the page so that it may be used. While taking such page + faults just to determine if a page is being used appears to be an + expensive proposition, it is much less expensive than reusing the page + for some other purpose only to find that a process needs it back and + then have to go to disk. + + + FreeBSD makes use of several page queues to further refine the + selection of pages to reuse as well as to determine when dirty pages + must be flushed to their backing store. Since page tables are dynamic + entities under FreeBSD, it costs virtually nothing to unmap a page from + the address space of any processes using it. When a page candidate has + been chosen based on the page-use counter, this is precisely what is + done. The system must make a distinction between clean pages which can + theoretically be freed up at any time, and dirty pages which must first + be written to their backing store before being reusable. When a page + candidate has been found it is moved to the inactive queue if it is + dirty, or the cache queue if it is clean. A separate algorithm based on + the dirty-to-clean page ratio determines when dirty pages in the + inactive queue must be flushed to disk. Once this is accomplished, the + flushed pages are moved from the inactive queue to the cache queue. At + this point, pages in the cache queue can still be reactivated by a VM + fault at relatively low cost. However, pages in the cache queue are + considered to be ‘immediately freeable’ and will be reused + in an LRU (least-recently used) fashion when the system needs to + allocate new memory. + + It is important to note that the FreeBSD VM system attempts to + separate clean and dirty pages for the express reason of avoiding + unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does + it move pages between the various page queues gratuitously when the + memory subsystem is not being stressed. This is why you will see some + systems with very low cache queue counts and high active queue counts + when doing a systat -vm command. As the VM system + becomes more stressed, it makes a greater effort to maintain the various + page queues at the levels determined to be the most effective. An urban + myth has circulated for years that Linux did a better job avoiding + swapouts than FreeBSD, but this in fact is not true. What was actually + occurring was that FreeBSD was proactively paging out unused pages in + order to make room for more disk cache while Linux was keeping unused + pages in core and leaving less memory available for cache and process + pages. I don't know whether this is still true today. + + + + Pre-Faulting and Zeroing Optimizations + + Taking a VM fault is not expensive if the underlying page is already + in core and can simply be mapped into the process, but it can become + expensive if you take a whole lot of them on a regular basis. A good + example of this is running a program such as &man.ls.1; or &man.ps.1; + over and over again. If the program binary is mapped into memory but + not mapped into the page table, then all the pages that will be accessed + by the program will have to be faulted in every time the program is run. + This is unnecessary when the pages in question are already in the VM + Cache, so FreeBSD will attempt to pre-populate a process's page tables + with those pages that are already in the VM Cache. One thing that + FreeBSD does not yet do is pre-copy-on-write certain pages on exec. For + example, if you run the &man.ls.1; program while running vmstat + 1 you will notice that it always takes a certain number of + page faults, even when you run it over and over again. These are + zero-fill faults, not program code faults (which were pre-faulted in + already). Pre-copying pages on exec or fork is an area that could use + more study. + + A large percentage of page faults that occur are zero-fill faults. + You can usually see this by observing the vmstat -s + output. These occur when a process accesses pages in its BSS area. The + BSS area is expected to be initially zero but the VM system does not + bother to allocate any memory at all until the process actually accesses + it. When a fault occurs the VM system must not only allocate a new page, + it must zero it as well. To optimize the zeroing operation the VM system + has the ability to pre-zero pages and mark them as such, and to request + pre-zeroed pages when zero-fill faults occur. The pre-zeroing occurs + whenever the CPU is idle but the number of pages the system pre-zeros is + limited in order to avoid blowing away the memory caches. This is an + excellent example of adding complexity to the VM system in order to + optimize the critical path. + + + + Page Table Optimizations + + The page table optimizations make up the most contentious part of + the FreeBSD VM design and they have shown some strain with the advent of + serious use of mmap(). I think this is actually a + feature of most BSDs though I am not sure when it was first introduced. + There are two major optimizations. The first is that hardware page + tables do not contain persistent state but instead can be thrown away at + any time with only a minor amount of management overhead. The second is + that every active page table entry in the system has a governing + pv_entry structure which is tied into the + vm_page structure. FreeBSD can simply iterate + through those mappings that are known to exist while Linux must check + all page tables that might contain a specific + mapping to see if it does, which can achieve O(n^2) overhead in certain + situations. It is because of this that FreeBSD tends to make better + choices on which pages to reuse or swap when memory is stressed, giving + it better performance under load. However, FreeBSD requires kernel + tuning to accommodate large-shared-address-space situations such as + those that can occur in a news system because it may run out of + pv_entry structures. + + Both Linux and FreeBSD need work in this area. FreeBSD is trying to + maximize the advantage of a potentially sparse active-mapping model (not + all processes need to map all pages of a shared library, for example), + whereas Linux is trying to simplify its algorithms. FreeBSD generally + has the performance advantage here at the cost of wasting a little extra + memory, but FreeBSD breaks down in the case where a large file is + massively shared across hundreds of processes. Linux, on the other hand, + breaks down in the case where many processes are sparsely-mapping the + same shared library and also runs non-optimally when trying to determine + whether a page can be reused or not. + + + + Page Coloring + + We'll end with the page coloring optimizations. Page coloring is a + performance optimization designed to ensure that accesses to contiguous + pages in virtual memory make the best use of the processor cache. In + ancient times (i.e. 10+ years ago) processor caches tended to map + virtual memory rather than physical memory. This led to a huge number of + problems including having to clear the cache on every context switch in + some cases, and problems with data aliasing in the cache. Modern + processor caches map physical memory precisely to solve those problems. + This means that two side-by-side pages in a processes address space may + not correspond to two side-by-side pages in the cache. In fact, if you + aren't careful side-by-side pages in virtual memory could wind up using + the same page in the processor cache—leading to cacheable data + being thrown away prematurely and reducing CPU performance. This is true + even with multi-way set-associative caches (though the effect is + mitigated somewhat). + + FreeBSD's memory allocation code implements page coloring + optimizations, which means that the memory allocation code will attempt + to locate free pages that are contiguous from the point of view of the + cache. For example, if page 16 of physical memory is assigned to page 0 + of a process's virtual memory and the cache can hold 4 pages, the page + coloring code will not assign page 20 of physical memory to page 1 of a + process's virtual memory. It would, instead, assign page 21 of physical + memory. The page coloring code attempts to avoid assigning page 20 + because this maps over the same cache memory as page 16 and would result + in non-optimal caching. This code adds a significant amount of + complexity to the VM memory allocation subsystem as you can well + imagine, but the result is well worth the effort. Page Coloring makes VM + memory as deterministic as physical memory in regards to cache + performance. + + + + Conclusion + + Virtual memory in modern operating systems must address a number of + different issues efficiently and for many different usage patterns. The + modular and algorithmic approach that BSD has historically taken allows + us to study and understand the current implementation as well as + relatively cleanly replace large sections of the code. There have been a + number of improvements to the FreeBSD VM system in the last several + years, and work is ongoing. + + + + Bonus QA session by Allen Briggs + <email>briggs@ninthwonder.com</email> + + + + + What is “the interleaving algorithm” that you + refer to in your listing of the ills of the FreeBSD 3.x swap + arrangments? + + + + FreeBSD uses a fixed swap interleave which defaults to 4. This + means that FreeBSD reserves space for four swap areas even if you + only have one, two, or three. Since swap is interleaved the linear + address space representing the ‘four swap areas’ will be + fragmented if you don't actually have four swap areas. For + example, if you have two swap areas A and B FreeBSD's address + space representation for that swap area will be interleaved in + blocks of 16 pages: + + A B C D A B C D A B C D A B C D + + FreeBSD 3.x uses a ‘sequential list of free + regions’ approach to accounting for the free swap areas. + The idea is that large blocks of free linear space can be + represented with a single list node + (kern/subr_rlist.c). But due to the + fragmentation the sequential list winds up being insanely + fragmented. In the above example, completely unused swap will + have A and B shown as ‘free’ and C and D shown as + ‘all allocated’. Each A-B sequence requires a list + node to account for because C and D are holes, so the list node + cannot be combined with the next A-B sequence. + + Why do we interleave our swap space instead of just tack swap + areas onto the end and do something fancier? Because it's a whole + lot easier to allocate linear swaths of an address space and have + the result automatically be interleaved across multiple disks than + it is to try to put that sophistication elsewhere. + + The fragmentation causes other problems. Being a linear list + under 3.x, and having such a huge amount of inherent + fragmentation, allocating and freeing swap winds up being an O(N) + algorithm instead of an O(1) algorithm. Combined with other + factors (heavy swapping) and you start getting into O(N^2) and + O(N^3) levels of overhead, which is bad. The 3.x system may also + need to allocate KVM during a swap operation to create a new list + node which can lead to a deadlock if the system is trying to + pageout pages in a low-memory situation. + + Under 4.x we do not use a sequential list. Instead we use a + radix tree and bitmaps of swap blocks rather than ranged list + nodes. We take the hit of preallocating all the bitmaps required + for the entire swap area up front but it winds up wasting less + memory due to the use of a bitmap (one bit per block) instead of a + linked list of nodes. The use of a radix tree instead of a + sequential list gives us nearly O(1) performance no matter how + fragmented the tree becomes. + + + + + + I don't get the following: + +
+ It is important to note that the FreeBSD VM system attempts + to separate clean and dirty pages for the express reason of + avoiding unnecessary flushes of dirty pages (which eats I/O + bandwidth), nor does it move pages between the various page + queues gratitously when the memory subsystem is not being + stressed. This is why you will see some systems with very low + cache queue counts and high active queue counts when doing a + systat -vm command. +
+ + How is the separation of clean and dirty (inactive) pages + related to the situation where you see low cache queue counts and + high active queue counts in systat -vm? Do the + systat stats roll the active and dirty pages together for the + active queue count? +
+ + + Yes, that is confusing. The relationship is + “goal” verses “reality”. Our goal is to + separate the pages but the reality is that if we are not in a + memory crunch, we don't really have to. + + What this means is that FreeBSD will not try very hard to + separate out dirty pages (inactive queue) from clean pages (cache + queue) when the system is not being stressed, nor will it try to + deactivate pages (active queue -> inactive queue) when the system + is not being stressed, even if they aren't being used. + +
+ + + + In the &man.ls.1; / vmstat 1 example, + wouldn't some of the page faults be data page faults (COW from + executable file to private page)? I.e., I would expect the page + faults to be some zero-fill and some program data. Or are you + implying that FreeBSD does do pre-COW for the program data? + + + + A COW fault can be either zero-fill or program-data. The + mechanism is the same either way because the backing program-data + is almost certainly already in the cache. I am indeed lumping the + two together. FreeBSD does not pre-COW program data or zero-fill, + but it does pre-map pages that exist in its + cache. + + + + + + In your section on page table optimizations, can you give a + little more detail about pv_entry and + vm_page (or should vm_page be + vm_pmap—as in 4.4, cf. pp. 180-181 of + McKusick, Bostic, Karel, Quarterman)? Specifically, what kind of + operation/reaction would require scanning the mappings? + + How does Linux do in the case where FreeBSD breaks down + (sharing a large file mapping over many processes)? + + + + A vm_page represents an (object,index#) + tuple. A pv_entry represents a hardware page + table entry (pte). If you have five processes sharing the same + physical page, and three of those processes's page tables actually + map the page, that page will be represented by a single + vm_page structure and three + pv_entry structures. + + pv_entry structures only represent pages + mapped by the MMU (one pv_entry represnts one + pte). This means that when we need to remove all hardware + references to a vm_page (in order to reuse the + page for something else, page it out, clear it, dirty it, and so + forth) we can simply scan the linked list of + pv_entry's associated with that + vm_page to remove or modify the pte's from + their page tables. + + Under Linux there is no such linked list. In order to remove + all the hardware page table mappings for a + vm_page linux must index into every VM object + that might have mapped the page. For + example, if you have 50 processes all mapping the same shared + library and want to get rid of page X in that library, you need to + index into the page table for each of those 50 processes even if + only 10 of them have actually mapped the page. So Linux is + trading off the simplicity of its design against performance. + Many VM algorithms which are O(1) or (small N) under FreeBSD wind + up being O(N), O(N^2), or worse under Linux. Since the pte's + representing a particular page in an object tend to be at the same + offset in all the page tables they are mapped in, reducing the + number of accesses into the page tables at the same pte offset + will often avoid blowing away the L1 cache line for that offset, + which can lead to better performance. + + FreeBSD has added complexity (the pv_entry + scheme) in order to increase performance (to limit page table + accesses to only those pte's that need to be + modified). + + But FreeBSD has a scaling problem that Linux does not in that + there are a limited number of pv_entry + structures and this causes problems when you have massive sharing + of data. In this case you may run out of + pv_entry structures even though there is plenty + of free memory available. This can be fixed easily enough by + bumping up the number of pv_entry structures in + the kernel config, but we really need to find a better way to do + it. + + In regards to the memory overhead of a page table verses the + pv_entry scheme: Linux uses + ‘permanent’ page tables that are not throw away, but + does not need a pv_entry for each potentially + mapped pte. FreeBSD uses ‘throw away’ page tables but + adds in a pv_entry structure for each + actually-mapped pte. I think memory utilization winds up being + about the same, giving FreeBSD an algorithmic advantage with its + ability to throw away page tables at will with very low + overhead. + + + + + + Finally, in the page coloring section, it might help to have a + little more description of what you mean here. I didn't quite + follow it. + + + + Do you know how an L1 hardware memory cache works? I'll + explain: Consider a machine with 16MB of main memory but only 128K + of L1 cache. Generally the way this cache works is that each 128K + block of main memory uses the same 128K of + cache. If you access offset 0 in main memory and then offset + offset 128K in main memory you can wind up throwing away the + cached data you read from offset 0! + + Now, I am simplifying things greatly. What I just described + is what is called a ‘direct mapped’ hardware memory + cache. Most modern caches are what are called + 2-way-set-associative or 4-way-set-associative caches. The + set-associatively allows you to access up to N different memory + regions that overlap the same cache memory without destroying the + previously cached data. But only N. + + So if I have a 4-way set associative cache I can access offset + 0, offset 128K, 256K and offset 384K and still be able to access + offset 0 again and have it come from the L1 cache. If I then + access offset 512K, however, one of the four previously cached + data objects will be thrown away by the cache. + + It is extremely important… + extremely important for most of a processor's + memory accesses to be able to come from the L1 cache, because the + L1 cache operates at the processor frequency. The moment you have + an L1 cahe miss and have to go to the L2 cache or to main memory, + the processor will stall and potentially sit twidling its fingers + for hundreds of instructions worth of time + waiting for a read from main memory to complete. Main memory (the + dynamic ram you stuff into a computer) is + slow, when compared to the speed of a modern + processor core. + + Ok, so now onto page coloring: All modern memory caches are + what are known as physical caches. They + cache physical memory addresses, not virtual memory addresses. + This allows the cache to be left alone across a process context + switch, which is very important. + + But in the UNIX world you are dealing with virtual address + spaces, not physical address spaces. Any program you write will + see the virtual address space given to it. The actual + physical pages underlying that virtual + address space are not necessarily physically contiguous! In fact, + you might have two pages that are side by side in a processes + address space which wind up being at offset 0 and offset 128K in + physical memory. + + A program normally assumes that two side-by-side pages will be + optimally cached. That is, that you can access data objects in + both pages without having them blow away each other's cache entry. + But this is only true if the physical pages underlying the virtual + address space are contiguous (insofar as the cache is + concerned). + + This is what Page coloring does. Instead of assigning + random physical pages to virtual addresses, + which may result in non-optimal cache performance , Page coloring + assigns reasonably-contiguous physical pages + to virtual addresses. Thus programs can be written under the + assumption that the characteristics of the underlying hardware + cache are the same for their virtual address space as they would + be if the program had been run directly in a physical address + space. + + Note that I say ‘reasonably’ contiguous rather + than simply ‘contiguous’. From the point of view of a + 128K direct mapped cache, the physical address 0 is the same as + the physical address 128K. So two side-by-side pages in your + virtual address space may wind up being offset 128K and offset + 132K in physical memory, but could also easily be offset 128K and + offset 4K in physical memory and still retain the same cache + performance characteristics. So page-coloring does + not have to assign truly contiguous pages of + physical memory to contiguous pages of virtual memory, it just + needs to make sure it assigns contiguous pages from the point of + view of cache performance and operation. + + +
+
+
diff --git a/en_US.ISO8859-1/articles/vm-design/fig1.eps b/en_US.ISO8859-1/articles/vm-design/fig1.eps new file mode 100644 index 0000000000..49d2c05a56 --- /dev/null +++ b/en_US.ISO8859-1/articles/vm-design/fig1.eps @@ -0,0 +1,104 @@ +%!PS-Adobe-2.0 EPSF-2.0 +%%Title: fig1.eps +%%Creator: fig2dev Version 3.2.3 Patchlevel +%%CreationDate: Sun Oct 8 19:54:25 2000 +%%For: nik@canyon.nothing-going-on.org (Nik Clayton) +%%BoundingBox: 0 0 119 65 +%%Magnification: 1.0000 +%%EndComments +/$F2psDict 200 dict def +$F2psDict begin +$F2psDict /mtrx matrix put +/col-1 {0 setgray} bind def +/col0 {0.000 0.000 0.000 srgb} bind def +/col1 {0.000 0.000 1.000 srgb} bind def +/col2 {0.000 1.000 0.000 srgb} bind def +/col3 {0.000 1.000 1.000 srgb} bind def +/col4 {1.000 0.000 0.000 srgb} bind def +/col5 {1.000 0.000 1.000 srgb} bind def +/col6 {1.000 1.000 0.000 srgb} bind def +/col7 {1.000 1.000 1.000 srgb} bind def +/col8 {0.000 0.000 0.560 srgb} bind def +/col9 {0.000 0.000 0.690 srgb} bind def +/col10 {0.000 0.000 0.820 srgb} bind def +/col11 {0.530 0.810 1.000 srgb} bind def +/col12 {0.000 0.560 0.000 srgb} bind def +/col13 {0.000 0.690 0.000 srgb} bind def +/col14 {0.000 0.820 0.000 srgb} bind def +/col15 {0.000 0.560 0.560 srgb} bind def +/col16 {0.000 0.690 0.690 srgb} bind def +/col17 {0.000 0.820 0.820 srgb} bind def +/col18 {0.560 0.000 0.000 srgb} bind def +/col19 {0.690 0.000 0.000 srgb} bind def +/col20 {0.820 0.000 0.000 srgb} bind def +/col21 {0.560 0.000 0.560 srgb} bind def +/col22 {0.690 0.000 0.690 srgb} bind def +/col23 {0.820 0.000 0.820 srgb} bind def +/col24 {0.500 0.190 0.000 srgb} bind def +/col25 {0.630 0.250 0.000 srgb} bind def +/col26 {0.750 0.380 0.000 srgb} bind def +/col27 {1.000 0.500 0.500 srgb} bind def +/col28 {1.000 0.630 0.630 srgb} bind def +/col29 {1.000 0.750 0.750 srgb} bind def +/col30 {1.000 0.880 0.880 srgb} bind def +/col31 {1.000 0.840 0.000 srgb} bind def + +end +save +newpath 0 65 moveto 0 0 lineto 119 0 lineto 119 65 lineto closepath clip newpath +-143.0 298.0 translate +1 -1 scale + +/cp {closepath} bind def +/ef {eofill} bind def +/gr {grestore} bind def +/gs {gsave} bind def +/sa {save} bind def +/rs {restore} bind def +/l {lineto} bind def +/m {moveto} bind def +/rm {rmoveto} bind def +/n {newpath} bind def +/s {stroke} bind def +/sh {show} bind def +/slc {setlinecap} bind def +/slj {setlinejoin} bind def +/slw {setlinewidth} bind def +/srgb {setrgbcolor} bind def +/rot {rotate} bind def +/sc {scale} bind def +/sd {setdash} bind def +/ff {findfont} bind def +/sf {setfont} bind def +/scf {scalefont} bind def +/sw {stringwidth} bind def +/tr {translate} bind def +/tnt {dup dup currentrgbcolor + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add srgb} + bind def +/shd {dup dup currentrgbcolor 4 -2 roll mul 4 -2 roll mul + 4 -2 roll mul srgb} bind def +/$F2psBegin {$F2psDict begin /$F2psEnteredState save def} def +/$F2psEnd {$F2psEnteredState restore end} def + +$F2psBegin +%%Page: 1 1 +10 setmiterlimit + 0.06000 0.06000 sc +% Polyline +7.500 slw +n 2400 4200 m 4050 4200 l 4050 4950 l 2400 4950 l + cp gs col0 s gr +% Polyline +n 4050 4200 m + 4350 3900 l gs col0 s gr +% Polyline +n 2400 4200 m 2700 3900 l 4350 3900 l 4350 4650 l + 4050 4950 l gs col0 s gr +/Helvetica-Bold ff 180.00 scf sf +3225 4650 m +gs 1 -1 sc (A) dup sw pop 2 div neg 0 rm col0 sh gr +$F2psEnd +rs diff --git a/en_US.ISO8859-1/articles/vm-design/fig2.eps b/en_US.ISO8859-1/articles/vm-design/fig2.eps new file mode 100644 index 0000000000..fcb8bd41ad --- /dev/null +++ b/en_US.ISO8859-1/articles/vm-design/fig2.eps @@ -0,0 +1,115 @@ +%!PS-Adobe-2.0 EPSF-2.0 +%%Title: fig2.eps +%%Creator: fig2dev Version 3.2.3 Patchlevel +%%CreationDate: Sun Oct 8 19:55:31 2000 +%%For: nik@canyon.nothing-going-on.org (Nik Clayton) +%%BoundingBox: 0 0 120 110 +%%Magnification: 1.0000 +%%EndComments +/$F2psDict 200 dict def +$F2psDict begin +$F2psDict /mtrx matrix put +/col-1 {0 setgray} bind def +/col0 {0.000 0.000 0.000 srgb} bind def +/col1 {0.000 0.000 1.000 srgb} bind def +/col2 {0.000 1.000 0.000 srgb} bind def +/col3 {0.000 1.000 1.000 srgb} bind def +/col4 {1.000 0.000 0.000 srgb} bind def +/col5 {1.000 0.000 1.000 srgb} bind def +/col6 {1.000 1.000 0.000 srgb} bind def +/col7 {1.000 1.000 1.000 srgb} bind def +/col8 {0.000 0.000 0.560 srgb} bind def +/col9 {0.000 0.000 0.690 srgb} bind def +/col10 {0.000 0.000 0.820 srgb} bind def +/col11 {0.530 0.810 1.000 srgb} bind def +/col12 {0.000 0.560 0.000 srgb} bind def +/col13 {0.000 0.690 0.000 srgb} bind def +/col14 {0.000 0.820 0.000 srgb} bind def +/col15 {0.000 0.560 0.560 srgb} bind def +/col16 {0.000 0.690 0.690 srgb} bind def +/col17 {0.000 0.820 0.820 srgb} bind def +/col18 {0.560 0.000 0.000 srgb} bind def +/col19 {0.690 0.000 0.000 srgb} bind def +/col20 {0.820 0.000 0.000 srgb} bind def +/col21 {0.560 0.000 0.560 srgb} bind def +/col22 {0.690 0.000 0.690 srgb} bind def +/col23 {0.820 0.000 0.820 srgb} bind def +/col24 {0.500 0.190 0.000 srgb} bind def +/col25 {0.630 0.250 0.000 srgb} bind def +/col26 {0.750 0.380 0.000 srgb} bind def +/col27 {1.000 0.500 0.500 srgb} bind def +/col28 {1.000 0.630 0.630 srgb} bind def +/col29 {1.000 0.750 0.750 srgb} bind def +/col30 {1.000 0.880 0.880 srgb} bind def +/col31 {1.000 0.840 0.000 srgb} bind def + +end +save +newpath 0 110 moveto 0 0 lineto 120 0 lineto 120 110 lineto closepath clip newpath +-174.0 370.0 translate +1 -1 scale + +/cp {closepath} bind def +/ef {eofill} bind def +/gr {grestore} bind def +/gs {gsave} bind def +/sa {save} bind def +/rs {restore} bind def +/l {lineto} bind def +/m {moveto} bind def +/rm {rmoveto} bind def +/n {newpath} bind def +/s {stroke} bind def +/sh {show} bind def +/slc {setlinecap} bind def +/slj {setlinejoin} bind def +/slw {setlinewidth} bind def +/srgb {setrgbcolor} bind def +/rot {rotate} bind def +/sc {scale} bind def +/sd {setdash} bind def +/ff {findfont} bind def +/sf {setfont} bind def +/scf {scalefont} bind def +/sw {stringwidth} bind def +/tr {translate} bind def +/tnt {dup dup currentrgbcolor + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add srgb} + bind def +/shd {dup dup currentrgbcolor 4 -2 roll mul 4 -2 roll mul + 4 -2 roll mul srgb} bind def +/$F2psBegin {$F2psDict begin /$F2psEnteredState save def} def +/$F2psEnd {$F2psEnteredState restore end} def + +$F2psBegin +%%Page: 1 1 +10 setmiterlimit + 0.06000 0.06000 sc +/Helvetica-Bold ff 180.00 scf sf +3750 5100 m +gs 1 -1 sc (B) dup sw pop 2 div neg 0 rm col0 sh gr +% Polyline +7.500 slw +n 4871 5100 m 4879 5100 l gs col0 s gr +% Polyline +n 2925 5400 m 4575 5400 l 4575 6150 l 2925 6150 l + cp gs col0 s gr +% Polyline +n 4575 4650 m + 4875 4350 l gs col0 s gr +% Polyline +n 2925 4650 m 4575 4650 l 4575 5400 l 2925 5400 l + cp gs col0 s gr +% Polyline +n 2925 4650 m 3225 4350 l 4875 4350 l 4875 5100 l + 4575 5400 l gs col0 s gr +/Helvetica-Bold ff 180.00 scf sf +3750 5850 m +gs 1 -1 sc (A) dup sw pop 2 div neg 0 rm col0 sh gr +% Polyline +n 4875 5100 m 4875 5850 l + 4575 6150 l gs col0 s gr +$F2psEnd +rs diff --git a/en_US.ISO8859-1/articles/vm-design/fig3.eps b/en_US.ISO8859-1/articles/vm-design/fig3.eps new file mode 100644 index 0000000000..0e3138b2ed --- /dev/null +++ b/en_US.ISO8859-1/articles/vm-design/fig3.eps @@ -0,0 +1,133 @@ +%!PS-Adobe-2.0 EPSF-2.0 +%%Title: fig3.eps +%%Creator: fig2dev Version 3.2.3 Patchlevel +%%CreationDate: Sun Oct 8 19:53:51 2000 +%%For: nik@canyon.nothing-going-on.org (Nik Clayton) +%%BoundingBox: 0 0 120 155 +%%Magnification: 1.0000 +%%EndComments +/$F2psDict 200 dict def +$F2psDict begin +$F2psDict /mtrx matrix put +/col-1 {0 setgray} bind def +/col0 {0.000 0.000 0.000 srgb} bind def +/col1 {0.000 0.000 1.000 srgb} bind def +/col2 {0.000 1.000 0.000 srgb} bind def +/col3 {0.000 1.000 1.000 srgb} bind def +/col4 {1.000 0.000 0.000 srgb} bind def +/col5 {1.000 0.000 1.000 srgb} bind def +/col6 {1.000 1.000 0.000 srgb} bind def +/col7 {1.000 1.000 1.000 srgb} bind def +/col8 {0.000 0.000 0.560 srgb} bind def +/col9 {0.000 0.000 0.690 srgb} bind def +/col10 {0.000 0.000 0.820 srgb} bind def +/col11 {0.530 0.810 1.000 srgb} bind def +/col12 {0.000 0.560 0.000 srgb} bind def +/col13 {0.000 0.690 0.000 srgb} bind def +/col14 {0.000 0.820 0.000 srgb} bind def +/col15 {0.000 0.560 0.560 srgb} bind def +/col16 {0.000 0.690 0.690 srgb} bind def +/col17 {0.000 0.820 0.820 srgb} bind def +/col18 {0.560 0.000 0.000 srgb} bind def +/col19 {0.690 0.000 0.000 srgb} bind def +/col20 {0.820 0.000 0.000 srgb} bind def +/col21 {0.560 0.000 0.560 srgb} bind def +/col22 {0.690 0.000 0.690 srgb} bind def +/col23 {0.820 0.000 0.820 srgb} bind def +/col24 {0.500 0.190 0.000 srgb} bind def +/col25 {0.630 0.250 0.000 srgb} bind def +/col26 {0.750 0.380 0.000 srgb} bind def +/col27 {1.000 0.500 0.500 srgb} bind def +/col28 {1.000 0.630 0.630 srgb} bind def +/col29 {1.000 0.750 0.750 srgb} bind def +/col30 {1.000 0.880 0.880 srgb} bind def +/col31 {1.000 0.840 0.000 srgb} bind def + +end +save +newpath 0 155 moveto 0 0 lineto 120 0 lineto 120 155 lineto closepath clip newpath +-174.0 370.0 translate +1 -1 scale + +/cp {closepath} bind def +/ef {eofill} bind def +/gr {grestore} bind def +/gs {gsave} bind def +/sa {save} bind def +/rs {restore} bind def +/l {lineto} bind def +/m {moveto} bind def +/rm {rmoveto} bind def +/n {newpath} bind def +/s {stroke} bind def +/sh {show} bind def +/slc {setlinecap} bind def +/slj {setlinejoin} bind def +/slw {setlinewidth} bind def +/srgb {setrgbcolor} bind def +/rot {rotate} bind def +/sc {scale} bind def +/sd {setdash} bind def +/ff {findfont} bind def +/sf {setfont} bind def +/scf {scalefont} bind def +/sw {stringwidth} bind def +/tr {translate} bind def +/tnt {dup dup currentrgbcolor + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add srgb} + bind def +/shd {dup dup currentrgbcolor 4 -2 roll mul 4 -2 roll mul + 4 -2 roll mul srgb} bind def +/$F2psBegin {$F2psDict begin /$F2psEnteredState save def} def +/$F2psEnd {$F2psEnteredState restore end} def + +$F2psBegin +%%Page: 1 1 +10 setmiterlimit + 0.06000 0.06000 sc +/Helvetica-Bold ff 180.00 scf sf +4125 4350 m +gs 1 -1 sc (C2) dup sw pop 2 div neg 0 rm col0 sh gr +% Polyline +7.500 slw +n 4871 5100 m 4879 5100 l gs col0 s gr +% Polyline +n 2925 5400 m 4575 5400 l 4575 6150 l 2925 6150 l + cp gs col0 s gr +% Polyline +n 4575 4650 m + 4875 4350 l gs col0 s gr +% Polyline +n 2925 4650 m 4575 4650 l 4575 5400 l 2925 5400 l + cp gs col0 s gr +% Polyline +n 4875 3600 m 4875 5100 l + 4575 5400 l gs col0 s gr +% Polyline +n 2925 4650 m 2925 3900 l 3225 3600 l + 4875 3600 l gs col0 s gr +% Polyline +n 2925 3900 m 4425 3900 l 4575 3900 l + 4875 3600 l gs col0 s gr +% Polyline +n 4575 4650 m + 4575 3900 l gs col0 s gr +% Polyline +n 3750 4650 m 3750 3900 l + 4050 3600 l gs col0 s gr +/Helvetica-Bold ff 180.00 scf sf +3750 5850 m +gs 1 -1 sc (A) dup sw pop 2 div neg 0 rm col0 sh gr +/Helvetica-Bold ff 180.00 scf sf +3750 5100 m +gs 1 -1 sc (B) dup sw pop 2 div neg 0 rm col0 sh gr +/Helvetica-Bold ff 180.00 scf sf +3375 4350 m +gs 1 -1 sc (C1) dup sw pop 2 div neg 0 rm col0 sh gr +% Polyline +n 4875 5100 m 4875 5850 l + 4575 6150 l gs col0 s gr +$F2psEnd +rs diff --git a/en_US.ISO8859-1/articles/vm-design/fig4.eps b/en_US.ISO8859-1/articles/vm-design/fig4.eps new file mode 100644 index 0000000000..24fc1b5add --- /dev/null +++ b/en_US.ISO8859-1/articles/vm-design/fig4.eps @@ -0,0 +1,133 @@ +%!PS-Adobe-2.0 EPSF-2.0 +%%Title: fig4.eps +%%Creator: fig2dev Version 3.2.3 Patchlevel +%%CreationDate: Sun Oct 8 19:55:53 2000 +%%For: nik@canyon.nothing-going-on.org (Nik Clayton) +%%BoundingBox: 0 0 120 155 +%%Magnification: 1.0000 +%%EndComments +/$F2psDict 200 dict def +$F2psDict begin +$F2psDict /mtrx matrix put +/col-1 {0 setgray} bind def +/col0 {0.000 0.000 0.000 srgb} bind def +/col1 {0.000 0.000 1.000 srgb} bind def +/col2 {0.000 1.000 0.000 srgb} bind def +/col3 {0.000 1.000 1.000 srgb} bind def +/col4 {1.000 0.000 0.000 srgb} bind def +/col5 {1.000 0.000 1.000 srgb} bind def +/col6 {1.000 1.000 0.000 srgb} bind def +/col7 {1.000 1.000 1.000 srgb} bind def +/col8 {0.000 0.000 0.560 srgb} bind def +/col9 {0.000 0.000 0.690 srgb} bind def +/col10 {0.000 0.000 0.820 srgb} bind def +/col11 {0.530 0.810 1.000 srgb} bind def +/col12 {0.000 0.560 0.000 srgb} bind def +/col13 {0.000 0.690 0.000 srgb} bind def +/col14 {0.000 0.820 0.000 srgb} bind def +/col15 {0.000 0.560 0.560 srgb} bind def +/col16 {0.000 0.690 0.690 srgb} bind def +/col17 {0.000 0.820 0.820 srgb} bind def +/col18 {0.560 0.000 0.000 srgb} bind def +/col19 {0.690 0.000 0.000 srgb} bind def +/col20 {0.820 0.000 0.000 srgb} bind def +/col21 {0.560 0.000 0.560 srgb} bind def +/col22 {0.690 0.000 0.690 srgb} bind def +/col23 {0.820 0.000 0.820 srgb} bind def +/col24 {0.500 0.190 0.000 srgb} bind def +/col25 {0.630 0.250 0.000 srgb} bind def +/col26 {0.750 0.380 0.000 srgb} bind def +/col27 {1.000 0.500 0.500 srgb} bind def +/col28 {1.000 0.630 0.630 srgb} bind def +/col29 {1.000 0.750 0.750 srgb} bind def +/col30 {1.000 0.880 0.880 srgb} bind def +/col31 {1.000 0.840 0.000 srgb} bind def + +end +save +newpath 0 155 moveto 0 0 lineto 120 0 lineto 120 155 lineto closepath clip newpath +-174.0 370.0 translate +1 -1 scale + +/cp {closepath} bind def +/ef {eofill} bind def +/gr {grestore} bind def +/gs {gsave} bind def +/sa {save} bind def +/rs {restore} bind def +/l {lineto} bind def +/m {moveto} bind def +/rm {rmoveto} bind def +/n {newpath} bind def +/s {stroke} bind def +/sh {show} bind def +/slc {setlinecap} bind def +/slj {setlinejoin} bind def +/slw {setlinewidth} bind def +/srgb {setrgbcolor} bind def +/rot {rotate} bind def +/sc {scale} bind def +/sd {setdash} bind def +/ff {findfont} bind def +/sf {setfont} bind def +/scf {scalefont} bind def +/sw {stringwidth} bind def +/tr {translate} bind def +/tnt {dup dup currentrgbcolor + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add srgb} + bind def +/shd {dup dup currentrgbcolor 4 -2 roll mul 4 -2 roll mul + 4 -2 roll mul srgb} bind def +/$F2psBegin {$F2psDict begin /$F2psEnteredState save def} def +/$F2psEnd {$F2psEnteredState restore end} def + +$F2psBegin +%%Page: 1 1 +10 setmiterlimit + 0.06000 0.06000 sc +/Helvetica-Bold ff 180.00 scf sf +3375 4350 m +gs 1 -1 sc (C1) dup sw pop 2 div neg 0 rm col0 sh gr +% Polyline +7.500 slw +n 4871 5100 m 4879 5100 l gs col0 s gr +% Polyline +n 2925 5400 m 4575 5400 l 4575 6150 l 2925 6150 l + cp gs col0 s gr +% Polyline +n 4575 4650 m + 4875 4350 l gs col0 s gr +% Polyline +n 2925 4650 m 4575 4650 l 4575 5400 l 2925 5400 l + cp gs col0 s gr +% Polyline +n 4875 4350 m 4875 5100 l + 4575 5400 l gs col0 s gr +% Polyline +n 2925 4650 m 2925 3900 l 3225 3600 l + 4050 3600 l gs col0 s gr +% Polyline +n 3750 4650 m 3750 3900 l + 4050 3600 l gs col0 s gr +% Polyline +n 2925 3900 m + 3750 3900 l gs col0 s gr +% Polyline +n 3750 4650 m 4050 4350 l + 4875 4350 l gs col0 s gr +% Polyline +n 4050 4350 m + 4050 3600 l gs col0 s gr +/Helvetica-Bold ff 180.00 scf sf +3750 5850 m +gs 1 -1 sc (A) dup sw pop 2 div neg 0 rm col0 sh gr +/Helvetica-Bold ff 180.00 scf sf +3750 5100 m +gs 1 -1 sc (B) dup sw pop 2 div neg 0 rm col0 sh gr +% Polyline +n 4875 5100 m 4875 5850 l + 4575 6150 l gs col0 s gr +$F2psEnd +rs diff --git a/en_US.ISO_8859-1/articles/vm-design/Makefile b/en_US.ISO_8859-1/articles/vm-design/Makefile new file mode 100644 index 0000000000..6758b4073a --- /dev/null +++ b/en_US.ISO_8859-1/articles/vm-design/Makefile @@ -0,0 +1,16 @@ +# $FreeBSD: doc/en_US.ISO_8859-1/articles/mh/Makefile,v 1.8 1999/09/06 06:52:37 peter Exp $ + +DOC?= article + +FORMATS?= html + +IMAGES= fig1.eps fig2.eps fig3.eps fig4.eps + +INSTALL_COMPRESSED?=gz +INSTALL_ONLY_COMPRESSED?= + +SRCS= article.sgml + +DOC_PREFIX?= ${.CURDIR}/../../.. + +.include "${DOC_PREFIX}/share/mk/doc.project.mk" diff --git a/en_US.ISO_8859-1/articles/vm-design/article.sgml b/en_US.ISO_8859-1/articles/vm-design/article.sgml new file mode 100644 index 0000000000..7479a04cf8 --- /dev/null +++ b/en_US.ISO_8859-1/articles/vm-design/article.sgml @@ -0,0 +1,838 @@ + + + + +%man; +]> + +
+ + Design elements of the FreeBSD VM system + + + + Matthew + + Dillon + + +
+ dillon@apollo.backplane.com +
+
+
+
+ + + The title is really just a fancy way of saying that I am going to + attempt to describe the whole VM enchilada, hopefully in a way that + everyone can follow. For the last year I have concentrated on a number + of major kernel subsystems within FreeBSD, with the VM and Swap + subsystems being the most interesting and NFS being ‘a necessary + chore’. I rewrote only small portions of the code. In the VM + arena the only major rewrite I have done is to the swap subsystem. + Most of my work was cleanup and maintenance, with only moderate code + rewriting and no major algorithmic adjustments within the VM + subsystem. The bulk of the VM subsystem's theoretical base remains + unchanged and a lot of the credit for the modernization effort in the + last few years belongs to John Dyson and David Greenman. Not being a + historian like Kirk I will not attempt to tag all the various features + with peoples names, since I will invariably get it wrong. + + + + This article was originally published in the January 2000 issue of + DaemonNews. This + version of the article may include updates from Matt and other authors + to reflect changes in FreeBSD's VM implementation. + +
+ + + Introduction + + Before moving along to the actual design let's spend a little time + on the necessity of maintaining and modernizing any long-living + codebase. In the programming world, algorithms tend to be more + important than code and it is precisely due to BSD's academic roots that + a great deal of attention was paid to algorithm design from the + beginning. More attention paid to the design generally leads to a clean + and flexible codebase that can be fairly easily modified, extended, or + replaced over time. While BSD is considered an ‘old’ + operating system by some people, those of us who work on it tend to view + it more as a ‘mature’ codebase which has various components + modified, extended, or replaced with modern code. It has evolved, and + FreeBSD is at the bleeding edge no matter how old some of the code might + be. This is an important distinction to make and one that is + unfortunately lost to many people. The biggest error a programmer can + make is to not learn from history, and this is precisely the error that + many other modern operating systems have made. NT is the best example + of this, and the consequences have been dire. Linux also makes this + mistake to some degree—enough that we BSD folk can make small + jokes about it every once in a while, anyway. Linux's problem is simply + one of a lack of experience and history to compare ideas against, a + problem that is easily and rapidly being addressed by the Linux + community in the same way it has been addressed in the BSD + community—by continuous code development. The NT folk, on the + other hand, repeatedly make the same mistakes solved by UNIX decades ago + and then spend years fixing them. Over and over again. They have a + severe case of ‘not designed here’ and ‘we are always + right because our marketing department says so’. I have little + tolerance for anyone who cannot learn from history. + + Much of the apparent complexity of the FreeBSD design, especially in + the VM/Swap subsystem, is a direct result of having to solve serious + performance issues that occur under various conditions. These issues + are not due to bad algorithmic design but instead rise from + environmental factors. In any direct comparison between platforms, + these issues become most apparent when system resources begin to get + stressed. As I describe FreeBSD's VM/Swap subsystem the reader should + always keep two points in mind. First, the most important aspect of + performance design is what is known as “Optimizing the Critical + Path”. It is often the case that performance optimizations add a + little bloat to the code in order to make the critical path perform + better. Second, a solid, generalized design outperforms a + heavily-optimized design over the long run. While a generalized design + may end up being slower than an heavily-optimized design when they are + first implemented, the generalized design tends to be easier to adapt to + changing conditions and the heavily-optimized design winds up having to + be thrown away. Any codebase that will survive and be maintainable for + years must therefore be designed properly from the beginning even if it + costs some performance. Twenty years ago people were still arguing that + programming in assembly was better than programming in a high-level + language because it produced code that was ten times as fast. Today, + the fallibility of that argument is obvious—as are the parallels + to algorithmic design and code generalization. + + + + VM Objects + + The best way to begin describing the FreeBSD VM system is to look at + it from the perspective of a user-level process. Each user process sees + a single, private, contiguous VM address space containing several types + of memory objects. These objects have various characteristics. Program + code and program data are effectively a single memory-mapped file (the + binary file being run), but program code is read-only while program data + is copy-on-write. Program BSS is just memory allocated and filled with + zeros on demand, called demand zero page fill. Arbitrary files can be + memory-mapped into the address space as well, which is how the shared + library mechanism works. Such mappings can require modifications to + remain private to the process making them. The fork system call adds an + entirely new dimension to the VM management problem on top of the + complexity already given. + + A program binary data page (which is a basic copy-on-write page) + illustrates the complexity. A program binary contains a preinitialized + data section which is initially mapped directly from the program file. + When a program is loaded into a process's VM space, this area is + initially memory-mapped and backed by the program binary itself, + allowing the VM system to free/reuse the page and later load it back in + from the binary. The moment a process modifies this data, however, the + VM system must make a private copy of the page for that process. Since + the private copy has been modified, the VM system may no longer free it, + because there is no longer any way to restore it later on. + + You will notice immediately that what was originally a simple file + mapping has become much more complex. Data may be modified on a + page-by-page basis whereas the file mapping encompasses many pages at + once. The complexity further increases when a process forks. When a + process forks, the result is two processes—each with their own + private address spaces, including any modifications made by the original + process prior to the call to fork(). It would be + silly for the VM system to make a complete copy of the data at the time + of the fork() because it is quite possible that at + least one of the two processes will only need to read from that page + from then on, allowing the original page to continue to be used. What + was a private page is made copy-on-write again, since each process + (parent and child) expects their own personal post-fork modifications to + remain private to themselves and not effect the other. + + FreeBSD manages all of this with a layered VM Object model. The + original binary program file winds up being the lowest VM Object layer. + A copy-on-write layer is pushed on top of that to hold those pages which + had to be copied from the original file. If the program modifies a data + page belonging to the original file the VM system takes a fault and + makes a copy of the page in the higher layer. When a process forks, + additional VM Object layers are pushed on. This might make a little + more sense with a fairly basic example. A fork() + is a common operation for any *BSD system, so this example will consider + a program that starts up, and forks. When the process starts, the VM + system creates an object layer, let's call this A: + + + + + + + + +---------------+ +| A | ++---------------+ + + + + A picture + + + + A represents the file—pages may be paged in and out of the + file's physical media as necessary. Paging in from the disk is + reasonable for a program, but we really don't want to page back out and + overwrite the executable. The VM system therefore creates a second + layer, B, that will be physically backed by swap space: + + + + + + + + +---------------+ +| B | ++---------------+ +| A | ++---------------+ + + + + On the first write to a page after this, a new page is created in B, + and its contents are initialized from A. All pages in B can be paged in + or out to a swap device. When the program forks, the VM system creates + two new object layers—C1 for the parent, and C2 for the + child—that rest on top of B: + + + + + + + + +-------+-------+ +| C1 | C2 | ++-------+-------+ +| B | ++---------------+ +| A | ++---------------+ + + + + In this case, let's say a page in B is modified by the original + parent process. The process will take a copy-on-write fault and + duplicate the page in C1, leaving the original page in B untouched. + Now, let's say the same page in B is modified by the child process. The + process will take a copy-on-write fault and duplicate the page in C2. + The original page in B is now completely hidden since both C1 and C2 + have a copy and B could theoretically be destroyed if it does not + represent a 'real' file). However, this sort of optimization is not + trivial to make because it is so fine-grained. FreeBSD does not make + this optimization. Now, suppose (as is often the case) that the child + process does an exec(). Its current address space + is usually replaced by a new address space representing a new file. In + this case, the C2 layer is destroyed: + + + + + + + + +-------+ +| C1 | ++-------+-------+ +| B | ++---------------+ +| A | ++---------------+ + + + + In this case, the number of children of B drops to one, and all + accesses to B now go through C1. This means that B and C1 can be + collapsed together. Any pages in B that also exist in C1 are deleted + from B during the collapse. Thus, even though the optimization in the + previous step could not be made, we can recover the dead pages when + either of the processes exit or exec(). + + This model creates a number of potential problems. The first is that + you can wind up with a relatively deep stack of layered VM Objects which + can cost scanning time and memory when you when you take a fault. Deep + layering can occur when processes fork and then fork again (either + parent or child). The second problem is that you can wind up with dead, + inaccessible pages deep in the stack of VM Objects. In our last example + if both the parent and child processes modify the same page, they both + get their own private copies of the page and the original page in B is + no longer accessible by anyone. That page in B can be freed. + + FreeBSD solves the deep layering problem with a special optimization + called the “All Shadowed Case”. This case occurs if either + C1 or C2 take sufficient COW faults to completely shadow all pages in B. + Lets say that C1 achieves this. C1 can now bypass B entirely, so rather + then have C1->B->A and C2->B->A we now have C1->A and C2->B->A. But + look what also happened—now B has only one reference (C2), so we + can collapse B and C2 together. The end result is that B is deleted + entirely and we have C1->A and C2->A. It is often the case that B will + contain a large number of pages and neither C1 nor C2 will be able to + completely overshadow it. If we fork again and create a set of D + layers, however, it is much more likely that one of the D layers will + eventually be able to completely overshadow the much smaller dataset + reprsented by C1 or C2. The same optimization will work at any point in + the graph and the grand result of this is that even on a heavily forked + machine VM Object stacks tend to not get much deeper then 4. This is + true of both the parent and the children and true whether the parent is + doing the forking or whether the children cascade forks. + + The dead page problem still exists in the case where C1 or C2 do not + completely overshadow B. Due to our other optimizations this case does + not represent much of a problem and we simply allow the pages to be + dead. If the system runs low on memory it will swap them out, eating a + little swap, but that's it. + + The advantage to the VM Object model is that + fork() is extremely fast, since no real data + copying need take place. The disadvantage is that you can build a + relatively complex VM Object layering that slows page fault handling + down a little, and you spend memory managing the VM Object structures. + The optimizations FreeBSD makes proves to reduce the problems enough + that they can be ignored, leaving no real disadvantage. + + + + SWAP Layers + + Private data pages are initially either copy-on-write or zero-fill + pages. When a change, and therefore a copy, is made, the original + backing object (usually a file) can no longer be used to save a copy of + the page when the VM system needs to reuse it for other purposes. This + is where SWAP comes in. SWAP is allocated to create backing store for + memory that does not otherwise have it. FreeBSD allocates the swap + management structure for a VM Object only when it is actually needed. + However, the swap management structure has had problems + historically. + + Under FreeBSD 3.x the swap management structure preallocates an + array that encompasses the entire object requiring swap backing + store—even if only a few pages of that object are swap-backed. + This creates a kernel memory fragmentation problem when large objects + are mapped, or processes with large runsizes (RSS) fork. Also, in order + to keep track of swap space, a ‘list of holes’ is kept in + kernel memory, and this tends to get severely fragmented as well. Since + the 'list of holes' is a linear list, the swap allocation and freeing + performance is a non-optimal O(n)-per-page. It also requires kernel + memory allocations to take place during the swap freeing process, and + that creates low memory deadlock problems. The problem is further + exacerbated by holes created due to the interleaving algorithm. Also, + the swap block map can become fragmented fairly easily resulting in + non-contiguous allocations. Kernel memory must also be allocated on the + fly for additional swap management structures when a swapout occurs. It + is evident that there was plenty of room for improvement. + + For FreeBSD 4.x, I completely rewrote the swap subsystem. With this + rewrite, swap management structures are allocated through a hash table + rather than a linear array giving them a fixed allocation size and much + finer granularity. Rather then using a linearly linked list to keep + track of swap space reservations, it now uses a bitmap of swap blocks + arranged in a radix tree structure with free-space hinting in the radix + node structures. This effectively makes swap allocation and freeing an + O(1) operation. The entire radix tree bitmap is also preallocated in + order to avoid having to allocate kernel memory during critical low + memory swapping operations. After all, the system tends to swap when it + is low on memory so we should avoid allocating kernel memory at such + times in order to avoid potential deadlocks. Finally, to reduce + fragmentation the radix tree is capable of allocating large contiguous + chunks at once, skipping over smaller fragmented chunks. I did not take + the final step of having an 'allocating hint pointer' that would trundle + through a portion of swap as allocations were made in order to further + guarantee contiguous allocations or at least locality of reference, but + I ensured that such an addition could be made. + + + + When to free a page + + Since the VM system uses all available memory for disk caching, + there are usually very few truly-free pages. The VM system depends on + being able to properly choose pages which are not in use to reuse for + new allocations. Selecting the optimal pages to free is possibly the + single-most important function any VM system can perform because if it + makes a poor selection, the VM system may be forced to unnecessarily + retrieve pages from disk, seriously degrading system performance. + + How much overhead are we willing to suffer in the critical path to + avoid freeing the wrong page? Each wrong choice we make will cost us + hundreds of thousands of CPU cycles and a noticeable stall of the + affected processes, so we are willing to endure a significant amount of + overhead in order to be sure that the right page is chosen. This is why + FreeBSD tends to outperform other systems when memory resources become + stressed. + + The free page determination algorithm is built upon a history of the + use of memory pages. To acquire this history, the system takes advantage + of a page-used bit feature that most hardware page tables have. + + In any case, the page-used bit is cleared and at some later point + the VM system comes across the page again and sees that the page-used + bit has been set. This indicates that the page is still being actively + used. If the bit is still clear it is an indication that the page is not + being actively used. By testing this bit periodically, a use history (in + the form of a counter) for the physical page is developed. When the VM + system later needs to free up some pages, checking this history becomes + the cornerstone of determining the best candidate page to reuse. + + + What if the hardware has no page-used bit? + + For those platforms that do not have this feature, the system + actually emulates a page-used bit. It unmaps or protects a page, + forcing a page fault if the page is accessed again. When the page + fault is taken, the system simply marks the page as having been used + and unprotects the page so that it may be used. While taking such page + faults just to determine if a page is being used appears to be an + expensive proposition, it is much less expensive than reusing the page + for some other purpose only to find that a process needs it back and + then have to go to disk. + + + FreeBSD makes use of several page queues to further refine the + selection of pages to reuse as well as to determine when dirty pages + must be flushed to their backing store. Since page tables are dynamic + entities under FreeBSD, it costs virtually nothing to unmap a page from + the address space of any processes using it. When a page candidate has + been chosen based on the page-use counter, this is precisely what is + done. The system must make a distinction between clean pages which can + theoretically be freed up at any time, and dirty pages which must first + be written to their backing store before being reusable. When a page + candidate has been found it is moved to the inactive queue if it is + dirty, or the cache queue if it is clean. A separate algorithm based on + the dirty-to-clean page ratio determines when dirty pages in the + inactive queue must be flushed to disk. Once this is accomplished, the + flushed pages are moved from the inactive queue to the cache queue. At + this point, pages in the cache queue can still be reactivated by a VM + fault at relatively low cost. However, pages in the cache queue are + considered to be ‘immediately freeable’ and will be reused + in an LRU (least-recently used) fashion when the system needs to + allocate new memory. + + It is important to note that the FreeBSD VM system attempts to + separate clean and dirty pages for the express reason of avoiding + unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does + it move pages between the various page queues gratuitously when the + memory subsystem is not being stressed. This is why you will see some + systems with very low cache queue counts and high active queue counts + when doing a systat -vm command. As the VM system + becomes more stressed, it makes a greater effort to maintain the various + page queues at the levels determined to be the most effective. An urban + myth has circulated for years that Linux did a better job avoiding + swapouts than FreeBSD, but this in fact is not true. What was actually + occurring was that FreeBSD was proactively paging out unused pages in + order to make room for more disk cache while Linux was keeping unused + pages in core and leaving less memory available for cache and process + pages. I don't know whether this is still true today. + + + + Pre-Faulting and Zeroing Optimizations + + Taking a VM fault is not expensive if the underlying page is already + in core and can simply be mapped into the process, but it can become + expensive if you take a whole lot of them on a regular basis. A good + example of this is running a program such as &man.ls.1; or &man.ps.1; + over and over again. If the program binary is mapped into memory but + not mapped into the page table, then all the pages that will be accessed + by the program will have to be faulted in every time the program is run. + This is unnecessary when the pages in question are already in the VM + Cache, so FreeBSD will attempt to pre-populate a process's page tables + with those pages that are already in the VM Cache. One thing that + FreeBSD does not yet do is pre-copy-on-write certain pages on exec. For + example, if you run the &man.ls.1; program while running vmstat + 1 you will notice that it always takes a certain number of + page faults, even when you run it over and over again. These are + zero-fill faults, not program code faults (which were pre-faulted in + already). Pre-copying pages on exec or fork is an area that could use + more study. + + A large percentage of page faults that occur are zero-fill faults. + You can usually see this by observing the vmstat -s + output. These occur when a process accesses pages in its BSS area. The + BSS area is expected to be initially zero but the VM system does not + bother to allocate any memory at all until the process actually accesses + it. When a fault occurs the VM system must not only allocate a new page, + it must zero it as well. To optimize the zeroing operation the VM system + has the ability to pre-zero pages and mark them as such, and to request + pre-zeroed pages when zero-fill faults occur. The pre-zeroing occurs + whenever the CPU is idle but the number of pages the system pre-zeros is + limited in order to avoid blowing away the memory caches. This is an + excellent example of adding complexity to the VM system in order to + optimize the critical path. + + + + Page Table Optimizations + + The page table optimizations make up the most contentious part of + the FreeBSD VM design and they have shown some strain with the advent of + serious use of mmap(). I think this is actually a + feature of most BSDs though I am not sure when it was first introduced. + There are two major optimizations. The first is that hardware page + tables do not contain persistent state but instead can be thrown away at + any time with only a minor amount of management overhead. The second is + that every active page table entry in the system has a governing + pv_entry structure which is tied into the + vm_page structure. FreeBSD can simply iterate + through those mappings that are known to exist while Linux must check + all page tables that might contain a specific + mapping to see if it does, which can achieve O(n^2) overhead in certain + situations. It is because of this that FreeBSD tends to make better + choices on which pages to reuse or swap when memory is stressed, giving + it better performance under load. However, FreeBSD requires kernel + tuning to accommodate large-shared-address-space situations such as + those that can occur in a news system because it may run out of + pv_entry structures. + + Both Linux and FreeBSD need work in this area. FreeBSD is trying to + maximize the advantage of a potentially sparse active-mapping model (not + all processes need to map all pages of a shared library, for example), + whereas Linux is trying to simplify its algorithms. FreeBSD generally + has the performance advantage here at the cost of wasting a little extra + memory, but FreeBSD breaks down in the case where a large file is + massively shared across hundreds of processes. Linux, on the other hand, + breaks down in the case where many processes are sparsely-mapping the + same shared library and also runs non-optimally when trying to determine + whether a page can be reused or not. + + + + Page Coloring + + We'll end with the page coloring optimizations. Page coloring is a + performance optimization designed to ensure that accesses to contiguous + pages in virtual memory make the best use of the processor cache. In + ancient times (i.e. 10+ years ago) processor caches tended to map + virtual memory rather than physical memory. This led to a huge number of + problems including having to clear the cache on every context switch in + some cases, and problems with data aliasing in the cache. Modern + processor caches map physical memory precisely to solve those problems. + This means that two side-by-side pages in a processes address space may + not correspond to two side-by-side pages in the cache. In fact, if you + aren't careful side-by-side pages in virtual memory could wind up using + the same page in the processor cache—leading to cacheable data + being thrown away prematurely and reducing CPU performance. This is true + even with multi-way set-associative caches (though the effect is + mitigated somewhat). + + FreeBSD's memory allocation code implements page coloring + optimizations, which means that the memory allocation code will attempt + to locate free pages that are contiguous from the point of view of the + cache. For example, if page 16 of physical memory is assigned to page 0 + of a process's virtual memory and the cache can hold 4 pages, the page + coloring code will not assign page 20 of physical memory to page 1 of a + process's virtual memory. It would, instead, assign page 21 of physical + memory. The page coloring code attempts to avoid assigning page 20 + because this maps over the same cache memory as page 16 and would result + in non-optimal caching. This code adds a significant amount of + complexity to the VM memory allocation subsystem as you can well + imagine, but the result is well worth the effort. Page Coloring makes VM + memory as deterministic as physical memory in regards to cache + performance. + + + + Conclusion + + Virtual memory in modern operating systems must address a number of + different issues efficiently and for many different usage patterns. The + modular and algorithmic approach that BSD has historically taken allows + us to study and understand the current implementation as well as + relatively cleanly replace large sections of the code. There have been a + number of improvements to the FreeBSD VM system in the last several + years, and work is ongoing. + + + + Bonus QA session by Allen Briggs + <email>briggs@ninthwonder.com</email> + + + + + What is “the interleaving algorithm” that you + refer to in your listing of the ills of the FreeBSD 3.x swap + arrangments? + + + + FreeBSD uses a fixed swap interleave which defaults to 4. This + means that FreeBSD reserves space for four swap areas even if you + only have one, two, or three. Since swap is interleaved the linear + address space representing the ‘four swap areas’ will be + fragmented if you don't actually have four swap areas. For + example, if you have two swap areas A and B FreeBSD's address + space representation for that swap area will be interleaved in + blocks of 16 pages: + + A B C D A B C D A B C D A B C D + + FreeBSD 3.x uses a ‘sequential list of free + regions’ approach to accounting for the free swap areas. + The idea is that large blocks of free linear space can be + represented with a single list node + (kern/subr_rlist.c). But due to the + fragmentation the sequential list winds up being insanely + fragmented. In the above example, completely unused swap will + have A and B shown as ‘free’ and C and D shown as + ‘all allocated’. Each A-B sequence requires a list + node to account for because C and D are holes, so the list node + cannot be combined with the next A-B sequence. + + Why do we interleave our swap space instead of just tack swap + areas onto the end and do something fancier? Because it's a whole + lot easier to allocate linear swaths of an address space and have + the result automatically be interleaved across multiple disks than + it is to try to put that sophistication elsewhere. + + The fragmentation causes other problems. Being a linear list + under 3.x, and having such a huge amount of inherent + fragmentation, allocating and freeing swap winds up being an O(N) + algorithm instead of an O(1) algorithm. Combined with other + factors (heavy swapping) and you start getting into O(N^2) and + O(N^3) levels of overhead, which is bad. The 3.x system may also + need to allocate KVM during a swap operation to create a new list + node which can lead to a deadlock if the system is trying to + pageout pages in a low-memory situation. + + Under 4.x we do not use a sequential list. Instead we use a + radix tree and bitmaps of swap blocks rather than ranged list + nodes. We take the hit of preallocating all the bitmaps required + for the entire swap area up front but it winds up wasting less + memory due to the use of a bitmap (one bit per block) instead of a + linked list of nodes. The use of a radix tree instead of a + sequential list gives us nearly O(1) performance no matter how + fragmented the tree becomes. + + + + + + I don't get the following: + +
+ It is important to note that the FreeBSD VM system attempts + to separate clean and dirty pages for the express reason of + avoiding unnecessary flushes of dirty pages (which eats I/O + bandwidth), nor does it move pages between the various page + queues gratitously when the memory subsystem is not being + stressed. This is why you will see some systems with very low + cache queue counts and high active queue counts when doing a + systat -vm command. +
+ + How is the separation of clean and dirty (inactive) pages + related to the situation where you see low cache queue counts and + high active queue counts in systat -vm? Do the + systat stats roll the active and dirty pages together for the + active queue count? +
+ + + Yes, that is confusing. The relationship is + “goal” verses “reality”. Our goal is to + separate the pages but the reality is that if we are not in a + memory crunch, we don't really have to. + + What this means is that FreeBSD will not try very hard to + separate out dirty pages (inactive queue) from clean pages (cache + queue) when the system is not being stressed, nor will it try to + deactivate pages (active queue -> inactive queue) when the system + is not being stressed, even if they aren't being used. + +
+ + + + In the &man.ls.1; / vmstat 1 example, + wouldn't some of the page faults be data page faults (COW from + executable file to private page)? I.e., I would expect the page + faults to be some zero-fill and some program data. Or are you + implying that FreeBSD does do pre-COW for the program data? + + + + A COW fault can be either zero-fill or program-data. The + mechanism is the same either way because the backing program-data + is almost certainly already in the cache. I am indeed lumping the + two together. FreeBSD does not pre-COW program data or zero-fill, + but it does pre-map pages that exist in its + cache. + + + + + + In your section on page table optimizations, can you give a + little more detail about pv_entry and + vm_page (or should vm_page be + vm_pmap—as in 4.4, cf. pp. 180-181 of + McKusick, Bostic, Karel, Quarterman)? Specifically, what kind of + operation/reaction would require scanning the mappings? + + How does Linux do in the case where FreeBSD breaks down + (sharing a large file mapping over many processes)? + + + + A vm_page represents an (object,index#) + tuple. A pv_entry represents a hardware page + table entry (pte). If you have five processes sharing the same + physical page, and three of those processes's page tables actually + map the page, that page will be represented by a single + vm_page structure and three + pv_entry structures. + + pv_entry structures only represent pages + mapped by the MMU (one pv_entry represnts one + pte). This means that when we need to remove all hardware + references to a vm_page (in order to reuse the + page for something else, page it out, clear it, dirty it, and so + forth) we can simply scan the linked list of + pv_entry's associated with that + vm_page to remove or modify the pte's from + their page tables. + + Under Linux there is no such linked list. In order to remove + all the hardware page table mappings for a + vm_page linux must index into every VM object + that might have mapped the page. For + example, if you have 50 processes all mapping the same shared + library and want to get rid of page X in that library, you need to + index into the page table for each of those 50 processes even if + only 10 of them have actually mapped the page. So Linux is + trading off the simplicity of its design against performance. + Many VM algorithms which are O(1) or (small N) under FreeBSD wind + up being O(N), O(N^2), or worse under Linux. Since the pte's + representing a particular page in an object tend to be at the same + offset in all the page tables they are mapped in, reducing the + number of accesses into the page tables at the same pte offset + will often avoid blowing away the L1 cache line for that offset, + which can lead to better performance. + + FreeBSD has added complexity (the pv_entry + scheme) in order to increase performance (to limit page table + accesses to only those pte's that need to be + modified). + + But FreeBSD has a scaling problem that Linux does not in that + there are a limited number of pv_entry + structures and this causes problems when you have massive sharing + of data. In this case you may run out of + pv_entry structures even though there is plenty + of free memory available. This can be fixed easily enough by + bumping up the number of pv_entry structures in + the kernel config, but we really need to find a better way to do + it. + + In regards to the memory overhead of a page table verses the + pv_entry scheme: Linux uses + ‘permanent’ page tables that are not throw away, but + does not need a pv_entry for each potentially + mapped pte. FreeBSD uses ‘throw away’ page tables but + adds in a pv_entry structure for each + actually-mapped pte. I think memory utilization winds up being + about the same, giving FreeBSD an algorithmic advantage with its + ability to throw away page tables at will with very low + overhead. + + + + + + Finally, in the page coloring section, it might help to have a + little more description of what you mean here. I didn't quite + follow it. + + + + Do you know how an L1 hardware memory cache works? I'll + explain: Consider a machine with 16MB of main memory but only 128K + of L1 cache. Generally the way this cache works is that each 128K + block of main memory uses the same 128K of + cache. If you access offset 0 in main memory and then offset + offset 128K in main memory you can wind up throwing away the + cached data you read from offset 0! + + Now, I am simplifying things greatly. What I just described + is what is called a ‘direct mapped’ hardware memory + cache. Most modern caches are what are called + 2-way-set-associative or 4-way-set-associative caches. The + set-associatively allows you to access up to N different memory + regions that overlap the same cache memory without destroying the + previously cached data. But only N. + + So if I have a 4-way set associative cache I can access offset + 0, offset 128K, 256K and offset 384K and still be able to access + offset 0 again and have it come from the L1 cache. If I then + access offset 512K, however, one of the four previously cached + data objects will be thrown away by the cache. + + It is extremely important… + extremely important for most of a processor's + memory accesses to be able to come from the L1 cache, because the + L1 cache operates at the processor frequency. The moment you have + an L1 cahe miss and have to go to the L2 cache or to main memory, + the processor will stall and potentially sit twidling its fingers + for hundreds of instructions worth of time + waiting for a read from main memory to complete. Main memory (the + dynamic ram you stuff into a computer) is + slow, when compared to the speed of a modern + processor core. + + Ok, so now onto page coloring: All modern memory caches are + what are known as physical caches. They + cache physical memory addresses, not virtual memory addresses. + This allows the cache to be left alone across a process context + switch, which is very important. + + But in the UNIX world you are dealing with virtual address + spaces, not physical address spaces. Any program you write will + see the virtual address space given to it. The actual + physical pages underlying that virtual + address space are not necessarily physically contiguous! In fact, + you might have two pages that are side by side in a processes + address space which wind up being at offset 0 and offset 128K in + physical memory. + + A program normally assumes that two side-by-side pages will be + optimally cached. That is, that you can access data objects in + both pages without having them blow away each other's cache entry. + But this is only true if the physical pages underlying the virtual + address space are contiguous (insofar as the cache is + concerned). + + This is what Page coloring does. Instead of assigning + random physical pages to virtual addresses, + which may result in non-optimal cache performance , Page coloring + assigns reasonably-contiguous physical pages + to virtual addresses. Thus programs can be written under the + assumption that the characteristics of the underlying hardware + cache are the same for their virtual address space as they would + be if the program had been run directly in a physical address + space. + + Note that I say ‘reasonably’ contiguous rather + than simply ‘contiguous’. From the point of view of a + 128K direct mapped cache, the physical address 0 is the same as + the physical address 128K. So two side-by-side pages in your + virtual address space may wind up being offset 128K and offset + 132K in physical memory, but could also easily be offset 128K and + offset 4K in physical memory and still retain the same cache + performance characteristics. So page-coloring does + not have to assign truly contiguous pages of + physical memory to contiguous pages of virtual memory, it just + needs to make sure it assigns contiguous pages from the point of + view of cache performance and operation. + + +
+
+
diff --git a/en_US.ISO_8859-1/articles/vm-design/fig1.eps b/en_US.ISO_8859-1/articles/vm-design/fig1.eps new file mode 100644 index 0000000000..49d2c05a56 --- /dev/null +++ b/en_US.ISO_8859-1/articles/vm-design/fig1.eps @@ -0,0 +1,104 @@ +%!PS-Adobe-2.0 EPSF-2.0 +%%Title: fig1.eps +%%Creator: fig2dev Version 3.2.3 Patchlevel +%%CreationDate: Sun Oct 8 19:54:25 2000 +%%For: nik@canyon.nothing-going-on.org (Nik Clayton) +%%BoundingBox: 0 0 119 65 +%%Magnification: 1.0000 +%%EndComments +/$F2psDict 200 dict def +$F2psDict begin +$F2psDict /mtrx matrix put +/col-1 {0 setgray} bind def +/col0 {0.000 0.000 0.000 srgb} bind def +/col1 {0.000 0.000 1.000 srgb} bind def +/col2 {0.000 1.000 0.000 srgb} bind def +/col3 {0.000 1.000 1.000 srgb} bind def +/col4 {1.000 0.000 0.000 srgb} bind def +/col5 {1.000 0.000 1.000 srgb} bind def +/col6 {1.000 1.000 0.000 srgb} bind def +/col7 {1.000 1.000 1.000 srgb} bind def +/col8 {0.000 0.000 0.560 srgb} bind def +/col9 {0.000 0.000 0.690 srgb} bind def +/col10 {0.000 0.000 0.820 srgb} bind def +/col11 {0.530 0.810 1.000 srgb} bind def +/col12 {0.000 0.560 0.000 srgb} bind def +/col13 {0.000 0.690 0.000 srgb} bind def +/col14 {0.000 0.820 0.000 srgb} bind def +/col15 {0.000 0.560 0.560 srgb} bind def +/col16 {0.000 0.690 0.690 srgb} bind def +/col17 {0.000 0.820 0.820 srgb} bind def +/col18 {0.560 0.000 0.000 srgb} bind def +/col19 {0.690 0.000 0.000 srgb} bind def +/col20 {0.820 0.000 0.000 srgb} bind def +/col21 {0.560 0.000 0.560 srgb} bind def +/col22 {0.690 0.000 0.690 srgb} bind def +/col23 {0.820 0.000 0.820 srgb} bind def +/col24 {0.500 0.190 0.000 srgb} bind def +/col25 {0.630 0.250 0.000 srgb} bind def +/col26 {0.750 0.380 0.000 srgb} bind def +/col27 {1.000 0.500 0.500 srgb} bind def +/col28 {1.000 0.630 0.630 srgb} bind def +/col29 {1.000 0.750 0.750 srgb} bind def +/col30 {1.000 0.880 0.880 srgb} bind def +/col31 {1.000 0.840 0.000 srgb} bind def + +end +save +newpath 0 65 moveto 0 0 lineto 119 0 lineto 119 65 lineto closepath clip newpath +-143.0 298.0 translate +1 -1 scale + +/cp {closepath} bind def +/ef {eofill} bind def +/gr {grestore} bind def +/gs {gsave} bind def +/sa {save} bind def +/rs {restore} bind def +/l {lineto} bind def +/m {moveto} bind def +/rm {rmoveto} bind def +/n {newpath} bind def +/s {stroke} bind def +/sh {show} bind def +/slc {setlinecap} bind def +/slj {setlinejoin} bind def +/slw {setlinewidth} bind def +/srgb {setrgbcolor} bind def +/rot {rotate} bind def +/sc {scale} bind def +/sd {setdash} bind def +/ff {findfont} bind def +/sf {setfont} bind def +/scf {scalefont} bind def +/sw {stringwidth} bind def +/tr {translate} bind def +/tnt {dup dup currentrgbcolor + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add srgb} + bind def +/shd {dup dup currentrgbcolor 4 -2 roll mul 4 -2 roll mul + 4 -2 roll mul srgb} bind def +/$F2psBegin {$F2psDict begin /$F2psEnteredState save def} def +/$F2psEnd {$F2psEnteredState restore end} def + +$F2psBegin +%%Page: 1 1 +10 setmiterlimit + 0.06000 0.06000 sc +% Polyline +7.500 slw +n 2400 4200 m 4050 4200 l 4050 4950 l 2400 4950 l + cp gs col0 s gr +% Polyline +n 4050 4200 m + 4350 3900 l gs col0 s gr +% Polyline +n 2400 4200 m 2700 3900 l 4350 3900 l 4350 4650 l + 4050 4950 l gs col0 s gr +/Helvetica-Bold ff 180.00 scf sf +3225 4650 m +gs 1 -1 sc (A) dup sw pop 2 div neg 0 rm col0 sh gr +$F2psEnd +rs diff --git a/en_US.ISO_8859-1/articles/vm-design/fig2.eps b/en_US.ISO_8859-1/articles/vm-design/fig2.eps new file mode 100644 index 0000000000..fcb8bd41ad --- /dev/null +++ b/en_US.ISO_8859-1/articles/vm-design/fig2.eps @@ -0,0 +1,115 @@ +%!PS-Adobe-2.0 EPSF-2.0 +%%Title: fig2.eps +%%Creator: fig2dev Version 3.2.3 Patchlevel +%%CreationDate: Sun Oct 8 19:55:31 2000 +%%For: nik@canyon.nothing-going-on.org (Nik Clayton) +%%BoundingBox: 0 0 120 110 +%%Magnification: 1.0000 +%%EndComments +/$F2psDict 200 dict def +$F2psDict begin +$F2psDict /mtrx matrix put +/col-1 {0 setgray} bind def +/col0 {0.000 0.000 0.000 srgb} bind def +/col1 {0.000 0.000 1.000 srgb} bind def +/col2 {0.000 1.000 0.000 srgb} bind def +/col3 {0.000 1.000 1.000 srgb} bind def +/col4 {1.000 0.000 0.000 srgb} bind def +/col5 {1.000 0.000 1.000 srgb} bind def +/col6 {1.000 1.000 0.000 srgb} bind def +/col7 {1.000 1.000 1.000 srgb} bind def +/col8 {0.000 0.000 0.560 srgb} bind def +/col9 {0.000 0.000 0.690 srgb} bind def +/col10 {0.000 0.000 0.820 srgb} bind def +/col11 {0.530 0.810 1.000 srgb} bind def +/col12 {0.000 0.560 0.000 srgb} bind def +/col13 {0.000 0.690 0.000 srgb} bind def +/col14 {0.000 0.820 0.000 srgb} bind def +/col15 {0.000 0.560 0.560 srgb} bind def +/col16 {0.000 0.690 0.690 srgb} bind def +/col17 {0.000 0.820 0.820 srgb} bind def +/col18 {0.560 0.000 0.000 srgb} bind def +/col19 {0.690 0.000 0.000 srgb} bind def +/col20 {0.820 0.000 0.000 srgb} bind def +/col21 {0.560 0.000 0.560 srgb} bind def +/col22 {0.690 0.000 0.690 srgb} bind def +/col23 {0.820 0.000 0.820 srgb} bind def +/col24 {0.500 0.190 0.000 srgb} bind def +/col25 {0.630 0.250 0.000 srgb} bind def +/col26 {0.750 0.380 0.000 srgb} bind def +/col27 {1.000 0.500 0.500 srgb} bind def +/col28 {1.000 0.630 0.630 srgb} bind def +/col29 {1.000 0.750 0.750 srgb} bind def +/col30 {1.000 0.880 0.880 srgb} bind def +/col31 {1.000 0.840 0.000 srgb} bind def + +end +save +newpath 0 110 moveto 0 0 lineto 120 0 lineto 120 110 lineto closepath clip newpath +-174.0 370.0 translate +1 -1 scale + +/cp {closepath} bind def +/ef {eofill} bind def +/gr {grestore} bind def +/gs {gsave} bind def +/sa {save} bind def +/rs {restore} bind def +/l {lineto} bind def +/m {moveto} bind def +/rm {rmoveto} bind def +/n {newpath} bind def +/s {stroke} bind def +/sh {show} bind def +/slc {setlinecap} bind def +/slj {setlinejoin} bind def +/slw {setlinewidth} bind def +/srgb {setrgbcolor} bind def +/rot {rotate} bind def +/sc {scale} bind def +/sd {setdash} bind def +/ff {findfont} bind def +/sf {setfont} bind def +/scf {scalefont} bind def +/sw {stringwidth} bind def +/tr {translate} bind def +/tnt {dup dup currentrgbcolor + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add srgb} + bind def +/shd {dup dup currentrgbcolor 4 -2 roll mul 4 -2 roll mul + 4 -2 roll mul srgb} bind def +/$F2psBegin {$F2psDict begin /$F2psEnteredState save def} def +/$F2psEnd {$F2psEnteredState restore end} def + +$F2psBegin +%%Page: 1 1 +10 setmiterlimit + 0.06000 0.06000 sc +/Helvetica-Bold ff 180.00 scf sf +3750 5100 m +gs 1 -1 sc (B) dup sw pop 2 div neg 0 rm col0 sh gr +% Polyline +7.500 slw +n 4871 5100 m 4879 5100 l gs col0 s gr +% Polyline +n 2925 5400 m 4575 5400 l 4575 6150 l 2925 6150 l + cp gs col0 s gr +% Polyline +n 4575 4650 m + 4875 4350 l gs col0 s gr +% Polyline +n 2925 4650 m 4575 4650 l 4575 5400 l 2925 5400 l + cp gs col0 s gr +% Polyline +n 2925 4650 m 3225 4350 l 4875 4350 l 4875 5100 l + 4575 5400 l gs col0 s gr +/Helvetica-Bold ff 180.00 scf sf +3750 5850 m +gs 1 -1 sc (A) dup sw pop 2 div neg 0 rm col0 sh gr +% Polyline +n 4875 5100 m 4875 5850 l + 4575 6150 l gs col0 s gr +$F2psEnd +rs diff --git a/en_US.ISO_8859-1/articles/vm-design/fig3.eps b/en_US.ISO_8859-1/articles/vm-design/fig3.eps new file mode 100644 index 0000000000..0e3138b2ed --- /dev/null +++ b/en_US.ISO_8859-1/articles/vm-design/fig3.eps @@ -0,0 +1,133 @@ +%!PS-Adobe-2.0 EPSF-2.0 +%%Title: fig3.eps +%%Creator: fig2dev Version 3.2.3 Patchlevel +%%CreationDate: Sun Oct 8 19:53:51 2000 +%%For: nik@canyon.nothing-going-on.org (Nik Clayton) +%%BoundingBox: 0 0 120 155 +%%Magnification: 1.0000 +%%EndComments +/$F2psDict 200 dict def +$F2psDict begin +$F2psDict /mtrx matrix put +/col-1 {0 setgray} bind def +/col0 {0.000 0.000 0.000 srgb} bind def +/col1 {0.000 0.000 1.000 srgb} bind def +/col2 {0.000 1.000 0.000 srgb} bind def +/col3 {0.000 1.000 1.000 srgb} bind def +/col4 {1.000 0.000 0.000 srgb} bind def +/col5 {1.000 0.000 1.000 srgb} bind def +/col6 {1.000 1.000 0.000 srgb} bind def +/col7 {1.000 1.000 1.000 srgb} bind def +/col8 {0.000 0.000 0.560 srgb} bind def +/col9 {0.000 0.000 0.690 srgb} bind def +/col10 {0.000 0.000 0.820 srgb} bind def +/col11 {0.530 0.810 1.000 srgb} bind def +/col12 {0.000 0.560 0.000 srgb} bind def +/col13 {0.000 0.690 0.000 srgb} bind def +/col14 {0.000 0.820 0.000 srgb} bind def +/col15 {0.000 0.560 0.560 srgb} bind def +/col16 {0.000 0.690 0.690 srgb} bind def +/col17 {0.000 0.820 0.820 srgb} bind def +/col18 {0.560 0.000 0.000 srgb} bind def +/col19 {0.690 0.000 0.000 srgb} bind def +/col20 {0.820 0.000 0.000 srgb} bind def +/col21 {0.560 0.000 0.560 srgb} bind def +/col22 {0.690 0.000 0.690 srgb} bind def +/col23 {0.820 0.000 0.820 srgb} bind def +/col24 {0.500 0.190 0.000 srgb} bind def +/col25 {0.630 0.250 0.000 srgb} bind def +/col26 {0.750 0.380 0.000 srgb} bind def +/col27 {1.000 0.500 0.500 srgb} bind def +/col28 {1.000 0.630 0.630 srgb} bind def +/col29 {1.000 0.750 0.750 srgb} bind def +/col30 {1.000 0.880 0.880 srgb} bind def +/col31 {1.000 0.840 0.000 srgb} bind def + +end +save +newpath 0 155 moveto 0 0 lineto 120 0 lineto 120 155 lineto closepath clip newpath +-174.0 370.0 translate +1 -1 scale + +/cp {closepath} bind def +/ef {eofill} bind def +/gr {grestore} bind def +/gs {gsave} bind def +/sa {save} bind def +/rs {restore} bind def +/l {lineto} bind def +/m {moveto} bind def +/rm {rmoveto} bind def +/n {newpath} bind def +/s {stroke} bind def +/sh {show} bind def +/slc {setlinecap} bind def +/slj {setlinejoin} bind def +/slw {setlinewidth} bind def +/srgb {setrgbcolor} bind def +/rot {rotate} bind def +/sc {scale} bind def +/sd {setdash} bind def +/ff {findfont} bind def +/sf {setfont} bind def +/scf {scalefont} bind def +/sw {stringwidth} bind def +/tr {translate} bind def +/tnt {dup dup currentrgbcolor + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add srgb} + bind def +/shd {dup dup currentrgbcolor 4 -2 roll mul 4 -2 roll mul + 4 -2 roll mul srgb} bind def +/$F2psBegin {$F2psDict begin /$F2psEnteredState save def} def +/$F2psEnd {$F2psEnteredState restore end} def + +$F2psBegin +%%Page: 1 1 +10 setmiterlimit + 0.06000 0.06000 sc +/Helvetica-Bold ff 180.00 scf sf +4125 4350 m +gs 1 -1 sc (C2) dup sw pop 2 div neg 0 rm col0 sh gr +% Polyline +7.500 slw +n 4871 5100 m 4879 5100 l gs col0 s gr +% Polyline +n 2925 5400 m 4575 5400 l 4575 6150 l 2925 6150 l + cp gs col0 s gr +% Polyline +n 4575 4650 m + 4875 4350 l gs col0 s gr +% Polyline +n 2925 4650 m 4575 4650 l 4575 5400 l 2925 5400 l + cp gs col0 s gr +% Polyline +n 4875 3600 m 4875 5100 l + 4575 5400 l gs col0 s gr +% Polyline +n 2925 4650 m 2925 3900 l 3225 3600 l + 4875 3600 l gs col0 s gr +% Polyline +n 2925 3900 m 4425 3900 l 4575 3900 l + 4875 3600 l gs col0 s gr +% Polyline +n 4575 4650 m + 4575 3900 l gs col0 s gr +% Polyline +n 3750 4650 m 3750 3900 l + 4050 3600 l gs col0 s gr +/Helvetica-Bold ff 180.00 scf sf +3750 5850 m +gs 1 -1 sc (A) dup sw pop 2 div neg 0 rm col0 sh gr +/Helvetica-Bold ff 180.00 scf sf +3750 5100 m +gs 1 -1 sc (B) dup sw pop 2 div neg 0 rm col0 sh gr +/Helvetica-Bold ff 180.00 scf sf +3375 4350 m +gs 1 -1 sc (C1) dup sw pop 2 div neg 0 rm col0 sh gr +% Polyline +n 4875 5100 m 4875 5850 l + 4575 6150 l gs col0 s gr +$F2psEnd +rs diff --git a/en_US.ISO_8859-1/articles/vm-design/fig4.eps b/en_US.ISO_8859-1/articles/vm-design/fig4.eps new file mode 100644 index 0000000000..24fc1b5add --- /dev/null +++ b/en_US.ISO_8859-1/articles/vm-design/fig4.eps @@ -0,0 +1,133 @@ +%!PS-Adobe-2.0 EPSF-2.0 +%%Title: fig4.eps +%%Creator: fig2dev Version 3.2.3 Patchlevel +%%CreationDate: Sun Oct 8 19:55:53 2000 +%%For: nik@canyon.nothing-going-on.org (Nik Clayton) +%%BoundingBox: 0 0 120 155 +%%Magnification: 1.0000 +%%EndComments +/$F2psDict 200 dict def +$F2psDict begin +$F2psDict /mtrx matrix put +/col-1 {0 setgray} bind def +/col0 {0.000 0.000 0.000 srgb} bind def +/col1 {0.000 0.000 1.000 srgb} bind def +/col2 {0.000 1.000 0.000 srgb} bind def +/col3 {0.000 1.000 1.000 srgb} bind def +/col4 {1.000 0.000 0.000 srgb} bind def +/col5 {1.000 0.000 1.000 srgb} bind def +/col6 {1.000 1.000 0.000 srgb} bind def +/col7 {1.000 1.000 1.000 srgb} bind def +/col8 {0.000 0.000 0.560 srgb} bind def +/col9 {0.000 0.000 0.690 srgb} bind def +/col10 {0.000 0.000 0.820 srgb} bind def +/col11 {0.530 0.810 1.000 srgb} bind def +/col12 {0.000 0.560 0.000 srgb} bind def +/col13 {0.000 0.690 0.000 srgb} bind def +/col14 {0.000 0.820 0.000 srgb} bind def +/col15 {0.000 0.560 0.560 srgb} bind def +/col16 {0.000 0.690 0.690 srgb} bind def +/col17 {0.000 0.820 0.820 srgb} bind def +/col18 {0.560 0.000 0.000 srgb} bind def +/col19 {0.690 0.000 0.000 srgb} bind def +/col20 {0.820 0.000 0.000 srgb} bind def +/col21 {0.560 0.000 0.560 srgb} bind def +/col22 {0.690 0.000 0.690 srgb} bind def +/col23 {0.820 0.000 0.820 srgb} bind def +/col24 {0.500 0.190 0.000 srgb} bind def +/col25 {0.630 0.250 0.000 srgb} bind def +/col26 {0.750 0.380 0.000 srgb} bind def +/col27 {1.000 0.500 0.500 srgb} bind def +/col28 {1.000 0.630 0.630 srgb} bind def +/col29 {1.000 0.750 0.750 srgb} bind def +/col30 {1.000 0.880 0.880 srgb} bind def +/col31 {1.000 0.840 0.000 srgb} bind def + +end +save +newpath 0 155 moveto 0 0 lineto 120 0 lineto 120 155 lineto closepath clip newpath +-174.0 370.0 translate +1 -1 scale + +/cp {closepath} bind def +/ef {eofill} bind def +/gr {grestore} bind def +/gs {gsave} bind def +/sa {save} bind def +/rs {restore} bind def +/l {lineto} bind def +/m {moveto} bind def +/rm {rmoveto} bind def +/n {newpath} bind def +/s {stroke} bind def +/sh {show} bind def +/slc {setlinecap} bind def +/slj {setlinejoin} bind def +/slw {setlinewidth} bind def +/srgb {setrgbcolor} bind def +/rot {rotate} bind def +/sc {scale} bind def +/sd {setdash} bind def +/ff {findfont} bind def +/sf {setfont} bind def +/scf {scalefont} bind def +/sw {stringwidth} bind def +/tr {translate} bind def +/tnt {dup dup currentrgbcolor + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add srgb} + bind def +/shd {dup dup currentrgbcolor 4 -2 roll mul 4 -2 roll mul + 4 -2 roll mul srgb} bind def +/$F2psBegin {$F2psDict begin /$F2psEnteredState save def} def +/$F2psEnd {$F2psEnteredState restore end} def + +$F2psBegin +%%Page: 1 1 +10 setmiterlimit + 0.06000 0.06000 sc +/Helvetica-Bold ff 180.00 scf sf +3375 4350 m +gs 1 -1 sc (C1) dup sw pop 2 div neg 0 rm col0 sh gr +% Polyline +7.500 slw +n 4871 5100 m 4879 5100 l gs col0 s gr +% Polyline +n 2925 5400 m 4575 5400 l 4575 6150 l 2925 6150 l + cp gs col0 s gr +% Polyline +n 4575 4650 m + 4875 4350 l gs col0 s gr +% Polyline +n 2925 4650 m 4575 4650 l 4575 5400 l 2925 5400 l + cp gs col0 s gr +% Polyline +n 4875 4350 m 4875 5100 l + 4575 5400 l gs col0 s gr +% Polyline +n 2925 4650 m 2925 3900 l 3225 3600 l + 4050 3600 l gs col0 s gr +% Polyline +n 3750 4650 m 3750 3900 l + 4050 3600 l gs col0 s gr +% Polyline +n 2925 3900 m + 3750 3900 l gs col0 s gr +% Polyline +n 3750 4650 m 4050 4350 l + 4875 4350 l gs col0 s gr +% Polyline +n 4050 4350 m + 4050 3600 l gs col0 s gr +/Helvetica-Bold ff 180.00 scf sf +3750 5850 m +gs 1 -1 sc (A) dup sw pop 2 div neg 0 rm col0 sh gr +/Helvetica-Bold ff 180.00 scf sf +3750 5100 m +gs 1 -1 sc (B) dup sw pop 2 div neg 0 rm col0 sh gr +% Polyline +n 4875 5100 m 4875 5850 l + 4575 6150 l gs col0 s gr +$F2psEnd +rs