2377 lines
		
	
	
	
		
			102 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			2377 lines
		
	
	
	
		
			102 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
| <!-- $FreeBSD$ -->
 | |
| <!-- The FreeBSD Documentation Project -->
 | |
| 
 | |
| <!DOCTYPE article PUBLIC "-//FreeBSD//DTD DocBook V4.1-Based Extension//EN" [
 | |
| <!ENTITY % articles.ent PUBLIC "-//FreeBSD//ENTITIES DocBook FreeBSD Articles Entity Set//EN">
 | |
| %articles.ent;
 | |
| ]>
 | |
| 
 | |
| <article>
 | |
|   <articleinfo>
 | |
|     <title>&linux; emulation in &os;</title>
 | |
| 
 | |
|     <author>
 | |
|       <firstname>Roman</firstname>
 | |
|       <surname>Divacky</surname>
 | |
| 
 | |
|       <affiliation>
 | |
| 	<address><email>rdivacky@FreeBSD.org</email></address>
 | |
|       </affiliation>
 | |
|     </author>
 | |
| 
 | |
|     <legalnotice id="trademarks" role="trademarks">
 | |
|       &tm-attrib.adobe;
 | |
|       &tm-attrib.ibm;
 | |
|       &tm-attrib.freebsd;
 | |
|       &tm-attrib.linux;
 | |
|       &tm-attrib.netbsd;
 | |
|       &tm-attrib.realnetworks;
 | |
|       &tm-attrib.oracle;
 | |
|       &tm-attrib.sun;
 | |
|       &tm-attrib.general;
 | |
|     </legalnotice>
 | |
| 
 | |
|     <abstract>
 | |
|       <para>This masters thesis deals with updating the &linux; emulation layer
 | |
| 	(the so called <firstterm>Linuxulator</firstterm>).  The task was to update the layer to match
 | |
| 	the functionality of &linux; 2.6. As a reference implementation, the
 | |
| 	&linux; 2.6.16 kernel was chosen.  The concept is loosely based on
 | |
| 	the NetBSD implementation.  Most of the work was done in the summer
 | |
| 	of 2006 as a part of the Google Summer of Code students program.
 | |
| 	The focus was on bringing the <firstterm>NPTL</firstterm> (new &posix;
 | |
| 	thread library) support	into the emulation layer, including
 | |
| 	<firstterm>TLS</firstterm> (thread local storage),
 | |
| 	<firstterm>futexes</firstterm> (fast user space mutexes),
 | |
| 	<firstterm>PID mangling</firstterm>, and some other minor
 | |
| 	things.  Many small problems were identified and fixed in the
 | |
| 	process.  My work was integrated into the main &os; source
 | |
| 	repository and will be shipped in the upcoming 7.0R release.  We,
 | |
| 	the emulation development team, are working on making the
 | |
| 	&linux; 2.6 emulation the default emulation layer in &os;.</para>
 | |
|     </abstract>
 | |
|   </articleinfo>
 | |
| 
 | |
|   <sect1 id="intro">
 | |
|     <title>Introduction</title>
 | |
| 
 | |
|     <para>In the last few years the open source &unix; based operating systems
 | |
|       started to be widely deployed on server and client machines.  Among
 | |
|       these operating systems I would like to point out two: &os;, for its BSD
 | |
|       heritage, time proven code base and many interesting features and
 | |
|       &linux; for its wide user base, enthusiastic open developer community
 | |
|       and support from large companies.  &os; tends to be used on server
 | |
|       class machines serving heavy duty networking tasks with less usage on
 | |
|       desktop class machines for ordinary users.  While &linux; has the same
 | |
|       usage on servers, but it is used much more by home based users.  This
 | |
|       leads to a situation where there are many binary only programs available
 | |
|       for &linux; that lack support for &os;.</para>
 | |
| 
 | |
|     <para>Naturally, a need for the ability to run &linux; binaries on a &os;
 | |
|       system arises and this is what this thesis deals with: the emulation of
 | |
|       the &linux; kernel in the &os; operating system.</para>
 | |
| 
 | |
|     <para>During the Summer of 2006 Google Inc. sponsored a project which
 | |
|       focused on extending the &linux; emulation layer (the so called Linuxulator)
 | |
|       in &os; to include &linux; 2.6 facilities.  This thesis is written as a
 | |
|       part of this project.</para>
 | |
|   </sect1>
 | |
| 
 | |
|   <sect1 id="inside">
 | |
|     <title>A look inside…</title>
 | |
| 
 | |
|     <para>In this section we are going to describe every operating system in
 | |
|       question.  How they deal with syscalls, trapframes etc. all the low-level
 | |
|       stuff. We also describe the way they understand common &unix;
 | |
|       primitives like what a PID is, what a thread is, etc.  In the third
 | |
|       subsection we talk about how &unix; on &unix; emulation could be done
 | |
|       in general.<para>
 | |
| 
 | |
|     <sect2 id="what-is-unix">
 | |
|       <title>What is &unix;</title>
 | |
| 
 | |
|       <para>&unix; is an operating system with a long history that has
 | |
| 	influenced almost every other operating system currently in use.
 | |
| 	Starting in the 1960s, its development continues to this day (although
 | |
| 	in different projects).  &unix; development soon forked into two main
 | |
| 	ways: the BSDs and System III/V families.  They mutually influenced
 | |
| 	themselves by growing a common &unix; standard.  Among the
 | |
| 	contributions originated in BSD we can name virtual memory, TCP/IP
 | |
| 	networking, FFS, and many others.  The System V branch contributed to
 | |
| 	SysV interprocess communication primitives, copy-on-write, etc. &unix;
 | |
| 	itself does not exist any more but its ideas have been used by many
 | |
| 	other operating systems world wide thus forming the so called &unix;-like
 | |
| 	operating systems.  These days the most influential ones are &linux;, 
 | |
| 	Solaris, and possibly (to some extent) &os;.  There are in-company
 | |
| 	&unix; derivatives (AIX, HP-UX etc.), but these have been more and
 | |
| 	more migrated to the aforementioned systems.  Let us summarize typical
 | |
| 	&unix; characteristics.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 id="tech-details">
 | |
|       <title>Technical details</title>
 | |
| 
 | |
|       <para>Every running program constitutes a process that represents a state
 | |
| 	of the computation.  Running process is divided between kernel-space
 | |
| 	and user-space.  Some operations can be done only from kernel space
 | |
| 	(dealing with hardware etc.), but the process should spend most of its
 | |
| 	lifetime in the user space.  The kernel is where the management of the
 | |
| 	processes, hardware, and low-level details take place.  The kernel
 | |
| 	provides a standard unified &unix; API to the user space.  The most
 | |
| 	important ones are covered below.<para>
 | |
| 
 | |
|       <sect3 id="kern-proc-comm">
 | |
| 	<title>Communication between kernel and user space process</title>
 | |
| 
 | |
| 	<para>Common &unix; API defines a syscall as a way to issue commands
 | |
| 	  from a user space process to the kernel.  The most common
 | |
| 	  implementation is either by using an interrupt or specialized
 | |
| 	  instruction (think of
 | |
| 	  <literal>SYSENTER</literal>/<literal>SYSCALL</literal> instructions
 | |
| 	  for ia32).  Syscalls are defined by a number.  For example in &os;,
 | |
| 	  the syscall number 85 is the &man.swapon.2; syscall and the
 | |
| 	  syscall number 132 is &man.mkfifo.2;.  Some syscalls need
 | |
| 	  parameters, which are passed from the user-space to the kernel-space
 | |
| 	  in various ways (implementation dependant).  Syscalls are
 | |
| 	  synchronous.</para>
 | |
| 
 | |
| 	<para>Another possible way to communicate is by using a
 | |
| 	  <firstterm>trap</firstterm>.  Traps occur asynchronously after
 | |
| 	  some event occurs (division by zero, page fault etc.).  A trap
 | |
| 	  can be transparent for a process (page fault) or can result in
 | |
| 	  a reaction like sending a <firstterm>signal</firstterm>
 | |
| 	  (division by zero).</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="proc-proc-comm">
 | |
| 	<title>Communication between processes</title>
 | |
| 
 | |
| 	<para>There are other APIs (System V IPC, shared memory etc.) but the
 | |
| 	  single most important API is signal.  Signals are sent by processes
 | |
| 	  or by the kernel and received by processes.  Some signals
 | |
| 	  can be ignored or handled by a user supplied routine, some result
 | |
| 	  in a predefined action that cannot be altered or ignored.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="proc-mgmt">
 | |
| 	<title>Process management</title>
 | |
| 
 | |
| 	<para>Kernel instances are processed first in the system (so called
 | |
| 	  init).  Every running process can create its identical copy using
 | |
| 	  the &man.fork.2; syscall.  Some slightly modified versions of this
 | |
| 	  syscall were introduced but the basic semantic is the same.  Every
 | |
| 	  running process can morph into some other process using the
 | |
| 	  &man.exec.3; syscall.  Some modifications of this syscall were
 | |
| 	  introduced but all serve the same basic purpose.  Processes end
 | |
| 	  their lives by calling the &man.exit.2; syscall.  Every process is
 | |
| 	  identified by a unique number called PID.  Every process has a
 | |
| 	  defined parent (identified by its PID).</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="thread-mgmt">
 | |
| 	<title>Thread management</title>
 | |
| 
 | |
| 	<para>Traditional &unix; does not define any API nor implementation
 | |
| 	  for threading, while  &posix; defines its threading API but the
 | |
| 	  implementation is undefined.  Traditionally there were two ways of
 | |
| 	  implementing threads.  Handling them as separate processes (1:1
 | |
| 	  threading) or envelope the whole thread group in one process and
 | |
| 	  managing the threading in userspace (1:N threading).  Comparing
 | |
| 	  main features of each approach:</para>
 | |
| 
 | |
| 	<para>1:1 threading</para>
 | |
| 
 | |
| 	<itemizedlist>
 | |
| 	  <listitem>
 | |
| 	    <para>- heavyweight threads</para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para>- the scheduling cannot be altered by the user
 | |
| 	      (slightly mitigated by the &posix; API)</para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para>+ no syscall wrapping necessary</para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para>+ can utilize multiple CPUs</para>
 | |
| 	  </listitem>
 | |
| 	</itemizedlist>
 | |
| 
 | |
| 	<para>1:N threading</para>
 | |
| 
 | |
| 	<itemizedlist>
 | |
| 	  <listitem>
 | |
| 	    <para>+ lightweight threads</para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para>+ scheduling can be easily altered by the user</para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para>- syscalls must be wrapped </para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para>- cannot utilize more than one CPU</para>
 | |
| 	  </listitem>
 | |
| 	</itemizedlist>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 id="what-is-freebsd">
 | |
|       <title>What is &os;?</title>
 | |
| 
 | |
|       <para>The &os; project is one of the oldest open source operating
 | |
| 	systems currently available for daily use.  It is a direct descendant
 | |
| 	of the genuine &unix; so it could be claimed that it is a true &unix;
 | |
| 	although licensing issues do not permit that.  The start of the project
 | |
| 	dates back to the early 1990's when a crew of fellow BSD users patched
 | |
| 	the 386BSD operating system.  Based on this patchkit a new operating
 | |
| 	system arose named &os; for its liberal license.  Another group created
 | |
| 	the NetBSD operating system with different goals in mind.  We will
 | |
| 	focus on &os;.</para>
 | |
| 
 | |
|       <para>&os; is a modern &unix;-based operating system with all the
 | |
| 	features of &unix;.  Preemptive multitasking, multiuser facilities,
 | |
| 	TCP/IP networking, memory protection, symmetric multiprocessing
 | |
| 	support, virtual memory with merged VM and buffer cache, they are all
 | |
| 	there.  One of the interesting and extremely useful features is the
 | |
| 	ability to emulate other &unix;-like operating systems.  As of
 | |
| 	December 2006 and 7-CURRENT development, the following
 | |
| 	emulation functionalities are supported:</para>
 | |
| 
 | |
|       <itemizedlist>
 | |
| 	<listitem>
 | |
| 	  <para>&os;/i386 emulation on &os;/amd64</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>&os;/i386 emulation on &os;/ia64</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>&linux;-emulation of &linux; operating system on &os;</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>NDIS-emulation of Windows networking drivers interface</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>NetBSD-emulation of NetBSD operating system</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>PECoff-support for PECoff &os; executables</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>SVR4-emulation of System V revision 4 &unix;</para>
 | |
| 	</listitem>
 | |
|       </itemizedlist>
 | |
| 
 | |
|       <para>Actively developed emulations are the &linux; layer and various
 | |
| 	&os;-on-&os; layers.  Others are not supposed to work properly nor
 | |
| 	be usable these days.</para>
 | |
| 
 | |
|       <para>&os; development happens in a central CVS repository where only
 | |
| 	a selected team of so called commiters can write.  This repository
 | |
| 	posseses several branches; the most interesting are the HEAD branch,
 | |
| 	in &os;	nomenclature called -CURRENT, and RELENG_X branches, where X
 | |
| 	stands for a number indicating a major version of &os;.  As of
 | |
| 	December 2006, there are development branches for 6.X development
 | |
| 	(RELENG_6) and for the 5.X development (RELENG_5).  Other branches are
 | |
| 	closed and not actively maintained or only fed with security patches
 | |
| 	by the Security Officer of the &os; project.</para>
 | |
| 
 | |
|       <para>Historically the active development was done in the HEAD branch so
 | |
| 	it was considered extremely unstable and supposed to happen to break
 | |
| 	at any time.  This is not true any more as the
 | |
| 	<application>Perforce</application> (commercial	version control system)
 | |
| 	repository was introduced so that active development happen there.
 | |
| 	There are many branches in <application>Perforce</application> where
 | |
| 	development of certain parts of the system happens and these branches
 | |
| 	are from time to time merged back to the main CVS repository thus
 | |
| 	effectively putting the given feature to the &os; operating system.
 | |
| 	Tha same happened with the <filename>rdivacky_linuxolator</filename>
 | |
| 	branch where development of this thesis code was going on.</para>
 | |
| 
 | |
|       <para>More info about the &os; operating system can be found
 | |
| 	at [2].</para>
 | |
| 
 | |
|       <sect3 id="freebsd-tech-details">
 | |
| 	<title>Technical details</title>
 | |
| 
 | |
| 	<para>&os; is traditional flavor of &unix; in the sense of dividing the
 | |
| 	  run of processes into two halves: kernel space and user space run.
 | |
| 	  There are two types of process entry to the kernel: a syscall and a
 | |
| 	  trap.  There is only one way to return.  In the subsequent sections
 | |
| 	  we will describe the three gates to/from the kernel.  The whole
 | |
| 	  description applies to the i386 architecture as the Linuxulator
 | |
| 	  only exists there but the concept is similar on other architectures.
 | |
| 	  The information was taken from [1] and the source code.</para>
 | |
| 
 | |
| 	<sect4 id="freebsd-sys-entries">
 | |
| 	  <title>System entries</title>
 | |
| 
 | |
| 	  <para>&os; has an abstraction called an execution class loader,
 | |
| 	    which is a wedge into the &man.execve.2; syscall.  This employs a
 | |
| 	    structure <literal>sysentvec</literal>, which describes an
 | |
| 	    executable ABI.  It contains things like errno translation table,
 | |
| 	    signal translation table, various functions to serve syscall needs
 | |
| 	    (stack fixup, coredumping, etc.).  Every ABI the &os; kernel wants
 | |
| 	    to support must define this structure, as it is used later in the
 | |
| 	    syscall processing code and at some other places.  System entries
 | |
| 	    are handled by trap handlers, where we can access both the
 | |
| 	    kernel-space and the user-space at once.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="freebsd-syscalls">
 | |
| 	  <title>Syscalls</title>
 | |
| 
 | |
| 	  <para>Syscalls on &os; are issued by executing interrupt
 | |
| 	    <literal>0x80</literal> with register <varname>%eax</varname> set
 | |
| 	    to a desired syscall number with arguments passed on the stack.</para>
 | |
| 
 | |
| 	  <para>When a process issues an interrupt <literal>0x80</literal>, the
 | |
| 	    <literal>int0x80</literal> syscall trap handler is issued (defined
 | |
| 	    in <filename>sys/i386/i386/exception.s</filename>), which prepares
 | |
| 	    arguments (i.e. copies them on to the stack) for a
 | |
| 	    call to a C function &man.syscall.2; (defined in
 | |
| 	    <filename>sys/i386/i386/trap.c</filename>), which processes the
 | |
| 	    passed in trapframe.  The processing consists of preparing the
 | |
| 	    syscall (depending on the <literal>sysvec</literal> entry),
 | |
| 	    determining if the syscall is 32-bit or 64-bit one (changes size
 | |
| 	    of the parameters), then the parameters are copied, including the
 | |
| 	    syscall.  Next, the actual syscall function is executed with
 | |
| 	    processing of the return code (special cases for
 | |
| 	    <literal>ERESTART</literal> and <literal>EJUSTRETURN</literal>
 | |
| 	    errors).  Finally an <literal>userret()</literal> is scheduled,
 | |
| 	    switching the process back to the users-pace.  The parameters to
 | |
| 	    the actual syscall handler are passed in the form of 
 | |
| 	    <literal>struct thread *td</literal>,
 | |
| 	    <literal>struct syscall args *</literal> arguments where the second
 | |
| 	    parameter is a pointer to the copied in structure of
 | |
| 	    parameters.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="freebsd-traps">
 | |
| 	  <title>Traps</title>
 | |
| 
 | |
| 	  <para>Handling of traps in &os; is similar to the handling of
 | |
| 	    syscalls.  Whenever a trap occurs, an assembler handler is called.
 | |
| 	    It is chosen between alltraps, alltraps with regs pushed or
 | |
| 	    calltrap depending on the type of the trap.  This handler prepares
 | |
| 	    arguments for a call to a C function <literal>trap()</literal>
 | |
| 	    (defined in <filename>sys/i386/i386/trap.c</filename>), which then
 | |
| 	    processes the occurred trap.  After the processing it might send a
 | |
| 	    signal to the process and/or exit to userland using
 | |
| 	    <literal>userret()</literal>.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="freebsd-exits">
 | |
| 	  <title>Exits</title>
 | |
| 
 | |
| 	  <para>Exits from kernel to userspace happen using the assembler
 | |
| 	    routine <literal>doreti</literal> regardless of whether the kernel
 | |
| 	    was entered via a trap or via a syscall.  This restores the program
 | |
| 	    status from the stack and returns to the userspace.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="freebsd-unix-primitives">
 | |
| 	  <title>&unix; primitives</title>
 | |
| 
 | |
| 	  <para>&os; operating system adheres to the traditional &unix; scheme,
 | |
| 	    where every process has a unique identification number, the so
 | |
| 	    called <firstterm>PID</firstterm> (Process ID).  PID numbers are
 | |
| 	    allocated either linearly or randomly ranging from
 | |
| 	    <literal>0</literal> to <literal>PID_MAX</literal>.  The allocation
 | |
| 	    of PID numbers is done using linear searching of PID space.  Every
 | |
| 	    thread in a process receives the same PID number as result of the
 | |
| 	    &man.getpid.2; call.</para>
 | |
| 
 | |
| 	  <para>There are currently two ways to implement threading in &os;.
 | |
| 	    The first way is M:N threading followed by the 1:1 threading model.
 | |
| 	    The default library used is M:N threading
 | |
| 	    (<literal>libpthread</literal>) and you can switch at runtime to
 | |
| 	    1:1 threading (<literal>libthr</literal>).  The plan is to switch
 | |
| 	    to 1:1 library by default soon.  Although those two libraries use
 | |
| 	    the same kernel primitives, they are accessed through different
 | |
| 	    API(es).  The M:N library uses the <literal>kse_*</literal> family
 | |
| 	    of syscalls while the 1:1 library uses the <literal>thr_*</literal>
 | |
| 	    family of syscalls.  Because of this, there is no general concept
 | |
| 	    of thread ID shared between kernel and userspace.  Of course, both
 | |
| 	    threading libraries implement the pthread thread ID API.  Every
 | |
| 	    kernel thread (as described by <literal>struct thread</literal>)
 | |
| 	    has td tid identifier but this is not directly accessible
 | |
| 	    from userland and solely serves the kernel's needs.  It is also
 | |
| 	    used for 1:1 threading library as pthread's thread ID but handling
 | |
| 	    of this is internal to the library and cannot be relied on.</para>
 | |
| 
 | |
| 	  <para>As stated previously there are two implementations of threading
 | |
| 	    in &os;.  The M:N library divides the work between kernel space and
 | |
| 	    userspace.  Thread is an entity that gets scheduled in the kernel
 | |
| 	    but it can represent various number of userspace threads.
 | |
| 	    M userspace threads get mapped to N kernel threads thus saving
 | |
| 	    resources while keeping the ability to exploit multiprocessor
 | |
| 	    parallelism.  Further information about the implementation can be
 | |
| 	    obtained from the man page or [1].  The 1:1 library directly maps a
 | |
| 	    userland thread to a kernel thread thus greatly simplifying the
 | |
| 	    scheme.  None of these designs implement a fairness mechanism (such
 | |
| 	    a mechanism was implemented but it was removed recently because it
 | |
| 	    caused serious slowdown and made the code more difficult to deal
 | |
| 	    with).</para>
 | |
| 	</sect4>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 id="what-is-linux">
 | |
|       <title>What is &linux;</title>
 | |
| 
 | |
|       <para>&linux; is a &unix;-like kernel originally developed by Linus
 | |
| 	Torvalds, and now being contributed to by a massive crowd of
 | |
| 	programmers all around the world.  From its mere beginnings to todays,
 | |
| 	with wide support from companies such as IBM or Google, &linux; is
 | |
| 	being associated with its fast development pace, full hardware support
 | |
| 	and benevolent dictator model of organization.</para>
 | |
| 
 | |
|       <para>&linux; development started in 1991 as a hobbyist project at
 | |
| 	University of Helsinki in Finland.  Since then it has obtained all the
 | |
| 	features of a modern &unix;-like OS: multiprocessing, multiuser
 | |
| 	support, virtual memory, networking, basically everything is there.
 | |
| 	There are also highly advanced features like virtualization etc.</para>
 | |
| 
 | |
|       <para>As of 2006 &linux; seems to be the most widely used open source
 | |
| 	operating system with support from independent software vendors like
 | |
| 	Oracle, RealNetworks, Adobe, etc.  Most of the commercial software
 | |
| 	distributed for &linux; can only be obtained in a binary form so
 | |
| 	recompilation for other operating systems is impossible.</para>
 | |
| 
 | |
|       <para>Most of the &linux; development happens in a
 | |
| 	<application>Git</application> version control system.
 | |
| 	<application>Git</application> is a distributed system so there is
 | |
| 	no central source of the &linux; code, but some branches are considered
 | |
| 	prominent and official.  The version number scheme implemented by
 | |
| 	&linux; consists of four numbers A.B.C.D.  Currently development
 | |
| 	happens in 2.6.C.D, where C represents major version, where new
 | |
| 	features are added or changed while D is a minor version for bugfixes
 | |
| 	only.</para>
 | |
| 
 | |
|       <para>More information can be obtained from [4].</para>
 | |
| 
 | |
|       <sect3 id="linux-tech-details">
 | |
| 	<title>Technical details</title>
 | |
| 
 | |
|       <para>&linux; follows the traditional &unix; scheme of dividing the run
 | |
| 	of a process in two halves: the kernel and user space.  The kernel can
 | |
| 	be entered in two ways: via a trap or via a syscall.  The return is
 | |
| 	handled only in one way.  The further description applies to
 | |
| 	&linux; 2.6 on the &i386; architecture.  This information was
 | |
| 	taken from [3].</para>
 | |
| 
 | |
| 	<sect4 id="linux-syscalls">
 | |
| 	  <title>Syscalls</title>
 | |
| 
 | |
| 	  <para>Syscalls in &linux; are performed (in userspace) using
 | |
| 	    <literal>syscallX</literal> macros where X substitutes a number
 | |
| 	    representing the number of parameters of the given syscall.  This
 | |
| 	    macro translates to a code that loads <varname>%eax</varname>
 | |
| 	    register with a number of the syscall and executes interrupt
 | |
| 	    <literal>0x80</literal>.  After this syscall return is called,
 | |
| 	    which translates negative return values to positive
 | |
| 	    <literal>errno</literal> values and sets <literal>res</literal> to
 | |
| 	    <literal>-1</literal> in case of an error.  Whenever the interrupt
 | |
| 	    <literal>0x80</literal> is called the process enters the kernel in
 | |
| 	    system call trap handler.  This routine saves all registers on the
 | |
| 	    stack and calls the selected syscall entry.  Note that the &linux;
 | |
| 	    calling convention expects parameters to the syscall to be passed
 | |
| 	    via registers as shown here:</para>
 | |
| 
 | |
| 	  <orderedlist>
 | |
| 	    <listitem>
 | |
| 	      <para>parameter -> <varname>%ebx</varname></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>parameter -> <varname>%ecx</varname></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>parameter -> <varname>%edx</varname></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>parameter -> <varname>%esi</varname></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>parameter -> <varname>%edi</varname></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>parameter -> <varname>%ebp</varname></para>
 | |
| 	    </listitem>
 | |
| 	  </orderedlist>
 | |
| 
 | |
| 	  <para>There are some exceptions to this, where &linux; uses different
 | |
| 	    calling convention (most notably the <literal>clone</literal>
 | |
| 	    syscall).</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="linux-traps">
 | |
| 	  <title>Traps</title>
 | |
| 
 | |
| 	  <para>The trap handlers are introduced in
 | |
| 	    <filename>arch/i386/kernel/traps.c</filename> and most of these
 | |
| 	    handlers live in <filename>arch/i386/kernel/entry.S</filename>,
 | |
| 	    where handling of the traps happens.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="linux-exits">
 | |
| 	  <title>Exits</title>
 | |
| 
 | |
| 	  <para>Return from the syscall is managed by syscall &man.exit.3;,
 | |
| 	    which checks for the process having unfinished work, then checks
 | |
| 	    whether we used user-supplied selectors.  If this happens stack
 | |
| 	    fixing is applied and finally the registers are restored from the
 | |
| 	    stack and the process returns to the userspace.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="linux-unix-primitives">
 | |
| 	  <title>&unix; primitives</title>
 | |
| 
 | |
| 	  <para>In the 2.6 version, the &linux; operating system redefined some
 | |
| 	    of the traditional &unix; primitives, notably PID, TID and thread.
 | |
| 	    PID is defined not to be unique for every process, so for some
 | |
| 	    processes (threads) &man.getppid.2; returns the same value.  Unique
 | |
| 	    identification of process is provided by TID.  This is because
 | |
| 	    <firstterm>NPTL</firstterm> (New &posix; Thread Library) defines
 | |
| 	    threads to be normal processes (so called 1:1 threading).  Spawning
 | |
| 	     a new process in &linux; 2.6 happens using the
 | |
| 	    <literal>clone</literal> syscall (fork variants are reimplemented using
 | |
| 	    it).  This clone syscall defines a set of flags that affect
 | |
| 	    behaviour of the cloning process regarding thread implementation.
 | |
| 	    The semantic is a bit fuzzy as there is no single flag telling the
 | |
| 	    syscall to create a thread.</para>
 | |
| 
 | |
| 	  <para>Implemented clone flags are:</para>
 | |
| 
 | |
| 	  <itemizedlist>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_VM</literal> - processes share their memory
 | |
| 		space</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_FS</literal> - share umask, cwd and
 | |
| 		namespace</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_FILES</literal> - share open
 | |
| 		files</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_SIGHAND</literal> - share signal handlers
 | |
| 		and blocked signals</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_PARENT</literal> - share parent</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_THREAD</literal> - be thread (further
 | |
| 		explanation below)</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_NEWNS</literal> - new namespace</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_SYSVSEM</literal> - share SysV undo
 | |
| 		structures</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_SETTLS</literal> - setup TLS at supplied
 | |
| 		address</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_PARENT_SETTID</literal> - set TID in the
 | |
| 		parent</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_CHILD_CLEARTID</literal> - clear TID in the
 | |
| 		child</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_CHILD_SETTID</literal> - set TID in the
 | |
| 		child</para>
 | |
| 	    </listitem>
 | |
| 	  </itemizedlist>
 | |
| 
 | |
| 	  <para><literal>CLONE_PARENT</literal> sets the real parent to the
 | |
| 	    parent of the caller.  This is useful for threads because if thread
 | |
| 	    A creates thread B we want thread B to be parented to the parent of
 | |
| 	    the whole thread group.  <literal>CLONE_THREAD</literal> does
 | |
| 	    exactly the same thing as <literal>CLONE_PARENT</literal>,
 | |
| 	    <literal>CLONE_VM</literal> and <literal>CLONE_SIGHAND</literal>,
 | |
| 	    rewrites PID to be the same as PID of the caller, sets exit signal
 | |
| 	    to be none and enters the thread group.
 | |
| 	    <literal>CLONE_SETTLS</literal> sets up GDT entries for TLS
 | |
| 	    handling.  The <literal>CLONE_*_*TID</literal> set of flags
 | |
| 	    sets/clears user supplied address to TID or 0.</para>
 | |
| 
 | |
| 	  <para>As you can see the <literal>CLONE_THREAD</literal> does most
 | |
| 	    of the work and does not seem to fit the scheme very well.  The
 | |
| 	    original intention is unclear (even for authors, according to
 | |
| 	    comments in the code) but I think originally there was one
 | |
| 	    threading flag, which was then parcelled among many other flags
 | |
| 	    but this separation was never fully finished.  It is also unclear
 | |
| 	    what this partition is good for as glibc does not use that so only
 | |
| 	    hand-written use of the clone permits a programmer to access this
 | |
| 	    features.</para>
 | |
| 
 | |
| 	  <para>For non-threaded programs the PID and TID are the same.  For
 | |
| 	    threaded programs the first thread PID and TID are the same and
 | |
| 	    every created thread shares the same PID and gets assigned a
 | |
| 	    unique TID (because <literal>CLONE_THREAD</literal> is passed in)
 | |
| 	    also parent is shared for all processes forming this threaded
 | |
| 	    program.</para>
 | |
| 
 | |
| 	  <para>The code that implements &man.pthread.create.3; in NPTL defines
 | |
| 	    the clone flags like this:</para> 
 | |
| 
 | |
| 	  <programlisting>int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGNAL
 | |
| 
 | |
|  | CLONE_SETTLS | CLONE_PARENT_SETTID 
 | |
| 
 | |
| | CLONE_CHILD_CLEARTID | CLONE_SYSVSEM 
 | |
| #if __ASSUME_NO_CLONE_DETACHED == 0 
 | |
| 
 | |
| | CLONE_DETACHED 
 | |
| #endif 
 | |
| 
 | |
| | 0);</programlisting>
 | |
| 
 | |
| 	  <para>The <literal>CLONE_SIGNAL</literal> is defined like</para>
 | |
| 
 | |
| 	  <programlisting>#define CLONE_SIGNAL (CLONE_SIGHAND | CLONE_THREAD)</programlisting>
 | |
| 
 | |
| 	  <para>the last 0 means no signal is sent when any of the threads
 | |
| 	    exits.</para>
 | |
| 	</sect4>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 id="what-is-emu">
 | |
|       <title>What is emulation</title>
 | |
| 
 | |
|       <para>According to a dictionary definition, emulation is the ability of
 | |
| 	a program or device to imitate another program or device.  This is
 | |
| 	achieved by providing the same reaction to a given stimulus as the
 | |
| 	emulated object.  In practice, the software world mostly sees three
 | |
| 	types of emulation - a program used to emulate a machine (QEMU, various
 | |
| 	game console emulators etc.), software emulation of a hardware facility
 | |
| 	(OpenGL emulators, floating point units emulation etc.) and operating
 | |
| 	system emulation (either in kernel of the operating system or as a
 | |
| 	userspace program).</para>
 | |
| 
 | |
|       <para>Emulation is usually used in a place, where using the original
 | |
| 	component is not feasible nor possible at all.  For example someone
 | |
| 	might want to use a program developed for a different operating
 | |
| 	system than he uses.  Then emulation comes in handy.  Sometimes
 | |
| 	there is no other way but to use emulation - e.g. when the hardware
 | |
| 	device you try to use does not exist (yet/anymore) then there is no
 | |
| 	other way but emulation.  This happens often when porting an operating
 | |
| 	system to a new (non-existent) platform.  Sometimes it is just
 | |
| 	cheaper to emulate.</para>
 | |
| 
 | |
|       <para>Looking from an implementation point of view, there are two main
 | |
| 	approaches to the implementation of emulation.  You can either emulate
 | |
| 	the whole thing - accepting possible inputs of the original object,
 | |
| 	maintaining inner state and emitting correct output based on the state
 | |
| 	and/or input.  This kind of emulation does not require any special
 | |
| 	conditions and basically can be implemented anywhere for any
 | |
| 	device/program.  The drawback is that implementing such emulation is
 | |
| 	quite difficult, time-consuming and error-prone.  In some cases we can
 | |
| 	use a simpler approach.  Imagine you want to emulate a printer that
 | |
| 	prints from left to right on a printer that prints from right to left.
 | |
| 	It is obvious that there is no need for a complex emulation layer but
 | |
| 	simply reversing of the printed text is sufficient.  Sometimes the
 | |
| 	emulating environment is very similar to the emulated one so just a
 | |
| 	thin layer of some translation is necessary to provide fully working
 | |
| 	emulation!  As you can see this is much less demanding to implement,
 | |
| 	so less time-consuming and error-prone than the previous approach.  But
 | |
| 	the necessary condition is that the two environments must be similar
 | |
| 	enough.  The third approach combines the two previous.  Most of the
 | |
| 	time the objects do not provide the same capabilities so in a case of
 | |
| 	emulating the more powerful one on the less powerful we have to emulate
 | |
| 	the missing features with full emulation described above.</para>
 | |
| 
 | |
|       <para>This master thesis deals with emulation of &unix; on &unix;, which
 | |
| 	is exactly the case, where only a thin layer of translation is
 | |
| 	sufficient to provide full emulation.  The &unix; API consists of a set
 | |
| 	of syscalls, which are usually self contained and do not affect some
 | |
| 	global kernel state.</para>
 | |
| 
 | |
|       <para>There are a few syscalls that affect inner state but this can be
 | |
| 	dealt with by providing some structures that maintain the extra
 | |
| 	state.</para>
 | |
| 
 | |
|       <para>No emulation is perfect and emulations tend to lack some parts but
 | |
| 	this usually does not cause any serious drawbacks.  Imagine a game
 | |
| 	console emulator that emulates everything but music output.  No doubt
 | |
| 	that the games are playable and one can use the emulator.  It might
 | |
| 	not be that comfortable as the original game console but its an
 | |
| 	acceptable compromise between price and comfort.</para>
 | |
| 
 | |
|       <para>The same goes with the &unix; API.  Most programs can live with a
 | |
| 	very limited set of syscalls working.  Those syscalls tend to be the
 | |
| 	oldest ones (&man.read.2;/&man.write.2;, &man.fork.2; family,
 | |
| 	&man.signal.3; handling, &man.exit.3;, &man.socket.2; API) hence it is
 | |
| 	easy to emulate because their semantics is shared among all
 | |
| 	&unix;es, which exist todays.</para>
 | |
|     </sect2>
 | |
|   </sect1>
 | |
| 
 | |
|   <sect1 id="freebsd-emulation">
 | |
|     <title>Emulation</title>
 | |
| 
 | |
|     <sect2>
 | |
|       <title>How emulation works in &os;</title>
 | |
| 
 | |
|       <para>As stated earlier, &os; supports running binaries from several
 | |
| 	other &unix;es.  This works because &os; has an abstraction called the
 | |
| 	execution class loader.  This wedges into the &man.execve.2; syscall,
 | |
| 	so when &man.execve.2; is about to execute a binary it examines its
 | |
| 	type.</para>
 | |
| 
 | |
|       <para>There are basically two types of binaries in &os;.  Shell-like text
 | |
| 	scripts which are identified by <literal>#!</literal> as their first
 | |
| 	two characters and normal (typically <firstterm>ELF</firstterm>)
 | |
| 	binaries, which are a representation of a compiled executable object.
 | |
| 	The vast majority (one could say all of them) of binaries in &os; are
 | |
| 	from type ELF.  ELF files contain a header, which specifies the OS ABI
 | |
| 	for this ELF file.  By reading this information, the operating system
 | |
| 	can accurately determine what type of binary the given file is.</para>
 | |
| 
 | |
|       <para>Every OS ABI must be registered in the &os; kernel.  This applies
 | |
| 	to the &os; native OS ABI, as well.  So when &man.execve.2; executes a
 | |
| 	binary it iterates through the list of registered APIs and when it
 | |
| 	finds the right one it starts to use the information contained in the
 | |
| 	OS ABI description (its syscall table, <literal>errno</literal>
 | |
| 	translation table, etc.).  So every time the process calls a syscall,
 | |
| 	it uses its own set of syscalls instead of some global one.  This
 | |
| 	effectively provides a very elegant and easy way of supporting
 | |
| 	execution of various binary formats.</para>
 | |
| 
 | |
|       <para>The nature of emulation of different OSes (and also some other
 | |
| 	subsystems) led developers to invite a handler event mechanism.  There
 | |
| 	are various places in the kernel, where a list of event handlers are
 | |
| 	called.  Every subsystem can register an event handler and they are
 | |
| 	called accordingly.  For example, when a process exits there is a
 | |
| 	handler called that possibly cleans up whatever the subsystem needs
 | |
| 	to be cleaned.</para>
 | |
| 
 | |
|       <para>Those simple facilities provide basically everything that is needed
 | |
| 	for the emulation infrastructure and in fact these are basically the
 | |
| 	only things necessary to implement the &linux; emulation layer.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 id="freebsd-common-primitives">
 | |
|       <title>Common primitives in the &os; kernel</title>
 | |
| 
 | |
|       <para>Emulation layers need some support from the operating system.  I am
 | |
| 	going to describe some of the supported primitives in the &os;
 | |
| 	operating system.</para>
 | |
| 
 | |
|       <sect3 id="freebsd-locking-primitives">
 | |
| 	<title>Locking primitives</title>
 | |
| 
 | |
| 	<para>Contributed by: &a.attilio;</para>
 | |
| 
 | |
| 	<para>The &os; synchronization primitive set is based on the idea to
 | |
| 	  supply a rather huge number of different primitives in a way that
 | |
| 	  the better one can be used for every particular, appropriate
 | |
| 	  situation.</para>
 | |
| 
 | |
| 	<para>To a high level point of view you can consider three kinds of
 | |
| 	  synchronization primitives in the &os; kernel:</para>
 | |
| 
 | |
| 	<itemizedlist>
 | |
| 	  <listitem>
 | |
| 	    <para>atomic operations and memory barriers</para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para>locks</para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para>scheduling barriers</para>
 | |
| 	  </listitem>
 | |
| 	</itemizedlist>
 | |
| 
 | |
| 	<para>Below there are descriptions for the 3 families.  For every lock,
 | |
| 	  you should really check the linked manpage (where possible) for
 | |
| 	  more detailed explanations.</para>
 | |
| 
 | |
| 	<sect4 id="freebsd-atomic-op">
 | |
| 	  <title>Atomic operations and memory barriers</title>
 | |
| 
 | |
| 	  <para>Atomic operations are implemented through a set of functions
 | |
| 	    performing simple aritmetics on memory operands in an atomic way
 | |
| 	    with respect to external events (interrupts, preemption, etc.).
 | |
| 	    Atomic operations can guarantee atomicity just on small data types
 | |
| 	    (in the magnitude order of the <literal>.long.</literal>
 | |
| 	    architecture C data type), so should be rarely used directly in the
 | |
| 	    end-level code, if not only for very simple operations (like flag
 | |
| 	    setting in a bitmap, for example).  In fact, it is rather simple
 | |
| 	    and common to write down a wrong semantic based on just atomic
 | |
| 	    operations (usually referred as lock-less).  The &os; kernel offers
 | |
| 	    a way to perform atomic operations in conjunction with a memory
 | |
| 	    barrier.  The memory barriers will guarantee that an atomic
 | |
| 	    operation will happen following some specified ordering with
 | |
| 	    respect to other memory accesses.  For example, if we need that an
 | |
| 	    atomic operation happen just after all other pending writes (in
 | |
| 	    terms of instructions reordering buffers activities) are completed,
 | |
| 	    we need to explicitly use a memory barrier in conjunction to this
 | |
| 	    atomic operation.  So it is simple to understand why memory
 | |
| 	    barriers play a key role for higher-level locks building (just
 | |
| 	    as refcounts, mutexes, etc.).  For a detailed explanatory on atomic
 | |
| 	    operations, please refer to &man.atomic.9;.  It is far, however,
 | |
| 	    noting that atomic operations (and memory barriers as well) should
 | |
| 	    ideally only be used for building front-ending locks (as
 | |
| 	    mutexes).</para>
 | |
| 
 | |
| 	<sect4 id="freebsd-refcounts">
 | |
| 	  <title>Refcounts</title>
 | |
| 
 | |
| 	  <para>Refcounts are interfaces for handling reference counters.
 | |
| 	    They are implemented through atomic operations and are intended to
 | |
| 	    be used just for cases, where the reference counter is the only
 | |
| 	    one thing to be protected, so even something like a spin-mutex is
 | |
| 	    deprecated.  Using the refcount interface for structures, where
 | |
| 	    a mutex is already used is often wrong since we should probably
 | |
| 	    close the reference counter in some already protected paths.  A
 | |
| 	    manpage discussing refcount does not exist currently, just check
 | |
| 	    <filename>sys/refcount.h</filename> for an overview of the
 | |
| 	    existing API.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="freebsd-locks">
 | |
| 	  <title>Locks</title>
 | |
| 
 | |
| 	  <para>&os; kernel has huge classes of locks.  Every lock is defined
 | |
| 	    by some peculiar properties, but probably the most important is the
 | |
| 	    event linked to contesting holders (or in other terms, the
 | |
| 	    behaviour of threads unable to acquire the lock).  &os;'s locking
 | |
| 	    scheme presents three different behaviours for contenders:</para>
 | |
| 
 | |
| 	  <orderedlist>
 | |
| 	    <listitem>
 | |
| 	      <para>spinning</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>blocking</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>sleeping</para>
 | |
| 	    </listitem>
 | |
| 	  </orderedlist>
 | |
| 
 | |
| 	  <note>
 | |
| 	    <para>numbers are not casual</para>
 | |
| 	  </note>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="freebsd-spinlocks">
 | |
| 	  <title>Spinning locks</title>
 | |
| 
 | |
| 	  <para>Spin locks let waiters to spin until they cannot acquire the
 | |
| 	    lock.  An important matter do deal with is when a thread contests
 | |
| 	    on a spin lock if it is not descheduled.  Since the &os; kernel
 | |
| 	    is preemptive, this exposes spin lock at the risk of deadlocks
 | |
| 	    that can be solved just disabling interrupts while they are
 | |
| 	    acquired.  For this and other reasons (like lack of priority
 | |
| 	    propagation support, poorness in load balancing schemes between
 | |
| 	    CPUs, etc.), spin locks are intended to protect very small paths
 | |
| 	    of code, or ideally not to be used at all if not explicitly
 | |
| 	    requested (explained later).</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="freebsd-blocking">
 | |
| 	  <title>Blocking</title>
 | |
| 
 | |
| 	  <para>Block locks let waiters to be descheduled and blocked until
 | |
| 	    the lock owner does not drop it and wakes up one or more
 | |
| 	    contenders.  In order to avoid starvation issues, blocking locks
 | |
| 	    do priority propagation from the waiters to the owner.  Block
 | |
| 	    locks must be implemented through the turnstile interface and are
 | |
| 	    intended to be the most used kind of locks in the kernel, if no
 | |
| 	    particular conditions are met.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="freebsd-sleeping">
 | |
| 	  <title>Sleeping</title>
 | |
| 
 | |
| 	  <para>Sleep locks let waiters to be descheduled and fall asleep
 | |
| 	    until the lock holder does not drop it and wakes up one or more
 | |
| 	    waiters.  Since sleep locks are intended to protect large paths
 | |
| 	    of code and to cater asynchronous events, they do not do any form
 | |
| 	    of priority propagation.  They must be implemented through the
 | |
| 	    &man.sleepqueue.9; interface.</para>
 | |
| 
 | |
| 	  <para>The order used to acquire locks is very important, not only for
 | |
| 	    the possibility to deadlock due at lock order reversals, but even
 | |
| 	    because lock acquisition should follow specific rules linked to
 | |
| 	    locks natures.  If you give a look at the table above, the
 | |
| 	    practical rule is that if a thread holds a lock of level n (where
 | |
| 	    the level is the number listed close to the kind of lock) it is not
 | |
| 	    allowed to acquire a lock of superior levels, since this would
 | |
| 	    break the specified semantic for a path.  For example, if a thread
 | |
| 	    holds a block lock (level 2), it is allowed to acquire a spin lock
 | |
| 	    (level 1) but not a sleep lock (level 3), since block locks are
 | |
| 	    intended to protect smaller paths than sleep lock (these rules are
 | |
| 	    not about atomic operations or scheduling barriers,
 | |
| 	    however).</para>
 | |
| 
 | |
| 	  <para>This is a list of lock with their respective behaviours:</para>
 | |
| 
 | |
| 	  <itemizedlist>
 | |
| 	    <listitem>
 | |
| 	      <para>spin mutex - spinning - &man.mutex.9;</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>sleep mutex - blocking - &man.mutex.9;</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>pool mutex - blocking - &man.mtx.pool.9;</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>sleep family - sleeping - &man.sleep.9; pause tsleep
 | |
| 		msleep msleep spin msleep rw msleep sx</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>condvar - sleeping - &man.condvar.9;</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>rwlock - blocking - &man.rwlock.9;</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>sxlock - sleeping - &man.sx.9;</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>lockmgr - sleeping - &man.lockmgr.9;</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>semaphores - sleeping - &man.sema.9;</para>
 | |
| 	    </listitem>
 | |
| 	  </itemizedlist>
 | |
| 
 | |
| 	  <para>Among these locks only mutexes, sxlocks, rwlocks and lockmgrs
 | |
| 	    are intended to handle recursion, but currently recursion is only
 | |
| 	    supported by mutexes and lockmgrs.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="freebsd-scheduling">
 | |
| 	  <title>Scheduling barriers</title>
 | |
| 
 | |
| 	  <para>Scheduling barriers are intended to be used in order to drive
 | |
| 	    scheduling of threading.  They consist mainly of three
 | |
| 	    different stubs:</para>
 | |
| 
 | |
| 	  <itemizedlist>
 | |
| 	    <listitem>
 | |
| 	      <para>critical sections (and preemption)</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>sched_bind</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>sched_pin</para>
 | |
| 	    </listitem>
 | |
| 	  </itemizedlist>
 | |
| 
 | |
| 	  <para>Generally, these should be used only in a particular context
 | |
| 	    and even if they can often replace locks, they should be avoided
 | |
| 	    because they do not let the diagnose of simple eventual problems
 | |
| 	    with locking debugging tools (as &man.witness.4;).</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="freebsd-critical">
 | |
| 	  <title>Critical sections</title>
 | |
| 
 | |
| 	  <para>The &os; kernel has been made preemptive basically to deal with
 | |
| 	    interrupt threads.  In fact, in order to avoid high interrupt
 | |
| 	    latency, time-sharing priority threads can be preempted by
 | |
| 	    interrupt threads (in this way, they do not need to wait to be
 | |
| 	    scheduled as the normal path previews).  Preemption, however,
 | |
| 	    introduces new racing points that need to be handled, as well.
 | |
| 	    Often, in order to deal with preemption, the simplest thing to do
 | |
| 	    is to completely disable it.  A critical section defines a piece of
 | |
| 	    code (borderlined by the pair of functions &man.critical.enter.9;
 | |
| 	    and &man.critical.exit.9;, where preemption is guaranteed to not
 | |
| 	    happen (until the protected code is fully executed).  This can
 | |
| 	    often replace a lock effectively but should be used carefully in
 | |
| 	    order to not lose the whole advantage that preemption
 | |
| 	    brings.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="freebsd-schedpin">
 | |
| 	  <title>sched_pin/sched_unpin</title>
 | |
| 
 | |
| 	  <para>Another way to deal with preemption is the
 | |
| 	    <function>sched_pin()</function> interface.  If a piece of code
 | |
| 	    is closed in the <function>sched_pin()</function>  and
 | |
| 	    <function>sched_unpin()</function> pair of functions it is
 | |
| 	    guaranteed that the respective thread, even if it can be preempted,
 | |
| 	    it will always be executed on the same CPU.  Pinning is very
 | |
| 	    effective in the particular case when we have to access at
 | |
| 	    per-cpu datas and we assume other threads will not change those
 | |
| 	    data.  The latter condition will determine a critical section
 | |
| 	    as a too strong condition for our code.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="freebsd-schedbind">
 | |
| 	  <title>sched_bind/sched_unbind</title>
 | |
| 
 | |
| 	  <para><function>sched_bind</function> is an API used in order to bind
 | |
| 	    a thread to a particular CPU for all the time it executes the code,
 | |
| 	    until a <function>sched_unbind</function> function call does not
 | |
| 	    unbind it.  This feature has a key role in situations where you
 | |
| 	    cannot trust the current state of CPUs (for example, at very early
 | |
| 	    stages of boot), as you want to avoid your thread to migrate on
 | |
| 	    inactive CPUs.  Since <function>sched_bind</function> and
 | |
| 	    <function>sched_unbind</function> manipulate internal scheduler
 | |
| 	    structures, they need to be enclosed in
 | |
| 	    <function>sched_lock</function> acquisition/releasing when
 | |
| 	    used.</para>
 | |
| 	</sect4>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="freebsd-proc">
 | |
| 	<title>Proc structure</title>
 | |
| 
 | |
| 	<para>Various emulation layers sometimes require some additional
 | |
| 	  per-process data.  It can manage separate structures (a list, a tree
 | |
| 	  etc.) containing these data for every process but this tends to be
 | |
| 	  slow and memory consuming.  To solve this problem the &os;
 | |
| 	  <literal>proc</literal> structure contains
 | |
| 	  <literal>p_emuldata</literal>, which is a void pointer to some
 | |
| 	  emulation layer specific data.  This <literal>proc</literal> entry
 | |
| 	  is protected by the proc mutex.</para>
 | |
| 
 | |
| 	<para>The &os; <literal>proc</literal> structure contains a
 | |
| 	  <literal>p_sysent</literal> entry that identifies, which ABI this
 | |
| 	   process is running.  In fact, it is a pointer to the
 | |
| 	  <literal>sysentvec</literal> described above.  So by comparing this
 | |
| 	  pointer to the address where the <literal>sysentvec</literal>
 | |
| 	  structure for the given ABI is stored we can effectively determine
 | |
| 	  whether the process belongs to our emulation layer.  The code
 | |
| 	  typically looks like:</para>
 | |
| 
 | |
| 	<programlisting>if (__predict_true(p->p_sysent != &elf_&linux;_sysvec))
 | |
| 	  return;</programlisting>
 | |
| 
 | |
| 	<para>As you can see, we effectively use the
 | |
| 	  <literal>__predict_true</literal> modifier to collapse the most
 | |
| 	  common case (&os; process) to a simple return operation thus
 | |
| 	  preserving high performance.  This code should be turned into a
 | |
| 	  macro because currently it is not very flexible, i.e. we do not
 | |
| 	  support &linux;64 emulation nor A.OUT &linux; processes
 | |
| 	  on i386.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="freebsd-vfs">
 | |
| 	<title>VFS</title>
 | |
| 
 | |
| 	<para>The &os; VFS subsystem is very complex but the &linux; emulation
 | |
| 	  layer uses just a small subset via a well defined API.  It can either
 | |
| 	  operate on vnodes or file handlers.  Vnode represents a virtual
 | |
| 	  vnode, i.e. representation of a node in VFS.  Another representation
 | |
| 	  is a file handler, which represents an opened file from the
 | |
| 	  perspective of a process.  A file handler can represent a socket or
 | |
| 	  an ordinary file.  A file handler contains a pointer to its vnode.
 | |
| 	  More then one file handler can point to the same vnode.</para>
 | |
| 
 | |
| 	<sect4 id="freebsd-namei">
 | |
| 	  <title>namei</title>
 | |
| 
 | |
| 	  <para>The &man.namei.9; routine is a central entry point to pathname
 | |
| 	    lookup and translation.  It traverses the path point by point from
 | |
| 	    the starting point to the end point using lookup function, which is
 | |
| 	    internal to VFS.  The &man.namei.9; syscall can cope with symlinks,
 | |
| 	    absolute and relative paths.  When a path is looked up using
 | |
| 	    &man.namei.9; it is inputed to the name cache.  This behaviour can
 | |
| 	    be supressed.  This routine is used all over the kernel and its
 | |
| 	    performance is very critical.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="freebsd-vn">
 | |
| 	  <title>vn_fullpath</title>
 | |
| 
 | |
| 	  <para>The &man.vn.fullpath.9; function takes the best effort to
 | |
| 	   traverse VFS name cache and returns a path for a given (locked)
 | |
| 	   vnode.  This process is unreliable but works just fine for the most
 | |
| 	   common cases.  The unreliability is because it relies on VFS cache
 | |
| 	   (it does not traverse the on medium structures), it does not work
 | |
| 	   with hardlinks, etc.  This routine is used in several places in the
 | |
| 	   Linuxulator.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="freebsd-vnode">
 | |
| 	  <title>Vnode operations</title>
 | |
| 
 | |
| 	  <itemizedlist>
 | |
| 	    <listitem>
 | |
| 	      <para><function>fgetvp</function> - given a thread and a file
 | |
| 		descripton number it returns the associated vnode</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.vn.lock.9; - locks a vnode</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><function>vn_unlock</function> - unlocks a vnode</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.VOP.READDIR.9; - reads a directory referenced by
 | |
| 		a vnode</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.VOP.GETATTR.9; - gets attributes of a file or a
 | |
| 		directory referenced by a vnode</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.VOP.LOOKUP.9; - looks up a path to a given
 | |
| 		directory</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.VOP.OPEN.9; - opens a file referenced by a
 | |
| 		vnode</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.VOP.CLOSE.9; - closes a file referenced by a
 | |
| 		vnode</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.vput.9; - decrements the use count for a vnode and
 | |
| 		unlocks it</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.vrele.9; - decrements the use count for a vnode</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.vref.9; - increments the use count for a vnode</para>
 | |
| 	    </listitem>
 | |
| 	  </itemizedlist>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="freebsd-file-handler">
 | |
| 	  <title>File handler operations</title>
 | |
| 
 | |
| 	  <itemizedlist>
 | |
| 	    <listitem>
 | |
| 	      <para><function>fget</function> - given a thread and a file
 | |
| 		descriptor number it returns associated file handler and
 | |
| 		references it</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><function>fdrop</function> - drops a reference to a file
 | |
| 		handler</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><function>fhold</function> - references a file
 | |
| 		handler</para>
 | |
| 	    </listitem>
 | |
| 	  </itemizedlist>
 | |
| 	</sect4>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
|   </sect1>
 | |
| 
 | |
|   <sect1 id="md">
 | |
|     <title>&linux; emulation layer -MD part</title>
 | |
| 
 | |
|     <para>This section deals with implementation of &linux; emulation layer in
 | |
|       &os; operating system.  It first describes the machine dependent part
 | |
|       talking about how and where interaction between userland and kernel is
 | |
|       implemented.  It talks about syscalls, signals, ptrace, traps, stack
 | |
|       fixup.  This part discusses i386 but it is written generally so other
 | |
|       architectures should not differ very much.  The next part is the machine
 | |
|       independent part of the Linuxulator.  This section only covers i386 and ELF
 | |
|       handling.  A.OUT is obsolete and untested.</para>
 | |
| 
 | |
|     <sect2 id="syscall-handling">
 | |
|       <title>Syscall handling</title>
 | |
| 
 | |
|       <para>Syscall handling is mostly written in
 | |
| 	<filename>linux_sysvec.c</filename>, which covers most of the routines
 | |
| 	pointed out in the <literal>sysentvec</literal> structure.  When a
 | |
| 	&linux; process running on &os; issues a syscall, the general syscall
 | |
| 	routine calls linux prepsyscall routine for the &linux; ABI.</para>
 | |
| 
 | |
|       <sect3 id="linux-prepsyscall">
 | |
| 	<title>&linux; prepsyscall</title>
 | |
| 
 | |
| 	<para>&linux; passes arguments to syscalls via registers (that is why
 | |
| 	  it is limited to 6 parameters on i386) while &os; uses the stack.
 | |
| 	  The &linux; prepsyscall routine must copy parameters from registers
 | |
| 	  to the stack.  The order of the registers is:
 | |
| 	  <varname>%ebx</varname>, <varname>%ecx</varname>,
 | |
| 	  <varname>%edx</varname>, <varname>%esi</varname>,
 | |
| 	  <varname>%edi</varname>, <varname>%ebp</varname>.  The catch is that
 | |
| 	  this is true for only <emphasis>most</emphasis> of the syscalls.
 | |
| 	  Some (most notably <function>clone</function>) uses a different
 | |
| 	  order but it is luckily easy to fix by inserting a dummy parameter
 | |
| 	  in the <function>linux_clone</function> prototype.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="syscall-writing">
 | |
| 	<title>Syscall writing</title>
 | |
| 
 | |
| 	<para>Every syscall implemented in the Linuxulator must have its
 | |
| 	  prototype with various flags in <filename>syscalls.master</filename>.
 | |
| 	  The form of the file is:</para>
 | |
| 
 | |
| 	<programlisting>...
 | |
| 	AUE_FORK STD		{ int linux_fork(void); }
 | |
| ... 
 | |
| 	AUE_CLOSE NOPROTO	{ int close(int fd); }
 | |
| ...</programlisting>
 | |
| 
 | |
| 	<para>The first column represents the syscall number.  The second
 | |
| 	  column is for auditing support.  The third column represents the
 | |
| 	  syscall type.  It is either <literal>STD</literal>,
 | |
| 	  <literal>OBSOL</literal>, <literal>NOPROTO</literal> and
 | |
| 	  <literal>UNIMPL</literal>.  <literal>STD</literal> is a standard
 | |
| 	  syscall with full prototype and implementation.
 | |
| 	  <literal>OBSOL</literal> is obsolete and defines just the prototype.
 | |
| 	  <literal>NOPROTO</literal> means that the syscall is implemented
 | |
| 	  elsewhere so do not prepend ABI prefix, etc.
 | |
| 	  <literal>UNIMPL</literal> means that the syscall will be
 | |
| 	  substituted with the <function>nosys</function> syscall
 | |
| 	  (a syscall just printing out a message about the syscall not being
 | |
| 	  implemented and returning <literal>ENOSYS</literal>).</para>
 | |
| 
 | |
| 	<para>From <filename>syscalls.master</filename> a script generates
 | |
| 	  three files: <filename>linux_syscall.h</filename>,
 | |
| 	  <filename>linux_proto.h</filename> and
 | |
| 	  <filename>linux_sysent.c</filename>.  The
 | |
| 	  <filename>linux_syscall.h</filename> contains definitions of syscall
 | |
| 	  names and their numerical value, e.g.:</para>
 | |
| 
 | |
| 	<programlisting>...
 | |
| #define LINUX_SYS_linux_fork 2
 | |
| ...
 | |
| #define LINUX_SYS_close 6
 | |
| ...</programlisting>
 | |
| 
 | |
| 	<para>The <filename>linux_proto.h</filename> contains structure
 | |
| 	  definitions of arguments to every syscall, e.g.:</para>
 | |
| 
 | |
| 	<programlisting>struct linux_fork_args { 
 | |
|   register_t dummy; 
 | |
| };</programlisting>
 | |
| 
 | |
| 	<para>And finally, <filename>linux_sysent.c</filename> contains
 | |
| 	  structure describing the system entry table, used to actually
 | |
| 	  dispatch a syscall, e.g.:</para>
 | |
| 
 | |
| 	<programlisting>{ 0, (sy_call_t *)linux_fork, AUE_FORK, NULL, 0, 0 }, /* 2 = linux_fork */ 
 | |
| { AS(close_args), (sy_call_t *)close, AUE_CLOSE, NULL, 0, 0 }, /* 6 = close */</programlisting>
 | |
| 
 | |
| 	<para>As you can see <function>linux_fork</function> is implemented
 | |
| 	  in Linuxulator itself so the definition is of <literal>STD</literal>
 | |
| 	  type and has no argument, which is exhibited by the dummy argument
 | |
| 	  structure.  On the other hand <function>close</function> is just an
 | |
| 	  alias for real &os; &man.close.2; so it has no linux arguments
 | |
| 	  structure associated and in the system entry table it is not prefixed
 | |
| 	  with linux as it calls the real &man.close.2; in the kernel.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="dummy-syscalls">
 | |
| 	<title>Dummy syscalls</title>
 | |
| 
 | |
| 	<para>The &linux; emulation layer is not complete, as some syscalls are
 | |
| 	  not implemented properly and some are not implemented at all.  The
 | |
| 	  emulation layer employs a facility to mark unimplemented syscalls
 | |
| 	  with the <literal>DUMMY</literal> macro.  These dummy definitions
 | |
| 	  reside in <filename>linux_dummy.c</filename> in a form of
 | |
| 	  <literal>DUMMY(syscall);</literal>, which is then translated to
 | |
| 	  various syscall auxiliary files and the implementation consists
 | |
| 	  of printing a message saying that this syscall is not implemented.
 | |
| 	  The <literal>UNIMPL</literal> prototype is not used because we want
 | |
| 	  to be able to identify the name of the syscall that was called in
 | |
| 	  order to know what syscalls are more important to implement.</para>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 id="signal-handling">
 | |
|       <title>Signal handling</title>
 | |
| 
 | |
|       <para>Signal handling is done generally in the &os; kernel for all
 | |
| 	binary compatibilities with a call to a compat-dependent layer.
 | |
| 	&linux; compatibility layer defines
 | |
| 	<function>linux_sendsig</function> routine for this purpose.</para>
 | |
| 
 | |
|       <sect3 id="linux-sendsig">
 | |
| 	<title>&linux; sendsig</title>
 | |
| 
 | |
| 	<para>This routine first checks whether the signal has been installed
 | |
| 	  with a <literal>SA_SIGINFO</literal> in which case it calls
 | |
| 	  <function>linux_rt_sendsig</function> routine instead.  Furthermore,
 | |
| 	   it allocates (or reuses an already existing) signal handle context,
 | |
| 	  then it builds a list of arguments for the signal handler.  It
 | |
| 	  translates the signal number based on the signal translation table,
 | |
| 	  assigns a handler, translates sigset.  Then it saves context for the
 | |
| 	  <function>sigreturn</function> routine (various registers, translated
 | |
| 	  trap number and signal mask).  Finally, it copies out the signal
 | |
| 	  context to the userspace and prepares context for the actual
 | |
| 	  signal handler to run.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="linux-rt-sendsig">
 | |
| 	<title>linux_rt_sendsig</title>
 | |
| 
 | |
| 	<para>This routine is similar to <function>linux_sendsig</function>
 | |
| 	  just the signal context preparation is different.  It adds
 | |
| 	  <literal>siginfo</literal>, <literal>ucontext</literal>, and some
 | |
| 	  &posix; parts.  It might be worth considering whether those two
 | |
| 	  functions could not be merged with a benefit of less code duplication
 | |
| 	  and possibly even faster execution.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="linux-sigreturn">
 | |
| 	<title>linux_sigreturn</title>
 | |
| 
 | |
| 	<para>This syscall is used for return from the signal handler.  It does
 | |
| 	  some security checks and restores the original process context.  It
 | |
| 	  also unmasks the signal in process signal mask.</para>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 id="ptrace">
 | |
|       <title>Ptrace</title>
 | |
| 
 | |
|       <para>Many &unix; derivates implement the &man.ptrace.2; syscall in order
 | |
| 	to allow various tracking and debugging features.  This facility
 | |
| 	enables the tracing process to obtain various information about the
 | |
| 	traced process, like register dumps, any memory from the process
 | |
| 	address space, etc. and also to trace the process like in stepping an
 | |
| 	instruction or between system entries (syscalls and traps).
 | |
| 	&man.ptrace.2; also lets you set various information in the traced
 | |
| 	process (registers etc.).  &man.ptrace.2; is a &unix;-wide standard
 | |
| 	implemented in most &unix;es around the world.</para> 
 | |
| 
 | |
|       <para>&linux; emulation in &os; implements the &man.ptrace.2; facility
 | |
| 	in <filename>linux_ptrace.c</filename>.  The routines for converting
 | |
| 	registers between &linux; and &os; and the actual &man.ptrace.2;
 | |
| 	syscall emulation syscall.  The syscall is a long switch block that
 | |
| 	implements its counterpart in &os; for every &man.ptrace.2; command.
 | |
| 	The &man.ptrace.2; commands are mostly equal between &linux; and &os;
 | |
| 	so usually just a small modification is needed.  For example,
 | |
| 	<literal>PT_GETREGS</literal> in &linux; operates on direct data while
 | |
| 	&os; uses a pointer to the data so after performing a (native)
 | |
| 	&man.ptrace.2; syscall, a copyout must be done to preserve &linux;
 | |
| 	semantics.</para>
 | |
| 
 | |
|       <para>The &man.ptrace.2; implementation in Linuxulator has some known
 | |
| 	weaknesses.  There have been panics seen when using
 | |
| 	<command>strace</command> (which is a &man.ptrace.2; consumer) in the
 | |
| 	Linuxulator environment.  Also <literal>PT_SYSCALL</literal> is not
 | |
| 	implemented.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 id="traps">
 | |
|       <title>Traps</title>
 | |
| 
 | |
|       <para>Whenever a &linux; process running in the emulation layer traps
 | |
| 	the trap itself is handled transparently with the only exception of
 | |
| 	the trap translation.  &linux; and &os; differs in opinion on what a
 | |
| 	trap is so this is dealt with here. The code is actually very
 | |
| 	short:</para>
 | |
| 
 | |
|       <programlisting>static int 
 | |
| translate_traps(int signal, int trap_code) 
 | |
| { 
 | |
| 
 | |
|   if (signal != SIGBUS) 
 | |
|     return signal;
 | |
| 
 | |
|   switch (trap_code) {
 | |
| 
 | |
|     case T_PROTFLT:
 | |
|     case T_TSSFLT:
 | |
|     case T_DOUBLEFLT:
 | |
|     case T_PAGEFLT:
 | |
|       return SIGSEGV;
 | |
| 
 | |
|     default: 
 | |
|       return signal; 
 | |
|   } 
 | |
| }</programlisting>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 id="stack-fixup">
 | |
|       <title>Stack fixup</title>
 | |
| 
 | |
|       <para>The RTLD run-time link-editor expects so called AUX tags on stack
 | |
| 	during an <function>execve</function> so a fixup must be done to ensure
 | |
| 	this.  Of course, every RTLD system is different so the emulation layer
 | |
| 	must provide its own stack fixup routine to do this.  So does
 | |
| 	Linuxulator.  The <function>elf_linux_fixup</function> simply copies
 | |
| 	out AUX tags to the stack and adjusts the stack of the user space
 | |
| 	process to point right after those tags.  So RTLD works in a
 | |
| 	smart way.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 id="aout-support">
 | |
|       <title>A.OUT support</title>
 | |
| 
 | |
|       <para>The &linux; emulation layer on i386 also supports &linux; A.OUT
 | |
| 	binaries.  Pretty much everything described in the previous sections
 | |
| 	must be implemented for A.OUT support (beside traps translation and
 | |
| 	signals sending).  The support for A.OUT binaries is no longer
 | |
| 	maintained, especially the 2.6 emulation does not work with it but
 | |
| 	this does not cause any problem, as the linux-base in ports probably
 | |
| 	do not support A.OUT binaries at all.  This support will probably be
 | |
| 	removed in future.  Most of the stuff necessary for loading &linux;
 | |
| 	A.OUT binaries is in <filename>imgact_linux.c</filename> file.</para>
 | |
|     </sect2>
 | |
|   </sect1>
 | |
| 
 | |
|   <sect1 id="mi">
 | |
|     <title>&linux; emulation layer -MI part</title>
 | |
| 
 | |
|     <para>This section talks about machine independent part of the
 | |
|       Linuxulator.  It covers the emulation infrastructure needed for &linux;
 | |
|       2.6 emulation, the thread local storage (TLS) implementation (on i386)
 | |
|       and futexes.  Then we talk briefly about some syscalls.</para>
 | |
| 
 | |
|     <sect2 id="nptl-desc">
 | |
|       <title>Description of NPTL</title>
 | |
| 
 | |
|       <para>One of the major areas of progress in development of &linux; 2.6
 | |
| 	was threading.  Prior to 2.6, the &linux; threading support was
 | |
| 	implemented in the <application>linuxthreads</application> library.
 | |
| 	The library was a partial implementation of &posix; threading.  The
 | |
| 	threading was implemented using separate processes for each thread
 | |
| 	using the <function>clone</function> syscall to let them share the
 | |
| 	address space (and other things).  The main weaknesses of this
 | |
| 	approach was that every thread had a different PID, signal handling
 | |
| 	was broken (from the pthreads perspective), etc.  Also the performance
 | |
| 	was not very good (use of <literal>SIGUSR</literal> signals for
 | |
| 	threads synchronization, kernel resource consumption, etc.) so to
 | |
| 	overcome these problems a new threading system was developed and
 | |
| 	named NPTL.</para>
 | |
| 
 | |
|       <para>The NPTL library focused on two things but a third thing came
 | |
| 	along so it is usually considered a part of NPTL.  Those two things
 | |
| 	were embedding of threads into a process structure and futexes.  The
 | |
| 	additional third thing was TLS, which is not directly required by NPTL
 | |
| 	but the whole NPTL userland library depends on it.  Those improvements
 | |
| 	yielded in much improved performance and standards conformance.  NPTL
 | |
| 	is a standard threading library in &linux; systems these days.</para>
 | |
| 
 | |
|       <para>The &os; Linuxulator implementation approaches the NPTL in three
 | |
| 	main areas.  The TLS, futexes and PID mangling, which is meant to
 | |
| 	simulate the &linux; threads.  Further sections describe each of these
 | |
| 	areas.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 id="linux26-emu">
 | |
|       <title>&linux; 2.6 emulation infrastructure</title>
 | |
| 
 | |
|       <para>These sections deal with the way &linux; threads are managed and
 | |
| 	how we simulate that in &os;.</para>
 | |
| 
 | |
|       <sect3 id="linux26-runtime">
 | |
| 	<title>Runtime determining of 2.6 emulation</title>
 | |
| 
 | |
| 	<para>The &linux; emulation layer in &os; supports runtime setting of
 | |
| 	  the emulated version.  This is done via &man.sysctl.8;, namely
 | |
| 	  <literal>compat.linux.osrelease</literal>, which is set to 2.4.2 by
 | |
| 	  default (as of April 2007) and with all &linux; versions up to 2.6
 | |
| 	  it just determined what &man.uname.1 outputs.  It is different with
 | |
| 	  2.6 emulation where setting this &man.sysctl.8; affects runtime
 | |
| 	  behaviour of the emulation layer.  When set to 2.6.x it sets the
 | |
| 	  value of <literal>linux_use_linux26</literal> while setting to
 | |
| 	  something else keeps it unset.  This variable (plus per-prison
 | |
| 	  variables of the very same kind) determines whether 2.6
 | |
| 	  infrastructure (mainly PID mangling) is used in the code or not.
 | |
| 	  The version setting is done system-wide and this affects all &linux;
 | |
| 	  processes.  The &man.sysctl.8; should not be changed when running any
 | |
| 	  &linux; binary as it might harm things.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="linux-proc-thread">
 | |
| 	<title>&linux; processes and thread identifiers</title>
 | |
| 
 | |
| 	<para>The semantics of &linux; threading are a little confusing and
 | |
| 	  uses entirely different nomenclature to &os;.  A process in
 | |
| 	  &linux; consists of a <literal>struct task</literal> embedding two
 | |
| 	  identifier fields - PID and TGID.  PID is <emphasis>not</emphasis>
 | |
| 	  a process ID but it is a thread ID.  The TGID identifies a thread
 | |
| 	  group in other words a process.  For single-threaded process the
 | |
| 	  PID equals the TGID.</para>
 | |
| 
 | |
| 	<para>The thread in NPTL is just an ordinary process that happens to
 | |
| 	  have TGID not equal to PID and have a group leader not equal to
 | |
| 	  itself (and shared VM etc. of course).  Everything else happens in
 | |
| 	  the same way as to an ordinary process.  There is no separation of
 | |
| 	  a shared status to some external structure like in &os;.  This
 | |
| 	  creates some duplication of information and possible data
 | |
| 	  inconsistency.  The &linux; kernel seems to use task -> group
 | |
| 	  information in some places and task information elsewhere and it is
 | |
| 	  really not very consistent and looks error-prone.</para>
 | |
| 
 | |
| 	<para>Every NPTL thread is created by a call to the
 | |
| 	  <function>clone</function> syscall with a specific set of flags
 | |
| 	  (more in the next subsection).  The NPTL implements strict
 | |
| 	  1:1 threading.</para>
 | |
| 
 | |
| 	<para>In &os; we emulate NPTL threads with ordinary &os; processes that
 | |
| 	  share VM space, etc. and the PID gymnastic is just mimiced in the
 | |
| 	  emulation specific structure attached to the process.  The
 | |
| 	  structure attached to the process looks like:</para>
 | |
| 
 | |
| 	<programlisting>struct linux_emuldata { 
 | |
|   pid_t pid; 
 | |
| 
 | |
|   int *child_set_tid; /* in clone(): Child.s TID to set on clone */ 
 | |
|   int *child_clear_tid;/* in clone(): Child.s TID to clear on exit */ 
 | |
| 
 | |
|   struct linux_emuldata_shared *shared; 
 | |
| 
 | |
|   int pdeath_signal; /* parent death signal */ 
 | |
| 
 | |
|   LIST_ENTRY(linux_emuldata) threads; /* list of linux threads */ 
 | |
| };</programlisting>
 | |
| 
 | |
| 	<para>The PID is used to identify the &os; process that attaches this
 | |
| 	  structure.  The <function>child_se_tid</function> and
 | |
| 	  <function>child_clear_tid</function> are used for TID address
 | |
| 	  copyout when a process exits and is created.  The
 | |
| 	  <varname>shared</varname> pointer points to a structure shared
 | |
| 	  among threads.  The <varname>pdeath_signal</varname> variable
 | |
| 	  identifies the parent death signal  and the
 | |
| 	  <varname>threads</varname> pointer is used to link this structure
 | |
| 	  to the list of threads.  The <literal>linux_emuldata_shared</literal>
 | |
| 	  structure looks like:</para>
 | |
| 
 | |
| 	<programlisting>struct linux_emuldata_shared { 
 | |
| 
 | |
|   int refs; 
 | |
| 
 | |
|   pid_t group_pid; 
 | |
| 
 | |
|   LIST_HEAD(, linux_emuldata) threads; /* head of list of linux threads */ 
 | |
| };</programlisting>
 | |
| 
 | |
| 	<para>The <varname>refs</varname> is a reference counter being used
 | |
| 	  to determine when we can free the structure to avoid memory leaks.
 | |
| 	  The <varname>group_pid</varname> is to identify PID ( = TGID) of the
 | |
| 	  whole process ( = thread group).  The <varname>threads</varname>
 | |
| 	  pointer is the head of the list of threads in the process.</para>
 | |
| 
 | |
| 	<para>The <literal>linux_emuldata</literal> structure can be obtained
 | |
| 	  from the process using <function>em_find</function>.  The prototype
 | |
| 	  of the function is:</para>
 | |
| 
 | |
| 	<programlisting>struct linux_emuldata *em_find(struct proc *, int locked);</programlisting>
 | |
| 
 | |
| 	<para>Here, <varname>proc</varname> is the process we want the emuldata
 | |
| 	  structure from and the locked parameter determines whether we want to
 | |
| 	  lock or not.  The accepted values are <literal>EMUL_DOLOCK</literal>
 | |
| 	  and <literal>EMUL_DOUNLOCK</literal>.  More about locking
 | |
| 	  later.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="pid-mangling">
 | |
| 	<title>PID mangling</title>
 | |
| 
 | |
| 	<para>Because of the described different view knowing what a process
 | |
| 	  ID and thread ID is between &os; and &linux; we have to translate
 | |
| 	  the view somehow.  We do it by PID mangling.  This means that we
 | |
| 	  fake what a PID (=TGID) and TID (=PID) is between kernel and
 | |
| 	  userland.  The rule of thumb is that in kernel (in Linuxulator)
 | |
| 	  PID = PID and TGID = shared -> group pid and to userland we
 | |
| 	  present <literal>PID = shared -> group_pid</literal> and
 | |
| 	  <literal>TID = proc -> p_pid</literal>.
 | |
| 	  The PID member of <literal>linux_emuldata structure</literal> is
 | |
| 	  a &os; PID.</para>
 | |
| 
 | |
| 	<para>The above affects mainly getpid, getppid, gettid syscalls.  Where
 | |
| 	  we use PID/TGID respectively.  In copyout of TIDs in
 | |
| 	  <function>child_clear_tid</function> and
 | |
| 	  <function>child_set_tid</function> we copy out &os; PID.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="clone-syscall">
 | |
| 	<title>Clone syscall</title>
 | |
| 
 | |
| 	<para>The <function>clone</function> syscall is the way threads are
 | |
| 	  created in &linux;.  The syscall prototype looks like this:</para>
 | |
| 
 | |
| 	<programlisting>int linux_clone(l_int flags, void *stack, void *parent_tidptr, int dummy,
 | |
| void * child_tidptr);</programlisting>
 | |
| 
 | |
| 	<para>The <varname>flags</varname> parameter tells the syscall how
 | |
| 	  exactly the processes should be cloned.  As described above, &linux;
 | |
| 	  can create processes sharing various things independently, for
 | |
| 	  example two processes can share file descriptors but not VM, etc.
 | |
| 	  Last byte of the <varname>flags</varname> parameter is the exit
 | |
| 	  signal of the newly created process.  The <varname>stack</varname>
 | |
| 	  parameter if non-<literal>NULL</literal> tells, where the thread
 | |
| 	  stack is and if it is <literal>NULL</literal> we are supposed to
 | |
| 	  copy-on-write the calling process stack (i.e. do what normal
 | |
| 	  &man.fork.2; routine does).  The <varname>parent_tidptr</varname>
 | |
| 	  parameter is used as an address for copying out process PID (i.e.
 | |
| 	  thread id) once the process is sufficiently instantiated but is
 | |
| 	  not runnable yet.  The <varname>dummy</varname> parameter is here
 | |
| 	  because of the very strange calling convention of this syscall on
 | |
| 	  i386.  It uses the registers directly and does not let the compiler
 | |
| 	  do it what results in the need of a dummy syscall.  The
 | |
| 	  <varname>child_tidptr</varname> parameter is used as an address
 | |
| 	  for copying out PID once the process has finished forking and when
 | |
| 	  the process exits.</para>
 | |
| 
 | |
| 	<para>The syscall itself proceeds by setting corresponding flags
 | |
| 	  depending on the flags passed in.  For example,
 | |
| 	  <literal>CLONE_VM</literal> maps to RFMEM (sharing of VM), etc.
 | |
| 	  The only nit here is <literal>CLONE_FS</literal> and
 | |
| 	  <literal>CLONE_FILES</literal> because &os; does not allow setting
 | |
| 	  this separately so we fake it by not setting RFFDG (copying of fd
 | |
| 	  table and other fs information) if either of these is defined.  This
 | |
| 	  does not cause any problems, because those flags are always set
 | |
| 	  together.  After setting the flags the process is forked using
 | |
| 	  the internal <function>fork1</function> routine, the process is
 | |
| 	  instrumented not to be put on a run queue, i.e. not to be set
 | |
| 	  runnable.  After the forking is done we possibly reparent the newly
 | |
| 	  created process to emulate <literal>CLONE_PARENT</literal> semantics.
 | |
| 	  Next part is creating the emulation data.  Threads in &linux; does
 | |
| 	  not signal their parents so we set exit signal to be 0 to disable
 | |
| 	  this.  After that setting of <varname>child_set_tid</varname> and
 | |
| 	  <varname>child_clear_tid</varname> is performed enabling the
 | |
| 	  functionality later in the code.  At this point we copy out the PID
 | |
| 	  to the address specified by <varname>parent_tidptr</varname>.  The
 | |
| 	  setting of process stack is done by simply rewriting thread frame
 | |
| 	  <varname>%esp</varname> register (<varname>%rsp</varname> on amd64).
 | |
| 	  Next part is setting up TLS for the newly created process.  After
 | |
| 	  this &man.vfork.2; semantics might be emulated and finally the newly
 | |
| 	  created process is put on a run queue and copying out its PID to the
 | |
| 	  parent process via <function>clone</function> return value is
 | |
| 	  done.</para>
 | |
| 
 | |
| 	<para>The <function>clone</function> syscall is able and in fact is
 | |
| 	  used for emulating classic &man.fork.2; and &man.vfork.2; syscalls.
 | |
| 	  Newer glibc in a case of 2.6 kernel uses <function>clone</function>
 | |
| 	  to implement &man.fork.2; and &man.vfork.2; syscalls.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="locking">
 | |
| 	<title>Locking</title>
 | |
| 
 | |
| 	<para>The locking is implemented to be per-subsystem because we do not
 | |
| 	  expect a lot of contention on these.  There are two locks:
 | |
| 	  <literal>emul_lock</literal> used to protect manipulating of
 | |
| 	  <literal>linux_emuldata</literal> and
 | |
| 	  <literal>emul_shared_lock</literal> used to manipulate
 | |
| 	  <literal>linux_emuldata_shared</literal>.  The
 | |
| 	  <literal>emul_lock</literal> is a nonsleepable blocking mutex while
 | |
| 	  <literal>emul_shared_lock</literal> is a sleepable blocking
 | |
| 	  <literal>sx_lock</literal>.  Because of the per-subsystem locking we
 | |
| 	  can coalesce some locks and that is why the em find offers the
 | |
| 	  non-locking access.</para>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 id="tls">
 | |
|       <title>TLS</title>
 | |
| 
 | |
|       <para>This section deals with TLS also known as thread local
 | |
| 	storage.</para>
 | |
| 
 | |
|       <sect3 id="trheading-intro">
 | |
| 	<title>Introduction to threading</title>
 | |
| 
 | |
| 	<para>Threads in computer science are entities within a process that
 | |
| 	  can be scheduled independently from each other.  The threads in the
 | |
| 	  process share process wide data (file descriptors, etc.) but also
 | |
| 	  have their own stack for their own data.  Sometimes there is a need
 | |
| 	  for process-wide data specific to a given thread.  Imagine a name of
 | |
| 	  the thread in execution or something like that.  The traditional
 | |
| 	  &unix; threading API, <application>pthreads</application> provides
 | |
| 	  a way to do it via &man.pthread.key.create.3;,
 | |
| 	  &man.pthread.setspecific.3; and &man.pthread.getspecific.3; where a
 | |
| 	  thread can create a key to the thread local data and using
 | |
| 	  &man.pthread.getspecific.3; or &man.pthread.getspecific.3; to
 | |
| 	  manipulate those data.  You can easily see that this is not the most
 | |
| 	  comfortable way this could be accomplished.  So various producers of
 | |
| 	  C/C++ compilers introduced a better way.  They defined a new modifier
 | |
| 	  keyword thread that specifies that a variable is thread specific.  A
 | |
| 	  new method of accessing such variables was developed as well (at
 | |
| 	  least on i386).  The <application>pthreads</application> method tends
 | |
| 	  to be implemented in userspace as a trivial lookup table.  The
 | |
| 	  performance of such a solution is not very good.  So the new method
 | |
| 	  uses (on i386) segment registers to address a segment, where TLS area
 | |
| 	  is stored so the actual accessing of a thread variable is just
 | |
| 	  appending the segment register to the address thus addressing via it.
 | |
| 	  The segment registers are usually <varname>%gs</varname> and
 | |
| 	  <varname>%fs</varname> acting like segment selectors.  Every thread
 | |
| 	  has its own area where the thread local data are stored and the
 | |
| 	  segment must be loaded on every context switch.  This method is very
 | |
| 	  fast and used almost exclusively in the whole i386 &unix; world.
 | |
| 	  Both &os; and &linux; implement this approach and it yields very good
 | |
| 	  results.  The only drawback is the need to reload the segment on
 | |
| 	  every context switch which can slowdown context switches.  &os; tries
 | |
| 	  to avoid this overhead by using only 1 segment descriptor for this
 | |
| 	  while &linux; uses 3.  Interesting thing is that almost nothing uses
 | |
| 	  more than 1 descriptor (only <application>Wine</application> seems to
 | |
| 	  use 2) so &linux; pays this unnecessary price for context
 | |
| 	  switches.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="i386-segs">
 | |
| 	<title>Segments on i386</title>
 | |
| 
 | |
| 	<para>The i386 architecture implements the so called segments.  A
 | |
| 	  segment is a description of an area of memory.  The base address
 | |
| 	  (bottom) of the memory area, the end of it (ceiling), type,
 | |
| 	  protection, etc.  The memory described by a segment can be accessed
 | |
| 	  using segment selector registers (<varname>%cs</varname>,
 | |
| 	  <varname>%ds</varname>, <varname>%ss</varname>,
 | |
| 	  <varname>%es</varname>, <varname>%fs</varname>,
 | |
| 	  <varname>%gs</varname>).  For example let us suppose we have a
 | |
| 	  segment which base address is 0x1234 and length and this code:</para>
 | |
| 
 | |
| 	<programlisting>mov %edx,%gs:0x10</programlisting>
 | |
| 
 | |
| 	<para>This will load the content of the <varname>%edx</varname>
 | |
| 	  register into memory location 0x1244.  Some segment registers have
 | |
| 	  a special use, for example <varname>%cs</varname> is used for code
 | |
| 	  segment and <varname>%ss</varname> is used for stack segment but
 | |
| 	  <varname>%fs</varname> and <varname>%gs</varname> are generally
 | |
| 	  unused.  Segments are either stored in a global GDT table or in a
 | |
| 	  local LDT table.  LDT is accessed via an entry in the GDT.  The
 | |
| 	  LDT can store more types of segments.  LDT can be per process.
 | |
| 	  Both tables define upto 8191 entries.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="linux-i386">
 | |
| 	<title>Implementation on &linux; i386</title>
 | |
| 
 | |
| 	<para>There are two main ways of setting up TLS in &linux;.  It can be
 | |
| 	  set when cloning a process using the <function>clone</function>
 | |
| 	  syscall or it can call <function>set_thread_area</function>.  When a
 | |
| 	  process passes <literal>CLONE_SETTLS</literal> flag to
 | |
| 	  <function>clone</function>, the kernel expects the memory pointed to
 | |
| 	  by the <varname>%esi</varname> register a &linux; user space
 | |
| 	  representation of a segment, which gets translated to the machine
 | |
| 	  representation of a segment and loaded into a GDT slot.  The
 | |
| 	  GDT slot can be specified with a number or -1 can be used meaning
 | |
| 	  that the system itself should choose the first free slot.  In
 | |
| 	  practice, the vast majority of programs use only one TLS entry and
 | |
| 	  does not care about the number of the entry.  We exploit this in the
 | |
| 	  emulation and in fact depend on it.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="tls-emu">
 | |
| 	<title>Emulation of &linux; TLS</title>
 | |
| 
 | |
| 	<sect4 id="tls-i386">
 | |
| 	  <title>i386</title>
 | |
| 
 | |
| 	  <para>Loading of TLS for the current thread happens by calling
 | |
| 	    <function>set_thread_area</function> while loading TLS for a
 | |
| 	    second process in <function>clone</function> is done in the
 | |
| 	    separate block in <function>clone</function>.  Those two functions
 | |
| 	    are very similar.  The only difference being the actual loading of
 | |
| 	    the GDT segment, which happens on the next context switch for the
 | |
| 	    newly created process while <function>set_thread_area</function>
 | |
| 	    must load this directly.  The code basically does this.  It copies
 | |
| 	    the &linux; form segment descriptor from the userland.  The code
 | |
| 	    checks for the number of the descriptor but because this differs
 | |
| 	    between &os; and &linux; we fake it a little.  We only support
 | |
| 	    indexes of 6, 3 and -1.  The 6 is genuine &linux; number, 3 is
 | |
| 	    genuine &os; one and -1 means autoselection.  Then we set the
 | |
| 	    descriptor number to constant 3 and copy out this to the
 | |
| 	    userspace.  We rely on the userspace process using the number from
 | |
| 	    the descriptor but this works most of the time (have never seen a
 | |
| 	    case where this did not work) as the userspace process typically
 | |
| 	    passes in 1.  Then we convert the descriptor from the &linux; form
 | |
| 	    to a machine dependant form (i.e. operating system independent
 | |
| 	    form) and copy this to the &os; defined segment descriptor.
 | |
| 	    Finally we can load it.  We assign the descriptor to threads PCB
 | |
| 	    (process control block) and load the <varname>%gs</varname>
 | |
| 	    segment using <function>load_gs</function>.  This loading must be
 | |
| 	    done in a critical section so that nothing can interrupt us.
 | |
| 	    The <literal>CLONE_SETTLS</literal> case works exactly like this
 | |
| 	    just the loading using <function>load_gs</function> is not
 | |
| 	    performed.  The segment used for this (segment number 3) is
 | |
| 	    shared for this use between &os; processes and &linux; processes
 | |
| 	    so the &linux; emulation layer does not add any overhead over
 | |
| 	    plain &os;.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="tls-amd64">
 | |
| 	  <title>amd64</title>
 | |
| 
 | |
| 	  <para>The amd64 implementation is similar to the i386 one but there
 | |
| 	    was initially no 32bit segment descriptor used for this purpose
 | |
| 	    (hence not even native 32bit TLS users worked) so we had to add
 | |
| 	    such a segment and implement its loading on every context switch
 | |
| 	    (when a flag signaling use of 32bit is set).  Apart from this the
 | |
| 	    TLS loading is exactly the same just the segment numbers are
 | |
| 	    different and the descriptor format and the loading differs
 | |
| 	    slightly.</para>
 | |
| 	</sect4>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 id="futexes">
 | |
|       <title>Futexes</title>
 | |
| 
 | |
|       <sect3 id="sync-intro">
 | |
| 	<title>Introduction to synchronization</title>
 | |
| 
 | |
| 	<para>Threads need some kind of synchronization and &posix; provides
 | |
| 	  some of them: mutexes for mutual exclusion, read-write locks for
 | |
| 	  mutual exclusion with biased ratio of reads and writes and condition
 | |
| 	  variables for signaling a status change.  It is interesting to note
 | |
| 	  that &posix; threading API lacks support for semaphores.  Those
 | |
| 	  synchronization routines implementations are heavily dependant on
 | |
| 	  the type threading support we have.  In pure 1:M (userspace) model
 | |
| 	  the implementation can be solely done in userspace and thus be very
 | |
| 	  fast (the condition variables will probably end up being implemented
 | |
| 	  using signals, i.e. not fast) and simple.  In 1:1 model, the
 | |
| 	  situation is also quite clear - the threads must be synchronized
 | |
| 	  using kernel facilites (which is very slow because a syscall must be
 | |
| 	  performed).  The mixed M:N scenario just combines the first and
 | |
| 	  second approach or rely solely on kernel.  Threads synchronization is
 | |
| 	  a vital part of thread-enabled programming and its performance can
 | |
| 	  affect resulting program a lot.  Recent benchmarks on &os; operating
 | |
| 	  system showed that an improved sx_lock implementation yielded 40%
 | |
| 	  speedup in <firstterm>ZFS</firstterm> (a heavy sx user), this
 | |
| 	  is in-kernel stuff but it shows clearly how important the performance
 | |
| 	  of synchronization primitives is.</para>
 | |
| 
 | |
| 	<para>Threaded programs should be written with as little contention on
 | |
| 	  locks as possible.  Otherwise, instead of doing useful work the
 | |
| 	  thread just waits on a lock.  Because of this, the most well written
 | |
| 	  threaded programs show little locks contention.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="futex-intro">
 | |
| 	<title>Futexes introduction</title>
 | |
| 
 | |
| 	<para>&linux; implements 1:1 threading, i.e. it has to use in-kernel
 | |
| 	  synchronization primitives.  As stated earlier, well written threaded
 | |
| 	  programs have little lock contention.  So a typical sequence
 | |
| 	  could be performed as two atomic increase/decrease mutex reference
 | |
| 	  counter, which is very fast, as presented by the following
 | |
| 	  example:</para>
 | |
| 
 | |
| 	<programlisting>pthread_mutex_lock(&mutex); 
 | |
| .... 
 | |
| pthread_mutex_unlock(&mutex);</programlisting>
 | |
| 
 | |
| 	<para>1:1 threading forces us to perform two syscalls for those mutex
 | |
| 	  calls, which is very slow.</para>
 | |
| 
 | |
| 	<para>The solution &linux; 2.6 implements is called futexes.
 | |
| 	  Futexes implement the check for contention in userspace and call
 | |
| 	  kernel primitives only in a case of contention.  Thus the typical
 | |
| 	  case takes place without any kernel intervention.  This yields
 | |
| 	  reasonably fast and flexible synchronization primitives
 | |
| 	  implementation.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="futex-api">
 | |
| 	<title>Futex API</title>
 | |
| 
 | |
| 	<para>The futex syscall looks like this:</para>
 | |
| 
 | |
| 	<programlisting>int futex(void *uaddr, int op, int val, struct timespec *timeout, void *uaddr2, int val3);</programlisting>
 | |
| 
 | |
| 	<para>In this example <varname>uaddr</varname> is an address of the
 | |
| 	  mutex in userspace, <varname>op</varname> is an operation we are
 | |
| 	  about to perform and the other parameters have per-operation
 | |
| 	  meaning.</para>
 | |
| 
 | |
| 	<para>Futexes implement the following operations:</para>
 | |
| 
 | |
| 	<itemizedlist>
 | |
| 	  <listitem>
 | |
| 	    <para><literal>FUTEX_WAIT</literal></para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para><literal>FUTEX_WAKE</literal></para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para><literal>FUTEX_FD</literal></para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para><literal>FUTEX_REQUEUE</literal></para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para><literal>FUTEX_CMP_REQUEUE</literal></para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para><literal>FUTEX_WAKE_OP</literal></para>
 | |
| 	  </listitem>
 | |
| 	</itemizedlist> 
 | |
| 
 | |
| 	<sect4 id="futex-wait">
 | |
| 	  <title>FUTEX_WAIT</title>
 | |
| 
 | |
| 	  <para>This operation verifies that on address
 | |
| 	    <varname>uaddr</varname> the value <varname>val</varname>
 | |
| 	    is written.  If not, <literal>EWOULDBLOCK</literal> is
 | |
| 	    returned, otherwise the thread is queued on the futex and gets
 | |
| 	    suspended.  If the argument <varname>timeout</varname> is
 | |
| 	    non-zero it specifies the maximum time for the sleeping,
 | |
| 	    otherwise the sleeping is infinite.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="futex-wake">
 | |
| 	  <title>FUTEX_WAKE</title>
 | |
| 
 | |
| 	  <para>This operation takes a futex at <varname>uaddr</varname>
 | |
| 	    and wakes up <varname>val</varname> first futexes queued
 | |
| 	    on this futex.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="futex-fd">
 | |
| 	  <title>FUTEX_FD</title>
 | |
| 
 | |
| 	  <para>This operations associates a file descriptor with a given
 | |
| 	    futex.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="futex-requeue">
 | |
| 	  <title>FUTEX_REQUEUE</title>
 | |
| 
 | |
| 	  <para>This operation takes <varname>val</varname> threads
 | |
| 	    queued on futex at <varname>uaddr</varname>, wakes them up,
 | |
| 	    and takes <varname>val2</varname> next threads and requeues them
 | |
| 	    on futex at <varname>uaddr2</varname>.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="futex-cmp-requeue">
 | |
| 	  <title>FUTEX_CMP_REQUEUE</title>
 | |
| 
 | |
| 	  <para>This operation does the same as
 | |
| 	    <literal>FUTEX_REQUEUE</literal> but it checks that
 | |
| 	    <varname>val3</varname> equals to <varname>val</varname>
 | |
| 	    first.</para> 
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="futex-wake-op">
 | |
| 	  <title>FUTEX_WAKE_OP</title>
 | |
| 
 | |
| 	  <para>This operation performs an atomic operation on
 | |
| 	    <varname>val3</varname> (which contains coded some other value)
 | |
| 	    and <varname>uaddr</varname>.  Then it wakes up
 | |
| 	    <varname>val</varname> threads on futex at
 | |
| 	    <varname>uaddr</varname> and if the atomic operation returned a
 | |
| 	    positive number it wakes up <varname>val2</varname> threads on
 | |
| 	    futex at <varname>uaddr2</varname>.</para>
 | |
| 
 | |
| 	  <para>The operations implemented in
 | |
| 	    <literal>FUTEX_WAKE_OP</literal>:</para>
 | |
| 
 | |
| 	  <itemizedlist>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>FUTEX_OP_SET</literal></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>FUTEX_OP_ADD</literal></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>FUTEX_OP_OR</literal></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>FUTEX_OP_AND</literal></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>FUTEX_OP_XOR</literal></para>
 | |
| 	    </listitem>
 | |
| 	  </itemizedlist>
 | |
| 
 | |
| 	  <note>
 | |
| 	    <para>There is no <varname>val2</varname> parameter in the
 | |
| 	      futex prototype.  The <varname>val2</varname> is taken from the
 | |
| 	      <varname>struct timespec *timeout</varname> parameter
 | |
| 	      for operations <literal>FUTEX_REQUEUE</literal>,
 | |
| 	      <literal>FUTEX_CMP_REQUEUE</literal> and
 | |
| 	      <literal>FUTEX_WAKE_OP</literal>.</para>
 | |
| 	  </note>
 | |
| 	</sect4>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="futex-emu">
 | |
| 	<title>Futex emulation in &os;</title>
 | |
| 
 | |
| 	<para>The futex emulation in &os; is taken from NetBSD and further
 | |
| 	  extended by us.  It is placed in <filename>linux_futex.c</filename>
 | |
| 	  and <filename>linux_futex.h</filename> files.  The
 | |
| 	  <literal>futex</literal> structure looks like:</para>
 | |
| 
 | |
| 	<programlisting>struct futex {
 | |
|   void *f_uaddr;
 | |
|   int f_refcount;
 | |
| 
 | |
|   LIST_ENTRY(futex) f_list;
 | |
| 
 | |
|   TAILQ_HEAD(lf_waiting_paroc, waiting_proc) f_waiting_proc;
 | |
| };</programlisting>
 | |
| 
 | |
| 	<para>And the structure <literal>waiting_proc</literal> is:</para>
 | |
| 
 | |
| 	<programlisting>struct waiting_proc { 
 | |
| 
 | |
|   struct thread *wp_t; 
 | |
| 
 | |
|   struct futex *wp_new_futex; 
 | |
| 
 | |
|   TAILQ_ENTRY(waiting_proc) wp_list; 
 | |
| };</programlisting>
 | |
| 
 | |
| 	<sect4 id="futex-get">
 | |
| 	  <title>futex_get / futex_put</title>
 | |
| 
 | |
| 	  <para>A futex is obtained using the <function>futex_get</function>
 | |
| 	    function, which searches a linear list of futexes and returns the
 | |
| 	    found one or creates a new futex.  When releasing a futex from the
 | |
| 	    use we call the <function>futex_put</function> function, which
 | |
| 	    decreases a reference counter of the futex and if the refcount
 | |
| 	    reaches zero it is released.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="futex-sleep">
 | |
| 	  <title>futex_sleep</title>
 | |
| 
 | |
| 	  <para>When a futex queues a thread for sleeping it creates a
 | |
| 	    <literal>working_proc</literal> structure and puts this structure
 | |
| 	    to the list inside the futex structure then it just performs a
 | |
| 	    &man.tsleep.9; to suspend the thread.  The sleep can be timed out.
 | |
| 	    After &man.tsleep.9; returns (the thread was woken up or it timed
 | |
| 	    out) the <literal>working_proc</literal> structure is removed
 | |
| 	    from the list and is destroyed.  All this is done in the
 | |
| 	    <function>futex_sleep</function> function.  If we got woken up
 | |
| 	    from <function>futex_wake</function> we have
 | |
| 	    <varname>wp_new_futex</varname> set so we sleep on it.  This way
 | |
| 	    the actual requeueing is done in this function.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="futex-wake-2">
 | |
| 	  <title>futex_wake</title>
 | |
| 
 | |
| 	  <para>Waking up a thread sleeping on a futex is performed in the
 | |
| 	    <function>futex_wake</function> function.  First in this function
 | |
| 	    we mimic the strange &linux; behaviour, where it wakes up N threads
 | |
| 	    for all operations, the only exception is that the REQUEUE
 | |
| 	    operations are performed on N+1 threads.  But this usually does not
 | |
| 	    make any difference as we are waking up all threads.  Next in the
 | |
| 	    function in the loop we wake up n threads, after this we check if
 | |
| 	    there is a new futex for requeueing.  If so, we requeue up to n2
 | |
| 	    threads on the new futex.  This cooperates with
 | |
| 	    <function>futex_sleep</function>.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="futex-wake-op-2">
 | |
| 	  <title>futex_wake_op</title>
 | |
| 
 | |
| 	  <para>The <literal>FUTEX_WAKE_OP</literal> operation is quite
 | |
| 	    complicated.  First we obtain two futexes at addresses
 | |
| 	    <varname>uaddr</varname> and <varname>uaddr2</varname> then we
 | |
| 	    perform the atomic operation using <varname>val3</varname> and
 | |
| 	    <varname>uaddr2</varname>.  Then <varname>val</varname> waiters
 | |
| 	    on the first futex is woken up and if the atomic operation
 | |
| 	    condition holds we wake up <varname>val2</varname> (i.e.
 | |
| 	    <varname>timeout</varname>) waiter on the second futex.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="futex-atomic-op">
 | |
| 	  <title>futex atomic operation</title>
 | |
| 
 | |
| 	  <para>The atomic operation takes two parameters
 | |
| 	    <varname>encoded_op</varname> and <varname>uaddr</varname>.
 | |
| 	    The encoded operation encodes the operation itself,
 | |
| 	    comparing value, operation argument, and comparing argument.
 | |
| 	    The pseudocode for the operation is like this one:</para>
 | |
| 
 | |
| 	  <programlisting>oldval = *uaddr2
 | |
| *uaddr2 = oldval OP oparg</programlisting>
 | |
| 
 | |
| 	  <para>And this is done atomically.  First a copying in of the number
 | |
| 	    at <varname>uaddr</varname> is performed and the operation is
 | |
| 	    done.  The code handles page faults and if no page fault occurs
 | |
| 	    <varname>oldval</varname> is compared to
 | |
| 	    <varname>cmparg</varname> argument with cmp comparator.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 id="futex-locking">
 | |
| 	  <title>Futex locking</title>
 | |
| 
 | |
| 	  <para>Futex implementation uses two lock lists protecting
 | |
| 	    <function>sx_lock</function> and global locks (either Giant
 | |
| 	    or another <function>sx_lock</function>).  Every operation is
 | |
| 	    performed locked from the start to the very end.</para>
 | |
| 	</sect4>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 id="syscall-impl">
 | |
|       <title>Various syscalls implementation</title>
 | |
| 
 | |
|       <para>In this section I am going to describe some smaller syscalls that
 | |
| 	are worth mentioning because their implementation is not obvious or
 | |
| 	those syscalls are interesting from other point of view.</para>
 | |
| 
 | |
|       <sect3 id="syscall-at">
 | |
| 	<title>*at family of syscalls</title>
 | |
| 
 | |
| 	<para>During development of &linux; 2.6.16 kernel, the *at syscalls
 | |
| 	  were added.  Those syscalls (<function>openat</function> for example)
 | |
| 	  work exactly like their at-less counterparts with the slight
 | |
| 	  exception of the <varname>dirfd</varname> parameter.  This
 | |
| 	  parameter changes where the given file, on which the syscall is to be
 | |
| 	  performed, is.  When the <varname>filename</varname> parameter is
 | |
| 	  absolute <varname>dirfd</varname> is ignored but when the path to
 | |
| 	  the file is relative, it comes to the play.  The
 | |
| 	  <varname>dirfd</varname> paramtere is a directory relative to which
 | |
| 	  the relative pathname is checked.  The <varname>dirfd</varname>
 | |
| 	  parameter is a file descriptor of some directory or
 | |
| 	  <literal>AT_FDCWD</literal>.  So for example the
 | |
| 	  <function>openat</function> syscall can be like this:</para>
 | |
| 
 | |
| 	<programlisting>file descriptor 123 = /tmp/foo/, current working directory = /tmp/ 
 | |
| 
 | |
| openat(123, /tmp/bah\, flags, mode)	/* opens /tmp/bah */
 | |
| openat(123, bah\, flags, mode)		/* opens /tmp/foo/bah */
 | |
| openat(AT_FDWCWD, bah\, flags, mode)	/* opens /tmp/bah */
 | |
| openat(stdio, bah\, flags, mode)	/* returns error because stdio is not a directory */</programlisting>
 | |
| 
 | |
| 	<para>This infrastructure is necessary to avoid races when opening
 | |
| 	  files outside the working directory.  Imagine that a process consists
 | |
| 	  of two threads, thread A and thread B.  Thread A
 | |
| 	  issues <literal>open(./tmp/foo/bah., flags, mode)</literal> and
 | |
| 	  before returning it gets preempted and thread B runs.
 | |
| 	  Thread B does not care about the needs of thread A and
 | |
| 	  renames or removes <filename>/tmp/foo/</filename>.  We got a race.
 | |
| 	  To avoid this we can open <filename>/tmp/foo</filename> and use it
 | |
| 	  as <varname>dirfd</varname> for <function>openat</function>
 | |
| 	  syscall.  This also enables user to implement per-thread
 | |
| 	  working directories.</para>
 | |
| 
 | |
| 	<para>&linux; family of *at syscalls contains:
 | |
| 	  <function>linux_openat</function>,
 | |
| 	  <function>linux_mkdirat</function>,
 | |
| 	  <function>linux_mknodat</function>,
 | |
| 	  <function>linux_fchownat</function>,
 | |
| 	  <function>linux_futimesat</function>,
 | |
| 	  <function>linux_fstatat64</function>,
 | |
| 	  <function>linux_unlinkat</function>,
 | |
| 	  <function>linux_renameat</function>,
 | |
| 	  <function>linux_linkat</function>,
 | |
| 	  <function>linux_symlinkat</function>,
 | |
| 	  <function>linux_readlinkat</function>,
 | |
| 	  <function>linux_fchmodat</function> and
 | |
| 	  <function>linux_faccessat</function>.  All these are implemented
 | |
| 	  using the modified &man.namei.9; routine and simple
 | |
| 	  wrapping layer.</para>
 | |
| 
 | |
| 	<sect4 id="implementation">
 | |
| 	  <title>Implementation</title>
 | |
| 
 | |
| 	  <para>The implementation is done by altering the
 | |
| 	     &man.namei.9; routine (described above) to take
 | |
| 	     additional parameter <varname>dirfd</varname> in its
 | |
| 	     <literal>nameidata</literal> structure, which specifies the
 | |
| 	     starting point of the pathname lookup instead of using the
 | |
| 	     current working directory every time.  The resolution of
 | |
| 	     <varname>dirfd</varname> from file descriptor number to a
 | |
| 	     vnode is done in native *at syscalls.  When
 | |
| 	     <varname>dirfd</varname> is <literal>AT_FDCWD</literal> the
 | |
| 	     <varname>dvp</varname> entry in <literal>nameidata</literal>
 | |
| 	     structure is <literal>NULL</literal> but when
 | |
| 	     <varname>dirfd</varname> is a different number we obtain a
 | |
| 	     file for this file descriptor, check whether this file
 | |
| 	     is valid and if there is vnode attached to it then we get a vnode.
 | |
| 	     Then we check this vnode for being a directory.  In the actual
 | |
| 	     &man.namei.9; routine we simply substitute the
 | |
| 	     <varname>dvp</varname> vnode for <varname>dp</varname> variable
 | |
| 	     in the &man.namei.9; function, which determines the
 | |
| 	     starting point.  The &man.namei.9; is not used
 | |
| 	     directly but via a trace of different functions on various
 | |
| 	     levels.  For example the <function>openat</function> goes like
 | |
| 	     this:</para>
 | |
| 
 | |
| 	  <programlisting>openat() --> kern_openat() --> vn_open() -> namei()</programlisting>
 | |
| 
 | |
| 	  <para>For this reason <function>kern_open</function> and
 | |
| 	    <function>vn_open</function> must be altered to incorporate
 | |
| 	    the additional <varname>dirfd</varname> parameter.  No compat
 | |
| 	    layer is created for those because there are not many users of
 | |
| 	    this and the users can be easily converted.  This general
 | |
| 	    implementation enables &os; to implement their own *at syscalls.
 | |
| 	    This is being discussed right now.</para>
 | |
| 	</sect4>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="ioctl">
 | |
| 	<title>Ioctl</title>
 | |
| 
 | |
| 	<para>The ioctl interface is quite fragile due to its generality.
 | |
| 	  We have to bear in mind that devices differ between &linux; and &os;
 | |
| 	  so some care must be applied to do ioctl emulation work right.  The
 | |
| 	  ioctl handling is implemented in <filename>linux_ioctl.c</filename>,
 | |
| 	  where <function>linux_ioctl</function> function is defined.  This
 | |
| 	  function simply iterates over sets of ioctl handlers to find a
 | |
| 	  handler that implements a given command.  The ioctl syscall has three
 | |
| 	  parameters, the file descriptor, command and an argument.  The
 | |
| 	  command is a 16-bit number, which in theory is divided into high
 | |
| 	  8 bits determining class of the ioctl command and low
 | |
| 	  8 bits, which are the actual command within the given set.
 | |
| 	  The emulation takes advantage of this division.  We implement
 | |
| 	  handlers for each set, like <function>sound_handler</function>
 | |
| 	  or <function>disk_handler</function>.  Each handler has a maximum
 | |
| 	  command and a minimum command defined, which is used for determining
 | |
| 	  what handler is used.  There are slight problems with this approach
 | |
| 	  because &linux; does not use the set division consistently so
 | |
| 	  sometimes ioctls for a different set are inside a set they should
 | |
| 	  not belong to (SCSI generic ioctls inside cdrom set, etc.).  &os;
 | |
| 	  currently does not implement many &linux; ioctls (compared to
 | |
| 	  NetBSD, for example) but the plan is to port those from NetBSD.
 | |
| 	  The trend is to use &linux; ioctls even in the native &os; drivers
 | |
| 	  because of the easy porting of applications.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 id="debugging">
 | |
| 	<title>Debugging</title>
 | |
| 
 | |
| 	<para>Every syscall should be debuggable.  For this purpose we
 | |
| 	  introduce a small infrastructure.  We have the ldebug facility, which
 | |
| 	  tells whether a given syscall should be debugged (settable via a
 | |
| 	  sysctl).  For printing we have LMSG and ARGS macros.  Those are used
 | |
| 	  for altering a printable string for uniform debuging messages.</para>
 | |
|       </sect3>
 | |
|     </sect2>    
 | |
|   </sect1>
 | |
| 
 | |
|   <sect1 id="conclusion">
 | |
|     <title>Conclusion</title>
 | |
| 
 | |
|     <sect2 id="results">
 | |
|       <title>Results</title>
 | |
| 
 | |
|       <para>As of April 2007 the &linux; emulation layer is capable of
 | |
| 	emulating the &linux; 2.6.16 kernel quite well.  The remaining
 | |
| 	problems concern futexes, unfinished *at family of syscalls,
 | |
| 	problematic signals delivery, missing <function>epoll</function> and
 | |
| 	<function>inotify</function> and probably some bugs we have not
 | |
| 	discovered yet.  Despite this we are capable of running basically all
 | |
| 	the &linux; programs included in &os; Ports Collection with
 | |
| 	Fedora Core 4 at 2.6.16 and there are some rudimentary
 | |
| 	reports of success with Fedora Core 6 at 2.6.16.  The
 | |
| 	Fedora Core 6	linux_base was recently commited enabling
 | |
| 	some further testing of the emulation layer and giving us some more
 | |
| 	hints where we should put our effort in implementing missing
 | |
| 	stuff.</para>
 | |
| 
 | |
|       <para>We are able to run the most used applications like
 | |
| 	<filename role="package">www/linux-firefox</filename>,
 | |
| 	<filename role="package">www/linux-opera</filename>,
 | |
| 	<filename role="package">net-im/skype</filename> and some games from
 | |
| 	the Ports Collection.  Some of the programs exhibit bad behaviour
 | |
| 	under 2.6 emulation but this is currently under investigation and
 | |
| 	hopefully will be fixed soon.  The only big application that is
 | |
| 	known not to work is the &linux; &java; Development Kit and this is
 | |
| 	because of the requirement of <function>epoll</function>
 | |
| 	facility which is not directly related to the &linux;
 | |
| 	kernel 2.6.</para>
 | |
| 
 | |
|       <para>We hope to enable 2.6.16 emulation by default some time after
 | |
| 	&os; 7.0 is released at least to expose the 2.6 emulation parts for
 | |
| 	some wider testing.  Once this is done we can switch to
 | |
| 	Fedora Core 6 linux_base, which is the ultimate plan.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 id="future-work">
 | |
|       <title>Future work</title>
 | |
| 
 | |
|       <para>Future work should focus on fixing the remaining issues with
 | |
| 	futexes, implement the rest of the *at family of syscalls, fix the
 | |
| 	signal delivery and possibly implement the <function>epoll</function>
 | |
| 	and <function>inotify</function> facilities.</para>
 | |
| 
 | |
|       <para>We hope to be able to run the most important programs flawlessly
 | |
| 	soon, so we will be able to switch to the 2.6 emulation by default and
 | |
| 	make the Fedora Core 6 the default linux_base because our
 | |
| 	currently used Fedora Core 4 is not supported any
 | |
| 	more.</para>
 | |
| 
 | |
|       <para>The other possible goal is to share our code with NetBSD and
 | |
| 	DragonflyBSD.  NetBSD has some support for 2.6 emulation but its far
 | |
| 	from finished and not really tested.  DragonflyBSD has expressed some
 | |
| 	interest in porting the 2.6 improvements.</para>
 | |
| 
 | |
|       <para>Generally, as &linux; develops we would like to keep up with their
 | |
| 	development, implementing newly added syscalls.  Splice comes to mind
 | |
| 	first.  Some already implemented syscalls are also heavily crippled,
 | |
| 	for example <function>mremap</function> and others.  Some performance
 | |
| 	improvements can also be made, finer grained locking and others.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 id="team">
 | |
|       <title>Team</title>
 | |
| 
 | |
|       <para>I cooperated on this project with (in alphabetical order):</para>
 | |
| 
 | |
|       <itemizedlist>
 | |
| 	<listitem>
 | |
| 	  <para>&a.jhb;</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>&a.kib;</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>Emmanuel Dreyfus</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>Scot Hetzel</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>&a.jkim;</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>&a.netchild;</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>&a.ssouhlal;</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>Li Xiao</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>&a.davidxu;</para>
 | |
| 	</listitem>
 | |
|       </itemizedlist>
 | |
| 
 | |
|       <para>I would like to thank all those people for their advices, code
 | |
| 	reviews and general support.</para>
 | |
|     </sect2>
 | |
|   </sect1>
 | |
| 
 | |
|   <sect1 id="literatures">
 | |
|     <title>Literatures</title>
 | |
| 
 | |
|     <orderedlist>
 | |
|       <listitem>
 | |
| 	<para>Marshall Kirk McKusick - George V. Nevile-Neil. Design
 | |
| 	  and Implementation of the &os; operating system. Addison-Wesley,
 | |
| 	  2005.</para>
 | |
|       </listitem>
 | |
|       <listitem>
 | |
| 	<para><ulink url="http://www.FreeBSD.org"></ulink></para>
 | |
|       </listitem>
 | |
|       <listitem>
 | |
| 	<para><ulink url="http://tldp.org"></ulink></para>
 | |
|       </listitem>
 | |
|       <listitem>
 | |
| 	<para><ulink url="http://www.linux.org"></ulink></para>
 | |
|      </listitem>
 | |
|     </orderedlist>
 | |
|   </sect1>
 | |
| </article>
 | |
| 
 | |
| <!-- 
 | |
|      Local Variables:
 | |
|      mode: sgml
 | |
|      sgml-indent-data: t
 | |
|      sgml-omittag: nil
 | |
|      sgml-always-quote-attributes: t
 | |
|      End:
 | |
| -->
 |