PR: 242495 Submitted by: ultimateninjamaster948@gmail.com Patch by: carlavilla@ Approved by: bcr@ Differential Revision: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=242495
		
			
				
	
	
		
			2545 lines
		
	
	
	
		
			102 KiB
		
	
	
	
		
			XML
		
	
	
	
	
	
			
		
		
	
	
			2545 lines
		
	
	
	
		
			102 KiB
		
	
	
	
		
			XML
		
	
	
	
	
	
| <?xml version="1.0" encoding="iso-8859-1"?>
 | |
| <!DOCTYPE article PUBLIC "-//FreeBSD//DTD DocBook XML V5.0-Based Extension//EN"
 | |
| 	"http://www.FreeBSD.org/XML/share/xml/freebsd50.dtd">
 | |
| <!-- $FreeBSD$ -->
 | |
| <!-- The FreeBSD Documentation Project -->
 | |
| <article xmlns="http://docbook.org/ns/docbook"
 | |
|   xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
 | |
|   xml:lang="en">
 | |
|   <info>
 | |
|     <title>&linux; emulation in &os;</title>
 | |
| 
 | |
|     <author>
 | |
|       <personname>
 | |
| 	<firstname>Roman</firstname>
 | |
| 	<surname>Divacky</surname>
 | |
|       </personname>
 | |
|       <affiliation>
 | |
| 	<address>
 | |
| 	  <email>rdivacky@FreeBSD.org</email>
 | |
| 	</address>
 | |
|       </affiliation>
 | |
|     </author>
 | |
| 
 | |
|     <legalnotice xml:id="trademarks" role="trademarks">
 | |
|       &tm-attrib.adobe;
 | |
|       &tm-attrib.ibm;
 | |
|       &tm-attrib.freebsd;
 | |
|       &tm-attrib.linux;
 | |
|       &tm-attrib.netbsd;
 | |
|       &tm-attrib.realnetworks;
 | |
|       &tm-attrib.oracle;
 | |
|       &tm-attrib.sun;
 | |
|       &tm-attrib.general;
 | |
|     </legalnotice>
 | |
| 
 | |
|     <pubdate>$FreeBSD$</pubdate>
 | |
| 
 | |
|     <releaseinfo>$FreeBSD$</releaseinfo>
 | |
| 
 | |
|     <abstract>
 | |
|       <para>This masters thesis deals with updating the &linux;
 | |
| 	emulation layer (the so called
 | |
| 	<firstterm>Linuxulator</firstterm>).  The task was to update
 | |
| 	the layer to match the functionality of &linux; 2.6. As a
 | |
| 	reference implementation, the &linux; 2.6.16 kernel was
 | |
| 	chosen.  The concept is loosely based on the NetBSD
 | |
| 	implementation.  Most of the work was done in the summer of
 | |
| 	2006 as a part of the Google Summer of Code students program.
 | |
| 	The focus was on bringing the <firstterm>NPTL</firstterm> (new
 | |
| 	&posix; thread library) support into the emulation layer,
 | |
| 	including <firstterm>TLS</firstterm> (thread local storage),
 | |
| 	<firstterm>futexes</firstterm> (fast user space mutexes),
 | |
| 	<firstterm>PID mangling</firstterm>, and some other minor
 | |
| 	things.  Many small problems were identified and fixed in the
 | |
| 	process.  My work was integrated into the main &os; source
 | |
| 	repository and will be shipped in the upcoming 7.0R release.
 | |
| 	We, the emulation development team, are working on making the
 | |
| 	&linux; 2.6 emulation the default emulation layer in
 | |
| 	&os;.</para>
 | |
|     </abstract>
 | |
|   </info>
 | |
| 
 | |
|   <sect1 xml:id="intro">
 | |
|     <title>Introduction</title>
 | |
| 
 | |
|     <para>In the last few years the open source &unix; based operating
 | |
|       systems started to be widely deployed on server and client
 | |
|       machines.  Among these operating systems I would like to point
 | |
|       out two: &os;, for its BSD heritage, time proven code base and
 | |
|       many interesting features and &linux; for its wide user base,
 | |
|       enthusiastic open developer community and support from large
 | |
|       companies.  &os; tends to be used on server class machines
 | |
|       serving heavy duty networking tasks with less usage on desktop
 | |
|       class machines for ordinary users.  While &linux; has the same
 | |
|       usage on servers, but it is used much more by home based users.
 | |
|       This leads to a situation where there are many binary only
 | |
|       programs available for &linux; that lack support for
 | |
|       &os;.</para>
 | |
| 
 | |
|     <para>Naturally, a need for the ability to run &linux; binaries on
 | |
|       a &os; system arises and this is what this thesis deals with:
 | |
|       the emulation of the &linux; kernel in the &os; operating
 | |
|       system.</para>
 | |
| 
 | |
|     <para>During the Summer of 2006 Google Inc. sponsored a project
 | |
|       which focused on extending the &linux; emulation layer (the so
 | |
|       called Linuxulator) in &os; to include &linux; 2.6 facilities.
 | |
|       This thesis is written as a part of this project.</para>
 | |
|   </sect1>
 | |
| 
 | |
|   <sect1 xml:id="inside">
 | |
|     <title>A look inside…</title>
 | |
| 
 | |
|     <para>In this section we are going to describe every operating
 | |
|       system in question.  How they deal with syscalls, trapframes
 | |
|       etc., all the low-level stuff.  We also describe the way they
 | |
|       understand common &unix; primitives like what a PID is, what a
 | |
|       thread is, etc.  In the third subsection we talk about how
 | |
|       &unix; on &unix; emulation could be done in general.</para>
 | |
| 
 | |
|     <sect2 xml:id="what-is-unix">
 | |
|       <title>What is &unix;</title>
 | |
| 
 | |
|       <para>&unix; is an operating system with a long history that has
 | |
| 	influenced almost every other operating system currently in
 | |
| 	use.  Starting in the 1960s, its development continues to this
 | |
| 	day (although in different projects).  &unix; development soon
 | |
| 	forked into two main ways: the BSDs and System III/V families.
 | |
| 	They mutually influenced themselves by growing a common &unix;
 | |
| 	standard.  Among the contributions originated in BSD we can
 | |
| 	name virtual memory, TCP/IP networking, FFS, and many others.
 | |
| 	The System V branch contributed to SysV interprocess
 | |
| 	communication primitives, copy-on-write, etc. &unix; itself
 | |
| 	does not exist any more but its ideas have been used by many
 | |
| 	other operating systems world wide thus forming the so called
 | |
| 	&unix;-like operating systems.  These days the most
 | |
| 	influential ones are &linux;, Solaris, and possibly (to some
 | |
| 	extent) &os;.  There are in-company &unix; derivatives (AIX,
 | |
| 	HP-UX etc.), but these have been more and more migrated to the
 | |
| 	aforementioned systems.  Let us summarize typical &unix;
 | |
| 	characteristics.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="tech-details">
 | |
|       <title>Technical details</title>
 | |
| 
 | |
|       <para>Every running program constitutes a process that
 | |
| 	represents a state of the computation.  Running process is
 | |
| 	divided between kernel-space and user-space.  Some operations
 | |
| 	can be done only from kernel space (dealing with hardware
 | |
| 	etc.), but the process should spend most of its lifetime in
 | |
| 	the user space.  The kernel is where the management of the
 | |
| 	processes, hardware, and low-level details take place.  The
 | |
| 	kernel provides a standard unified &unix; API to the user
 | |
| 	space.  The most important ones are covered below.</para>
 | |
| 
 | |
|       <sect3 xml:id="kern-proc-comm">
 | |
| 	<title>Communication between kernel and user space
 | |
| 	  process</title>
 | |
| 
 | |
| 	<para>Common &unix; API defines a syscall as a way to issue
 | |
| 	  commands from a user space process to the kernel.  The most
 | |
| 	  common implementation is either by using an interrupt or
 | |
| 	  specialized instruction (think of
 | |
| 	  <literal>SYSENTER</literal>/<literal>SYSCALL</literal>
 | |
| 	  instructions for ia32).  Syscalls are defined by a number.
 | |
| 	  For example in &os;, the syscall number 85 is the
 | |
| 	  &man.swapon.2; syscall and the syscall number 132 is
 | |
| 	  &man.mkfifo.2;.  Some syscalls need parameters, which are
 | |
| 	  passed from the user-space to the kernel-space in various
 | |
| 	  ways (implementation dependant).  Syscalls are
 | |
| 	  synchronous.</para>
 | |
| 
 | |
| 	<para>Another possible way to communicate is by using a
 | |
| 	  <firstterm>trap</firstterm>.  Traps occur asynchronously
 | |
| 	  after some event occurs (division by zero, page fault etc.).
 | |
| 	  A trap can be transparent for a process (page fault) or can
 | |
| 	  result in a reaction like sending a
 | |
| 	  <firstterm>signal</firstterm> (division by zero).</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="proc-proc-comm">
 | |
| 	<title>Communication between processes</title>
 | |
| 
 | |
| 	<para>There are other APIs (System V IPC, shared memory etc.)
 | |
| 	  but the single most important API is signal.  Signals are
 | |
| 	  sent by processes or by the kernel and received by
 | |
| 	  processes.  Some signals can be ignored or handled by a user
 | |
| 	  supplied routine, some result in a predefined action that
 | |
| 	  cannot be altered or ignored.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="proc-mgmt">
 | |
| 	<title>Process management</title>
 | |
| 
 | |
| 	<para>Kernel instances are processed first in the system (so
 | |
| 	  called init).  Every running process can create its
 | |
| 	  identical copy using the &man.fork.2; syscall.  Some
 | |
| 	  slightly modified versions of this syscall were introduced
 | |
| 	  but the basic semantic is the same.  Every running process
 | |
| 	  can morph into some other process using the &man.exec.3;
 | |
| 	  syscall.  Some modifications of this syscall were introduced
 | |
| 	  but all serve the same basic purpose.  Processes end their
 | |
| 	  lives by calling the &man.exit.2; syscall.  Every process is
 | |
| 	  identified by a unique number called PID.  Every process has
 | |
| 	  a defined parent (identified by its PID).</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="thread-mgmt">
 | |
| 	<title>Thread management</title>
 | |
| 
 | |
| 	<para>Traditional &unix; does not define any API nor
 | |
| 	  implementation for threading, while  &posix; defines its
 | |
| 	  threading API but the implementation is undefined.
 | |
| 	  Traditionally there were two ways of implementing threads.
 | |
| 	  Handling them as separate processes (1:1 threading) or
 | |
| 	  envelope the whole thread group in one process and managing
 | |
| 	  the threading in userspace (1:N threading).  Comparing main
 | |
| 	  features of each approach:</para>
 | |
| 
 | |
| 	<para>1:1 threading</para>
 | |
| 
 | |
| 	<itemizedlist>
 | |
| 	  <listitem>
 | |
| 	    <para>- heavyweight threads</para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para>- the scheduling cannot be altered by the user
 | |
| 	      (slightly mitigated by the &posix; API)</para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para>+ no syscall wrapping necessary</para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para>+ can utilize multiple CPUs</para>
 | |
| 	  </listitem>
 | |
| 	</itemizedlist>
 | |
| 
 | |
| 	<para>1:N threading</para>
 | |
| 
 | |
| 	<itemizedlist>
 | |
| 	  <listitem>
 | |
| 	    <para>+ lightweight threads</para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para>+ scheduling can be easily altered by the
 | |
| 	      user</para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para>- syscalls must be wrapped</para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para>- cannot utilize more than one CPU</para>
 | |
| 	  </listitem>
 | |
| 	</itemizedlist>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="what-is-freebsd">
 | |
|       <title>What is &os;?</title>
 | |
| 
 | |
|       <para>The &os; project is one of the oldest open source
 | |
| 	operating systems currently available for daily use.  It is a
 | |
| 	direct descendant of the genuine &unix; so it could be claimed
 | |
| 	that it is a true &unix; although licensing issues do not
 | |
| 	permit that.  The start of the project dates back to the early
 | |
| 	1990's when a crew of fellow BSD users patched the 386BSD
 | |
| 	operating system.  Based on this patchkit a new operating
 | |
| 	system arose named &os; for its liberal license.  Another
 | |
| 	group created the NetBSD operating system with different goals
 | |
| 	in mind.  We will focus on &os;.</para>
 | |
| 
 | |
|       <para>&os; is a modern &unix;-based operating system with all
 | |
| 	the features of &unix;.  Preemptive multitasking, multiuser
 | |
| 	facilities, TCP/IP networking, memory protection, symmetric
 | |
| 	multiprocessing support, virtual memory with merged VM and
 | |
| 	buffer cache, they are all there.  One of the interesting and
 | |
| 	extremely useful features is the ability to emulate other
 | |
| 	&unix;-like operating systems.  As of December 2006 and
 | |
| 	7-CURRENT development, the following emulation functionalities
 | |
| 	are supported:</para>
 | |
| 
 | |
|       <itemizedlist>
 | |
| 	<listitem>
 | |
| 	  <para>&os;/i386 emulation on &os;/amd64</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>&os;/i386 emulation on &os;/ia64</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>&linux;-emulation of &linux; operating system on
 | |
| 	    &os;</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>NDIS-emulation of Windows networking drivers
 | |
| 	    interface</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>NetBSD-emulation of NetBSD operating system</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>PECoff-support for PECoff &os; executables</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>SVR4-emulation of System V revision 4 &unix;</para>
 | |
| 	</listitem>
 | |
|       </itemizedlist>
 | |
| 
 | |
|       <para>Actively developed emulations are the &linux; layer and
 | |
| 	various &os;-on-&os; layers.  Others are not supposed to work
 | |
| 	properly nor be usable these days.</para>
 | |
| 
 | |
|       <sect3 xml:id="freebsd-tech-details">
 | |
| 	<title>Technical details</title>
 | |
| 
 | |
| 	<para>&os; is traditional flavor of &unix; in the sense of
 | |
| 	  dividing the run of processes into two halves: kernel space
 | |
| 	  and user space run.  There are two types of process entry to
 | |
| 	  the kernel: a syscall and a trap.  There is only one way to
 | |
| 	  return.  In the subsequent sections we will describe the
 | |
| 	  three gates to/from the kernel.  The whole description
 | |
| 	  applies to the i386 architecture as the Linuxulator only
 | |
| 	  exists there but the concept is similar on other
 | |
| 	  architectures.  The information was taken from [1] and the
 | |
| 	  source code.</para>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-sys-entries">
 | |
| 	  <title>System entries</title>
 | |
| 
 | |
| 	  <para>&os; has an abstraction called an execution class
 | |
| 	    loader, which is a wedge into the &man.execve.2; syscall.
 | |
| 	    This employs a structure <literal>sysentvec</literal>,
 | |
| 	    which describes an executable ABI.  It contains things
 | |
| 	    like errno translation table, signal translation table,
 | |
| 	    various functions to serve syscall needs (stack fixup,
 | |
| 	    coredumping, etc.).  Every ABI the &os; kernel wants to
 | |
| 	    support must define this structure, as it is used later in
 | |
| 	    the syscall processing code and at some other places.
 | |
| 	    System entries are handled by trap handlers, where we can
 | |
| 	    access both the kernel-space and the user-space at
 | |
| 	    once.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-syscalls">
 | |
| 	  <title>Syscalls</title>
 | |
| 
 | |
| 	  <para>Syscalls on &os; are issued by executing interrupt
 | |
| 	    <literal>0x80</literal> with register
 | |
| 	    <varname>%eax</varname> set to a desired syscall number
 | |
| 	    with arguments passed on the stack.</para>
 | |
| 
 | |
| 	  <para>When a process issues an interrupt
 | |
| 	    <literal>0x80</literal>, the <literal>int0x80</literal>
 | |
| 	    syscall trap handler is issued (defined in
 | |
| 	    <filename>sys/i386/i386/exception.s</filename>), which
 | |
| 	    prepares arguments (i.e. copies them on to the stack) for
 | |
| 	    a call to a C function &man.syscall.2; (defined in
 | |
| 	    <filename>sys/i386/i386/trap.c</filename>), which
 | |
| 	    processes the passed in trapframe.  The processing
 | |
| 	    consists of preparing the syscall (depending on the
 | |
| 	    <literal>sysvec</literal> entry), determining if the
 | |
| 	    syscall is 32-bit or 64-bit one (changes size of the
 | |
| 	    parameters), then the parameters are copied, including the
 | |
| 	    syscall.  Next, the actual syscall function is executed
 | |
| 	    with processing of the return code (special cases for
 | |
| 	    <literal>ERESTART</literal> and
 | |
| 	    <literal>EJUSTRETURN</literal> errors).  Finally an
 | |
| 	    <literal>userret()</literal> is scheduled, switching the
 | |
| 	    process back to the users-pace.  The parameters to the
 | |
| 	    actual syscall handler are passed in the form of
 | |
| 	    <literal>struct thread *td</literal>, <literal>struct
 | |
| 	      syscall args *</literal> arguments where the second
 | |
| 	    parameter is a pointer to the copied in structure of
 | |
| 	    parameters.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-traps">
 | |
| 	  <title>Traps</title>
 | |
| 
 | |
| 	  <para>Handling of traps in &os; is similar to the handling
 | |
| 	    of syscalls.  Whenever a trap occurs, an assembler handler
 | |
| 	    is called.  It is chosen between alltraps, alltraps with
 | |
| 	    regs pushed or calltrap depending on the type of the trap.
 | |
| 	    This handler prepares arguments for a call to a C function
 | |
| 	    <literal>trap()</literal> (defined in
 | |
| 	    <filename>sys/i386/i386/trap.c</filename>), which then
 | |
| 	    processes the occurred trap.  After the processing it
 | |
| 	    might send a signal to the process and/or exit to userland
 | |
| 	    using <literal>userret()</literal>.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-exits">
 | |
| 	  <title>Exits</title>
 | |
| 
 | |
| 	  <para>Exits from kernel to userspace happen using the
 | |
| 	    assembler routine <literal>doreti</literal> regardless of
 | |
| 	    whether the kernel was entered via a trap or via a
 | |
| 	    syscall.  This restores the program status from the stack
 | |
| 	    and returns to the userspace.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-unix-primitives">
 | |
| 	  <title>&unix; primitives</title>
 | |
| 
 | |
| 	  <para>&os; operating system adheres to the traditional
 | |
| 	    &unix; scheme, where every process has a unique
 | |
| 	    identification number, the so called
 | |
| 	    <firstterm>PID</firstterm> (Process ID).  PID numbers are
 | |
| 	    allocated either linearly or randomly ranging from
 | |
| 	    <literal>0</literal> to <literal>PID_MAX</literal>.  The
 | |
| 	    allocation of PID numbers is done using linear searching
 | |
| 	    of PID space.  Every thread in a process receives the same
 | |
| 	    PID number as result of the &man.getpid.2; call.</para>
 | |
| 
 | |
| 	  <para>There are currently two ways to implement threading in
 | |
| 	    &os;.  The first way is M:N threading followed by the 1:1
 | |
| 	    threading model.  The default library used is M:N
 | |
| 	    threading (<literal>libpthread</literal>) and you can
 | |
| 	    switch at runtime to 1:1 threading
 | |
| 	    (<literal>libthr</literal>).  The plan is to switch to 1:1
 | |
| 	    library by default soon.  Although those two libraries use
 | |
| 	    the same kernel primitives, they are accessed through
 | |
| 	    different API(es).  The M:N library uses the
 | |
| 	    <literal>kse_*</literal> family of syscalls while the 1:1
 | |
| 	    library uses the <literal>thr_*</literal> family of
 | |
| 	    syscalls.  Because of this, there is no general concept of
 | |
| 	    thread ID shared between kernel and userspace.  Of course,
 | |
| 	    both threading libraries implement the pthread thread ID
 | |
| 	    API.  Every kernel thread (as described by <literal>struct
 | |
| 	      thread</literal>) has td tid identifier but this is not
 | |
| 	    directly accessible from userland and solely serves the
 | |
| 	    kernel's needs.  It is also used for 1:1 threading library
 | |
| 	    as pthread's thread ID but handling of this is internal to
 | |
| 	    the library and cannot be relied on.</para>
 | |
| 
 | |
| 	  <para>As stated previously there are two implementations of
 | |
| 	    threading in &os;.  The M:N library divides the work
 | |
| 	    between kernel space and userspace.  Thread is an entity
 | |
| 	    that gets scheduled in the kernel but it can represent
 | |
| 	    various number of userspace threads.  M userspace threads
 | |
| 	    get mapped to N kernel threads thus saving resources while
 | |
| 	    keeping the ability to exploit multiprocessor parallelism.
 | |
| 	    Further information about the implementation can be
 | |
| 	    obtained from the man page or [1].  The 1:1 library
 | |
| 	    directly maps a userland thread to a kernel thread thus
 | |
| 	    greatly simplifying the scheme.  None of these designs
 | |
| 	    implement a fairness mechanism (such a mechanism was
 | |
| 	    implemented but it was removed recently because it caused
 | |
| 	    serious slowdown and made the code more difficult to deal
 | |
| 	    with).</para>
 | |
| 	</sect4>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="what-is-linux">
 | |
|       <title>What is &linux;</title>
 | |
| 
 | |
|       <para>&linux; is a &unix;-like kernel originally developed by
 | |
| 	Linus Torvalds, and now being contributed to by a massive
 | |
| 	crowd of programmers all around the world.  From its mere
 | |
| 	beginnings to today, with wide support from companies such as
 | |
| 	IBM or Google, &linux; is being associated with its fast
 | |
| 	development pace, full hardware support and benevolent
 | |
| 	dictator model of organization.</para>
 | |
| 
 | |
|       <para>&linux; development started in 1991 as a hobbyist project
 | |
| 	at University of Helsinki in Finland.  Since then it has
 | |
| 	obtained all the features of a modern &unix;-like OS:
 | |
| 	multiprocessing, multiuser support, virtual memory,
 | |
| 	networking, basically everything is there.  There are also
 | |
| 	highly advanced features like virtualization etc.</para>
 | |
| 
 | |
|       <para>As of 2006 &linux; seems to be the most widely used open
 | |
| 	source operating system with support from independent software
 | |
| 	vendors like Oracle, RealNetworks, Adobe, etc.  Most of the
 | |
| 	commercial software distributed for &linux; can only be
 | |
| 	obtained in a binary form so recompilation for other operating
 | |
| 	systems is impossible.</para>
 | |
| 
 | |
|       <para>Most of the &linux; development happens in a
 | |
| 	<application>Git</application> version control system.
 | |
| 	<application>Git</application> is a distributed system so
 | |
| 	there is no central source of the &linux; code, but some
 | |
| 	branches are considered prominent and official.  The version
 | |
| 	number scheme implemented by &linux; consists of four numbers
 | |
| 	A.B.C.D.  Currently development happens in 2.6.C.D, where C
 | |
| 	represents major version, where new features are added or
 | |
| 	changed while D is a minor version for bugfixes only.</para>
 | |
| 
 | |
|       <para>More information can be obtained from [3].</para>
 | |
| 
 | |
|       <sect3 xml:id="linux-tech-details">
 | |
| 	<title>Technical details</title>
 | |
| 
 | |
| 	<para>&linux; follows the traditional &unix; scheme of
 | |
| 	  dividing the run of a process in two halves: the kernel and
 | |
| 	  user space.  The kernel can be entered in two ways: via a
 | |
| 	  trap or via a syscall.  The return is handled only in one
 | |
| 	  way.  The further description applies to &linux; 2.6 on
 | |
| 	  the &i386; architecture.  This information was taken from
 | |
| 	  [2].</para>
 | |
| 
 | |
| 	<sect4 xml:id="linux-syscalls">
 | |
| 	  <title>Syscalls</title>
 | |
| 
 | |
| 	  <para>Syscalls in &linux; are performed (in userspace) using
 | |
| 	    <literal>syscallX</literal> macros where X substitutes a
 | |
| 	    number representing the number of parameters of the given
 | |
| 	    syscall.  This macro translates to a code that loads
 | |
| 	    <varname>%eax</varname> register with a number of the
 | |
| 	    syscall and executes interrupt <literal>0x80</literal>.
 | |
| 	    After this syscall return is called, which translates
 | |
| 	    negative return values to positive
 | |
| 	    <literal>errno</literal> values and sets
 | |
| 	    <literal>res</literal> to <literal>-1</literal> in case of
 | |
| 	    an error.  Whenever the interrupt <literal>0x80</literal>
 | |
| 	    is called the process enters the kernel in system call
 | |
| 	    trap handler.  This routine saves all registers on the
 | |
| 	    stack and calls the selected syscall entry.  Note that the
 | |
| 	    &linux; calling convention expects parameters to the
 | |
| 	    syscall to be passed via registers as shown here:</para>
 | |
| 
 | |
| 	  <orderedlist>
 | |
| 	    <listitem>
 | |
| 	      <para>parameter -> <varname>%ebx</varname></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>parameter -> <varname>%ecx</varname></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>parameter -> <varname>%edx</varname></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>parameter -> <varname>%esi</varname></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>parameter -> <varname>%edi</varname></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>parameter -> <varname>%ebp</varname></para>
 | |
| 	    </listitem>
 | |
| 	  </orderedlist>
 | |
| 
 | |
| 	  <para>There are some exceptions to this, where &linux; uses
 | |
| 	    different calling convention (most notably the
 | |
| 	    <literal>clone</literal> syscall).</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="linux-traps">
 | |
| 	  <title>Traps</title>
 | |
| 
 | |
| 	  <para>The trap handlers are introduced in
 | |
| 	    <filename>arch/i386/kernel/traps.c</filename> and most of
 | |
| 	    these handlers live in
 | |
| 	    <filename>arch/i386/kernel/entry.S</filename>, where
 | |
| 	    handling of the traps happens.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="linux-exits">
 | |
| 	  <title>Exits</title>
 | |
| 
 | |
| 	  <para>Return from the syscall is managed by syscall
 | |
| 	    &man.exit.3;, which checks for the process having
 | |
| 	    unfinished work, then checks whether we used user-supplied
 | |
| 	    selectors.  If this happens stack fixing is applied and
 | |
| 	    finally the registers are restored from the stack and the
 | |
| 	    process returns to the userspace.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="linux-unix-primitives">
 | |
| 	  <title>&unix; primitives</title>
 | |
| 
 | |
| 	  <para>In the 2.6 version, the &linux; operating system
 | |
| 	    redefined some of the traditional &unix; primitives,
 | |
| 	    notably PID, TID and thread.  PID is defined not to be
 | |
| 	    unique for every process, so for some processes (threads)
 | |
| 	    &man.getppid.2; returns the same value.  Unique
 | |
| 	    identification of process is provided by TID.  This is
 | |
| 	    because <firstterm>NPTL</firstterm> (New &posix; Thread
 | |
| 	    Library) defines threads to be normal processes (so called
 | |
| 	    1:1 threading).  Spawning a new process in
 | |
| 	    &linux; 2.6 happens using the
 | |
| 	    <literal>clone</literal> syscall (fork variants are
 | |
| 	    reimplemented using it).  This clone syscall defines a set
 | |
| 	    of flags that affect behavior of the cloning process
 | |
| 	    regarding thread implementation.  The semantic is a bit
 | |
| 	    fuzzy as there is no single flag telling the syscall to
 | |
| 	    create a thread.</para>
 | |
| 
 | |
| 	  <para>Implemented clone flags are:</para>
 | |
| 
 | |
| 	  <itemizedlist>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_VM</literal> - processes share
 | |
| 		their memory space</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_FS</literal> - share umask, cwd and
 | |
| 		namespace</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_FILES</literal> - share open
 | |
| 		files</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_SIGHAND</literal> - share signal
 | |
| 		handlers and blocked signals</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_PARENT</literal> - share
 | |
| 		parent</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_THREAD</literal> - be thread
 | |
| 		(further explanation below)</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_NEWNS</literal> - new
 | |
| 		namespace</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_SYSVSEM</literal> - share SysV undo
 | |
| 		structures</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_SETTLS</literal> - setup TLS at
 | |
| 		supplied address</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_PARENT_SETTID</literal> - set TID
 | |
| 		in the parent</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_CHILD_CLEARTID</literal> - clear
 | |
| 		TID in the child</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>CLONE_CHILD_SETTID</literal> - set TID in
 | |
| 		the child</para>
 | |
| 	    </listitem>
 | |
| 	  </itemizedlist>
 | |
| 
 | |
| 	  <para><literal>CLONE_PARENT</literal> sets the real parent
 | |
| 	    to the parent of the caller.  This is useful for threads
 | |
| 	    because if thread A creates thread B we want thread B to
 | |
| 	    be parented to the parent of the whole thread group.
 | |
| 	    <literal>CLONE_THREAD</literal> does exactly the same
 | |
| 	    thing as <literal>CLONE_PARENT</literal>,
 | |
| 	    <literal>CLONE_VM</literal> and
 | |
| 	    <literal>CLONE_SIGHAND</literal>, rewrites PID to be the
 | |
| 	    same as PID of the caller, sets exit signal to be none and
 | |
| 	    enters the thread group.  <literal>CLONE_SETTLS</literal>
 | |
| 	    sets up GDT entries for TLS handling.  The
 | |
| 	    <literal>CLONE_*_*TID</literal> set of flags sets/clears
 | |
| 	    user supplied address to TID or 0.</para>
 | |
| 
 | |
| 	  <para>As you can see the <literal>CLONE_THREAD</literal>
 | |
| 	    does most of the work and does not seem to fit the scheme
 | |
| 	    very well.  The original intention is unclear (even for
 | |
| 	    authors, according to comments in the code) but I think
 | |
| 	    originally there was one threading flag, which was then
 | |
| 	    parcelled among many other flags but this separation was
 | |
| 	    never fully finished.  It is also unclear what this
 | |
| 	    partition is good for as glibc does not use that so only
 | |
| 	    hand-written use of the clone permits a programmer to
 | |
| 	    access this features.</para>
 | |
| 
 | |
| 	  <para>For non-threaded programs the PID and TID are the
 | |
| 	    same.  For threaded programs the first thread PID and TID
 | |
| 	    are the same and every created thread shares the same PID
 | |
| 	    and gets assigned a unique TID (because
 | |
| 	    <literal>CLONE_THREAD</literal> is passed in) also parent
 | |
| 	    is shared for all processes forming this threaded
 | |
| 	    program.</para>
 | |
| 
 | |
| 	  <para>The code that implements &man.pthread.create.3; in
 | |
| 	    NPTL defines the clone flags like this:</para>
 | |
| 
 | |
| 	  <programlisting>int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGNAL
 | |
| 
 | |
|  | CLONE_SETTLS | CLONE_PARENT_SETTID
 | |
| 
 | |
| | CLONE_CHILD_CLEARTID | CLONE_SYSVSEM
 | |
| #if __ASSUME_NO_CLONE_DETACHED == 0
 | |
| 
 | |
| | CLONE_DETACHED
 | |
| #endif
 | |
| 
 | |
| | 0);</programlisting>
 | |
| 
 | |
| 	  <para>The <literal>CLONE_SIGNAL</literal> is defined
 | |
| 	    like</para>
 | |
| 
 | |
| 	  <programlisting>#define CLONE_SIGNAL (CLONE_SIGHAND | CLONE_THREAD)</programlisting>
 | |
| 
 | |
| 	  <para>the last 0 means no signal is sent when any of the
 | |
| 	    threads exits.</para>
 | |
| 	</sect4>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="what-is-emu">
 | |
|       <title>What is emulation</title>
 | |
| 
 | |
|       <para>According to a dictionary definition, emulation is the
 | |
| 	ability of a program or device to imitate another program or
 | |
| 	device.  This is achieved by providing the same reaction to a
 | |
| 	given stimulus as the emulated object.  In practice, the
 | |
| 	software world mostly sees three types of emulation - a
 | |
| 	program used to emulate a machine (QEMU, various game console
 | |
| 	emulators etc.), software emulation of a hardware facility
 | |
| 	(OpenGL emulators, floating point units emulation etc.) and
 | |
| 	operating system emulation (either in kernel of the operating
 | |
| 	system or as a userspace program).</para>
 | |
| 
 | |
|       <para>Emulation is usually used in a place, where using the
 | |
| 	original component is not feasible nor possible at all.  For
 | |
| 	example someone might want to use a program developed for a
 | |
| 	different operating system than they use.  Then emulation
 | |
| 	comes in handy.  Sometimes there is no other way but to use
 | |
| 	emulation - e.g. when the hardware device you try to use does
 | |
| 	not exist (yet/anymore) then there is no other way but
 | |
| 	emulation.  This happens often when porting an operating
 | |
| 	system to a new (non-existent) platform.  Sometimes it is just
 | |
| 	cheaper to emulate.</para>
 | |
| 
 | |
|       <para>Looking from an implementation point of view, there are
 | |
| 	two main approaches to the implementation of emulation.  You
 | |
| 	can either emulate the whole thing - accepting possible inputs
 | |
| 	of the original object, maintaining inner state and emitting
 | |
| 	correct output based on the state and/or input.  This kind of
 | |
| 	emulation does not require any special conditions and
 | |
| 	basically can be implemented anywhere for any device/program.
 | |
| 	The drawback is that implementing such emulation is quite
 | |
| 	difficult, time-consuming and error-prone.  In some cases we
 | |
| 	can use a simpler approach.  Imagine you want to emulate a
 | |
| 	printer that prints from left to right on a printer that
 | |
| 	prints from right to left.  It is obvious that there is no
 | |
| 	need for a complex emulation layer but simply reversing of the
 | |
| 	printed text is sufficient.  Sometimes the
 | |
| 	emulating environment is very similar to the emulated one so
 | |
| 	just a thin layer of some translation is necessary to provide
 | |
| 	fully working emulation!  As you can see this is much less
 | |
| 	demanding to implement, so less time-consuming and error-prone
 | |
| 	than the previous approach.  But the necessary condition is
 | |
| 	that the two environments must be similar enough.  The third
 | |
| 	approach combines the two previous.  Most of the time the
 | |
| 	objects do not provide the same capabilities so in a case of
 | |
| 	emulating the more powerful one on the less powerful we have
 | |
| 	to emulate the missing features with full emulation described
 | |
| 	above.</para>
 | |
| 
 | |
|       <para>This master thesis deals with emulation of &unix; on
 | |
| 	&unix;, which is exactly the case, where only a thin layer of
 | |
| 	translation is sufficient to provide full emulation.  The
 | |
| 	&unix; API consists of a set of syscalls, which are usually
 | |
| 	self contained and do not affect some global kernel
 | |
| 	state.</para>
 | |
| 
 | |
|       <para>There are a few syscalls that affect inner state but this
 | |
| 	can be dealt with by providing some structures that maintain
 | |
| 	the extra state.</para>
 | |
| 
 | |
|       <para>No emulation is perfect and emulations tend to lack some
 | |
| 	parts but this usually does not cause any serious drawbacks.
 | |
| 	Imagine a game console emulator that emulates everything but
 | |
| 	music output.  No doubt that the games are playable and one
 | |
| 	can use the emulator.  It might not be that comfortable as the
 | |
| 	original game console but its an acceptable compromise between
 | |
| 	price and comfort.</para>
 | |
| 
 | |
|       <para>The same goes with the &unix; API.  Most programs can live
 | |
| 	with a very limited set of syscalls working.  Those syscalls
 | |
| 	tend to be the oldest ones (&man.read.2;/&man.write.2;,
 | |
| 	&man.fork.2; family, &man.signal.3; handling, &man.exit.3;,
 | |
| 	&man.socket.2; API) hence it is easy to emulate because their
 | |
| 	semantics is shared among all &unix;es, which exist
 | |
| 	todays.</para>
 | |
|     </sect2>
 | |
|   </sect1>
 | |
| 
 | |
|   <sect1 xml:id="freebsd-emulation">
 | |
|     <title>Emulation</title>
 | |
| 
 | |
|     <sect2>
 | |
|       <title>How emulation works in &os;</title>
 | |
| 
 | |
|       <para>As stated earlier, &os; supports running binaries from
 | |
| 	several other &unix;es.  This works because &os; has an
 | |
| 	abstraction called the execution class loader.  This wedges
 | |
| 	into the &man.execve.2; syscall, so when &man.execve.2; is
 | |
| 	about to execute a binary it examines its type.</para>
 | |
| 
 | |
|       <para>There are basically two types of binaries in &os;.
 | |
| 	Shell-like text scripts which are identified by
 | |
| 	<literal>#!</literal> as their first two characters and normal
 | |
| 	(typically <firstterm>ELF</firstterm>) binaries, which are a
 | |
| 	representation of a compiled executable object.  The vast
 | |
| 	majority (one could say all of them) of binaries in &os; are
 | |
| 	from type ELF.  ELF files contain a header, which specifies
 | |
| 	the OS ABI for this ELF file.  By reading this information,
 | |
| 	the operating system can accurately determine what type of
 | |
| 	binary the given file is.</para>
 | |
| 
 | |
|       <para>Every OS ABI must be registered in the &os; kernel.  This
 | |
| 	applies to the &os; native OS ABI, as well.  So when
 | |
| 	&man.execve.2; executes a binary it iterates through the list
 | |
| 	of registered APIs and when it finds the right one it starts
 | |
| 	to use the information contained in the OS ABI description
 | |
| 	(its syscall table, <literal>errno</literal> translation
 | |
| 	table, etc.).  So every time the process calls a syscall, it
 | |
| 	uses its own set of syscalls instead of some global one.  This
 | |
| 	effectively provides a very elegant and easy way of supporting
 | |
| 	execution of various binary formats.</para>
 | |
| 
 | |
|       <para>The nature of emulation of different OSes (and also some
 | |
| 	other subsystems) led developers to invite a handler event
 | |
| 	mechanism.  There are various places in the kernel, where a
 | |
| 	list of event handlers are called.  Every subsystem can
 | |
| 	register an event handler and they are called accordingly.
 | |
| 	For example, when a process exits there is a handler called
 | |
| 	that possibly cleans up whatever the subsystem needs to be
 | |
| 	cleaned.</para>
 | |
| 
 | |
|       <para>Those simple facilities provide basically everything that
 | |
| 	is needed for the emulation infrastructure and in fact these
 | |
| 	are basically the only things necessary to implement the
 | |
| 	&linux; emulation layer.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="freebsd-common-primitives">
 | |
|       <title>Common primitives in the &os; kernel</title>
 | |
| 
 | |
|       <para>Emulation layers need some support from the operating
 | |
| 	system.  I am going to describe some of the supported
 | |
| 	primitives in the &os; operating system.</para>
 | |
| 
 | |
|       <sect3 xml:id="freebsd-locking-primitives">
 | |
| 	<title>Locking primitives</title>
 | |
| 
 | |
| 	<para>Contributed by: &a.attilio.email;</para>
 | |
| 
 | |
| 	<para>The &os; synchronization primitive set is based on the
 | |
| 	  idea to supply a rather huge number of different primitives
 | |
| 	  in a way that the better one can be used for every
 | |
| 	  particular, appropriate situation.</para>
 | |
| 
 | |
| 	<para>To a high level point of view you can consider three
 | |
| 	  kinds of synchronization primitives in the &os;
 | |
| 	  kernel:</para>
 | |
| 
 | |
| 	<itemizedlist>
 | |
| 	  <listitem>
 | |
| 	    <para>atomic operations and memory barriers</para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para>locks</para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para>scheduling barriers</para>
 | |
| 	  </listitem>
 | |
| 	</itemizedlist>
 | |
| 
 | |
| 	<para>Below there are descriptions for the 3 families.  For
 | |
| 	  every lock, you should really check the linked manpage
 | |
| 	  (where possible) for more detailed explanations.</para>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-atomic-op">
 | |
| 	  <title>Atomic operations and memory barriers</title>
 | |
| 
 | |
| 	  <para>Atomic operations are implemented through a set of
 | |
| 	    functions performing simple arithmetics on memory operands
 | |
| 	    in an atomic way with respect to external events
 | |
| 	    (interrupts, preemption, etc.).  Atomic operations can
 | |
| 	    guarantee atomicity just on small data types (in the
 | |
| 	    magnitude order of the <literal>.long.</literal>
 | |
| 	    architecture C data type), so should be rarely used
 | |
| 	    directly in the end-level code, if not only for very
 | |
| 	    simple operations (like flag setting in a bitmap, for
 | |
| 	    example).  In fact, it is rather simple and common to
 | |
| 	    write down a wrong semantic based on just atomic
 | |
| 	    operations (usually referred as lock-less).  The &os;
 | |
| 	    kernel offers a way to perform atomic operations in
 | |
| 	    conjunction with a memory barrier.  The memory barriers
 | |
| 	    will guarantee that an atomic operation will happen
 | |
| 	    following some specified ordering with respect to other
 | |
| 	    memory accesses.  For example, if we need that an atomic
 | |
| 	    operation happen just after all other pending writes (in
 | |
| 	    terms of instructions reordering buffers activities) are
 | |
| 	    completed, we need to explicitly use a memory barrier in
 | |
| 	    conjunction to this atomic operation.  So it is simple to
 | |
| 	    understand why memory barriers play a key role for
 | |
| 	    higher-level locks building (just as refcounts, mutexes,
 | |
| 	    etc.).  For a detailed explanatory on atomic operations,
 | |
| 	    please refer to &man.atomic.9;.  It is far, however,
 | |
| 	    noting that atomic operations (and memory barriers as
 | |
| 	    well) should ideally only be used for building
 | |
| 	    front-ending locks (as mutexes).</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-refcounts">
 | |
| 	  <title>Refcounts</title>
 | |
| 
 | |
| 	  <para>Refcounts are interfaces for handling reference
 | |
| 	    counters.  They are implemented through atomic operations
 | |
| 	    and are intended to be used just for cases, where the
 | |
| 	    reference counter is the only one thing to be protected,
 | |
| 	    so even something like a spin-mutex is deprecated.  Using
 | |
| 	    the refcount interface for structures, where a mutex is
 | |
| 	    already used is often wrong since we should probably close
 | |
| 	    the reference counter in some already protected paths.  A
 | |
| 	    manpage discussing refcount does not exist currently, just
 | |
| 	    check <filename>sys/refcount.h</filename> for an overview
 | |
| 	    of the existing API.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-locks">
 | |
| 	  <title>Locks</title>
 | |
| 
 | |
| 	  <para>&os; kernel has huge classes of locks.  Every lock is
 | |
| 	    defined by some peculiar properties, but probably the most
 | |
| 	    important is the event linked to contesting holders (or in
 | |
| 	    other terms, the behavior of threads unable to acquire the
 | |
| 	    lock).  &os;'s locking scheme presents three different
 | |
| 	    behaviors for contenders:</para>
 | |
| 
 | |
| 	  <orderedlist>
 | |
| 	    <listitem>
 | |
| 	      <para>spinning</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>blocking</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>sleeping</para>
 | |
| 	    </listitem>
 | |
| 	  </orderedlist>
 | |
| 
 | |
| 	  <note>
 | |
| 	    <para>numbers are not casual</para>
 | |
| 	  </note>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-spinlocks">
 | |
| 	  <title>Spinning locks</title>
 | |
| 
 | |
| 	  <para>Spin locks let waiters to spin until they cannot
 | |
| 	    acquire the lock.  An important matter do deal with is
 | |
| 	    when a thread contests on a spin lock if it is not
 | |
| 	    descheduled.  Since the &os; kernel is preemptive, this
 | |
| 	    exposes spin lock at the risk of deadlocks that can be
 | |
| 	    solved just disabling interrupts while they are acquired.
 | |
| 	    For this and other reasons (like lack of priority
 | |
| 	    propagation support, poorness in load balancing schemes
 | |
| 	    between CPUs, etc.), spin locks are intended to protect
 | |
| 	    very small paths of code, or ideally not to be used at all
 | |
| 	    if not explicitly requested (explained later).</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-blocking">
 | |
| 	  <title>Blocking</title>
 | |
| 
 | |
| 	  <para>Block locks let waiters to be descheduled and blocked
 | |
| 	    until the lock owner does not drop it and wakes up one or
 | |
| 	    more contenders.  In order to avoid starvation issues,
 | |
| 	    blocking locks do priority propagation from the waiters to
 | |
| 	    the owner.  Block locks must be implemented through the
 | |
| 	    turnstile interface and are intended to be the most used
 | |
| 	    kind of locks in the kernel, if no particular conditions
 | |
| 	    are met.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-sleeping">
 | |
| 	  <title>Sleeping</title>
 | |
| 
 | |
| 	  <para>Sleep locks let waiters to be descheduled and fall
 | |
| 	    asleep until the lock holder does not drop it and wakes up
 | |
| 	    one or more waiters.  Since sleep locks are intended to
 | |
| 	    protect large paths of code and to cater asynchronous
 | |
| 	    events, they do not do any form of priority propagation.
 | |
| 	    They must be implemented through the &man.sleepqueue.9;
 | |
| 	    interface.</para>
 | |
| 
 | |
| 	  <para>The order used to acquire locks is very important, not
 | |
| 	    only for the possibility to deadlock due at lock order
 | |
| 	    reversals, but even because lock acquisition should follow
 | |
| 	    specific rules linked to locks natures.  If you give a
 | |
| 	    look at the table above, the practical rule is that if a
 | |
| 	    thread holds a lock of level n (where the level is the
 | |
| 	    number listed close to the kind of lock) it is not allowed
 | |
| 	    to acquire a lock of superior levels, since this would
 | |
| 	    break the specified semantic for a path.  For example, if
 | |
| 	    a thread holds a block lock (level 2), it is allowed to
 | |
| 	    acquire a spin lock (level 1) but not a sleep lock (level
 | |
| 	    3), since block locks are intended to protect smaller
 | |
| 	    paths than sleep lock (these rules are not about atomic
 | |
| 	    operations or scheduling barriers, however).</para>
 | |
| 
 | |
| 	  <para>This is a list of lock with their respective
 | |
| 	    behaviors:</para>
 | |
| 
 | |
| 	  <itemizedlist>
 | |
| 	    <listitem>
 | |
| 	      <para>spin mutex - spinning - &man.mutex.9;</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>sleep mutex - blocking - &man.mutex.9;</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>pool mutex - blocking - &man.mtx.pool.9;</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>sleep family - sleeping - &man.sleep.9; pause
 | |
| 		tsleep msleep msleep spin msleep rw msleep sx</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>condvar - sleeping - &man.condvar.9;</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>rwlock - blocking - &man.rwlock.9;</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>sxlock - sleeping - &man.sx.9;</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>lockmgr - sleeping - &man.lockmgr.9;</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>semaphores - sleeping - &man.sema.9;</para>
 | |
| 	    </listitem>
 | |
| 	  </itemizedlist>
 | |
| 
 | |
| 	  <para>Among these locks only mutexes, sxlocks, rwlocks and
 | |
| 	    lockmgrs are intended to handle recursion, but currently
 | |
| 	    recursion is only supported by mutexes and
 | |
| 	    lockmgrs.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-scheduling">
 | |
| 	  <title>Scheduling barriers</title>
 | |
| 
 | |
| 	  <para>Scheduling barriers are intended to be used in order
 | |
| 	    to drive scheduling of threading.  They consist mainly of
 | |
| 	    three different stubs:</para>
 | |
| 
 | |
| 	  <itemizedlist>
 | |
| 	    <listitem>
 | |
| 	      <para>critical sections (and preemption)</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>sched_bind</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>sched_pin</para>
 | |
| 	    </listitem>
 | |
| 	  </itemizedlist>
 | |
| 
 | |
| 	  <para>Generally, these should be used only in a particular
 | |
| 	    context and even if they can often replace locks, they
 | |
| 	    should be avoided because they do not let the diagnose of
 | |
| 	    simple eventual problems with locking debugging tools (as
 | |
| 	    &man.witness.4;).</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-critical">
 | |
| 	  <title>Critical sections</title>
 | |
| 
 | |
| 	  <para>The &os; kernel has been made preemptive basically to
 | |
| 	    deal with interrupt threads.  In fact, in order to avoid
 | |
| 	    high interrupt latency, time-sharing priority threads can
 | |
| 	    be preempted by interrupt threads (in this way, they do
 | |
| 	    not need to wait to be scheduled as the normal path
 | |
| 	    previews).  Preemption, however, introduces new racing
 | |
| 	    points that need to be handled, as well.  Often, in order
 | |
| 	    to deal with preemption, the simplest thing to do is to
 | |
| 	    completely disable it.  A critical section defines a piece
 | |
| 	    of code (borderlined by the pair of functions
 | |
| 	    &man.critical.enter.9; and &man.critical.exit.9;, where
 | |
| 	    preemption is guaranteed to not happen (until the
 | |
| 	    protected code is fully executed).  This can often replace
 | |
| 	    a lock effectively but should be used carefully in order
 | |
| 	    to not lose the whole advantage that preemption
 | |
| 	    brings.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-schedpin">
 | |
| 	  <title>sched_pin/sched_unpin</title>
 | |
| 
 | |
| 	  <para>Another way to deal with preemption is the
 | |
| 	    <function>sched_pin()</function> interface.  If a piece of
 | |
| 	    code is closed in the <function>sched_pin()</function>
 | |
| 	    and <function>sched_unpin()</function> pair of functions
 | |
| 	    it is guaranteed that the respective thread, even if it
 | |
| 	    can be preempted, it will always be executed on the same
 | |
| 	    CPU.  Pinning is very effective in the particular case
 | |
| 	    when we have to access at per-cpu datas and we assume
 | |
| 	    other threads will not change those data.  The latter
 | |
| 	    condition will determine a critical section as a too
 | |
| 	    strong condition for our code.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-schedbind">
 | |
| 	  <title>sched_bind/sched_unbind</title>
 | |
| 
 | |
| 	  <para><function>sched_bind</function> is an API used in
 | |
| 	    order to bind a thread to a particular CPU for all the
 | |
| 	    time it executes the code, until a
 | |
| 	    <function>sched_unbind</function> function call does not
 | |
| 	    unbind it.  This feature has a key role in situations
 | |
| 	    where you cannot trust the current state of CPUs (for
 | |
| 	    example, at very early stages of boot), as you want to
 | |
| 	    avoid your thread to migrate on inactive CPUs.  Since
 | |
| 	    <function>sched_bind</function> and
 | |
| 	    <function>sched_unbind</function> manipulate internal
 | |
| 	    scheduler structures, they need to be enclosed in
 | |
| 	    <function>sched_lock</function> acquisition/releasing when
 | |
| 	    used.</para>
 | |
| 	</sect4>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="freebsd-proc">
 | |
| 	<title>Proc structure</title>
 | |
| 
 | |
| 	<para>Various emulation layers sometimes require some
 | |
| 	  additional per-process data.  It can manage separate
 | |
| 	  structures (a list, a tree etc.) containing these data for
 | |
| 	  every process but this tends to be slow and memory
 | |
| 	  consuming.  To solve this problem the &os;
 | |
| 	  <literal>proc</literal> structure contains
 | |
| 	  <literal>p_emuldata</literal>, which is a void pointer to
 | |
| 	  some emulation layer specific data.  This
 | |
| 	  <literal>proc</literal> entry is protected by the proc
 | |
| 	  mutex.</para>
 | |
| 
 | |
| 	<para>The &os; <literal>proc</literal> structure contains a
 | |
| 	  <literal>p_sysent</literal> entry that identifies, which ABI
 | |
| 	  this process is running.  In fact, it is a pointer to the
 | |
| 	  <literal>sysentvec</literal> described above.  So by
 | |
| 	  comparing this pointer to the address where the
 | |
| 	  <literal>sysentvec</literal> structure for the given ABI is
 | |
| 	  stored we can effectively determine whether the process
 | |
| 	  belongs to our emulation layer.  The code typically looks
 | |
| 	  like:</para>
 | |
| 
 | |
| 	<programlisting>if (__predict_true(p->p_sysent != &elf_&linux;_sysvec))
 | |
| 	  return;</programlisting>
 | |
| 
 | |
| 	<para>As you can see, we effectively use the
 | |
| 	  <literal>__predict_true</literal> modifier to collapse the
 | |
| 	  most common case (&os; process) to a simple return operation
 | |
| 	  thus preserving high performance.  This code should be
 | |
| 	  turned into a macro because currently it is not very
 | |
| 	  flexible, i.e. we do not support &linux;64 emulation nor
 | |
| 	  A.OUT &linux; processes on i386.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="freebsd-vfs">
 | |
| 	<title>VFS</title>
 | |
| 
 | |
| 	<para>The &os; VFS subsystem is very complex but the &linux;
 | |
| 	  emulation layer uses just a small subset via a well defined
 | |
| 	  API.  It can either operate on vnodes or file handlers.
 | |
| 	  Vnode represents a virtual vnode, i.e. representation of a
 | |
| 	  node in VFS.  Another representation is a file handler,
 | |
| 	  which represents an opened file from the perspective of a
 | |
| 	  process.  A file handler can represent a socket or an
 | |
| 	  ordinary file.  A file handler contains a pointer to its
 | |
| 	  vnode.  More then one file handler can point to the same
 | |
| 	  vnode.</para>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-namei">
 | |
| 	  <title>namei</title>
 | |
| 
 | |
| 	  <para>The &man.namei.9; routine is a central entry point to
 | |
| 	    pathname lookup and translation.  It traverses the path
 | |
| 	    point by point from the starting point to the end point
 | |
| 	    using lookup function, which is internal to VFS.  The
 | |
| 	    &man.namei.9; syscall can cope with symlinks, absolute and
 | |
| 	    relative paths.  When a path is looked up using
 | |
| 	    &man.namei.9; it is inputed to the name cache.  This
 | |
| 	    behavior can be suppressed.  This routine is used all over
 | |
| 	    the kernel and its performance is very critical.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-vn">
 | |
| 	  <title>vn_fullpath</title>
 | |
| 
 | |
| 	  <para>The &man.vn.fullpath.9; function takes the best effort
 | |
| 	    to traverse VFS name cache and returns a path for a given
 | |
| 	    (locked) vnode.  This process is unreliable but works just
 | |
| 	    fine for the most common cases.  The unreliability is
 | |
| 	    because it relies on VFS cache (it does not traverse the
 | |
| 	    on medium structures), it does not work with hardlinks,
 | |
| 	    etc.  This routine is used in several places in the
 | |
| 	    Linuxulator.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-vnode">
 | |
| 	  <title>Vnode operations</title>
 | |
| 
 | |
| 	  <itemizedlist>
 | |
| 	    <listitem>
 | |
| 	      <para><function>fgetvp</function> - given a thread and a
 | |
| 		file descriptor number it returns the associated
 | |
| 		vnode</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.vn.lock.9; - locks a vnode</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><function>vn_unlock</function> - unlocks a
 | |
| 		vnode</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.VOP.READDIR.9; - reads a directory referenced
 | |
| 		by a vnode</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.VOP.GETATTR.9; - gets attributes of a file or
 | |
| 		a directory referenced by a vnode</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.VOP.LOOKUP.9; - looks up a path to a given
 | |
| 		directory</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.VOP.OPEN.9; - opens a file referenced by a
 | |
| 		vnode</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.VOP.CLOSE.9; - closes a file referenced by a
 | |
| 		vnode</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.vput.9; - decrements the use count for a
 | |
| 		vnode and unlocks it</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.vrele.9; - decrements the use count for a
 | |
| 		vnode</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para>&man.vref.9; - increments the use count for a
 | |
| 		vnode</para>
 | |
| 	    </listitem>
 | |
| 	  </itemizedlist>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="freebsd-file-handler">
 | |
| 	  <title>File handler operations</title>
 | |
| 
 | |
| 	  <itemizedlist>
 | |
| 	    <listitem>
 | |
| 	      <para><function>fget</function> - given a thread and a
 | |
| 		file descriptor number it returns associated file
 | |
| 		handler and references it</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><function>fdrop</function> - drops a reference to
 | |
| 		a file handler</para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><function>fhold</function> - references a file
 | |
| 		handler</para>
 | |
| 	    </listitem>
 | |
| 	  </itemizedlist>
 | |
| 	</sect4>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
|   </sect1>
 | |
| 
 | |
|   <sect1 xml:id="md">
 | |
|     <title>&linux; emulation layer -MD part</title>
 | |
| 
 | |
|     <para>This section deals with implementation of &linux; emulation
 | |
|       layer in &os; operating system.  It first describes the machine
 | |
|       dependent part talking about how and where interaction between
 | |
|       userland and kernel is implemented.  It talks about syscalls,
 | |
|       signals, ptrace, traps, stack fixup.  This part discusses i386
 | |
|       but it is written generally so other architectures should not
 | |
|       differ very much.  The next part is the machine independent part
 | |
|       of the Linuxulator.  This section only covers i386 and ELF
 | |
|       handling.  A.OUT is obsolete and untested.</para>
 | |
| 
 | |
|     <sect2 xml:id="syscall-handling">
 | |
|       <title>Syscall handling</title>
 | |
| 
 | |
|       <para>Syscall handling is mostly written in
 | |
| 	<filename>linux_sysvec.c</filename>, which covers most of the
 | |
| 	routines pointed out in the <literal>sysentvec</literal>
 | |
| 	structure.  When a &linux; process running on &os; issues a
 | |
| 	syscall, the general syscall routine calls linux prepsyscall
 | |
| 	routine for the &linux; ABI.</para>
 | |
| 
 | |
|       <sect3 xml:id="linux-prepsyscall">
 | |
| 	<title>&linux; prepsyscall</title>
 | |
| 
 | |
| 	<para>&linux; passes arguments to syscalls via registers (that
 | |
| 	  is why it is limited to 6 parameters on i386) while &os;
 | |
| 	  uses the stack.  The &linux; prepsyscall routine must copy
 | |
| 	  parameters from registers to the stack.  The order of the
 | |
| 	  registers is: <varname>%ebx</varname>,
 | |
| 	  <varname>%ecx</varname>, <varname>%edx</varname>,
 | |
| 	  <varname>%esi</varname>, <varname>%edi</varname>,
 | |
| 	  <varname>%ebp</varname>.  The catch is that this is true for
 | |
| 	  only <emphasis>most</emphasis> of the syscalls.  Some (most
 | |
| 	  notably <function>clone</function>) uses a different order
 | |
| 	  but it is luckily easy to fix by inserting a dummy parameter
 | |
| 	  in the <function>linux_clone</function> prototype.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="syscall-writing">
 | |
| 	<title>Syscall writing</title>
 | |
| 
 | |
| 	<para>Every syscall implemented in the Linuxulator must have
 | |
| 	  its prototype with various flags in
 | |
| 	  <filename>syscalls.master</filename>.  The form of the file
 | |
| 	  is:</para>
 | |
| 
 | |
| 	<programlisting>...
 | |
| 	AUE_FORK STD		{ int linux_fork(void); }
 | |
| ...
 | |
| 	AUE_CLOSE NOPROTO	{ int close(int fd); }
 | |
| ...</programlisting>
 | |
| 
 | |
| 	<para>The first column represents the syscall number.  The
 | |
| 	  second column is for auditing support.  The third column
 | |
| 	  represents the syscall type.  It is either
 | |
| 	  <literal>STD</literal>, <literal>OBSOL</literal>,
 | |
| 	  <literal>NOPROTO</literal> and <literal>UNIMPL</literal>.
 | |
| 	  <literal>STD</literal> is a standard syscall with full
 | |
| 	  prototype and implementation.  <literal>OBSOL</literal> is
 | |
| 	  obsolete and defines just the prototype.
 | |
| 	  <literal>NOPROTO</literal> means that the syscall is
 | |
| 	  implemented elsewhere so do not prepend ABI prefix, etc.
 | |
| 	  <literal>UNIMPL</literal> means that the syscall will be
 | |
| 	  substituted with the <function>nosys</function> syscall (a
 | |
| 	  syscall just printing out a message about the syscall not
 | |
| 	  being implemented and returning
 | |
| 	  <literal>ENOSYS</literal>).</para>
 | |
| 
 | |
| 	<para>From <filename>syscalls.master</filename> a script
 | |
| 	  generates three files: <filename>linux_syscall.h</filename>,
 | |
| 	  <filename>linux_proto.h</filename> and
 | |
| 	  <filename>linux_sysent.c</filename>.  The
 | |
| 	  <filename>linux_syscall.h</filename> contains definitions of
 | |
| 	  syscall names and their numerical value, e.g.:</para>
 | |
| 
 | |
| 	<programlisting>...
 | |
| #define LINUX_SYS_linux_fork 2
 | |
| ...
 | |
| #define LINUX_SYS_close 6
 | |
| ...</programlisting>
 | |
| 
 | |
| 	<para>The <filename>linux_proto.h</filename> contains
 | |
| 	  structure definitions of arguments to every syscall,
 | |
| 	  e.g.:</para>
 | |
| 
 | |
| 	<programlisting>struct linux_fork_args {
 | |
|   register_t dummy;
 | |
| };</programlisting>
 | |
| 
 | |
| 	<para>And finally, <filename>linux_sysent.c</filename>
 | |
| 	  contains structure describing the system entry table, used
 | |
| 	  to actually dispatch a syscall, e.g.:</para>
 | |
| 
 | |
| 	<programlisting>{ 0, (sy_call_t *)linux_fork, AUE_FORK, NULL, 0, 0 }, /* 2 = linux_fork */
 | |
| { AS(close_args), (sy_call_t *)close, AUE_CLOSE, NULL, 0, 0 }, /* 6 = close */</programlisting>
 | |
| 
 | |
| 	<para>As you can see <function>linux_fork</function> is
 | |
| 	  implemented in Linuxulator itself so the definition is of
 | |
| 	  <literal>STD</literal> type and has no argument, which is
 | |
| 	  exhibited by the dummy argument structure.  On the other
 | |
| 	  hand <function>close</function> is just an alias for real
 | |
| 	  &os; &man.close.2; so it has no linux arguments structure
 | |
| 	  associated and in the system entry table it is not prefixed
 | |
| 	  with linux as it calls the real &man.close.2; in the
 | |
| 	  kernel.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="dummy-syscalls">
 | |
| 	<title>Dummy syscalls</title>
 | |
| 
 | |
| 	<para>The &linux; emulation layer is not complete, as some
 | |
| 	  syscalls are not implemented properly and some are not
 | |
| 	  implemented at all.  The emulation layer employs a facility
 | |
| 	  to mark unimplemented syscalls with the
 | |
| 	  <literal>DUMMY</literal> macro.  These dummy definitions
 | |
| 	  reside in <filename>linux_dummy.c</filename> in a form of
 | |
| 	  <literal>DUMMY(syscall);</literal>, which is then translated
 | |
| 	  to various syscall auxiliary files and the implementation
 | |
| 	  consists of printing a message saying that this syscall is
 | |
| 	  not implemented.  The <literal>UNIMPL</literal> prototype is
 | |
| 	  not used because we want to be able to identify the name of
 | |
| 	  the syscall that was called in order to know what syscalls
 | |
| 	  are more important to implement.</para>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="signal-handling">
 | |
|       <title>Signal handling</title>
 | |
| 
 | |
|       <para>Signal handling is done generally in the &os; kernel for
 | |
| 	all binary compatibilities with a call to a compat-dependent
 | |
| 	layer.  &linux; compatibility layer defines
 | |
| 	<function>linux_sendsig</function> routine for this
 | |
| 	purpose.</para>
 | |
| 
 | |
|       <sect3 xml:id="linux-sendsig">
 | |
| 	<title>&linux; sendsig</title>
 | |
| 
 | |
| 	<para>This routine first checks whether the signal has been
 | |
| 	  installed with a <literal>SA_SIGINFO</literal> in which case
 | |
| 	  it calls <function>linux_rt_sendsig</function> routine
 | |
| 	  instead.  Furthermore, it allocates (or reuses an already
 | |
| 	  existing) signal handle context, then it builds a list of
 | |
| 	  arguments for the signal handler.  It translates the signal
 | |
| 	  number based on the signal translation table, assigns a
 | |
| 	  handler, translates sigset.  Then it saves context for the
 | |
| 	  <function>sigreturn</function> routine (various registers,
 | |
| 	  translated trap number and signal mask).  Finally, it copies
 | |
| 	  out the signal context to the userspace and prepares context
 | |
| 	  for the actual signal handler to run.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="linux-rt-sendsig">
 | |
| 	<title>linux_rt_sendsig</title>
 | |
| 
 | |
| 	<para>This routine is similar to
 | |
| 	  <function>linux_sendsig</function> just the signal context
 | |
| 	  preparation is different.  It adds
 | |
| 	  <literal>siginfo</literal>, <literal>ucontext</literal>, and
 | |
| 	  some &posix; parts.  It might be worth considering whether
 | |
| 	  those two functions could not be merged with a benefit of
 | |
| 	  less code duplication and possibly even faster
 | |
| 	  execution.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="linux-sigreturn">
 | |
| 	<title>linux_sigreturn</title>
 | |
| 
 | |
| 	<para>This syscall is used for return from the signal handler.
 | |
| 	  It does some security checks and restores the original
 | |
| 	  process context.  It also unmasks the signal in process
 | |
| 	  signal mask.</para>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="ptrace">
 | |
|       <title>Ptrace</title>
 | |
| 
 | |
|       <para>Many &unix; derivates implement the &man.ptrace.2; syscall
 | |
| 	in order to allow various tracking and debugging features.
 | |
| 	This facility enables the tracing process to obtain various
 | |
| 	information about the traced process, like register dumps, any
 | |
| 	memory from the process address space, etc. and also to trace
 | |
| 	the process like in stepping an instruction or between system
 | |
| 	entries (syscalls and traps).  &man.ptrace.2; also lets you
 | |
| 	set various information in the traced process (registers
 | |
| 	etc.).  &man.ptrace.2; is a &unix;-wide standard implemented
 | |
| 	in most &unix;es around the world.</para>
 | |
| 
 | |
|       <para>&linux; emulation in &os; implements the &man.ptrace.2;
 | |
| 	facility in <filename>linux_ptrace.c</filename>.  The routines
 | |
| 	for converting registers between &linux; and &os; and the
 | |
| 	actual &man.ptrace.2; syscall emulation syscall.  The syscall
 | |
| 	is a long switch block that implements its counterpart in &os;
 | |
| 	for every &man.ptrace.2; command.  The &man.ptrace.2; commands
 | |
| 	are mostly equal between &linux; and &os; so usually just a
 | |
| 	small modification is needed.  For example,
 | |
| 	<literal>PT_GETREGS</literal> in &linux; operates on direct
 | |
| 	data while &os; uses a pointer to the data so after performing
 | |
| 	a (native) &man.ptrace.2; syscall, a copyout must be done to
 | |
| 	preserve &linux; semantics.</para>
 | |
| 
 | |
|       <para>The &man.ptrace.2; implementation in Linuxulator has some
 | |
| 	known weaknesses.  There have been panics seen when using
 | |
| 	<command>strace</command> (which is a &man.ptrace.2; consumer)
 | |
| 	in the Linuxulator environment.  Also
 | |
| 	<literal>PT_SYSCALL</literal> is not implemented.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="traps">
 | |
|       <title>Traps</title>
 | |
| 
 | |
|       <para>Whenever a &linux; process running in the emulation layer
 | |
| 	traps the trap itself is handled transparently with the only
 | |
| 	exception of the trap translation.  &linux; and &os; differs
 | |
| 	in opinion on what a trap is so this is dealt with here.  The
 | |
| 	code is actually very short:</para>
 | |
| 
 | |
|       <programlisting>static int
 | |
| translate_traps(int signal, int trap_code)
 | |
| {
 | |
| 
 | |
|   if (signal != SIGBUS)
 | |
|     return signal;
 | |
| 
 | |
|   switch (trap_code) {
 | |
| 
 | |
|     case T_PROTFLT:
 | |
|     case T_TSSFLT:
 | |
|     case T_DOUBLEFLT:
 | |
|     case T_PAGEFLT:
 | |
|       return SIGSEGV;
 | |
| 
 | |
|     default:
 | |
|       return signal;
 | |
|   }
 | |
| }</programlisting>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="stack-fixup">
 | |
|       <title>Stack fixup</title>
 | |
| 
 | |
|       <para>The RTLD run-time link-editor expects so called AUX tags
 | |
| 	on stack during an <function>execve</function> so a fixup must
 | |
| 	be done to ensure this.  Of course, every RTLD system is
 | |
| 	different so the emulation layer must provide its own stack
 | |
| 	fixup routine to do this.  So does Linuxulator.  The
 | |
| 	<function>elf_linux_fixup</function> simply copies out AUX
 | |
| 	tags to the stack and adjusts the stack of the user space
 | |
| 	process to point right after those tags.  So RTLD works in a
 | |
| 	smart way.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="aout-support">
 | |
|       <title>A.OUT support</title>
 | |
| 
 | |
|       <para>The &linux; emulation layer on i386 also supports &linux;
 | |
| 	A.OUT binaries.  Pretty much everything described in the
 | |
| 	previous sections must be implemented for A.OUT support
 | |
| 	(beside traps translation and signals sending).  The support
 | |
| 	for A.OUT binaries is no longer maintained, especially the 2.6
 | |
| 	emulation does not work with it but this does not cause any
 | |
| 	problem, as the linux-base in ports probably do not support
 | |
| 	A.OUT binaries at all.  This support will probably be removed
 | |
| 	in future.  Most of the stuff necessary for loading &linux;
 | |
| 	A.OUT binaries is in <filename>imgact_linux.c</filename>
 | |
| 	file.</para>
 | |
|     </sect2>
 | |
|   </sect1>
 | |
| 
 | |
|   <sect1 xml:id="mi">
 | |
|     <title>&linux; emulation layer -MI part</title>
 | |
| 
 | |
|     <para>This section talks about machine independent part of the
 | |
|       Linuxulator.  It covers the emulation infrastructure needed for
 | |
|       &linux; 2.6 emulation, the thread local storage (TLS)
 | |
|       implementation (on i386) and futexes.  Then we talk briefly
 | |
|       about some syscalls.</para>
 | |
| 
 | |
|     <sect2 xml:id="nptl-desc">
 | |
|       <title>Description of NPTL</title>
 | |
| 
 | |
|       <para>One of the major areas of progress in development of
 | |
| 	&linux; 2.6 was threading.  Prior to 2.6, the &linux;
 | |
| 	threading support was implemented in the
 | |
| 	<application>linuxthreads</application> library.  The library
 | |
| 	was a partial implementation of &posix; threading.  The
 | |
| 	threading was implemented using separate processes for each
 | |
| 	thread using the <function>clone</function> syscall to let
 | |
| 	them share the address space (and other things).  The main
 | |
| 	weaknesses of this approach was that every thread had a
 | |
| 	different PID, signal handling was broken (from the pthreads
 | |
| 	perspective), etc.  Also the performance was not very good
 | |
| 	(use of <literal>SIGUSR</literal> signals for threads
 | |
| 	synchronization, kernel resource consumption, etc.) so to
 | |
| 	overcome these problems a new threading system was developed
 | |
| 	and named NPTL.</para>
 | |
| 
 | |
|       <para>The NPTL library focused on two things but a third thing
 | |
| 	came along so it is usually considered a part of NPTL.  Those
 | |
| 	two things were embedding of threads into a process structure
 | |
| 	and futexes.  The additional third thing was TLS, which is not
 | |
| 	directly required by NPTL but the whole NPTL userland library
 | |
| 	depends on it.  Those improvements yielded in much improved
 | |
| 	performance and standards conformance.  NPTL is a standard
 | |
| 	threading library in &linux; systems these days.</para>
 | |
| 
 | |
|       <para>The &os; Linuxulator implementation approaches the NPTL in
 | |
| 	three main areas.  The TLS, futexes and PID mangling, which is
 | |
| 	meant to simulate the &linux; threads.  Further sections
 | |
| 	describe each of these areas.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="linux26-emu">
 | |
|       <title>&linux; 2.6 emulation infrastructure</title>
 | |
| 
 | |
|       <para>These sections deal with the way &linux; threads are
 | |
| 	managed and how we simulate that in &os;.</para>
 | |
| 
 | |
|       <sect3 xml:id="linux26-runtime">
 | |
| 	<title>Runtime determining of 2.6 emulation</title>
 | |
| 
 | |
| 	<para>The &linux; emulation layer in &os; supports runtime
 | |
| 	  setting of the emulated version.  This is done via
 | |
| 	  &man.sysctl.8;, namely
 | |
| 	  <literal>compat.linux.osrelease</literal>.  Setting this
 | |
| 	  &man.sysctl.8; affects runtime behavior of the emulation
 | |
| 	  layer.  When set to 2.6.x it sets the value of
 | |
| 	  <literal>linux_use_linux26</literal> while setting to
 | |
| 	  something else keeps it unset.  This variable (plus
 | |
| 	  per-prison variables of the very same kind) determines
 | |
| 	  whether 2.6 infrastructure (mainly PID mangling) is used in
 | |
| 	  the code or not.  The version setting is done system-wide
 | |
| 	  and this affects all &linux; processes.  The &man.sysctl.8;
 | |
| 	  should not be changed when running any &linux; binary as it
 | |
| 	  might harm things.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="linux-proc-thread">
 | |
| 	<title>&linux; processes and thread identifiers</title>
 | |
| 
 | |
| 	<para>The semantics of &linux; threading are a little
 | |
| 	  confusing and uses entirely different nomenclature to &os;.
 | |
| 	  A process in &linux; consists of a <literal>struct
 | |
| 	    task</literal> embedding two identifier fields - PID and
 | |
| 	  TGID.  PID is <emphasis>not</emphasis> a process ID but it
 | |
| 	  is a thread ID.  The TGID identifies a thread group in other
 | |
| 	  words a process.  For single-threaded process the PID equals
 | |
| 	  the TGID.</para>
 | |
| 
 | |
| 	<para>The thread in NPTL is just an ordinary process that
 | |
| 	  happens to have TGID not equal to PID and have a group
 | |
| 	  leader not equal to itself (and shared VM etc. of course).
 | |
| 	  Everything else happens in the same way as to an ordinary
 | |
| 	  process.  There is no separation of a shared status to some
 | |
| 	  external structure like in &os;.  This creates some
 | |
| 	  duplication of information and possible data inconsistency.
 | |
| 	  The &linux; kernel seems to use task -> group information
 | |
| 	  in some places and task information elsewhere and it is
 | |
| 	  really not very consistent and looks error-prone.</para>
 | |
| 
 | |
| 	<para>Every NPTL thread is created by a call to the
 | |
| 	  <function>clone</function> syscall with a specific set of
 | |
| 	  flags (more in the next subsection).  The NPTL implements
 | |
| 	  strict 1:1 threading.</para>
 | |
| 
 | |
| 	<para>In &os; we emulate NPTL threads with ordinary &os;
 | |
| 	  processes that share VM space, etc. and the PID gymnastic is
 | |
| 	  just mimicked in the emulation specific structure attached
 | |
| 	  to the process.  The structure attached to the process looks
 | |
| 	  like:</para>
 | |
| 
 | |
| 	<programlisting>struct linux_emuldata {
 | |
|   pid_t pid;
 | |
| 
 | |
|   int *child_set_tid; /* in clone(): Child.s TID to set on clone */
 | |
|   int *child_clear_tid;/* in clone(): Child.s TID to clear on exit */
 | |
| 
 | |
|   struct linux_emuldata_shared *shared;
 | |
| 
 | |
|   int pdeath_signal; /* parent death signal */
 | |
| 
 | |
|   LIST_ENTRY(linux_emuldata) threads; /* list of linux threads */
 | |
| };</programlisting>
 | |
| 
 | |
| 	<para>The PID is used to identify the &os; process that
 | |
| 	  attaches this structure.  The
 | |
| 	  <function>child_se_tid</function> and
 | |
| 	  <function>child_clear_tid</function> are used for TID
 | |
| 	  address copyout when a process exits and is created.  The
 | |
| 	  <varname>shared</varname> pointer points to a structure
 | |
| 	  shared among threads.  The <varname>pdeath_signal</varname>
 | |
| 	  variable identifies the parent death signal  and the
 | |
| 	  <varname>threads</varname> pointer is used to link this
 | |
| 	  structure to the list of threads.  The
 | |
| 	  <literal>linux_emuldata_shared</literal> structure looks
 | |
| 	  like:</para>
 | |
| 
 | |
| 	<programlisting>struct linux_emuldata_shared {
 | |
| 
 | |
|   int refs;
 | |
| 
 | |
|   pid_t group_pid;
 | |
| 
 | |
|   LIST_HEAD(, linux_emuldata) threads; /* head of list of linux threads */
 | |
| };</programlisting>
 | |
| 
 | |
| 	<para>The <varname>refs</varname> is a reference counter being
 | |
| 	  used to determine when we can free the structure to avoid
 | |
| 	  memory leaks.  The <varname>group_pid</varname> is to
 | |
| 	  identify PID ( = TGID) of the whole process ( = thread
 | |
| 	  group).  The <varname>threads</varname> pointer is the head
 | |
| 	  of the list of threads in the process.</para>
 | |
| 
 | |
| 	<para>The <literal>linux_emuldata</literal> structure can be
 | |
| 	  obtained from the process using
 | |
| 	  <function>em_find</function>.  The prototype of the function
 | |
| 	  is:</para>
 | |
| 
 | |
| 	<programlisting>struct linux_emuldata *em_find(struct proc *, int locked);</programlisting>
 | |
| 
 | |
| 	<para>Here, <varname>proc</varname> is the process we want the
 | |
| 	  emuldata structure from and the locked parameter determines
 | |
| 	  whether we want to lock or not.  The accepted values are
 | |
| 	  <literal>EMUL_DOLOCK</literal> and
 | |
| 	  <literal>EMUL_DOUNLOCK</literal>.  More about locking
 | |
| 	  later.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="pid-mangling">
 | |
| 	<title>PID mangling</title>
 | |
| 
 | |
| 	<para>Because of the described different view knowing what a
 | |
| 	  process ID and thread ID is between &os; and &linux; we have
 | |
| 	  to translate the view somehow.  We do it by PID mangling.
 | |
| 	  This means that we fake what a PID (=TGID) and TID (=PID) is
 | |
| 	  between kernel and userland.  The rule of thumb is that in
 | |
| 	  kernel (in Linuxulator) PID = PID and TGID = shared ->
 | |
| 	  group pid and to userland we present <literal>PID = shared
 | |
| 	    -> group_pid</literal> and <literal>TID = proc ->
 | |
| 	    p_pid</literal>.  The PID member of
 | |
| 	  <literal>linux_emuldata structure</literal> is a &os;
 | |
| 	  PID.</para>
 | |
| 
 | |
| 	<para>The above affects mainly getpid, getppid, gettid
 | |
| 	  syscalls.  Where we use PID/TGID respectively.  In copyout
 | |
| 	  of TIDs in <function>child_clear_tid</function> and
 | |
| 	  <function>child_set_tid</function> we copy out &os;
 | |
| 	  PID.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="clone-syscall">
 | |
| 	<title>Clone syscall</title>
 | |
| 
 | |
| 	<para>The <function>clone</function> syscall is the way
 | |
| 	  threads are created in &linux;.  The syscall prototype looks
 | |
| 	  like this:</para>
 | |
| 
 | |
| 	<programlisting>int linux_clone(l_int flags, void *stack, void *parent_tidptr, int dummy,
 | |
| void * child_tidptr);</programlisting>
 | |
| 
 | |
| 	<para>The <varname>flags</varname> parameter tells the syscall
 | |
| 	  how exactly the processes should be cloned.  As described
 | |
| 	  above, &linux; can create processes sharing various things
 | |
| 	  independently, for example two processes can share file
 | |
| 	  descriptors but not VM, etc.  Last byte of the
 | |
| 	  <varname>flags</varname> parameter is the exit signal of the
 | |
| 	  newly created process.  The <varname>stack</varname>
 | |
| 	  parameter if non-<literal>NULL</literal> tells, where the
 | |
| 	  thread stack is and if it is <literal>NULL</literal> we are
 | |
| 	  supposed to copy-on-write the calling process stack (i.e. do
 | |
| 	  what normal &man.fork.2; routine does).  The
 | |
| 	  <varname>parent_tidptr</varname> parameter is used as an
 | |
| 	  address for copying out process PID (i.e.  thread id) once
 | |
| 	  the process is sufficiently instantiated but is not runnable
 | |
| 	  yet.  The <varname>dummy</varname> parameter is here because
 | |
| 	  of the very strange calling convention of this syscall on
 | |
| 	  i386.  It uses the registers directly and does not let the
 | |
| 	  compiler do it what results in the need of a dummy syscall.
 | |
| 	  The <varname>child_tidptr</varname> parameter is used as an
 | |
| 	  address for copying out PID once the process has finished
 | |
| 	  forking and when the process exits.</para>
 | |
| 
 | |
| 	<para>The syscall itself proceeds by setting corresponding
 | |
| 	  flags depending on the flags passed in.  For example,
 | |
| 	  <literal>CLONE_VM</literal> maps to RFMEM (sharing of VM),
 | |
| 	  etc.  The only nit here is <literal>CLONE_FS</literal> and
 | |
| 	  <literal>CLONE_FILES</literal> because &os; does not allow
 | |
| 	  setting this separately so we fake it by not setting RFFDG
 | |
| 	  (copying of fd table and other fs information) if either of
 | |
| 	  these is defined.  This does not cause any problems, because
 | |
| 	  those flags are always set together.  After setting the
 | |
| 	  flags the process is forked using the internal
 | |
| 	  <function>fork1</function> routine, the process is
 | |
| 	  instrumented not to be put on a run queue, i.e. not to be
 | |
| 	  set runnable.  After the forking is done we possibly
 | |
| 	  reparent the newly created process to emulate
 | |
| 	  <literal>CLONE_PARENT</literal> semantics.  Next part is
 | |
| 	  creating the emulation data.  Threads in &linux; does not
 | |
| 	  signal their parents so we set exit signal to be 0 to
 | |
| 	  disable this.  After that setting of
 | |
| 	  <varname>child_set_tid</varname> and
 | |
| 	  <varname>child_clear_tid</varname> is performed enabling the
 | |
| 	  functionality later in the code.  At this point we copy out
 | |
| 	  the PID to the address specified by
 | |
| 	  <varname>parent_tidptr</varname>.  The setting of process
 | |
| 	  stack is done by simply rewriting thread frame
 | |
| 	  <varname>%esp</varname> register (<varname>%rsp</varname> on
 | |
| 	  amd64).  Next part is setting up TLS for the newly created
 | |
| 	  process.  After this &man.vfork.2; semantics might be
 | |
| 	  emulated and finally the newly created process is put on a
 | |
| 	  run queue and copying out its PID to the parent process via
 | |
| 	  <function>clone</function> return value is done.</para>
 | |
| 
 | |
| 	<para>The <function>clone</function> syscall is able and in
 | |
| 	  fact is used for emulating classic &man.fork.2; and
 | |
| 	  &man.vfork.2; syscalls.  Newer glibc in a case of 2.6 kernel
 | |
| 	  uses <function>clone</function> to implement &man.fork.2;
 | |
| 	  and &man.vfork.2; syscalls.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="locking">
 | |
| 	<title>Locking</title>
 | |
| 
 | |
| 	<para>The locking is implemented to be per-subsystem because
 | |
| 	  we do not expect a lot of contention on these.  There are
 | |
| 	  two locks: <literal>emul_lock</literal> used to protect
 | |
| 	  manipulating of <literal>linux_emuldata</literal> and
 | |
| 	  <literal>emul_shared_lock</literal> used to manipulate
 | |
| 	  <literal>linux_emuldata_shared</literal>.  The
 | |
| 	  <literal>emul_lock</literal> is a nonsleepable blocking
 | |
| 	  mutex while <literal>emul_shared_lock</literal> is a
 | |
| 	  sleepable blocking <literal>sx_lock</literal>.  Because of
 | |
| 	  the per-subsystem locking we can coalesce some locks and
 | |
| 	  that is why the em find offers the non-locking
 | |
| 	  access.</para>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="tls">
 | |
|       <title>TLS</title>
 | |
| 
 | |
|       <para>This section deals with TLS also known as thread local
 | |
| 	storage.</para>
 | |
| 
 | |
|       <sect3 xml:id="trheading-intro">
 | |
| 	<title>Introduction to threading</title>
 | |
| 
 | |
| 	<para>Threads in computer science are entities within a
 | |
| 	  process that can be scheduled independently from each other.
 | |
| 	  The threads in the process share process wide data (file
 | |
| 	  descriptors, etc.) but also have their own stack for their
 | |
| 	  own data.  Sometimes there is a need for process-wide data
 | |
| 	  specific to a given thread.  Imagine a name of the thread in
 | |
| 	  execution or something like that.  The traditional &unix;
 | |
| 	  threading API, <application>pthreads</application> provides
 | |
| 	  a way to do it via &man.pthread.key.create.3;,
 | |
| 	  &man.pthread.setspecific.3; and &man.pthread.getspecific.3;
 | |
| 	  where a thread can create a key to the thread local data and
 | |
| 	  using &man.pthread.getspecific.3; or
 | |
| 	  &man.pthread.getspecific.3; to manipulate those data.  You
 | |
| 	  can easily see that this is not the most comfortable way
 | |
| 	  this could be accomplished.  So various producers of C/C++
 | |
| 	  compilers introduced a better way.  They defined a new
 | |
| 	  modifier keyword thread that specifies that a variable is
 | |
| 	  thread specific.  A new method of accessing such variables
 | |
| 	  was developed as well (at least on i386).  The
 | |
| 	  <application>pthreads</application> method tends to be
 | |
| 	  implemented in userspace as a trivial lookup table.  The
 | |
| 	  performance of such a solution is not very good.  So the new
 | |
| 	  method uses (on i386) segment registers to address a
 | |
| 	  segment, where TLS area is stored so the actual accessing of
 | |
| 	  a thread variable is just appending the segment register to
 | |
| 	  the address thus addressing via it.  The segment registers
 | |
| 	  are usually <varname>%gs</varname> and
 | |
| 	  <varname>%fs</varname> acting like segment selectors.  Every
 | |
| 	  thread has its own area where the thread local data are
 | |
| 	  stored and the segment must be loaded on every context
 | |
| 	  switch.  This method is very fast and used almost
 | |
| 	  exclusively in the whole i386 &unix; world.  Both &os; and
 | |
| 	  &linux; implement this approach and it yields very good
 | |
| 	  results.  The only drawback is the need to reload the
 | |
| 	  segment on every context switch which can slowdown context
 | |
| 	  switches.  &os; tries to avoid this overhead by using only 1
 | |
| 	  segment descriptor for this while &linux; uses 3.
 | |
| 	  Interesting thing is that almost nothing uses more than 1
 | |
| 	  descriptor (only <application>Wine</application> seems to
 | |
| 	  use 2) so &linux; pays this unnecessary price for context
 | |
| 	  switches.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="i386-segs">
 | |
| 	<title>Segments on i386</title>
 | |
| 
 | |
| 	<para>The i386 architecture implements the so called segments.
 | |
| 	  A segment is a description of an area of memory.  The base
 | |
| 	  address (bottom) of the memory area, the end of it
 | |
| 	  (ceiling), type, protection, etc.  The memory described by a
 | |
| 	  segment can be accessed using segment selector registers
 | |
| 	  (<varname>%cs</varname>, <varname>%ds</varname>,
 | |
| 	  <varname>%ss</varname>, <varname>%es</varname>,
 | |
| 	  <varname>%fs</varname>, <varname>%gs</varname>).  For
 | |
| 	  example let us suppose we have a segment which base address
 | |
| 	  is 0x1234 and length and this code:</para>
 | |
| 
 | |
| 	<programlisting>mov %edx,%gs:0x10</programlisting>
 | |
| 
 | |
| 	<para>This will load the content of the
 | |
| 	  <varname>%edx</varname> register into memory location
 | |
| 	  0x1244.  Some segment registers have a special use, for
 | |
| 	  example <varname>%cs</varname> is used for code segment and
 | |
| 	  <varname>%ss</varname> is used for stack segment but
 | |
| 	  <varname>%fs</varname> and <varname>%gs</varname> are
 | |
| 	  generally unused.  Segments are either stored in a global
 | |
| 	  GDT table or in a local LDT table.  LDT is accessed via an
 | |
| 	  entry in the GDT.  The LDT can store more types of segments.
 | |
| 	  LDT can be per process.  Both tables define up to 8191
 | |
| 	  entries.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="linux-i386">
 | |
| 	<title>Implementation on &linux; i386</title>
 | |
| 
 | |
| 	<para>There are two main ways of setting up TLS in &linux;.
 | |
| 	  It can be set when cloning a process using the
 | |
| 	  <function>clone</function> syscall or it can call
 | |
| 	  <function>set_thread_area</function>.  When a process passes
 | |
| 	  <literal>CLONE_SETTLS</literal> flag to
 | |
| 	  <function>clone</function>, the kernel expects the memory
 | |
| 	  pointed to by the <varname>%esi</varname> register a &linux;
 | |
| 	  user space representation of a segment, which gets
 | |
| 	  translated to the machine representation of a segment and
 | |
| 	  loaded into a GDT slot.  The GDT slot can be specified with
 | |
| 	  a number or -1 can be used meaning that the system itself
 | |
| 	  should choose the first free slot.  In practice, the vast
 | |
| 	  majority of programs use only one TLS entry and does not
 | |
| 	  care about the number of the entry.  We exploit this in the
 | |
| 	  emulation and in fact depend on it.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="tls-emu">
 | |
| 	<title>Emulation of &linux; TLS</title>
 | |
| 
 | |
| 	<sect4 xml:id="tls-i386">
 | |
| 	  <title>i386</title>
 | |
| 
 | |
| 	  <para>Loading of TLS for the current thread happens by
 | |
| 	    calling <function>set_thread_area</function> while loading
 | |
| 	    TLS for a second process in <function>clone</function> is
 | |
| 	    done in the separate block in <function>clone</function>.
 | |
| 	    Those two functions are very similar.  The only difference
 | |
| 	    being the actual loading of the GDT segment, which happens
 | |
| 	    on the next context switch for the newly created process
 | |
| 	    while <function>set_thread_area</function> must load this
 | |
| 	    directly.  The code basically does this.  It copies the
 | |
| 	    &linux; form segment descriptor from the userland.  The
 | |
| 	    code checks for the number of the descriptor but because
 | |
| 	    this differs between &os; and &linux; we fake it a little.
 | |
| 	    We only support indexes of 6, 3 and -1.  The 6 is genuine
 | |
| 	    &linux; number, 3 is genuine &os; one and -1 means
 | |
| 	    autoselection.  Then we set the descriptor number to
 | |
| 	    constant 3 and copy out this to the userspace.  We rely on
 | |
| 	    the userspace process using the number from the descriptor
 | |
| 	    but this works most of the time (have never seen a case
 | |
| 	    where this did not work) as the userspace process
 | |
| 	    typically passes in 1.  Then we convert the descriptor
 | |
| 	    from the &linux; form to a machine dependant form (i.e.
 | |
| 	    operating system independent form) and copy this to the
 | |
| 	    &os; defined segment descriptor.  Finally we can load it.
 | |
| 	    We assign the descriptor to threads PCB (process control
 | |
| 	    block) and load the <varname>%gs</varname> segment using
 | |
| 	    <function>load_gs</function>.  This loading must be done
 | |
| 	    in a critical section so that nothing can interrupt us.
 | |
| 	    The <literal>CLONE_SETTLS</literal> case works exactly
 | |
| 	    like this just the loading using
 | |
| 	    <function>load_gs</function> is not performed.  The
 | |
| 	    segment used for this (segment number 3) is shared for
 | |
| 	    this use between &os; processes and &linux; processes so
 | |
| 	    the &linux; emulation layer does not add any overhead over
 | |
| 	    plain &os;.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="tls-amd64">
 | |
| 	  <title>amd64</title>
 | |
| 
 | |
| 	  <para>The amd64 implementation is similar to the i386 one
 | |
| 	    but there was initially no 32bit segment descriptor used
 | |
| 	    for this purpose (hence not even native 32bit TLS users
 | |
| 	    worked) so we had to add such a segment and implement its
 | |
| 	    loading on every context switch (when a flag signaling use
 | |
| 	    of 32bit is set).  Apart from this the TLS loading is
 | |
| 	    exactly the same just the segment numbers are different
 | |
| 	    and the descriptor format and the loading differs
 | |
| 	    slightly.</para>
 | |
| 	</sect4>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="futexes">
 | |
|       <title>Futexes</title>
 | |
| 
 | |
|       <sect3 xml:id="sync-intro">
 | |
| 	<title>Introduction to synchronization</title>
 | |
| 
 | |
| 	<para>Threads need some kind of synchronization and &posix;
 | |
| 	  provides some of them: mutexes for mutual exclusion,
 | |
| 	  read-write locks for mutual exclusion with biased ratio of
 | |
| 	  reads and writes and condition variables for signaling a
 | |
| 	  status change.  It is interesting to note that &posix;
 | |
| 	  threading API lacks support for semaphores.  Those
 | |
| 	  synchronization routines implementations are heavily
 | |
| 	  dependant on the type threading support we have.  In pure
 | |
| 	  1:M (userspace) model the implementation can be solely done
 | |
| 	  in userspace and thus be very fast (the condition variables
 | |
| 	  will probably end up being implemented using signals, i.e.
 | |
| 	  not fast) and simple.  In 1:1 model, the situation is also
 | |
| 	  quite clear - the threads must be synchronized using kernel
 | |
| 	  facilities (which is very slow because a syscall must be
 | |
| 	  performed).  The mixed M:N scenario just combines the first
 | |
| 	  and second approach or rely solely on kernel.  Threads
 | |
| 	  synchronization is a vital part of thread-enabled
 | |
| 	  programming and its performance can affect resulting program
 | |
| 	  a lot.  Recent benchmarks on &os; operating system showed
 | |
| 	  that an improved sx_lock implementation yielded 40% speedup
 | |
| 	  in <firstterm>ZFS</firstterm> (a heavy sx user), this is
 | |
| 	  in-kernel stuff but it shows clearly how important the
 | |
| 	  performance of synchronization primitives is.</para>
 | |
| 
 | |
| 	<para>Threaded programs should be written with as little
 | |
| 	  contention on locks as possible.  Otherwise, instead of
 | |
| 	  doing useful work the thread just waits on a lock.  Because
 | |
| 	  of this, the most well written threaded programs show little
 | |
| 	  locks contention.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="futex-intro">
 | |
| 	<title>Futexes introduction</title>
 | |
| 
 | |
| 	<para>&linux; implements 1:1 threading, i.e. it has to use
 | |
| 	  in-kernel synchronization primitives.  As stated earlier,
 | |
| 	  well written threaded programs have little lock contention.
 | |
| 	  So a typical sequence could be performed as two atomic
 | |
| 	  increase/decrease mutex reference counter, which is very
 | |
| 	  fast, as presented by the following example:</para>
 | |
| 
 | |
| 	<programlisting>pthread_mutex_lock(&mutex);
 | |
| ....
 | |
| pthread_mutex_unlock(&mutex);</programlisting>
 | |
| 
 | |
| 	<para>1:1 threading forces us to perform two syscalls for
 | |
| 	  those mutex calls, which is very slow.</para>
 | |
| 
 | |
| 	<para>The solution &linux; 2.6 implements is called
 | |
| 	  futexes.  Futexes implement the check for contention in
 | |
| 	  userspace and call kernel primitives only in a case of
 | |
| 	  contention.  Thus the typical case takes place without any
 | |
| 	  kernel intervention.  This yields reasonably fast and
 | |
| 	  flexible synchronization primitives implementation.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="futex-api">
 | |
| 	<title>Futex API</title>
 | |
| 
 | |
| 	<para>The futex syscall looks like this:</para>
 | |
| 
 | |
| 	<programlisting>int futex(void *uaddr, int op, int val, struct timespec *timeout, void *uaddr2, int val3);</programlisting>
 | |
| 
 | |
| 	<para>In this example <varname>uaddr</varname> is an address
 | |
| 	  of the mutex in userspace, <varname>op</varname> is an
 | |
| 	  operation we are about to perform and the other parameters
 | |
| 	  have per-operation meaning.</para>
 | |
| 
 | |
| 	<para>Futexes implement the following operations:</para>
 | |
| 
 | |
| 	<itemizedlist>
 | |
| 	  <listitem>
 | |
| 	    <para><literal>FUTEX_WAIT</literal></para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para><literal>FUTEX_WAKE</literal></para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para><literal>FUTEX_FD</literal></para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para><literal>FUTEX_REQUEUE</literal></para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para><literal>FUTEX_CMP_REQUEUE</literal></para>
 | |
| 	  </listitem>
 | |
| 	  <listitem>
 | |
| 	    <para><literal>FUTEX_WAKE_OP</literal></para>
 | |
| 	  </listitem>
 | |
| 	</itemizedlist>
 | |
| 
 | |
| 	<sect4 xml:id="futex-wait">
 | |
| 	  <title>FUTEX_WAIT</title>
 | |
| 
 | |
| 	  <para>This operation verifies that on address
 | |
| 	    <varname>uaddr</varname> the value <varname>val</varname>
 | |
| 	    is written.  If not, <literal>EWOULDBLOCK</literal> is
 | |
| 	    returned, otherwise the thread is queued on the futex and
 | |
| 	    gets suspended.  If the argument
 | |
| 	    <varname>timeout</varname> is non-zero it specifies the
 | |
| 	    maximum time for the sleeping, otherwise the sleeping is
 | |
| 	    infinite.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="futex-wake">
 | |
| 	  <title>FUTEX_WAKE</title>
 | |
| 
 | |
| 	  <para>This operation takes a futex at
 | |
| 	    <varname>uaddr</varname> and wakes up
 | |
| 	    <varname>val</varname> first futexes queued on this
 | |
| 	    futex.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="futex-fd">
 | |
| 	  <title>FUTEX_FD</title>
 | |
| 
 | |
| 	  <para>This operations associates a file descriptor with a
 | |
| 	    given futex.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="futex-requeue">
 | |
| 	  <title>FUTEX_REQUEUE</title>
 | |
| 
 | |
| 	  <para>This operation takes <varname>val</varname> threads
 | |
| 	    queued on futex at <varname>uaddr</varname>, wakes them
 | |
| 	    up, and takes <varname>val2</varname> next threads and
 | |
| 	    requeues them on futex at
 | |
| 	    <varname>uaddr2</varname>.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="futex-cmp-requeue">
 | |
| 	  <title>FUTEX_CMP_REQUEUE</title>
 | |
| 
 | |
| 	  <para>This operation does the same as
 | |
| 	    <literal>FUTEX_REQUEUE</literal> but it checks that
 | |
| 	    <varname>val3</varname> equals to <varname>val</varname>
 | |
| 	    first.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="futex-wake-op">
 | |
| 	  <title>FUTEX_WAKE_OP</title>
 | |
| 
 | |
| 	  <para>This operation performs an atomic operation on
 | |
| 	    <varname>val3</varname> (which contains coded some other
 | |
| 	    value) and <varname>uaddr</varname>.  Then it wakes up
 | |
| 	    <varname>val</varname> threads on futex at
 | |
| 	    <varname>uaddr</varname> and if the atomic operation
 | |
| 	    returned a positive number it wakes up
 | |
| 	    <varname>val2</varname> threads on futex at
 | |
| 	    <varname>uaddr2</varname>.</para>
 | |
| 
 | |
| 	  <para>The operations implemented in
 | |
| 	    <literal>FUTEX_WAKE_OP</literal>:</para>
 | |
| 
 | |
| 	  <itemizedlist>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>FUTEX_OP_SET</literal></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>FUTEX_OP_ADD</literal></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>FUTEX_OP_OR</literal></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>FUTEX_OP_AND</literal></para>
 | |
| 	    </listitem>
 | |
| 	    <listitem>
 | |
| 	      <para><literal>FUTEX_OP_XOR</literal></para>
 | |
| 	    </listitem>
 | |
| 	  </itemizedlist>
 | |
| 
 | |
| 	  <note>
 | |
| 	    <para>There is no <varname>val2</varname> parameter in the
 | |
| 	      futex prototype.  The <varname>val2</varname> is taken
 | |
| 	      from the <varname>struct timespec *timeout</varname>
 | |
| 	      parameter for operations
 | |
| 	      <literal>FUTEX_REQUEUE</literal>,
 | |
| 	      <literal>FUTEX_CMP_REQUEUE</literal> and
 | |
| 	      <literal>FUTEX_WAKE_OP</literal>.</para>
 | |
| 	  </note>
 | |
| 	</sect4>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="futex-emu">
 | |
| 	<title>Futex emulation in &os;</title>
 | |
| 
 | |
| 	<para>The futex emulation in &os; is taken from NetBSD and
 | |
| 	  further extended by us.  It is placed in
 | |
| 	  <filename>linux_futex.c</filename> and
 | |
| 	  <filename>linux_futex.h</filename> files.  The
 | |
| 	  <literal>futex</literal> structure looks like:</para>
 | |
| 
 | |
| 	<programlisting>struct futex {
 | |
|   void *f_uaddr;
 | |
|   int f_refcount;
 | |
| 
 | |
|   LIST_ENTRY(futex) f_list;
 | |
| 
 | |
|   TAILQ_HEAD(lf_waiting_paroc, waiting_proc) f_waiting_proc;
 | |
| };</programlisting>
 | |
| 
 | |
| 	<para>And the structure <literal>waiting_proc</literal>
 | |
| 	  is:</para>
 | |
| 
 | |
| 	<programlisting>struct waiting_proc {
 | |
| 
 | |
|   struct thread *wp_t;
 | |
| 
 | |
|   struct futex *wp_new_futex;
 | |
| 
 | |
|   TAILQ_ENTRY(waiting_proc) wp_list;
 | |
| };</programlisting>
 | |
| 
 | |
| 	<sect4 xml:id="futex-get">
 | |
| 	  <title>futex_get / futex_put</title>
 | |
| 
 | |
| 	  <para>A futex is obtained using the
 | |
| 	    <function>futex_get</function> function, which searches a
 | |
| 	    linear list of futexes and returns the found one or
 | |
| 	    creates a new futex.  When releasing a futex from the use
 | |
| 	    we call the <function>futex_put</function> function, which
 | |
| 	    decreases a reference counter of the futex and if the
 | |
| 	    refcount reaches zero it is released.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="futex-sleep">
 | |
| 	  <title>futex_sleep</title>
 | |
| 
 | |
| 	  <para>When a futex queues a thread for sleeping it creates a
 | |
| 	    <literal>working_proc</literal> structure and puts this
 | |
| 	    structure to the list inside the futex structure then it
 | |
| 	    just performs a &man.tsleep.9; to suspend the thread.  The
 | |
| 	    sleep can be timed out.  After &man.tsleep.9; returns (the
 | |
| 	    thread was woken up or it timed out) the
 | |
| 	    <literal>working_proc</literal> structure is removed from
 | |
| 	    the list and is destroyed.  All this is done in the
 | |
| 	    <function>futex_sleep</function> function.  If we got
 | |
| 	    woken up from <function>futex_wake</function> we have
 | |
| 	    <varname>wp_new_futex</varname> set so we sleep on it.
 | |
| 	    This way the actual requeueing is done in this
 | |
| 	    function.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="futex-wake-2">
 | |
| 	  <title>futex_wake</title>
 | |
| 
 | |
| 	  <para>Waking up a thread sleeping on a futex is performed in
 | |
| 	    the <function>futex_wake</function> function.  First in
 | |
| 	    this function we mimic the strange &linux; behavior, where
 | |
| 	    it wakes up N threads for all operations, the only
 | |
| 	    exception is that the REQUEUE operations are performed on
 | |
| 	    N+1 threads.  But this usually does not make any
 | |
| 	    difference as we are waking up all threads.  Next in the
 | |
| 	    function in the loop we wake up n threads, after this we
 | |
| 	    check if there is a new futex for requeueing.  If so, we
 | |
| 	    requeue up to n2 threads on the new futex.  This
 | |
| 	    cooperates with <function>futex_sleep</function>.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="futex-wake-op-2">
 | |
| 	  <title>futex_wake_op</title>
 | |
| 
 | |
| 	  <para>The <literal>FUTEX_WAKE_OP</literal> operation is
 | |
| 	    quite complicated.  First we obtain two futexes at
 | |
| 	    addresses <varname>uaddr</varname> and
 | |
| 	    <varname>uaddr2</varname> then we perform the atomic
 | |
| 	    operation using <varname>val3</varname> and
 | |
| 	    <varname>uaddr2</varname>.  Then <varname>val</varname>
 | |
| 	    waiters on the first futex is woken up and if the atomic
 | |
| 	    operation condition holds we wake up
 | |
| 	    <varname>val2</varname> (i.e.  <varname>timeout</varname>)
 | |
| 	    waiter on the second futex.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="futex-atomic-op">
 | |
| 	  <title>futex atomic operation</title>
 | |
| 
 | |
| 	  <para>The atomic operation takes two parameters
 | |
| 	    <varname>encoded_op</varname> and
 | |
| 	    <varname>uaddr</varname>.  The encoded operation encodes
 | |
| 	    the operation itself, comparing value, operation argument,
 | |
| 	    and comparing argument.  The pseudocode for the operation
 | |
| 	    is like this one:</para>
 | |
| 
 | |
| 	  <programlisting>oldval = *uaddr2
 | |
| *uaddr2 = oldval OP oparg</programlisting>
 | |
| 
 | |
| 	  <para>And this is done atomically.  First a copying in of
 | |
| 	    the number at <varname>uaddr</varname> is performed and
 | |
| 	    the operation is done.  The code handles page faults and
 | |
| 	    if no page fault occurs <varname>oldval</varname> is
 | |
| 	    compared to <varname>cmparg</varname> argument with cmp
 | |
| 	    comparator.</para>
 | |
| 	</sect4>
 | |
| 
 | |
| 	<sect4 xml:id="futex-locking">
 | |
| 	  <title>Futex locking</title>
 | |
| 
 | |
| 	  <para>Futex implementation uses two lock lists protecting
 | |
| 	    <function>sx_lock</function> and global locks (either
 | |
| 	    Giant or another <function>sx_lock</function>).  Every
 | |
| 	    operation is performed locked from the start to the very
 | |
| 	    end.</para>
 | |
| 	</sect4>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="syscall-impl">
 | |
|       <title>Various syscalls implementation</title>
 | |
| 
 | |
|       <para>In this section I am going to describe some smaller
 | |
| 	syscalls that are worth mentioning because their
 | |
| 	implementation is not obvious or those syscalls are
 | |
| 	interesting from other point of view.</para>
 | |
| 
 | |
|       <sect3 xml:id="syscall-at">
 | |
| 	<title>*at family of syscalls</title>
 | |
| 
 | |
| 	<para>During development of &linux; 2.6.16 kernel, the *at
 | |
| 	  syscalls were added.  Those syscalls
 | |
| 	  (<function>openat</function> for example) work exactly like
 | |
| 	  their at-less counterparts with the slight exception of the
 | |
| 	  <varname>dirfd</varname> parameter.  This parameter changes
 | |
| 	  where the given file, on which the syscall is to be
 | |
| 	  performed, is.  When the <varname>filename</varname>
 | |
| 	  parameter is absolute <varname>dirfd</varname> is ignored
 | |
| 	  but when the path to the file is relative, it comes to the
 | |
| 	  play.  The <varname>dirfd</varname> parameter is a directory
 | |
| 	  relative to which the relative pathname is checked.  The
 | |
| 	  <varname>dirfd</varname> parameter is a file descriptor of
 | |
| 	  some directory or <literal>AT_FDCWD</literal>.  So for
 | |
| 	  example the <function>openat</function> syscall can be like
 | |
| 	  this:</para>
 | |
| 
 | |
| 	<programlisting>file descriptor 123 = /tmp/foo/, current working directory = /tmp/
 | |
| 
 | |
| openat(123, /tmp/bah\, flags, mode)	/* opens /tmp/bah */
 | |
| openat(123, bah\, flags, mode)		/* opens /tmp/foo/bah */
 | |
| openat(AT_FDWCWD, bah\, flags, mode)	/* opens /tmp/bah */
 | |
| openat(stdio, bah\, flags, mode)	/* returns error because stdio is not a directory */</programlisting>
 | |
| 
 | |
| 	<para>This infrastructure is necessary to avoid races when
 | |
| 	  opening files outside the working directory.  Imagine that a
 | |
| 	  process consists of two threads, thread A and
 | |
| 	  thread B.  Thread A issues
 | |
| 	  <literal>open(./tmp/foo/bah., flags, mode)</literal> and
 | |
| 	  before returning it gets preempted and thread B runs.
 | |
| 	  Thread B does not care about the needs of thread A
 | |
| 	  and renames or removes <filename>/tmp/foo/</filename>.  We
 | |
| 	  got a race.  To avoid this we can open
 | |
| 	  <filename>/tmp/foo</filename> and use it as
 | |
| 	  <varname>dirfd</varname> for <function>openat</function>
 | |
| 	  syscall.  This also enables user to implement per-thread
 | |
| 	  working directories.</para>
 | |
| 
 | |
| 	<para>&linux; family of *at syscalls contains:
 | |
| 	  <function>linux_openat</function>,
 | |
| 	  <function>linux_mkdirat</function>,
 | |
| 	  <function>linux_mknodat</function>,
 | |
| 	  <function>linux_fchownat</function>,
 | |
| 	  <function>linux_futimesat</function>,
 | |
| 	  <function>linux_fstatat64</function>,
 | |
| 	  <function>linux_unlinkat</function>,
 | |
| 	  <function>linux_renameat</function>,
 | |
| 	  <function>linux_linkat</function>,
 | |
| 	  <function>linux_symlinkat</function>,
 | |
| 	  <function>linux_readlinkat</function>,
 | |
| 	  <function>linux_fchmodat</function> and
 | |
| 	  <function>linux_faccessat</function>.  All these are
 | |
| 	  implemented using the modified &man.namei.9; routine and
 | |
| 	  simple wrapping layer.</para>
 | |
| 
 | |
| 	<sect4 xml:id="implementation">
 | |
| 	  <title>Implementation</title>
 | |
| 
 | |
| 	  <para>The implementation is done by altering the
 | |
| 	    &man.namei.9; routine (described above) to take additional
 | |
| 	    parameter <varname>dirfd</varname> in its
 | |
| 	    <literal>nameidata</literal> structure, which specifies
 | |
| 	    the starting point of the pathname lookup instead of using
 | |
| 	    the current working directory every time.  The resolution
 | |
| 	    of <varname>dirfd</varname> from file descriptor number to
 | |
| 	    a vnode is done in native *at syscalls.  When
 | |
| 	    <varname>dirfd</varname> is <literal>AT_FDCWD</literal>
 | |
| 	    the <varname>dvp</varname> entry in
 | |
| 	    <literal>nameidata</literal> structure is
 | |
| 	    <literal>NULL</literal> but when <varname>dirfd</varname>
 | |
| 	    is a different number we obtain a file for this file
 | |
| 	    descriptor, check whether this file is valid and if there
 | |
| 	    is vnode attached to it then we get a vnode.  Then we
 | |
| 	    check this vnode for being a directory.  In the actual
 | |
| 	    &man.namei.9; routine we simply substitute the
 | |
| 	    <varname>dvp</varname> vnode for <varname>dp</varname>
 | |
| 	    variable in the &man.namei.9; function, which determines
 | |
| 	    the starting point.  The &man.namei.9; is not used
 | |
| 	    directly but via a trace of different functions on various
 | |
| 	    levels.  For example the <function>openat</function> goes
 | |
| 	    like this:</para>
 | |
| 
 | |
| 	  <programlisting>openat() --> kern_openat() --> vn_open() -> namei()</programlisting>
 | |
| 
 | |
| 	  <para>For this reason <function>kern_open</function> and
 | |
| 	    <function>vn_open</function> must be altered to
 | |
| 	    incorporate the additional <varname>dirfd</varname>
 | |
| 	    parameter.  No compat layer is created for those because
 | |
| 	    there are not many users of this and the users can be
 | |
| 	    easily converted.  This general implementation enables
 | |
| 	    &os; to implement their own *at syscalls.  This is being
 | |
| 	    discussed right now.</para>
 | |
| 	</sect4>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="ioctl">
 | |
| 	<title>Ioctl</title>
 | |
| 
 | |
| 	<para>The ioctl interface is quite fragile due to its
 | |
| 	  generality.  We have to bear in mind that devices differ
 | |
| 	  between &linux; and &os; so some care must be applied to do
 | |
| 	  ioctl emulation work right.  The ioctl handling is
 | |
| 	  implemented in <filename>linux_ioctl.c</filename>, where
 | |
| 	  <function>linux_ioctl</function> function is defined.  This
 | |
| 	  function simply iterates over sets of ioctl handlers to find
 | |
| 	  a handler that implements a given command.  The ioctl
 | |
| 	  syscall has three parameters, the file descriptor, command
 | |
| 	  and an argument.  The command is a 16-bit number, which in
 | |
| 	  theory is divided into high 8 bits determining class of
 | |
| 	  the ioctl command and low 8 bits, which are the actual
 | |
| 	  command within the given set.  The emulation takes advantage
 | |
| 	  of this division.  We implement handlers for each set, like
 | |
| 	  <function>sound_handler</function> or
 | |
| 	  <function>disk_handler</function>.  Each handler has a
 | |
| 	  maximum command and a minimum command defined, which is used
 | |
| 	  for determining what handler is used.  There are slight
 | |
| 	  problems with this approach because &linux; does not use the
 | |
| 	  set division consistently so sometimes ioctls for a
 | |
| 	  different set are inside a set they should not belong to
 | |
| 	  (SCSI generic ioctls inside cdrom set, etc.).  &os;
 | |
| 	  currently does not implement many &linux; ioctls (compared
 | |
| 	  to NetBSD, for example) but the plan is to port those from
 | |
| 	  NetBSD.  The trend is to use &linux; ioctls even in the
 | |
| 	  native &os; drivers because of the easy porting of
 | |
| 	  applications.</para>
 | |
|       </sect3>
 | |
| 
 | |
|       <sect3 xml:id="debugging">
 | |
| 	<title>Debugging</title>
 | |
| 
 | |
| 	<para>Every syscall should be debuggable.  For this purpose we
 | |
| 	  introduce a small infrastructure.  We have the ldebug
 | |
| 	  facility, which tells whether a given syscall should be
 | |
| 	  debugged (settable via a sysctl).  For printing we have LMSG
 | |
| 	  and ARGS macros.  Those are used for altering a printable
 | |
| 	  string for uniform debugging messages.</para>
 | |
|       </sect3>
 | |
|     </sect2>
 | |
|   </sect1>
 | |
| 
 | |
|   <sect1 xml:id="conclusion">
 | |
|     <title>Conclusion</title>
 | |
| 
 | |
|     <sect2 xml:id="results">
 | |
|       <title>Results</title>
 | |
| 
 | |
|       <para>As of April 2007 the &linux; emulation layer is capable of
 | |
| 	emulating the &linux; 2.6.16 kernel quite well.  The
 | |
| 	remaining problems concern futexes, unfinished *at family of
 | |
| 	syscalls, problematic signals delivery, missing
 | |
| 	<function>epoll</function> and <function>inotify</function>
 | |
| 	and probably some bugs we have not discovered yet.  Despite
 | |
| 	this we are capable of running basically all the &linux;
 | |
| 	programs included in &os; Ports Collection with
 | |
| 	Fedora Core 4 at 2.6.16 and there are some
 | |
| 	rudimentary reports of success with Fedora Core 6 at
 | |
| 	2.6.16.  The Fedora Core 6 linux_base was recently
 | |
| 	committed enabling some further testing of the emulation layer
 | |
| 	and giving us some more hints where we should put our effort
 | |
| 	in implementing missing stuff.</para>
 | |
| 
 | |
|       <para>We are able to run the most used applications like
 | |
| 	<package>www/linux-firefox</package>,
 | |
| 	<package>net-im/skype</package> and some games from the
 | |
| 	Ports Collection.  Some of the programs exhibit bad
 | |
| 	behavior under 2.6 emulation but this is currently under
 | |
| 	investigation and hopefully will be fixed soon.  The only big
 | |
| 	application that is known not to work is the &linux; &java;
 | |
| 	Development Kit and this is because of the requirement of
 | |
| 	<function>epoll</function> facility which is not directly
 | |
| 	related to the &linux; kernel 2.6.</para>
 | |
| 
 | |
|       <para>We hope to enable 2.6.16 emulation by default some time
 | |
| 	after &os; 7.0 is released at least to expose the 2.6
 | |
| 	emulation parts for some wider testing.  Once this is done we
 | |
| 	can switch to Fedora Core 6 linux_base, which is the
 | |
| 	ultimate plan.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="future-work">
 | |
|       <title>Future work</title>
 | |
| 
 | |
|       <para>Future work should focus on fixing the remaining issues
 | |
| 	with futexes, implement the rest of the *at family of
 | |
| 	syscalls, fix the signal delivery and possibly implement the
 | |
| 	<function>epoll</function> and <function>inotify</function>
 | |
| 	facilities.</para>
 | |
| 
 | |
|       <para>We hope to be able to run the most important programs
 | |
| 	flawlessly soon, so we will be able to switch to the 2.6
 | |
| 	emulation by default and make the Fedora Core 6 the
 | |
| 	default linux_base because our currently used
 | |
| 	Fedora Core 4 is not supported any more.</para>
 | |
| 
 | |
|       <para>The other possible goal is to share our code with NetBSD
 | |
| 	and DragonflyBSD.  NetBSD has some support for 2.6 emulation
 | |
| 	but its far from finished and not really tested.  DragonflyBSD
 | |
| 	has expressed some interest in porting the 2.6
 | |
| 	improvements.</para>
 | |
| 
 | |
|       <para>Generally, as &linux; develops we would like to keep up
 | |
| 	with their development, implementing newly added syscalls.
 | |
| 	Splice comes to mind first.  Some already implemented syscalls
 | |
| 	are also heavily crippled, for example
 | |
| 	<function>mremap</function> and others.  Some performance
 | |
| 	improvements can also be made, finer grained locking and
 | |
| 	others.</para>
 | |
|     </sect2>
 | |
| 
 | |
|     <sect2 xml:id="team">
 | |
|       <title>Team</title>
 | |
| 
 | |
|       <para>I cooperated on this project with (in alphabetical
 | |
| 	order):</para>
 | |
| 
 | |
|       <itemizedlist>
 | |
| 	<listitem>
 | |
| 	  <para>&a.jhb.email;</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>&a.kib.email;</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>Emmanuel Dreyfus</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>Scot Hetzel</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>&a.jkim.email;</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>&a.netchild.email;</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>&a.ssouhlal.email;</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>Li Xiao</para>
 | |
| 	</listitem>
 | |
| 	<listitem>
 | |
| 	  <para>&a.davidxu.email;</para>
 | |
| 	</listitem>
 | |
|       </itemizedlist>
 | |
| 
 | |
|       <para>I would like to thank all those people for their advice,
 | |
| 	code reviews and general support.</para>
 | |
|     </sect2>
 | |
|   </sect1>
 | |
| 
 | |
|   <sect1 xml:id="literatures">
 | |
|     <title>Literatures</title>
 | |
| 
 | |
|     <orderedlist>
 | |
|       <listitem>
 | |
| 	<para>Marshall Kirk McKusick - George V. Nevile-Neil.  Design
 | |
| 	  and Implementation of the &os; operating system.
 | |
| 	  Addison-Wesley, 2005.</para>
 | |
|       </listitem>
 | |
|       <listitem>
 | |
| 	<para><uri
 | |
| 	    xlink:href="https://tldp.org">https://tldp.org</uri></para>
 | |
|       </listitem>
 | |
|       <listitem>
 | |
| 	<para><uri
 | |
| 	    xlink:href="https://www.kernel.org">https://www.kernel.org</uri></para>
 | |
|       </listitem>
 | |
|     </orderedlist>
 | |
|   </sect1>
 | |
| </article>
 |