2546 lines
102 KiB
XML
2546 lines
102 KiB
XML
<?xml version="1.0" encoding="iso-8859-1"?>
|
|
<!DOCTYPE article PUBLIC "-//FreeBSD//DTD DocBook XML V5.0-Based Extension//EN"
|
|
"http://www.FreeBSD.org/XML/share/xml/freebsd50.dtd">
|
|
<!-- $FreeBSD$ -->
|
|
<!-- The FreeBSD Documentation Project -->
|
|
<article xmlns="http://docbook.org/ns/docbook"
|
|
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
|
|
xml:lang="en">
|
|
<info>
|
|
<title>&linux; emulation in &os;</title>
|
|
|
|
<author>
|
|
<personname>
|
|
<firstname>Roman</firstname>
|
|
<surname>Divacky</surname>
|
|
</personname>
|
|
<affiliation>
|
|
<address>
|
|
<email>rdivacky@FreeBSD.org</email>
|
|
</address>
|
|
</affiliation>
|
|
</author>
|
|
|
|
<legalnotice xml:id="trademarks" role="trademarks">
|
|
&tm-attrib.adobe;
|
|
&tm-attrib.ibm;
|
|
&tm-attrib.freebsd;
|
|
&tm-attrib.linux;
|
|
&tm-attrib.netbsd;
|
|
&tm-attrib.realnetworks;
|
|
&tm-attrib.oracle;
|
|
&tm-attrib.sun;
|
|
&tm-attrib.general;
|
|
</legalnotice>
|
|
|
|
<pubdate>$FreeBSD$</pubdate>
|
|
|
|
<releaseinfo>$FreeBSD$</releaseinfo>
|
|
|
|
<abstract>
|
|
<para>This masters thesis deals with updating the &linux;
|
|
emulation layer (the so called
|
|
<firstterm>Linuxulator</firstterm>). The task was to update
|
|
the layer to match the functionality of &linux; 2.6. As a
|
|
reference implementation, the &linux; 2.6.16 kernel was
|
|
chosen. The concept is loosely based on the NetBSD
|
|
implementation. Most of the work was done in the summer of
|
|
2006 as a part of the Google Summer of Code students program.
|
|
The focus was on bringing the <firstterm>NPTL</firstterm> (new
|
|
&posix; thread library) support into the emulation layer,
|
|
including <firstterm>TLS</firstterm> (thread local storage),
|
|
<firstterm>futexes</firstterm> (fast user space mutexes),
|
|
<firstterm>PID mangling</firstterm>, and some other minor
|
|
things. Many small problems were identified and fixed in the
|
|
process. My work was integrated into the main &os; source
|
|
repository and will be shipped in the upcoming 7.0R release.
|
|
We, the emulation development team, are working on making the
|
|
&linux; 2.6 emulation the default emulation layer in
|
|
&os;.</para>
|
|
</abstract>
|
|
</info>
|
|
|
|
<sect1 xml:id="intro">
|
|
<title>Introduction</title>
|
|
|
|
<para>In the last few years the open source &unix; based operating
|
|
systems started to be widely deployed on server and client
|
|
machines. Among these operating systems I would like to point
|
|
out two: &os;, for its BSD heritage, time proven code base and
|
|
many interesting features and &linux; for its wide user base,
|
|
enthusiastic open developer community and support from large
|
|
companies. &os; tends to be used on server class machines
|
|
serving heavy duty networking tasks with less usage on desktop
|
|
class machines for ordinary users. While &linux; has the same
|
|
usage on servers, but it is used much more by home based users.
|
|
This leads to a situation where there are many binary only
|
|
programs available for &linux; that lack support for
|
|
&os;.</para>
|
|
|
|
<para>Naturally, a need for the ability to run &linux; binaries on
|
|
a &os; system arises and this is what this thesis deals with:
|
|
the emulation of the &linux; kernel in the &os; operating
|
|
system.</para>
|
|
|
|
<para>During the Summer of 2006 Google Inc. sponsored a project
|
|
which focused on extending the &linux; emulation layer (the so
|
|
called Linuxulator) in &os; to include &linux; 2.6 facilities.
|
|
This thesis is written as a part of this project.</para>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="inside">
|
|
<title>A look inside…</title>
|
|
|
|
<para>In this section we are going to describe every operating
|
|
system in question. How they deal with syscalls, trapframes
|
|
etc., all the low-level stuff. We also describe the way they
|
|
understand common &unix; primitives like what a PID is, what a
|
|
thread is, etc. In the third subsection we talk about how
|
|
&unix; on &unix; emulation could be done in general.</para>
|
|
|
|
<sect2 xml:id="what-is-unix">
|
|
<title>What is &unix;</title>
|
|
|
|
<para>&unix; is an operating system with a long history that has
|
|
influenced almost every other operating system currently in
|
|
use. Starting in the 1960s, its development continues to this
|
|
day (although in different projects). &unix; development soon
|
|
forked into two main ways: the BSDs and System III/V families.
|
|
They mutually influenced themselves by growing a common &unix;
|
|
standard. Among the contributions originated in BSD we can
|
|
name virtual memory, TCP/IP networking, FFS, and many others.
|
|
The System V branch contributed to SysV interprocess
|
|
communication primitives, copy-on-write, etc. &unix; itself
|
|
does not exist any more but its ideas have been used by many
|
|
other operating systems world wide thus forming the so called
|
|
&unix;-like operating systems. These days the most
|
|
influential ones are &linux;, Solaris, and possibly (to some
|
|
extent) &os;. There are in-company &unix; derivatives (AIX,
|
|
HP-UX etc.), but these have been more and more migrated to the
|
|
aforementioned systems. Let us summarize typical &unix;
|
|
characteristics.</para>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="tech-details">
|
|
<title>Technical details</title>
|
|
|
|
<para>Every running program constitutes a process that
|
|
represents a state of the computation. Running process is
|
|
divided between kernel-space and user-space. Some operations
|
|
can be done only from kernel space (dealing with hardware
|
|
etc.), but the process should spend most of its lifetime in
|
|
the user space. The kernel is where the management of the
|
|
processes, hardware, and low-level details take place. The
|
|
kernel provides a standard unified &unix; API to the user
|
|
space. The most important ones are covered below.</para>
|
|
|
|
<sect3 xml:id="kern-proc-comm">
|
|
<title>Communication between kernel and user space
|
|
process</title>
|
|
|
|
<para>Common &unix; API defines a syscall as a way to issue
|
|
commands from a user space process to the kernel. The most
|
|
common implementation is either by using an interrupt or
|
|
specialized instruction (think of
|
|
<literal>SYSENTER</literal>/<literal>SYSCALL</literal>
|
|
instructions for ia32). Syscalls are defined by a number.
|
|
For example in &os;, the syscall number 85 is the
|
|
&man.swapon.2; syscall and the syscall number 132 is
|
|
&man.mkfifo.2;. Some syscalls need parameters, which are
|
|
passed from the user-space to the kernel-space in various
|
|
ways (implementation dependant). Syscalls are
|
|
synchronous.</para>
|
|
|
|
<para>Another possible way to communicate is by using a
|
|
<firstterm>trap</firstterm>. Traps occur asynchronously
|
|
after some event occurs (division by zero, page fault etc.).
|
|
A trap can be transparent for a process (page fault) or can
|
|
result in a reaction like sending a
|
|
<firstterm>signal</firstterm> (division by zero).</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="proc-proc-comm">
|
|
<title>Communication between processes</title>
|
|
|
|
<para>There are other APIs (System V IPC, shared memory etc.)
|
|
but the single most important API is signal. Signals are
|
|
sent by processes or by the kernel and received by
|
|
processes. Some signals can be ignored or handled by a user
|
|
supplied routine, some result in a predefined action that
|
|
cannot be altered or ignored.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="proc-mgmt">
|
|
<title>Process management</title>
|
|
|
|
<para>Kernel instances are processed first in the system (so
|
|
called init). Every running process can create its
|
|
identical copy using the &man.fork.2; syscall. Some
|
|
slightly modified versions of this syscall were introduced
|
|
but the basic semantic is the same. Every running process
|
|
can morph into some other process using the &man.exec.3;
|
|
syscall. Some modifications of this syscall were introduced
|
|
but all serve the same basic purpose. Processes end their
|
|
lives by calling the &man.exit.2; syscall. Every process is
|
|
identified by a unique number called PID. Every process has
|
|
a defined parent (identified by its PID).</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="thread-mgmt">
|
|
<title>Thread management</title>
|
|
|
|
<para>Traditional &unix; does not define any API nor
|
|
implementation for threading, while &posix; defines its
|
|
threading API but the implementation is undefined.
|
|
Traditionally there were two ways of implementing threads.
|
|
Handling them as separate processes (1:1 threading) or
|
|
envelope the whole thread group in one process and managing
|
|
the threading in userspace (1:N threading). Comparing main
|
|
features of each approach:</para>
|
|
|
|
<para>1:1 threading</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>- heavyweight threads</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>- the scheduling cannot be altered by the user
|
|
(slightly mitigated by the &posix; API)</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>+ no syscall wrapping necessary</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>+ can utilize multiple CPUs</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>1:N threading</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>+ lightweight threads</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>+ scheduling can be easily altered by the
|
|
user</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>- syscalls must be wrapped</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>- cannot utilize more than one CPU</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="what-is-freebsd">
|
|
<title>What is &os;?</title>
|
|
|
|
<para>The &os; project is one of the oldest open source
|
|
operating systems currently available for daily use. It is a
|
|
direct descendant of the genuine &unix; so it could be claimed
|
|
that it is a true &unix; although licensing issues do not
|
|
permit that. The start of the project dates back to the early
|
|
1990's when a crew of fellow BSD users patched the 386BSD
|
|
operating system. Based on this patchkit a new operating
|
|
system arose named &os; for its liberal license. Another
|
|
group created the NetBSD operating system with different goals
|
|
in mind. We will focus on &os;.</para>
|
|
|
|
<para>&os; is a modern &unix;-based operating system with all
|
|
the features of &unix;. Preemptive multitasking, multiuser
|
|
facilities, TCP/IP networking, memory protection, symmetric
|
|
multiprocessing support, virtual memory with merged VM and
|
|
buffer cache, they are all there. One of the interesting and
|
|
extremely useful features is the ability to emulate other
|
|
&unix;-like operating systems. As of December 2006 and
|
|
7-CURRENT development, the following emulation functionalities
|
|
are supported:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>&os;/i386 emulation on &os;/amd64</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&os;/i386 emulation on &os;/ia64</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&linux;-emulation of &linux; operating system on
|
|
&os;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>NDIS-emulation of Windows networking drivers
|
|
interface</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>NetBSD-emulation of NetBSD operating system</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>PECoff-support for PECoff &os; executables</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>SVR4-emulation of System V revision 4 &unix;</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Actively developed emulations are the &linux; layer and
|
|
various &os;-on-&os; layers. Others are not supposed to work
|
|
properly nor be usable these days.</para>
|
|
|
|
<sect3 xml:id="freebsd-tech-details">
|
|
<title>Technical details</title>
|
|
|
|
<para>&os; is traditional flavor of &unix; in the sense of
|
|
dividing the run of processes into two halves: kernel space
|
|
and user space run. There are two types of process entry to
|
|
the kernel: a syscall and a trap. There is only one way to
|
|
return. In the subsequent sections we will describe the
|
|
three gates to/from the kernel. The whole description
|
|
applies to the i386 architecture as the Linuxulator only
|
|
exists there but the concept is similar on other
|
|
architectures. The information was taken from [1] and the
|
|
source code.</para>
|
|
|
|
<sect4 xml:id="freebsd-sys-entries">
|
|
<title>System entries</title>
|
|
|
|
<para>&os; has an abstraction called an execution class
|
|
loader, which is a wedge into the &man.execve.2; syscall.
|
|
This employs a structure <literal>sysentvec</literal>,
|
|
which describes an executable ABI. It contains things
|
|
like errno translation table, signal translation table,
|
|
various functions to serve syscall needs (stack fixup,
|
|
coredumping, etc.). Every ABI the &os; kernel wants to
|
|
support must define this structure, as it is used later in
|
|
the syscall processing code and at some other places.
|
|
System entries are handled by trap handlers, where we can
|
|
access both the kernel-space and the user-space at
|
|
once.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-syscalls">
|
|
<title>Syscalls</title>
|
|
|
|
<para>Syscalls on &os; are issued by executing interrupt
|
|
<literal>0x80</literal> with register
|
|
<varname>%eax</varname> set to a desired syscall number
|
|
with arguments passed on the stack.</para>
|
|
|
|
<para>When a process issues an interrupt
|
|
<literal>0x80</literal>, the <literal>int0x80</literal>
|
|
syscall trap handler is issued (defined in
|
|
<filename>sys/i386/i386/exception.s</filename>), which
|
|
prepares arguments (i.e. copies them on to the stack) for
|
|
a call to a C function &man.syscall.2; (defined in
|
|
<filename>sys/i386/i386/trap.c</filename>), which
|
|
processes the passed in trapframe. The processing
|
|
consists of preparing the syscall (depending on the
|
|
<literal>sysvec</literal> entry), determining if the
|
|
syscall is 32-bit or 64-bit one (changes size of the
|
|
parameters), then the parameters are copied, including the
|
|
syscall. Next, the actual syscall function is executed
|
|
with processing of the return code (special cases for
|
|
<literal>ERESTART</literal> and
|
|
<literal>EJUSTRETURN</literal> errors). Finally an
|
|
<literal>userret()</literal> is scheduled, switching the
|
|
process back to the users-pace. The parameters to the
|
|
actual syscall handler are passed in the form of
|
|
<literal>struct thread *td</literal>, <literal>struct
|
|
syscall args *</literal> arguments where the second
|
|
parameter is a pointer to the copied in structure of
|
|
parameters.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-traps">
|
|
<title>Traps</title>
|
|
|
|
<para>Handling of traps in &os; is similar to the handling
|
|
of syscalls. Whenever a trap occurs, an assembler handler
|
|
is called. It is chosen between alltraps, alltraps with
|
|
regs pushed or calltrap depending on the type of the trap.
|
|
This handler prepares arguments for a call to a C function
|
|
<literal>trap()</literal> (defined in
|
|
<filename>sys/i386/i386/trap.c</filename>), which then
|
|
processes the occurred trap. After the processing it
|
|
might send a signal to the process and/or exit to userland
|
|
using <literal>userret()</literal>.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-exits">
|
|
<title>Exits</title>
|
|
|
|
<para>Exits from kernel to userspace happen using the
|
|
assembler routine <literal>doreti</literal> regardless of
|
|
whether the kernel was entered via a trap or via a
|
|
syscall. This restores the program status from the stack
|
|
and returns to the userspace.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-unix-primitives">
|
|
<title>&unix; primitives</title>
|
|
|
|
<para>&os; operating system adheres to the traditional
|
|
&unix; scheme, where every process has a unique
|
|
identification number, the so called
|
|
<firstterm>PID</firstterm> (Process ID). PID numbers are
|
|
allocated either linearly or randomly ranging from
|
|
<literal>0</literal> to <literal>PID_MAX</literal>. The
|
|
allocation of PID numbers is done using linear searching
|
|
of PID space. Every thread in a process receives the same
|
|
PID number as result of the &man.getpid.2; call.</para>
|
|
|
|
<para>There are currently two ways to implement threading in
|
|
&os;. The first way is M:N threading followed by the 1:1
|
|
threading model. The default library used is M:N
|
|
threading (<literal>libpthread</literal>) and you can
|
|
switch at runtime to 1:1 threading
|
|
(<literal>libthr</literal>). The plan is to switch to 1:1
|
|
library by default soon. Although those two libraries use
|
|
the same kernel primitives, they are accessed through
|
|
different API(es). The M:N library uses the
|
|
<literal>kse_*</literal> family of syscalls while the 1:1
|
|
library uses the <literal>thr_*</literal> family of
|
|
syscalls. Because of this, there is no general concept of
|
|
thread ID shared between kernel and userspace. Of course,
|
|
both threading libraries implement the pthread thread ID
|
|
API. Every kernel thread (as described by <literal>struct
|
|
thread</literal>) has td tid identifier but this is not
|
|
directly accessible from userland and solely serves the
|
|
kernel's needs. It is also used for 1:1 threading library
|
|
as pthread's thread ID but handling of this is internal to
|
|
the library and cannot be relied on.</para>
|
|
|
|
<para>As stated previously there are two implementations of
|
|
threading in &os;. The M:N library divides the work
|
|
between kernel space and userspace. Thread is an entity
|
|
that gets scheduled in the kernel but it can represent
|
|
various number of userspace threads. M userspace threads
|
|
get mapped to N kernel threads thus saving resources while
|
|
keeping the ability to exploit multiprocessor parallelism.
|
|
Further information about the implementation can be
|
|
obtained from the man page or [1]. The 1:1 library
|
|
directly maps a userland thread to a kernel thread thus
|
|
greatly simplifying the scheme. None of these designs
|
|
implement a fairness mechanism (such a mechanism was
|
|
implemented but it was removed recently because it caused
|
|
serious slowdown and made the code more difficult to deal
|
|
with).</para>
|
|
</sect4>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="what-is-linux">
|
|
<title>What is &linux;</title>
|
|
|
|
<para>&linux; is a &unix;-like kernel originally developed by
|
|
Linus Torvalds, and now being contributed to by a massive
|
|
crowd of programmers all around the world. From its mere
|
|
beginnings to today, with wide support from companies such as
|
|
IBM or Google, &linux; is being associated with its fast
|
|
development pace, full hardware support and benevolent
|
|
dictator model of organization.</para>
|
|
|
|
<para>&linux; development started in 1991 as a hobbyist project
|
|
at University of Helsinki in Finland. Since then it has
|
|
obtained all the features of a modern &unix;-like OS:
|
|
multiprocessing, multiuser support, virtual memory,
|
|
networking, basically everything is there. There are also
|
|
highly advanced features like virtualization etc.</para>
|
|
|
|
<para>As of 2006 &linux; seems to be the most widely used open
|
|
source operating system with support from independent software
|
|
vendors like Oracle, RealNetworks, Adobe, etc. Most of the
|
|
commercial software distributed for &linux; can only be
|
|
obtained in a binary form so recompilation for other operating
|
|
systems is impossible.</para>
|
|
|
|
<para>Most of the &linux; development happens in a
|
|
<application>Git</application> version control system.
|
|
<application>Git</application> is a distributed system so
|
|
there is no central source of the &linux; code, but some
|
|
branches are considered prominent and official. The version
|
|
number scheme implemented by &linux; consists of four numbers
|
|
A.B.C.D. Currently development happens in 2.6.C.D, where C
|
|
represents major version, where new features are added or
|
|
changed while D is a minor version for bugfixes only.</para>
|
|
|
|
<para>More information can be obtained from [3].</para>
|
|
|
|
<sect3 xml:id="linux-tech-details">
|
|
<title>Technical details</title>
|
|
|
|
<para>&linux; follows the traditional &unix; scheme of
|
|
dividing the run of a process in two halves: the kernel and
|
|
user space. The kernel can be entered in two ways: via a
|
|
trap or via a syscall. The return is handled only in one
|
|
way. The further description applies to &linux; 2.6 on
|
|
the &i386; architecture. This information was taken from
|
|
[2].</para>
|
|
|
|
<sect4 xml:id="linux-syscalls">
|
|
<title>Syscalls</title>
|
|
|
|
<para>Syscalls in &linux; are performed (in userspace) using
|
|
<literal>syscallX</literal> macros where X substitutes a
|
|
number representing the number of parameters of the given
|
|
syscall. This macro translates to a code that loads
|
|
<varname>%eax</varname> register with a number of the
|
|
syscall and executes interrupt <literal>0x80</literal>.
|
|
After this syscall return is called, which translates
|
|
negative return values to positive
|
|
<literal>errno</literal> values and sets
|
|
<literal>res</literal> to <literal>-1</literal> in case of
|
|
an error. Whenever the interrupt <literal>0x80</literal>
|
|
is called the process enters the kernel in system call
|
|
trap handler. This routine saves all registers on the
|
|
stack and calls the selected syscall entry. Note that the
|
|
&linux; calling convention expects parameters to the
|
|
syscall to be passed via registers as shown here:</para>
|
|
|
|
<orderedlist>
|
|
<listitem>
|
|
<para>parameter -> <varname>%ebx</varname></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>parameter -> <varname>%ecx</varname></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>parameter -> <varname>%edx</varname></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>parameter -> <varname>%esi</varname></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>parameter -> <varname>%edi</varname></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>parameter -> <varname>%ebp</varname></para>
|
|
</listitem>
|
|
</orderedlist>
|
|
|
|
<para>There are some exceptions to this, where &linux; uses
|
|
different calling convention (most notably the
|
|
<literal>clone</literal> syscall).</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="linux-traps">
|
|
<title>Traps</title>
|
|
|
|
<para>The trap handlers are introduced in
|
|
<filename>arch/i386/kernel/traps.c</filename> and most of
|
|
these handlers live in
|
|
<filename>arch/i386/kernel/entry.S</filename>, where
|
|
handling of the traps happens.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="linux-exits">
|
|
<title>Exits</title>
|
|
|
|
<para>Return from the syscall is managed by syscall
|
|
&man.exit.3;, which checks for the process having
|
|
unfinished work, then checks whether we used user-supplied
|
|
selectors. If this happens stack fixing is applied and
|
|
finally the registers are restored from the stack and the
|
|
process returns to the userspace.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="linux-unix-primitives">
|
|
<title>&unix; primitives</title>
|
|
|
|
<para>In the 2.6 version, the &linux; operating system
|
|
redefined some of the traditional &unix; primitives,
|
|
notably PID, TID and thread. PID is defined not to be
|
|
unique for every process, so for some processes (threads)
|
|
&man.getppid.2; returns the same value. Unique
|
|
identification of process is provided by TID. This is
|
|
because <firstterm>NPTL</firstterm> (New &posix; Thread
|
|
Library) defines threads to be normal processes (so called
|
|
1:1 threading). Spawning a new process in
|
|
&linux; 2.6 happens using the
|
|
<literal>clone</literal> syscall (fork variants are
|
|
reimplemented using it). This clone syscall defines a set
|
|
of flags that affect behavior of the cloning process
|
|
regarding thread implementation. The semantic is a bit
|
|
fuzzy as there is no single flag telling the syscall to
|
|
create a thread.</para>
|
|
|
|
<para>Implemented clone flags are:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para><literal>CLONE_VM</literal> - processes share
|
|
their memory space</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_FS</literal> - share umask, cwd and
|
|
namespace</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_FILES</literal> - share open
|
|
files</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_SIGHAND</literal> - share signal
|
|
handlers and blocked signals</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_PARENT</literal> - share
|
|
parent</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_THREAD</literal> - be thread
|
|
(further explanation below)</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_NEWNS</literal> - new
|
|
namespace</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_SYSVSEM</literal> - share SysV undo
|
|
structures</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_SETTLS</literal> - setup TLS at
|
|
supplied address</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_PARENT_SETTID</literal> - set TID
|
|
in the parent</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_CHILD_CLEARTID</literal> - clear
|
|
TID in the child</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_CHILD_SETTID</literal> - set TID in
|
|
the child</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para><literal>CLONE_PARENT</literal> sets the real parent
|
|
to the parent of the caller. This is useful for threads
|
|
because if thread A creates thread B we want thread B to
|
|
be parented to the parent of the whole thread group.
|
|
<literal>CLONE_THREAD</literal> does exactly the same
|
|
thing as <literal>CLONE_PARENT</literal>,
|
|
<literal>CLONE_VM</literal> and
|
|
<literal>CLONE_SIGHAND</literal>, rewrites PID to be the
|
|
same as PID of the caller, sets exit signal to be none and
|
|
enters the thread group. <literal>CLONE_SETTLS</literal>
|
|
sets up GDT entries for TLS handling. The
|
|
<literal>CLONE_*_*TID</literal> set of flags sets/clears
|
|
user supplied address to TID or 0.</para>
|
|
|
|
<para>As you can see the <literal>CLONE_THREAD</literal>
|
|
does most of the work and does not seem to fit the scheme
|
|
very well. The original intention is unclear (even for
|
|
authors, according to comments in the code) but I think
|
|
originally there was one threading flag, which was then
|
|
parcelled among many other flags but this separation was
|
|
never fully finished. It is also unclear what this
|
|
partition is good for as glibc does not use that so only
|
|
hand-written use of the clone permits a programmer to
|
|
access this features.</para>
|
|
|
|
<para>For non-threaded programs the PID and TID are the
|
|
same. For threaded programs the first thread PID and TID
|
|
are the same and every created thread shares the same PID
|
|
and gets assigned a unique TID (because
|
|
<literal>CLONE_THREAD</literal> is passed in) also parent
|
|
is shared for all processes forming this threaded
|
|
program.</para>
|
|
|
|
<para>The code that implements &man.pthread.create.3; in
|
|
NPTL defines the clone flags like this:</para>
|
|
|
|
<programlisting>int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGNAL
|
|
|
|
| CLONE_SETTLS | CLONE_PARENT_SETTID
|
|
|
|
| CLONE_CHILD_CLEARTID | CLONE_SYSVSEM
|
|
#if __ASSUME_NO_CLONE_DETACHED == 0
|
|
|
|
| CLONE_DETACHED
|
|
#endif
|
|
|
|
| 0);</programlisting>
|
|
|
|
<para>The <literal>CLONE_SIGNAL</literal> is defined
|
|
like</para>
|
|
|
|
<programlisting>#define CLONE_SIGNAL (CLONE_SIGHAND | CLONE_THREAD)</programlisting>
|
|
|
|
<para>the last 0 means no signal is sent when any of the
|
|
threads exits.</para>
|
|
</sect4>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="what-is-emu">
|
|
<title>What is emulation</title>
|
|
|
|
<para>According to a dictionary definition, emulation is the
|
|
ability of a program or device to imitate another program or
|
|
device. This is achieved by providing the same reaction to a
|
|
given stimulus as the emulated object. In practice, the
|
|
software world mostly sees three types of emulation - a
|
|
program used to emulate a machine (QEMU, various game console
|
|
emulators etc.), software emulation of a hardware facility
|
|
(OpenGL emulators, floating point units emulation etc.) and
|
|
operating system emulation (either in kernel of the operating
|
|
system or as a userspace program).</para>
|
|
|
|
<para>Emulation is usually used in a place, where using the
|
|
original component is not feasible nor possible at all. For
|
|
example someone might want to use a program developed for a
|
|
different operating system than they use. Then emulation
|
|
comes in handy. Sometimes there is no other way but to use
|
|
emulation - e.g. when the hardware device you try to use does
|
|
not exist (yet/anymore) then there is no other way but
|
|
emulation. This happens often when porting an operating
|
|
system to a new (non-existent) platform. Sometimes it is just
|
|
cheaper to emulate.</para>
|
|
|
|
<para>Looking from an implementation point of view, there are
|
|
two main approaches to the implementation of emulation. You
|
|
can either emulate the whole thing - accepting possible inputs
|
|
of the original object, maintaining inner state and emitting
|
|
correct output based on the state and/or input. This kind of
|
|
emulation does not require any special conditions and
|
|
basically can be implemented anywhere for any device/program.
|
|
The drawback is that implementing such emulation is quite
|
|
difficult, time-consuming and error-prone. In some cases we
|
|
can use a simpler approach. Imagine you want to emulate a
|
|
printer that prints from left to right on a printer that
|
|
prints from right to left. It is obvious that there is no
|
|
need for a complex emulation layer but simply reversing of the
|
|
printed text is sufficient. Sometimes the
|
|
emulating environment is very similar to the emulated one so
|
|
just a thin layer of some translation is necessary to provide
|
|
fully working emulation! As you can see this is much less
|
|
demanding to implement, so less time-consuming and error-prone
|
|
than the previous approach. But the necessary condition is
|
|
that the two environments must be similar enough. The third
|
|
approach combines the two previous. Most of the time the
|
|
objects do not provide the same capabilities so in a case of
|
|
emulating the more powerful one on the less powerful we have
|
|
to emulate the missing features with full emulation described
|
|
above.</para>
|
|
|
|
<para>This master thesis deals with emulation of &unix; on
|
|
&unix;, which is exactly the case, where only a thin layer of
|
|
translation is sufficient to provide full emulation. The
|
|
&unix; API consists of a set of syscalls, which are usually
|
|
self contained and do not affect some global kernel
|
|
state.</para>
|
|
|
|
<para>There are a few syscalls that affect inner state but this
|
|
can be dealt with by providing some structures that maintain
|
|
the extra state.</para>
|
|
|
|
<para>No emulation is perfect and emulations tend to lack some
|
|
parts but this usually does not cause any serious drawbacks.
|
|
Imagine a game console emulator that emulates everything but
|
|
music output. No doubt that the games are playable and one
|
|
can use the emulator. It might not be that comfortable as the
|
|
original game console but its an acceptable compromise between
|
|
price and comfort.</para>
|
|
|
|
<para>The same goes with the &unix; API. Most programs can live
|
|
with a very limited set of syscalls working. Those syscalls
|
|
tend to be the oldest ones (&man.read.2;/&man.write.2;,
|
|
&man.fork.2; family, &man.signal.3; handling, &man.exit.3;,
|
|
&man.socket.2; API) hence it is easy to emulate because their
|
|
semantics is shared among all &unix;es, which exist
|
|
todays.</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="freebsd-emulation">
|
|
<title>Emulation</title>
|
|
|
|
<sect2>
|
|
<title>How emulation works in &os;</title>
|
|
|
|
<para>As stated earlier, &os; supports running binaries from
|
|
several other &unix;es. This works because &os; has an
|
|
abstraction called the execution class loader. This wedges
|
|
into the &man.execve.2; syscall, so when &man.execve.2; is
|
|
about to execute a binary it examines its type.</para>
|
|
|
|
<para>There are basically two types of binaries in &os;.
|
|
Shell-like text scripts which are identified by
|
|
<literal>#!</literal> as their first two characters and normal
|
|
(typically <firstterm>ELF</firstterm>) binaries, which are a
|
|
representation of a compiled executable object. The vast
|
|
majority (one could say all of them) of binaries in &os; are
|
|
from type ELF. ELF files contain a header, which specifies
|
|
the OS ABI for this ELF file. By reading this information,
|
|
the operating system can accurately determine what type of
|
|
binary the given file is.</para>
|
|
|
|
<para>Every OS ABI must be registered in the &os; kernel. This
|
|
applies to the &os; native OS ABI, as well. So when
|
|
&man.execve.2; executes a binary it iterates through the list
|
|
of registered APIs and when it finds the right one it starts
|
|
to use the information contained in the OS ABI description
|
|
(its syscall table, <literal>errno</literal> translation
|
|
table, etc.). So every time the process calls a syscall, it
|
|
uses its own set of syscalls instead of some global one. This
|
|
effectively provides a very elegant and easy way of supporting
|
|
execution of various binary formats.</para>
|
|
|
|
<para>The nature of emulation of different OSes (and also some
|
|
other subsystems) led developers to invite a handler event
|
|
mechanism. There are various places in the kernel, where a
|
|
list of event handlers are called. Every subsystem can
|
|
register an event handler and they are called accordingly.
|
|
For example, when a process exits there is a handler called
|
|
that possibly cleans up whatever the subsystem needs to be
|
|
cleaned.</para>
|
|
|
|
<para>Those simple facilities provide basically everything that
|
|
is needed for the emulation infrastructure and in fact these
|
|
are basically the only things necessary to implement the
|
|
&linux; emulation layer.</para>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="freebsd-common-primitives">
|
|
<title>Common primitives in the &os; kernel</title>
|
|
|
|
<para>Emulation layers need some support from the operating
|
|
system. I am going to describe some of the supported
|
|
primitives in the &os; operating system.</para>
|
|
|
|
<sect3 xml:id="freebsd-locking-primitives">
|
|
<title>Locking primitives</title>
|
|
|
|
<para>Contributed by: &a.attilio.email;</para>
|
|
|
|
<para>The &os; synchronization primitive set is based on the
|
|
idea to supply a rather huge number of different primitives
|
|
in a way that the better one can be used for every
|
|
particular, appropriate situation.</para>
|
|
|
|
<para>To a high level point of view you can consider three
|
|
kinds of synchronization primitives in the &os;
|
|
kernel:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>atomic operations and memory barriers</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>locks</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>scheduling barriers</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Below there are descriptions for the 3 families. For
|
|
every lock, you should really check the linked manpage
|
|
(where possible) for more detailed explanations.</para>
|
|
|
|
<sect4 xml:id="freebsd-atomic-op">
|
|
<title>Atomic operations and memory barriers</title>
|
|
|
|
<para>Atomic operations are implemented through a set of
|
|
functions performing simple arithmetics on memory operands
|
|
in an atomic way with respect to external events
|
|
(interrupts, preemption, etc.). Atomic operations can
|
|
guarantee atomicity just on small data types (in the
|
|
magnitude order of the <literal>.long.</literal>
|
|
architecture C data type), so should be rarely used
|
|
directly in the end-level code, if not only for very
|
|
simple operations (like flag setting in a bitmap, for
|
|
example). In fact, it is rather simple and common to
|
|
write down a wrong semantic based on just atomic
|
|
operations (usually referred as lock-less). The &os;
|
|
kernel offers a way to perform atomic operations in
|
|
conjunction with a memory barrier. The memory barriers
|
|
will guarantee that an atomic operation will happen
|
|
following some specified ordering with respect to other
|
|
memory accesses. For example, if we need that an atomic
|
|
operation happen just after all other pending writes (in
|
|
terms of instructions reordering buffers activities) are
|
|
completed, we need to explicitly use a memory barrier in
|
|
conjunction to this atomic operation. So it is simple to
|
|
understand why memory barriers play a key role for
|
|
higher-level locks building (just as refcounts, mutexes,
|
|
etc.). For a detailed explanatory on atomic operations,
|
|
please refer to &man.atomic.9;. It is far, however,
|
|
noting that atomic operations (and memory barriers as
|
|
well) should ideally only be used for building
|
|
front-ending locks (as mutexes).</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-refcounts">
|
|
<title>Refcounts</title>
|
|
|
|
<para>Refcounts are interfaces for handling reference
|
|
counters. They are implemented through atomic operations
|
|
and are intended to be used just for cases, where the
|
|
reference counter is the only one thing to be protected,
|
|
so even something like a spin-mutex is deprecated. Using
|
|
the refcount interface for structures, where a mutex is
|
|
already used is often wrong since we should probably close
|
|
the reference counter in some already protected paths. A
|
|
manpage discussing refcount does not exist currently, just
|
|
check <filename>sys/refcount.h</filename> for an overview
|
|
of the existing API.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-locks">
|
|
<title>Locks</title>
|
|
|
|
<para>&os; kernel has huge classes of locks. Every lock is
|
|
defined by some peculiar properties, but probably the most
|
|
important is the event linked to contesting holders (or in
|
|
other terms, the behavior of threads unable to acquire the
|
|
lock). &os;'s locking scheme presents three different
|
|
behaviors for contenders:</para>
|
|
|
|
<orderedlist>
|
|
<listitem>
|
|
<para>spinning</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>blocking</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>sleeping</para>
|
|
</listitem>
|
|
</orderedlist>
|
|
|
|
<note>
|
|
<para>numbers are not casual</para>
|
|
</note>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-spinlocks">
|
|
<title>Spinning locks</title>
|
|
|
|
<para>Spin locks let waiters to spin until they cannot
|
|
acquire the lock. An important matter do deal with is
|
|
when a thread contests on a spin lock if it is not
|
|
descheduled. Since the &os; kernel is preemptive, this
|
|
exposes spin lock at the risk of deadlocks that can be
|
|
solved just disabling interrupts while they are acquired.
|
|
For this and other reasons (like lack of priority
|
|
propagation support, poorness in load balancing schemes
|
|
between CPUs, etc.), spin locks are intended to protect
|
|
very small paths of code, or ideally not to be used at all
|
|
if not explicitly requested (explained later).</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-blocking">
|
|
<title>Blocking</title>
|
|
|
|
<para>Block locks let waiters to be descheduled and blocked
|
|
until the lock owner does not drop it and wakes up one or
|
|
more contenders. In order to avoid starvation issues,
|
|
blocking locks do priority propagation from the waiters to
|
|
the owner. Block locks must be implemented through the
|
|
turnstile interface and are intended to be the most used
|
|
kind of locks in the kernel, if no particular conditions
|
|
are met.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-sleeping">
|
|
<title>Sleeping</title>
|
|
|
|
<para>Sleep locks let waiters to be descheduled and fall
|
|
asleep until the lock holder does not drop it and wakes up
|
|
one or more waiters. Since sleep locks are intended to
|
|
protect large paths of code and to cater asynchronous
|
|
events, they do not do any form of priority propagation.
|
|
They must be implemented through the &man.sleepqueue.9;
|
|
interface.</para>
|
|
|
|
<para>The order used to acquire locks is very important, not
|
|
only for the possibility to deadlock due at lock order
|
|
reversals, but even because lock acquisition should follow
|
|
specific rules linked to locks natures. If you give a
|
|
look at the table above, the practical rule is that if a
|
|
thread holds a lock of level n (where the level is the
|
|
number listed close to the kind of lock) it is not allowed
|
|
to acquire a lock of superior levels, since this would
|
|
break the specified semantic for a path. For example, if
|
|
a thread holds a block lock (level 2), it is allowed to
|
|
acquire a spin lock (level 1) but not a sleep lock (level
|
|
3), since block locks are intended to protect smaller
|
|
paths than sleep lock (these rules are not about atomic
|
|
operations or scheduling barriers, however).</para>
|
|
|
|
<para>This is a list of lock with their respective
|
|
behaviors:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>spin mutex - spinning - &man.mutex.9;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>sleep mutex - blocking - &man.mutex.9;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>pool mutex - blocking - &man.mtx.pool.9;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>sleep family - sleeping - &man.sleep.9; pause
|
|
tsleep msleep msleep spin msleep rw msleep sx</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>condvar - sleeping - &man.condvar.9;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>rwlock - blocking - &man.rwlock.9;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>sxlock - sleeping - &man.sx.9;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>lockmgr - sleeping - &man.lockmgr.9;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>semaphores - sleeping - &man.sema.9;</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Among these locks only mutexes, sxlocks, rwlocks and
|
|
lockmgrs are intended to handle recursion, but currently
|
|
recursion is only supported by mutexes and
|
|
lockmgrs.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-scheduling">
|
|
<title>Scheduling barriers</title>
|
|
|
|
<para>Scheduling barriers are intended to be used in order
|
|
to drive scheduling of threading. They consist mainly of
|
|
three different stubs:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>critical sections (and preemption)</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>sched_bind</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>sched_pin</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Generally, these should be used only in a particular
|
|
context and even if they can often replace locks, they
|
|
should be avoided because they do not let the diagnose of
|
|
simple eventual problems with locking debugging tools (as
|
|
&man.witness.4;).</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-critical">
|
|
<title>Critical sections</title>
|
|
|
|
<para>The &os; kernel has been made preemptive basically to
|
|
deal with interrupt threads. In fact, in order to avoid
|
|
high interrupt latency, time-sharing priority threads can
|
|
be preempted by interrupt threads (in this way, they do
|
|
not need to wait to be scheduled as the normal path
|
|
previews). Preemption, however, introduces new racing
|
|
points that need to be handled, as well. Often, in order
|
|
to deal with preemption, the simplest thing to do is to
|
|
completely disable it. A critical section defines a piece
|
|
of code (borderlined by the pair of functions
|
|
&man.critical.enter.9; and &man.critical.exit.9;, where
|
|
preemption is guaranteed to not happen (until the
|
|
protected code is fully executed). This can often replace
|
|
a lock effectively but should be used carefully in order
|
|
to not lose the whole advantage that preemption
|
|
brings.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-schedpin">
|
|
<title>sched_pin/sched_unpin</title>
|
|
|
|
<para>Another way to deal with preemption is the
|
|
<function>sched_pin()</function> interface. If a piece of
|
|
code is closed in the <function>sched_pin()</function>
|
|
and <function>sched_unpin()</function> pair of functions
|
|
it is guaranteed that the respective thread, even if it
|
|
can be preempted, it will always be executed on the same
|
|
CPU. Pinning is very effective in the particular case
|
|
when we have to access at per-cpu datas and we assume
|
|
other threads will not change those data. The latter
|
|
condition will determine a critical section as a too
|
|
strong condition for our code.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-schedbind">
|
|
<title>sched_bind/sched_unbind</title>
|
|
|
|
<para><function>sched_bind</function> is an API used in
|
|
order to bind a thread to a particular CPU for all the
|
|
time it executes the code, until a
|
|
<function>sched_unbind</function> function call does not
|
|
unbind it. This feature has a key role in situations
|
|
where you cannot trust the current state of CPUs (for
|
|
example, at very early stages of boot), as you want to
|
|
avoid your thread to migrate on inactive CPUs. Since
|
|
<function>sched_bind</function> and
|
|
<function>sched_unbind</function> manipulate internal
|
|
scheduler structures, they need to be enclosed in
|
|
<function>sched_lock</function> acquisition/releasing when
|
|
used.</para>
|
|
</sect4>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="freebsd-proc">
|
|
<title>Proc structure</title>
|
|
|
|
<para>Various emulation layers sometimes require some
|
|
additional per-process data. It can manage separate
|
|
structures (a list, a tree etc.) containing these data for
|
|
every process but this tends to be slow and memory
|
|
consuming. To solve this problem the &os;
|
|
<literal>proc</literal> structure contains
|
|
<literal>p_emuldata</literal>, which is a void pointer to
|
|
some emulation layer specific data. This
|
|
<literal>proc</literal> entry is protected by the proc
|
|
mutex.</para>
|
|
|
|
<para>The &os; <literal>proc</literal> structure contains a
|
|
<literal>p_sysent</literal> entry that identifies, which ABI
|
|
this process is running. In fact, it is a pointer to the
|
|
<literal>sysentvec</literal> described above. So by
|
|
comparing this pointer to the address where the
|
|
<literal>sysentvec</literal> structure for the given ABI is
|
|
stored we can effectively determine whether the process
|
|
belongs to our emulation layer. The code typically looks
|
|
like:</para>
|
|
|
|
<programlisting>if (__predict_true(p->p_sysent != &elf_&linux;_sysvec))
|
|
return;</programlisting>
|
|
|
|
<para>As you can see, we effectively use the
|
|
<literal>__predict_true</literal> modifier to collapse the
|
|
most common case (&os; process) to a simple return operation
|
|
thus preserving high performance. This code should be
|
|
turned into a macro because currently it is not very
|
|
flexible, i.e. we do not support &linux;64 emulation nor
|
|
A.OUT &linux; processes on i386.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="freebsd-vfs">
|
|
<title>VFS</title>
|
|
|
|
<para>The &os; VFS subsystem is very complex but the &linux;
|
|
emulation layer uses just a small subset via a well defined
|
|
API. It can either operate on vnodes or file handlers.
|
|
Vnode represents a virtual vnode, i.e. representation of a
|
|
node in VFS. Another representation is a file handler,
|
|
which represents an opened file from the perspective of a
|
|
process. A file handler can represent a socket or an
|
|
ordinary file. A file handler contains a pointer to its
|
|
vnode. More then one file handler can point to the same
|
|
vnode.</para>
|
|
|
|
<sect4 xml:id="freebsd-namei">
|
|
<title>namei</title>
|
|
|
|
<para>The &man.namei.9; routine is a central entry point to
|
|
pathname lookup and translation. It traverses the path
|
|
point by point from the starting point to the end point
|
|
using lookup function, which is internal to VFS. The
|
|
&man.namei.9; syscall can cope with symlinks, absolute and
|
|
relative paths. When a path is looked up using
|
|
&man.namei.9; it is inputed to the name cache. This
|
|
behavior can be suppressed. This routine is used all over
|
|
the kernel and its performance is very critical.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-vn">
|
|
<title>vn_fullpath</title>
|
|
|
|
<para>The &man.vn.fullpath.9; function takes the best effort
|
|
to traverse VFS name cache and returns a path for a given
|
|
(locked) vnode. This process is unreliable but works just
|
|
fine for the most common cases. The unreliability is
|
|
because it relies on VFS cache (it does not traverse the
|
|
on medium structures), it does not work with hardlinks,
|
|
etc. This routine is used in several places in the
|
|
Linuxulator.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-vnode">
|
|
<title>Vnode operations</title>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para><function>fgetvp</function> - given a thread and a
|
|
file descriptor number it returns the associated
|
|
vnode</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.vn.lock.9; - locks a vnode</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><function>vn_unlock</function> - unlocks a
|
|
vnode</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.VOP.READDIR.9; - reads a directory referenced
|
|
by a vnode</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.VOP.GETATTR.9; - gets attributes of a file or
|
|
a directory referenced by a vnode</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.VOP.LOOKUP.9; - looks up a path to a given
|
|
directory</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.VOP.OPEN.9; - opens a file referenced by a
|
|
vnode</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.VOP.CLOSE.9; - closes a file referenced by a
|
|
vnode</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.vput.9; - decrements the use count for a
|
|
vnode and unlocks it</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.vrele.9; - decrements the use count for a
|
|
vnode</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.vref.9; - increments the use count for a
|
|
vnode</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-file-handler">
|
|
<title>File handler operations</title>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para><function>fget</function> - given a thread and a
|
|
file descriptor number it returns associated file
|
|
handler and references it</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><function>fdrop</function> - drops a reference to
|
|
a file handler</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><function>fhold</function> - references a file
|
|
handler</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</sect4>
|
|
</sect3>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="md">
|
|
<title>&linux; emulation layer -MD part</title>
|
|
|
|
<para>This section deals with implementation of &linux; emulation
|
|
layer in &os; operating system. It first describes the machine
|
|
dependent part talking about how and where interaction between
|
|
userland and kernel is implemented. It talks about syscalls,
|
|
signals, ptrace, traps, stack fixup. This part discusses i386
|
|
but it is written generally so other architectures should not
|
|
differ very much. The next part is the machine independent part
|
|
of the Linuxulator. This section only covers i386 and ELF
|
|
handling. A.OUT is obsolete and untested.</para>
|
|
|
|
<sect2 xml:id="syscall-handling">
|
|
<title>Syscall handling</title>
|
|
|
|
<para>Syscall handling is mostly written in
|
|
<filename>linux_sysvec.c</filename>, which covers most of the
|
|
routines pointed out in the <literal>sysentvec</literal>
|
|
structure. When a &linux; process running on &os; issues a
|
|
syscall, the general syscall routine calls linux prepsyscall
|
|
routine for the &linux; ABI.</para>
|
|
|
|
<sect3 xml:id="linux-prepsyscall">
|
|
<title>&linux; prepsyscall</title>
|
|
|
|
<para>&linux; passes arguments to syscalls via registers (that
|
|
is why it is limited to 6 parameters on i386) while &os;
|
|
uses the stack. The &linux; prepsyscall routine must copy
|
|
parameters from registers to the stack. The order of the
|
|
registers is: <varname>%ebx</varname>,
|
|
<varname>%ecx</varname>, <varname>%edx</varname>,
|
|
<varname>%esi</varname>, <varname>%edi</varname>,
|
|
<varname>%ebp</varname>. The catch is that this is true for
|
|
only <emphasis>most</emphasis> of the syscalls. Some (most
|
|
notably <function>clone</function>) uses a different order
|
|
but it is luckily easy to fix by inserting a dummy parameter
|
|
in the <function>linux_clone</function> prototype.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="syscall-writing">
|
|
<title>Syscall writing</title>
|
|
|
|
<para>Every syscall implemented in the Linuxulator must have
|
|
its prototype with various flags in
|
|
<filename>syscalls.master</filename>. The form of the file
|
|
is:</para>
|
|
|
|
<programlisting>...
|
|
AUE_FORK STD { int linux_fork(void); }
|
|
...
|
|
AUE_CLOSE NOPROTO { int close(int fd); }
|
|
...</programlisting>
|
|
|
|
<para>The first column represents the syscall number. The
|
|
second column is for auditing support. The third column
|
|
represents the syscall type. It is either
|
|
<literal>STD</literal>, <literal>OBSOL</literal>,
|
|
<literal>NOPROTO</literal> and <literal>UNIMPL</literal>.
|
|
<literal>STD</literal> is a standard syscall with full
|
|
prototype and implementation. <literal>OBSOL</literal> is
|
|
obsolete and defines just the prototype.
|
|
<literal>NOPROTO</literal> means that the syscall is
|
|
implemented elsewhere so do not prepend ABI prefix, etc.
|
|
<literal>UNIMPL</literal> means that the syscall will be
|
|
substituted with the <function>nosys</function> syscall (a
|
|
syscall just printing out a message about the syscall not
|
|
being implemented and returning
|
|
<literal>ENOSYS</literal>).</para>
|
|
|
|
<para>From <filename>syscalls.master</filename> a script
|
|
generates three files: <filename>linux_syscall.h</filename>,
|
|
<filename>linux_proto.h</filename> and
|
|
<filename>linux_sysent.c</filename>. The
|
|
<filename>linux_syscall.h</filename> contains definitions of
|
|
syscall names and their numerical value, e.g.:</para>
|
|
|
|
<programlisting>...
|
|
#define LINUX_SYS_linux_fork 2
|
|
...
|
|
#define LINUX_SYS_close 6
|
|
...</programlisting>
|
|
|
|
<para>The <filename>linux_proto.h</filename> contains
|
|
structure definitions of arguments to every syscall,
|
|
e.g.:</para>
|
|
|
|
<programlisting>struct linux_fork_args {
|
|
register_t dummy;
|
|
};</programlisting>
|
|
|
|
<para>And finally, <filename>linux_sysent.c</filename>
|
|
contains structure describing the system entry table, used
|
|
to actually dispatch a syscall, e.g.:</para>
|
|
|
|
<programlisting>{ 0, (sy_call_t *)linux_fork, AUE_FORK, NULL, 0, 0 }, /* 2 = linux_fork */
|
|
{ AS(close_args), (sy_call_t *)close, AUE_CLOSE, NULL, 0, 0 }, /* 6 = close */</programlisting>
|
|
|
|
<para>As you can see <function>linux_fork</function> is
|
|
implemented in Linuxulator itself so the definition is of
|
|
<literal>STD</literal> type and has no argument, which is
|
|
exhibited by the dummy argument structure. On the other
|
|
hand <function>close</function> is just an alias for real
|
|
&os; &man.close.2; so it has no linux arguments structure
|
|
associated and in the system entry table it is not prefixed
|
|
with linux as it calls the real &man.close.2; in the
|
|
kernel.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="dummy-syscalls">
|
|
<title>Dummy syscalls</title>
|
|
|
|
<para>The &linux; emulation layer is not complete, as some
|
|
syscalls are not implemented properly and some are not
|
|
implemented at all. The emulation layer employs a facility
|
|
to mark unimplemented syscalls with the
|
|
<literal>DUMMY</literal> macro. These dummy definitions
|
|
reside in <filename>linux_dummy.c</filename> in a form of
|
|
<literal>DUMMY(syscall);</literal>, which is then translated
|
|
to various syscall auxiliary files and the implementation
|
|
consists of printing a message saying that this syscall is
|
|
not implemented. The <literal>UNIMPL</literal> prototype is
|
|
not used because we want to be able to identify the name of
|
|
the syscall that was called in order to know what syscalls
|
|
are more important to implement.</para>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="signal-handling">
|
|
<title>Signal handling</title>
|
|
|
|
<para>Signal handling is done generally in the &os; kernel for
|
|
all binary compatibilities with a call to a compat-dependent
|
|
layer. &linux; compatibility layer defines
|
|
<function>linux_sendsig</function> routine for this
|
|
purpose.</para>
|
|
|
|
<sect3 xml:id="linux-sendsig">
|
|
<title>&linux; sendsig</title>
|
|
|
|
<para>This routine first checks whether the signal has been
|
|
installed with a <literal>SA_SIGINFO</literal> in which case
|
|
it calls <function>linux_rt_sendsig</function> routine
|
|
instead. Furthermore, it allocates (or reuses an already
|
|
existing) signal handle context, then it builds a list of
|
|
arguments for the signal handler. It translates the signal
|
|
number based on the signal translation table, assigns a
|
|
handler, translates sigset. Then it saves context for the
|
|
<function>sigreturn</function> routine (various registers,
|
|
translated trap number and signal mask). Finally, it copies
|
|
out the signal context to the userspace and prepares context
|
|
for the actual signal handler to run.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="linux-rt-sendsig">
|
|
<title>linux_rt_sendsig</title>
|
|
|
|
<para>This routine is similar to
|
|
<function>linux_sendsig</function> just the signal context
|
|
preparation is different. It adds
|
|
<literal>siginfo</literal>, <literal>ucontext</literal>, and
|
|
some &posix; parts. It might be worth considering whether
|
|
those two functions could not be merged with a benefit of
|
|
less code duplication and possibly even faster
|
|
execution.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="linux-sigreturn">
|
|
<title>linux_sigreturn</title>
|
|
|
|
<para>This syscall is used for return from the signal handler.
|
|
It does some security checks and restores the original
|
|
process context. It also unmasks the signal in process
|
|
signal mask.</para>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="ptrace">
|
|
<title>Ptrace</title>
|
|
|
|
<para>Many &unix; derivates implement the &man.ptrace.2; syscall
|
|
in order to allow various tracking and debugging features.
|
|
This facility enables the tracing process to obtain various
|
|
information about the traced process, like register dumps, any
|
|
memory from the process address space, etc. and also to trace
|
|
the process like in stepping an instruction or between system
|
|
entries (syscalls and traps). &man.ptrace.2; also lets you
|
|
set various information in the traced process (registers
|
|
etc.). &man.ptrace.2; is a &unix;-wide standard implemented
|
|
in most &unix;es around the world.</para>
|
|
|
|
<para>&linux; emulation in &os; implements the &man.ptrace.2;
|
|
facility in <filename>linux_ptrace.c</filename>. The routines
|
|
for converting registers between &linux; and &os; and the
|
|
actual &man.ptrace.2; syscall emulation syscall. The syscall
|
|
is a long switch block that implements its counterpart in &os;
|
|
for every &man.ptrace.2; command. The &man.ptrace.2; commands
|
|
are mostly equal between &linux; and &os; so usually just a
|
|
small modification is needed. For example,
|
|
<literal>PT_GETREGS</literal> in &linux; operates on direct
|
|
data while &os; uses a pointer to the data so after performing
|
|
a (native) &man.ptrace.2; syscall, a copyout must be done to
|
|
preserve &linux; semantics.</para>
|
|
|
|
<para>The &man.ptrace.2; implementation in Linuxulator has some
|
|
known weaknesses. There have been panics seen when using
|
|
<command>strace</command> (which is a &man.ptrace.2; consumer)
|
|
in the Linuxulator environment. Also
|
|
<literal>PT_SYSCALL</literal> is not implemented.</para>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="traps">
|
|
<title>Traps</title>
|
|
|
|
<para>Whenever a &linux; process running in the emulation layer
|
|
traps the trap itself is handled transparently with the only
|
|
exception of the trap translation. &linux; and &os; differs
|
|
in opinion on what a trap is so this is dealt with here. The
|
|
code is actually very short:</para>
|
|
|
|
<programlisting>static int
|
|
translate_traps(int signal, int trap_code)
|
|
{
|
|
|
|
if (signal != SIGBUS)
|
|
return signal;
|
|
|
|
switch (trap_code) {
|
|
|
|
case T_PROTFLT:
|
|
case T_TSSFLT:
|
|
case T_DOUBLEFLT:
|
|
case T_PAGEFLT:
|
|
return SIGSEGV;
|
|
|
|
default:
|
|
return signal;
|
|
}
|
|
}</programlisting>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="stack-fixup">
|
|
<title>Stack fixup</title>
|
|
|
|
<para>The RTLD run-time link-editor expects so called AUX tags
|
|
on stack during an <function>execve</function> so a fixup must
|
|
be done to ensure this. Of course, every RTLD system is
|
|
different so the emulation layer must provide its own stack
|
|
fixup routine to do this. So does Linuxulator. The
|
|
<function>elf_linux_fixup</function> simply copies out AUX
|
|
tags to the stack and adjusts the stack of the user space
|
|
process to point right after those tags. So RTLD works in a
|
|
smart way.</para>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="aout-support">
|
|
<title>A.OUT support</title>
|
|
|
|
<para>The &linux; emulation layer on i386 also supports &linux;
|
|
A.OUT binaries. Pretty much everything described in the
|
|
previous sections must be implemented for A.OUT support
|
|
(beside traps translation and signals sending). The support
|
|
for A.OUT binaries is no longer maintained, especially the 2.6
|
|
emulation does not work with it but this does not cause any
|
|
problem, as the linux-base in ports probably do not support
|
|
A.OUT binaries at all. This support will probably be removed
|
|
in future. Most of the stuff necessary for loading &linux;
|
|
A.OUT binaries is in <filename>imgact_linux.c</filename>
|
|
file.</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="mi">
|
|
<title>&linux; emulation layer -MI part</title>
|
|
|
|
<para>This section talks about machine independent part of the
|
|
Linuxulator. It covers the emulation infrastructure needed for
|
|
&linux; 2.6 emulation, the thread local storage (TLS)
|
|
implementation (on i386) and futexes. Then we talk briefly
|
|
about some syscalls.</para>
|
|
|
|
<sect2 xml:id="nptl-desc">
|
|
<title>Description of NPTL</title>
|
|
|
|
<para>One of the major areas of progress in development of
|
|
&linux; 2.6 was threading. Prior to 2.6, the &linux;
|
|
threading support was implemented in the
|
|
<application>linuxthreads</application> library. The library
|
|
was a partial implementation of &posix; threading. The
|
|
threading was implemented using separate processes for each
|
|
thread using the <function>clone</function> syscall to let
|
|
them share the address space (and other things). The main
|
|
weaknesses of this approach was that every thread had a
|
|
different PID, signal handling was broken (from the pthreads
|
|
perspective), etc. Also the performance was not very good
|
|
(use of <literal>SIGUSR</literal> signals for threads
|
|
synchronization, kernel resource consumption, etc.) so to
|
|
overcome these problems a new threading system was developed
|
|
and named NPTL.</para>
|
|
|
|
<para>The NPTL library focused on two things but a third thing
|
|
came along so it is usually considered a part of NPTL. Those
|
|
two things were embedding of threads into a process structure
|
|
and futexes. The additional third thing was TLS, which is not
|
|
directly required by NPTL but the whole NPTL userland library
|
|
depends on it. Those improvements yielded in much improved
|
|
performance and standards conformance. NPTL is a standard
|
|
threading library in &linux; systems these days.</para>
|
|
|
|
<para>The &os; Linuxulator implementation approaches the NPTL in
|
|
three main areas. The TLS, futexes and PID mangling, which is
|
|
meant to simulate the &linux; threads. Further sections
|
|
describe each of these areas.</para>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="linux26-emu">
|
|
<title>&linux; 2.6 emulation infrastructure</title>
|
|
|
|
<para>These sections deal with the way &linux; threads are
|
|
managed and how we simulate that in &os;.</para>
|
|
|
|
<sect3 xml:id="linux26-runtime">
|
|
<title>Runtime determining of 2.6 emulation</title>
|
|
|
|
<para>The &linux; emulation layer in &os; supports runtime
|
|
setting of the emulated version. This is done via
|
|
&man.sysctl.8;, namely
|
|
<literal>compat.linux.osrelease</literal>. Setting this
|
|
&man.sysctl.8; affects runtime behavior of the emulation
|
|
layer. When set to 2.6.x it sets the value of
|
|
<literal>linux_use_linux26</literal> while setting to
|
|
something else keeps it unset. This variable (plus
|
|
per-prison variables of the very same kind) determines
|
|
whether 2.6 infrastructure (mainly PID mangling) is used in
|
|
the code or not. The version setting is done system-wide
|
|
and this affects all &linux; processes. The &man.sysctl.8;
|
|
should not be changed when running any &linux; binary as it
|
|
might harm things.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="linux-proc-thread">
|
|
<title>&linux; processes and thread identifiers</title>
|
|
|
|
<para>The semantics of &linux; threading are a little
|
|
confusing and uses entirely different nomenclature to &os;.
|
|
A process in &linux; consists of a <literal>struct
|
|
task</literal> embedding two identifier fields - PID and
|
|
TGID. PID is <emphasis>not</emphasis> a process ID but it
|
|
is a thread ID. The TGID identifies a thread group in other
|
|
words a process. For single-threaded process the PID equals
|
|
the TGID.</para>
|
|
|
|
<para>The thread in NPTL is just an ordinary process that
|
|
happens to have TGID not equal to PID and have a group
|
|
leader not equal to itself (and shared VM etc. of course).
|
|
Everything else happens in the same way as to an ordinary
|
|
process. There is no separation of a shared status to some
|
|
external structure like in &os;. This creates some
|
|
duplication of information and possible data inconsistency.
|
|
The &linux; kernel seems to use task -> group information
|
|
in some places and task information elsewhere and it is
|
|
really not very consistent and looks error-prone.</para>
|
|
|
|
<para>Every NPTL thread is created by a call to the
|
|
<function>clone</function> syscall with a specific set of
|
|
flags (more in the next subsection). The NPTL implements
|
|
strict 1:1 threading.</para>
|
|
|
|
<para>In &os; we emulate NPTL threads with ordinary &os;
|
|
processes that share VM space, etc. and the PID gymnastic is
|
|
just mimicked in the emulation specific structure attached
|
|
to the process. The structure attached to the process looks
|
|
like:</para>
|
|
|
|
<programlisting>struct linux_emuldata {
|
|
pid_t pid;
|
|
|
|
int *child_set_tid; /* in clone(): Child.s TID to set on clone */
|
|
int *child_clear_tid;/* in clone(): Child.s TID to clear on exit */
|
|
|
|
struct linux_emuldata_shared *shared;
|
|
|
|
int pdeath_signal; /* parent death signal */
|
|
|
|
LIST_ENTRY(linux_emuldata) threads; /* list of linux threads */
|
|
};</programlisting>
|
|
|
|
<para>The PID is used to identify the &os; process that
|
|
attaches this structure. The
|
|
<function>child_se_tid</function> and
|
|
<function>child_clear_tid</function> are used for TID
|
|
address copyout when a process exits and is created. The
|
|
<varname>shared</varname> pointer points to a structure
|
|
shared among threads. The <varname>pdeath_signal</varname>
|
|
variable identifies the parent death signal and the
|
|
<varname>threads</varname> pointer is used to link this
|
|
structure to the list of threads. The
|
|
<literal>linux_emuldata_shared</literal> structure looks
|
|
like:</para>
|
|
|
|
<programlisting>struct linux_emuldata_shared {
|
|
|
|
int refs;
|
|
|
|
pid_t group_pid;
|
|
|
|
LIST_HEAD(, linux_emuldata) threads; /* head of list of linux threads */
|
|
};</programlisting>
|
|
|
|
<para>The <varname>refs</varname> is a reference counter being
|
|
used to determine when we can free the structure to avoid
|
|
memory leaks. The <varname>group_pid</varname> is to
|
|
identify PID ( = TGID) of the whole process ( = thread
|
|
group). The <varname>threads</varname> pointer is the head
|
|
of the list of threads in the process.</para>
|
|
|
|
<para>The <literal>linux_emuldata</literal> structure can be
|
|
obtained from the process using
|
|
<function>em_find</function>. The prototype of the function
|
|
is:</para>
|
|
|
|
<programlisting>struct linux_emuldata *em_find(struct proc *, int locked);</programlisting>
|
|
|
|
<para>Here, <varname>proc</varname> is the process we want the
|
|
emuldata structure from and the locked parameter determines
|
|
whether we want to lock or not. The accepted values are
|
|
<literal>EMUL_DOLOCK</literal> and
|
|
<literal>EMUL_DOUNLOCK</literal>. More about locking
|
|
later.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="pid-mangling">
|
|
<title>PID mangling</title>
|
|
|
|
<para>Because of the described different view knowing what a
|
|
process ID and thread ID is between &os; and &linux; we have
|
|
to translate the view somehow. We do it by PID mangling.
|
|
This means that we fake what a PID (=TGID) and TID (=PID) is
|
|
between kernel and userland. The rule of thumb is that in
|
|
kernel (in Linuxulator) PID = PID and TGID = shared ->
|
|
group pid and to userland we present <literal>PID = shared
|
|
-> group_pid</literal> and <literal>TID = proc ->
|
|
p_pid</literal>. The PID member of
|
|
<literal>linux_emuldata structure</literal> is a &os;
|
|
PID.</para>
|
|
|
|
<para>The above affects mainly getpid, getppid, gettid
|
|
syscalls. Where we use PID/TGID respectively. In copyout
|
|
of TIDs in <function>child_clear_tid</function> and
|
|
<function>child_set_tid</function> we copy out &os;
|
|
PID.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="clone-syscall">
|
|
<title>Clone syscall</title>
|
|
|
|
<para>The <function>clone</function> syscall is the way
|
|
threads are created in &linux;. The syscall prototype looks
|
|
like this:</para>
|
|
|
|
<programlisting>int linux_clone(l_int flags, void *stack, void *parent_tidptr, int dummy,
|
|
void * child_tidptr);</programlisting>
|
|
|
|
<para>The <varname>flags</varname> parameter tells the syscall
|
|
how exactly the processes should be cloned. As described
|
|
above, &linux; can create processes sharing various things
|
|
independently, for example two processes can share file
|
|
descriptors but not VM, etc. Last byte of the
|
|
<varname>flags</varname> parameter is the exit signal of the
|
|
newly created process. The <varname>stack</varname>
|
|
parameter if non-<literal>NULL</literal> tells, where the
|
|
thread stack is and if it is <literal>NULL</literal> we are
|
|
supposed to copy-on-write the calling process stack (i.e. do
|
|
what normal &man.fork.2; routine does). The
|
|
<varname>parent_tidptr</varname> parameter is used as an
|
|
address for copying out process PID (i.e. thread id) once
|
|
the process is sufficiently instantiated but is not runnable
|
|
yet. The <varname>dummy</varname> parameter is here because
|
|
of the very strange calling convention of this syscall on
|
|
i386. It uses the registers directly and does not let the
|
|
compiler do it what results in the need of a dummy syscall.
|
|
The <varname>child_tidptr</varname> parameter is used as an
|
|
address for copying out PID once the process has finished
|
|
forking and when the process exits.</para>
|
|
|
|
<para>The syscall itself proceeds by setting corresponding
|
|
flags depending on the flags passed in. For example,
|
|
<literal>CLONE_VM</literal> maps to RFMEM (sharing of VM),
|
|
etc. The only nit here is <literal>CLONE_FS</literal> and
|
|
<literal>CLONE_FILES</literal> because &os; does not allow
|
|
setting this separately so we fake it by not setting RFFDG
|
|
(copying of fd table and other fs information) if either of
|
|
these is defined. This does not cause any problems, because
|
|
those flags are always set together. After setting the
|
|
flags the process is forked using the internal
|
|
<function>fork1</function> routine, the process is
|
|
instrumented not to be put on a run queue, i.e. not to be
|
|
set runnable. After the forking is done we possibly
|
|
reparent the newly created process to emulate
|
|
<literal>CLONE_PARENT</literal> semantics. Next part is
|
|
creating the emulation data. Threads in &linux; does not
|
|
signal their parents so we set exit signal to be 0 to
|
|
disable this. After that setting of
|
|
<varname>child_set_tid</varname> and
|
|
<varname>child_clear_tid</varname> is performed enabling the
|
|
functionality later in the code. At this point we copy out
|
|
the PID to the address specified by
|
|
<varname>parent_tidptr</varname>. The setting of process
|
|
stack is done by simply rewriting thread frame
|
|
<varname>%esp</varname> register (<varname>%rsp</varname> on
|
|
amd64). Next part is setting up TLS for the newly created
|
|
process. After this &man.vfork.2; semantics might be
|
|
emulated and finally the newly created process is put on a
|
|
run queue and copying out its PID to the parent process via
|
|
<function>clone</function> return value is done.</para>
|
|
|
|
<para>The <function>clone</function> syscall is able and in
|
|
fact is used for emulating classic &man.fork.2; and
|
|
&man.vfork.2; syscalls. Newer glibc in a case of 2.6 kernel
|
|
uses <function>clone</function> to implement &man.fork.2;
|
|
and &man.vfork.2; syscalls.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="locking">
|
|
<title>Locking</title>
|
|
|
|
<para>The locking is implemented to be per-subsystem because
|
|
we do not expect a lot of contention on these. There are
|
|
two locks: <literal>emul_lock</literal> used to protect
|
|
manipulating of <literal>linux_emuldata</literal> and
|
|
<literal>emul_shared_lock</literal> used to manipulate
|
|
<literal>linux_emuldata_shared</literal>. The
|
|
<literal>emul_lock</literal> is a nonsleepable blocking
|
|
mutex while <literal>emul_shared_lock</literal> is a
|
|
sleepable blocking <literal>sx_lock</literal>. Because of
|
|
the per-subsystem locking we can coalesce some locks and
|
|
that is why the em find offers the non-locking
|
|
access.</para>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="tls">
|
|
<title>TLS</title>
|
|
|
|
<para>This section deals with TLS also known as thread local
|
|
storage.</para>
|
|
|
|
<sect3 xml:id="trheading-intro">
|
|
<title>Introduction to threading</title>
|
|
|
|
<para>Threads in computer science are entities within a
|
|
process that can be scheduled independently from each other.
|
|
The threads in the process share process wide data (file
|
|
descriptors, etc.) but also have their own stack for their
|
|
own data. Sometimes there is a need for process-wide data
|
|
specific to a given thread. Imagine a name of the thread in
|
|
execution or something like that. The traditional &unix;
|
|
threading API, <application>pthreads</application> provides
|
|
a way to do it via &man.pthread.key.create.3;,
|
|
&man.pthread.setspecific.3; and &man.pthread.getspecific.3;
|
|
where a thread can create a key to the thread local data and
|
|
using &man.pthread.getspecific.3; or
|
|
&man.pthread.getspecific.3; to manipulate those data. You
|
|
can easily see that this is not the most comfortable way
|
|
this could be accomplished. So various producers of C/C++
|
|
compilers introduced a better way. They defined a new
|
|
modifier keyword thread that specifies that a variable is
|
|
thread specific. A new method of accessing such variables
|
|
was developed as well (at least on i386). The
|
|
<application>pthreads</application> method tends to be
|
|
implemented in userspace as a trivial lookup table. The
|
|
performance of such a solution is not very good. So the new
|
|
method uses (on i386) segment registers to address a
|
|
segment, where TLS area is stored so the actual accessing of
|
|
a thread variable is just appending the segment register to
|
|
the address thus addressing via it. The segment registers
|
|
are usually <varname>%gs</varname> and
|
|
<varname>%fs</varname> acting like segment selectors. Every
|
|
thread has its own area where the thread local data are
|
|
stored and the segment must be loaded on every context
|
|
switch. This method is very fast and used almost
|
|
exclusively in the whole i386 &unix; world. Both &os; and
|
|
&linux; implement this approach and it yields very good
|
|
results. The only drawback is the need to reload the
|
|
segment on every context switch which can slowdown context
|
|
switches. &os; tries to avoid this overhead by using only 1
|
|
segment descriptor for this while &linux; uses 3.
|
|
Interesting thing is that almost nothing uses more than 1
|
|
descriptor (only <application>Wine</application> seems to
|
|
use 2) so &linux; pays this unnecessary price for context
|
|
switches.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="i386-segs">
|
|
<title>Segments on i386</title>
|
|
|
|
<para>The i386 architecture implements the so called segments.
|
|
A segment is a description of an area of memory. The base
|
|
address (bottom) of the memory area, the end of it
|
|
(ceiling), type, protection, etc. The memory described by a
|
|
segment can be accessed using segment selector registers
|
|
(<varname>%cs</varname>, <varname>%ds</varname>,
|
|
<varname>%ss</varname>, <varname>%es</varname>,
|
|
<varname>%fs</varname>, <varname>%gs</varname>). For
|
|
example let us suppose we have a segment which base address
|
|
is 0x1234 and length and this code:</para>
|
|
|
|
<programlisting>mov %edx,%gs:0x10</programlisting>
|
|
|
|
<para>This will load the content of the
|
|
<varname>%edx</varname> register into memory location
|
|
0x1244. Some segment registers have a special use, for
|
|
example <varname>%cs</varname> is used for code segment and
|
|
<varname>%ss</varname> is used for stack segment but
|
|
<varname>%fs</varname> and <varname>%gs</varname> are
|
|
generally unused. Segments are either stored in a global
|
|
GDT table or in a local LDT table. LDT is accessed via an
|
|
entry in the GDT. The LDT can store more types of segments.
|
|
LDT can be per process. Both tables define up to 8191
|
|
entries.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="linux-i386">
|
|
<title>Implementation on &linux; i386</title>
|
|
|
|
<para>There are two main ways of setting up TLS in &linux;.
|
|
It can be set when cloning a process using the
|
|
<function>clone</function> syscall or it can call
|
|
<function>set_thread_area</function>. When a process passes
|
|
<literal>CLONE_SETTLS</literal> flag to
|
|
<function>clone</function>, the kernel expects the memory
|
|
pointed to by the <varname>%esi</varname> register a &linux;
|
|
user space representation of a segment, which gets
|
|
translated to the machine representation of a segment and
|
|
loaded into a GDT slot. The GDT slot can be specified with
|
|
a number or -1 can be used meaning that the system itself
|
|
should choose the first free slot. In practice, the vast
|
|
majority of programs use only one TLS entry and does not
|
|
care about the number of the entry. We exploit this in the
|
|
emulation and in fact depend on it.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="tls-emu">
|
|
<title>Emulation of &linux; TLS</title>
|
|
|
|
<sect4 xml:id="tls-i386">
|
|
<title>i386</title>
|
|
|
|
<para>Loading of TLS for the current thread happens by
|
|
calling <function>set_thread_area</function> while loading
|
|
TLS for a second process in <function>clone</function> is
|
|
done in the separate block in <function>clone</function>.
|
|
Those two functions are very similar. The only difference
|
|
being the actual loading of the GDT segment, which happens
|
|
on the next context switch for the newly created process
|
|
while <function>set_thread_area</function> must load this
|
|
directly. The code basically does this. It copies the
|
|
&linux; form segment descriptor from the userland. The
|
|
code checks for the number of the descriptor but because
|
|
this differs between &os; and &linux; we fake it a little.
|
|
We only support indexes of 6, 3 and -1. The 6 is genuine
|
|
&linux; number, 3 is genuine &os; one and -1 means
|
|
autoselection. Then we set the descriptor number to
|
|
constant 3 and copy out this to the userspace. We rely on
|
|
the userspace process using the number from the descriptor
|
|
but this works most of the time (have never seen a case
|
|
where this did not work) as the userspace process
|
|
typically passes in 1. Then we convert the descriptor
|
|
from the &linux; form to a machine dependant form (i.e.
|
|
operating system independent form) and copy this to the
|
|
&os; defined segment descriptor. Finally we can load it.
|
|
We assign the descriptor to threads PCB (process control
|
|
block) and load the <varname>%gs</varname> segment using
|
|
<function>load_gs</function>. This loading must be done
|
|
in a critical section so that nothing can interrupt us.
|
|
The <literal>CLONE_SETTLS</literal> case works exactly
|
|
like this just the loading using
|
|
<function>load_gs</function> is not performed. The
|
|
segment used for this (segment number 3) is shared for
|
|
this use between &os; processes and &linux; processes so
|
|
the &linux; emulation layer does not add any overhead over
|
|
plain &os;.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="tls-amd64">
|
|
<title>amd64</title>
|
|
|
|
<para>The amd64 implementation is similar to the i386 one
|
|
but there was initially no 32bit segment descriptor used
|
|
for this purpose (hence not even native 32bit TLS users
|
|
worked) so we had to add such a segment and implement its
|
|
loading on every context switch (when a flag signaling use
|
|
of 32bit is set). Apart from this the TLS loading is
|
|
exactly the same just the segment numbers are different
|
|
and the descriptor format and the loading differs
|
|
slightly.</para>
|
|
</sect4>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="futexes">
|
|
<title>Futexes</title>
|
|
|
|
<sect3 xml:id="sync-intro">
|
|
<title>Introduction to synchronization</title>
|
|
|
|
<para>Threads need some kind of synchronization and &posix;
|
|
provides some of them: mutexes for mutual exclusion,
|
|
read-write locks for mutual exclusion with biased ratio of
|
|
reads and writes and condition variables for signaling a
|
|
status change. It is interesting to note that &posix;
|
|
threading API lacks support for semaphores. Those
|
|
synchronization routines implementations are heavily
|
|
dependant on the type threading support we have. In pure
|
|
1:M (userspace) model the implementation can be solely done
|
|
in userspace and thus be very fast (the condition variables
|
|
will probably end up being implemented using signals, i.e.
|
|
not fast) and simple. In 1:1 model, the situation is also
|
|
quite clear - the threads must be synchronized using kernel
|
|
facilities (which is very slow because a syscall must be
|
|
performed). The mixed M:N scenario just combines the first
|
|
and second approach or rely solely on kernel. Threads
|
|
synchronization is a vital part of thread-enabled
|
|
programming and its performance can affect resulting program
|
|
a lot. Recent benchmarks on &os; operating system showed
|
|
that an improved sx_lock implementation yielded 40% speedup
|
|
in <firstterm>ZFS</firstterm> (a heavy sx user), this is
|
|
in-kernel stuff but it shows clearly how important the
|
|
performance of synchronization primitives is.</para>
|
|
|
|
<para>Threaded programs should be written with as little
|
|
contention on locks as possible. Otherwise, instead of
|
|
doing useful work the thread just waits on a lock. Because
|
|
of this, the most well written threaded programs show little
|
|
locks contention.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="futex-intro">
|
|
<title>Futexes introduction</title>
|
|
|
|
<para>&linux; implements 1:1 threading, i.e. it has to use
|
|
in-kernel synchronization primitives. As stated earlier,
|
|
well written threaded programs have little lock contention.
|
|
So a typical sequence could be performed as two atomic
|
|
increase/decrease mutex reference counter, which is very
|
|
fast, as presented by the following example:</para>
|
|
|
|
<programlisting>pthread_mutex_lock(&mutex);
|
|
....
|
|
pthread_mutex_unlock(&mutex);</programlisting>
|
|
|
|
<para>1:1 threading forces us to perform two syscalls for
|
|
those mutex calls, which is very slow.</para>
|
|
|
|
<para>The solution &linux; 2.6 implements is called
|
|
futexes. Futexes implement the check for contention in
|
|
userspace and call kernel primitives only in a case of
|
|
contention. Thus the typical case takes place without any
|
|
kernel intervention. This yields reasonably fast and
|
|
flexible synchronization primitives implementation.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="futex-api">
|
|
<title>Futex API</title>
|
|
|
|
<para>The futex syscall looks like this:</para>
|
|
|
|
<programlisting>int futex(void *uaddr, int op, int val, struct timespec *timeout, void *uaddr2, int val3);</programlisting>
|
|
|
|
<para>In this example <varname>uaddr</varname> is an address
|
|
of the mutex in userspace, <varname>op</varname> is an
|
|
operation we are about to perform and the other parameters
|
|
have per-operation meaning.</para>
|
|
|
|
<para>Futexes implement the following operations:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para><literal>FUTEX_WAIT</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_WAKE</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_FD</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_REQUEUE</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_CMP_REQUEUE</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_WAKE_OP</literal></para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<sect4 xml:id="futex-wait">
|
|
<title>FUTEX_WAIT</title>
|
|
|
|
<para>This operation verifies that on address
|
|
<varname>uaddr</varname> the value <varname>val</varname>
|
|
is written. If not, <literal>EWOULDBLOCK</literal> is
|
|
returned, otherwise the thread is queued on the futex and
|
|
gets suspended. If the argument
|
|
<varname>timeout</varname> is non-zero it specifies the
|
|
maximum time for the sleeping, otherwise the sleeping is
|
|
infinite.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-wake">
|
|
<title>FUTEX_WAKE</title>
|
|
|
|
<para>This operation takes a futex at
|
|
<varname>uaddr</varname> and wakes up
|
|
<varname>val</varname> first futexes queued on this
|
|
futex.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-fd">
|
|
<title>FUTEX_FD</title>
|
|
|
|
<para>This operations associates a file descriptor with a
|
|
given futex.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-requeue">
|
|
<title>FUTEX_REQUEUE</title>
|
|
|
|
<para>This operation takes <varname>val</varname> threads
|
|
queued on futex at <varname>uaddr</varname>, wakes them
|
|
up, and takes <varname>val2</varname> next threads and
|
|
requeues them on futex at
|
|
<varname>uaddr2</varname>.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-cmp-requeue">
|
|
<title>FUTEX_CMP_REQUEUE</title>
|
|
|
|
<para>This operation does the same as
|
|
<literal>FUTEX_REQUEUE</literal> but it checks that
|
|
<varname>val3</varname> equals to <varname>val</varname>
|
|
first.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-wake-op">
|
|
<title>FUTEX_WAKE_OP</title>
|
|
|
|
<para>This operation performs an atomic operation on
|
|
<varname>val3</varname> (which contains coded some other
|
|
value) and <varname>uaddr</varname>. Then it wakes up
|
|
<varname>val</varname> threads on futex at
|
|
<varname>uaddr</varname> and if the atomic operation
|
|
returned a positive number it wakes up
|
|
<varname>val2</varname> threads on futex at
|
|
<varname>uaddr2</varname>.</para>
|
|
|
|
<para>The operations implemented in
|
|
<literal>FUTEX_WAKE_OP</literal>:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para><literal>FUTEX_OP_SET</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_OP_ADD</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_OP_OR</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_OP_AND</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_OP_XOR</literal></para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<note>
|
|
<para>There is no <varname>val2</varname> parameter in the
|
|
futex prototype. The <varname>val2</varname> is taken
|
|
from the <varname>struct timespec *timeout</varname>
|
|
parameter for operations
|
|
<literal>FUTEX_REQUEUE</literal>,
|
|
<literal>FUTEX_CMP_REQUEUE</literal> and
|
|
<literal>FUTEX_WAKE_OP</literal>.</para>
|
|
</note>
|
|
</sect4>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="futex-emu">
|
|
<title>Futex emulation in &os;</title>
|
|
|
|
<para>The futex emulation in &os; is taken from NetBSD and
|
|
further extended by us. It is placed in
|
|
<filename>linux_futex.c</filename> and
|
|
<filename>linux_futex.h</filename> files. The
|
|
<literal>futex</literal> structure looks like:</para>
|
|
|
|
<programlisting>struct futex {
|
|
void *f_uaddr;
|
|
int f_refcount;
|
|
|
|
LIST_ENTRY(futex) f_list;
|
|
|
|
TAILQ_HEAD(lf_waiting_paroc, waiting_proc) f_waiting_proc;
|
|
};</programlisting>
|
|
|
|
<para>And the structure <literal>waiting_proc</literal>
|
|
is:</para>
|
|
|
|
<programlisting>struct waiting_proc {
|
|
|
|
struct thread *wp_t;
|
|
|
|
struct futex *wp_new_futex;
|
|
|
|
TAILQ_ENTRY(waiting_proc) wp_list;
|
|
};</programlisting>
|
|
|
|
<sect4 xml:id="futex-get">
|
|
<title>futex_get / futex_put</title>
|
|
|
|
<para>A futex is obtained using the
|
|
<function>futex_get</function> function, which searches a
|
|
linear list of futexes and returns the found one or
|
|
creates a new futex. When releasing a futex from the use
|
|
we call the <function>futex_put</function> function, which
|
|
decreases a reference counter of the futex and if the
|
|
refcount reaches zero it is released.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-sleep">
|
|
<title>futex_sleep</title>
|
|
|
|
<para>When a futex queues a thread for sleeping it creates a
|
|
<literal>working_proc</literal> structure and puts this
|
|
structure to the list inside the futex structure then it
|
|
just performs a &man.tsleep.9; to suspend the thread. The
|
|
sleep can be timed out. After &man.tsleep.9; returns (the
|
|
thread was woken up or it timed out) the
|
|
<literal>working_proc</literal> structure is removed from
|
|
the list and is destroyed. All this is done in the
|
|
<function>futex_sleep</function> function. If we got
|
|
woken up from <function>futex_wake</function> we have
|
|
<varname>wp_new_futex</varname> set so we sleep on it.
|
|
This way the actual requeueing is done in this
|
|
function.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-wake-2">
|
|
<title>futex_wake</title>
|
|
|
|
<para>Waking up a thread sleeping on a futex is performed in
|
|
the <function>futex_wake</function> function. First in
|
|
this function we mimic the strange &linux; behavior, where
|
|
it wakes up N threads for all operations, the only
|
|
exception is that the REQUEUE operations are performed on
|
|
N+1 threads. But this usually does not make any
|
|
difference as we are waking up all threads. Next in the
|
|
function in the loop we wake up n threads, after this we
|
|
check if there is a new futex for requeueing. If so, we
|
|
requeue up to n2 threads on the new futex. This
|
|
cooperates with <function>futex_sleep</function>.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-wake-op-2">
|
|
<title>futex_wake_op</title>
|
|
|
|
<para>The <literal>FUTEX_WAKE_OP</literal> operation is
|
|
quite complicated. First we obtain two futexes at
|
|
addresses <varname>uaddr</varname> and
|
|
<varname>uaddr2</varname> then we perform the atomic
|
|
operation using <varname>val3</varname> and
|
|
<varname>uaddr2</varname>. Then <varname>val</varname>
|
|
waiters on the first futex is woken up and if the atomic
|
|
operation condition holds we wake up
|
|
<varname>val2</varname> (i.e. <varname>timeout</varname>)
|
|
waiter on the second futex.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-atomic-op">
|
|
<title>futex atomic operation</title>
|
|
|
|
<para>The atomic operation takes two parameters
|
|
<varname>encoded_op</varname> and
|
|
<varname>uaddr</varname>. The encoded operation encodes
|
|
the operation itself, comparing value, operation argument,
|
|
and comparing argument. The pseudocode for the operation
|
|
is like this one:</para>
|
|
|
|
<programlisting>oldval = *uaddr2
|
|
*uaddr2 = oldval OP oparg</programlisting>
|
|
|
|
<para>And this is done atomically. First a copying in of
|
|
the number at <varname>uaddr</varname> is performed and
|
|
the operation is done. The code handles page faults and
|
|
if no page fault occurs <varname>oldval</varname> is
|
|
compared to <varname>cmparg</varname> argument with cmp
|
|
comparator.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-locking">
|
|
<title>Futex locking</title>
|
|
|
|
<para>Futex implementation uses two lock lists protecting
|
|
<function>sx_lock</function> and global locks (either
|
|
Giant or another <function>sx_lock</function>). Every
|
|
operation is performed locked from the start to the very
|
|
end.</para>
|
|
</sect4>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="syscall-impl">
|
|
<title>Various syscalls implementation</title>
|
|
|
|
<para>In this section I am going to describe some smaller
|
|
syscalls that are worth mentioning because their
|
|
implementation is not obvious or those syscalls are
|
|
interesting from other point of view.</para>
|
|
|
|
<sect3 xml:id="syscall-at">
|
|
<title>*at family of syscalls</title>
|
|
|
|
<para>During development of &linux; 2.6.16 kernel, the *at
|
|
syscalls were added. Those syscalls
|
|
(<function>openat</function> for example) work exactly like
|
|
their at-less counterparts with the slight exception of the
|
|
<varname>dirfd</varname> parameter. This parameter changes
|
|
where the given file, on which the syscall is to be
|
|
performed, is. When the <varname>filename</varname>
|
|
parameter is absolute <varname>dirfd</varname> is ignored
|
|
but when the path to the file is relative, it comes to the
|
|
play. The <varname>dirfd</varname> parameter is a directory
|
|
relative to which the relative pathname is checked. The
|
|
<varname>dirfd</varname> parameter is a file descriptor of
|
|
some directory or <literal>AT_FDCWD</literal>. So for
|
|
example the <function>openat</function> syscall can be like
|
|
this:</para>
|
|
|
|
<programlisting>file descriptor 123 = /tmp/foo/, current working directory = /tmp/
|
|
|
|
openat(123, /tmp/bah\, flags, mode) /* opens /tmp/bah */
|
|
openat(123, bah\, flags, mode) /* opens /tmp/foo/bah */
|
|
openat(AT_FDWCWD, bah\, flags, mode) /* opens /tmp/bah */
|
|
openat(stdio, bah\, flags, mode) /* returns error because stdio is not a directory */</programlisting>
|
|
|
|
<para>This infrastructure is necessary to avoid races when
|
|
opening files outside the working directory. Imagine that a
|
|
process consists of two threads, thread A and
|
|
thread B. Thread A issues
|
|
<literal>open(./tmp/foo/bah., flags, mode)</literal> and
|
|
before returning it gets preempted and thread B runs.
|
|
Thread B does not care about the needs of thread A
|
|
and renames or removes <filename>/tmp/foo/</filename>. We
|
|
got a race. To avoid this we can open
|
|
<filename>/tmp/foo</filename> and use it as
|
|
<varname>dirfd</varname> for <function>openat</function>
|
|
syscall. This also enables user to implement per-thread
|
|
working directories.</para>
|
|
|
|
<para>&linux; family of *at syscalls contains:
|
|
<function>linux_openat</function>,
|
|
<function>linux_mkdirat</function>,
|
|
<function>linux_mknodat</function>,
|
|
<function>linux_fchownat</function>,
|
|
<function>linux_futimesat</function>,
|
|
<function>linux_fstatat64</function>,
|
|
<function>linux_unlinkat</function>,
|
|
<function>linux_renameat</function>,
|
|
<function>linux_linkat</function>,
|
|
<function>linux_symlinkat</function>,
|
|
<function>linux_readlinkat</function>,
|
|
<function>linux_fchmodat</function> and
|
|
<function>linux_faccessat</function>. All these are
|
|
implemented using the modified &man.namei.9; routine and
|
|
simple wrapping layer.</para>
|
|
|
|
<sect4 xml:id="implementation">
|
|
<title>Implementation</title>
|
|
|
|
<para>The implementation is done by altering the
|
|
&man.namei.9; routine (described above) to take additional
|
|
parameter <varname>dirfd</varname> in its
|
|
<literal>nameidata</literal> structure, which specifies
|
|
the starting point of the pathname lookup instead of using
|
|
the current working directory every time. The resolution
|
|
of <varname>dirfd</varname> from file descriptor number to
|
|
a vnode is done in native *at syscalls. When
|
|
<varname>dirfd</varname> is <literal>AT_FDCWD</literal>
|
|
the <varname>dvp</varname> entry in
|
|
<literal>nameidata</literal> structure is
|
|
<literal>NULL</literal> but when <varname>dirfd</varname>
|
|
is a different number we obtain a file for this file
|
|
descriptor, check whether this file is valid and if there
|
|
is vnode attached to it then we get a vnode. Then we
|
|
check this vnode for being a directory. In the actual
|
|
&man.namei.9; routine we simply substitute the
|
|
<varname>dvp</varname> vnode for <varname>dp</varname>
|
|
variable in the &man.namei.9; function, which determines
|
|
the starting point. The &man.namei.9; is not used
|
|
directly but via a trace of different functions on various
|
|
levels. For example the <function>openat</function> goes
|
|
like this:</para>
|
|
|
|
<programlisting>openat() --> kern_openat() --> vn_open() -> namei()</programlisting>
|
|
|
|
<para>For this reason <function>kern_open</function> and
|
|
<function>vn_open</function> must be altered to
|
|
incorporate the additional <varname>dirfd</varname>
|
|
parameter. No compat layer is created for those because
|
|
there are not many users of this and the users can be
|
|
easily converted. This general implementation enables
|
|
&os; to implement their own *at syscalls. This is being
|
|
discussed right now.</para>
|
|
</sect4>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="ioctl">
|
|
<title>Ioctl</title>
|
|
|
|
<para>The ioctl interface is quite fragile due to its
|
|
generality. We have to bear in mind that devices differ
|
|
between &linux; and &os; so some care must be applied to do
|
|
ioctl emulation work right. The ioctl handling is
|
|
implemented in <filename>linux_ioctl.c</filename>, where
|
|
<function>linux_ioctl</function> function is defined. This
|
|
function simply iterates over sets of ioctl handlers to find
|
|
a handler that implements a given command. The ioctl
|
|
syscall has three parameters, the file descriptor, command
|
|
and an argument. The command is a 16-bit number, which in
|
|
theory is divided into high 8 bits determining class of
|
|
the ioctl command and low 8 bits, which are the actual
|
|
command within the given set. The emulation takes advantage
|
|
of this division. We implement handlers for each set, like
|
|
<function>sound_handler</function> or
|
|
<function>disk_handler</function>. Each handler has a
|
|
maximum command and a minimum command defined, which is used
|
|
for determining what handler is used. There are slight
|
|
problems with this approach because &linux; does not use the
|
|
set division consistently so sometimes ioctls for a
|
|
different set are inside a set they should not belong to
|
|
(SCSI generic ioctls inside cdrom set, etc.). &os;
|
|
currently does not implement many &linux; ioctls (compared
|
|
to NetBSD, for example) but the plan is to port those from
|
|
NetBSD. The trend is to use &linux; ioctls even in the
|
|
native &os; drivers because of the easy porting of
|
|
applications.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="debugging">
|
|
<title>Debugging</title>
|
|
|
|
<para>Every syscall should be debuggable. For this purpose we
|
|
introduce a small infrastructure. We have the ldebug
|
|
facility, which tells whether a given syscall should be
|
|
debugged (settable via a sysctl). For printing we have LMSG
|
|
and ARGS macros. Those are used for altering a printable
|
|
string for uniform debugging messages.</para>
|
|
</sect3>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="conclusion">
|
|
<title>Conclusion</title>
|
|
|
|
<sect2 xml:id="results">
|
|
<title>Results</title>
|
|
|
|
<para>As of April 2007 the &linux; emulation layer is capable of
|
|
emulating the &linux; 2.6.16 kernel quite well. The
|
|
remaining problems concern futexes, unfinished *at family of
|
|
syscalls, problematic signals delivery, missing
|
|
<function>epoll</function> and <function>inotify</function>
|
|
and probably some bugs we have not discovered yet. Despite
|
|
this we are capable of running basically all the &linux;
|
|
programs included in &os; Ports Collection with
|
|
Fedora Core 4 at 2.6.16 and there are some
|
|
rudimentary reports of success with Fedora Core 6 at
|
|
2.6.16. The Fedora Core 6 linux_base was recently
|
|
committed enabling some further testing of the emulation layer
|
|
and giving us some more hints where we should put our effort
|
|
in implementing missing stuff.</para>
|
|
|
|
<para>We are able to run the most used applications like
|
|
<package>www/linux-firefox</package>,
|
|
<package>www/linux-opera</package>,
|
|
<package>net-im/skype</package> and some games from the
|
|
Ports Collection. Some of the programs exhibit bad
|
|
behavior under 2.6 emulation but this is currently under
|
|
investigation and hopefully will be fixed soon. The only big
|
|
application that is known not to work is the &linux; &java;
|
|
Development Kit and this is because of the requirement of
|
|
<function>epoll</function> facility which is not directly
|
|
related to the &linux; kernel 2.6.</para>
|
|
|
|
<para>We hope to enable 2.6.16 emulation by default some time
|
|
after &os; 7.0 is released at least to expose the 2.6
|
|
emulation parts for some wider testing. Once this is done we
|
|
can switch to Fedora Core 6 linux_base, which is the
|
|
ultimate plan.</para>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="future-work">
|
|
<title>Future work</title>
|
|
|
|
<para>Future work should focus on fixing the remaining issues
|
|
with futexes, implement the rest of the *at family of
|
|
syscalls, fix the signal delivery and possibly implement the
|
|
<function>epoll</function> and <function>inotify</function>
|
|
facilities.</para>
|
|
|
|
<para>We hope to be able to run the most important programs
|
|
flawlessly soon, so we will be able to switch to the 2.6
|
|
emulation by default and make the Fedora Core 6 the
|
|
default linux_base because our currently used
|
|
Fedora Core 4 is not supported any more.</para>
|
|
|
|
<para>The other possible goal is to share our code with NetBSD
|
|
and DragonflyBSD. NetBSD has some support for 2.6 emulation
|
|
but its far from finished and not really tested. DragonflyBSD
|
|
has expressed some interest in porting the 2.6
|
|
improvements.</para>
|
|
|
|
<para>Generally, as &linux; develops we would like to keep up
|
|
with their development, implementing newly added syscalls.
|
|
Splice comes to mind first. Some already implemented syscalls
|
|
are also heavily crippled, for example
|
|
<function>mremap</function> and others. Some performance
|
|
improvements can also be made, finer grained locking and
|
|
others.</para>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="team">
|
|
<title>Team</title>
|
|
|
|
<para>I cooperated on this project with (in alphabetical
|
|
order):</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>&a.jhb.email;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&a.kib.email;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Emmanuel Dreyfus</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Scot Hetzel</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&a.jkim.email;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&a.netchild.email;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&a.ssouhlal.email;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Li Xiao</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&a.davidxu.email;</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>I would like to thank all those people for their advice,
|
|
code reviews and general support.</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="literatures">
|
|
<title>Literatures</title>
|
|
|
|
<orderedlist>
|
|
<listitem>
|
|
<para>Marshall Kirk McKusick - George V. Nevile-Neil. Design
|
|
and Implementation of the &os; operating system.
|
|
Addison-Wesley, 2005.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><uri
|
|
xlink:href="https://tldp.org">https://tldp.org</uri></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><uri
|
|
xlink:href="https://www.kernel.org">https://www.kernel.org</uri></para>
|
|
</listitem>
|
|
</orderedlist>
|
|
</sect1>
|
|
</article>
|