2366 lines
103 KiB
XML
2366 lines
103 KiB
XML
<?xml version="1.0" encoding="iso-8859-1"?>
|
|
<!DOCTYPE article PUBLIC "-//FreeBSD//DTD DocBook XML V5.0-Based Extension//EN"
|
|
"http://www.FreeBSD.org/XML/share/xml/freebsd50.dtd">
|
|
<!-- $FreeBSD$ -->
|
|
<!-- The FreeBSD Documentation Project -->
|
|
<article xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en">
|
|
<info><title>&linux; emulation in &os;</title>
|
|
|
|
|
|
<author><personname><firstname>Roman</firstname><surname>Divacky</surname></personname><affiliation>
|
|
<address><email>rdivacky@FreeBSD.org</email></address>
|
|
</affiliation></author>
|
|
|
|
<legalnotice xml:id="trademarks" role="trademarks">
|
|
&tm-attrib.adobe;
|
|
&tm-attrib.ibm;
|
|
&tm-attrib.freebsd;
|
|
&tm-attrib.linux;
|
|
&tm-attrib.netbsd;
|
|
&tm-attrib.realnetworks;
|
|
&tm-attrib.oracle;
|
|
&tm-attrib.sun;
|
|
&tm-attrib.general;
|
|
</legalnotice>
|
|
|
|
<pubdate>$FreeBSD$</pubdate>
|
|
|
|
<releaseinfo>$FreeBSD$</releaseinfo>
|
|
|
|
<abstract>
|
|
<para>This masters thesis deals with updating the &linux; emulation layer
|
|
(the so called <firstterm>Linuxulator</firstterm>). The task was to update the layer to match
|
|
the functionality of &linux; 2.6. As a reference implementation, the
|
|
&linux; 2.6.16 kernel was chosen. The concept is loosely based on
|
|
the NetBSD implementation. Most of the work was done in the summer
|
|
of 2006 as a part of the Google Summer of Code students program.
|
|
The focus was on bringing the <firstterm>NPTL</firstterm> (new &posix;
|
|
thread library) support into the emulation layer, including
|
|
<firstterm>TLS</firstterm> (thread local storage),
|
|
<firstterm>futexes</firstterm> (fast user space mutexes),
|
|
<firstterm>PID mangling</firstterm>, and some other minor
|
|
things. Many small problems were identified and fixed in the
|
|
process. My work was integrated into the main &os; source
|
|
repository and will be shipped in the upcoming 7.0R release. We,
|
|
the emulation development team, are working on making the
|
|
&linux; 2.6 emulation the default emulation layer in &os;.</para>
|
|
</abstract>
|
|
</info>
|
|
|
|
<sect1 xml:id="intro">
|
|
<title>Introduction</title>
|
|
|
|
<para>In the last few years the open source &unix; based operating systems
|
|
started to be widely deployed on server and client machines. Among
|
|
these operating systems I would like to point out two: &os;, for its BSD
|
|
heritage, time proven code base and many interesting features and
|
|
&linux; for its wide user base, enthusiastic open developer community
|
|
and support from large companies. &os; tends to be used on server
|
|
class machines serving heavy duty networking tasks with less usage on
|
|
desktop class machines for ordinary users. While &linux; has the same
|
|
usage on servers, but it is used much more by home based users. This
|
|
leads to a situation where there are many binary only programs available
|
|
for &linux; that lack support for &os;.</para>
|
|
|
|
<para>Naturally, a need for the ability to run &linux; binaries on a &os;
|
|
system arises and this is what this thesis deals with: the emulation of
|
|
the &linux; kernel in the &os; operating system.</para>
|
|
|
|
<para>During the Summer of 2006 Google Inc. sponsored a project which
|
|
focused on extending the &linux; emulation layer (the so called Linuxulator)
|
|
in &os; to include &linux; 2.6 facilities. This thesis is written as a
|
|
part of this project.</para>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="inside">
|
|
<title>A look inside…</title>
|
|
|
|
<para>In this section we are going to describe every operating system in
|
|
question. How they deal with syscalls, trapframes etc., all the low-level
|
|
stuff. We also describe the way they understand common &unix;
|
|
primitives like what a PID is, what a thread is, etc. In the third
|
|
subsection we talk about how &unix; on &unix; emulation could be done
|
|
in general.</para>
|
|
|
|
<sect2 xml:id="what-is-unix">
|
|
<title>What is &unix;</title>
|
|
|
|
<para>&unix; is an operating system with a long history that has
|
|
influenced almost every other operating system currently in use.
|
|
Starting in the 1960s, its development continues to this day (although
|
|
in different projects). &unix; development soon forked into two main
|
|
ways: the BSDs and System III/V families. They mutually influenced
|
|
themselves by growing a common &unix; standard. Among the
|
|
contributions originated in BSD we can name virtual memory, TCP/IP
|
|
networking, FFS, and many others. The System V branch contributed to
|
|
SysV interprocess communication primitives, copy-on-write, etc. &unix;
|
|
itself does not exist any more but its ideas have been used by many
|
|
other operating systems world wide thus forming the so called &unix;-like
|
|
operating systems. These days the most influential ones are &linux;,
|
|
Solaris, and possibly (to some extent) &os;. There are in-company
|
|
&unix; derivatives (AIX, HP-UX etc.), but these have been more and
|
|
more migrated to the aforementioned systems. Let us summarize typical
|
|
&unix; characteristics.</para>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="tech-details">
|
|
<title>Technical details</title>
|
|
|
|
<para>Every running program constitutes a process that represents a state
|
|
of the computation. Running process is divided between kernel-space
|
|
and user-space. Some operations can be done only from kernel space
|
|
(dealing with hardware etc.), but the process should spend most of its
|
|
lifetime in the user space. The kernel is where the management of the
|
|
processes, hardware, and low-level details take place. The kernel
|
|
provides a standard unified &unix; API to the user space. The most
|
|
important ones are covered below.</para>
|
|
|
|
<sect3 xml:id="kern-proc-comm">
|
|
<title>Communication between kernel and user space process</title>
|
|
|
|
<para>Common &unix; API defines a syscall as a way to issue commands
|
|
from a user space process to the kernel. The most common
|
|
implementation is either by using an interrupt or specialized
|
|
instruction (think of
|
|
<literal>SYSENTER</literal>/<literal>SYSCALL</literal> instructions
|
|
for ia32). Syscalls are defined by a number. For example in &os;,
|
|
the syscall number 85 is the &man.swapon.2; syscall and the
|
|
syscall number 132 is &man.mkfifo.2;. Some syscalls need
|
|
parameters, which are passed from the user-space to the kernel-space
|
|
in various ways (implementation dependant). Syscalls are
|
|
synchronous.</para>
|
|
|
|
<para>Another possible way to communicate is by using a
|
|
<firstterm>trap</firstterm>. Traps occur asynchronously after
|
|
some event occurs (division by zero, page fault etc.). A trap
|
|
can be transparent for a process (page fault) or can result in
|
|
a reaction like sending a <firstterm>signal</firstterm>
|
|
(division by zero).</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="proc-proc-comm">
|
|
<title>Communication between processes</title>
|
|
|
|
<para>There are other APIs (System V IPC, shared memory etc.) but the
|
|
single most important API is signal. Signals are sent by processes
|
|
or by the kernel and received by processes. Some signals
|
|
can be ignored or handled by a user supplied routine, some result
|
|
in a predefined action that cannot be altered or ignored.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="proc-mgmt">
|
|
<title>Process management</title>
|
|
|
|
<para>Kernel instances are processed first in the system (so called
|
|
init). Every running process can create its identical copy using
|
|
the &man.fork.2; syscall. Some slightly modified versions of this
|
|
syscall were introduced but the basic semantic is the same. Every
|
|
running process can morph into some other process using the
|
|
&man.exec.3; syscall. Some modifications of this syscall were
|
|
introduced but all serve the same basic purpose. Processes end
|
|
their lives by calling the &man.exit.2; syscall. Every process is
|
|
identified by a unique number called PID. Every process has a
|
|
defined parent (identified by its PID).</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="thread-mgmt">
|
|
<title>Thread management</title>
|
|
|
|
<para>Traditional &unix; does not define any API nor implementation
|
|
for threading, while &posix; defines its threading API but the
|
|
implementation is undefined. Traditionally there were two ways of
|
|
implementing threads. Handling them as separate processes (1:1
|
|
threading) or envelope the whole thread group in one process and
|
|
managing the threading in userspace (1:N threading). Comparing
|
|
main features of each approach:</para>
|
|
|
|
<para>1:1 threading</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>- heavyweight threads</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>- the scheduling cannot be altered by the user
|
|
(slightly mitigated by the &posix; API)</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>+ no syscall wrapping necessary</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>+ can utilize multiple CPUs</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>1:N threading</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>+ lightweight threads</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>+ scheduling can be easily altered by the user</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>- syscalls must be wrapped </para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>- cannot utilize more than one CPU</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="what-is-freebsd">
|
|
<title>What is &os;?</title>
|
|
|
|
<para>The &os; project is one of the oldest open source operating
|
|
systems currently available for daily use. It is a direct descendant
|
|
of the genuine &unix; so it could be claimed that it is a true &unix;
|
|
although licensing issues do not permit that. The start of the project
|
|
dates back to the early 1990's when a crew of fellow BSD users patched
|
|
the 386BSD operating system. Based on this patchkit a new operating
|
|
system arose named &os; for its liberal license. Another group created
|
|
the NetBSD operating system with different goals in mind. We will
|
|
focus on &os;.</para>
|
|
|
|
<para>&os; is a modern &unix;-based operating system with all the
|
|
features of &unix;. Preemptive multitasking, multiuser facilities,
|
|
TCP/IP networking, memory protection, symmetric multiprocessing
|
|
support, virtual memory with merged VM and buffer cache, they are all
|
|
there. One of the interesting and extremely useful features is the
|
|
ability to emulate other &unix;-like operating systems. As of
|
|
December 2006 and 7-CURRENT development, the following
|
|
emulation functionalities are supported:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>&os;/i386 emulation on &os;/amd64</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&os;/i386 emulation on &os;/ia64</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&linux;-emulation of &linux; operating system on &os;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>NDIS-emulation of Windows networking drivers interface</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>NetBSD-emulation of NetBSD operating system</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>PECoff-support for PECoff &os; executables</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>SVR4-emulation of System V revision 4 &unix;</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Actively developed emulations are the &linux; layer and various
|
|
&os;-on-&os; layers. Others are not supposed to work properly nor
|
|
be usable these days.</para>
|
|
|
|
<para>&os; development happens in a central CVS repository where only
|
|
a selected team of so called committers can write. This repository
|
|
possesses several branches; the most interesting are the HEAD branch,
|
|
in &os; nomenclature called -CURRENT, and RELENG_X branches, where X
|
|
stands for a number indicating a major version of &os;. As of
|
|
December 2006, there are development branches for 6.X development
|
|
(RELENG_6) and for the 5.X development (RELENG_5). Other branches are
|
|
closed and not actively maintained or only fed with security patches
|
|
by the Security Officer of the &os; project.</para>
|
|
|
|
<para>Historically the active development was done in the HEAD branch so
|
|
it was considered extremely unstable and supposed to happen to break
|
|
at any time. This is not true any more as the
|
|
<application>Perforce</application> (commercial version control system)
|
|
repository was introduced so that active development happen there.
|
|
There are many branches in <application>Perforce</application> where
|
|
development of certain parts of the system happens and these branches
|
|
are from time to time merged back to the main CVS repository thus
|
|
effectively putting the given feature to the &os; operating system.
|
|
The same happened with the <filename>rdivacky_linuxolator</filename>
|
|
branch where development of this thesis code was going on.</para>
|
|
|
|
<para>More info about the &os; operating system can be found
|
|
at [2].</para>
|
|
|
|
<sect3 xml:id="freebsd-tech-details">
|
|
<title>Technical details</title>
|
|
|
|
<para>&os; is traditional flavor of &unix; in the sense of dividing the
|
|
run of processes into two halves: kernel space and user space run.
|
|
There are two types of process entry to the kernel: a syscall and a
|
|
trap. There is only one way to return. In the subsequent sections
|
|
we will describe the three gates to/from the kernel. The whole
|
|
description applies to the i386 architecture as the Linuxulator
|
|
only exists there but the concept is similar on other architectures.
|
|
The information was taken from [1] and the source code.</para>
|
|
|
|
<sect4 xml:id="freebsd-sys-entries">
|
|
<title>System entries</title>
|
|
|
|
<para>&os; has an abstraction called an execution class loader,
|
|
which is a wedge into the &man.execve.2; syscall. This employs a
|
|
structure <literal>sysentvec</literal>, which describes an
|
|
executable ABI. It contains things like errno translation table,
|
|
signal translation table, various functions to serve syscall needs
|
|
(stack fixup, coredumping, etc.). Every ABI the &os; kernel wants
|
|
to support must define this structure, as it is used later in the
|
|
syscall processing code and at some other places. System entries
|
|
are handled by trap handlers, where we can access both the
|
|
kernel-space and the user-space at once.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-syscalls">
|
|
<title>Syscalls</title>
|
|
|
|
<para>Syscalls on &os; are issued by executing interrupt
|
|
<literal>0x80</literal> with register <varname>%eax</varname> set
|
|
to a desired syscall number with arguments passed on the stack.</para>
|
|
|
|
<para>When a process issues an interrupt <literal>0x80</literal>, the
|
|
<literal>int0x80</literal> syscall trap handler is issued (defined
|
|
in <filename>sys/i386/i386/exception.s</filename>), which prepares
|
|
arguments (i.e. copies them on to the stack) for a
|
|
call to a C function &man.syscall.2; (defined in
|
|
<filename>sys/i386/i386/trap.c</filename>), which processes the
|
|
passed in trapframe. The processing consists of preparing the
|
|
syscall (depending on the <literal>sysvec</literal> entry),
|
|
determining if the syscall is 32-bit or 64-bit one (changes size
|
|
of the parameters), then the parameters are copied, including the
|
|
syscall. Next, the actual syscall function is executed with
|
|
processing of the return code (special cases for
|
|
<literal>ERESTART</literal> and <literal>EJUSTRETURN</literal>
|
|
errors). Finally an <literal>userret()</literal> is scheduled,
|
|
switching the process back to the users-pace. The parameters to
|
|
the actual syscall handler are passed in the form of
|
|
<literal>struct thread *td</literal>,
|
|
<literal>struct syscall args *</literal> arguments where the second
|
|
parameter is a pointer to the copied in structure of
|
|
parameters.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-traps">
|
|
<title>Traps</title>
|
|
|
|
<para>Handling of traps in &os; is similar to the handling of
|
|
syscalls. Whenever a trap occurs, an assembler handler is called.
|
|
It is chosen between alltraps, alltraps with regs pushed or
|
|
calltrap depending on the type of the trap. This handler prepares
|
|
arguments for a call to a C function <literal>trap()</literal>
|
|
(defined in <filename>sys/i386/i386/trap.c</filename>), which then
|
|
processes the occurred trap. After the processing it might send a
|
|
signal to the process and/or exit to userland using
|
|
<literal>userret()</literal>.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-exits">
|
|
<title>Exits</title>
|
|
|
|
<para>Exits from kernel to userspace happen using the assembler
|
|
routine <literal>doreti</literal> regardless of whether the kernel
|
|
was entered via a trap or via a syscall. This restores the program
|
|
status from the stack and returns to the userspace.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-unix-primitives">
|
|
<title>&unix; primitives</title>
|
|
|
|
<para>&os; operating system adheres to the traditional &unix; scheme,
|
|
where every process has a unique identification number, the so
|
|
called <firstterm>PID</firstterm> (Process ID). PID numbers are
|
|
allocated either linearly or randomly ranging from
|
|
<literal>0</literal> to <literal>PID_MAX</literal>. The allocation
|
|
of PID numbers is done using linear searching of PID space. Every
|
|
thread in a process receives the same PID number as result of the
|
|
&man.getpid.2; call.</para>
|
|
|
|
<para>There are currently two ways to implement threading in &os;.
|
|
The first way is M:N threading followed by the 1:1 threading model.
|
|
The default library used is M:N threading
|
|
(<literal>libpthread</literal>) and you can switch at runtime to
|
|
1:1 threading (<literal>libthr</literal>). The plan is to switch
|
|
to 1:1 library by default soon. Although those two libraries use
|
|
the same kernel primitives, they are accessed through different
|
|
API(es). The M:N library uses the <literal>kse_*</literal> family
|
|
of syscalls while the 1:1 library uses the <literal>thr_*</literal>
|
|
family of syscalls. Because of this, there is no general concept
|
|
of thread ID shared between kernel and userspace. Of course, both
|
|
threading libraries implement the pthread thread ID API. Every
|
|
kernel thread (as described by <literal>struct thread</literal>)
|
|
has td tid identifier but this is not directly accessible
|
|
from userland and solely serves the kernel's needs. It is also
|
|
used for 1:1 threading library as pthread's thread ID but handling
|
|
of this is internal to the library and cannot be relied on.</para>
|
|
|
|
<para>As stated previously there are two implementations of threading
|
|
in &os;. The M:N library divides the work between kernel space and
|
|
userspace. Thread is an entity that gets scheduled in the kernel
|
|
but it can represent various number of userspace threads.
|
|
M userspace threads get mapped to N kernel threads thus saving
|
|
resources while keeping the ability to exploit multiprocessor
|
|
parallelism. Further information about the implementation can be
|
|
obtained from the man page or [1]. The 1:1 library directly maps a
|
|
userland thread to a kernel thread thus greatly simplifying the
|
|
scheme. None of these designs implement a fairness mechanism (such
|
|
a mechanism was implemented but it was removed recently because it
|
|
caused serious slowdown and made the code more difficult to deal
|
|
with).</para>
|
|
</sect4>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="what-is-linux">
|
|
<title>What is &linux;</title>
|
|
|
|
<para>&linux; is a &unix;-like kernel originally developed by Linus
|
|
Torvalds, and now being contributed to by a massive crowd of
|
|
programmers all around the world. From its mere beginnings to todays,
|
|
with wide support from companies such as IBM or Google, &linux; is
|
|
being associated with its fast development pace, full hardware support
|
|
and benevolent dictator model of organization.</para>
|
|
|
|
<para>&linux; development started in 1991 as a hobbyist project at
|
|
University of Helsinki in Finland. Since then it has obtained all the
|
|
features of a modern &unix;-like OS: multiprocessing, multiuser
|
|
support, virtual memory, networking, basically everything is there.
|
|
There are also highly advanced features like virtualization etc.</para>
|
|
|
|
<para>As of 2006 &linux; seems to be the most widely used open source
|
|
operating system with support from independent software vendors like
|
|
Oracle, RealNetworks, Adobe, etc. Most of the commercial software
|
|
distributed for &linux; can only be obtained in a binary form so
|
|
recompilation for other operating systems is impossible.</para>
|
|
|
|
<para>Most of the &linux; development happens in a
|
|
<application>Git</application> version control system.
|
|
<application>Git</application> is a distributed system so there is
|
|
no central source of the &linux; code, but some branches are considered
|
|
prominent and official. The version number scheme implemented by
|
|
&linux; consists of four numbers A.B.C.D. Currently development
|
|
happens in 2.6.C.D, where C represents major version, where new
|
|
features are added or changed while D is a minor version for bugfixes
|
|
only.</para>
|
|
|
|
<para>More information can be obtained from [4].</para>
|
|
|
|
<sect3 xml:id="linux-tech-details">
|
|
<title>Technical details</title>
|
|
|
|
<para>&linux; follows the traditional &unix; scheme of dividing the run
|
|
of a process in two halves: the kernel and user space. The kernel can
|
|
be entered in two ways: via a trap or via a syscall. The return is
|
|
handled only in one way. The further description applies to
|
|
&linux; 2.6 on the &i386; architecture. This information was
|
|
taken from [3].</para>
|
|
|
|
<sect4 xml:id="linux-syscalls">
|
|
<title>Syscalls</title>
|
|
|
|
<para>Syscalls in &linux; are performed (in userspace) using
|
|
<literal>syscallX</literal> macros where X substitutes a number
|
|
representing the number of parameters of the given syscall. This
|
|
macro translates to a code that loads <varname>%eax</varname>
|
|
register with a number of the syscall and executes interrupt
|
|
<literal>0x80</literal>. After this syscall return is called,
|
|
which translates negative return values to positive
|
|
<literal>errno</literal> values and sets <literal>res</literal> to
|
|
<literal>-1</literal> in case of an error. Whenever the interrupt
|
|
<literal>0x80</literal> is called the process enters the kernel in
|
|
system call trap handler. This routine saves all registers on the
|
|
stack and calls the selected syscall entry. Note that the &linux;
|
|
calling convention expects parameters to the syscall to be passed
|
|
via registers as shown here:</para>
|
|
|
|
<orderedlist>
|
|
<listitem>
|
|
<para>parameter -> <varname>%ebx</varname></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>parameter -> <varname>%ecx</varname></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>parameter -> <varname>%edx</varname></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>parameter -> <varname>%esi</varname></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>parameter -> <varname>%edi</varname></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>parameter -> <varname>%ebp</varname></para>
|
|
</listitem>
|
|
</orderedlist>
|
|
|
|
<para>There are some exceptions to this, where &linux; uses different
|
|
calling convention (most notably the <literal>clone</literal>
|
|
syscall).</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="linux-traps">
|
|
<title>Traps</title>
|
|
|
|
<para>The trap handlers are introduced in
|
|
<filename>arch/i386/kernel/traps.c</filename> and most of these
|
|
handlers live in <filename>arch/i386/kernel/entry.S</filename>,
|
|
where handling of the traps happens.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="linux-exits">
|
|
<title>Exits</title>
|
|
|
|
<para>Return from the syscall is managed by syscall &man.exit.3;,
|
|
which checks for the process having unfinished work, then checks
|
|
whether we used user-supplied selectors. If this happens stack
|
|
fixing is applied and finally the registers are restored from the
|
|
stack and the process returns to the userspace.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="linux-unix-primitives">
|
|
<title>&unix; primitives</title>
|
|
|
|
<para>In the 2.6 version, the &linux; operating system redefined some
|
|
of the traditional &unix; primitives, notably PID, TID and thread.
|
|
PID is defined not to be unique for every process, so for some
|
|
processes (threads) &man.getppid.2; returns the same value. Unique
|
|
identification of process is provided by TID. This is because
|
|
<firstterm>NPTL</firstterm> (New &posix; Thread Library) defines
|
|
threads to be normal processes (so called 1:1 threading). Spawning
|
|
a new process in &linux; 2.6 happens using the
|
|
<literal>clone</literal> syscall (fork variants are reimplemented using
|
|
it). This clone syscall defines a set of flags that affect
|
|
behaviour of the cloning process regarding thread implementation.
|
|
The semantic is a bit fuzzy as there is no single flag telling the
|
|
syscall to create a thread.</para>
|
|
|
|
<para>Implemented clone flags are:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para><literal>CLONE_VM</literal> - processes share their memory
|
|
space</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_FS</literal> - share umask, cwd and
|
|
namespace</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_FILES</literal> - share open
|
|
files</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_SIGHAND</literal> - share signal handlers
|
|
and blocked signals</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_PARENT</literal> - share parent</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_THREAD</literal> - be thread (further
|
|
explanation below)</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_NEWNS</literal> - new namespace</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_SYSVSEM</literal> - share SysV undo
|
|
structures</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_SETTLS</literal> - setup TLS at supplied
|
|
address</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_PARENT_SETTID</literal> - set TID in the
|
|
parent</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_CHILD_CLEARTID</literal> - clear TID in the
|
|
child</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>CLONE_CHILD_SETTID</literal> - set TID in the
|
|
child</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para><literal>CLONE_PARENT</literal> sets the real parent to the
|
|
parent of the caller. This is useful for threads because if thread
|
|
A creates thread B we want thread B to be parented to the parent of
|
|
the whole thread group. <literal>CLONE_THREAD</literal> does
|
|
exactly the same thing as <literal>CLONE_PARENT</literal>,
|
|
<literal>CLONE_VM</literal> and <literal>CLONE_SIGHAND</literal>,
|
|
rewrites PID to be the same as PID of the caller, sets exit signal
|
|
to be none and enters the thread group.
|
|
<literal>CLONE_SETTLS</literal> sets up GDT entries for TLS
|
|
handling. The <literal>CLONE_*_*TID</literal> set of flags
|
|
sets/clears user supplied address to TID or 0.</para>
|
|
|
|
<para>As you can see the <literal>CLONE_THREAD</literal> does most
|
|
of the work and does not seem to fit the scheme very well. The
|
|
original intention is unclear (even for authors, according to
|
|
comments in the code) but I think originally there was one
|
|
threading flag, which was then parcelled among many other flags
|
|
but this separation was never fully finished. It is also unclear
|
|
what this partition is good for as glibc does not use that so only
|
|
hand-written use of the clone permits a programmer to access this
|
|
features.</para>
|
|
|
|
<para>For non-threaded programs the PID and TID are the same. For
|
|
threaded programs the first thread PID and TID are the same and
|
|
every created thread shares the same PID and gets assigned a
|
|
unique TID (because <literal>CLONE_THREAD</literal> is passed in)
|
|
also parent is shared for all processes forming this threaded
|
|
program.</para>
|
|
|
|
<para>The code that implements &man.pthread.create.3; in NPTL defines
|
|
the clone flags like this:</para>
|
|
|
|
<programlisting>int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGNAL
|
|
|
|
| CLONE_SETTLS | CLONE_PARENT_SETTID
|
|
|
|
| CLONE_CHILD_CLEARTID | CLONE_SYSVSEM
|
|
#if __ASSUME_NO_CLONE_DETACHED == 0
|
|
|
|
| CLONE_DETACHED
|
|
#endif
|
|
|
|
| 0);</programlisting>
|
|
|
|
<para>The <literal>CLONE_SIGNAL</literal> is defined like</para>
|
|
|
|
<programlisting>#define CLONE_SIGNAL (CLONE_SIGHAND | CLONE_THREAD)</programlisting>
|
|
|
|
<para>the last 0 means no signal is sent when any of the threads
|
|
exits.</para>
|
|
</sect4>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="what-is-emu">
|
|
<title>What is emulation</title>
|
|
|
|
<para>According to a dictionary definition, emulation is the ability of
|
|
a program or device to imitate another program or device. This is
|
|
achieved by providing the same reaction to a given stimulus as the
|
|
emulated object. In practice, the software world mostly sees three
|
|
types of emulation - a program used to emulate a machine (QEMU, various
|
|
game console emulators etc.), software emulation of a hardware facility
|
|
(OpenGL emulators, floating point units emulation etc.) and operating
|
|
system emulation (either in kernel of the operating system or as a
|
|
userspace program).</para>
|
|
|
|
<para>Emulation is usually used in a place, where using the original
|
|
component is not feasible nor possible at all. For example someone
|
|
might want to use a program developed for a different operating
|
|
system than he uses. Then emulation comes in handy. Sometimes
|
|
there is no other way but to use emulation - e.g. when the hardware
|
|
device you try to use does not exist (yet/anymore) then there is no
|
|
other way but emulation. This happens often when porting an operating
|
|
system to a new (non-existent) platform. Sometimes it is just
|
|
cheaper to emulate.</para>
|
|
|
|
<para>Looking from an implementation point of view, there are two main
|
|
approaches to the implementation of emulation. You can either emulate
|
|
the whole thing - accepting possible inputs of the original object,
|
|
maintaining inner state and emitting correct output based on the state
|
|
and/or input. This kind of emulation does not require any special
|
|
conditions and basically can be implemented anywhere for any
|
|
device/program. The drawback is that implementing such emulation is
|
|
quite difficult, time-consuming and error-prone. In some cases we can
|
|
use a simpler approach. Imagine you want to emulate a printer that
|
|
prints from left to right on a printer that prints from right to left.
|
|
It is obvious that there is no need for a complex emulation layer but
|
|
simply reversing of the printed text is sufficient. Sometimes the
|
|
emulating environment is very similar to the emulated one so just a
|
|
thin layer of some translation is necessary to provide fully working
|
|
emulation! As you can see this is much less demanding to implement,
|
|
so less time-consuming and error-prone than the previous approach. But
|
|
the necessary condition is that the two environments must be similar
|
|
enough. The third approach combines the two previous. Most of the
|
|
time the objects do not provide the same capabilities so in a case of
|
|
emulating the more powerful one on the less powerful we have to emulate
|
|
the missing features with full emulation described above.</para>
|
|
|
|
<para>This master thesis deals with emulation of &unix; on &unix;, which
|
|
is exactly the case, where only a thin layer of translation is
|
|
sufficient to provide full emulation. The &unix; API consists of a set
|
|
of syscalls, which are usually self contained and do not affect some
|
|
global kernel state.</para>
|
|
|
|
<para>There are a few syscalls that affect inner state but this can be
|
|
dealt with by providing some structures that maintain the extra
|
|
state.</para>
|
|
|
|
<para>No emulation is perfect and emulations tend to lack some parts but
|
|
this usually does not cause any serious drawbacks. Imagine a game
|
|
console emulator that emulates everything but music output. No doubt
|
|
that the games are playable and one can use the emulator. It might
|
|
not be that comfortable as the original game console but its an
|
|
acceptable compromise between price and comfort.</para>
|
|
|
|
<para>The same goes with the &unix; API. Most programs can live with a
|
|
very limited set of syscalls working. Those syscalls tend to be the
|
|
oldest ones (&man.read.2;/&man.write.2;, &man.fork.2; family,
|
|
&man.signal.3; handling, &man.exit.3;, &man.socket.2; API) hence it is
|
|
easy to emulate because their semantics is shared among all
|
|
&unix;es, which exist todays.</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="freebsd-emulation">
|
|
<title>Emulation</title>
|
|
|
|
<sect2>
|
|
<title>How emulation works in &os;</title>
|
|
|
|
<para>As stated earlier, &os; supports running binaries from several
|
|
other &unix;es. This works because &os; has an abstraction called the
|
|
execution class loader. This wedges into the &man.execve.2; syscall,
|
|
so when &man.execve.2; is about to execute a binary it examines its
|
|
type.</para>
|
|
|
|
<para>There are basically two types of binaries in &os;. Shell-like text
|
|
scripts which are identified by <literal>#!</literal> as their first
|
|
two characters and normal (typically <firstterm>ELF</firstterm>)
|
|
binaries, which are a representation of a compiled executable object.
|
|
The vast majority (one could say all of them) of binaries in &os; are
|
|
from type ELF. ELF files contain a header, which specifies the OS ABI
|
|
for this ELF file. By reading this information, the operating system
|
|
can accurately determine what type of binary the given file is.</para>
|
|
|
|
<para>Every OS ABI must be registered in the &os; kernel. This applies
|
|
to the &os; native OS ABI, as well. So when &man.execve.2; executes a
|
|
binary it iterates through the list of registered APIs and when it
|
|
finds the right one it starts to use the information contained in the
|
|
OS ABI description (its syscall table, <literal>errno</literal>
|
|
translation table, etc.). So every time the process calls a syscall,
|
|
it uses its own set of syscalls instead of some global one. This
|
|
effectively provides a very elegant and easy way of supporting
|
|
execution of various binary formats.</para>
|
|
|
|
<para>The nature of emulation of different OSes (and also some other
|
|
subsystems) led developers to invite a handler event mechanism. There
|
|
are various places in the kernel, where a list of event handlers are
|
|
called. Every subsystem can register an event handler and they are
|
|
called accordingly. For example, when a process exits there is a
|
|
handler called that possibly cleans up whatever the subsystem needs
|
|
to be cleaned.</para>
|
|
|
|
<para>Those simple facilities provide basically everything that is needed
|
|
for the emulation infrastructure and in fact these are basically the
|
|
only things necessary to implement the &linux; emulation layer.</para>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="freebsd-common-primitives">
|
|
<title>Common primitives in the &os; kernel</title>
|
|
|
|
<para>Emulation layers need some support from the operating system. I am
|
|
going to describe some of the supported primitives in the &os;
|
|
operating system.</para>
|
|
|
|
<sect3 xml:id="freebsd-locking-primitives">
|
|
<title>Locking primitives</title>
|
|
|
|
<para>Contributed by: &a.attilio.email;</para>
|
|
|
|
<para>The &os; synchronization primitive set is based on the idea to
|
|
supply a rather huge number of different primitives in a way that
|
|
the better one can be used for every particular, appropriate
|
|
situation.</para>
|
|
|
|
<para>To a high level point of view you can consider three kinds of
|
|
synchronization primitives in the &os; kernel:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>atomic operations and memory barriers</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>locks</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>scheduling barriers</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Below there are descriptions for the 3 families. For every lock,
|
|
you should really check the linked manpage (where possible) for
|
|
more detailed explanations.</para>
|
|
|
|
<sect4 xml:id="freebsd-atomic-op">
|
|
<title>Atomic operations and memory barriers</title>
|
|
|
|
<para>Atomic operations are implemented through a set of functions
|
|
performing simple arithmetics on memory operands in an atomic way
|
|
with respect to external events (interrupts, preemption, etc.).
|
|
Atomic operations can guarantee atomicity just on small data types
|
|
(in the magnitude order of the <literal>.long.</literal>
|
|
architecture C data type), so should be rarely used directly in the
|
|
end-level code, if not only for very simple operations (like flag
|
|
setting in a bitmap, for example). In fact, it is rather simple
|
|
and common to write down a wrong semantic based on just atomic
|
|
operations (usually referred as lock-less). The &os; kernel offers
|
|
a way to perform atomic operations in conjunction with a memory
|
|
barrier. The memory barriers will guarantee that an atomic
|
|
operation will happen following some specified ordering with
|
|
respect to other memory accesses. For example, if we need that an
|
|
atomic operation happen just after all other pending writes (in
|
|
terms of instructions reordering buffers activities) are completed,
|
|
we need to explicitly use a memory barrier in conjunction to this
|
|
atomic operation. So it is simple to understand why memory
|
|
barriers play a key role for higher-level locks building (just
|
|
as refcounts, mutexes, etc.). For a detailed explanatory on atomic
|
|
operations, please refer to &man.atomic.9;. It is far, however,
|
|
noting that atomic operations (and memory barriers as well) should
|
|
ideally only be used for building front-ending locks (as
|
|
mutexes).</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-refcounts">
|
|
<title>Refcounts</title>
|
|
|
|
<para>Refcounts are interfaces for handling reference counters.
|
|
They are implemented through atomic operations and are intended to
|
|
be used just for cases, where the reference counter is the only
|
|
one thing to be protected, so even something like a spin-mutex is
|
|
deprecated. Using the refcount interface for structures, where
|
|
a mutex is already used is often wrong since we should probably
|
|
close the reference counter in some already protected paths. A
|
|
manpage discussing refcount does not exist currently, just check
|
|
<filename>sys/refcount.h</filename> for an overview of the
|
|
existing API.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-locks">
|
|
<title>Locks</title>
|
|
|
|
<para>&os; kernel has huge classes of locks. Every lock is defined
|
|
by some peculiar properties, but probably the most important is the
|
|
event linked to contesting holders (or in other terms, the
|
|
behaviour of threads unable to acquire the lock). &os;'s locking
|
|
scheme presents three different behaviours for contenders:</para>
|
|
|
|
<orderedlist>
|
|
<listitem>
|
|
<para>spinning</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>blocking</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>sleeping</para>
|
|
</listitem>
|
|
</orderedlist>
|
|
|
|
<note>
|
|
<para>numbers are not casual</para>
|
|
</note>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-spinlocks">
|
|
<title>Spinning locks</title>
|
|
|
|
<para>Spin locks let waiters to spin until they cannot acquire the
|
|
lock. An important matter do deal with is when a thread contests
|
|
on a spin lock if it is not descheduled. Since the &os; kernel
|
|
is preemptive, this exposes spin lock at the risk of deadlocks
|
|
that can be solved just disabling interrupts while they are
|
|
acquired. For this and other reasons (like lack of priority
|
|
propagation support, poorness in load balancing schemes between
|
|
CPUs, etc.), spin locks are intended to protect very small paths
|
|
of code, or ideally not to be used at all if not explicitly
|
|
requested (explained later).</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-blocking">
|
|
<title>Blocking</title>
|
|
|
|
<para>Block locks let waiters to be descheduled and blocked until
|
|
the lock owner does not drop it and wakes up one or more
|
|
contenders. In order to avoid starvation issues, blocking locks
|
|
do priority propagation from the waiters to the owner. Block
|
|
locks must be implemented through the turnstile interface and are
|
|
intended to be the most used kind of locks in the kernel, if no
|
|
particular conditions are met.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-sleeping">
|
|
<title>Sleeping</title>
|
|
|
|
<para>Sleep locks let waiters to be descheduled and fall asleep
|
|
until the lock holder does not drop it and wakes up one or more
|
|
waiters. Since sleep locks are intended to protect large paths
|
|
of code and to cater asynchronous events, they do not do any form
|
|
of priority propagation. They must be implemented through the
|
|
&man.sleepqueue.9; interface.</para>
|
|
|
|
<para>The order used to acquire locks is very important, not only for
|
|
the possibility to deadlock due at lock order reversals, but even
|
|
because lock acquisition should follow specific rules linked to
|
|
locks natures. If you give a look at the table above, the
|
|
practical rule is that if a thread holds a lock of level n (where
|
|
the level is the number listed close to the kind of lock) it is not
|
|
allowed to acquire a lock of superior levels, since this would
|
|
break the specified semantic for a path. For example, if a thread
|
|
holds a block lock (level 2), it is allowed to acquire a spin lock
|
|
(level 1) but not a sleep lock (level 3), since block locks are
|
|
intended to protect smaller paths than sleep lock (these rules are
|
|
not about atomic operations or scheduling barriers,
|
|
however).</para>
|
|
|
|
<para>This is a list of lock with their respective behaviours:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>spin mutex - spinning - &man.mutex.9;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>sleep mutex - blocking - &man.mutex.9;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>pool mutex - blocking - &man.mtx.pool.9;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>sleep family - sleeping - &man.sleep.9; pause tsleep
|
|
msleep msleep spin msleep rw msleep sx</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>condvar - sleeping - &man.condvar.9;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>rwlock - blocking - &man.rwlock.9;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>sxlock - sleeping - &man.sx.9;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>lockmgr - sleeping - &man.lockmgr.9;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>semaphores - sleeping - &man.sema.9;</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Among these locks only mutexes, sxlocks, rwlocks and lockmgrs
|
|
are intended to handle recursion, but currently recursion is only
|
|
supported by mutexes and lockmgrs.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-scheduling">
|
|
<title>Scheduling barriers</title>
|
|
|
|
<para>Scheduling barriers are intended to be used in order to drive
|
|
scheduling of threading. They consist mainly of three
|
|
different stubs:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>critical sections (and preemption)</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>sched_bind</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>sched_pin</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Generally, these should be used only in a particular context
|
|
and even if they can often replace locks, they should be avoided
|
|
because they do not let the diagnose of simple eventual problems
|
|
with locking debugging tools (as &man.witness.4;).</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-critical">
|
|
<title>Critical sections</title>
|
|
|
|
<para>The &os; kernel has been made preemptive basically to deal with
|
|
interrupt threads. In fact, in order to avoid high interrupt
|
|
latency, time-sharing priority threads can be preempted by
|
|
interrupt threads (in this way, they do not need to wait to be
|
|
scheduled as the normal path previews). Preemption, however,
|
|
introduces new racing points that need to be handled, as well.
|
|
Often, in order to deal with preemption, the simplest thing to do
|
|
is to completely disable it. A critical section defines a piece of
|
|
code (borderlined by the pair of functions &man.critical.enter.9;
|
|
and &man.critical.exit.9;, where preemption is guaranteed to not
|
|
happen (until the protected code is fully executed). This can
|
|
often replace a lock effectively but should be used carefully in
|
|
order to not lose the whole advantage that preemption
|
|
brings.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-schedpin">
|
|
<title>sched_pin/sched_unpin</title>
|
|
|
|
<para>Another way to deal with preemption is the
|
|
<function>sched_pin()</function> interface. If a piece of code
|
|
is closed in the <function>sched_pin()</function> and
|
|
<function>sched_unpin()</function> pair of functions it is
|
|
guaranteed that the respective thread, even if it can be preempted,
|
|
it will always be executed on the same CPU. Pinning is very
|
|
effective in the particular case when we have to access at
|
|
per-cpu datas and we assume other threads will not change those
|
|
data. The latter condition will determine a critical section
|
|
as a too strong condition for our code.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-schedbind">
|
|
<title>sched_bind/sched_unbind</title>
|
|
|
|
<para><function>sched_bind</function> is an API used in order to bind
|
|
a thread to a particular CPU for all the time it executes the code,
|
|
until a <function>sched_unbind</function> function call does not
|
|
unbind it. This feature has a key role in situations where you
|
|
cannot trust the current state of CPUs (for example, at very early
|
|
stages of boot), as you want to avoid your thread to migrate on
|
|
inactive CPUs. Since <function>sched_bind</function> and
|
|
<function>sched_unbind</function> manipulate internal scheduler
|
|
structures, they need to be enclosed in
|
|
<function>sched_lock</function> acquisition/releasing when
|
|
used.</para>
|
|
</sect4>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="freebsd-proc">
|
|
<title>Proc structure</title>
|
|
|
|
<para>Various emulation layers sometimes require some additional
|
|
per-process data. It can manage separate structures (a list, a tree
|
|
etc.) containing these data for every process but this tends to be
|
|
slow and memory consuming. To solve this problem the &os;
|
|
<literal>proc</literal> structure contains
|
|
<literal>p_emuldata</literal>, which is a void pointer to some
|
|
emulation layer specific data. This <literal>proc</literal> entry
|
|
is protected by the proc mutex.</para>
|
|
|
|
<para>The &os; <literal>proc</literal> structure contains a
|
|
<literal>p_sysent</literal> entry that identifies, which ABI this
|
|
process is running. In fact, it is a pointer to the
|
|
<literal>sysentvec</literal> described above. So by comparing this
|
|
pointer to the address where the <literal>sysentvec</literal>
|
|
structure for the given ABI is stored we can effectively determine
|
|
whether the process belongs to our emulation layer. The code
|
|
typically looks like:</para>
|
|
|
|
<programlisting>if (__predict_true(p->p_sysent != &elf_&linux;_sysvec))
|
|
return;</programlisting>
|
|
|
|
<para>As you can see, we effectively use the
|
|
<literal>__predict_true</literal> modifier to collapse the most
|
|
common case (&os; process) to a simple return operation thus
|
|
preserving high performance. This code should be turned into a
|
|
macro because currently it is not very flexible, i.e. we do not
|
|
support &linux;64 emulation nor A.OUT &linux; processes
|
|
on i386.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="freebsd-vfs">
|
|
<title>VFS</title>
|
|
|
|
<para>The &os; VFS subsystem is very complex but the &linux; emulation
|
|
layer uses just a small subset via a well defined API. It can either
|
|
operate on vnodes or file handlers. Vnode represents a virtual
|
|
vnode, i.e. representation of a node in VFS. Another representation
|
|
is a file handler, which represents an opened file from the
|
|
perspective of a process. A file handler can represent a socket or
|
|
an ordinary file. A file handler contains a pointer to its vnode.
|
|
More then one file handler can point to the same vnode.</para>
|
|
|
|
<sect4 xml:id="freebsd-namei">
|
|
<title>namei</title>
|
|
|
|
<para>The &man.namei.9; routine is a central entry point to pathname
|
|
lookup and translation. It traverses the path point by point from
|
|
the starting point to the end point using lookup function, which is
|
|
internal to VFS. The &man.namei.9; syscall can cope with symlinks,
|
|
absolute and relative paths. When a path is looked up using
|
|
&man.namei.9; it is inputed to the name cache. This behaviour can
|
|
be suppressed. This routine is used all over the kernel and its
|
|
performance is very critical.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-vn">
|
|
<title>vn_fullpath</title>
|
|
|
|
<para>The &man.vn.fullpath.9; function takes the best effort to
|
|
traverse VFS name cache and returns a path for a given (locked)
|
|
vnode. This process is unreliable but works just fine for the most
|
|
common cases. The unreliability is because it relies on VFS cache
|
|
(it does not traverse the on medium structures), it does not work
|
|
with hardlinks, etc. This routine is used in several places in the
|
|
Linuxulator.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-vnode">
|
|
<title>Vnode operations</title>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para><function>fgetvp</function> - given a thread and a file
|
|
descriptor number it returns the associated vnode</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.vn.lock.9; - locks a vnode</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><function>vn_unlock</function> - unlocks a vnode</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.VOP.READDIR.9; - reads a directory referenced by
|
|
a vnode</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.VOP.GETATTR.9; - gets attributes of a file or a
|
|
directory referenced by a vnode</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.VOP.LOOKUP.9; - looks up a path to a given
|
|
directory</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.VOP.OPEN.9; - opens a file referenced by a
|
|
vnode</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.VOP.CLOSE.9; - closes a file referenced by a
|
|
vnode</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.vput.9; - decrements the use count for a vnode and
|
|
unlocks it</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.vrele.9; - decrements the use count for a vnode</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&man.vref.9; - increments the use count for a vnode</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="freebsd-file-handler">
|
|
<title>File handler operations</title>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para><function>fget</function> - given a thread and a file
|
|
descriptor number it returns associated file handler and
|
|
references it</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><function>fdrop</function> - drops a reference to a file
|
|
handler</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><function>fhold</function> - references a file
|
|
handler</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</sect4>
|
|
</sect3>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="md">
|
|
<title>&linux; emulation layer -MD part</title>
|
|
|
|
<para>This section deals with implementation of &linux; emulation layer in
|
|
&os; operating system. It first describes the machine dependent part
|
|
talking about how and where interaction between userland and kernel is
|
|
implemented. It talks about syscalls, signals, ptrace, traps, stack
|
|
fixup. This part discusses i386 but it is written generally so other
|
|
architectures should not differ very much. The next part is the machine
|
|
independent part of the Linuxulator. This section only covers i386 and ELF
|
|
handling. A.OUT is obsolete and untested.</para>
|
|
|
|
<sect2 xml:id="syscall-handling">
|
|
<title>Syscall handling</title>
|
|
|
|
<para>Syscall handling is mostly written in
|
|
<filename>linux_sysvec.c</filename>, which covers most of the routines
|
|
pointed out in the <literal>sysentvec</literal> structure. When a
|
|
&linux; process running on &os; issues a syscall, the general syscall
|
|
routine calls linux prepsyscall routine for the &linux; ABI.</para>
|
|
|
|
<sect3 xml:id="linux-prepsyscall">
|
|
<title>&linux; prepsyscall</title>
|
|
|
|
<para>&linux; passes arguments to syscalls via registers (that is why
|
|
it is limited to 6 parameters on i386) while &os; uses the stack.
|
|
The &linux; prepsyscall routine must copy parameters from registers
|
|
to the stack. The order of the registers is:
|
|
<varname>%ebx</varname>, <varname>%ecx</varname>,
|
|
<varname>%edx</varname>, <varname>%esi</varname>,
|
|
<varname>%edi</varname>, <varname>%ebp</varname>. The catch is that
|
|
this is true for only <emphasis>most</emphasis> of the syscalls.
|
|
Some (most notably <function>clone</function>) uses a different
|
|
order but it is luckily easy to fix by inserting a dummy parameter
|
|
in the <function>linux_clone</function> prototype.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="syscall-writing">
|
|
<title>Syscall writing</title>
|
|
|
|
<para>Every syscall implemented in the Linuxulator must have its
|
|
prototype with various flags in <filename>syscalls.master</filename>.
|
|
The form of the file is:</para>
|
|
|
|
<programlisting>...
|
|
AUE_FORK STD { int linux_fork(void); }
|
|
...
|
|
AUE_CLOSE NOPROTO { int close(int fd); }
|
|
...</programlisting>
|
|
|
|
<para>The first column represents the syscall number. The second
|
|
column is for auditing support. The third column represents the
|
|
syscall type. It is either <literal>STD</literal>,
|
|
<literal>OBSOL</literal>, <literal>NOPROTO</literal> and
|
|
<literal>UNIMPL</literal>. <literal>STD</literal> is a standard
|
|
syscall with full prototype and implementation.
|
|
<literal>OBSOL</literal> is obsolete and defines just the prototype.
|
|
<literal>NOPROTO</literal> means that the syscall is implemented
|
|
elsewhere so do not prepend ABI prefix, etc.
|
|
<literal>UNIMPL</literal> means that the syscall will be
|
|
substituted with the <function>nosys</function> syscall
|
|
(a syscall just printing out a message about the syscall not being
|
|
implemented and returning <literal>ENOSYS</literal>).</para>
|
|
|
|
<para>From <filename>syscalls.master</filename> a script generates
|
|
three files: <filename>linux_syscall.h</filename>,
|
|
<filename>linux_proto.h</filename> and
|
|
<filename>linux_sysent.c</filename>. The
|
|
<filename>linux_syscall.h</filename> contains definitions of syscall
|
|
names and their numerical value, e.g.:</para>
|
|
|
|
<programlisting>...
|
|
#define LINUX_SYS_linux_fork 2
|
|
...
|
|
#define LINUX_SYS_close 6
|
|
...</programlisting>
|
|
|
|
<para>The <filename>linux_proto.h</filename> contains structure
|
|
definitions of arguments to every syscall, e.g.:</para>
|
|
|
|
<programlisting>struct linux_fork_args {
|
|
register_t dummy;
|
|
};</programlisting>
|
|
|
|
<para>And finally, <filename>linux_sysent.c</filename> contains
|
|
structure describing the system entry table, used to actually
|
|
dispatch a syscall, e.g.:</para>
|
|
|
|
<programlisting>{ 0, (sy_call_t *)linux_fork, AUE_FORK, NULL, 0, 0 }, /* 2 = linux_fork */
|
|
{ AS(close_args), (sy_call_t *)close, AUE_CLOSE, NULL, 0, 0 }, /* 6 = close */</programlisting>
|
|
|
|
<para>As you can see <function>linux_fork</function> is implemented
|
|
in Linuxulator itself so the definition is of <literal>STD</literal>
|
|
type and has no argument, which is exhibited by the dummy argument
|
|
structure. On the other hand <function>close</function> is just an
|
|
alias for real &os; &man.close.2; so it has no linux arguments
|
|
structure associated and in the system entry table it is not prefixed
|
|
with linux as it calls the real &man.close.2; in the kernel.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="dummy-syscalls">
|
|
<title>Dummy syscalls</title>
|
|
|
|
<para>The &linux; emulation layer is not complete, as some syscalls are
|
|
not implemented properly and some are not implemented at all. The
|
|
emulation layer employs a facility to mark unimplemented syscalls
|
|
with the <literal>DUMMY</literal> macro. These dummy definitions
|
|
reside in <filename>linux_dummy.c</filename> in a form of
|
|
<literal>DUMMY(syscall);</literal>, which is then translated to
|
|
various syscall auxiliary files and the implementation consists
|
|
of printing a message saying that this syscall is not implemented.
|
|
The <literal>UNIMPL</literal> prototype is not used because we want
|
|
to be able to identify the name of the syscall that was called in
|
|
order to know what syscalls are more important to implement.</para>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="signal-handling">
|
|
<title>Signal handling</title>
|
|
|
|
<para>Signal handling is done generally in the &os; kernel for all
|
|
binary compatibilities with a call to a compat-dependent layer.
|
|
&linux; compatibility layer defines
|
|
<function>linux_sendsig</function> routine for this purpose.</para>
|
|
|
|
<sect3 xml:id="linux-sendsig">
|
|
<title>&linux; sendsig</title>
|
|
|
|
<para>This routine first checks whether the signal has been installed
|
|
with a <literal>SA_SIGINFO</literal> in which case it calls
|
|
<function>linux_rt_sendsig</function> routine instead. Furthermore,
|
|
it allocates (or reuses an already existing) signal handle context,
|
|
then it builds a list of arguments for the signal handler. It
|
|
translates the signal number based on the signal translation table,
|
|
assigns a handler, translates sigset. Then it saves context for the
|
|
<function>sigreturn</function> routine (various registers, translated
|
|
trap number and signal mask). Finally, it copies out the signal
|
|
context to the userspace and prepares context for the actual
|
|
signal handler to run.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="linux-rt-sendsig">
|
|
<title>linux_rt_sendsig</title>
|
|
|
|
<para>This routine is similar to <function>linux_sendsig</function>
|
|
just the signal context preparation is different. It adds
|
|
<literal>siginfo</literal>, <literal>ucontext</literal>, and some
|
|
&posix; parts. It might be worth considering whether those two
|
|
functions could not be merged with a benefit of less code duplication
|
|
and possibly even faster execution.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="linux-sigreturn">
|
|
<title>linux_sigreturn</title>
|
|
|
|
<para>This syscall is used for return from the signal handler. It does
|
|
some security checks and restores the original process context. It
|
|
also unmasks the signal in process signal mask.</para>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="ptrace">
|
|
<title>Ptrace</title>
|
|
|
|
<para>Many &unix; derivates implement the &man.ptrace.2; syscall in order
|
|
to allow various tracking and debugging features. This facility
|
|
enables the tracing process to obtain various information about the
|
|
traced process, like register dumps, any memory from the process
|
|
address space, etc. and also to trace the process like in stepping an
|
|
instruction or between system entries (syscalls and traps).
|
|
&man.ptrace.2; also lets you set various information in the traced
|
|
process (registers etc.). &man.ptrace.2; is a &unix;-wide standard
|
|
implemented in most &unix;es around the world.</para>
|
|
|
|
<para>&linux; emulation in &os; implements the &man.ptrace.2; facility
|
|
in <filename>linux_ptrace.c</filename>. The routines for converting
|
|
registers between &linux; and &os; and the actual &man.ptrace.2;
|
|
syscall emulation syscall. The syscall is a long switch block that
|
|
implements its counterpart in &os; for every &man.ptrace.2; command.
|
|
The &man.ptrace.2; commands are mostly equal between &linux; and &os;
|
|
so usually just a small modification is needed. For example,
|
|
<literal>PT_GETREGS</literal> in &linux; operates on direct data while
|
|
&os; uses a pointer to the data so after performing a (native)
|
|
&man.ptrace.2; syscall, a copyout must be done to preserve &linux;
|
|
semantics.</para>
|
|
|
|
<para>The &man.ptrace.2; implementation in Linuxulator has some known
|
|
weaknesses. There have been panics seen when using
|
|
<command>strace</command> (which is a &man.ptrace.2; consumer) in the
|
|
Linuxulator environment. Also <literal>PT_SYSCALL</literal> is not
|
|
implemented.</para>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="traps">
|
|
<title>Traps</title>
|
|
|
|
<para>Whenever a &linux; process running in the emulation layer traps
|
|
the trap itself is handled transparently with the only exception of
|
|
the trap translation. &linux; and &os; differs in opinion on what a
|
|
trap is so this is dealt with here. The code is actually very
|
|
short:</para>
|
|
|
|
<programlisting>static int
|
|
translate_traps(int signal, int trap_code)
|
|
{
|
|
|
|
if (signal != SIGBUS)
|
|
return signal;
|
|
|
|
switch (trap_code) {
|
|
|
|
case T_PROTFLT:
|
|
case T_TSSFLT:
|
|
case T_DOUBLEFLT:
|
|
case T_PAGEFLT:
|
|
return SIGSEGV;
|
|
|
|
default:
|
|
return signal;
|
|
}
|
|
}</programlisting>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="stack-fixup">
|
|
<title>Stack fixup</title>
|
|
|
|
<para>The RTLD run-time link-editor expects so called AUX tags on stack
|
|
during an <function>execve</function> so a fixup must be done to ensure
|
|
this. Of course, every RTLD system is different so the emulation layer
|
|
must provide its own stack fixup routine to do this. So does
|
|
Linuxulator. The <function>elf_linux_fixup</function> simply copies
|
|
out AUX tags to the stack and adjusts the stack of the user space
|
|
process to point right after those tags. So RTLD works in a
|
|
smart way.</para>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="aout-support">
|
|
<title>A.OUT support</title>
|
|
|
|
<para>The &linux; emulation layer on i386 also supports &linux; A.OUT
|
|
binaries. Pretty much everything described in the previous sections
|
|
must be implemented for A.OUT support (beside traps translation and
|
|
signals sending). The support for A.OUT binaries is no longer
|
|
maintained, especially the 2.6 emulation does not work with it but
|
|
this does not cause any problem, as the linux-base in ports probably
|
|
do not support A.OUT binaries at all. This support will probably be
|
|
removed in future. Most of the stuff necessary for loading &linux;
|
|
A.OUT binaries is in <filename>imgact_linux.c</filename> file.</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="mi">
|
|
<title>&linux; emulation layer -MI part</title>
|
|
|
|
<para>This section talks about machine independent part of the
|
|
Linuxulator. It covers the emulation infrastructure needed for &linux;
|
|
2.6 emulation, the thread local storage (TLS) implementation (on i386)
|
|
and futexes. Then we talk briefly about some syscalls.</para>
|
|
|
|
<sect2 xml:id="nptl-desc">
|
|
<title>Description of NPTL</title>
|
|
|
|
<para>One of the major areas of progress in development of &linux; 2.6
|
|
was threading. Prior to 2.6, the &linux; threading support was
|
|
implemented in the <application>linuxthreads</application> library.
|
|
The library was a partial implementation of &posix; threading. The
|
|
threading was implemented using separate processes for each thread
|
|
using the <function>clone</function> syscall to let them share the
|
|
address space (and other things). The main weaknesses of this
|
|
approach was that every thread had a different PID, signal handling
|
|
was broken (from the pthreads perspective), etc. Also the performance
|
|
was not very good (use of <literal>SIGUSR</literal> signals for
|
|
threads synchronization, kernel resource consumption, etc.) so to
|
|
overcome these problems a new threading system was developed and
|
|
named NPTL.</para>
|
|
|
|
<para>The NPTL library focused on two things but a third thing came
|
|
along so it is usually considered a part of NPTL. Those two things
|
|
were embedding of threads into a process structure and futexes. The
|
|
additional third thing was TLS, which is not directly required by NPTL
|
|
but the whole NPTL userland library depends on it. Those improvements
|
|
yielded in much improved performance and standards conformance. NPTL
|
|
is a standard threading library in &linux; systems these days.</para>
|
|
|
|
<para>The &os; Linuxulator implementation approaches the NPTL in three
|
|
main areas. The TLS, futexes and PID mangling, which is meant to
|
|
simulate the &linux; threads. Further sections describe each of these
|
|
areas.</para>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="linux26-emu">
|
|
<title>&linux; 2.6 emulation infrastructure</title>
|
|
|
|
<para>These sections deal with the way &linux; threads are managed and
|
|
how we simulate that in &os;.</para>
|
|
|
|
<sect3 xml:id="linux26-runtime">
|
|
<title>Runtime determining of 2.6 emulation</title>
|
|
|
|
<para>The &linux; emulation layer in &os; supports runtime setting of
|
|
the emulated version. This is done via &man.sysctl.8;, namely
|
|
<literal>compat.linux.osrelease</literal>, which is set to 2.4.2 by
|
|
default (as of April 2007) and with all &linux; versions up to 2.6
|
|
it just determined what &man.uname.1; outputs. It is different with
|
|
2.6 emulation where setting this &man.sysctl.8; affects runtime
|
|
behaviour of the emulation layer. When set to 2.6.x it sets the
|
|
value of <literal>linux_use_linux26</literal> while setting to
|
|
something else keeps it unset. This variable (plus per-prison
|
|
variables of the very same kind) determines whether 2.6
|
|
infrastructure (mainly PID mangling) is used in the code or not.
|
|
The version setting is done system-wide and this affects all &linux;
|
|
processes. The &man.sysctl.8; should not be changed when running any
|
|
&linux; binary as it might harm things.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="linux-proc-thread">
|
|
<title>&linux; processes and thread identifiers</title>
|
|
|
|
<para>The semantics of &linux; threading are a little confusing and
|
|
uses entirely different nomenclature to &os;. A process in
|
|
&linux; consists of a <literal>struct task</literal> embedding two
|
|
identifier fields - PID and TGID. PID is <emphasis>not</emphasis>
|
|
a process ID but it is a thread ID. The TGID identifies a thread
|
|
group in other words a process. For single-threaded process the
|
|
PID equals the TGID.</para>
|
|
|
|
<para>The thread in NPTL is just an ordinary process that happens to
|
|
have TGID not equal to PID and have a group leader not equal to
|
|
itself (and shared VM etc. of course). Everything else happens in
|
|
the same way as to an ordinary process. There is no separation of
|
|
a shared status to some external structure like in &os;. This
|
|
creates some duplication of information and possible data
|
|
inconsistency. The &linux; kernel seems to use task -> group
|
|
information in some places and task information elsewhere and it is
|
|
really not very consistent and looks error-prone.</para>
|
|
|
|
<para>Every NPTL thread is created by a call to the
|
|
<function>clone</function> syscall with a specific set of flags
|
|
(more in the next subsection). The NPTL implements strict
|
|
1:1 threading.</para>
|
|
|
|
<para>In &os; we emulate NPTL threads with ordinary &os; processes that
|
|
share VM space, etc. and the PID gymnastic is just mimicked in the
|
|
emulation specific structure attached to the process. The
|
|
structure attached to the process looks like:</para>
|
|
|
|
<programlisting>struct linux_emuldata {
|
|
pid_t pid;
|
|
|
|
int *child_set_tid; /* in clone(): Child.s TID to set on clone */
|
|
int *child_clear_tid;/* in clone(): Child.s TID to clear on exit */
|
|
|
|
struct linux_emuldata_shared *shared;
|
|
|
|
int pdeath_signal; /* parent death signal */
|
|
|
|
LIST_ENTRY(linux_emuldata) threads; /* list of linux threads */
|
|
};</programlisting>
|
|
|
|
<para>The PID is used to identify the &os; process that attaches this
|
|
structure. The <function>child_se_tid</function> and
|
|
<function>child_clear_tid</function> are used for TID address
|
|
copyout when a process exits and is created. The
|
|
<varname>shared</varname> pointer points to a structure shared
|
|
among threads. The <varname>pdeath_signal</varname> variable
|
|
identifies the parent death signal and the
|
|
<varname>threads</varname> pointer is used to link this structure
|
|
to the list of threads. The <literal>linux_emuldata_shared</literal>
|
|
structure looks like:</para>
|
|
|
|
<programlisting>struct linux_emuldata_shared {
|
|
|
|
int refs;
|
|
|
|
pid_t group_pid;
|
|
|
|
LIST_HEAD(, linux_emuldata) threads; /* head of list of linux threads */
|
|
};</programlisting>
|
|
|
|
<para>The <varname>refs</varname> is a reference counter being used
|
|
to determine when we can free the structure to avoid memory leaks.
|
|
The <varname>group_pid</varname> is to identify PID ( = TGID) of the
|
|
whole process ( = thread group). The <varname>threads</varname>
|
|
pointer is the head of the list of threads in the process.</para>
|
|
|
|
<para>The <literal>linux_emuldata</literal> structure can be obtained
|
|
from the process using <function>em_find</function>. The prototype
|
|
of the function is:</para>
|
|
|
|
<programlisting>struct linux_emuldata *em_find(struct proc *, int locked);</programlisting>
|
|
|
|
<para>Here, <varname>proc</varname> is the process we want the emuldata
|
|
structure from and the locked parameter determines whether we want to
|
|
lock or not. The accepted values are <literal>EMUL_DOLOCK</literal>
|
|
and <literal>EMUL_DOUNLOCK</literal>. More about locking
|
|
later.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="pid-mangling">
|
|
<title>PID mangling</title>
|
|
|
|
<para>Because of the described different view knowing what a process
|
|
ID and thread ID is between &os; and &linux; we have to translate
|
|
the view somehow. We do it by PID mangling. This means that we
|
|
fake what a PID (=TGID) and TID (=PID) is between kernel and
|
|
userland. The rule of thumb is that in kernel (in Linuxulator)
|
|
PID = PID and TGID = shared -> group pid and to userland we
|
|
present <literal>PID = shared -> group_pid</literal> and
|
|
<literal>TID = proc -> p_pid</literal>.
|
|
The PID member of <literal>linux_emuldata structure</literal> is
|
|
a &os; PID.</para>
|
|
|
|
<para>The above affects mainly getpid, getppid, gettid syscalls. Where
|
|
we use PID/TGID respectively. In copyout of TIDs in
|
|
<function>child_clear_tid</function> and
|
|
<function>child_set_tid</function> we copy out &os; PID.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="clone-syscall">
|
|
<title>Clone syscall</title>
|
|
|
|
<para>The <function>clone</function> syscall is the way threads are
|
|
created in &linux;. The syscall prototype looks like this:</para>
|
|
|
|
<programlisting>int linux_clone(l_int flags, void *stack, void *parent_tidptr, int dummy,
|
|
void * child_tidptr);</programlisting>
|
|
|
|
<para>The <varname>flags</varname> parameter tells the syscall how
|
|
exactly the processes should be cloned. As described above, &linux;
|
|
can create processes sharing various things independently, for
|
|
example two processes can share file descriptors but not VM, etc.
|
|
Last byte of the <varname>flags</varname> parameter is the exit
|
|
signal of the newly created process. The <varname>stack</varname>
|
|
parameter if non-<literal>NULL</literal> tells, where the thread
|
|
stack is and if it is <literal>NULL</literal> we are supposed to
|
|
copy-on-write the calling process stack (i.e. do what normal
|
|
&man.fork.2; routine does). The <varname>parent_tidptr</varname>
|
|
parameter is used as an address for copying out process PID (i.e.
|
|
thread id) once the process is sufficiently instantiated but is
|
|
not runnable yet. The <varname>dummy</varname> parameter is here
|
|
because of the very strange calling convention of this syscall on
|
|
i386. It uses the registers directly and does not let the compiler
|
|
do it what results in the need of a dummy syscall. The
|
|
<varname>child_tidptr</varname> parameter is used as an address
|
|
for copying out PID once the process has finished forking and when
|
|
the process exits.</para>
|
|
|
|
<para>The syscall itself proceeds by setting corresponding flags
|
|
depending on the flags passed in. For example,
|
|
<literal>CLONE_VM</literal> maps to RFMEM (sharing of VM), etc.
|
|
The only nit here is <literal>CLONE_FS</literal> and
|
|
<literal>CLONE_FILES</literal> because &os; does not allow setting
|
|
this separately so we fake it by not setting RFFDG (copying of fd
|
|
table and other fs information) if either of these is defined. This
|
|
does not cause any problems, because those flags are always set
|
|
together. After setting the flags the process is forked using
|
|
the internal <function>fork1</function> routine, the process is
|
|
instrumented not to be put on a run queue, i.e. not to be set
|
|
runnable. After the forking is done we possibly reparent the newly
|
|
created process to emulate <literal>CLONE_PARENT</literal> semantics.
|
|
Next part is creating the emulation data. Threads in &linux; does
|
|
not signal their parents so we set exit signal to be 0 to disable
|
|
this. After that setting of <varname>child_set_tid</varname> and
|
|
<varname>child_clear_tid</varname> is performed enabling the
|
|
functionality later in the code. At this point we copy out the PID
|
|
to the address specified by <varname>parent_tidptr</varname>. The
|
|
setting of process stack is done by simply rewriting thread frame
|
|
<varname>%esp</varname> register (<varname>%rsp</varname> on amd64).
|
|
Next part is setting up TLS for the newly created process. After
|
|
this &man.vfork.2; semantics might be emulated and finally the newly
|
|
created process is put on a run queue and copying out its PID to the
|
|
parent process via <function>clone</function> return value is
|
|
done.</para>
|
|
|
|
<para>The <function>clone</function> syscall is able and in fact is
|
|
used for emulating classic &man.fork.2; and &man.vfork.2; syscalls.
|
|
Newer glibc in a case of 2.6 kernel uses <function>clone</function>
|
|
to implement &man.fork.2; and &man.vfork.2; syscalls.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="locking">
|
|
<title>Locking</title>
|
|
|
|
<para>The locking is implemented to be per-subsystem because we do not
|
|
expect a lot of contention on these. There are two locks:
|
|
<literal>emul_lock</literal> used to protect manipulating of
|
|
<literal>linux_emuldata</literal> and
|
|
<literal>emul_shared_lock</literal> used to manipulate
|
|
<literal>linux_emuldata_shared</literal>. The
|
|
<literal>emul_lock</literal> is a nonsleepable blocking mutex while
|
|
<literal>emul_shared_lock</literal> is a sleepable blocking
|
|
<literal>sx_lock</literal>. Because of the per-subsystem locking we
|
|
can coalesce some locks and that is why the em find offers the
|
|
non-locking access.</para>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="tls">
|
|
<title>TLS</title>
|
|
|
|
<para>This section deals with TLS also known as thread local
|
|
storage.</para>
|
|
|
|
<sect3 xml:id="trheading-intro">
|
|
<title>Introduction to threading</title>
|
|
|
|
<para>Threads in computer science are entities within a process that
|
|
can be scheduled independently from each other. The threads in the
|
|
process share process wide data (file descriptors, etc.) but also
|
|
have their own stack for their own data. Sometimes there is a need
|
|
for process-wide data specific to a given thread. Imagine a name of
|
|
the thread in execution or something like that. The traditional
|
|
&unix; threading API, <application>pthreads</application> provides
|
|
a way to do it via &man.pthread.key.create.3;,
|
|
&man.pthread.setspecific.3; and &man.pthread.getspecific.3; where a
|
|
thread can create a key to the thread local data and using
|
|
&man.pthread.getspecific.3; or &man.pthread.getspecific.3; to
|
|
manipulate those data. You can easily see that this is not the most
|
|
comfortable way this could be accomplished. So various producers of
|
|
C/C++ compilers introduced a better way. They defined a new modifier
|
|
keyword thread that specifies that a variable is thread specific. A
|
|
new method of accessing such variables was developed as well (at
|
|
least on i386). The <application>pthreads</application> method tends
|
|
to be implemented in userspace as a trivial lookup table. The
|
|
performance of such a solution is not very good. So the new method
|
|
uses (on i386) segment registers to address a segment, where TLS area
|
|
is stored so the actual accessing of a thread variable is just
|
|
appending the segment register to the address thus addressing via it.
|
|
The segment registers are usually <varname>%gs</varname> and
|
|
<varname>%fs</varname> acting like segment selectors. Every thread
|
|
has its own area where the thread local data are stored and the
|
|
segment must be loaded on every context switch. This method is very
|
|
fast and used almost exclusively in the whole i386 &unix; world.
|
|
Both &os; and &linux; implement this approach and it yields very good
|
|
results. The only drawback is the need to reload the segment on
|
|
every context switch which can slowdown context switches. &os; tries
|
|
to avoid this overhead by using only 1 segment descriptor for this
|
|
while &linux; uses 3. Interesting thing is that almost nothing uses
|
|
more than 1 descriptor (only <application>Wine</application> seems to
|
|
use 2) so &linux; pays this unnecessary price for context
|
|
switches.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="i386-segs">
|
|
<title>Segments on i386</title>
|
|
|
|
<para>The i386 architecture implements the so called segments. A
|
|
segment is a description of an area of memory. The base address
|
|
(bottom) of the memory area, the end of it (ceiling), type,
|
|
protection, etc. The memory described by a segment can be accessed
|
|
using segment selector registers (<varname>%cs</varname>,
|
|
<varname>%ds</varname>, <varname>%ss</varname>,
|
|
<varname>%es</varname>, <varname>%fs</varname>,
|
|
<varname>%gs</varname>). For example let us suppose we have a
|
|
segment which base address is 0x1234 and length and this code:</para>
|
|
|
|
<programlisting>mov %edx,%gs:0x10</programlisting>
|
|
|
|
<para>This will load the content of the <varname>%edx</varname>
|
|
register into memory location 0x1244. Some segment registers have
|
|
a special use, for example <varname>%cs</varname> is used for code
|
|
segment and <varname>%ss</varname> is used for stack segment but
|
|
<varname>%fs</varname> and <varname>%gs</varname> are generally
|
|
unused. Segments are either stored in a global GDT table or in a
|
|
local LDT table. LDT is accessed via an entry in the GDT. The
|
|
LDT can store more types of segments. LDT can be per process.
|
|
Both tables define up to 8191 entries.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="linux-i386">
|
|
<title>Implementation on &linux; i386</title>
|
|
|
|
<para>There are two main ways of setting up TLS in &linux;. It can be
|
|
set when cloning a process using the <function>clone</function>
|
|
syscall or it can call <function>set_thread_area</function>. When a
|
|
process passes <literal>CLONE_SETTLS</literal> flag to
|
|
<function>clone</function>, the kernel expects the memory pointed to
|
|
by the <varname>%esi</varname> register a &linux; user space
|
|
representation of a segment, which gets translated to the machine
|
|
representation of a segment and loaded into a GDT slot. The
|
|
GDT slot can be specified with a number or -1 can be used meaning
|
|
that the system itself should choose the first free slot. In
|
|
practice, the vast majority of programs use only one TLS entry and
|
|
does not care about the number of the entry. We exploit this in the
|
|
emulation and in fact depend on it.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="tls-emu">
|
|
<title>Emulation of &linux; TLS</title>
|
|
|
|
<sect4 xml:id="tls-i386">
|
|
<title>i386</title>
|
|
|
|
<para>Loading of TLS for the current thread happens by calling
|
|
<function>set_thread_area</function> while loading TLS for a
|
|
second process in <function>clone</function> is done in the
|
|
separate block in <function>clone</function>. Those two functions
|
|
are very similar. The only difference being the actual loading of
|
|
the GDT segment, which happens on the next context switch for the
|
|
newly created process while <function>set_thread_area</function>
|
|
must load this directly. The code basically does this. It copies
|
|
the &linux; form segment descriptor from the userland. The code
|
|
checks for the number of the descriptor but because this differs
|
|
between &os; and &linux; we fake it a little. We only support
|
|
indexes of 6, 3 and -1. The 6 is genuine &linux; number, 3 is
|
|
genuine &os; one and -1 means autoselection. Then we set the
|
|
descriptor number to constant 3 and copy out this to the
|
|
userspace. We rely on the userspace process using the number from
|
|
the descriptor but this works most of the time (have never seen a
|
|
case where this did not work) as the userspace process typically
|
|
passes in 1. Then we convert the descriptor from the &linux; form
|
|
to a machine dependant form (i.e. operating system independent
|
|
form) and copy this to the &os; defined segment descriptor.
|
|
Finally we can load it. We assign the descriptor to threads PCB
|
|
(process control block) and load the <varname>%gs</varname>
|
|
segment using <function>load_gs</function>. This loading must be
|
|
done in a critical section so that nothing can interrupt us.
|
|
The <literal>CLONE_SETTLS</literal> case works exactly like this
|
|
just the loading using <function>load_gs</function> is not
|
|
performed. The segment used for this (segment number 3) is
|
|
shared for this use between &os; processes and &linux; processes
|
|
so the &linux; emulation layer does not add any overhead over
|
|
plain &os;.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="tls-amd64">
|
|
<title>amd64</title>
|
|
|
|
<para>The amd64 implementation is similar to the i386 one but there
|
|
was initially no 32bit segment descriptor used for this purpose
|
|
(hence not even native 32bit TLS users worked) so we had to add
|
|
such a segment and implement its loading on every context switch
|
|
(when a flag signaling use of 32bit is set). Apart from this the
|
|
TLS loading is exactly the same just the segment numbers are
|
|
different and the descriptor format and the loading differs
|
|
slightly.</para>
|
|
</sect4>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="futexes">
|
|
<title>Futexes</title>
|
|
|
|
<sect3 xml:id="sync-intro">
|
|
<title>Introduction to synchronization</title>
|
|
|
|
<para>Threads need some kind of synchronization and &posix; provides
|
|
some of them: mutexes for mutual exclusion, read-write locks for
|
|
mutual exclusion with biased ratio of reads and writes and condition
|
|
variables for signaling a status change. It is interesting to note
|
|
that &posix; threading API lacks support for semaphores. Those
|
|
synchronization routines implementations are heavily dependant on
|
|
the type threading support we have. In pure 1:M (userspace) model
|
|
the implementation can be solely done in userspace and thus be very
|
|
fast (the condition variables will probably end up being implemented
|
|
using signals, i.e. not fast) and simple. In 1:1 model, the
|
|
situation is also quite clear - the threads must be synchronized
|
|
using kernel facilities (which is very slow because a syscall must be
|
|
performed). The mixed M:N scenario just combines the first and
|
|
second approach or rely solely on kernel. Threads synchronization is
|
|
a vital part of thread-enabled programming and its performance can
|
|
affect resulting program a lot. Recent benchmarks on &os; operating
|
|
system showed that an improved sx_lock implementation yielded 40%
|
|
speedup in <firstterm>ZFS</firstterm> (a heavy sx user), this
|
|
is in-kernel stuff but it shows clearly how important the performance
|
|
of synchronization primitives is.</para>
|
|
|
|
<para>Threaded programs should be written with as little contention on
|
|
locks as possible. Otherwise, instead of doing useful work the
|
|
thread just waits on a lock. Because of this, the most well written
|
|
threaded programs show little locks contention.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="futex-intro">
|
|
<title>Futexes introduction</title>
|
|
|
|
<para>&linux; implements 1:1 threading, i.e. it has to use in-kernel
|
|
synchronization primitives. As stated earlier, well written threaded
|
|
programs have little lock contention. So a typical sequence
|
|
could be performed as two atomic increase/decrease mutex reference
|
|
counter, which is very fast, as presented by the following
|
|
example:</para>
|
|
|
|
<programlisting>pthread_mutex_lock(&mutex);
|
|
....
|
|
pthread_mutex_unlock(&mutex);</programlisting>
|
|
|
|
<para>1:1 threading forces us to perform two syscalls for those mutex
|
|
calls, which is very slow.</para>
|
|
|
|
<para>The solution &linux; 2.6 implements is called futexes.
|
|
Futexes implement the check for contention in userspace and call
|
|
kernel primitives only in a case of contention. Thus the typical
|
|
case takes place without any kernel intervention. This yields
|
|
reasonably fast and flexible synchronization primitives
|
|
implementation.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="futex-api">
|
|
<title>Futex API</title>
|
|
|
|
<para>The futex syscall looks like this:</para>
|
|
|
|
<programlisting>int futex(void *uaddr, int op, int val, struct timespec *timeout, void *uaddr2, int val3);</programlisting>
|
|
|
|
<para>In this example <varname>uaddr</varname> is an address of the
|
|
mutex in userspace, <varname>op</varname> is an operation we are
|
|
about to perform and the other parameters have per-operation
|
|
meaning.</para>
|
|
|
|
<para>Futexes implement the following operations:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para><literal>FUTEX_WAIT</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_WAKE</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_FD</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_REQUEUE</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_CMP_REQUEUE</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_WAKE_OP</literal></para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<sect4 xml:id="futex-wait">
|
|
<title>FUTEX_WAIT</title>
|
|
|
|
<para>This operation verifies that on address
|
|
<varname>uaddr</varname> the value <varname>val</varname>
|
|
is written. If not, <literal>EWOULDBLOCK</literal> is
|
|
returned, otherwise the thread is queued on the futex and gets
|
|
suspended. If the argument <varname>timeout</varname> is
|
|
non-zero it specifies the maximum time for the sleeping,
|
|
otherwise the sleeping is infinite.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-wake">
|
|
<title>FUTEX_WAKE</title>
|
|
|
|
<para>This operation takes a futex at <varname>uaddr</varname>
|
|
and wakes up <varname>val</varname> first futexes queued
|
|
on this futex.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-fd">
|
|
<title>FUTEX_FD</title>
|
|
|
|
<para>This operations associates a file descriptor with a given
|
|
futex.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-requeue">
|
|
<title>FUTEX_REQUEUE</title>
|
|
|
|
<para>This operation takes <varname>val</varname> threads
|
|
queued on futex at <varname>uaddr</varname>, wakes them up,
|
|
and takes <varname>val2</varname> next threads and requeues them
|
|
on futex at <varname>uaddr2</varname>.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-cmp-requeue">
|
|
<title>FUTEX_CMP_REQUEUE</title>
|
|
|
|
<para>This operation does the same as
|
|
<literal>FUTEX_REQUEUE</literal> but it checks that
|
|
<varname>val3</varname> equals to <varname>val</varname>
|
|
first.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-wake-op">
|
|
<title>FUTEX_WAKE_OP</title>
|
|
|
|
<para>This operation performs an atomic operation on
|
|
<varname>val3</varname> (which contains coded some other value)
|
|
and <varname>uaddr</varname>. Then it wakes up
|
|
<varname>val</varname> threads on futex at
|
|
<varname>uaddr</varname> and if the atomic operation returned a
|
|
positive number it wakes up <varname>val2</varname> threads on
|
|
futex at <varname>uaddr2</varname>.</para>
|
|
|
|
<para>The operations implemented in
|
|
<literal>FUTEX_WAKE_OP</literal>:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para><literal>FUTEX_OP_SET</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_OP_ADD</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_OP_OR</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_OP_AND</literal></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><literal>FUTEX_OP_XOR</literal></para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<note>
|
|
<para>There is no <varname>val2</varname> parameter in the
|
|
futex prototype. The <varname>val2</varname> is taken from the
|
|
<varname>struct timespec *timeout</varname> parameter
|
|
for operations <literal>FUTEX_REQUEUE</literal>,
|
|
<literal>FUTEX_CMP_REQUEUE</literal> and
|
|
<literal>FUTEX_WAKE_OP</literal>.</para>
|
|
</note>
|
|
</sect4>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="futex-emu">
|
|
<title>Futex emulation in &os;</title>
|
|
|
|
<para>The futex emulation in &os; is taken from NetBSD and further
|
|
extended by us. It is placed in <filename>linux_futex.c</filename>
|
|
and <filename>linux_futex.h</filename> files. The
|
|
<literal>futex</literal> structure looks like:</para>
|
|
|
|
<programlisting>struct futex {
|
|
void *f_uaddr;
|
|
int f_refcount;
|
|
|
|
LIST_ENTRY(futex) f_list;
|
|
|
|
TAILQ_HEAD(lf_waiting_paroc, waiting_proc) f_waiting_proc;
|
|
};</programlisting>
|
|
|
|
<para>And the structure <literal>waiting_proc</literal> is:</para>
|
|
|
|
<programlisting>struct waiting_proc {
|
|
|
|
struct thread *wp_t;
|
|
|
|
struct futex *wp_new_futex;
|
|
|
|
TAILQ_ENTRY(waiting_proc) wp_list;
|
|
};</programlisting>
|
|
|
|
<sect4 xml:id="futex-get">
|
|
<title>futex_get / futex_put</title>
|
|
|
|
<para>A futex is obtained using the <function>futex_get</function>
|
|
function, which searches a linear list of futexes and returns the
|
|
found one or creates a new futex. When releasing a futex from the
|
|
use we call the <function>futex_put</function> function, which
|
|
decreases a reference counter of the futex and if the refcount
|
|
reaches zero it is released.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-sleep">
|
|
<title>futex_sleep</title>
|
|
|
|
<para>When a futex queues a thread for sleeping it creates a
|
|
<literal>working_proc</literal> structure and puts this structure
|
|
to the list inside the futex structure then it just performs a
|
|
&man.tsleep.9; to suspend the thread. The sleep can be timed out.
|
|
After &man.tsleep.9; returns (the thread was woken up or it timed
|
|
out) the <literal>working_proc</literal> structure is removed
|
|
from the list and is destroyed. All this is done in the
|
|
<function>futex_sleep</function> function. If we got woken up
|
|
from <function>futex_wake</function> we have
|
|
<varname>wp_new_futex</varname> set so we sleep on it. This way
|
|
the actual requeueing is done in this function.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-wake-2">
|
|
<title>futex_wake</title>
|
|
|
|
<para>Waking up a thread sleeping on a futex is performed in the
|
|
<function>futex_wake</function> function. First in this function
|
|
we mimic the strange &linux; behaviour, where it wakes up N threads
|
|
for all operations, the only exception is that the REQUEUE
|
|
operations are performed on N+1 threads. But this usually does not
|
|
make any difference as we are waking up all threads. Next in the
|
|
function in the loop we wake up n threads, after this we check if
|
|
there is a new futex for requeueing. If so, we requeue up to n2
|
|
threads on the new futex. This cooperates with
|
|
<function>futex_sleep</function>.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-wake-op-2">
|
|
<title>futex_wake_op</title>
|
|
|
|
<para>The <literal>FUTEX_WAKE_OP</literal> operation is quite
|
|
complicated. First we obtain two futexes at addresses
|
|
<varname>uaddr</varname> and <varname>uaddr2</varname> then we
|
|
perform the atomic operation using <varname>val3</varname> and
|
|
<varname>uaddr2</varname>. Then <varname>val</varname> waiters
|
|
on the first futex is woken up and if the atomic operation
|
|
condition holds we wake up <varname>val2</varname> (i.e.
|
|
<varname>timeout</varname>) waiter on the second futex.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-atomic-op">
|
|
<title>futex atomic operation</title>
|
|
|
|
<para>The atomic operation takes two parameters
|
|
<varname>encoded_op</varname> and <varname>uaddr</varname>.
|
|
The encoded operation encodes the operation itself,
|
|
comparing value, operation argument, and comparing argument.
|
|
The pseudocode for the operation is like this one:</para>
|
|
|
|
<programlisting>oldval = *uaddr2
|
|
*uaddr2 = oldval OP oparg</programlisting>
|
|
|
|
<para>And this is done atomically. First a copying in of the number
|
|
at <varname>uaddr</varname> is performed and the operation is
|
|
done. The code handles page faults and if no page fault occurs
|
|
<varname>oldval</varname> is compared to
|
|
<varname>cmparg</varname> argument with cmp comparator.</para>
|
|
</sect4>
|
|
|
|
<sect4 xml:id="futex-locking">
|
|
<title>Futex locking</title>
|
|
|
|
<para>Futex implementation uses two lock lists protecting
|
|
<function>sx_lock</function> and global locks (either Giant
|
|
or another <function>sx_lock</function>). Every operation is
|
|
performed locked from the start to the very end.</para>
|
|
</sect4>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="syscall-impl">
|
|
<title>Various syscalls implementation</title>
|
|
|
|
<para>In this section I am going to describe some smaller syscalls that
|
|
are worth mentioning because their implementation is not obvious or
|
|
those syscalls are interesting from other point of view.</para>
|
|
|
|
<sect3 xml:id="syscall-at">
|
|
<title>*at family of syscalls</title>
|
|
|
|
<para>During development of &linux; 2.6.16 kernel, the *at syscalls
|
|
were added. Those syscalls (<function>openat</function> for example)
|
|
work exactly like their at-less counterparts with the slight
|
|
exception of the <varname>dirfd</varname> parameter. This
|
|
parameter changes where the given file, on which the syscall is to be
|
|
performed, is. When the <varname>filename</varname> parameter is
|
|
absolute <varname>dirfd</varname> is ignored but when the path to
|
|
the file is relative, it comes to the play. The
|
|
<varname>dirfd</varname> parameter is a directory relative to which
|
|
the relative pathname is checked. The <varname>dirfd</varname>
|
|
parameter is a file descriptor of some directory or
|
|
<literal>AT_FDCWD</literal>. So for example the
|
|
<function>openat</function> syscall can be like this:</para>
|
|
|
|
<programlisting>file descriptor 123 = /tmp/foo/, current working directory = /tmp/
|
|
|
|
openat(123, /tmp/bah\, flags, mode) /* opens /tmp/bah */
|
|
openat(123, bah\, flags, mode) /* opens /tmp/foo/bah */
|
|
openat(AT_FDWCWD, bah\, flags, mode) /* opens /tmp/bah */
|
|
openat(stdio, bah\, flags, mode) /* returns error because stdio is not a directory */</programlisting>
|
|
|
|
<para>This infrastructure is necessary to avoid races when opening
|
|
files outside the working directory. Imagine that a process consists
|
|
of two threads, thread A and thread B. Thread A
|
|
issues <literal>open(./tmp/foo/bah., flags, mode)</literal> and
|
|
before returning it gets preempted and thread B runs.
|
|
Thread B does not care about the needs of thread A and
|
|
renames or removes <filename>/tmp/foo/</filename>. We got a race.
|
|
To avoid this we can open <filename>/tmp/foo</filename> and use it
|
|
as <varname>dirfd</varname> for <function>openat</function>
|
|
syscall. This also enables user to implement per-thread
|
|
working directories.</para>
|
|
|
|
<para>&linux; family of *at syscalls contains:
|
|
<function>linux_openat</function>,
|
|
<function>linux_mkdirat</function>,
|
|
<function>linux_mknodat</function>,
|
|
<function>linux_fchownat</function>,
|
|
<function>linux_futimesat</function>,
|
|
<function>linux_fstatat64</function>,
|
|
<function>linux_unlinkat</function>,
|
|
<function>linux_renameat</function>,
|
|
<function>linux_linkat</function>,
|
|
<function>linux_symlinkat</function>,
|
|
<function>linux_readlinkat</function>,
|
|
<function>linux_fchmodat</function> and
|
|
<function>linux_faccessat</function>. All these are implemented
|
|
using the modified &man.namei.9; routine and simple
|
|
wrapping layer.</para>
|
|
|
|
<sect4 xml:id="implementation">
|
|
<title>Implementation</title>
|
|
|
|
<para>The implementation is done by altering the
|
|
&man.namei.9; routine (described above) to take
|
|
additional parameter <varname>dirfd</varname> in its
|
|
<literal>nameidata</literal> structure, which specifies the
|
|
starting point of the pathname lookup instead of using the
|
|
current working directory every time. The resolution of
|
|
<varname>dirfd</varname> from file descriptor number to a
|
|
vnode is done in native *at syscalls. When
|
|
<varname>dirfd</varname> is <literal>AT_FDCWD</literal> the
|
|
<varname>dvp</varname> entry in <literal>nameidata</literal>
|
|
structure is <literal>NULL</literal> but when
|
|
<varname>dirfd</varname> is a different number we obtain a
|
|
file for this file descriptor, check whether this file
|
|
is valid and if there is vnode attached to it then we get a vnode.
|
|
Then we check this vnode for being a directory. In the actual
|
|
&man.namei.9; routine we simply substitute the
|
|
<varname>dvp</varname> vnode for <varname>dp</varname> variable
|
|
in the &man.namei.9; function, which determines the
|
|
starting point. The &man.namei.9; is not used
|
|
directly but via a trace of different functions on various
|
|
levels. For example the <function>openat</function> goes like
|
|
this:</para>
|
|
|
|
<programlisting>openat() --> kern_openat() --> vn_open() -> namei()</programlisting>
|
|
|
|
<para>For this reason <function>kern_open</function> and
|
|
<function>vn_open</function> must be altered to incorporate
|
|
the additional <varname>dirfd</varname> parameter. No compat
|
|
layer is created for those because there are not many users of
|
|
this and the users can be easily converted. This general
|
|
implementation enables &os; to implement their own *at syscalls.
|
|
This is being discussed right now.</para>
|
|
</sect4>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="ioctl">
|
|
<title>Ioctl</title>
|
|
|
|
<para>The ioctl interface is quite fragile due to its generality.
|
|
We have to bear in mind that devices differ between &linux; and &os;
|
|
so some care must be applied to do ioctl emulation work right. The
|
|
ioctl handling is implemented in <filename>linux_ioctl.c</filename>,
|
|
where <function>linux_ioctl</function> function is defined. This
|
|
function simply iterates over sets of ioctl handlers to find a
|
|
handler that implements a given command. The ioctl syscall has three
|
|
parameters, the file descriptor, command and an argument. The
|
|
command is a 16-bit number, which in theory is divided into high
|
|
8 bits determining class of the ioctl command and low
|
|
8 bits, which are the actual command within the given set.
|
|
The emulation takes advantage of this division. We implement
|
|
handlers for each set, like <function>sound_handler</function>
|
|
or <function>disk_handler</function>. Each handler has a maximum
|
|
command and a minimum command defined, which is used for determining
|
|
what handler is used. There are slight problems with this approach
|
|
because &linux; does not use the set division consistently so
|
|
sometimes ioctls for a different set are inside a set they should
|
|
not belong to (SCSI generic ioctls inside cdrom set, etc.). &os;
|
|
currently does not implement many &linux; ioctls (compared to
|
|
NetBSD, for example) but the plan is to port those from NetBSD.
|
|
The trend is to use &linux; ioctls even in the native &os; drivers
|
|
because of the easy porting of applications.</para>
|
|
</sect3>
|
|
|
|
<sect3 xml:id="debugging">
|
|
<title>Debugging</title>
|
|
|
|
<para>Every syscall should be debuggable. For this purpose we
|
|
introduce a small infrastructure. We have the ldebug facility, which
|
|
tells whether a given syscall should be debugged (settable via a
|
|
sysctl). For printing we have LMSG and ARGS macros. Those are used
|
|
for altering a printable string for uniform debugging messages.</para>
|
|
</sect3>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="conclusion">
|
|
<title>Conclusion</title>
|
|
|
|
<sect2 xml:id="results">
|
|
<title>Results</title>
|
|
|
|
<para>As of April 2007 the &linux; emulation layer is capable of
|
|
emulating the &linux; 2.6.16 kernel quite well. The remaining
|
|
problems concern futexes, unfinished *at family of syscalls,
|
|
problematic signals delivery, missing <function>epoll</function> and
|
|
<function>inotify</function> and probably some bugs we have not
|
|
discovered yet. Despite this we are capable of running basically all
|
|
the &linux; programs included in &os; Ports Collection with
|
|
Fedora Core 4 at 2.6.16 and there are some rudimentary
|
|
reports of success with Fedora Core 6 at 2.6.16. The
|
|
Fedora Core 6 linux_base was recently committed enabling
|
|
some further testing of the emulation layer and giving us some more
|
|
hints where we should put our effort in implementing missing
|
|
stuff.</para>
|
|
|
|
<para>We are able to run the most used applications like
|
|
<package>www/linux-firefox</package>,
|
|
<package>www/linux-opera</package>,
|
|
<package>net-im/skype</package> and some games from
|
|
the Ports Collection. Some of the programs exhibit bad behaviour
|
|
under 2.6 emulation but this is currently under investigation and
|
|
hopefully will be fixed soon. The only big application that is
|
|
known not to work is the &linux; &java; Development Kit and this is
|
|
because of the requirement of <function>epoll</function>
|
|
facility which is not directly related to the &linux;
|
|
kernel 2.6.</para>
|
|
|
|
<para>We hope to enable 2.6.16 emulation by default some time after
|
|
&os; 7.0 is released at least to expose the 2.6 emulation parts for
|
|
some wider testing. Once this is done we can switch to
|
|
Fedora Core 6 linux_base, which is the ultimate plan.</para>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="future-work">
|
|
<title>Future work</title>
|
|
|
|
<para>Future work should focus on fixing the remaining issues with
|
|
futexes, implement the rest of the *at family of syscalls, fix the
|
|
signal delivery and possibly implement the <function>epoll</function>
|
|
and <function>inotify</function> facilities.</para>
|
|
|
|
<para>We hope to be able to run the most important programs flawlessly
|
|
soon, so we will be able to switch to the 2.6 emulation by default and
|
|
make the Fedora Core 6 the default linux_base because our
|
|
currently used Fedora Core 4 is not supported any
|
|
more.</para>
|
|
|
|
<para>The other possible goal is to share our code with NetBSD and
|
|
DragonflyBSD. NetBSD has some support for 2.6 emulation but its far
|
|
from finished and not really tested. DragonflyBSD has expressed some
|
|
interest in porting the 2.6 improvements.</para>
|
|
|
|
<para>Generally, as &linux; develops we would like to keep up with their
|
|
development, implementing newly added syscalls. Splice comes to mind
|
|
first. Some already implemented syscalls are also heavily crippled,
|
|
for example <function>mremap</function> and others. Some performance
|
|
improvements can also be made, finer grained locking and others.</para>
|
|
</sect2>
|
|
|
|
<sect2 xml:id="team">
|
|
<title>Team</title>
|
|
|
|
<para>I cooperated on this project with (in alphabetical order):</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>&a.jhb.email;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&a.kib.email;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Emmanuel Dreyfus</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Scot Hetzel</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&a.jkim.email;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&a.netchild.email;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&a.ssouhlal.email;</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Li Xiao</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>&a.davidxu.email;</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>I would like to thank all those people for their advice, code
|
|
reviews and general support.</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="literatures">
|
|
<title>Literatures</title>
|
|
|
|
<orderedlist>
|
|
<listitem>
|
|
<para>Marshall Kirk McKusick - George V. Nevile-Neil. Design
|
|
and Implementation of the &os; operating system. Addison-Wesley,
|
|
2005.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><uri xlink:href="http://www.FreeBSD.org">http://www.FreeBSD.org</uri></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><uri xlink:href="http://tldp.org">http://tldp.org</uri></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><uri xlink:href="http://www.linux.org">http://www.linux.org</uri></para>
|
|
</listitem>
|
|
</orderedlist>
|
|
</sect1>
|
|
</article>
|