2387 lines
102 KiB
XML
2387 lines
102 KiB
XML
<?xml version="1.0" encoding="iso-8859-1"?>
|
|
<!--
|
|
The FreeBSD Documentation Project
|
|
|
|
Copyright (c) 2002 Sergey Lyubka <devnull@uptsoft.com>
|
|
All rights reserved
|
|
Copyright (c) 2014 Sergio Andr?s G?mez del Real <Sergio.G.delReal@gmail.com>
|
|
All rights reserved
|
|
$FreeBSD$
|
|
-->
|
|
|
|
<chapter xmlns="http://docbook.org/ns/docbook"
|
|
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
|
|
xml:id="boot">
|
|
|
|
<info>
|
|
<title>Bootstrapping and Kernel Initialization</title>
|
|
|
|
<authorgroup>
|
|
<author>
|
|
<personname>
|
|
<firstname>Sergey</firstname>
|
|
<surname>Lyubka</surname>
|
|
</personname>
|
|
|
|
<contrib>Contributed by </contrib>
|
|
</author>
|
|
<!-- devnull@uptsoft.com 12 Jun 2002 -->
|
|
</authorgroup>
|
|
|
|
<authorgroup>
|
|
<author>
|
|
<personname>
|
|
<firstname>Sergio Andrés</firstname>
|
|
<surname> Gómez del Real</surname>
|
|
</personname>
|
|
|
|
<contrib>Updated and enhanced by </contrib>
|
|
</author>
|
|
<!-- Sergio.G.DelReal@gmail.com Jan 2014 -->
|
|
</authorgroup>
|
|
</info>
|
|
|
|
<sect1 xml:id="boot-synopsis">
|
|
<title>Synopsis</title>
|
|
|
|
<indexterm><primary>BIOS</primary></indexterm>
|
|
<indexterm><primary>firmware</primary></indexterm>
|
|
<indexterm><primary>POST</primary></indexterm>
|
|
<indexterm><primary>IA-32</primary></indexterm>
|
|
<indexterm><primary>booting</primary></indexterm>
|
|
<indexterm><primary>system initialization</primary></indexterm>
|
|
<para>This chapter is an overview of the boot and system
|
|
initialization processes, starting from the <acronym>BIOS</acronym> (firmware)
|
|
<acronym>POST</acronym>, to the first user process creation. Since the initial
|
|
steps of system startup are very architecture dependent, the
|
|
IA-32 architecture is used as an example.</para>
|
|
|
|
<para>The &os; boot process can be surprisingly complex. After
|
|
control is passed from the <acronym>BIOS</acronym>, a considerable amount of
|
|
low-level configuration must be done before the kernel can be
|
|
loaded and executed. This setup must be done in a simple and
|
|
flexible manner, allowing the user a great deal of customization
|
|
possibilities.</para>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="boot-overview">
|
|
<title>Overview</title>
|
|
|
|
<para>The boot process is an extremely machine-dependent
|
|
activity. Not only must code be written for every computer
|
|
architecture, but there may also be multiple types of booting on
|
|
the same architecture. For example, a directory listing of
|
|
<filename>/usr/src/sys/boot</filename>
|
|
reveals a great amount of architecture-dependent code. There is
|
|
a directory for each of the various supported architectures. In
|
|
the x86-specific <filename>i386</filename>
|
|
directory, there are subdirectories for different boot standards
|
|
like <filename>mbr</filename> (Master Boot Record),
|
|
<filename>gpt</filename> (<acronym>GUID</acronym> Partition
|
|
Table), and <filename>efi</filename> (Extensible Firmware
|
|
Interface). Each boot standard has its own conventions and data
|
|
structures. The example that follows shows booting an x86
|
|
computer from an <acronym>MBR</acronym> hard drive with the &os;
|
|
<filename>boot0</filename> multi-boot loader stored in the very
|
|
first sector. That boot code starts the &os; three-stage boot
|
|
process.</para>
|
|
|
|
<para>The key to understanding this process is that it is a series
|
|
of stages of increasing complexity. These stages are
|
|
<filename>boot1</filename>, <filename>boot2</filename>, and
|
|
<filename>loader</filename> (see &man.boot.8; for more detail).
|
|
The boot system executes each stage in sequence. The last
|
|
stage, <filename>loader</filename>, is responsible for loading
|
|
the &os; kernel. Each stage is examined in the following
|
|
sections.</para>
|
|
|
|
<para>Here is an example of the output generated by the
|
|
different boot stages. Actual output
|
|
may differ from machine to machine:</para>
|
|
|
|
<informaltable frame="none" pgwide="0">
|
|
<tgroup cols="2">
|
|
<tbody>
|
|
<row>
|
|
<entry>&os; Component</entry>
|
|
<entry>Output (may vary)</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><literal>boot0</literal></entry>
|
|
<entry><screen>F1 FreeBSD
|
|
F2 BSD
|
|
F5 Disk 2</screen></entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><literal>boot2</literal>
|
|
<footnote><para>This prompt will appear if the user
|
|
presses a key just after selecting an OS to boot
|
|
at the <literal>boot0</literal>
|
|
stage.</para></footnote></entry>
|
|
<entry><screen>>>FreeBSD/i386 BOOT
|
|
Default: 1:ad(1,a)/boot/loader
|
|
boot:</screen></entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>loader</filename></entry>
|
|
<entry><screen>BTX loader 1.00 BTX version is 1.02
|
|
Consoles: internal video/keyboard
|
|
BIOS drive C: is disk0
|
|
BIOS 639kB/2096064kB available memory
|
|
|
|
FreeBSD/x86 bootstrap loader, Revision 1.1
|
|
Console internal video/keyboard
|
|
(root@snap.freebsd.org, Thu Jan 16 22:18:05 UTC 2014)
|
|
Loading /boot/defaults/loader.conf
|
|
/boot/kernel/kernel text=0xed9008 data=0x117d28+0x176650 syms=[0x8+0x137988+0x8+0x1515f8]</screen></entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry>kernel</entry>
|
|
<entry><screen>Copyright (c) 1992-2013 The FreeBSD Project.
|
|
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
|
|
The Regents of the University of California. All rights reserved.
|
|
FreeBSD is a registered trademark of The FreeBSD Foundation.
|
|
FreeBSD 10.0-RELEASE #0 r260789: Thu Jan 16 22:34:59 UTC 2014
|
|
root@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64
|
|
FreeBSD clang version 3.3 (tags/RELEASE_33/final 183502) 20130610</screen></entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</informaltable>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="boot-bios">
|
|
<title>The <acronym>BIOS</acronym></title>
|
|
|
|
<para>When the computer powers on, the processor's registers are
|
|
set to some predefined values. One of the registers is the
|
|
<emphasis>instruction pointer</emphasis> register, and its value
|
|
after a power on is well defined: it is a 32-bit value of
|
|
<literal>0xfffffff0</literal>. The instruction pointer register
|
|
(also known as the Program Counter) points to code to be
|
|
executed by the processor. Another important register is the
|
|
<literal>cr0</literal> 32-bit control register, and its value
|
|
just after a reboot is <literal>0</literal>. One of
|
|
<literal>cr0</literal>'s bits, the PE (Protection Enabled) bit,
|
|
indicates whether the processor is running in 32-bit protected
|
|
mode or 16-bit real mode. Since this bit is cleared at boot
|
|
time, the processor boots in 16-bit real mode. Real mode means,
|
|
among other things, that linear and physical addresses are
|
|
identical. The reason for the processor not to start
|
|
immediately in 32-bit protected mode is backwards compatibility.
|
|
In particular, the boot process relies on the services provided
|
|
by the <acronym>BIOS</acronym>, and the <acronym>BIOS</acronym>
|
|
itself works in legacy, 16-bit code.</para>
|
|
|
|
<para>The value of <literal>0xfffffff0</literal> is slightly less
|
|
than 4 GB, so unless the machine has 4 GB of physical
|
|
memory, it cannot point to a valid memory address. The
|
|
computer's hardware translates this address so that it points to
|
|
a <acronym>BIOS</acronym> memory block.</para>
|
|
|
|
<para>The <acronym>BIOS</acronym> (Basic Input Output
|
|
System) is a chip on the motherboard that has a relatively small
|
|
amount of read-only memory (<acronym>ROM</acronym>). This
|
|
memory contains various low-level routines that are specific to
|
|
the hardware supplied with the motherboard. The processor will
|
|
first jump to the address 0xfffffff0, which really resides in
|
|
the <acronym>BIOS</acronym>'s memory. Usually this address
|
|
contains a jump instruction to the <acronym>BIOS</acronym>'s
|
|
POST routines.</para>
|
|
|
|
<para>The <acronym>POST</acronym> (Power On Self Test)
|
|
is a set of routines including the memory check, system bus
|
|
check, and other low-level initialization so the
|
|
<acronym>CPU</acronym> can set up the computer properly. The
|
|
important step of this stage is determining the boot device.
|
|
Modern <acronym>BIOS</acronym> implementations permit the
|
|
selection of a boot device, allowing booting from a floppy,
|
|
<acronym>CD-ROM</acronym>, hard disk, or other devices.</para>
|
|
|
|
<para>The very last thing in the <acronym>POST</acronym> is the
|
|
<literal>INT 0x19</literal> instruction. The
|
|
<literal>INT 0x19</literal> handler reads 512 bytes from the
|
|
first sector of boot device into the memory at address
|
|
<literal>0x7c00</literal>. The term
|
|
<emphasis>first sector</emphasis> originates from hard drive
|
|
architecture, where the magnetic plate is divided into a number
|
|
of cylindrical tracks. Tracks are numbered, and every track is
|
|
divided into a number (usually 64) of sectors. Track numbers
|
|
start at 0, but sector numbers start from 1. Track 0 is the
|
|
outermost on the magnetic plate, and sector 1, the first sector,
|
|
has a special purpose. It is also called the
|
|
<acronym>MBR</acronym>, or Master Boot Record. The remaining
|
|
sectors on the first track are never used.</para>
|
|
|
|
<para>This sector is our boot-sequence starting point. As we will
|
|
see, this sector contains a copy of our
|
|
<filename>boot0</filename> program. A jump is made by the
|
|
<acronym>BIOS</acronym> to address <literal>0x7c00</literal> so
|
|
it starts executing.</para>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="boot-boot0">
|
|
<title>The Master Boot Record (<literal>boot0</literal>)</title>
|
|
|
|
<indexterm><primary>MBR</primary></indexterm>
|
|
|
|
<para>After control is received from the <acronym>BIOS</acronym>
|
|
at memory address <literal>0x7c00</literal>,
|
|
<filename>boot0</filename> starts executing. It is the first
|
|
piece of code under &os; control. The task of
|
|
<filename>boot0</filename> is quite simple: scan the partition
|
|
table and let the user choose which partition to boot from. The
|
|
Partition Table is a special, standard data structure embedded
|
|
in the <acronym>MBR</acronym> (hence embedded in
|
|
<filename>boot0</filename>) describing the four standard PC
|
|
<quote>partitions</quote>
|
|
<footnote>
|
|
<para><link
|
|
xlink:href="http://en.wikipedia.org/wiki/Master_boot_record"></link></para></footnote>.
|
|
<filename>boot0</filename> resides in the filesystem as
|
|
<filename>/boot/boot0</filename>. It is a small 512-byte file,
|
|
and it is exactly what &os;'s installation procedure wrote to
|
|
the hard disk's <acronym>MBR</acronym> if you chose the <quote>bootmanager</quote>
|
|
option at installation time. Indeed,
|
|
<filename>boot0</filename> <emphasis>is</emphasis> the
|
|
<acronym>MBR</acronym>.</para>
|
|
|
|
<para>As mentioned previously, the <literal>INT 0x19</literal>
|
|
instruction causes the <literal>INT 0x19</literal> handler to
|
|
load an <acronym>MBR</acronym> (<filename>boot0</filename>) into
|
|
memory at address <literal>0x7c00</literal>. The source file
|
|
for <filename>boot0</filename> can be found in
|
|
<filename>sys/boot/i386/boot0/boot0.S</filename> - which is an
|
|
awesome piece of code written by Robert Nordier.</para>
|
|
|
|
<para>A special structure starting from offset
|
|
<literal>0x1be</literal> in the <acronym>MBR</acronym> is called
|
|
the <emphasis>partition table</emphasis>. It has four records
|
|
of 16 bytes each, called <emphasis>partition records</emphasis>,
|
|
which represent how the hard disk is partitioned, or, in &os;'s
|
|
terminology, sliced. One byte of those 16 says whether a
|
|
partition (slice) is bootable or not. Exactly one record must
|
|
have that flag set, otherwise <filename>boot0</filename>'s code
|
|
will refuse to proceed.</para>
|
|
|
|
<para>A partition record has the following fields:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>the 1-byte filesystem type</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>the 1-byte bootable flag</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>the 6 byte descriptor in CHS format</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>the 8 byte descriptor in LBA format</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>A partition record descriptor contains information about
|
|
where exactly the partition resides on the drive. Both
|
|
descriptors, <acronym>LBA</acronym> and <acronym>CHS</acronym>,
|
|
describe the same information, but in different ways:
|
|
<acronym>LBA</acronym> (Logical Block Addressing) has the
|
|
starting sector for the partition and the partition's length,
|
|
while <acronym>CHS</acronym> (Cylinder Head Sector) has
|
|
coordinates for the first and last sectors of the partition.
|
|
The partition table ends with the special signature
|
|
<literal>0xaa55</literal>.</para>
|
|
|
|
<para>The <acronym>MBR</acronym> must fit into 512 bytes, a single
|
|
disk sector. This program uses low-level <quote>tricks</quote>
|
|
like taking advantage of the side effects of certain
|
|
instructions and reusing register values from previous
|
|
operations to make the most out of the fewest possible
|
|
instructions. Care must also be taken when handling the
|
|
partition table, which is embedded in the <acronym>MBR</acronym>
|
|
itself. For these reasons, be very careful when modifying
|
|
<filename>boot0.S</filename>.</para>
|
|
|
|
<para>Note that the <filename>boot0.S</filename> source file
|
|
is assembled <quote>as is</quote>: instructions are translated
|
|
one by one to binary, with no additional information (no
|
|
<acronym>ELF</acronym> file format, for example). This kind of
|
|
low-level control is achieved at link time through special
|
|
control flags passed to the linker. For example, the text
|
|
section of the program is set to be located at address
|
|
<literal>0x600</literal>. In practice this means that
|
|
<filename>boot0</filename> must be loaded to memory address
|
|
<literal>0x600</literal> in order to function properly.</para>
|
|
|
|
<para>It is worth looking at the <filename>Makefile</filename> for
|
|
<filename>boot0</filename>
|
|
(<filename>sys/boot/i386/boot0/Makefile</filename>), as it
|
|
defines some of the run-time behavior of
|
|
<filename>boot0</filename>. For instance, if a terminal
|
|
connected to the serial port (COM1) is used for I/O, the macro
|
|
<literal>SIO</literal> must be defined
|
|
(<literal>-DSIO</literal>). <literal>-DPXE</literal> enables
|
|
boot through <acronym>PXE</acronym> by pressing
|
|
<keycap>F6</keycap>. Additionally, the program defines a set of
|
|
<emphasis>flags</emphasis> that allow further modification of
|
|
its behavior. All of this is illustrated in the
|
|
<filename>Makefile</filename>. For example, look at the
|
|
linker directives which command the linker to start the text
|
|
section at address <literal>0x600</literal>, and to build the
|
|
output file <quote>as is</quote> (strip out any file
|
|
formatting):</para>
|
|
|
|
<figure xml:id="boot-boot0-makefile-as-is">
|
|
<title><filename>sys/boot/i386/boot0/Makefile</filename></title>
|
|
|
|
<programlisting> BOOT_BOOT0_ORG?=0x600
|
|
LDFLAGS=-e start -Ttext ${BOOT_BOOT0_ORG} \
|
|
-Wl,-N,-S,--oformat,binary</programlisting>
|
|
</figure>
|
|
|
|
<para>Let us now start our study of the <acronym>MBR</acronym>, or
|
|
<filename>boot0</filename>, starting where execution
|
|
begins.</para>
|
|
|
|
<note>
|
|
<para>Some modifications have been made to some instructions in
|
|
favor of better exposition. For example, some macros are
|
|
expanded, and some macro tests are omitted when the result of
|
|
the test is known. This applies to all of the code examples
|
|
shown.</para>
|
|
</note>
|
|
|
|
<figure xml:id="boot-boot0-entrypoint">
|
|
<title><filename>sys/boot/i386/boot0/boot0.S</filename></title>
|
|
|
|
<programlisting>start:
|
|
cld # String ops inc
|
|
xorw %ax,%ax # Zero
|
|
movw %ax,%es # Address
|
|
movw %ax,%ds # data
|
|
movw %ax,%ss # Set up
|
|
movw 0x7c00,%sp # stack</programlisting>
|
|
</figure>
|
|
|
|
<para>This first block of code is the entry point of the program.
|
|
It is where the <acronym>BIOS</acronym> transfers control.
|
|
First, it makes sure that the string operations autoincrement
|
|
its pointer operands (the <literal>cld</literal> instruction)
|
|
<footnote>
|
|
<para>When in doubt, we refer the reader to the official Intel
|
|
manuals, which describe the exact semantics for each
|
|
instruction: <link
|
|
xlink:href="http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html"></link>.</para></footnote>.
|
|
Then, as it makes no assumption about the state of the segment
|
|
registers, it initializes them. Finally, it sets the stack
|
|
pointer register (<literal>%sp</literal>) to address
|
|
<literal>0x7c00</literal>, so we have a working stack.</para>
|
|
|
|
<para>The next block is responsible for the relocation and
|
|
subsequent jump to the relocated code.</para>
|
|
|
|
<figure xml:id="boot-boot0-relocation">
|
|
<title><filename>sys/boot/i386/boot0/boot0.S</filename></title>
|
|
|
|
<programlisting> movw $0x7c00,%si # Source
|
|
movw $0x600,%di # Destination
|
|
movw $512,%cx # Word count
|
|
rep # Relocate
|
|
movsb # code
|
|
movw %di,%bp # Address variables
|
|
movb $16,%cl # Words to clear
|
|
rep # Zero
|
|
stosb # them
|
|
incb -0xe(%di) # Set the S field to 1
|
|
jmp main-0x7c00+0x600 # Jump to relocated code</programlisting>
|
|
</figure>
|
|
|
|
<para>Because <filename>boot0</filename> is loaded by the
|
|
<acronym>BIOS</acronym> to address <literal>0x7C00</literal>, it
|
|
copies itself to address <literal>0x600</literal> and then
|
|
transfers control there (recall that it was linked to execute at
|
|
address <literal>0x600</literal>). The source address,
|
|
<literal>0x7c00</literal>, is copied to register
|
|
<literal>%si</literal>. The destination address,
|
|
<literal>0x600</literal>, to register <literal>%di</literal>.
|
|
The number of bytes to copy, <literal>512</literal> (the
|
|
program's size), is copied to register <literal>%cx</literal>.
|
|
Next, the <literal>rep</literal> instruction repeats the
|
|
instruction that follows, that is, <literal>movsb</literal>, the
|
|
number of times dictated by the <literal>%cx</literal> register.
|
|
The <literal>movsb</literal> instruction copies the byte pointed
|
|
to by <literal>%si</literal> to the address pointed to by
|
|
<literal>%di</literal>. This is repeated another 511 times. On
|
|
each repetition, both the source and destination registers,
|
|
<literal>%si</literal> and <literal>%di</literal>, are
|
|
incremented by one. Thus, upon completion of the 512-byte copy,
|
|
<literal>%di</literal> has the value
|
|
<literal>0x600</literal>+<literal>512</literal>=
|
|
<literal>0x800</literal>, and <literal>%si</literal> has the
|
|
value <literal>0x7c00</literal>+<literal>512</literal>=
|
|
<literal>0x7e00</literal>; we have thus completed the code
|
|
<emphasis>relocation</emphasis>.</para>
|
|
|
|
<para>Next, the destination register
|
|
<literal>%di</literal> is copied to <literal>%bp</literal>.
|
|
<literal>%bp</literal> gets the value <literal>0x800</literal>.
|
|
The value <literal>16</literal> is copied to
|
|
<literal>%cl</literal> in preparation for a new string operation
|
|
(like our previous <literal>movsb</literal>). Now,
|
|
<literal>stosb</literal> is executed 16 times. This instruction
|
|
copies a <literal>0</literal> value to the address pointed to by
|
|
the destination register (<literal>%di</literal>, which is
|
|
<literal>0x800</literal>), and increments it. This is repeated
|
|
another 15 times, so <literal>%di</literal> ends up with value
|
|
<literal>0x810</literal>. Effectively, this clears the address
|
|
range <literal>0x800</literal>-<literal>0x80f</literal>. This
|
|
range is used as a (fake) partition table for writing the
|
|
<acronym>MBR</acronym> back to disk. Finally, the sector field
|
|
for the <acronym>CHS</acronym> addressing of this fake partition
|
|
is given the value 1 and a jump is made to the main function
|
|
from the relocated code. Note that until this jump to the
|
|
relocated code, any reference to an absolute address was
|
|
avoided.</para>
|
|
|
|
<para>The following code block tests whether the drive number
|
|
provided by the <acronym>BIOS</acronym> should be used, or
|
|
the one stored in <filename>boot0</filename>.</para>
|
|
|
|
<figure xml:id="boot-boot0-drivenumber">
|
|
<title><filename>sys/boot/i386/boot0/boot0.S</filename></title>
|
|
|
|
<programlisting>main:
|
|
testb $SETDRV,-69(%bp) # Set drive number?
|
|
jnz disable_update # Yes
|
|
testb %dl,%dl # Drive number valid?
|
|
js save_curdrive # Possibly (0x80 set)</programlisting>
|
|
</figure>
|
|
|
|
<para>This code tests the <literal>SETDRV</literal> bit
|
|
(<literal>0x20</literal>) in the <emphasis>flags</emphasis>
|
|
variable. Recall that register <literal>%bp</literal> points to
|
|
address location <literal>0x800</literal>, so the test is done
|
|
to the <emphasis>flags</emphasis> variable at address
|
|
<literal>0x800</literal>-<literal>69</literal>=
|
|
<literal>0x7bb</literal>. This is an example of the type of
|
|
modifications that can be done to <filename>boot0</filename>.
|
|
The <literal>SETDRV</literal> flag is not set by default, but it
|
|
can be set in the <filename>Makefile</filename>. When set, the
|
|
drive number stored in the <acronym>MBR</acronym> is used
|
|
instead of the one provided by the <acronym>BIOS</acronym>. We
|
|
assume the defaults, and that the <acronym>BIOS</acronym>
|
|
provided a valid drive number, so we jump to
|
|
<literal>save_curdrive</literal>.</para>
|
|
|
|
<para>The next block saves the drive number provided by the
|
|
<acronym>BIOS</acronym>, and calls <literal>putn</literal> to
|
|
print a new line on the screen.</para>
|
|
|
|
<figure xml:id="boot-boot0-savedrivenumber">
|
|
<title><filename>sys/boot/i386/boot0/boot0.S</filename></title>
|
|
|
|
<programlisting>save_curdrive:
|
|
movb %dl, (%bp) # Save drive number
|
|
pushw %dx # Also in the stack
|
|
#ifdef TEST /* test code, print internal bios drive */
|
|
rolb $1, %dl
|
|
movw $drive, %si
|
|
call putkey
|
|
#endif
|
|
callw putn # Print a newline</programlisting>
|
|
</figure>
|
|
|
|
<para>Note that we assume <varname>TEST</varname> is not defined,
|
|
so the conditional code in it is not assembled and will not
|
|
appear in our executable <filename>boot0</filename>.</para>
|
|
|
|
<para>Our next block implements the actual scanning of the
|
|
partition table. It prints to the screen the partition type for
|
|
each of the four entries in the partition table. It compares
|
|
each type with a list of well-known operating system file
|
|
systems. Examples of recognized partition types are
|
|
<acronym>NTFS</acronym> (&windows;, ID 0x7),
|
|
<literal>ext2fs</literal> (&linux;, ID 0x83), and, of course,
|
|
<literal>ffs</literal>/<literal>ufs2</literal> (&os;, ID 0xa5).
|
|
The implementation is fairly simple.</para>
|
|
|
|
<figure xml:id="boot-boot0-partition-scan">
|
|
<title><filename>sys/boot/i386/boot0/boot0.S</filename></title>
|
|
|
|
<programlisting> movw $(partbl+0x4),%bx # Partition table (+4)
|
|
xorw %dx,%dx # Item number
|
|
|
|
read_entry:
|
|
movb %ch,-0x4(%bx) # Zero active flag (ch == 0)
|
|
btw %dx,_FLAGS(%bp) # Entry enabled?
|
|
jnc next_entry # No
|
|
movb (%bx),%al # Load type
|
|
test %al, %al # skip empty partition
|
|
jz next_entry
|
|
movw $bootable_ids,%di # Lookup tables
|
|
movb $(TLEN+1),%cl # Number of entries
|
|
repne # Locate
|
|
scasb # type
|
|
addw $(TLEN-1), %di # Adjust
|
|
movb (%di),%cl # Partition
|
|
addw %cx,%di # description
|
|
callw putx # Display it
|
|
|
|
next_entry:
|
|
incw %dx # Next item
|
|
addb $0x10,%bl # Next entry
|
|
jnc read_entry # Till done</programlisting>
|
|
</figure>
|
|
|
|
<para>It is important to note that the active flag for each entry
|
|
is cleared, so after the scanning, <emphasis>no</emphasis>
|
|
partition entry is active in our memory copy of
|
|
<filename>boot0</filename>. Later, the active flag will be set
|
|
for the selected partition. This ensures that only one active
|
|
partition exists if the user chooses to write the changes back
|
|
to disk.</para>
|
|
|
|
<para>The next block tests for other drives. At startup,
|
|
the <acronym>BIOS</acronym> writes the number of drives present
|
|
in the computer to address <literal>0x475</literal>. If there
|
|
are any other drives present, <filename>boot0</filename> prints
|
|
the current drive to screen. The user may command
|
|
<filename>boot0</filename> to scan partitions on another drive
|
|
later.</para>
|
|
|
|
<figure xml:id="boot-boot0-test-drives">
|
|
<title><filename>sys/boot/i386/boot0/boot0.S</filename></title>
|
|
|
|
<programlisting> popw %ax # Drive number
|
|
subb $0x79,%al # Does next
|
|
cmpb 0x475,%al # drive exist? (from BIOS?)
|
|
jb print_drive # Yes
|
|
decw %ax # Already drive 0?
|
|
jz print_prompt # Yes</programlisting>
|
|
</figure>
|
|
|
|
<para>We make the assumption that a single drive is present, so
|
|
the jump to <literal>print_drive</literal> is not performed. We
|
|
also assume nothing strange happened, so we jump to
|
|
<literal>print_prompt</literal>.</para>
|
|
|
|
<para>This next block just prints out a prompt followed by the
|
|
default option:</para>
|
|
|
|
<figure xml:id="boot-boot0-prompt">
|
|
<title><filename>sys/boot/i386/boot0/boot0.S</filename></title>
|
|
|
|
<programlisting>print_prompt:
|
|
movw $prompt,%si # Display
|
|
callw putstr # prompt
|
|
movb _OPT(%bp),%dl # Display
|
|
decw %si # default
|
|
callw putkey # key
|
|
jmp start_input # Skip beep</programlisting>
|
|
</figure>
|
|
|
|
<para>Finally, a jump is performed to
|
|
<literal>start_input</literal>, where the
|
|
<acronym>BIOS</acronym> services are used to start a timer and
|
|
for reading user input from the keyboard; if the timer expires,
|
|
the default option will be selected:</para>
|
|
|
|
<figure xml:id="boot-boot0-start-input">
|
|
<title><filename>sys/boot/i386/boot0/boot0.S</filename></title>
|
|
|
|
<programlisting>start_input:
|
|
xorb %ah,%ah # BIOS: Get
|
|
int $0x1a # system time
|
|
movw %dx,%di # Ticks when
|
|
addw _TICKS(%bp),%di # timeout
|
|
read_key:
|
|
movb $0x1,%ah # BIOS: Check
|
|
int $0x16 # for keypress
|
|
jnz got_key # Have input
|
|
xorb %ah,%ah # BIOS: int 0x1a, 00
|
|
int $0x1a # get system time
|
|
cmpw %di,%dx # Timeout?
|
|
jb read_key # No</programlisting>
|
|
</figure>
|
|
|
|
<para>An interrupt is requested with number
|
|
<literal>0x1a</literal> and argument <literal>0</literal> in
|
|
register <literal>%ah</literal>. The <acronym>BIOS</acronym>
|
|
has a predefined set of services, requested by applications as
|
|
software-generated interrupts through the <literal>int</literal>
|
|
instruction and receiving arguments in registers (in this case,
|
|
<literal>%ah</literal>). Here, particularly, we are requesting
|
|
the number of clock ticks since last midnight; this value is
|
|
computed by the <acronym>BIOS</acronym> through the
|
|
<acronym>RTC</acronym> (Real Time Clock). This clock can be
|
|
programmed to work at frequencies ranging from 2 Hz to
|
|
8192 Hz. The <acronym>BIOS</acronym> sets it to
|
|
18.2 Hz at startup. When the request is satisfied, a
|
|
32-bit result is returned by the <acronym>BIOS</acronym> in
|
|
registers <literal>%cx</literal> and <literal>%dx</literal>
|
|
(lower bytes in <literal>%dx</literal>). This result (the
|
|
<literal>%dx</literal> part) is copied to register
|
|
<literal>%di</literal>, and the value of the
|
|
<varname>TICKS</varname> variable is added to
|
|
<literal>%di</literal>. This variable resides in
|
|
<filename>boot0</filename> at offset <literal>_TICKS</literal>
|
|
(a negative value) from register <literal>%bp</literal> (which,
|
|
recall, points to <literal>0x800</literal>). The default value
|
|
of this variable is <literal>0xb6</literal> (182 in decimal).
|
|
Now, the idea is that <filename>boot0</filename> constantly
|
|
requests the time from the <acronym>BIOS</acronym>, and when the
|
|
value returned in register <literal>%dx</literal> is greater
|
|
than the value stored in <literal>%di</literal>, the time is up
|
|
and the default selection will be made. Since the RTC ticks
|
|
18.2 times per second, this condition will be met after 10
|
|
seconds (this default behavior can be changed in the
|
|
<filename>Makefile</filename>). Until this time has passed,
|
|
<filename>boot0</filename> continually asks the
|
|
<acronym>BIOS</acronym> for any user input; this is done through
|
|
<literal>int 0x16</literal>, argument <literal>1</literal> in
|
|
<literal>%ah</literal>.</para>
|
|
|
|
<para>Whether a key was pressed or the time expired, subsequent
|
|
code validates the selection. Based on the selection, the
|
|
register <literal>%si</literal> is set to point to the
|
|
appropriate partition entry in the partition table. This new
|
|
selection overrides the previous default one. Indeed, it
|
|
becomes the new default. Finally, the ACTIVE flag of the
|
|
selected partition is set. If it was enabled at compile time,
|
|
the in-memory version of <filename>boot0</filename> with these
|
|
modified values is written back to the <acronym>MBR</acronym> on
|
|
disk. We leave the details of this implementation to the
|
|
reader.</para>
|
|
|
|
<para>We now end our study with the last code block from the
|
|
<filename>boot0</filename> program:</para>
|
|
|
|
<figure xml:id="boot-boot0-check-bootable">
|
|
<title><filename>sys/boot/i386/boot0/boot0.S</filename></title>
|
|
|
|
<programlisting> movw $0x7c00,%bx # Address for read
|
|
movb $0x2,%ah # Read sector
|
|
callw intx13 # from disk
|
|
jc beep # If error
|
|
cmpw $0xaa55,0x1fe(%bx) # Bootable?
|
|
jne beep # No
|
|
pushw %si # Save ptr to selected part.
|
|
callw putn # Leave some space
|
|
popw %si # Restore, next stage uses it
|
|
jmp *%bx # Invoke bootstrap</programlisting>
|
|
</figure>
|
|
|
|
<para>Recall that <literal>%si</literal> points to the selected
|
|
partition entry. This entry tells us where the partition begins
|
|
on disk. We assume, of course, that the partition selected is
|
|
actually a &os; slice.</para>
|
|
|
|
<note>
|
|
<para>From now on, we will favor the use of the technically
|
|
more accurate term <quote>slice</quote> rather than
|
|
<quote>partition</quote>.</para>
|
|
</note>
|
|
|
|
<para>The transfer buffer is set to <literal>0x7c00</literal>
|
|
(register <literal>%bx</literal>), and a read for the first
|
|
sector of the &os; slice is requested by calling
|
|
<literal>intx13</literal>. We assume that everything went okay,
|
|
so a jump to <literal>beep</literal> is not performed. In
|
|
particular, the new sector read must end with the magic sequence
|
|
<literal>0xaa55</literal>. Finally, the value at
|
|
<literal>%si</literal> (the pointer to the selected partition
|
|
table) is preserved for use by the next stage, and a jump is
|
|
performed to address <literal>0x7c00</literal>, where execution
|
|
of our next stage (the just-read block) is started.</para>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="boot-boot1">
|
|
<title><literal>boot1</literal> Stage</title>
|
|
|
|
<para>So far we have gone through the following sequence:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>The <acronym>BIOS</acronym> did some early hardware
|
|
initialization, including the <acronym>POST</acronym>. The
|
|
<acronym>MBR</acronym> (<filename>boot0</filename>) was
|
|
loaded from absolute disk sector one to address
|
|
<literal>0x7c00</literal>. Execution control was passed to
|
|
that location.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para><filename>boot0</filename> relocated itself to the
|
|
location it was linked to execute
|
|
(<literal>0x600</literal>), followed by a jump to continue
|
|
execution at the appropriate place. Finally,
|
|
<filename>boot0</filename> loaded the first disk sector from
|
|
the &os; slice to address <literal>0x7c00</literal>.
|
|
Execution control was passed to that location.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para><filename>boot1</filename> is the next step in the
|
|
boot-loading sequence. It is the first of three boot stages.
|
|
Note that we have been dealing exclusively
|
|
with disk sectors. Indeed, the <acronym>BIOS</acronym> loads
|
|
the absolute first sector, while <filename>boot0</filename>
|
|
loads the first sector of the &os; slice. Both loads are to
|
|
address <literal>0x7c00</literal>. We can conceptually think of
|
|
these disk sectors as containing the files
|
|
<filename>boot0</filename> and <filename>boot1</filename>,
|
|
respectively, but in reality this is not entirely true for
|
|
<filename>boot1</filename>. Strictly speaking, unlike
|
|
<filename>boot0</filename>, <filename>boot1</filename> is not
|
|
part of the boot blocks
|
|
<footnote>
|
|
<para>There is a file <filename>/boot/boot1</filename>, but it
|
|
is not the written to the beginning of the &os; slice.
|
|
Instead, it is concatenated with <filename>boot2</filename>
|
|
to form <filename>boot</filename>, which
|
|
<emphasis>is</emphasis> written to the beginning of the &os;
|
|
slice and read at boot time.</para></footnote>.
|
|
Instead, a single, full-blown file, <filename>boot</filename>
|
|
(<filename>/boot/boot</filename>), is what ultimately is
|
|
written to disk. This file is a combination of
|
|
<filename>boot1</filename>, <filename>boot2</filename> and the
|
|
<literal>Boot Extender</literal> (or <acronym>BTX</acronym>).
|
|
This single file is greater in size than a single sector
|
|
(greater than 512 bytes). Fortunately,
|
|
<filename>boot1</filename> occupies <emphasis>exactly</emphasis>
|
|
the first 512 bytes of this single file, so when
|
|
<filename>boot0</filename> loads the first sector of the &os;
|
|
slice (512 bytes), it is actually loading
|
|
<filename>boot1</filename> and transferring control to
|
|
it.</para>
|
|
|
|
<para>The main task of <filename>boot1</filename> is to load the
|
|
next boot stage. This next stage is somewhat more complex. It
|
|
is composed of a server called the <quote>Boot Extender</quote>,
|
|
or <acronym>BTX</acronym>, and a client, called
|
|
<filename>boot2</filename>. As we will see, the last boot
|
|
stage, <filename>loader</filename>, is also a client of the
|
|
<acronym>BTX</acronym> server.</para>
|
|
|
|
<para>Let us now look in detail at what exactly is done by
|
|
<filename>boot1</filename>, starting like we did for
|
|
<filename>boot0</filename>, at its entry point:</para>
|
|
|
|
<figure xml:id="boot-boot1-entry">
|
|
<title><filename>sys/boot/i386/boot2/boot1.S</filename></title>
|
|
|
|
<programlisting>start:
|
|
jmp main</programlisting>
|
|
</figure>
|
|
|
|
<para>The entry point at <literal>start</literal> simply jumps
|
|
past a special data area to the label <literal>main</literal>,
|
|
which in turn looks like this:</para>
|
|
|
|
<figure xml:id="boot-boot1-main">
|
|
<title><filename>sys/boot/i386/boot2/boot1.S</filename></title>
|
|
|
|
<programlisting>main:
|
|
cld # String ops inc
|
|
xor %cx,%cx # Zero
|
|
mov %cx,%es # Address
|
|
mov %cx,%ds # data
|
|
mov %cx,%ss # Set up
|
|
mov $start,%sp # stack
|
|
mov %sp,%si # Source
|
|
mov $0x700,%di # Destination
|
|
incb %ch # Word count
|
|
rep # Copy
|
|
movsw # code</programlisting>
|
|
</figure>
|
|
|
|
<para>Just like <filename>boot0</filename>, this
|
|
code relocates <filename>boot1</filename>,
|
|
this time to memory address <literal>0x700</literal>. However,
|
|
unlike <filename>boot0</filename>, it does not jump there.
|
|
<filename>boot1</filename> is linked to execute at
|
|
address <literal>0x7c00</literal>, effectively where it was
|
|
loaded in the first place. The reason for this relocation will
|
|
be discussed shortly.</para>
|
|
|
|
<para>Next comes a loop that looks for the &os; slice. Although
|
|
<filename>boot0</filename> loaded <filename>boot1</filename>
|
|
from the &os; slice, no information was passed to it about this
|
|
<footnote>
|
|
<para>Actually we did pass a pointer to the slice entry in
|
|
register <literal>%si</literal>. However,
|
|
<filename>boot1</filename> does not assume that it was
|
|
loaded by <filename>boot0</filename> (perhaps some other
|
|
<acronym>MBR</acronym> loaded it, and did not pass this
|
|
information), so it assumes nothing.</para></footnote>,
|
|
so <filename>boot1</filename> must rescan the
|
|
partition table to find where the &os; slice starts. Therefore
|
|
it rereads the <acronym>MBR</acronym>:</para>
|
|
|
|
<figure xml:id="boot-boot1-find-freebsd">
|
|
<title><filename>sys/boot/i386/boot2/boot1.S</filename></title>
|
|
|
|
<programlisting> mov $part4,%si # Partition
|
|
cmpb $0x80,%dl # Hard drive?
|
|
jb main.4 # No
|
|
movb $0x1,%dh # Block count
|
|
callw nread # Read MBR</programlisting>
|
|
</figure>
|
|
|
|
<para>In the code above, register <literal>%dl</literal>
|
|
maintains information about the boot device. This is passed on
|
|
by the <acronym>BIOS</acronym> and preserved by the
|
|
<acronym>MBR</acronym>. Numbers <literal>0x80</literal> and
|
|
greater tells us that we are dealing with a hard drive, so a
|
|
call is made to <literal>nread</literal>, where the
|
|
<acronym>MBR</acronym> is read. Arguments to
|
|
<literal>nread</literal> are passed through
|
|
<literal>%si</literal> and <literal>%dh</literal>. The memory
|
|
address at label <literal>part4</literal> is copied to
|
|
<literal>%si</literal>. This memory address holds a
|
|
<quote>fake partition</quote> to be used by
|
|
<literal>nread</literal>. The following is the data in the fake
|
|
partition:</para>
|
|
|
|
<figure xml:id="boot-boot2-make-fake-partition">
|
|
<title><filename>sys/boot/i386/boot2/Makefile</filename></title>
|
|
|
|
<programlisting> part4:
|
|
.byte 0x80, 0x00, 0x01, 0x00
|
|
.byte 0xa5, 0xfe, 0xff, 0xff
|
|
.byte 0x00, 0x00, 0x00, 0x00
|
|
.byte 0x50, 0xc3, 0x00, 0x00</programlisting>
|
|
</figure>
|
|
|
|
<para>In particular, the <acronym>LBA</acronym> for this fake
|
|
partition is hardcoded to zero. This is used as an argument to
|
|
the <acronym>BIOS</acronym> for reading absolute sector one from
|
|
the hard drive. Alternatively, CHS addressing could be used.
|
|
In this case, the fake partition holds cylinder 0, head 0 and
|
|
sector 1, which is equivalent to absolute sector one.</para>
|
|
|
|
<para>Let us now proceed to take a look at
|
|
<literal>nread</literal>:</para>
|
|
|
|
<figure xml:id="boot-boot1-nread">
|
|
<title><filename>sys/boot/i386/boot2/boot1.S</filename></title>
|
|
|
|
<programlisting>nread:
|
|
mov $0x8c00,%bx # Transfer buffer
|
|
mov 0x8(%si),%ax # Get
|
|
mov 0xa(%si),%cx # LBA
|
|
push %cs # Read from
|
|
callw xread.1 # disk
|
|
jnc return # If success, return</programlisting>
|
|
</figure>
|
|
|
|
<para>Recall that <literal>%si</literal> points to the fake
|
|
partition. The word
|
|
<footnote>
|
|
<para>In the context of 16-bit real mode, a word is 2
|
|
bytes.</para></footnote>
|
|
at offset <literal>0x8</literal> is copied to register
|
|
<literal>%ax</literal> and word at offset <literal>0xa</literal>
|
|
to <literal>%cx</literal>. They are interpreted by the
|
|
<acronym>BIOS</acronym> as the lower 4-byte value denoting the
|
|
LBA to be read (the upper four bytes are assumed to be zero).
|
|
Register <literal>%bx</literal> holds the memory address where
|
|
the <acronym>MBR</acronym> will be loaded. The instruction
|
|
pushing <literal>%cs</literal> onto the stack is very
|
|
interesting. In this context, it accomplishes nothing. However, as
|
|
we will see shortly, <filename>boot2</filename>, in conjunction
|
|
with the <acronym>BTX</acronym> server, also uses
|
|
<literal>xread.1</literal>. This mechanism will be discussed in
|
|
the next section.</para>
|
|
|
|
<para>The code at <literal>xread.1</literal> further calls
|
|
the <literal>read</literal> function, which actually calls the
|
|
<acronym>BIOS</acronym> asking for the disk sector:</para>
|
|
|
|
<figure xml:id="boot-boot1-xread1">
|
|
<title><filename>sys/boot/i386/boot2/boot1.S</filename></title>
|
|
|
|
<programlisting>xread.1:
|
|
pushl $0x0 # absolute
|
|
push %cx # block
|
|
push %ax # number
|
|
push %es # Address of
|
|
push %bx # transfer buffer
|
|
xor %ax,%ax # Number of
|
|
movb %dh,%al # blocks to
|
|
push %ax # transfer
|
|
push $0x10 # Size of packet
|
|
mov %sp,%bp # Packet pointer
|
|
callw read # Read from disk
|
|
lea 0x10(%bp),%sp # Clear stack
|
|
lret # To far caller</programlisting>
|
|
</figure>
|
|
|
|
<para>Note the long return instruction at the end of this block.
|
|
This instruction pops out the <literal>%cs</literal> register
|
|
pushed by <literal>nread</literal>, and returns. Finally,
|
|
<literal>nread</literal> also returns.</para>
|
|
|
|
<para>With the <acronym>MBR</acronym> loaded to memory, the actual
|
|
loop for searching the &os; slice begins:</para>
|
|
|
|
<figure xml:id="boot-boot1-find-part">
|
|
<title><filename>sys/boot/i386/boot2/boot1.S</filename></title>
|
|
|
|
<programlisting> mov $0x1,%cx # Two passes
|
|
main.1:
|
|
mov $0x8dbe,%si # Partition table
|
|
movb $0x1,%dh # Partition
|
|
main.2:
|
|
cmpb $0xa5,0x4(%si) # Our partition type?
|
|
jne main.3 # No
|
|
jcxz main.5 # If second pass
|
|
testb $0x80,(%si) # Active?
|
|
jnz main.5 # Yes
|
|
main.3:
|
|
add $0x10,%si # Next entry
|
|
incb %dh # Partition
|
|
cmpb $0x5,%dh # In table?
|
|
jb main.2 # Yes
|
|
dec %cx # Do two
|
|
jcxz main.1 # passes</programlisting>
|
|
</figure>
|
|
|
|
<para>If a &os; slice is identified, execution continues at
|
|
<literal>main.5</literal>. Note that when a &os; slice is found
|
|
<literal>%si</literal> points to the appropriate entry in the
|
|
partition table, and <literal>%dh</literal> holds the partition
|
|
number. We assume that a &os; slice is found, so we continue
|
|
execution at <literal>main.5</literal>:</para>
|
|
|
|
<figure xml:id="boot-boot1-main5">
|
|
<title><filename>sys/boot/i386/boot2/boot1.S</filename></title>
|
|
|
|
<programlisting>main.5:
|
|
mov %dx,0x900 # Save args
|
|
movb $0x10,%dh # Sector count
|
|
callw nread # Read disk
|
|
mov $0x9000,%bx # BTX
|
|
mov 0xa(%bx),%si # Get BTX length and set
|
|
add %bx,%si # %si to start of boot2.bin
|
|
mov $0xc000,%di # Client page 2
|
|
mov $0xa200,%cx # Byte
|
|
sub %si,%cx # count
|
|
rep # Relocate
|
|
movsb # client</programlisting>
|
|
</figure>
|
|
|
|
<para>Recall that at this point, register <literal>%si</literal>
|
|
points to the &os; slice entry in the <acronym>MBR</acronym>
|
|
partition table, so a call to <literal>nread</literal> will
|
|
effectively read sectors at the beginning of this partition.
|
|
The argument passed on register <literal>%dh</literal> tells
|
|
<literal>nread</literal> to read 16 disk sectors. Recall that
|
|
the first 512 bytes, or the first sector of the &os; slice,
|
|
coincides with the <filename>boot1</filename> program. Also
|
|
recall that the file written to the beginning of the &os;
|
|
slice is not <filename>/boot/boot1</filename>, but
|
|
<filename>/boot/boot</filename>. Let us look at the size of
|
|
these files in the filesystem:</para>
|
|
|
|
<screen xml:id="boot-boot1-filesize">-r--r--r-- 1 root wheel 512B Jan 8 00:15 /boot/boot0
|
|
-r--r--r-- 1 root wheel 512B Jan 8 00:15 /boot/boot1
|
|
-r--r--r-- 1 root wheel 7.5K Jan 8 00:15 /boot/boot2
|
|
-r--r--r-- 1 root wheel 8.0K Jan 8 00:15 /boot/boot</screen>
|
|
|
|
<para>Both <filename>boot0</filename> and
|
|
<filename>boot1</filename> are 512 bytes each, so they fit
|
|
<emphasis>exactly</emphasis> in one disk sector.
|
|
<filename>boot2</filename> is much bigger, holding both
|
|
the <acronym>BTX</acronym> server and the <filename>boot2</filename> client.
|
|
Finally, a file called simply <filename>boot</filename> is 512
|
|
bytes larger than <filename>boot2</filename>. This file is a
|
|
concatenation of <filename>boot1</filename> and
|
|
<filename>boot2</filename>. As already noted,
|
|
<filename>boot0</filename> is the file written to the absolute
|
|
first disk sector (the <acronym>MBR</acronym>), and
|
|
<filename>boot</filename> is the file written to the first
|
|
sector of the &os; slice; <filename>boot1</filename> and
|
|
<filename>boot2</filename> are <emphasis>not</emphasis> written
|
|
to disk. The command used to concatenate
|
|
<filename>boot1</filename> and <filename>boot2</filename> into a
|
|
single <filename>boot</filename> is merely
|
|
<command>cat boot1 boot2 > boot</command>.</para>
|
|
|
|
<para>So <filename>boot1</filename> occupies exactly the first 512
|
|
bytes of <filename>boot</filename> and, because
|
|
<filename>boot</filename> is written to the first sector of the
|
|
&os; slice, <filename>boot1</filename> fits exactly in this
|
|
first sector. Because <literal>nread</literal> reads the first
|
|
16 sectors of the &os; slice, it effectively reads the entire
|
|
<filename>boot</filename> file
|
|
<footnote>
|
|
<para>512*16=8192 bytes, exactly the size of
|
|
<filename>boot</filename></para></footnote>.
|
|
We will see more details about how <filename>boot</filename> is
|
|
formed from <filename>boot1</filename> and
|
|
<filename>boot2</filename> in the next section.</para>
|
|
|
|
<para>Recall that <literal>nread</literal> uses memory address
|
|
<literal>0x8c00</literal> as the transfer buffer to hold the
|
|
sectors read. This address is conveniently chosen. Indeed,
|
|
because <filename>boot1</filename> belongs to the first 512
|
|
bytes, it ends up in the address range
|
|
<literal>0x8c00</literal>-<literal>0x8dff</literal>. The 512
|
|
bytes that follows (range
|
|
<literal>0x8e00</literal>-<literal>0x8fff</literal>) is used to
|
|
store the <emphasis>bsdlabel</emphasis>
|
|
<footnote>
|
|
<para>Historically known as <quote>disklabel</quote>. If you
|
|
ever wondered where &os; stored this information, it is in
|
|
this region. See &man.bsdlabel.8;</para></footnote>.</para>
|
|
|
|
<para>Starting at address <literal>0x9000</literal> is the
|
|
beginning of the <acronym>BTX</acronym> server, and immediately
|
|
following is the <filename>boot2</filename> client. The
|
|
<acronym>BTX</acronym> server acts as a kernel, and executes in
|
|
protected mode in the most privileged level. In contrast, the
|
|
<acronym>BTX</acronym> clients (<filename>boot2</filename>, for
|
|
example), execute in user mode. We will see how this is
|
|
accomplished in the next section. The code after the call to
|
|
<literal>nread</literal> locates the beginning of
|
|
<filename>boot2</filename> in the memory buffer, and copies it
|
|
to memory address <literal>0xc000</literal>. This is because
|
|
the <acronym>BTX</acronym> server arranges
|
|
<filename>boot2</filename> to execute in a segment starting at
|
|
<literal>0xa000</literal>. We explore this in detail in the
|
|
following section.</para>
|
|
|
|
<para>The last code block of <filename>boot1</filename> enables
|
|
access to memory above 1MB
|
|
<footnote>
|
|
<para>This is necessary for legacy reasons. Interested
|
|
readers should see <link
|
|
xlink:href="http://en.wikipedia.org/wiki/A20_line"/>.</para></footnote>
|
|
and concludes with a jump to the starting point of the
|
|
<acronym>BTX</acronym> server:</para>
|
|
|
|
<figure xml:id="boot-boot1-seta20">
|
|
<title><filename>sys/boot/i386/boot2/boot1.S</filename></title>
|
|
|
|
<programlisting>seta20:
|
|
cli # Disable interrupts
|
|
seta20.1:
|
|
dec %cx # Timeout?
|
|
jz seta20.3 # Yes
|
|
|
|
inb $0x64,%al # Get status
|
|
testb $0x2,%al # Busy?
|
|
jnz seta20.1 # Yes
|
|
movb $0xd1,%al # Command: Write
|
|
outb %al,$0x64 # output port
|
|
seta20.2:
|
|
inb $0x64,%al # Get status
|
|
testb $0x2,%al # Busy?
|
|
jnz seta20.2 # Yes
|
|
movb $0xdf,%al # Enable
|
|
outb %al,$0x60 # A20
|
|
seta20.3:
|
|
sti # Enable interrupts
|
|
jmp 0x9010 # Start BTX</programlisting>
|
|
</figure>
|
|
|
|
<para>Note that right before the jump, interrupts are
|
|
enabled.</para>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="btx-server">
|
|
<title>The <acronym>BTX</acronym> Server</title>
|
|
|
|
<para>Next in our boot sequence is the
|
|
<acronym>BTX</acronym> Server. Let us quickly remember how we
|
|
got here:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>The <acronym>BIOS</acronym> loads the absolute sector
|
|
one (the <acronym>MBR</acronym>, or
|
|
<filename>boot0</filename>), to address
|
|
<literal>0x7c00</literal> and jumps there.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para><filename>boot0</filename> relocates itself to
|
|
<literal>0x600</literal>, the address it was linked to
|
|
execute, and jumps over there. It then reads the first
|
|
sector of the &os; slice (which consists of
|
|
<filename>boot1</filename>) into address
|
|
<literal>0x7c00</literal> and jumps over there.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para><filename>boot1</filename> loads the first 16 sectors
|
|
of the &os; slice into address <literal>0x8c00</literal>.
|
|
This 16 sectors, or 8192 bytes, is the whole file
|
|
<filename>boot</filename>. The file is a
|
|
concatenation of <filename>boot1</filename> and
|
|
<filename>boot2</filename>. <filename>boot2</filename>, in
|
|
turn, contains the <acronym>BTX</acronym> server and the
|
|
<filename>boot2</filename> client. Finally, a jump is made
|
|
to address <literal>0x9010</literal>, the entry point of the
|
|
<acronym>BTX</acronym> server.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Before studying the <acronym>BTX</acronym> Server in detail,
|
|
let us further review how the single, all-in-one
|
|
<filename>boot</filename> file is created. The way
|
|
<filename>boot</filename> is built is defined in its
|
|
<filename>Makefile</filename>
|
|
(<filename>/usr/src/sys/boot/i386/boot2/Makefile</filename>).
|
|
Let us look at the rule that creates the
|
|
<filename>boot</filename> file:</para>
|
|
|
|
<figure xml:id="boot-boot1-make-boot">
|
|
<title><filename>sys/boot/i386/boot2/Makefile</filename></title>
|
|
|
|
<programlisting> boot: boot1 boot2
|
|
cat boot1 boot2 > boot</programlisting>
|
|
</figure>
|
|
|
|
<para>This tells us that <filename>boot1</filename> and
|
|
<filename>boot2</filename> are needed, and the rule simply
|
|
concatenates them to produce a single file called
|
|
<filename>boot</filename>. The rules for creating
|
|
<filename>boot1</filename> are also quite simple:</para>
|
|
|
|
<figure xml:id="boot-boot1-make-boot1">
|
|
<title><filename>sys/boot/i386/boot2/Makefile</filename></title>
|
|
|
|
<programlisting> boot1: boot1.out
|
|
objcopy -S -O binary boot1.out boot1
|
|
|
|
boot1.out: boot1.o
|
|
ld -e start -Ttext 0x7c00 -o boot1.out boot1.o</programlisting>
|
|
</figure>
|
|
|
|
<para>To apply the rule for creating
|
|
<filename>boot1</filename>, <filename>boot1.out</filename> must
|
|
be resolved. This, in turn, depends on the existence of
|
|
<filename>boot1.o</filename>. This last file is simply the
|
|
result of assembling our familiar <filename>boot1.S</filename>,
|
|
without linking. Now, the rule for creating
|
|
<filename>boot1.out</filename> is applied. This tells us that
|
|
<filename>boot1.o</filename> should be linked with
|
|
<literal>start</literal> as its entry point, and starting at
|
|
address <literal>0x7c00</literal>. Finally,
|
|
<filename>boot1</filename> is created from
|
|
<filename>boot1.out</filename> applying the appropriate rule.
|
|
This rule is the <filename>objcopy</filename> command applied to
|
|
<filename>boot1.out</filename>. Note the flags passed to
|
|
<filename>objcopy</filename>: <literal>-S</literal> tells it to
|
|
strip all relocation and symbolic information;
|
|
<literal>-O binary</literal> indicates the output format, that
|
|
is, a simple, unformatted binary file.</para>
|
|
|
|
<para>Having <filename>boot1</filename>, let us take a look at how
|
|
<filename>boot2</filename> is constructed:</para>
|
|
|
|
<figure xml:id="boot-boot1-make-boot2">
|
|
<title><filename>sys/boot/i386/boot2/Makefile</filename></title>
|
|
|
|
<programlisting> boot2: boot2.ld
|
|
@set -- `ls -l boot2.ld`; x=$$((7680-$$5)); \
|
|
echo "$$x bytes available"; test $$x -ge 0
|
|
dd if=boot2.ld of=boot2 obs=7680 conv=osync
|
|
|
|
boot2.ld: boot2.ldr boot2.bin ../btx/btx/btx
|
|
btxld -v -E 0x2000 -f bin -b ../btx/btx/btx -l boot2.ldr \
|
|
-o boot2.ld -P 1 boot2.bin
|
|
|
|
boot2.ldr:
|
|
dd if=/dev/zero of=boot2.ldr bs=512 count=1
|
|
|
|
boot2.bin: boot2.out
|
|
objcopy -S -O binary boot2.out boot2.bin
|
|
|
|
boot2.out: ../btx/lib/crt0.o boot2.o sio.o
|
|
ld -Ttext 0x2000 -o boot2.out
|
|
|
|
boot2.o: boot2.s
|
|
${CC} ${ACFLAGS} -c boot2.s
|
|
|
|
boot2.s: boot2.c boot2.h ${.CURDIR}/../../common/ufsread.c
|
|
${CC} ${CFLAGS} -S -o boot2.s.tmp ${.CURDIR}/boot2.c
|
|
sed -e '/align/d' -e '/nop/d' "MISSING" boot2.s.tmp > boot2.s
|
|
rm -f boot2.s.tmp
|
|
|
|
boot2.h: boot1.out
|
|
${NM} -t d ${.ALLSRC} | awk '/([0-9])+ T xread/ \
|
|
{ x = $$1 - ORG1; \
|
|
printf("#define XREADORG %#x\n", REL1 + x) }' \
|
|
ORG1=`printf "%d" ${ORG1}` \
|
|
REL1=`printf "%d" ${REL1}` > ${.TARGET}</programlisting>
|
|
</figure>
|
|
|
|
<para>The mechanism for building <filename>boot2</filename> is
|
|
far more elaborate. Let us point out the most relevant facts.
|
|
The dependency list is as follows:</para>
|
|
|
|
<figure xml:id="boot-boot1-make-boot2-more">
|
|
<title><filename>sys/boot/i386/boot2/Makefile</filename></title>
|
|
|
|
<programlisting> boot2: boot2.ld
|
|
boot2.ld: boot2.ldr boot2.bin ${BTXDIR}/btx/btx
|
|
boot2.bin: boot2.out
|
|
boot2.out: ${BTXDIR}/lib/crt0.o boot2.o sio.o
|
|
boot2.o: boot2.s
|
|
boot2.s: boot2.c boot2.h ${.CURDIR}/../../common/ufsread.c
|
|
boot2.h: boot1.out</programlisting>
|
|
</figure>
|
|
|
|
<para>Note that initially there is no header file
|
|
<filename>boot2.h</filename>, but its creation depends on
|
|
<filename>boot1.out</filename>, which we already have. The rule
|
|
for its creation is a bit terse, but the important thing is that
|
|
the output, <filename>boot2.h</filename>, is something like
|
|
this:</para>
|
|
|
|
<figure xml:id="boot-boot1-make-boot2h">
|
|
<title><filename>sys/boot/i386/boot2/boot2.h</filename></title>
|
|
|
|
<programlisting>
|
|
#define XREADORG 0x725</programlisting>
|
|
</figure>
|
|
|
|
<para>Recall that <filename>boot1</filename> was relocated (i.e.,
|
|
copied from <literal>0x7c00</literal> to
|
|
<literal>0x700</literal>). This relocation will now make sense,
|
|
because as we will see, the <acronym>BTX</acronym> server
|
|
reclaims some memory, including the space where
|
|
<filename>boot1</filename> was originally loaded. However, the
|
|
<acronym>BTX</acronym> server needs access to
|
|
<filename>boot1</filename>'s <literal>xread</literal> function;
|
|
this function, according to the output of
|
|
<filename>boot2.h</filename>, is at location
|
|
<literal>0x725</literal>. Indeed, the
|
|
<acronym>BTX</acronym> server uses the
|
|
<literal>xread</literal> function from
|
|
<filename>boot1</filename>'s relocated code. This function is
|
|
now accesible from within the <filename>boot2</filename>
|
|
client.</para>
|
|
|
|
<para>We next build <filename>boot2.s</filename> from files
|
|
<filename>boot2.h</filename>, <filename>boot2.c</filename> and
|
|
<filename>/usr/src/sys/boot/common/ufsread.c</filename>. The
|
|
rule for this is to compile the code in
|
|
<filename>boot2.c</filename> (which includes
|
|
<filename>boot2.h</filename> and <filename>ufsread.c</filename>)
|
|
into assembly code. Having <filename>boot2.s</filename>, the
|
|
next rule assembles <filename>boot2.s</filename>, creating the
|
|
object file <filename>boot2.o</filename>. The
|
|
next rule directs the linker to link various files
|
|
(<filename>crt0.o</filename>,
|
|
<filename>boot2.o</filename> and <filename>sio.o</filename>).
|
|
Note that the output file, <filename>boot2.out</filename>, is
|
|
linked to execute at address <literal>0x2000</literal>. Recall
|
|
that <filename>boot2</filename> will be executed in user mode,
|
|
within a special user segment set up by the
|
|
<acronym>BTX</acronym> server. This segment starts at
|
|
<literal>0xa000</literal>. Also, remember that the
|
|
<filename>boot2</filename> portion of <filename>boot</filename>
|
|
was copied to address <literal>0xc000</literal>, that is, offset
|
|
<literal>0x2000</literal> from the start of the user segment, so
|
|
<filename>boot2</filename> will work properly when we transfer
|
|
control to it. Next, <filename>boot2.bin</filename> is created
|
|
from <filename>boot2.out</filename> by stripping its symbols and
|
|
format information; boot2.bin is a <emphasis>raw</emphasis>
|
|
binary. Now, note that a file <filename>boot2.ldr</filename> is
|
|
created as a 512-byte file full of zeros. This space is
|
|
reserved for the bsdlabel.</para>
|
|
|
|
<para>Now that we have files <filename>boot1</filename>,
|
|
<filename>boot2.bin</filename> and
|
|
<filename>boot2.ldr</filename>, only the
|
|
<acronym>BTX</acronym> server is missing before creating the
|
|
all-in-one <filename>boot</filename> file. The
|
|
<acronym>BTX</acronym> server is located in
|
|
<filename>/usr/src/sys/boot/i386/btx/btx</filename>; it has its
|
|
own <filename>Makefile</filename> with its own set of rules for
|
|
building. The important thing to notice is that it is also
|
|
compiled as a <emphasis>raw</emphasis> binary, and that it is
|
|
linked to execute at address <literal>0x9000</literal>. The
|
|
details can be found in
|
|
<filename>/usr/src/sys/boot/i386/btx/btx/Makefile</filename>.</para>
|
|
|
|
<para>Having the files that comprise the <filename>boot</filename>
|
|
program, the final step is to <emphasis>merge</emphasis> them.
|
|
This is done by a special program called
|
|
<filename>btxld</filename> (source located in
|
|
<filename>/usr/src/usr.sbin/btxld</filename>). Some arguments
|
|
to this program include the name of the output file
|
|
(<filename>boot</filename>), its entry point
|
|
(<literal>0x2000</literal>) and its file format
|
|
(raw binary). The various files are
|
|
finally merged by this utility into the file
|
|
<filename>boot</filename>, which consists of
|
|
<filename>boot1</filename>, <filename>boot2</filename>, the
|
|
<literal>bsdlabel</literal> and the
|
|
<acronym>BTX</acronym> server. This file, which takes
|
|
exactly 16 sectors, or 8192 bytes, is what is
|
|
actually written to the beginning of the &os; slice
|
|
during instalation. Let us now proceed to study the
|
|
<acronym>BTX</acronym> server program.</para>
|
|
|
|
<para>The <acronym>BTX</acronym> server prepares a simple
|
|
environment and switches from 16-bit real mode to 32-bit
|
|
protected mode, right before passing control to the client.
|
|
This includes initializing and updating the following data
|
|
structures:</para>
|
|
|
|
<indexterm><primary>virtual v86 mode</primary></indexterm>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>Modifies the
|
|
<literal>Interrupt Vector Table (IVT)</literal>. The
|
|
<acronym>IVT</acronym> provides exception and interrupt
|
|
handlers for Real-Mode code.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>The <literal>Interrupt Descriptor Table (IDT)</literal>
|
|
is created. Entries are provided for processor exceptions,
|
|
hardware interrupts, two system calls and V86 interface.
|
|
The IDT provides exception and interrupt handlers for
|
|
Protected-Mode code.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>A <literal>Task-State Segment (TSS)</literal> is
|
|
created. This is necessary because the processor works in
|
|
the <emphasis>least</emphasis> privileged level when
|
|
executing the client (<filename>boot2</filename>), but in
|
|
the <emphasis>most</emphasis> privileged level when
|
|
executing the <acronym>BTX</acronym> server.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>The <acronym>GDT</acronym> (Global Descriptor Table) is
|
|
set up. Entries (descriptors) are provided for
|
|
supervisor code and data, user code and data, and real-mode
|
|
code and data.
|
|
<footnote>
|
|
<para>Real-mode code and data are necessary when switching
|
|
back to real mode from protected mode, as suggested by
|
|
the Intel manuals.</para></footnote></para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Let us now start studying the actual implementation. Recall
|
|
that <filename>boot1</filename> made a jump to address
|
|
<literal>0x9010</literal>, the <acronym>BTX</acronym> server's
|
|
entry point. Before studying program execution there,
|
|
note that the <acronym>BTX</acronym> server has a special header
|
|
at address range <literal>0x9000-0x900f</literal>, right before
|
|
its entry point. This header is defined as follows:</para>
|
|
|
|
<figure xml:id="btx-header">
|
|
<title><filename>sys/boot/i386/btx/btx/btx.S</filename></title>
|
|
|
|
<programlisting>start: # Start of code
|
|
/*
|
|
* BTX header.
|
|
*/
|
|
btx_hdr: .byte 0xeb # Machine ID
|
|
.byte 0xe # Header size
|
|
.ascii "BTX" # Magic
|
|
.byte 0x1 # Major version
|
|
.byte 0x2 # Minor version
|
|
.byte BTX_FLAGS # Flags
|
|
.word PAG_CNT-MEM_ORG>>0xc # Paging control
|
|
.word break-start # Text size
|
|
.long 0x0 # Entry address</programlisting>
|
|
</figure>
|
|
|
|
<para>Note the first two bytes are <literal>0xeb</literal> and
|
|
<literal>0xe</literal>. In the IA-32 architecture, these two
|
|
bytes are interpreted as a relative jump past the header into
|
|
the entry point, so in theory, <filename>boot1</filename> could
|
|
jump here (address <literal>0x9000</literal>) instead of address
|
|
<literal>0x9010</literal>. Note that the last field in the
|
|
<acronym>BTX</acronym> header is a pointer to the client's
|
|
(<filename>boot2</filename>) entry point. This field is patched
|
|
at link time.</para>
|
|
|
|
<para>Immediately following the header is the
|
|
<acronym>BTX</acronym> server's entry point:</para>
|
|
|
|
<figure xml:id="btx-init">
|
|
<title><filename>sys/boot/i386/btx/btx/btx.S</filename></title>
|
|
|
|
<programlisting>/*
|
|
* Initialization routine.
|
|
*/
|
|
init: cli # Disable interrupts
|
|
xor %ax,%ax # Zero/segment
|
|
mov %ax,%ss # Set up
|
|
mov $0x1800,%sp # stack
|
|
mov %ax,%es # Address
|
|
mov %ax,%ds # data
|
|
pushl $0x2 # Clear
|
|
popfl # flags</programlisting>
|
|
</figure>
|
|
|
|
<para>This code disables interrupts, sets up a working stack
|
|
(starting at address <literal>0x1800</literal>) and clears the
|
|
flags in the EFLAGS register. Note that the
|
|
<literal>popfl</literal> instruction pops out a doubleword (4
|
|
bytes) from the stack and places it in the EFLAGS register.
|
|
Because the value actually popped is <literal>2</literal>, the
|
|
EFLAGS register is effectively cleared (IA-32 requires that bit
|
|
2 of the EFLAGS register always be 1).</para>
|
|
|
|
<para>Our next code block clears (sets to <literal>0</literal>)
|
|
the memory range <literal>0x5e00-0x8fff</literal>. This range
|
|
is where the various data structures will be created:</para>
|
|
|
|
<figure xml:id="btx-clear-mem">
|
|
<title><filename>sys/boot/i386/btx/btx/btx.S</filename></title>
|
|
|
|
<programlisting>/*
|
|
* Initialize memory.
|
|
*/
|
|
mov $0x5e00,%di # Memory to initialize
|
|
mov $(0x9000-0x5e00)/2,%cx # Words to zero
|
|
rep # Zero-fill
|
|
stosw # memory</programlisting>
|
|
</figure>
|
|
|
|
<para>Recall that <filename>boot1</filename> was originally loaded
|
|
to address <literal>0x7c00</literal>, so, with this memory
|
|
initialization, that copy effectively dissapeared. However,
|
|
also recall that <filename>boot1</filename> was relocated to
|
|
<literal>0x700</literal>, so <emphasis>that</emphasis> copy is
|
|
still in memory, and the <acronym>BTX</acronym> server will make
|
|
use of it.</para>
|
|
|
|
<para>Next, the real-mode <acronym>IVT</acronym> (Interrupt Vector
|
|
Table is updated. The <acronym>IVT</acronym> is an array of
|
|
segment/offset pairs for exception and interrupt handlers. The
|
|
<acronym>BIOS</acronym> normally maps hardware interrupts to
|
|
interrupt vectors <literal>0x8</literal> to
|
|
<literal>0xf</literal> and <literal>0x70</literal> to
|
|
<literal>0x77</literal> but, as will be seen, the 8259A
|
|
Programmable Interrupt Controller, the chip controlling the
|
|
actual mapping of hardware interrupts to interrupt vectors, is
|
|
programmed to remap these interrupt vectors from
|
|
<literal>0x8-0xf</literal> to <literal>0x20-0x27</literal> and
|
|
from <literal>0x70-0x77</literal> to
|
|
<literal>0x28-0x2f</literal>. Thus, interrupt handlers are
|
|
provided for interrupt vectors <literal>0x20-0x2f</literal>.
|
|
The reason the <acronym>BIOS</acronym>-provided handlers are not
|
|
used directly is because they work in 16-bit real mode, but not
|
|
32-bit protected mode. Processor mode will be switched to
|
|
32-bit protected mode shortly. However, the
|
|
<acronym>BTX</acronym> server sets up a mechanism to effectively
|
|
use the handlers provided by the <acronym>BIOS</acronym>:</para>
|
|
|
|
<figure xml:id="btx-ivt">
|
|
<title><filename>sys/boot/i386/btx/btx/btx.S</filename></title>
|
|
|
|
<programlisting>/*
|
|
* Update real mode IDT for reflecting hardware interrupts.
|
|
*/
|
|
mov $intr20,%bx # Address first handler
|
|
mov $0x10,%cx # Number of handlers
|
|
mov $0x20*4,%di # First real mode IDT entry
|
|
init.0: mov %bx,(%di) # Store IP
|
|
inc %di # Address next
|
|
inc %di # entry
|
|
stosw # Store CS
|
|
add $4,%bx # Next handler
|
|
loop init.0 # Next IRQ</programlisting>
|
|
</figure>
|
|
|
|
<para>The next block creates the <acronym>IDT</acronym> (Interrupt
|
|
Descriptor Table). The <acronym>IDT</acronym> is analogous, in
|
|
protected mode, to the <acronym>IVT</acronym> in real mode.
|
|
That is, the <acronym>IDT</acronym> describes the various
|
|
exception and interrupt handlers used when the processor is
|
|
executing in protected mode. In essence, it also consists of an
|
|
array of segment/offset pairs, although the structure is
|
|
somewhat more complex, because segments in protected mode are
|
|
different than in real mode, and various protection mechanisms
|
|
apply:</para>
|
|
|
|
<figure xml:id="btx-idt">
|
|
<title><filename>sys/boot/i386/btx/btx/btx.S</filename></title>
|
|
|
|
<programlisting>/*
|
|
* Create IDT.
|
|
*/
|
|
mov $0x5e00,%di # IDT's address
|
|
mov $idtctl,%si # Control string
|
|
init.1: lodsb # Get entry
|
|
cbw # count
|
|
xchg %ax,%cx # as word
|
|
jcxz init.4 # If done
|
|
lodsb # Get segment
|
|
xchg %ax,%dx # P:DPL:type
|
|
lodsw # Get control
|
|
xchg %ax,%bx # set
|
|
lodsw # Get handler offset
|
|
mov $SEL_SCODE,%dh # Segment selector
|
|
init.2: shr %bx # Handle this int?
|
|
jnc init.3 # No
|
|
mov %ax,(%di) # Set handler offset
|
|
mov %dh,0x2(%di) # and selector
|
|
mov %dl,0x5(%di) # Set P:DPL:type
|
|
add $0x4,%ax # Next handler
|
|
init.3: lea 0x8(%di),%di # Next entry
|
|
loop init.2 # Till set done
|
|
jmp init.1 # Continue</programlisting>
|
|
</figure>
|
|
|
|
<para>Each entry in the <literal>IDT</literal> is 8 bytes long.
|
|
Besides the segment/offset information, they also describe the
|
|
segment type, privilege level, and whether the segment is
|
|
present in memory or not. The construction is such that
|
|
interrupt vectors from <literal>0</literal> to
|
|
<literal>0xf</literal> (exceptions) are handled by function
|
|
<literal>intx00</literal>; vector <literal>0x10</literal> (also
|
|
an exception) is handled by <literal>intx10</literal>; hardware
|
|
interrupts, which are later configured to start at interrupt
|
|
vector <literal>0x20</literal> all the way to interrupt vector
|
|
<literal>0x2f</literal>, are handled by function
|
|
<literal>intx20</literal>. Lastly, interrupt vector
|
|
<literal>0x30</literal>, which is used for system calls, is
|
|
handled by <literal>intx30</literal>, and vectors
|
|
<literal>0x31</literal> and <literal>0x32</literal> are handled
|
|
by <literal>intx31</literal>. It must be noted that only
|
|
descriptors for interrupt vectors <literal>0x30</literal>,
|
|
<literal>0x31</literal> and <literal>0x32</literal> are given
|
|
privilege level 3, the same privilege level as the
|
|
<filename>boot2</filename> client, which means the client can
|
|
execute a software-generated interrupt to this vectors through
|
|
the <literal>int</literal> instruction without failing (this is
|
|
the way <filename>boot2</filename> use the services provided by
|
|
the <acronym>BTX</acronym> server). Also, note that
|
|
<emphasis>only</emphasis> software-generated interrupts are
|
|
protected from code executing in lesser privilege levels.
|
|
Hardware-generated interrupts and processor-generated exceptions
|
|
are <emphasis>always</emphasis> handled adequately, regardless
|
|
of the actual privileges involved.</para>
|
|
|
|
<para>The next step is to initialize the <acronym>TSS</acronym>
|
|
(Task-State Segment). The <acronym>TSS</acronym> is a hardware
|
|
feature that helps the operating system or executive software
|
|
implement multitasking functionality through process
|
|
abstraction. The IA-32 architecture demands the creation and
|
|
use of <emphasis>at least</emphasis> one <acronym>TSS</acronym>
|
|
if multitasking facilities are used or different privilege
|
|
levels are defined. Because the <filename>boot2</filename>
|
|
client is executed in privilege level 3, but the
|
|
<acronym>BTX</acronym> server does in privilege level 0, a
|
|
<acronym>TSS</acronym> must be defined:</para>
|
|
|
|
<figure xml:id="btx-tss">
|
|
<title><filename>sys/boot/i386/btx/btx/btx.S</filename></title>
|
|
|
|
<programlisting>/*
|
|
* Initialize TSS.
|
|
*/
|
|
init.4: movb $_ESP0H,TSS_ESP0+1(%di) # Set ESP0
|
|
movb $SEL_SDATA,TSS_SS0(%di) # Set SS0
|
|
movb $_TSSIO,TSS_MAP(%di) # Set I/O bit map base</programlisting>
|
|
</figure>
|
|
|
|
<para>Note that a value is given for the Privilege Level 0 stack
|
|
pointer and stack segment in the <acronym>TSS</acronym>. This is needed because,
|
|
if an interrupt or exception is received while executing
|
|
<filename>boot2</filename> in Privilege Level 3, a change to
|
|
Privilege Level 0 is automatically performed by the processor,
|
|
so a new working stack is needed. Finally, the I/O Map Base
|
|
Address field of the <acronym>TSS</acronym> is given a value, which is a 16-bit
|
|
offset from the beginning of the <acronym>TSS</acronym> to the I/O Permission
|
|
Bitmap and the Interrupt Redirection Bitmap.</para>
|
|
|
|
<para>After the <acronym>IDT</acronym> and <acronym>TSS</acronym> are created, the processor is ready to
|
|
switch to protected mode. This is done in the next
|
|
block:</para>
|
|
|
|
<figure xml:id="btx-prot">
|
|
<title><filename>sys/boot/i386/btx/btx/btx.S</filename></title>
|
|
|
|
<programlisting>/*
|
|
* Bring up the system.
|
|
*/
|
|
mov $0x2820,%bx # Set protected mode
|
|
callw setpic # IRQ offsets
|
|
lidt idtdesc # Set IDT
|
|
lgdt gdtdesc # Set GDT
|
|
mov %cr0,%eax # Switch to protected
|
|
inc %ax # mode
|
|
mov %eax,%cr0 #
|
|
ljmp $SEL_SCODE,$init.8 # To 32-bit code
|
|
.code32
|
|
init.8: xorl %ecx,%ecx # Zero
|
|
movb $SEL_SDATA,%cl # To 32-bit
|
|
movw %cx,%ss # stack</programlisting>
|
|
</figure>
|
|
|
|
<para>First, a call is made to <literal>setpic</literal> to
|
|
program the 8259A <acronym>PIC</acronym> (Programmable Interrupt Controller).
|
|
This chip is connected to multiple hardware interrupt sources.
|
|
Upon receiving an interrupt from a device, it
|
|
signals the processor with the appropriate interrupt vector.
|
|
This can be customized so that specific interrupts are
|
|
associated with specific interrupt vectors, as explained before.
|
|
Next, the <acronym>IDTR</acronym> (Interrupt Descriptor Table Register) and
|
|
<acronym>GDTR</acronym> (Global Descriptor Table Register) are loaded with the
|
|
instructions <literal>lidt</literal> and <literal>lgdt</literal>, respectively. These registers are
|
|
loaded with the base address and limit address for the <acronym>IDT</acronym> and
|
|
<acronym>GDT</acronym>. The following three instructions set the Protection Enable
|
|
(PE) bit of the <literal>%cr0</literal> register. This
|
|
effectively switches the processor to
|
|
32-bit protected mode. Next, a long jump is made to
|
|
<literal>init.8</literal> using segment selector SEL_SCODE,
|
|
which selects the Supervisor Code Segment. The processor is
|
|
effectively executing in CPL 0, the most privileged level, after
|
|
this jump. Finally, the Supervisor Data Segment is selected for
|
|
the stack by assigning the segment selector SEL_SDATA to the
|
|
<literal>%ss</literal> register. This data segment also has a
|
|
privilege level of <literal>0</literal>.</para>
|
|
|
|
<para>Our last code block is responsible for loading the
|
|
<acronym>TR</acronym> (Task Register) with the segment selector for the <acronym>TSS</acronym> we created
|
|
earlier, and setting the User Mode environment before passing
|
|
execution control to the <filename>boot2</filename>
|
|
client.</para>
|
|
|
|
<figure xml:id="btx-end">
|
|
<title><filename>sys/boot/i386/btx/btx/btx.S</filename></title>
|
|
|
|
<programlisting>/*
|
|
* Launch user task.
|
|
*/
|
|
movb $SEL_TSS,%cl # Set task
|
|
ltr %cx # register
|
|
movl $0xa000,%edx # User base address
|
|
movzwl %ss:BDA_MEM,%eax # Get free memory
|
|
shll $0xa,%eax # To bytes
|
|
subl $ARGSPACE,%eax # Less arg space
|
|
subl %edx,%eax # Less base
|
|
movb $SEL_UDATA,%cl # User data selector
|
|
pushl %ecx # Set SS
|
|
pushl %eax # Set ESP
|
|
push $0x202 # Set flags (IF set)
|
|
push $SEL_UCODE # Set CS
|
|
pushl btx_hdr+0xc # Set EIP
|
|
pushl %ecx # Set GS
|
|
pushl %ecx # Set FS
|
|
pushl %ecx # Set DS
|
|
pushl %ecx # Set ES
|
|
pushl %edx # Set EAX
|
|
movb $0x7,%cl # Set remaining
|
|
init.9: push $0x0 # general
|
|
loop init.9 # registers
|
|
popa # and initialize
|
|
popl %es # Initialize
|
|
popl %ds # user
|
|
popl %fs # segment
|
|
popl %gs # registers
|
|
iret # To user mode</programlisting>
|
|
</figure>
|
|
|
|
<para>Note that the client's environment include a stack segment
|
|
selector and stack pointer (registers <literal>%ss</literal> and
|
|
<literal>%esp</literal>). Indeed, once the <acronym>TR</acronym> is loaded with
|
|
the appropriate stack segment selector (instruction
|
|
<literal>ltr</literal>), the stack pointer is calculated and
|
|
pushed onto the stack along with the stack's segment selector.
|
|
Next, the value <literal>0x202</literal> is pushed onto the
|
|
stack; it is the value that the EFLAGS will get when control is
|
|
passed to the client. Also, the User Mode code segment selector
|
|
and the client's entry point are pushed. Recall that this entry
|
|
point is patched in the <acronym>BTX</acronym> header at link time. Finally,
|
|
segment selectors (stored in register <literal>%ecx</literal>)
|
|
for the segment registers
|
|
<literal>%gs, %fs, %ds and %es</literal> are pushed onto the
|
|
stack, along with the value at <literal>%edx</literal>
|
|
(<literal>0xa000</literal>). Keep in mind the various values
|
|
that have been pushed onto the stack (they will be popped out
|
|
shortly). Next, values for the remaining general purpose
|
|
registers are also pushed onto the stack (note the
|
|
<literal>loop</literal> that pushes the value
|
|
<literal>0</literal> seven times). Now, values will be started
|
|
to be popped out of the stack. First, the
|
|
<literal>popa</literal> instruction pops out of the stack the
|
|
latest seven values pushed. They are stored in the general
|
|
purpose registers in order
|
|
<literal>%edi, %esi, %ebp, %ebx, %edx, %ecx, %eax</literal>.
|
|
Then, the various segment selectors pushed are popped into the
|
|
various segment registers. Five values still remain on the
|
|
stack. They are popped when the <literal>iret</literal>
|
|
instruction is executed. This instruction first pops
|
|
the value that was pushed from the <acronym>BTX</acronym> header. This value is a
|
|
pointer to <filename>boot2</filename>'s entry point. It is
|
|
placed in the register <literal>%eip</literal>, the instruction
|
|
pointer register. Next, the segment selector for the User
|
|
Code Segment is popped and copied to register
|
|
<literal>%cs</literal>. Remember that
|
|
this segment's privilege level is 3, the least privileged
|
|
level. This means that we must provide values for the stack of
|
|
this privilege level. This is why the processor, besides
|
|
further popping the value for the EFLAGS register, does two more
|
|
pops out of the stack. These values go to the stack
|
|
pointer (<literal>%esp</literal>) and the stack segment
|
|
(<literal>%ss</literal>). Now, execution continues at
|
|
<literal>boot0</literal>'s entry point.</para>
|
|
|
|
<para>It is important to note how the User Code Segment is
|
|
defined. This segment's <emphasis>base address</emphasis> is
|
|
set to <literal>0xa000</literal>. This means that code memory
|
|
addresses are <emphasis>relative</emphasis> to address 0xa000;
|
|
if code being executed is fetched from address
|
|
<literal>0x2000</literal>, the <emphasis>actual</emphasis>
|
|
memory addressed is
|
|
<literal>0xa000+0x2000=0xc000</literal>.</para>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="boot2">
|
|
<title><application>boot2</application> Stage</title>
|
|
|
|
<para><literal>boot2</literal> defines an important structure,
|
|
<literal>struct bootinfo</literal>. This structure is
|
|
initialized by <literal>boot2</literal> and passed to the
|
|
loader, and then further to the kernel. Some nodes of this
|
|
structures are set by <literal>boot2</literal>, the rest by the
|
|
loader. This structure, among other information, contains the
|
|
kernel filename, <acronym>BIOS</acronym> harddisk geometry, <acronym>BIOS</acronym> drive number for
|
|
boot device, physical memory available, <literal>envp</literal>
|
|
pointer etc. The definition for it is:</para>
|
|
|
|
<programlisting><filename>/usr/include/machine/bootinfo.h:</filename>
|
|
struct bootinfo {
|
|
u_int32_t bi_version;
|
|
u_int32_t bi_kernelname; /* represents a char * */
|
|
u_int32_t bi_nfs_diskless; /* struct nfs_diskless * */
|
|
/* End of fields that are always present. */
|
|
#define bi_endcommon bi_n_bios_used
|
|
u_int32_t bi_n_bios_used;
|
|
u_int32_t bi_bios_geom[N_BIOS_GEOM];
|
|
u_int32_t bi_size;
|
|
u_int8_t bi_memsizes_valid;
|
|
u_int8_t bi_bios_dev; /* bootdev BIOS unit number */
|
|
u_int8_t bi_pad[2];
|
|
u_int32_t bi_basemem;
|
|
u_int32_t bi_extmem;
|
|
u_int32_t bi_symtab; /* struct symtab * */
|
|
u_int32_t bi_esymtab; /* struct symtab * */
|
|
/* Items below only from advanced bootloader */
|
|
u_int32_t bi_kernend; /* end of kernel space */
|
|
u_int32_t bi_envp; /* environment */
|
|
u_int32_t bi_modulep; /* preloaded modules */
|
|
};</programlisting>
|
|
|
|
<para><literal>boot2</literal> enters into an infinite loop
|
|
waiting for user input, then calls <function>load()</function>.
|
|
If the user does not press anything, the loop breaks by a
|
|
timeout, so <function>load()</function> will load the default
|
|
file (<filename>/boot/loader</filename>). Functions
|
|
<function>ino_t lookup(char *filename)</function> and
|
|
<function>int xfsread(ino_t inode, void *buf, size_t
|
|
nbyte)</function> are used to read the content of a file into
|
|
memory. <filename>/boot/loader</filename> is an <acronym>ELF</acronym> binary, but
|
|
where the <acronym>ELF</acronym> header is prepended with <filename>a.out</filename>'s <literal>struct
|
|
exec</literal> structure. <function>load()</function> scans the
|
|
loader's ELF header, loading the content of
|
|
<filename>/boot/loader</filename> into memory, and passing the
|
|
execution to the loader's entry:</para>
|
|
|
|
<programlisting><filename>sys/boot/i386/boot2/boot2.c:</filename>
|
|
__exec((caddr_t)addr, RB_BOOTINFO | (opts & RBX_MASK),
|
|
MAKEBOOTDEV(dev_maj[dsk.type], 0, dsk.slice, dsk.unit, dsk.part),
|
|
0, 0, 0, VTOP(&bootinfo));</programlisting>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="boot-loader">
|
|
<title><application>loader</application> Stage</title>
|
|
|
|
<para><application>loader</application> is a <acronym>BTX</acronym> client as well.
|
|
I will not describe it here in detail, there is a comprehensive
|
|
manpage written by Mike Smith, &man.loader.8;. The underlying
|
|
mechanisms and <acronym>BTX</acronym> were discussed above.</para>
|
|
|
|
<para>The main task for the loader is to boot the kernel. When
|
|
the kernel is loaded into memory, it is being called by the
|
|
loader:</para>
|
|
|
|
<programlisting><filename>sys/boot/common/boot.c:</filename>
|
|
/* Call the exec handler from the loader matching the kernel */
|
|
module_formats[km->m_loader]->l_exec(km);</programlisting>
|
|
</sect1>
|
|
|
|
<sect1 xml:id="boot-kernel">
|
|
<title>Kernel Initialization</title>
|
|
|
|
<para>Let us take a look at the command that links the kernel.
|
|
This will help identify the exact location where the loader
|
|
passes execution to the kernel. This location is the kernel's
|
|
actual entry point.</para>
|
|
|
|
<programlisting><filename>sys/conf/Makefile.i386:</filename>
|
|
ld -elf -Bdynamic -T /usr/src/sys/conf/ldscript.i386 -export-dynamic \
|
|
-dynamic-linker /red/herring -o kernel -X locore.o \
|
|
<lots of kernel .o files></programlisting>
|
|
|
|
<indexterm><primary>ELF</primary></indexterm>
|
|
<para>A few interesting things can be seen here. First, the
|
|
kernel is an ELF dynamically linked binary, but the dynamic
|
|
linker for kernel is <filename>/red/herring</filename>, which is
|
|
definitely a bogus file. Second, taking a look at the file
|
|
<filename>sys/conf/ldscript.i386</filename> gives an idea about
|
|
what <application>ld</application> options are used when
|
|
compiling a kernel. Reading through the first few lines, the
|
|
string</para>
|
|
|
|
<programlisting><filename>sys/conf/ldscript.i386:</filename>
|
|
ENTRY(btext)</programlisting>
|
|
|
|
<para>says that a kernel's entry point is the symbol `btext'.
|
|
This symbol is defined in <filename>locore.s</filename>:</para>
|
|
|
|
<programlisting><filename>sys/i386/i386/locore.s:</filename>
|
|
.text
|
|
/**********************************************************************
|
|
*
|
|
* This is where the bootblocks start us, set the ball rolling...
|
|
*
|
|
*/
|
|
NON_GPROF_ENTRY(btext)</programlisting>
|
|
|
|
<para>First, the register EFLAGS is set to a predefined value of
|
|
0x00000002. Then all the segment registers are
|
|
initialized:</para>
|
|
|
|
<programlisting><filename>sys/i386/i386/locore.s:</filename>
|
|
/* Don't trust what the BIOS gives for eflags. */
|
|
pushl $PSL_KERNEL
|
|
popfl
|
|
|
|
/*
|
|
* Don't trust what the BIOS gives for %fs and %gs. Trust the bootstrap
|
|
* to set %cs, %ds, %es and %ss.
|
|
*/
|
|
mov %ds, %ax
|
|
mov %ax, %fs
|
|
mov %ax, %gs</programlisting>
|
|
|
|
<para>btext calls the routines
|
|
<function>recover_bootinfo()</function>,
|
|
<function>identify_cpu()</function>,
|
|
<function>create_pagetables()</function>, which are also defined
|
|
in <filename>locore.s</filename>. Here is a description of what
|
|
they do:</para>
|
|
|
|
<informaltable frame="none" pgwide="1">
|
|
<tgroup cols="2" align="left">
|
|
<tbody>
|
|
<row>
|
|
<entry><function>recover_bootinfo</function></entry>
|
|
<entry>This routine parses the parameters to the kernel
|
|
passed from the bootstrap. The kernel may have been
|
|
booted in 3 ways: by the loader, described above, by the
|
|
old disk boot blocks, or by the old diskless boot
|
|
procedure. This function determines the booting method,
|
|
and stores the <literal>struct bootinfo</literal>
|
|
structure into the kernel memory.</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><function>identify_cpu</function></entry>
|
|
<entry>This functions tries to find out what CPU it is
|
|
running on, storing the value found in a variable
|
|
<varname>_cpu</varname>.</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><function>create_pagetables</function></entry>
|
|
<entry>This function allocates and fills out a Page Table
|
|
Directory at the top of the kernel memory area.</entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</informaltable>
|
|
|
|
<para>The next steps are enabling VME, if the CPU supports
|
|
it:</para>
|
|
|
|
<programlisting> testl $CPUID_VME, R(_cpu_feature)
|
|
jz 1f
|
|
movl %cr4, %eax
|
|
orl $CR4_VME, %eax
|
|
movl %eax, %cr4</programlisting>
|
|
|
|
<para>Then, enabling paging:</para>
|
|
|
|
<programlisting>/* Now enable paging */
|
|
movl R(_IdlePTD), %eax
|
|
movl %eax,%cr3 /* load ptd addr into mmu */
|
|
movl %cr0,%eax /* get control word */
|
|
orl $CR0_PE|CR0_PG,%eax /* enable paging */
|
|
movl %eax,%cr0 /* and let's page NOW! */</programlisting>
|
|
|
|
<para>The next three lines of code are because the paging was set,
|
|
so the jump is needed to continue the execution in virtualized
|
|
address space:</para>
|
|
|
|
<programlisting> pushl $begin /* jump to high virtualized address */
|
|
ret
|
|
|
|
/* now running relocated at KERNBASE where the system is linked to run */
|
|
begin:</programlisting>
|
|
|
|
<para>The function <function>init386()</function> is called with
|
|
a pointer to the first free physical page, after that
|
|
<function>mi_startup()</function>. <function>init386</function>
|
|
is an architecture dependent initialization function, and
|
|
<function>mi_startup()</function> is an architecture independent
|
|
one (the 'mi_' prefix stands for Machine Independent). The
|
|
kernel never returns from <function>mi_startup()</function>, and
|
|
by calling it, the kernel finishes booting:</para>
|
|
|
|
<programlisting><filename>sys/i386/i386/locore.s:</filename>
|
|
movl physfree, %esi
|
|
pushl %esi /* value of first for init386(first) */
|
|
call _init386 /* wire 386 chip for unix operation */
|
|
call _mi_startup /* autoconfiguration, mountroot etc */
|
|
hlt /* never returns to here */</programlisting>
|
|
|
|
<sect2>
|
|
<title><function>init386()</function></title>
|
|
|
|
<para><function>init386()</function> is defined in
|
|
<filename>sys/i386/i386/machdep.c</filename> and performs
|
|
low-level initialization specific to the i386 chip. The
|
|
switch to protected mode was performed by the loader. The
|
|
loader has created the very first task, in which the kernel
|
|
continues to operate. Before looking at the code, consider
|
|
the tasks the processor must complete to initialize protected
|
|
mode execution:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>Initialize the kernel tunable parameters, passed from
|
|
the bootstrapping program.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Prepare the GDT.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Prepare the IDT.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Initialize the system console.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Initialize the DDB, if it is compiled into
|
|
kernel.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Initialize the TSS.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Prepare the LDT.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Set up proc0's pcb.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<indexterm><primary>parameters</primary></indexterm>
|
|
<para><function>init386()</function> initializes the tunable
|
|
parameters passed from bootstrap by setting the environment
|
|
pointer (envp) and calling <function>init_param1()</function>.
|
|
The envp pointer has been passed from loader in the
|
|
<literal>bootinfo</literal> structure:</para>
|
|
|
|
<programlisting><filename>sys/i386/i386/machdep.c:</filename>
|
|
kern_envp = (caddr_t)bootinfo.bi_envp + KERNBASE;
|
|
|
|
/* Init basic tunables, hz etc */
|
|
init_param1();</programlisting>
|
|
|
|
<para><function>init_param1()</function> is defined in
|
|
<filename>sys/kern/subr_param.c</filename>. That file has a
|
|
number of sysctls, and two functions,
|
|
<function>init_param1()</function> and
|
|
<function>init_param2()</function>, that are called from
|
|
<function>init386()</function>:</para>
|
|
|
|
<programlisting><filename>sys/kern/subr_param.c:</filename>
|
|
hz = HZ;
|
|
TUNABLE_INT_FETCH("kern.hz", &hz);</programlisting>
|
|
|
|
<para>TUNABLE_<typename>_FETCH is used to fetch the value
|
|
from the environment:</para>
|
|
|
|
<programlisting><filename>/usr/src/sys/sys/kernel.h:</filename>
|
|
#define TUNABLE_INT_FETCH(path, var) getenv_int((path), (var))</programlisting>
|
|
|
|
<para>Sysctl <literal>kern.hz</literal> is the system clock
|
|
tick. Additionally, these sysctls are set by
|
|
<function>init_param1()</function>: <literal>kern.maxswzone,
|
|
kern.maxbcache, kern.maxtsiz, kern.dfldsiz, kern.maxdsiz,
|
|
kern.dflssiz, kern.maxssiz, kern.sgrowsiz</literal>.</para>
|
|
|
|
<indexterm>
|
|
<primary>Global Descriptors Table (GDT)</primary>
|
|
</indexterm>
|
|
|
|
<para>Then <function>init386()</function> prepares the Global
|
|
Descriptors Table (GDT). Every task on an x86 is running in
|
|
its own virtual address space, and this space is addressed by
|
|
a segment:offset pair. Say, for instance, the current
|
|
instruction to be executed by the processor lies at CS:EIP,
|
|
then the linear virtual address for that instruction would be
|
|
<quote>the virtual address of code segment CS</quote> + EIP.
|
|
For convenience, segments begin at virtual address 0 and end
|
|
at a 4Gb boundary. Therefore, the instruction's linear
|
|
virtual address for this example would just be the value of
|
|
EIP. Segment registers such as CS, DS etc are the selectors,
|
|
i.e., indexes, into GDT (to be more precise, an index is not a
|
|
selector itself, but the INDEX field of a selector).
|
|
FreeBSD's GDT holds descriptors for 15 selectors per
|
|
CPU:</para>
|
|
|
|
<programlisting><filename>sys/i386/i386/machdep.c:</filename>
|
|
union descriptor gdt[NGDT * MAXCPU]; /* global descriptor table */
|
|
|
|
<filename>sys/i386/include/segments.h:</filename>
|
|
/*
|
|
* Entries in the Global Descriptor Table (GDT)
|
|
*/
|
|
#define GNULL_SEL 0 /* Null Descriptor */
|
|
#define GCODE_SEL 1 /* Kernel Code Descriptor */
|
|
#define GDATA_SEL 2 /* Kernel Data Descriptor */
|
|
#define GPRIV_SEL 3 /* SMP Per-Processor Private Data */
|
|
#define GPROC0_SEL 4 /* Task state process slot zero and up */
|
|
#define GLDT_SEL 5 /* LDT - eventually one per process */
|
|
#define GUSERLDT_SEL 6 /* User LDT */
|
|
#define GTGATE_SEL 7 /* Process task switch gate */
|
|
#define GBIOSLOWMEM_SEL 8 /* BIOS low memory access (must be entry 8) */
|
|
#define GPANIC_SEL 9 /* Task state to consider panic from */
|
|
#define GBIOSCODE32_SEL 10 /* BIOS interface (32bit Code) */
|
|
#define GBIOSCODE16_SEL 11 /* BIOS interface (16bit Code) */
|
|
#define GBIOSDATA_SEL 12 /* BIOS interface (Data) */
|
|
#define GBIOSUTIL_SEL 13 /* BIOS interface (Utility) */
|
|
#define GBIOSARGS_SEL 14 /* BIOS interface (Arguments) */</programlisting>
|
|
|
|
<para>Note that those #defines are not selectors themselves, but
|
|
just a field INDEX of a selector, so they are exactly the
|
|
indices of the GDT. for example, an actual selector for the
|
|
kernel code (GCODE_SEL) has the value 0x08.</para>
|
|
|
|
<indexterm><primary>Interrupt Descriptor Table
|
|
(IDT)</primary></indexterm>
|
|
<para>The next step is to initialize the Interrupt Descriptor
|
|
Table (IDT). This table is referenced by the processor when a
|
|
software or hardware interrupt occurs. For example, to make a
|
|
system call, user application issues the
|
|
<literal>INT 0x80</literal> instruction. This is a software
|
|
interrupt, so the processor's hardware looks up a record with
|
|
index 0x80 in the IDT. This record points to the routine that
|
|
handles this interrupt, in this particular case, this will be
|
|
the kernel's syscall gate. The IDT may have a maximum of 256
|
|
(0x100) records. The kernel allocates NIDT records for the
|
|
IDT, where NIDT is the maximum (256):</para>
|
|
|
|
<programlisting><filename>sys/i386/i386/machdep.c:</filename>
|
|
static struct gate_descriptor idt0[NIDT];
|
|
struct gate_descriptor *idt = &idt0[0]; /* interrupt descriptor table */</programlisting>
|
|
|
|
<para>For each interrupt, an appropriate handler is set. The
|
|
syscall gate for <literal>INT 0x80</literal> is set as
|
|
well:</para>
|
|
|
|
<programlisting><filename>sys/i386/i386/machdep.c:</filename>
|
|
setidt(0x80, &IDTVEC(int0x80_syscall),
|
|
SDT_SYS386TGT, SEL_UPL, GSEL(GCODE_SEL, SEL_KPL));</programlisting>
|
|
|
|
<para>So when a userland application issues the
|
|
<literal>INT 0x80</literal> instruction, control will transfer
|
|
to the function <function>_Xint0x80_syscall</function>, which
|
|
is in the kernel code segment and will be executed with
|
|
supervisor privileges.</para>
|
|
|
|
<para>Console and DDB are then initialized:</para>
|
|
<indexterm><primary>DDB</primary></indexterm>
|
|
|
|
<programlisting><filename>sys/i386/i386/machdep.c:</filename>
|
|
cninit();
|
|
/* skipped */
|
|
#ifdef DDB
|
|
kdb_init();
|
|
if (boothowto & RB_KDB)
|
|
Debugger("Boot flags requested debugger");
|
|
#endif</programlisting>
|
|
|
|
<para>The Task State Segment is another x86 protected mode
|
|
structure, the TSS is used by the hardware to store task
|
|
information when a task switch occurs.</para>
|
|
|
|
<para>The Local Descriptors Table is used to reference userland
|
|
code and data. Several selectors are defined to point to the
|
|
LDT, they are the system call gates and the user code and data
|
|
selectors:</para>
|
|
|
|
<programlisting><filename>/usr/include/machine/segments.h:</filename>
|
|
#define LSYS5CALLS_SEL 0 /* forced by intel BCS */
|
|
#define LSYS5SIGR_SEL 1
|
|
#define L43BSDCALLS_SEL 2 /* notyet */
|
|
#define LUCODE_SEL 3
|
|
#define LSOL26CALLS_SEL 4 /* Solaris >= 2.6 system call gate */
|
|
#define LUDATA_SEL 5
|
|
/* separate stack, es,fs,gs sels ? */
|
|
/* #define LPOSIXCALLS_SEL 5*/ /* notyet */
|
|
#define LBSDICALLS_SEL 16 /* BSDI system call gate */
|
|
#define NLDT (LBSDICALLS_SEL + 1)</programlisting>
|
|
|
|
<para>Next, proc0's Process Control Block
|
|
(<literal>struct pcb</literal>) structure is initialized.
|
|
proc0 is a <literal>struct proc</literal> structure that
|
|
describes a kernel process. It is always present while the
|
|
kernel is running, therefore it is declared as global:</para>
|
|
|
|
<programlisting><filename>sys/kern/kern_init.c:</filename>
|
|
struct proc proc0;</programlisting>
|
|
|
|
<para>The structure <literal>struct pcb</literal> is a part of a
|
|
proc structure. It is defined in
|
|
<filename>/usr/include/machine/pcb.h</filename> and has a
|
|
process's information specific to the i386 architecture, such
|
|
as registers values.</para>
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title><function>mi_startup()</function></title>
|
|
|
|
<para>This function performs a bubble sort of all the system
|
|
initialization objects and then calls the entry of each object
|
|
one by one:</para>
|
|
|
|
<programlisting><filename>sys/kern/init_main.c:</filename>
|
|
for (sipp = sysinit; *sipp; sipp++) {
|
|
|
|
/* ... skipped ... */
|
|
|
|
/* Call function */
|
|
(*((*sipp)->func))((*sipp)->udata);
|
|
/* ... skipped ... */
|
|
}</programlisting>
|
|
|
|
<para>Although the sysinit framework is described in the <link
|
|
xlink:href="&url.doc.langbase;/books/developers-handbook">Developers'
|
|
Handbook</link>, I will discuss the internals of it.</para>
|
|
|
|
<indexterm><primary>sysinit objects</primary></indexterm>
|
|
<para>Every system initialization object (sysinit object) is
|
|
created by calling a SYSINIT() macro. Let us take as example
|
|
an <literal>announce</literal> sysinit object. This object
|
|
prints the copyright message:</para>
|
|
|
|
<programlisting><filename>sys/kern/init_main.c:</filename>
|
|
static void
|
|
print_caddr_t(void *data __unused)
|
|
{
|
|
printf("%s", (char *)data);
|
|
}
|
|
SYSINIT(announce, SI_SUB_COPYRIGHT, SI_ORDER_FIRST, print_caddr_t, copyright)</programlisting>
|
|
|
|
<para>The subsystem ID for this object is SI_SUB_COPYRIGHT
|
|
(0x0800001), which comes right after the SI_SUB_CONSOLE
|
|
(0x0800000). So, the copyright message will be printed out
|
|
first, just after the console initialization.</para>
|
|
|
|
<para>Let us take a look at what exactly the macro
|
|
<literal>SYSINIT()</literal> does. It expands to a
|
|
<literal>C_SYSINIT()</literal> macro. The
|
|
<literal>C_SYSINIT()</literal> macro then expands to a static
|
|
<literal>struct sysinit</literal> structure declaration with
|
|
another <literal>DATA_SET</literal> macro call:</para>
|
|
|
|
<programlisting><filename>/usr/include/sys/kernel.h:</filename>
|
|
#define C_SYSINIT(uniquifier, subsystem, order, func, ident) \
|
|
static struct sysinit uniquifier ## _sys_init = { \ subsystem, \
|
|
order, \ func, \ ident \ }; \ DATA_SET(sysinit_set,uniquifier ##
|
|
_sys_init);
|
|
|
|
#define SYSINIT(uniquifier, subsystem, order, func, ident) \
|
|
C_SYSINIT(uniquifier, subsystem, order, \
|
|
(sysinit_cfunc_t)(sysinit_nfunc_t)func, (void *)ident)</programlisting>
|
|
|
|
<para>The <literal>DATA_SET()</literal> macro expands to a
|
|
<literal>MAKE_SET()</literal>, and that macro is the point
|
|
where all the sysinit magic is hidden:</para>
|
|
|
|
<programlisting><filename>/usr/include/linker_set.h:</filename>
|
|
#define MAKE_SET(set, sym) \
|
|
static void const * const __set_##set##_sym_##sym = &sym; \
|
|
__asm(".section .set." #set ",\"aw\""); \
|
|
__asm(".long " #sym); \
|
|
__asm(".previous")
|
|
#endif
|
|
#define TEXT_SET(set, sym) MAKE_SET(set, sym)
|
|
#define DATA_SET(set, sym) MAKE_SET(set, sym)</programlisting>
|
|
|
|
<para>In our case, the following declaration will occur:</para>
|
|
|
|
<programlisting>static struct sysinit announce_sys_init = {
|
|
SI_SUB_COPYRIGHT,
|
|
SI_ORDER_FIRST,
|
|
(sysinit_cfunc_t)(sysinit_nfunc_t) print_caddr_t,
|
|
(void *) copyright
|
|
};
|
|
|
|
static void const *const __set_sysinit_set_sym_announce_sys_init =
|
|
&announce_sys_init;
|
|
__asm(".section .set.sysinit_set" ",\"aw\"");
|
|
__asm(".long " "announce_sys_init");
|
|
__asm(".previous");</programlisting>
|
|
|
|
<para>The first <literal>__asm</literal> instruction will create
|
|
an ELF section within the kernel's executable. This will
|
|
happen at kernel link time. The section will have the name
|
|
<literal>.set.sysinit_set</literal>. The content of this
|
|
section is one 32-bit value, the address of announce_sys_init
|
|
structure, and that is what the second
|
|
<literal>__asm</literal> is. The third
|
|
<literal>__asm</literal> instruction marks the end of a
|
|
section. If a directive with the same section name occurred
|
|
before, the content, i.e., the 32-bit value, will be appended
|
|
to the existing section, so forming an array of 32-bit
|
|
pointers.</para>
|
|
|
|
<para>Running <application>objdump</application> on a kernel
|
|
binary, you may notice the presence of such small
|
|
sections:</para>
|
|
|
|
<screen>&prompt.user; <userinput>objdump -h /kernel</userinput>
|
|
7 .set.cons_set 00000014 c03164c0 c03164c0 002154c0 2**2
|
|
CONTENTS, ALLOC, LOAD, DATA
|
|
8 .set.kbddriver_set 00000010 c03164d4 c03164d4 002154d4 2**2
|
|
CONTENTS, ALLOC, LOAD, DATA
|
|
9 .set.scrndr_set 00000024 c03164e4 c03164e4 002154e4 2**2
|
|
CONTENTS, ALLOC, LOAD, DATA
|
|
10 .set.scterm_set 0000000c c0316508 c0316508 00215508 2**2
|
|
CONTENTS, ALLOC, LOAD, DATA
|
|
11 .set.sysctl_set 0000097c c0316514 c0316514 00215514 2**2
|
|
CONTENTS, ALLOC, LOAD, DATA
|
|
12 .set.sysinit_set 00000664 c0316e90 c0316e90 00215e90 2**2
|
|
CONTENTS, ALLOC, LOAD, DATA</screen>
|
|
|
|
<para>This screen dump shows that the size of .set.sysinit_set
|
|
section is 0x664 bytes, so <literal>0x664/sizeof(void
|
|
*)</literal> sysinit objects are compiled into the kernel.
|
|
The other sections such as <literal>.set.sysctl_set</literal>
|
|
represent other linker sets.</para>
|
|
|
|
<para>By defining a variable of type <literal>struct
|
|
linker_set</literal> the content of
|
|
<literal>.set.sysinit_set</literal> section will be
|
|
<quote>collected</quote> into that variable:</para>
|
|
|
|
<programlisting><filename>sys/kern/init_main.c:</filename>
|
|
extern struct linker_set sysinit_set; /* XXX */</programlisting>
|
|
|
|
<para>The <literal>struct linker_set</literal> is defined as
|
|
follows:</para>
|
|
|
|
<programlisting><filename>/usr/include/linker_set.h:</filename>
|
|
struct linker_set {
|
|
int ls_length;
|
|
void *ls_items[1]; /* really ls_length of them, trailing NULL */
|
|
};</programlisting>
|
|
|
|
<para>The first node will be equal to the number of a sysinit
|
|
objects, and the second node will be a NULL-terminated array
|
|
of pointers to them.</para>
|
|
|
|
<para>Returning to the <function>mi_startup()</function>
|
|
discussion, it is must be clear now, how the sysinit objects
|
|
are being organized. The <function>mi_startup()</function>
|
|
function sorts them and calls each. The very last object is
|
|
the system scheduler:</para>
|
|
|
|
<programlisting><filename>/usr/include/sys/kernel.h:</filename>
|
|
enum sysinit_sub_id {
|
|
SI_SUB_DUMMY = 0x0000000, /* not executed; for linker*/
|
|
SI_SUB_DONE = 0x0000001, /* processed*/
|
|
SI_SUB_CONSOLE = 0x0800000, /* console*/
|
|
SI_SUB_COPYRIGHT = 0x0800001, /* first use of console*/
|
|
...
|
|
SI_SUB_RUN_SCHEDULER = 0xfffffff /* scheduler: no return*/
|
|
};</programlisting>
|
|
|
|
<para>The system scheduler sysinit object is defined in the file
|
|
<filename>sys/vm/vm_glue.c</filename>, and the entry point for
|
|
that object is <function>scheduler()</function>. That
|
|
function is actually an infinite loop, and it represents a
|
|
process with PID 0, the swapper process. The proc0 structure,
|
|
mentioned before, is used to describe it.</para>
|
|
|
|
<para>The first user process, called <emphasis>init</emphasis>,
|
|
is created by the sysinit object
|
|
<literal>init</literal>:</para>
|
|
|
|
<programlisting><filename>sys/kern/init_main.c:</filename>
|
|
static void
|
|
create_init(const void *udata __unused)
|
|
{
|
|
int error;
|
|
int s;
|
|
|
|
s = splhigh();
|
|
error = fork1(&proc0, RFFDG | RFPROC, &initproc);
|
|
if (error)
|
|
panic("cannot fork init: %d\n", error);
|
|
initproc->p_flag |= P_INMEM | P_SYSTEM;
|
|
cpu_set_fork_handler(initproc, start_init, NULL);
|
|
remrunqueue(initproc);
|
|
splx(s);
|
|
}
|
|
SYSINIT(init,SI_SUB_CREATE_INIT, SI_ORDER_FIRST, create_init, NULL)</programlisting>
|
|
|
|
<para>The <function>create_init()</function> allocates a new
|
|
process by calling <function>fork1()</function>, but does not
|
|
mark it runnable. When this new process is scheduled for
|
|
execution by the scheduler, the
|
|
<function>start_init()</function> will be called. That
|
|
function is defined in <filename>init_main.c</filename>. It
|
|
tries to load and exec the <filename>init</filename> binary,
|
|
probing <filename>/sbin/init</filename> first, then
|
|
<filename>/sbin/oinit</filename>,
|
|
<filename>/sbin/init.bak</filename>, and finally
|
|
<filename>/stand/sysinstall</filename>:</para>
|
|
|
|
<programlisting><filename>sys/kern/init_main.c:</filename>
|
|
static char init_path[MAXPATHLEN] =
|
|
#ifdef INIT_PATH
|
|
__XSTRING(INIT_PATH);
|
|
#else
|
|
"/sbin/init:/sbin/oinit:/sbin/init.bak:/stand/sysinstall";
|
|
#endif</programlisting>
|
|
</sect2>
|
|
</sect1>
|
|
</chapter>
|