Add a chapter about the FreeBSD Boot Process and Kernel
Initialization. Much like the rest of the Developer's Handbook, this needs a lot of work, but is better than nothing. PR: docs/39471
This commit is contained in:
		
							parent
							
								
									ea98ed6770
								
							
						
					
					
						commit
						6800086bfb
					
				
				
				Notes:
				
					svn2git
				
				2020-12-08 03:00:23 +00:00 
				
			
			svn path=/head/; revision=13723
					 2 changed files with 1940 additions and 0 deletions
				
			
		
							
								
								
									
										970
									
								
								en_US.ISO8859-1/books/arch-handbook/boot/chapter.sgml
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										970
									
								
								en_US.ISO8859-1/books/arch-handbook/boot/chapter.sgml
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,970 @@ | |||
| <!-- | ||||
| The FreeBSD Documentation Project | ||||
| 
 | ||||
| Copyright (c) 2002 Sergey Lyubka <devnull@uptsoft.com> | ||||
| All rights reserved | ||||
| $FreeBSD$ | ||||
| --> | ||||
| 
 | ||||
| <chapter id="boot"> | ||||
|   <chapterinfo> | ||||
|     <authorgroup> | ||||
|       <author> | ||||
|         <firstname>Sergey</firstname> | ||||
| 	<surname>Lyubka</surname> | ||||
| 	<contrib>Contributed by </contrib> | ||||
|       </author> <!-- devnull@uptsoft.com  12 Jun 2002 --> | ||||
|     </authorgroup> | ||||
|   </chapterinfo> | ||||
|   <title>Bootstrapping and kernel initialization</title> | ||||
| 
 | ||||
|   <sect1> | ||||
|     <title>Synopsis</title> | ||||
| 
 | ||||
|     <para>This chapter is an overview of the boot and system | ||||
|       initialization process, starting from the BIOS (firmware) POST, | ||||
|       to the first user process creation.  Since the initial steps of | ||||
|       system startup are very architecture dependent, the IA-32 | ||||
|       architecture is used as an example.</para> | ||||
|   </sect1> | ||||
| 
 | ||||
|   <sect1> | ||||
|     <title>Overview</title> | ||||
| 
 | ||||
|     <para>A computer running FreeBSD can boot by several methods, | ||||
|       although the most common method, booting from a harddisk where | ||||
|       the OS is installed, will be discussed here.  The boot process | ||||
|       is divided into several steps:</para> | ||||
| 
 | ||||
|     <itemizedlist> | ||||
|       <listitem><para>BIOS POST</para></listitem> | ||||
|       <listitem><para>boot0 stage</para></listitem> | ||||
|       <listitem><para>boot2 stage</para></listitem> | ||||
|       <listitem><para>loader stage</para></listitem> | ||||
|       <listitem><para>kernel initialization</para></listitem> | ||||
|     </itemizedlist> | ||||
| 
 | ||||
|     <para>The boot0 and boot2 stages are also referred to as | ||||
|       <emphasis>bootstrap stages 1 and 2</emphasis> in &man.boot.8; as | ||||
|       the first steps in Freud's 3-stage bootstrapping procedure. | ||||
|       Various information is printed on the screen at each stage, so | ||||
|       visually you may recognize them using the table that follows. | ||||
|       Please note that the actual data may differ from machine to | ||||
|       machine:</para> | ||||
| 
 | ||||
|     <informaltable> | ||||
|       <tgroup cols="2"> | ||||
|         <tbody> | ||||
|           <row> | ||||
|             <entry><para>may vary</para></entry> <entry><para>BIOS | ||||
|             (firmware) messages</para></entry> | ||||
|           </row> | ||||
|           <row> | ||||
|             <entry><para> | ||||
| <screen>F1    FreeBSD | ||||
| F2    BSD | ||||
| F5    Disk 2</screen> | ||||
|             </para></entry> | ||||
|             <entry><para>boot0</para></entry> | ||||
|           </row> | ||||
|           <row> | ||||
|             <entry><para> | ||||
| <screen>>>FreeBSD/i386 BOOT | ||||
| Default: 1:ad(1,a)/boot/loader | ||||
| boot:</screen> | ||||
|             </para></entry> | ||||
| 
 | ||||
|             <entry><para>boot2<footnote><para>This prompt will appear | ||||
|               if the user presses a key just after selecting an OS to | ||||
|               boot at the boot0 | ||||
|               stage.</para></footnote></para></entry> | ||||
|           </row> | ||||
|           <row> | ||||
|             <entry><para> | ||||
| <screen>BTX loader 1.0 BTX version is 1.01 | ||||
| BIOS drive A: is disk0 | ||||
| BIOS drive C: is disk1 | ||||
| BIOS 639kB/64512kB available memory | ||||
| FreeBSD/i386 bootstrap loader, Revision 0.8 | ||||
| Console internal video/keyboard | ||||
| (jkh@bento.freebsd.org, Mon Nov 20 11:41:23 GMT 2000) | ||||
| /kernel text=0x1234 data=0x2345 syms=[0x4+0x3456]  | ||||
| Hit [Enter] to boot immediately, or any other key for command prompt | ||||
| Booting [kernel] in 9 seconds..._</screen> | ||||
|             </para></entry> | ||||
|             <entry><para>loader</para></entry> | ||||
|           </row> | ||||
|           <row> | ||||
|             <entry><para> | ||||
| <screen>Copyright (c) 1992-2002 The FreeBSD Project. | ||||
| Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 | ||||
|         The Regents of the University of California. All rights reserved. | ||||
| FreeBSD 4.6-RC #0: Sat May  4 22:49:02 GMT 2002 | ||||
|     devnull@kukas:/usr/obj/usr/src/sys/DEVNULL | ||||
| Timecounter "i8254"  frequency 1193182 Hz</screen> | ||||
|             </para></entry> | ||||
|             <entry><para>kernel</para></entry> | ||||
|           </row> | ||||
|         </tbody> | ||||
|       </tgroup> | ||||
|     </informaltable> | ||||
| 
 | ||||
|   </sect1> | ||||
| 
 | ||||
|   <sect1> | ||||
|     <title>BIOS POST</title> | ||||
| 
 | ||||
|     <para>When the PC powers on, the processor's registers are set | ||||
|       with some predefined values.  One of the registers is the | ||||
|       <emphasis>instruction pointer</emphasis> register, and its value | ||||
|       after a power on is well defined: it is a 32-bit value of | ||||
|       0xffffff00.  The instruction pointer register points to code to | ||||
|       be executed by the processor.  One of the registers is the | ||||
|       <literal>cr1</literal> 32-bit control register, and its value | ||||
|       just after the reboot is 0.  One of the cr1's bits, the bit PE | ||||
|       (Protected Enabled) indicates whether the processor is running | ||||
|       in protected or real mode.  Since at boot time this bit is | ||||
|       cleared, the processor boots in real mode.  Real mode means, | ||||
|       among other things, that linear and physical addresses are | ||||
|       identical.</para> | ||||
| 
 | ||||
|     <para>The value of 0xffffff00 is slightly less then 4Gb, so unless | ||||
|       the machine has 4Gb physical memory, it cannot point to a valid | ||||
|       memory address.  The computer's hardware translates this address | ||||
|       so that it points to a BIOS memory block.</para> | ||||
| 
 | ||||
|     <para>BIOS stands for <emphasis>Basic Input Output | ||||
|       System</emphasis>, and it is a chip on the motherboard that has | ||||
|       a relatively small amount of read-only memory (ROM).  This | ||||
|       memory contains various low-level routines that are specific to | ||||
|       the hardware supplied with the motherboard.  So, the processor | ||||
|       will first jump to the address 0xffffff00, which really resides | ||||
|       in the BIOS's memory.  Usually this address contains a jump | ||||
|       instruction to the BIOS's POST routines.</para> | ||||
| 
 | ||||
|     <para>POST stands for <emphasis>Power On Self Test</emphasis>. | ||||
|       This is a set of routines including the memory check, system bus | ||||
|       check and other low-level stuff so that the CPU can initialize | ||||
|       the computer properly.  The important step on this stage is | ||||
|       determining the boot device.  All modern BIOS's allow the boot | ||||
|       device to be set manually, so you can boot from a floppy, | ||||
|       CD-ROM, harddisk etc.</para> | ||||
| 
 | ||||
|     <para>The very last thing in the POST is the <literal>INT | ||||
|       0x19</literal> instruction.  That instruction reads 512 bytes | ||||
|       from the first sector of boot device into the memory at address | ||||
|       0x7c00.  The term <emphasis>first sector</emphasis> originates | ||||
|       from harddrive architecture, where the magnetic plate is divided | ||||
|       to a number of cylindrical tracks.  Tracks are numbered, and | ||||
|       every track is divided by a number (usually 64) sectors.  Track | ||||
|       number 0 is the outermost on the magnetic plate, and sector 1, | ||||
|       the first sector (tracks, or, cylinders, are numbered starting | ||||
|       from 0, but sectors - starting from 1), has a special meaning. | ||||
|       It is also called Master Boot Record, or MBR.  The remaining | ||||
|       sectors on the first track are never used <footnote><para>Some | ||||
|       utilities such as &man.disklabel.8; may store the information in | ||||
|       this area, mostly in the second | ||||
|       sector.</para></footnote>.</para> | ||||
| 
 | ||||
|   </sect1> | ||||
| 
 | ||||
|   <sect1> | ||||
|     <title>boot0 stage</title> | ||||
| 
 | ||||
|     <para>Take a look at the file <filename>/boot/boot0</filename>. | ||||
|       This is a small 512-byte file, and it is exactly what FreeBSD's | ||||
|       installation procedure wrote to your harddisk's MBR if you chose | ||||
|       the "bootmanager" option at installation time.</para> | ||||
| 
 | ||||
|     <para>As mentioned previously, the <literal>INT 0x19</literal> | ||||
|       instruction loads an MBR, i.e. the <filename>boot0</filename> | ||||
|       content, into the memory at address 0x7c00.  Taking a look at | ||||
|       the file <filename>sys/boot/i386/boot0/boot0.s</filename> can | ||||
|       give a guess at what is happening there - this is the boot | ||||
|       manager, which is an awesome piece of code written by Robert | ||||
|       Nordier.</para> | ||||
| 
 | ||||
|     <para>The MBR, or, <filename>boot0</filename>, has a special | ||||
|       structure starting from offset 0x1be, called the | ||||
|       <emphasis>partition table</emphasis>.  It has 4 records of 16 | ||||
|       bytes each, called <emphasis>partition records</emphasis>, which | ||||
|       represent how the harddisk(s) are partitioned, or, in FreeBSD's | ||||
|       terminology, sliced.  One byte of those 16 says whether a | ||||
|       partition (slice) is bootable or not.  Exactly one record must | ||||
|       have that flag set, otherwise <filename>boot0</filename>'s code | ||||
|       will refuse to proceed.</para> | ||||
| 
 | ||||
|     <para>A partition record has the following fields:</para> | ||||
| 
 | ||||
|     <itemizedlist> | ||||
|       <listitem><para>the 1-byte filesystem type</para></listitem> | ||||
|       <listitem><para>the 1-byte bootable flag</para></listitem> | ||||
|       <listitem><para>the 6 byte descriptor in CHS | ||||
|         format</para></listitem> | ||||
|       <listitem><para>the 8 byte descriptor in LBA | ||||
|         format</para></listitem> | ||||
|     </itemizedlist> | ||||
| 
 | ||||
|     <para>A partition record descriptor has the information about | ||||
|       where exactly the partition resides on the drive.  Both | ||||
|       descriptors, LBA and CHS, describe the same information, but in | ||||
|       different ways: LBA (Logical Block Addressing) has the starting | ||||
|       sector for the partition and the partition's length, while CHS | ||||
|       (Cylinder Head Sector) has coordinates for the first and last | ||||
|       sectors of the partition.</para> | ||||
| 
 | ||||
|     <para>The boot manager scans the partition table and prints the | ||||
|       menu on the screen so the user can select what disk and what | ||||
|       slice to boot.  By pressing an appropriate key, | ||||
|       <filename>boot0</filename> performs the following | ||||
|       actions:</para> | ||||
| 
 | ||||
|     <itemizedlist> | ||||
|       <listitem><para>modifies the bootable flag for the selected | ||||
|         partition to make it bootable, and clears the | ||||
|         previous</para></listitem> | ||||
| 
 | ||||
|       <listitem><para>saves itself to disk to remember what partition | ||||
|         (slice) has been selected so to use it as the default on the | ||||
|         next boot </para></listitem> | ||||
| 
 | ||||
|       <listitem><para>loads the first sector of the selected partition | ||||
|         (slice) into memory and jumps there</para></listitem> | ||||
|     </itemizedlist> | ||||
| 
 | ||||
|     <para>What kind of data should reside on the very first sector of | ||||
|       a bootable partition (slice), in our case, a FreeBSD slice?  As | ||||
|       you may have already guessed, it is | ||||
|       <filename>boot2</filename>.</para> | ||||
| 
 | ||||
|   </sect1> | ||||
| 
 | ||||
|   <sect1> | ||||
|     <title>boot2 stage</title> | ||||
| 
 | ||||
|     <para>You might wonder, why boot2 comes after boot0, and not | ||||
|       boot1.  Actually, there is a 512-byte file called | ||||
|       <filename>boot1</filename> in the directory | ||||
|       <filename>/boot</filename> as well.  It is used for booting from | ||||
|       a floppy.  When booting from a floppy, | ||||
|       <filename>boot1</filename> plays the same role as | ||||
|       <filename>boot0</filename> for a harddisk: it locates boot2 and | ||||
|       runs it.</para> | ||||
| 
 | ||||
|     <para>You may have realized that a file | ||||
|       <filename>/boot/mbr</filename> exists as well.  It is a | ||||
|       simplified version of boot0.  The code in | ||||
|       <filename>mbr</filename> does not provide a menu for the user, | ||||
|       it just blindly boots the partition marked active.</para> | ||||
| 
 | ||||
|     <para>The code implementing boot2 resides in | ||||
|       <filename>sys/boot/i386/boot2/</filename>, and the executable | ||||
|       itself is in <filename>/boot</filename>.  The files boot0 and | ||||
|       boot2 that are in <filename>/boot</filename> are not used by the | ||||
|       bootstrap, but by utilities such as | ||||
|       <application>boot0cfg</application>.  The actual position for | ||||
|       boot0 is in the MBR.  For boot2 it is the beginning of a | ||||
|       bootable FreeBSD slice.  These locations are not under the | ||||
|       filesystem's control, so they are invisible to commands like | ||||
|       <application>ls</application>.</para> | ||||
| 
 | ||||
|     <para>The main task for boot2 is to load the file | ||||
|       <filename>/boot/loader</filename>, which is the third stage in | ||||
|       the bootstrapping procedure.  The code in boot2 cannot use any | ||||
|       services like <function>open()</function> and | ||||
|       <function>read()</function>, since the kernel is not yet loaded. | ||||
|       It must scan the harddisk, knowing about the filesystem | ||||
|       structure, find the file <filename>/boot/loader</filename>, read | ||||
|       it into memory using a BIOS service, and then pass the execution | ||||
|       to the loader's entry point.</para> | ||||
| 
 | ||||
|     <para>Besides that, boot2 prompts for user input so the loader can | ||||
|       be booted from different disk, unit, slice and partition.</para> | ||||
| 
 | ||||
|     <para>The boot2 binary is created in special way:</para> | ||||
|     <programlisting><filename>sys/boot/i386/boot2/Makefile</filename> | ||||
| boot2: boot2.ldr boot2.bin ${BTX}/btx/btx | ||||
| 	btxld -v -E ${ORG2} -f bin -b ${BTX}/btx/btx -l boot2.ldr \ | ||||
| 		-o boot2.ld -P 1 boot2.bin</programlisting> | ||||
| 
 | ||||
|     <para>This Makefile snippet shows that &man.btxld.8; is used to | ||||
|       link the binary.  BTX, which stands for BooT eXtender, is a | ||||
|       piece of code that provides a protected mode environment for the | ||||
|       program, called the client, that it is linked with.  So boot2 is | ||||
|       a BTX client, i.e. it uses the sevice provided by BTX.</para> | ||||
| 
 | ||||
|     <para>The <application>btxld</application> utility is the linker. | ||||
|       It links two binaries together.  The difference between | ||||
|       &man.btxld.8; and &man.ld.1; is that | ||||
|       <application>ld</application> usually links object files into a | ||||
|       shared object or executable, while | ||||
|       <application>btxld</application> links an object file with the | ||||
|       BTX, producing the binary file suitable to be put on the | ||||
|       beginning of the partition for the system boot.</para> | ||||
| 
 | ||||
|     <para>boot0 passes the execution to BTX's entry point.  BTX then | ||||
|       switches the processor to protected mode, and prepares a simple | ||||
|       environment before calling the client.  This includes:</para> | ||||
| 
 | ||||
|     <itemizedlist> | ||||
|       <listitem><para>virtual v86 mode.  That means, the BTX is a v86 | ||||
|         monitor.  Real mode instructions like posh, popf, cli, sti, if | ||||
|         called by the client, will work.</para></listitem> | ||||
| 
 | ||||
|       <listitem><para>Interrupt Descriptor Table (IDT) is set up so | ||||
|         all hardware interrupts are routed to the default BIOS's | ||||
|         handlers, and interrupt 0x30 is set up to be the syscall | ||||
|         gate.</para></listitem> | ||||
| 
 | ||||
|       <listitem><para>Two system calls: <function>exec</function> and | ||||
|         <function>exit</function>, are defined:</para> | ||||
| 
 | ||||
|     <programlisting><filename>sys/boot/i386/btx/lib/btxsys.s:</filename> | ||||
| 		.set INT_SYS,0x30		# Interrupt number | ||||
| # | ||||
| # System call: exit | ||||
| # | ||||
| __exit: 	xorl %eax,%eax			# BTX system | ||||
| 		int $INT_SYS			#  call 0x0 | ||||
| # | ||||
| # System call: exec | ||||
| # | ||||
| __exec: 	movl $0x1,%eax			# BTX system | ||||
| 		int $INT_SYS			#  call 0x1</programlisting></listitem> | ||||
|     </itemizedlist> | ||||
| 
 | ||||
|     <para>BTX creates a Global Descriptor Table (GDT):</para> | ||||
| 
 | ||||
|     <programlisting><filename>sys/boot/i386/btx/btx/btx.s:</filename> | ||||
| gdt:		.word 0x0,0x0,0x0,0x0		# Null entry | ||||
| 		.word 0xffff,0x0,0x9a00,0xcf	# SEL_SCODE | ||||
| 		.word 0xffff,0x0,0x9200,0xcf	# SEL_SDATA | ||||
| 		.word 0xffff,0x0,0x9a00,0x0	# SEL_RCODE | ||||
| 		.word 0xffff,0x0,0x9200,0x0	# SEL_RDATA | ||||
| 		.word 0xffff,MEM_USR,0xfa00,0xcf# SEL_UCODE | ||||
| 		.word 0xffff,MEM_USR,0xf200,0xcf# SEL_UDATA | ||||
| 		.word _TSSLM,MEM_TSS,0x8900,0x0 # SEL_TSS</programlisting> | ||||
| 
 | ||||
|     <para>The client's code and data start from address MEM_USR | ||||
|       (0xa000), and a selector (SEL_UCODE) points to the client's code | ||||
|       segment.  The SEL_UCODE descriptor has Descriptor Privilege | ||||
|       Level (DPL) 3, which is the lowest privilege level.  But the | ||||
|       <literal>INT 0x30</literal> instruction handler resides in a | ||||
|       segment pointed to by the SEL_SCODE (supervisor code) selector, | ||||
|       as shown from the code that creates an IDT:</para> | ||||
| 
 | ||||
|   <programlisting>		mov $SEL_SCODE,%dh		# Segment selector | ||||
| init.2: 	shr %bx				# Handle this int? | ||||
| 		jnc init.3			# No | ||||
| 		mov %ax,(%di)			# Set handler offset | ||||
| 		mov %dh,0x2(%di)		#  and selector | ||||
| 		mov %dl,0x5(%di)		# Set P:DPL:type | ||||
| 		add $0x4,%ax			# Next handler</programlisting> | ||||
| 
 | ||||
|     <para>So, when the client calls <function>__exec()</function>, the | ||||
|       code will be executed with the highest privileges.  This allows | ||||
|       the kernel to change the protected mode data structures, such as | ||||
|       page tables, GDT, IDT, etc later, if needed.</para> | ||||
| 
 | ||||
|     <para>boot2 defines an important structure, <literal>struct | ||||
|       bootinfo</literal>.  This structure is initialized by boot2 and | ||||
|       passed to the loader, and then further to the kernel.  Some | ||||
|       nodes of this structures are set by boot2, the rest by the | ||||
|       loader.  This structure, among other information, contains the | ||||
|       kernel filename, BIOS harddisk geometry, BIOS drive number for | ||||
|       boot device, physical memory available, <literal>envp</literal> | ||||
|       pointer etc.  The definition for it is:</para> | ||||
| 
 | ||||
|     <programlisting><filename>/usr/include/machine/bootinfo.h</filename> | ||||
| struct bootinfo { | ||||
| 	u_int32_t	bi_version; | ||||
| 	u_int32_t	bi_kernelname;		/* represents a char * */ | ||||
| 	u_int32_t	bi_nfs_diskless;	/* struct nfs_diskless * */ | ||||
| 				/* End of fields that are always present. */ | ||||
| #define	bi_endcommon	bi_n_bios_used | ||||
| 	u_int32_t	bi_n_bios_used; | ||||
| 	u_int32_t	bi_bios_geom[N_BIOS_GEOM]; | ||||
| 	u_int32_t	bi_size; | ||||
| 	u_int8_t	bi_memsizes_valid; | ||||
| 	u_int8_t	bi_bios_dev;		/* bootdev BIOS unit number */ | ||||
| 	u_int8_t	bi_pad[2]; | ||||
| 	u_int32_t	bi_basemem; | ||||
| 	u_int32_t	bi_extmem; | ||||
| 	u_int32_t	bi_symtab;		/* struct symtab * */ | ||||
| 	u_int32_t	bi_esymtab;		/* struct symtab * */ | ||||
| 				/* Items below only from advanced bootloader */ | ||||
| 	u_int32_t	bi_kernend;		/* end of kernel space */ | ||||
| 	u_int32_t	bi_envp;		/* environment */ | ||||
| 	u_int32_t	bi_modulep;		/* preloaded modules */ | ||||
| };</programlisting> | ||||
| 
 | ||||
|   <para>boot2 enters into an infinite loop waiting for user input, | ||||
|     then calls <function>load()</function>.  If the user does not | ||||
|     press anything, the loop brakes by a timeout, so | ||||
|     <function>load()</function> will load the default file | ||||
|     (<filename>/boot/loader</filename>).  Functions <function>ino_t | ||||
|     lookup(char *filename)</function> and <function>int xfsread(ino_t | ||||
|     inode, void *buf, size_t nbyte)</function> are used to read the | ||||
|     content of a file into memory.  <filename>/boot/loader</filename> | ||||
|     is an ELF binary, but where the ELF header is prepended with | ||||
|     a.out's <literal>struct exec</literal> structure. | ||||
|     <function>load()</function> scans the loader's ELF header, loading | ||||
|     the content of <filename>/boot/loader</filename> into memory, and | ||||
|     passing the execution to the loader's entry:</para> | ||||
| 
 | ||||
|   <programlisting><filename>sys/boot/i386/boot2/boot2.c:</filename> | ||||
|     __exec((caddr_t)addr, RB_BOOTINFO | (opts & RBX_MASK), | ||||
| 	   MAKEBOOTDEV(dev_maj[dsk.type], 0, dsk.slice, dsk.unit, dsk.part), | ||||
| 	   0, 0, 0, VTOP(&bootinfo));</programlisting> | ||||
| 
 | ||||
|   </sect1> | ||||
| 
 | ||||
|   <sect1> | ||||
|     <title><application>loader</application> stage</title> | ||||
| 
 | ||||
|     <para><application>loader</application> is a BTX client as well. | ||||
|       I will not describe it here in detail, there is a comprehensive | ||||
|       manpage written by Mike Smith, &man.loader.8;.  The underlying | ||||
|       mechanisms and BTX were discussed above.</para> | ||||
| 
 | ||||
|     <para>The main task for the loader is to boot the kernel.  When | ||||
|       the kernel is loaded into memory, it is being called by the | ||||
|       loader:</para> | ||||
| 
 | ||||
|   <programlisting><filename>sys/boot/common/boot.c:</filename> | ||||
|     /* Call the exec handler from the loader matching the kernel */ | ||||
|     module_formats[km->m_loader]->l_exec(km);</programlisting> | ||||
|   </sect1> | ||||
| 
 | ||||
|   <sect1> | ||||
|     <title>Kernel initialization</title> | ||||
| 
 | ||||
|     <para>To where exactly is the execution passed by the loader, | ||||
|       i.e. what is the kernel's actual entry point.  Let us take a | ||||
|       look at the command that links the kernel:</para> | ||||
| 
 | ||||
|     <programlisting><filename>sys/conf/Makefile.i386:</filename> | ||||
| ld -elf -Bdynamic -T /usr/src/sys/conf/ldscript.i386  -export-dynamic \ | ||||
| -dynamic-linker /red/herring -o kernel -X locore.o \ | ||||
| <lots of kernel .o files></programlisting> | ||||
| 
 | ||||
|     <para>A few interesting things can be seen in this line.  First, | ||||
|       the kernel is an ELF dynamically linked binary, but the dynamic | ||||
|       linker for kernel is <filename>/red/herring</filename>, which is | ||||
|       definitely a bogus file.  Second, taking a look at the file | ||||
|       <filename>sys/conf/ldscript.i386</filename> gives an idea about | ||||
|       what <application>ld</application> options are used when | ||||
|       compiling a kernel.  Reading through the first few lines, the | ||||
|       string</para> | ||||
| 
 | ||||
|   <programlisting><filename>sys/conf/ldscript.i386:</filename> | ||||
| ENTRY(btext)</programlisting> | ||||
| 
 | ||||
|     <para>says that a kernel's entry point is the symbol `btext'. | ||||
|       This symbol is defined in <filename>locore.s</filename>:</para> | ||||
| 
 | ||||
|   <programlisting><filename>sys/i386/i386/locore.s:</filename> | ||||
| 	.text | ||||
| /********************************************************************** | ||||
|  * | ||||
|  * This is where the bootblocks start us, set the ball rolling... | ||||
|  * | ||||
|  */ | ||||
| NON_GPROF_ENTRY(btext)</programlisting> | ||||
| 
 | ||||
|     <para>First what is done is the register EFLAGS is set to a | ||||
|       predefined value of 0x00000002, and then all the segment | ||||
|       registers are initialized:</para> | ||||
| 
 | ||||
|     <programlisting><filename>sys/i386/i386/locore.s</filename> | ||||
| /* Don't trust what the BIOS gives for eflags. */ | ||||
| 	pushl	$PSL_KERNEL | ||||
| 	popfl | ||||
| 
 | ||||
| /* | ||||
|  * Don't trust what the BIOS gives for %fs and %gs.  Trust the bootstrap | ||||
|  * to set %cs, %ds, %es and %ss. | ||||
|  */ | ||||
| 	mov	%ds, %ax | ||||
| 	mov	%ax, %fs | ||||
| 	mov	%ax, %gs</programlisting> | ||||
| 
 | ||||
|     <para>btext calls the routines | ||||
|       <function>recover_bootinfo()</function>, | ||||
|       <function>identify_cpu()</function>, | ||||
|       <function>create_pagetables()</function>, which are also defined | ||||
|       in <filename>locore.s</filename>.  Here is a description of what | ||||
|       they do:</para> | ||||
| 
 | ||||
|     <informaltable> | ||||
|       <tgroup cols=2 align=left> | ||||
|       <tbody> | ||||
|         <row> | ||||
|           <entry><function>recover_bootinfo</function></entry> | ||||
| 
 | ||||
|           <entry>This routine parses the parameters to the kernel | ||||
|             passed from the bootstrap.  The kernel may have been | ||||
|             booted in 3 ways: by the loader, described above, by the | ||||
|             old disk boot blocks, and by the old diskless boot | ||||
|             procedure.  This function determines the booting method, | ||||
|             and stores the <literal>struct bootinfo</literal> | ||||
|             structure into the kernel memory.</entry> | ||||
|         </row> | ||||
|         <row> | ||||
|           <entry><function>identify_cpu</function></entry> <entry>This | ||||
|           functions tries to find out what CPU it is running on, | ||||
|           storing the value found in a variable | ||||
|           <varname>_cpu</varname>.</entry> | ||||
|         </row> | ||||
|         <row> | ||||
|           <entry><function>create_pagetables</function></entry> | ||||
|           <entry>This function allocates and fills out a Page Table Directory | ||||
|           at the top of the kernel memory area.</entry> | ||||
|         </row> | ||||
|       </tgroup> | ||||
|     </informaltable> | ||||
|     <para>The next steps are enabling VME, if the CPU supports it:</para> | ||||
| 
 | ||||
|     <programlisting>	testl	$CPUID_VME, R(_cpu_feature) | ||||
| 	jz	1f | ||||
| 	movl	%cr4, %eax | ||||
| 	orl	$CR4_VME, %eax | ||||
| 	movl	%eax, %cr4</programlisting> | ||||
| 
 | ||||
|     <para>Then, enabling paging:</para> | ||||
|     <programlisting>/* Now enable paging */ | ||||
| 	movl	R(_IdlePTD), %eax | ||||
| 	movl	%eax,%cr3			/* load ptd addr into mmu */ | ||||
| 	movl	%cr0,%eax			/* get control word */ | ||||
| 	orl	$CR0_PE|CR0_PG,%eax		/* enable paging */ | ||||
| 	movl	%eax,%cr0			/* and let's page NOW! */</programlisting> | ||||
| 
 | ||||
|     <para>The next three lines of code are because the paging was set, | ||||
|       so the jump is needed to continue the execution in virtualized | ||||
|       address space:</para> | ||||
| 
 | ||||
|     <programlisting>	pushl	$begin				/* jump to high virtualized address */ | ||||
| 	ret | ||||
| 
 | ||||
| /* now running relocated at KERNBASE where the system is linked to run */ | ||||
| begin:</programlisting> | ||||
| 
 | ||||
|     <para>The function <function>init386()</function> is called, with | ||||
|       a pointer to the first free physical page, after that | ||||
|       <function>mi_startup()</function>.  <function>init386</function> | ||||
|       is an architecture dependent initialization function, and | ||||
|       <function>mi_startup()</function> is an architecture independent | ||||
|       one (the 'mi_' prefix stands for Machine Independent).  The | ||||
|       kernel never returns from <function>mi_startup()</function>, and | ||||
|       by calling it, the kernel finishes booting:</para> | ||||
| 
 | ||||
|     <programlisting><filename>sys/i386/i386/locore.s:</filename> | ||||
| 	movl	physfree, %esi | ||||
| 	pushl	%esi				/* value of first for init386(first) */ | ||||
| 	call	_init386			/* wire 386 chip for unix operation */ | ||||
| 	call	_mi_startup			/* autoconfiguration, mountroot etc */ | ||||
| 	hlt		/* never returns to here */</programlisting> | ||||
| 
 | ||||
|     <sect2> | ||||
|       <title><function>init386()</function></title> | ||||
| 
 | ||||
|       <para><function>init386()</function> is defined in | ||||
|         <filename>sys/i386/i386/machdep.c</filename> and performs | ||||
|         low-level initialization, specific to the i386 chip.  The | ||||
|         switch to protected mode was performed by the loader.  The | ||||
|         loader has created the very first task, in which the kernel | ||||
|         continues to operate.  Before running straight away to the | ||||
|         code, I will enumerate the tasks the processor must complete | ||||
|         to initialize protected mode execution:</para> | ||||
| 
 | ||||
|       <itemizedlist> | ||||
|         <listitem><para>Initialize the kernel tunable parameters, passed from | ||||
|         the bootstrapping program.</para></listitem> | ||||
|         <listitem><para>Prepare the GDT.</para></listitem> | ||||
|         <listitem><para>Prepare the IDT.</para></listitem> | ||||
|         <listitem><para>Initialize the system console.</para></listitem> | ||||
|         <listitem><para>Initialize the DDB, if it is compiled into kernel. | ||||
|         </para></listitem> | ||||
|         <listitem><para>Initialize the TSS.</para></listitem> | ||||
|         <listitem><para>Prepare the LDT.</para></listitem> | ||||
|         <listitem><para>Setup proc0's pcb.</para></listitem> | ||||
| 
 | ||||
|       </itemizedlist> | ||||
| 
 | ||||
|       <para>What <function>init386()</function> first does is | ||||
|         initialize the tunable parameters passed from bootstrap.  This | ||||
|         is done by setting the environment pointer (envp) and calling | ||||
|         <function>init_param1()</function>.  The envp pointer has been | ||||
|         passed from loader in the <literal>bootinfo</literal> | ||||
|         structure:</para> | ||||
| 
 | ||||
|       <programlisting><filename>sys/i386/i386/machdep.c:</filename> | ||||
| 		kern_envp = (caddr_t)bootinfo.bi_envp + KERNBASE; | ||||
| 
 | ||||
| 	/* Init basic tunables, hz etc */ | ||||
| 	init_param1();</programlisting> | ||||
| 
 | ||||
|       <para><function>init_param1()</function> is defined in | ||||
|         <filename>sys/kern/subr_param.c</filename>.  That file has a | ||||
|         number of sysctls, and two functions, | ||||
|         <function>init_param1()</function> and | ||||
|         <function>init_param2()</function>, that are called from | ||||
|         <function>init386()</function>:</para> | ||||
| 
 | ||||
|       <programlisting><filename>sys/kern/subr_param.c</filename> | ||||
| 	hz = HZ; | ||||
| 	TUNABLE_INT_FETCH("kern.hz", &hz);</programlisting> | ||||
| 
 | ||||
|       <para>TUNABLE_<typename>_FETCH is used to fetch the value | ||||
|         from the environment:</para> | ||||
| 
 | ||||
|     <programlisting><filename>/usr/src/sys/sys/kernel.h</filename> | ||||
| #define	TUNABLE_INT_FETCH(path, var)	getenv_int((path), (var)) | ||||
| </programlisting> | ||||
| 
 | ||||
|       <para>Sysctl "kern.hz" is the system clock tick.  Along with | ||||
|         this, the following sysctls are set by | ||||
|         <function>init_param1()</function>: <literal>kern.maxswzone, | ||||
|         kern.maxbcache, kern.maxtsiz, kern.dfldsiz, kern.dflssiz, | ||||
|         kern.maxssiz, kern.sgrowsiz</literal>.</para> | ||||
| 
 | ||||
|       <para>Then <function>init386()</function> prepares the Global | ||||
|         Descriptors Table (GDT).  Every task on an x86 is running in | ||||
|         its own virtual address space, and this space is addressed by | ||||
|         a segment:offset pair.  Say, for instance, the current | ||||
|         instruction to be executed by the processor lies at CS:EIP, | ||||
|         then the linear virtual address for that instruction would be | ||||
|         "the virtual address of code segment CS" + EIP.  For | ||||
|         convenience, segments begin at virtual address 0 and end at a | ||||
|         4Gb boundary.  Therefore, the instruction's linear virtual | ||||
|         address for this example would just be the value of EIP. | ||||
|         Segment registers such as CS, DS etc are the selectors, | ||||
|         i.e. indexes, into GDT (to be more precise, an index is not a | ||||
|         selector itself, but the INDEX field of a selector). | ||||
|         FreeBSD's GDT holds descriptors for 15 selectors per | ||||
|         CPU:</para> | ||||
| 
 | ||||
|       <programlisting><filename>sys/i386/i386/machdep.c:</filename> | ||||
| union descriptor gdt[NGDT * MAXCPU];	/* global descriptor table */ | ||||
| 
 | ||||
| <filename>sys/i386/include/segments.h:</filename> | ||||
| /* | ||||
|  * Entries in the Global Descriptor Table (GDT) | ||||
|  */ | ||||
| #define	GNULL_SEL	0	/* Null Descriptor */ | ||||
| #define	GCODE_SEL	1	/* Kernel Code Descriptor */ | ||||
| #define	GDATA_SEL	2	/* Kernel Data Descriptor */ | ||||
| #define	GPRIV_SEL	3	/* SMP Per-Processor Private Data */ | ||||
| #define	GPROC0_SEL	4	/* Task state process slot zero and up */ | ||||
| #define	GLDT_SEL	5	/* LDT - eventually one per process */ | ||||
| #define	GUSERLDT_SEL	6	/* User LDT */ | ||||
| #define	GTGATE_SEL	7	/* Process task switch gate */ | ||||
| #define	GBIOSLOWMEM_SEL	8	/* BIOS low memory access (must be entry 8) */ | ||||
| #define	GPANIC_SEL	9	/* Task state to consider panic from */ | ||||
| #define GBIOSCODE32_SEL	10	/* BIOS interface (32bit Code) */ | ||||
| #define GBIOSCODE16_SEL	11	/* BIOS interface (16bit Code) */ | ||||
| #define GBIOSDATA_SEL	12	/* BIOS interface (Data) */ | ||||
| #define GBIOSUTIL_SEL	13	/* BIOS interface (Utility) */ | ||||
| #define GBIOSARGS_SEL	14	/* BIOS interface (Arguments) */</programlisting> | ||||
| 
 | ||||
|       <para>Note that those #defines are not selectors themselves, but | ||||
|         just a field INDEX of a selector, so they are exactly the | ||||
|         indices of the GDT.  for example, an actual selector for the | ||||
|         kernel code (GCODE_SEL) has the value 0x08.</para> | ||||
| 
 | ||||
|       <para>The next step is to initialize the Interrupt Descriptor | ||||
|         Table (IDT).  This table is to be referenced by the processor | ||||
|         when a software or hardware interrupt occurs.  For example, to | ||||
|         make a system call, user application issues the <literal>INT | ||||
|         0x80</literal> instruction.  This is a software interrupt, so | ||||
|         the processor's hardware looks up a record with index 0x80 in | ||||
|         the IDT.  This record points to the routine that handles this | ||||
|         interrupt, in this particular case, this will be the kernel's | ||||
|         syscall gate.  The IDT may have a maximum of 256 (0x100) | ||||
|         records.  The kernel allocates NIDT records for the IDT, where | ||||
|         NIDT is the maximum (256):</para> | ||||
| 
 | ||||
|       <programlisting><filename>sys/i386/i386/machdep.c:</filename> | ||||
| static struct gate_descriptor idt0[NIDT]; | ||||
| struct gate_descriptor *idt = &idt0[0];	/* interrupt descriptor table */ | ||||
| </programlisting> | ||||
| 
 | ||||
|       <para>For each interrupt, an appropriate handler is set.  The | ||||
|         syscall gate for <literal>INT 0x80</literal> is set as | ||||
|         well:</para> | ||||
| 
 | ||||
|       <programlisting><filename>sys/i386/i386/machdep.c:</filename> | ||||
|  	setidt(0x80, &IDTVEC(int0x80_syscall), | ||||
| 			SDT_SYS386TGT, SEL_UPL, GSEL(GCODE_SEL, SEL_KPL));</programlisting> | ||||
| 
 | ||||
|       <para>So when a userland application issues the <literal>INT | ||||
|         0x80</literal> instruction, control will transfer to the | ||||
|         function <function>_Xint0x80_syscall</function>, which is in | ||||
|         the kernel code segment and will be executed with supervisor | ||||
|         privileges.</para> | ||||
| 
 | ||||
|       <para>Console and DDB are then initialized:</para> | ||||
| 
 | ||||
|       <programlisting><filename>sys/i386/i386/machdep.c:</filename> | ||||
| 	cninit(); | ||||
| /* skipped */ | ||||
| #ifdef DDB | ||||
| 	kdb_init(); | ||||
| 	if (boothowto & RB_KDB) | ||||
| 		Debugger("Boot flags requested debugger"); | ||||
| #endif</programlisting> | ||||
| 
 | ||||
|       <para>The Task State Segment is another x86 protected mode | ||||
|         structure, the TSS is used by the hardware to store task | ||||
|         information when a task switch occurs.</para> | ||||
| 
 | ||||
|       <para>The Local Descriptors Table is used to reference userland | ||||
|         code and data.  Several selectors are defined to point to the | ||||
|         LDT, they are the system call gates and the user code and data | ||||
|         selectors:</para> | ||||
| 
 | ||||
|       <programlisting><filename>/usr/include/machine/segments.h</filename> | ||||
| #define	LSYS5CALLS_SEL	0	/* forced by intel BCS */ | ||||
| #define	LSYS5SIGR_SEL	1 | ||||
| #define	L43BSDCALLS_SEL	2	/* notyet */ | ||||
| #define	LUCODE_SEL	3 | ||||
| #define LSOL26CALLS_SEL	4	/* Solaris >= 2.6 system call gate */ | ||||
| #define	LUDATA_SEL	5 | ||||
| /* separate stack, es,fs,gs sels ? */ | ||||
| /* #define	LPOSIXCALLS_SEL	5*/	/* notyet */ | ||||
| #define LBSDICALLS_SEL	16	/* BSDI system call gate */ | ||||
| #define NLDT		(LBSDICALLS_SEL + 1) | ||||
| </programlisting> | ||||
| 
 | ||||
|     <para>Next, proc0's Process Control Block (<literal>struct | ||||
|       pcb</literal>) structure is initialized.  proc0 is a | ||||
|       <literal>struct proc</literal> structure that describes a kernel | ||||
|       process.  It is always present while the kernel is running, | ||||
|       therefore it is declared as global:</para> | ||||
| 
 | ||||
|     <programlisting><filename>sys/kern/kern_init.c:</filename> | ||||
|     struct	proc proc0;</programlisting> | ||||
| 
 | ||||
|     <para>The structure <literal>struct pcb</literal> is a part of a | ||||
|       proc structure.  It is defined in | ||||
|       <filename>/usr/include/machine/pcb.h</filename> and has a | ||||
|       process's information specific to the i386 architecture, such as | ||||
|       registers values.</para> | ||||
| 
 | ||||
|     </sect2> | ||||
| 
 | ||||
|     <sect2> | ||||
|       <title><function>mi_startup()</function></title> | ||||
| 
 | ||||
|       <para>This function performs a bubble sort of all the system | ||||
|         initialization objects and then calls the entry of each object | ||||
|         one by one:</para> | ||||
| 
 | ||||
|       <programlisting><filename>sys/kern/init_main.c:</filename> | ||||
| 	for (sipp = sysinit; *sipp; sipp++) { | ||||
| 
 | ||||
| 		/* ... skipped ... */ | ||||
| 
 | ||||
| 		/* Call function */ | ||||
| 		(*((*sipp)->func))((*sipp)->udata); | ||||
| 		/* ... skipped ... */ | ||||
| 	}</programlisting> | ||||
| 
 | ||||
|     <para>Although the sysinit framework is described in the | ||||
|       Developers' Handbook, I will discuss the internals of it.</para> | ||||
| 
 | ||||
|     <para>Every system initialization object (sysinit object) is | ||||
|       created by calling a SYSINIT() macro.  Let us take as example an | ||||
|       <literal>announce</literal> sysinit object.  This object prints | ||||
|       the copyright message:</para> | ||||
| 
 | ||||
|     <programlisting><filename>sys/kern/init_main.c:</filename> | ||||
| static void | ||||
| print_caddr_t(void *data __unused) | ||||
| { | ||||
| 	printf("%s", (char *)data); | ||||
| } | ||||
| SYSINIT(announce, SI_SUB_COPYRIGHT, SI_ORDER_FIRST, print_caddr_t, copyright)</programlisting> | ||||
| 
 | ||||
|     <para>The subsystem ID for this object is SI_SUB_COPYRIGHT | ||||
|       (0x0800001), which comes right after the SI_SUB_CONSOLE | ||||
|       (0x0800000).  So, the copyright message will be printed out | ||||
|       first, just after the console initialization.</para> | ||||
| 
 | ||||
|     <para>Let us take a look at what exactly the macro | ||||
|       <literal>SYSINIT()</literal> does.  It expands to a | ||||
|       <literal>C_SYSINIT()</literal> macro.  The | ||||
|       <literal>C_SYSINIT()</literal> macro then expands to a static | ||||
|       <literal>struct sysinit</literal> structure declaration with | ||||
|       another <literal>DATA_SET</literal> macro call:</para> | ||||
|       <programlisting><filename>/usr/include/sys/kernel.h:</filename> | ||||
|       #define C_SYSINIT(uniquifier, subsystem, order, func, ident) \ | ||||
|       static struct sysinit uniquifier ## _sys_init = { \ subsystem, \ | ||||
|       order, \ func, \ ident \ }; \ DATA_SET(sysinit_set,uniquifier ## | ||||
|       _sys_init); | ||||
| 
 | ||||
| #define	SYSINIT(uniquifier, subsystem, order, func, ident)	\ | ||||
| 	C_SYSINIT(uniquifier, subsystem, order,			\ | ||||
| 	(sysinit_cfunc_t)(sysinit_nfunc_t)func, (void *)ident)</programlisting> | ||||
| 
 | ||||
|     <para>The <literal>DATA_SET()</literal> macro expands to a | ||||
|       <literal>MAKE_SET()</literal>, and that macro is the point where | ||||
|       the all sysinit magic is hidden:</para> | ||||
| 
 | ||||
|     <programlisting><filename>/usr/include/linker_set.h</filename> | ||||
| #define MAKE_SET(set, sym)						\ | ||||
| 	static void const * const __set_##set##_sym_##sym = &sym;	\ | ||||
| 	__asm(".section .set." #set ",\"aw\"");				\ | ||||
| 	__asm(".long " #sym);						\ | ||||
| 	__asm(".previous") | ||||
| #endif | ||||
| #define TEXT_SET(set, sym) MAKE_SET(set, sym) | ||||
| #define DATA_SET(set, sym) MAKE_SET(set, sym)</programlisting> | ||||
| 
 | ||||
|     <para>In our case, the following declaration will occur:</para> | ||||
| 
 | ||||
|     <programlisting>static struct sysinit announce_sys_init = { | ||||
| 	SI_SUB_COPYRIGHT, | ||||
| 	SI_ORDER_FIRST, | ||||
| 	(sysinit_cfunc_t)(sysinit_nfunc_t)  print_caddr_t, | ||||
| 	(void *) copyright | ||||
| }; | ||||
| 
 | ||||
| static void const *const __set_sysinit_set_sym_announce_sys_init = | ||||
|     &announce_sys_init; | ||||
| __asm(".section .set.sysinit_set" ",\"aw\""); | ||||
| __asm(".long " "announce_sys_init"); | ||||
| __asm(".previous");</programlisting> | ||||
| 
 | ||||
|     <para>The first <literal>__asm</literal> instruction will create | ||||
|       an ELF section within the kernel's executable.  This will happen | ||||
|       at kernel link time.  The section will have the name | ||||
|       ".set.sysinit_set".  The content of this section is one 32-bit | ||||
|       value, the address of announce_sys_init structure, and that is | ||||
|       what the second <literal>__asm</literal> is.  The third | ||||
|       <literal>__asm</literal> instruction marks the end of a section. | ||||
|       If a directive with the same section name occured before, the | ||||
|       content, i.e. the 32-bit value, will be appended to the existing | ||||
|       section, so forming an array of 32-bit pointers.</para> | ||||
| 
 | ||||
|     <para>Running <application>objdump</application> on a kernel | ||||
|       binary, you may notice the presence of such small sections:</para> | ||||
| 
 | ||||
|     <screen>&prompt.user; <userinput>objdump -h /kernel</userinput> | ||||
|   7 .set.cons_set 00000014  c03164c0  c03164c0  002154c0  2**2 | ||||
|                   CONTENTS, ALLOC, LOAD, DATA | ||||
|   8 .set.kbddriver_set 00000010  c03164d4  c03164d4  002154d4  2**2 | ||||
|                   CONTENTS, ALLOC, LOAD, DATA | ||||
|   9 .set.scrndr_set 00000024  c03164e4  c03164e4  002154e4  2**2 | ||||
|                   CONTENTS, ALLOC, LOAD, DATA | ||||
|  10 .set.scterm_set 0000000c  c0316508  c0316508  00215508  2**2 | ||||
|                   CONTENTS, ALLOC, LOAD, DATA | ||||
|  11 .set.sysctl_set 0000097c  c0316514  c0316514  00215514  2**2 | ||||
|                   CONTENTS, ALLOC, LOAD, DATA | ||||
|  12 .set.sysinit_set 00000664  c0316e90  c0316e90  00215e90  2**2 | ||||
|                   CONTENTS, ALLOC, LOAD, DATA</screen> | ||||
| 
 | ||||
|     <para>This screen dump shows that the size of .set.sysinit_set | ||||
|       section is 0x664 bytes, so <literal>0x664/sizeof(void | ||||
|       *)</literal> sysinit objects are compiled into the kernel.  The | ||||
|       other sections such as <literal>.set.sysctl_set</literal> | ||||
|       represent other linker sets.</para> | ||||
| 
 | ||||
|     <para>By defining a variable of type <literal>struct | ||||
|       linker_set</literal> the content of | ||||
|       <literal>.set.sysinit_set</literal> section will be "collected" | ||||
|       into that variable:</para> | ||||
|       <programlisting><filename>sys/kern/init_main.c:</filename> | ||||
|       extern struct linker_set sysinit_set; /* XXX */</programlisting> | ||||
| 
 | ||||
|     <para>The <literal>struct linker_set</literal> is defined as | ||||
|       follows:</para> | ||||
| 
 | ||||
|     <programlisting><filename>/usr/include/linker_set.h:</filename> | ||||
|   struct linker_set { | ||||
| 	int	ls_length; | ||||
| 	void	*ls_items[1];		/* really ls_length of them, trailing NULL */ | ||||
| };</programlisting> | ||||
| 
 | ||||
|     <para>The first node will be equal to the number of a sysinit | ||||
|       objects, and the second node will be a NULL-terminated array of | ||||
|       pointers to them.</para> | ||||
| 
 | ||||
|     <para>Returning to the <function>mi_startup()</function> | ||||
|       discussion, it is must be clear now, how the sysinit objects are | ||||
|       being organized.  The <function>mi_startup()</function> function | ||||
|       sorts them and calls each.  The very last object is the system | ||||
|       scheduler:</para> | ||||
| 
 | ||||
|     <programlisting><filename>/usr/include/sys/kernel.h:</filename> | ||||
| enum sysinit_sub_id { | ||||
| 	SI_SUB_DUMMY		= 0x0000000,	/* not executed; for linker*/ | ||||
| 	SI_SUB_DONE		= 0x0000001,	/* processed*/ | ||||
| 	SI_SUB_CONSOLE		= 0x0800000,	/* console*/ | ||||
| 	SI_SUB_COPYRIGHT	= 0x0800001,	/* first use of console*/ | ||||
| ... | ||||
| 	SI_SUB_RUN_SCHEDULER	= 0xfffffff	/* scheduler: no return*/ | ||||
| };</programlisting> | ||||
| 
 | ||||
|     <para>The system scheduler sysinit object is defined in the file | ||||
|       <filename>sys/vm/vm_glue.c</filename>, and the entry point for | ||||
|       that object is <function>scheduler()</function>.  That function | ||||
|       is actually an infinite loop, and it represents a process with | ||||
|       PID 0, the swapper process.  The proc0 structure, mentioned | ||||
|       before, is used to describe it.</para> | ||||
| 
 | ||||
|     <para>The first user process, called <emphasis>init</emphasis>, is | ||||
|       created by the sysinit object "init":</para> | ||||
| 
 | ||||
|     <programlisting><filename>sys/kern/init_main.c:</filename> | ||||
| static void | ||||
| create_init(const void *udata __unused) | ||||
| { | ||||
| 	int error; | ||||
| 	int s; | ||||
| 
 | ||||
| 	s = splhigh(); | ||||
| 	error = fork1(&proc0, RFFDG | RFPROC, &initproc); | ||||
| 	if (error) | ||||
| 		panic("cannot fork init: %d\n", error); | ||||
| 	initproc->p_flag |= P_INMEM | P_SYSTEM; | ||||
| 	cpu_set_fork_handler(initproc, start_init, NULL); | ||||
| 	remrunqueue(initproc); | ||||
| 	splx(s); | ||||
| } | ||||
| SYSINIT(init,SI_SUB_CREATE_INIT, SI_ORDER_FIRST, create_init, NULL)</programlisting> | ||||
| 
 | ||||
|   <para>The <function>create_init()</function> allocates a new process | ||||
|     by calling <function>fork1()</function>, but does not mark it | ||||
|     runnable.  When this new process is scheduled for execution by the | ||||
|     scheduler, the <function>start_init()</function> will be called. | ||||
|     That function is defined in <filename>init_main.c</filename>.  It | ||||
|     tries to load and exec the <filename>init</filename> binary, | ||||
|     probing <filename>/sbin/init</filename> first, then | ||||
|     <filename>/sbin/oinit</filename>, | ||||
|     <filename>/sbin/init.bak</filename>, and finally | ||||
|     <filename>/stand/sysinstall</filename>:</para> | ||||
| 
 | ||||
|   <programlisting><filename>sys/kern/init_main.c:</filename> | ||||
| static char init_path[MAXPATHLEN] = | ||||
| #ifdef	INIT_PATH | ||||
|     __XSTRING(INIT_PATH); | ||||
| #else | ||||
|     "/sbin/init:/sbin/oinit:/sbin/init.bak:/stand/sysinstall"; | ||||
| #endif</programlisting> | ||||
| 
 | ||||
|   </sect2> | ||||
| </sect1> | ||||
| 
 | ||||
| </chapter> | ||||
| 
 | ||||
| <!--  | ||||
|      Local Variables: | ||||
|      mode: sgml | ||||
|      sgml-declaration: "../chapter.decl" | ||||
|      sgml-indent-data: t | ||||
|      sgml-omittag: nil | ||||
|      sgml-always-quote-attributes: t | ||||
|      sgml-parent-document: ("../book.sgml" "part" "chapter") | ||||
|      End: | ||||
| --> | ||||
							
								
								
									
										970
									
								
								en_US.ISO8859-1/books/developers-handbook/boot/chapter.sgml
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										970
									
								
								en_US.ISO8859-1/books/developers-handbook/boot/chapter.sgml
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,970 @@ | |||
| <!-- | ||||
| The FreeBSD Documentation Project | ||||
| 
 | ||||
| Copyright (c) 2002 Sergey Lyubka <devnull@uptsoft.com> | ||||
| All rights reserved | ||||
| $FreeBSD$ | ||||
| --> | ||||
| 
 | ||||
| <chapter id="boot"> | ||||
|   <chapterinfo> | ||||
|     <authorgroup> | ||||
|       <author> | ||||
|         <firstname>Sergey</firstname> | ||||
| 	<surname>Lyubka</surname> | ||||
| 	<contrib>Contributed by </contrib> | ||||
|       </author> <!-- devnull@uptsoft.com  12 Jun 2002 --> | ||||
|     </authorgroup> | ||||
|   </chapterinfo> | ||||
|   <title>Bootstrapping and kernel initialization</title> | ||||
| 
 | ||||
|   <sect1> | ||||
|     <title>Synopsis</title> | ||||
| 
 | ||||
|     <para>This chapter is an overview of the boot and system | ||||
|       initialization process, starting from the BIOS (firmware) POST, | ||||
|       to the first user process creation.  Since the initial steps of | ||||
|       system startup are very architecture dependent, the IA-32 | ||||
|       architecture is used as an example.</para> | ||||
|   </sect1> | ||||
| 
 | ||||
|   <sect1> | ||||
|     <title>Overview</title> | ||||
| 
 | ||||
|     <para>A computer running FreeBSD can boot by several methods, | ||||
|       although the most common method, booting from a harddisk where | ||||
|       the OS is installed, will be discussed here.  The boot process | ||||
|       is divided into several steps:</para> | ||||
| 
 | ||||
|     <itemizedlist> | ||||
|       <listitem><para>BIOS POST</para></listitem> | ||||
|       <listitem><para>boot0 stage</para></listitem> | ||||
|       <listitem><para>boot2 stage</para></listitem> | ||||
|       <listitem><para>loader stage</para></listitem> | ||||
|       <listitem><para>kernel initialization</para></listitem> | ||||
|     </itemizedlist> | ||||
| 
 | ||||
|     <para>The boot0 and boot2 stages are also referred to as | ||||
|       <emphasis>bootstrap stages 1 and 2</emphasis> in &man.boot.8; as | ||||
|       the first steps in Freud's 3-stage bootstrapping procedure. | ||||
|       Various information is printed on the screen at each stage, so | ||||
|       visually you may recognize them using the table that follows. | ||||
|       Please note that the actual data may differ from machine to | ||||
|       machine:</para> | ||||
| 
 | ||||
|     <informaltable> | ||||
|       <tgroup cols="2"> | ||||
|         <tbody> | ||||
|           <row> | ||||
|             <entry><para>may vary</para></entry> <entry><para>BIOS | ||||
|             (firmware) messages</para></entry> | ||||
|           </row> | ||||
|           <row> | ||||
|             <entry><para> | ||||
| <screen>F1    FreeBSD | ||||
| F2    BSD | ||||
| F5    Disk 2</screen> | ||||
|             </para></entry> | ||||
|             <entry><para>boot0</para></entry> | ||||
|           </row> | ||||
|           <row> | ||||
|             <entry><para> | ||||
| <screen>>>FreeBSD/i386 BOOT | ||||
| Default: 1:ad(1,a)/boot/loader | ||||
| boot:</screen> | ||||
|             </para></entry> | ||||
| 
 | ||||
|             <entry><para>boot2<footnote><para>This prompt will appear | ||||
|               if the user presses a key just after selecting an OS to | ||||
|               boot at the boot0 | ||||
|               stage.</para></footnote></para></entry> | ||||
|           </row> | ||||
|           <row> | ||||
|             <entry><para> | ||||
| <screen>BTX loader 1.0 BTX version is 1.01 | ||||
| BIOS drive A: is disk0 | ||||
| BIOS drive C: is disk1 | ||||
| BIOS 639kB/64512kB available memory | ||||
| FreeBSD/i386 bootstrap loader, Revision 0.8 | ||||
| Console internal video/keyboard | ||||
| (jkh@bento.freebsd.org, Mon Nov 20 11:41:23 GMT 2000) | ||||
| /kernel text=0x1234 data=0x2345 syms=[0x4+0x3456]  | ||||
| Hit [Enter] to boot immediately, or any other key for command prompt | ||||
| Booting [kernel] in 9 seconds..._</screen> | ||||
|             </para></entry> | ||||
|             <entry><para>loader</para></entry> | ||||
|           </row> | ||||
|           <row> | ||||
|             <entry><para> | ||||
| <screen>Copyright (c) 1992-2002 The FreeBSD Project. | ||||
| Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 | ||||
|         The Regents of the University of California. All rights reserved. | ||||
| FreeBSD 4.6-RC #0: Sat May  4 22:49:02 GMT 2002 | ||||
|     devnull@kukas:/usr/obj/usr/src/sys/DEVNULL | ||||
| Timecounter "i8254"  frequency 1193182 Hz</screen> | ||||
|             </para></entry> | ||||
|             <entry><para>kernel</para></entry> | ||||
|           </row> | ||||
|         </tbody> | ||||
|       </tgroup> | ||||
|     </informaltable> | ||||
| 
 | ||||
|   </sect1> | ||||
| 
 | ||||
|   <sect1> | ||||
|     <title>BIOS POST</title> | ||||
| 
 | ||||
|     <para>When the PC powers on, the processor's registers are set | ||||
|       with some predefined values.  One of the registers is the | ||||
|       <emphasis>instruction pointer</emphasis> register, and its value | ||||
|       after a power on is well defined: it is a 32-bit value of | ||||
|       0xffffff00.  The instruction pointer register points to code to | ||||
|       be executed by the processor.  One of the registers is the | ||||
|       <literal>cr1</literal> 32-bit control register, and its value | ||||
|       just after the reboot is 0.  One of the cr1's bits, the bit PE | ||||
|       (Protected Enabled) indicates whether the processor is running | ||||
|       in protected or real mode.  Since at boot time this bit is | ||||
|       cleared, the processor boots in real mode.  Real mode means, | ||||
|       among other things, that linear and physical addresses are | ||||
|       identical.</para> | ||||
| 
 | ||||
|     <para>The value of 0xffffff00 is slightly less then 4Gb, so unless | ||||
|       the machine has 4Gb physical memory, it cannot point to a valid | ||||
|       memory address.  The computer's hardware translates this address | ||||
|       so that it points to a BIOS memory block.</para> | ||||
| 
 | ||||
|     <para>BIOS stands for <emphasis>Basic Input Output | ||||
|       System</emphasis>, and it is a chip on the motherboard that has | ||||
|       a relatively small amount of read-only memory (ROM).  This | ||||
|       memory contains various low-level routines that are specific to | ||||
|       the hardware supplied with the motherboard.  So, the processor | ||||
|       will first jump to the address 0xffffff00, which really resides | ||||
|       in the BIOS's memory.  Usually this address contains a jump | ||||
|       instruction to the BIOS's POST routines.</para> | ||||
| 
 | ||||
|     <para>POST stands for <emphasis>Power On Self Test</emphasis>. | ||||
|       This is a set of routines including the memory check, system bus | ||||
|       check and other low-level stuff so that the CPU can initialize | ||||
|       the computer properly.  The important step on this stage is | ||||
|       determining the boot device.  All modern BIOS's allow the boot | ||||
|       device to be set manually, so you can boot from a floppy, | ||||
|       CD-ROM, harddisk etc.</para> | ||||
| 
 | ||||
|     <para>The very last thing in the POST is the <literal>INT | ||||
|       0x19</literal> instruction.  That instruction reads 512 bytes | ||||
|       from the first sector of boot device into the memory at address | ||||
|       0x7c00.  The term <emphasis>first sector</emphasis> originates | ||||
|       from harddrive architecture, where the magnetic plate is divided | ||||
|       to a number of cylindrical tracks.  Tracks are numbered, and | ||||
|       every track is divided by a number (usually 64) sectors.  Track | ||||
|       number 0 is the outermost on the magnetic plate, and sector 1, | ||||
|       the first sector (tracks, or, cylinders, are numbered starting | ||||
|       from 0, but sectors - starting from 1), has a special meaning. | ||||
|       It is also called Master Boot Record, or MBR.  The remaining | ||||
|       sectors on the first track are never used <footnote><para>Some | ||||
|       utilities such as &man.disklabel.8; may store the information in | ||||
|       this area, mostly in the second | ||||
|       sector.</para></footnote>.</para> | ||||
| 
 | ||||
|   </sect1> | ||||
| 
 | ||||
|   <sect1> | ||||
|     <title>boot0 stage</title> | ||||
| 
 | ||||
|     <para>Take a look at the file <filename>/boot/boot0</filename>. | ||||
|       This is a small 512-byte file, and it is exactly what FreeBSD's | ||||
|       installation procedure wrote to your harddisk's MBR if you chose | ||||
|       the "bootmanager" option at installation time.</para> | ||||
| 
 | ||||
|     <para>As mentioned previously, the <literal>INT 0x19</literal> | ||||
|       instruction loads an MBR, i.e. the <filename>boot0</filename> | ||||
|       content, into the memory at address 0x7c00.  Taking a look at | ||||
|       the file <filename>sys/boot/i386/boot0/boot0.s</filename> can | ||||
|       give a guess at what is happening there - this is the boot | ||||
|       manager, which is an awesome piece of code written by Robert | ||||
|       Nordier.</para> | ||||
| 
 | ||||
|     <para>The MBR, or, <filename>boot0</filename>, has a special | ||||
|       structure starting from offset 0x1be, called the | ||||
|       <emphasis>partition table</emphasis>.  It has 4 records of 16 | ||||
|       bytes each, called <emphasis>partition records</emphasis>, which | ||||
|       represent how the harddisk(s) are partitioned, or, in FreeBSD's | ||||
|       terminology, sliced.  One byte of those 16 says whether a | ||||
|       partition (slice) is bootable or not.  Exactly one record must | ||||
|       have that flag set, otherwise <filename>boot0</filename>'s code | ||||
|       will refuse to proceed.</para> | ||||
| 
 | ||||
|     <para>A partition record has the following fields:</para> | ||||
| 
 | ||||
|     <itemizedlist> | ||||
|       <listitem><para>the 1-byte filesystem type</para></listitem> | ||||
|       <listitem><para>the 1-byte bootable flag</para></listitem> | ||||
|       <listitem><para>the 6 byte descriptor in CHS | ||||
|         format</para></listitem> | ||||
|       <listitem><para>the 8 byte descriptor in LBA | ||||
|         format</para></listitem> | ||||
|     </itemizedlist> | ||||
| 
 | ||||
|     <para>A partition record descriptor has the information about | ||||
|       where exactly the partition resides on the drive.  Both | ||||
|       descriptors, LBA and CHS, describe the same information, but in | ||||
|       different ways: LBA (Logical Block Addressing) has the starting | ||||
|       sector for the partition and the partition's length, while CHS | ||||
|       (Cylinder Head Sector) has coordinates for the first and last | ||||
|       sectors of the partition.</para> | ||||
| 
 | ||||
|     <para>The boot manager scans the partition table and prints the | ||||
|       menu on the screen so the user can select what disk and what | ||||
|       slice to boot.  By pressing an appropriate key, | ||||
|       <filename>boot0</filename> performs the following | ||||
|       actions:</para> | ||||
| 
 | ||||
|     <itemizedlist> | ||||
|       <listitem><para>modifies the bootable flag for the selected | ||||
|         partition to make it bootable, and clears the | ||||
|         previous</para></listitem> | ||||
| 
 | ||||
|       <listitem><para>saves itself to disk to remember what partition | ||||
|         (slice) has been selected so to use it as the default on the | ||||
|         next boot </para></listitem> | ||||
| 
 | ||||
|       <listitem><para>loads the first sector of the selected partition | ||||
|         (slice) into memory and jumps there</para></listitem> | ||||
|     </itemizedlist> | ||||
| 
 | ||||
|     <para>What kind of data should reside on the very first sector of | ||||
|       a bootable partition (slice), in our case, a FreeBSD slice?  As | ||||
|       you may have already guessed, it is | ||||
|       <filename>boot2</filename>.</para> | ||||
| 
 | ||||
|   </sect1> | ||||
| 
 | ||||
|   <sect1> | ||||
|     <title>boot2 stage</title> | ||||
| 
 | ||||
|     <para>You might wonder, why boot2 comes after boot0, and not | ||||
|       boot1.  Actually, there is a 512-byte file called | ||||
|       <filename>boot1</filename> in the directory | ||||
|       <filename>/boot</filename> as well.  It is used for booting from | ||||
|       a floppy.  When booting from a floppy, | ||||
|       <filename>boot1</filename> plays the same role as | ||||
|       <filename>boot0</filename> for a harddisk: it locates boot2 and | ||||
|       runs it.</para> | ||||
| 
 | ||||
|     <para>You may have realized that a file | ||||
|       <filename>/boot/mbr</filename> exists as well.  It is a | ||||
|       simplified version of boot0.  The code in | ||||
|       <filename>mbr</filename> does not provide a menu for the user, | ||||
|       it just blindly boots the partition marked active.</para> | ||||
| 
 | ||||
|     <para>The code implementing boot2 resides in | ||||
|       <filename>sys/boot/i386/boot2/</filename>, and the executable | ||||
|       itself is in <filename>/boot</filename>.  The files boot0 and | ||||
|       boot2 that are in <filename>/boot</filename> are not used by the | ||||
|       bootstrap, but by utilities such as | ||||
|       <application>boot0cfg</application>.  The actual position for | ||||
|       boot0 is in the MBR.  For boot2 it is the beginning of a | ||||
|       bootable FreeBSD slice.  These locations are not under the | ||||
|       filesystem's control, so they are invisible to commands like | ||||
|       <application>ls</application>.</para> | ||||
| 
 | ||||
|     <para>The main task for boot2 is to load the file | ||||
|       <filename>/boot/loader</filename>, which is the third stage in | ||||
|       the bootstrapping procedure.  The code in boot2 cannot use any | ||||
|       services like <function>open()</function> and | ||||
|       <function>read()</function>, since the kernel is not yet loaded. | ||||
|       It must scan the harddisk, knowing about the filesystem | ||||
|       structure, find the file <filename>/boot/loader</filename>, read | ||||
|       it into memory using a BIOS service, and then pass the execution | ||||
|       to the loader's entry point.</para> | ||||
| 
 | ||||
|     <para>Besides that, boot2 prompts for user input so the loader can | ||||
|       be booted from different disk, unit, slice and partition.</para> | ||||
| 
 | ||||
|     <para>The boot2 binary is created in special way:</para> | ||||
|     <programlisting><filename>sys/boot/i386/boot2/Makefile</filename> | ||||
| boot2: boot2.ldr boot2.bin ${BTX}/btx/btx | ||||
| 	btxld -v -E ${ORG2} -f bin -b ${BTX}/btx/btx -l boot2.ldr \ | ||||
| 		-o boot2.ld -P 1 boot2.bin</programlisting> | ||||
| 
 | ||||
|     <para>This Makefile snippet shows that &man.btxld.8; is used to | ||||
|       link the binary.  BTX, which stands for BooT eXtender, is a | ||||
|       piece of code that provides a protected mode environment for the | ||||
|       program, called the client, that it is linked with.  So boot2 is | ||||
|       a BTX client, i.e. it uses the sevice provided by BTX.</para> | ||||
| 
 | ||||
|     <para>The <application>btxld</application> utility is the linker. | ||||
|       It links two binaries together.  The difference between | ||||
|       &man.btxld.8; and &man.ld.1; is that | ||||
|       <application>ld</application> usually links object files into a | ||||
|       shared object or executable, while | ||||
|       <application>btxld</application> links an object file with the | ||||
|       BTX, producing the binary file suitable to be put on the | ||||
|       beginning of the partition for the system boot.</para> | ||||
| 
 | ||||
|     <para>boot0 passes the execution to BTX's entry point.  BTX then | ||||
|       switches the processor to protected mode, and prepares a simple | ||||
|       environment before calling the client.  This includes:</para> | ||||
| 
 | ||||
|     <itemizedlist> | ||||
|       <listitem><para>virtual v86 mode.  That means, the BTX is a v86 | ||||
|         monitor.  Real mode instructions like posh, popf, cli, sti, if | ||||
|         called by the client, will work.</para></listitem> | ||||
| 
 | ||||
|       <listitem><para>Interrupt Descriptor Table (IDT) is set up so | ||||
|         all hardware interrupts are routed to the default BIOS's | ||||
|         handlers, and interrupt 0x30 is set up to be the syscall | ||||
|         gate.</para></listitem> | ||||
| 
 | ||||
|       <listitem><para>Two system calls: <function>exec</function> and | ||||
|         <function>exit</function>, are defined:</para> | ||||
| 
 | ||||
|     <programlisting><filename>sys/boot/i386/btx/lib/btxsys.s:</filename> | ||||
| 		.set INT_SYS,0x30		# Interrupt number | ||||
| # | ||||
| # System call: exit | ||||
| # | ||||
| __exit: 	xorl %eax,%eax			# BTX system | ||||
| 		int $INT_SYS			#  call 0x0 | ||||
| # | ||||
| # System call: exec | ||||
| # | ||||
| __exec: 	movl $0x1,%eax			# BTX system | ||||
| 		int $INT_SYS			#  call 0x1</programlisting></listitem> | ||||
|     </itemizedlist> | ||||
| 
 | ||||
|     <para>BTX creates a Global Descriptor Table (GDT):</para> | ||||
| 
 | ||||
|     <programlisting><filename>sys/boot/i386/btx/btx/btx.s:</filename> | ||||
| gdt:		.word 0x0,0x0,0x0,0x0		# Null entry | ||||
| 		.word 0xffff,0x0,0x9a00,0xcf	# SEL_SCODE | ||||
| 		.word 0xffff,0x0,0x9200,0xcf	# SEL_SDATA | ||||
| 		.word 0xffff,0x0,0x9a00,0x0	# SEL_RCODE | ||||
| 		.word 0xffff,0x0,0x9200,0x0	# SEL_RDATA | ||||
| 		.word 0xffff,MEM_USR,0xfa00,0xcf# SEL_UCODE | ||||
| 		.word 0xffff,MEM_USR,0xf200,0xcf# SEL_UDATA | ||||
| 		.word _TSSLM,MEM_TSS,0x8900,0x0 # SEL_TSS</programlisting> | ||||
| 
 | ||||
|     <para>The client's code and data start from address MEM_USR | ||||
|       (0xa000), and a selector (SEL_UCODE) points to the client's code | ||||
|       segment.  The SEL_UCODE descriptor has Descriptor Privilege | ||||
|       Level (DPL) 3, which is the lowest privilege level.  But the | ||||
|       <literal>INT 0x30</literal> instruction handler resides in a | ||||
|       segment pointed to by the SEL_SCODE (supervisor code) selector, | ||||
|       as shown from the code that creates an IDT:</para> | ||||
| 
 | ||||
|   <programlisting>		mov $SEL_SCODE,%dh		# Segment selector | ||||
| init.2: 	shr %bx				# Handle this int? | ||||
| 		jnc init.3			# No | ||||
| 		mov %ax,(%di)			# Set handler offset | ||||
| 		mov %dh,0x2(%di)		#  and selector | ||||
| 		mov %dl,0x5(%di)		# Set P:DPL:type | ||||
| 		add $0x4,%ax			# Next handler</programlisting> | ||||
| 
 | ||||
|     <para>So, when the client calls <function>__exec()</function>, the | ||||
|       code will be executed with the highest privileges.  This allows | ||||
|       the kernel to change the protected mode data structures, such as | ||||
|       page tables, GDT, IDT, etc later, if needed.</para> | ||||
| 
 | ||||
|     <para>boot2 defines an important structure, <literal>struct | ||||
|       bootinfo</literal>.  This structure is initialized by boot2 and | ||||
|       passed to the loader, and then further to the kernel.  Some | ||||
|       nodes of this structures are set by boot2, the rest by the | ||||
|       loader.  This structure, among other information, contains the | ||||
|       kernel filename, BIOS harddisk geometry, BIOS drive number for | ||||
|       boot device, physical memory available, <literal>envp</literal> | ||||
|       pointer etc.  The definition for it is:</para> | ||||
| 
 | ||||
|     <programlisting><filename>/usr/include/machine/bootinfo.h</filename> | ||||
| struct bootinfo { | ||||
| 	u_int32_t	bi_version; | ||||
| 	u_int32_t	bi_kernelname;		/* represents a char * */ | ||||
| 	u_int32_t	bi_nfs_diskless;	/* struct nfs_diskless * */ | ||||
| 				/* End of fields that are always present. */ | ||||
| #define	bi_endcommon	bi_n_bios_used | ||||
| 	u_int32_t	bi_n_bios_used; | ||||
| 	u_int32_t	bi_bios_geom[N_BIOS_GEOM]; | ||||
| 	u_int32_t	bi_size; | ||||
| 	u_int8_t	bi_memsizes_valid; | ||||
| 	u_int8_t	bi_bios_dev;		/* bootdev BIOS unit number */ | ||||
| 	u_int8_t	bi_pad[2]; | ||||
| 	u_int32_t	bi_basemem; | ||||
| 	u_int32_t	bi_extmem; | ||||
| 	u_int32_t	bi_symtab;		/* struct symtab * */ | ||||
| 	u_int32_t	bi_esymtab;		/* struct symtab * */ | ||||
| 				/* Items below only from advanced bootloader */ | ||||
| 	u_int32_t	bi_kernend;		/* end of kernel space */ | ||||
| 	u_int32_t	bi_envp;		/* environment */ | ||||
| 	u_int32_t	bi_modulep;		/* preloaded modules */ | ||||
| };</programlisting> | ||||
| 
 | ||||
|   <para>boot2 enters into an infinite loop waiting for user input, | ||||
|     then calls <function>load()</function>.  If the user does not | ||||
|     press anything, the loop brakes by a timeout, so | ||||
|     <function>load()</function> will load the default file | ||||
|     (<filename>/boot/loader</filename>).  Functions <function>ino_t | ||||
|     lookup(char *filename)</function> and <function>int xfsread(ino_t | ||||
|     inode, void *buf, size_t nbyte)</function> are used to read the | ||||
|     content of a file into memory.  <filename>/boot/loader</filename> | ||||
|     is an ELF binary, but where the ELF header is prepended with | ||||
|     a.out's <literal>struct exec</literal> structure. | ||||
|     <function>load()</function> scans the loader's ELF header, loading | ||||
|     the content of <filename>/boot/loader</filename> into memory, and | ||||
|     passing the execution to the loader's entry:</para> | ||||
| 
 | ||||
|   <programlisting><filename>sys/boot/i386/boot2/boot2.c:</filename> | ||||
|     __exec((caddr_t)addr, RB_BOOTINFO | (opts & RBX_MASK), | ||||
| 	   MAKEBOOTDEV(dev_maj[dsk.type], 0, dsk.slice, dsk.unit, dsk.part), | ||||
| 	   0, 0, 0, VTOP(&bootinfo));</programlisting> | ||||
| 
 | ||||
|   </sect1> | ||||
| 
 | ||||
|   <sect1> | ||||
|     <title><application>loader</application> stage</title> | ||||
| 
 | ||||
|     <para><application>loader</application> is a BTX client as well. | ||||
|       I will not describe it here in detail, there is a comprehensive | ||||
|       manpage written by Mike Smith, &man.loader.8;.  The underlying | ||||
|       mechanisms and BTX were discussed above.</para> | ||||
| 
 | ||||
|     <para>The main task for the loader is to boot the kernel.  When | ||||
|       the kernel is loaded into memory, it is being called by the | ||||
|       loader:</para> | ||||
| 
 | ||||
|   <programlisting><filename>sys/boot/common/boot.c:</filename> | ||||
|     /* Call the exec handler from the loader matching the kernel */ | ||||
|     module_formats[km->m_loader]->l_exec(km);</programlisting> | ||||
|   </sect1> | ||||
| 
 | ||||
|   <sect1> | ||||
|     <title>Kernel initialization</title> | ||||
| 
 | ||||
|     <para>To where exactly is the execution passed by the loader, | ||||
|       i.e. what is the kernel's actual entry point.  Let us take a | ||||
|       look at the command that links the kernel:</para> | ||||
| 
 | ||||
|     <programlisting><filename>sys/conf/Makefile.i386:</filename> | ||||
| ld -elf -Bdynamic -T /usr/src/sys/conf/ldscript.i386  -export-dynamic \ | ||||
| -dynamic-linker /red/herring -o kernel -X locore.o \ | ||||
| <lots of kernel .o files></programlisting> | ||||
| 
 | ||||
|     <para>A few interesting things can be seen in this line.  First, | ||||
|       the kernel is an ELF dynamically linked binary, but the dynamic | ||||
|       linker for kernel is <filename>/red/herring</filename>, which is | ||||
|       definitely a bogus file.  Second, taking a look at the file | ||||
|       <filename>sys/conf/ldscript.i386</filename> gives an idea about | ||||
|       what <application>ld</application> options are used when | ||||
|       compiling a kernel.  Reading through the first few lines, the | ||||
|       string</para> | ||||
| 
 | ||||
|   <programlisting><filename>sys/conf/ldscript.i386:</filename> | ||||
| ENTRY(btext)</programlisting> | ||||
| 
 | ||||
|     <para>says that a kernel's entry point is the symbol `btext'. | ||||
|       This symbol is defined in <filename>locore.s</filename>:</para> | ||||
| 
 | ||||
|   <programlisting><filename>sys/i386/i386/locore.s:</filename> | ||||
| 	.text | ||||
| /********************************************************************** | ||||
|  * | ||||
|  * This is where the bootblocks start us, set the ball rolling... | ||||
|  * | ||||
|  */ | ||||
| NON_GPROF_ENTRY(btext)</programlisting> | ||||
| 
 | ||||
|     <para>First what is done is the register EFLAGS is set to a | ||||
|       predefined value of 0x00000002, and then all the segment | ||||
|       registers are initialized:</para> | ||||
| 
 | ||||
|     <programlisting><filename>sys/i386/i386/locore.s</filename> | ||||
| /* Don't trust what the BIOS gives for eflags. */ | ||||
| 	pushl	$PSL_KERNEL | ||||
| 	popfl | ||||
| 
 | ||||
| /* | ||||
|  * Don't trust what the BIOS gives for %fs and %gs.  Trust the bootstrap | ||||
|  * to set %cs, %ds, %es and %ss. | ||||
|  */ | ||||
| 	mov	%ds, %ax | ||||
| 	mov	%ax, %fs | ||||
| 	mov	%ax, %gs</programlisting> | ||||
| 
 | ||||
|     <para>btext calls the routines | ||||
|       <function>recover_bootinfo()</function>, | ||||
|       <function>identify_cpu()</function>, | ||||
|       <function>create_pagetables()</function>, which are also defined | ||||
|       in <filename>locore.s</filename>.  Here is a description of what | ||||
|       they do:</para> | ||||
| 
 | ||||
|     <informaltable> | ||||
|       <tgroup cols=2 align=left> | ||||
|       <tbody> | ||||
|         <row> | ||||
|           <entry><function>recover_bootinfo</function></entry> | ||||
| 
 | ||||
|           <entry>This routine parses the parameters to the kernel | ||||
|             passed from the bootstrap.  The kernel may have been | ||||
|             booted in 3 ways: by the loader, described above, by the | ||||
|             old disk boot blocks, and by the old diskless boot | ||||
|             procedure.  This function determines the booting method, | ||||
|             and stores the <literal>struct bootinfo</literal> | ||||
|             structure into the kernel memory.</entry> | ||||
|         </row> | ||||
|         <row> | ||||
|           <entry><function>identify_cpu</function></entry> <entry>This | ||||
|           functions tries to find out what CPU it is running on, | ||||
|           storing the value found in a variable | ||||
|           <varname>_cpu</varname>.</entry> | ||||
|         </row> | ||||
|         <row> | ||||
|           <entry><function>create_pagetables</function></entry> | ||||
|           <entry>This function allocates and fills out a Page Table Directory | ||||
|           at the top of the kernel memory area.</entry> | ||||
|         </row> | ||||
|       </tgroup> | ||||
|     </informaltable> | ||||
|     <para>The next steps are enabling VME, if the CPU supports it:</para> | ||||
| 
 | ||||
|     <programlisting>	testl	$CPUID_VME, R(_cpu_feature) | ||||
| 	jz	1f | ||||
| 	movl	%cr4, %eax | ||||
| 	orl	$CR4_VME, %eax | ||||
| 	movl	%eax, %cr4</programlisting> | ||||
| 
 | ||||
|     <para>Then, enabling paging:</para> | ||||
|     <programlisting>/* Now enable paging */ | ||||
| 	movl	R(_IdlePTD), %eax | ||||
| 	movl	%eax,%cr3			/* load ptd addr into mmu */ | ||||
| 	movl	%cr0,%eax			/* get control word */ | ||||
| 	orl	$CR0_PE|CR0_PG,%eax		/* enable paging */ | ||||
| 	movl	%eax,%cr0			/* and let's page NOW! */</programlisting> | ||||
| 
 | ||||
|     <para>The next three lines of code are because the paging was set, | ||||
|       so the jump is needed to continue the execution in virtualized | ||||
|       address space:</para> | ||||
| 
 | ||||
|     <programlisting>	pushl	$begin				/* jump to high virtualized address */ | ||||
| 	ret | ||||
| 
 | ||||
| /* now running relocated at KERNBASE where the system is linked to run */ | ||||
| begin:</programlisting> | ||||
| 
 | ||||
|     <para>The function <function>init386()</function> is called, with | ||||
|       a pointer to the first free physical page, after that | ||||
|       <function>mi_startup()</function>.  <function>init386</function> | ||||
|       is an architecture dependent initialization function, and | ||||
|       <function>mi_startup()</function> is an architecture independent | ||||
|       one (the 'mi_' prefix stands for Machine Independent).  The | ||||
|       kernel never returns from <function>mi_startup()</function>, and | ||||
|       by calling it, the kernel finishes booting:</para> | ||||
| 
 | ||||
|     <programlisting><filename>sys/i386/i386/locore.s:</filename> | ||||
| 	movl	physfree, %esi | ||||
| 	pushl	%esi				/* value of first for init386(first) */ | ||||
| 	call	_init386			/* wire 386 chip for unix operation */ | ||||
| 	call	_mi_startup			/* autoconfiguration, mountroot etc */ | ||||
| 	hlt		/* never returns to here */</programlisting> | ||||
| 
 | ||||
|     <sect2> | ||||
|       <title><function>init386()</function></title> | ||||
| 
 | ||||
|       <para><function>init386()</function> is defined in | ||||
|         <filename>sys/i386/i386/machdep.c</filename> and performs | ||||
|         low-level initialization, specific to the i386 chip.  The | ||||
|         switch to protected mode was performed by the loader.  The | ||||
|         loader has created the very first task, in which the kernel | ||||
|         continues to operate.  Before running straight away to the | ||||
|         code, I will enumerate the tasks the processor must complete | ||||
|         to initialize protected mode execution:</para> | ||||
| 
 | ||||
|       <itemizedlist> | ||||
|         <listitem><para>Initialize the kernel tunable parameters, passed from | ||||
|         the bootstrapping program.</para></listitem> | ||||
|         <listitem><para>Prepare the GDT.</para></listitem> | ||||
|         <listitem><para>Prepare the IDT.</para></listitem> | ||||
|         <listitem><para>Initialize the system console.</para></listitem> | ||||
|         <listitem><para>Initialize the DDB, if it is compiled into kernel. | ||||
|         </para></listitem> | ||||
|         <listitem><para>Initialize the TSS.</para></listitem> | ||||
|         <listitem><para>Prepare the LDT.</para></listitem> | ||||
|         <listitem><para>Setup proc0's pcb.</para></listitem> | ||||
| 
 | ||||
|       </itemizedlist> | ||||
| 
 | ||||
|       <para>What <function>init386()</function> first does is | ||||
|         initialize the tunable parameters passed from bootstrap.  This | ||||
|         is done by setting the environment pointer (envp) and calling | ||||
|         <function>init_param1()</function>.  The envp pointer has been | ||||
|         passed from loader in the <literal>bootinfo</literal> | ||||
|         structure:</para> | ||||
| 
 | ||||
|       <programlisting><filename>sys/i386/i386/machdep.c:</filename> | ||||
| 		kern_envp = (caddr_t)bootinfo.bi_envp + KERNBASE; | ||||
| 
 | ||||
| 	/* Init basic tunables, hz etc */ | ||||
| 	init_param1();</programlisting> | ||||
| 
 | ||||
|       <para><function>init_param1()</function> is defined in | ||||
|         <filename>sys/kern/subr_param.c</filename>.  That file has a | ||||
|         number of sysctls, and two functions, | ||||
|         <function>init_param1()</function> and | ||||
|         <function>init_param2()</function>, that are called from | ||||
|         <function>init386()</function>:</para> | ||||
| 
 | ||||
|       <programlisting><filename>sys/kern/subr_param.c</filename> | ||||
| 	hz = HZ; | ||||
| 	TUNABLE_INT_FETCH("kern.hz", &hz);</programlisting> | ||||
| 
 | ||||
|       <para>TUNABLE_<typename>_FETCH is used to fetch the value | ||||
|         from the environment:</para> | ||||
| 
 | ||||
|     <programlisting><filename>/usr/src/sys/sys/kernel.h</filename> | ||||
| #define	TUNABLE_INT_FETCH(path, var)	getenv_int((path), (var)) | ||||
| </programlisting> | ||||
| 
 | ||||
|       <para>Sysctl "kern.hz" is the system clock tick.  Along with | ||||
|         this, the following sysctls are set by | ||||
|         <function>init_param1()</function>: <literal>kern.maxswzone, | ||||
|         kern.maxbcache, kern.maxtsiz, kern.dfldsiz, kern.dflssiz, | ||||
|         kern.maxssiz, kern.sgrowsiz</literal>.</para> | ||||
| 
 | ||||
|       <para>Then <function>init386()</function> prepares the Global | ||||
|         Descriptors Table (GDT).  Every task on an x86 is running in | ||||
|         its own virtual address space, and this space is addressed by | ||||
|         a segment:offset pair.  Say, for instance, the current | ||||
|         instruction to be executed by the processor lies at CS:EIP, | ||||
|         then the linear virtual address for that instruction would be | ||||
|         "the virtual address of code segment CS" + EIP.  For | ||||
|         convenience, segments begin at virtual address 0 and end at a | ||||
|         4Gb boundary.  Therefore, the instruction's linear virtual | ||||
|         address for this example would just be the value of EIP. | ||||
|         Segment registers such as CS, DS etc are the selectors, | ||||
|         i.e. indexes, into GDT (to be more precise, an index is not a | ||||
|         selector itself, but the INDEX field of a selector). | ||||
|         FreeBSD's GDT holds descriptors for 15 selectors per | ||||
|         CPU:</para> | ||||
| 
 | ||||
|       <programlisting><filename>sys/i386/i386/machdep.c:</filename> | ||||
| union descriptor gdt[NGDT * MAXCPU];	/* global descriptor table */ | ||||
| 
 | ||||
| <filename>sys/i386/include/segments.h:</filename> | ||||
| /* | ||||
|  * Entries in the Global Descriptor Table (GDT) | ||||
|  */ | ||||
| #define	GNULL_SEL	0	/* Null Descriptor */ | ||||
| #define	GCODE_SEL	1	/* Kernel Code Descriptor */ | ||||
| #define	GDATA_SEL	2	/* Kernel Data Descriptor */ | ||||
| #define	GPRIV_SEL	3	/* SMP Per-Processor Private Data */ | ||||
| #define	GPROC0_SEL	4	/* Task state process slot zero and up */ | ||||
| #define	GLDT_SEL	5	/* LDT - eventually one per process */ | ||||
| #define	GUSERLDT_SEL	6	/* User LDT */ | ||||
| #define	GTGATE_SEL	7	/* Process task switch gate */ | ||||
| #define	GBIOSLOWMEM_SEL	8	/* BIOS low memory access (must be entry 8) */ | ||||
| #define	GPANIC_SEL	9	/* Task state to consider panic from */ | ||||
| #define GBIOSCODE32_SEL	10	/* BIOS interface (32bit Code) */ | ||||
| #define GBIOSCODE16_SEL	11	/* BIOS interface (16bit Code) */ | ||||
| #define GBIOSDATA_SEL	12	/* BIOS interface (Data) */ | ||||
| #define GBIOSUTIL_SEL	13	/* BIOS interface (Utility) */ | ||||
| #define GBIOSARGS_SEL	14	/* BIOS interface (Arguments) */</programlisting> | ||||
| 
 | ||||
|       <para>Note that those #defines are not selectors themselves, but | ||||
|         just a field INDEX of a selector, so they are exactly the | ||||
|         indices of the GDT.  for example, an actual selector for the | ||||
|         kernel code (GCODE_SEL) has the value 0x08.</para> | ||||
| 
 | ||||
|       <para>The next step is to initialize the Interrupt Descriptor | ||||
|         Table (IDT).  This table is to be referenced by the processor | ||||
|         when a software or hardware interrupt occurs.  For example, to | ||||
|         make a system call, user application issues the <literal>INT | ||||
|         0x80</literal> instruction.  This is a software interrupt, so | ||||
|         the processor's hardware looks up a record with index 0x80 in | ||||
|         the IDT.  This record points to the routine that handles this | ||||
|         interrupt, in this particular case, this will be the kernel's | ||||
|         syscall gate.  The IDT may have a maximum of 256 (0x100) | ||||
|         records.  The kernel allocates NIDT records for the IDT, where | ||||
|         NIDT is the maximum (256):</para> | ||||
| 
 | ||||
|       <programlisting><filename>sys/i386/i386/machdep.c:</filename> | ||||
| static struct gate_descriptor idt0[NIDT]; | ||||
| struct gate_descriptor *idt = &idt0[0];	/* interrupt descriptor table */ | ||||
| </programlisting> | ||||
| 
 | ||||
|       <para>For each interrupt, an appropriate handler is set.  The | ||||
|         syscall gate for <literal>INT 0x80</literal> is set as | ||||
|         well:</para> | ||||
| 
 | ||||
|       <programlisting><filename>sys/i386/i386/machdep.c:</filename> | ||||
|  	setidt(0x80, &IDTVEC(int0x80_syscall), | ||||
| 			SDT_SYS386TGT, SEL_UPL, GSEL(GCODE_SEL, SEL_KPL));</programlisting> | ||||
| 
 | ||||
|       <para>So when a userland application issues the <literal>INT | ||||
|         0x80</literal> instruction, control will transfer to the | ||||
|         function <function>_Xint0x80_syscall</function>, which is in | ||||
|         the kernel code segment and will be executed with supervisor | ||||
|         privileges.</para> | ||||
| 
 | ||||
|       <para>Console and DDB are then initialized:</para> | ||||
| 
 | ||||
|       <programlisting><filename>sys/i386/i386/machdep.c:</filename> | ||||
| 	cninit(); | ||||
| /* skipped */ | ||||
| #ifdef DDB | ||||
| 	kdb_init(); | ||||
| 	if (boothowto & RB_KDB) | ||||
| 		Debugger("Boot flags requested debugger"); | ||||
| #endif</programlisting> | ||||
| 
 | ||||
|       <para>The Task State Segment is another x86 protected mode | ||||
|         structure, the TSS is used by the hardware to store task | ||||
|         information when a task switch occurs.</para> | ||||
| 
 | ||||
|       <para>The Local Descriptors Table is used to reference userland | ||||
|         code and data.  Several selectors are defined to point to the | ||||
|         LDT, they are the system call gates and the user code and data | ||||
|         selectors:</para> | ||||
| 
 | ||||
|       <programlisting><filename>/usr/include/machine/segments.h</filename> | ||||
| #define	LSYS5CALLS_SEL	0	/* forced by intel BCS */ | ||||
| #define	LSYS5SIGR_SEL	1 | ||||
| #define	L43BSDCALLS_SEL	2	/* notyet */ | ||||
| #define	LUCODE_SEL	3 | ||||
| #define LSOL26CALLS_SEL	4	/* Solaris >= 2.6 system call gate */ | ||||
| #define	LUDATA_SEL	5 | ||||
| /* separate stack, es,fs,gs sels ? */ | ||||
| /* #define	LPOSIXCALLS_SEL	5*/	/* notyet */ | ||||
| #define LBSDICALLS_SEL	16	/* BSDI system call gate */ | ||||
| #define NLDT		(LBSDICALLS_SEL + 1) | ||||
| </programlisting> | ||||
| 
 | ||||
|     <para>Next, proc0's Process Control Block (<literal>struct | ||||
|       pcb</literal>) structure is initialized.  proc0 is a | ||||
|       <literal>struct proc</literal> structure that describes a kernel | ||||
|       process.  It is always present while the kernel is running, | ||||
|       therefore it is declared as global:</para> | ||||
| 
 | ||||
|     <programlisting><filename>sys/kern/kern_init.c:</filename> | ||||
|     struct	proc proc0;</programlisting> | ||||
| 
 | ||||
|     <para>The structure <literal>struct pcb</literal> is a part of a | ||||
|       proc structure.  It is defined in | ||||
|       <filename>/usr/include/machine/pcb.h</filename> and has a | ||||
|       process's information specific to the i386 architecture, such as | ||||
|       registers values.</para> | ||||
| 
 | ||||
|     </sect2> | ||||
| 
 | ||||
|     <sect2> | ||||
|       <title><function>mi_startup()</function></title> | ||||
| 
 | ||||
|       <para>This function performs a bubble sort of all the system | ||||
|         initialization objects and then calls the entry of each object | ||||
|         one by one:</para> | ||||
| 
 | ||||
|       <programlisting><filename>sys/kern/init_main.c:</filename> | ||||
| 	for (sipp = sysinit; *sipp; sipp++) { | ||||
| 
 | ||||
| 		/* ... skipped ... */ | ||||
| 
 | ||||
| 		/* Call function */ | ||||
| 		(*((*sipp)->func))((*sipp)->udata); | ||||
| 		/* ... skipped ... */ | ||||
| 	}</programlisting> | ||||
| 
 | ||||
|     <para>Although the sysinit framework is described in the | ||||
|       Developers' Handbook, I will discuss the internals of it.</para> | ||||
| 
 | ||||
|     <para>Every system initialization object (sysinit object) is | ||||
|       created by calling a SYSINIT() macro.  Let us take as example an | ||||
|       <literal>announce</literal> sysinit object.  This object prints | ||||
|       the copyright message:</para> | ||||
| 
 | ||||
|     <programlisting><filename>sys/kern/init_main.c:</filename> | ||||
| static void | ||||
| print_caddr_t(void *data __unused) | ||||
| { | ||||
| 	printf("%s", (char *)data); | ||||
| } | ||||
| SYSINIT(announce, SI_SUB_COPYRIGHT, SI_ORDER_FIRST, print_caddr_t, copyright)</programlisting> | ||||
| 
 | ||||
|     <para>The subsystem ID for this object is SI_SUB_COPYRIGHT | ||||
|       (0x0800001), which comes right after the SI_SUB_CONSOLE | ||||
|       (0x0800000).  So, the copyright message will be printed out | ||||
|       first, just after the console initialization.</para> | ||||
| 
 | ||||
|     <para>Let us take a look at what exactly the macro | ||||
|       <literal>SYSINIT()</literal> does.  It expands to a | ||||
|       <literal>C_SYSINIT()</literal> macro.  The | ||||
|       <literal>C_SYSINIT()</literal> macro then expands to a static | ||||
|       <literal>struct sysinit</literal> structure declaration with | ||||
|       another <literal>DATA_SET</literal> macro call:</para> | ||||
|       <programlisting><filename>/usr/include/sys/kernel.h:</filename> | ||||
|       #define C_SYSINIT(uniquifier, subsystem, order, func, ident) \ | ||||
|       static struct sysinit uniquifier ## _sys_init = { \ subsystem, \ | ||||
|       order, \ func, \ ident \ }; \ DATA_SET(sysinit_set,uniquifier ## | ||||
|       _sys_init); | ||||
| 
 | ||||
| #define	SYSINIT(uniquifier, subsystem, order, func, ident)	\ | ||||
| 	C_SYSINIT(uniquifier, subsystem, order,			\ | ||||
| 	(sysinit_cfunc_t)(sysinit_nfunc_t)func, (void *)ident)</programlisting> | ||||
| 
 | ||||
|     <para>The <literal>DATA_SET()</literal> macro expands to a | ||||
|       <literal>MAKE_SET()</literal>, and that macro is the point where | ||||
|       the all sysinit magic is hidden:</para> | ||||
| 
 | ||||
|     <programlisting><filename>/usr/include/linker_set.h</filename> | ||||
| #define MAKE_SET(set, sym)						\ | ||||
| 	static void const * const __set_##set##_sym_##sym = &sym;	\ | ||||
| 	__asm(".section .set." #set ",\"aw\"");				\ | ||||
| 	__asm(".long " #sym);						\ | ||||
| 	__asm(".previous") | ||||
| #endif | ||||
| #define TEXT_SET(set, sym) MAKE_SET(set, sym) | ||||
| #define DATA_SET(set, sym) MAKE_SET(set, sym)</programlisting> | ||||
| 
 | ||||
|     <para>In our case, the following declaration will occur:</para> | ||||
| 
 | ||||
|     <programlisting>static struct sysinit announce_sys_init = { | ||||
| 	SI_SUB_COPYRIGHT, | ||||
| 	SI_ORDER_FIRST, | ||||
| 	(sysinit_cfunc_t)(sysinit_nfunc_t)  print_caddr_t, | ||||
| 	(void *) copyright | ||||
| }; | ||||
| 
 | ||||
| static void const *const __set_sysinit_set_sym_announce_sys_init = | ||||
|     &announce_sys_init; | ||||
| __asm(".section .set.sysinit_set" ",\"aw\""); | ||||
| __asm(".long " "announce_sys_init"); | ||||
| __asm(".previous");</programlisting> | ||||
| 
 | ||||
|     <para>The first <literal>__asm</literal> instruction will create | ||||
|       an ELF section within the kernel's executable.  This will happen | ||||
|       at kernel link time.  The section will have the name | ||||
|       ".set.sysinit_set".  The content of this section is one 32-bit | ||||
|       value, the address of announce_sys_init structure, and that is | ||||
|       what the second <literal>__asm</literal> is.  The third | ||||
|       <literal>__asm</literal> instruction marks the end of a section. | ||||
|       If a directive with the same section name occured before, the | ||||
|       content, i.e. the 32-bit value, will be appended to the existing | ||||
|       section, so forming an array of 32-bit pointers.</para> | ||||
| 
 | ||||
|     <para>Running <application>objdump</application> on a kernel | ||||
|       binary, you may notice the presence of such small sections:</para> | ||||
| 
 | ||||
|     <screen>&prompt.user; <userinput>objdump -h /kernel</userinput> | ||||
|   7 .set.cons_set 00000014  c03164c0  c03164c0  002154c0  2**2 | ||||
|                   CONTENTS, ALLOC, LOAD, DATA | ||||
|   8 .set.kbddriver_set 00000010  c03164d4  c03164d4  002154d4  2**2 | ||||
|                   CONTENTS, ALLOC, LOAD, DATA | ||||
|   9 .set.scrndr_set 00000024  c03164e4  c03164e4  002154e4  2**2 | ||||
|                   CONTENTS, ALLOC, LOAD, DATA | ||||
|  10 .set.scterm_set 0000000c  c0316508  c0316508  00215508  2**2 | ||||
|                   CONTENTS, ALLOC, LOAD, DATA | ||||
|  11 .set.sysctl_set 0000097c  c0316514  c0316514  00215514  2**2 | ||||
|                   CONTENTS, ALLOC, LOAD, DATA | ||||
|  12 .set.sysinit_set 00000664  c0316e90  c0316e90  00215e90  2**2 | ||||
|                   CONTENTS, ALLOC, LOAD, DATA</screen> | ||||
| 
 | ||||
|     <para>This screen dump shows that the size of .set.sysinit_set | ||||
|       section is 0x664 bytes, so <literal>0x664/sizeof(void | ||||
|       *)</literal> sysinit objects are compiled into the kernel.  The | ||||
|       other sections such as <literal>.set.sysctl_set</literal> | ||||
|       represent other linker sets.</para> | ||||
| 
 | ||||
|     <para>By defining a variable of type <literal>struct | ||||
|       linker_set</literal> the content of | ||||
|       <literal>.set.sysinit_set</literal> section will be "collected" | ||||
|       into that variable:</para> | ||||
|       <programlisting><filename>sys/kern/init_main.c:</filename> | ||||
|       extern struct linker_set sysinit_set; /* XXX */</programlisting> | ||||
| 
 | ||||
|     <para>The <literal>struct linker_set</literal> is defined as | ||||
|       follows:</para> | ||||
| 
 | ||||
|     <programlisting><filename>/usr/include/linker_set.h:</filename> | ||||
|   struct linker_set { | ||||
| 	int	ls_length; | ||||
| 	void	*ls_items[1];		/* really ls_length of them, trailing NULL */ | ||||
| };</programlisting> | ||||
| 
 | ||||
|     <para>The first node will be equal to the number of a sysinit | ||||
|       objects, and the second node will be a NULL-terminated array of | ||||
|       pointers to them.</para> | ||||
| 
 | ||||
|     <para>Returning to the <function>mi_startup()</function> | ||||
|       discussion, it is must be clear now, how the sysinit objects are | ||||
|       being organized.  The <function>mi_startup()</function> function | ||||
|       sorts them and calls each.  The very last object is the system | ||||
|       scheduler:</para> | ||||
| 
 | ||||
|     <programlisting><filename>/usr/include/sys/kernel.h:</filename> | ||||
| enum sysinit_sub_id { | ||||
| 	SI_SUB_DUMMY		= 0x0000000,	/* not executed; for linker*/ | ||||
| 	SI_SUB_DONE		= 0x0000001,	/* processed*/ | ||||
| 	SI_SUB_CONSOLE		= 0x0800000,	/* console*/ | ||||
| 	SI_SUB_COPYRIGHT	= 0x0800001,	/* first use of console*/ | ||||
| ... | ||||
| 	SI_SUB_RUN_SCHEDULER	= 0xfffffff	/* scheduler: no return*/ | ||||
| };</programlisting> | ||||
| 
 | ||||
|     <para>The system scheduler sysinit object is defined in the file | ||||
|       <filename>sys/vm/vm_glue.c</filename>, and the entry point for | ||||
|       that object is <function>scheduler()</function>.  That function | ||||
|       is actually an infinite loop, and it represents a process with | ||||
|       PID 0, the swapper process.  The proc0 structure, mentioned | ||||
|       before, is used to describe it.</para> | ||||
| 
 | ||||
|     <para>The first user process, called <emphasis>init</emphasis>, is | ||||
|       created by the sysinit object "init":</para> | ||||
| 
 | ||||
|     <programlisting><filename>sys/kern/init_main.c:</filename> | ||||
| static void | ||||
| create_init(const void *udata __unused) | ||||
| { | ||||
| 	int error; | ||||
| 	int s; | ||||
| 
 | ||||
| 	s = splhigh(); | ||||
| 	error = fork1(&proc0, RFFDG | RFPROC, &initproc); | ||||
| 	if (error) | ||||
| 		panic("cannot fork init: %d\n", error); | ||||
| 	initproc->p_flag |= P_INMEM | P_SYSTEM; | ||||
| 	cpu_set_fork_handler(initproc, start_init, NULL); | ||||
| 	remrunqueue(initproc); | ||||
| 	splx(s); | ||||
| } | ||||
| SYSINIT(init,SI_SUB_CREATE_INIT, SI_ORDER_FIRST, create_init, NULL)</programlisting> | ||||
| 
 | ||||
|   <para>The <function>create_init()</function> allocates a new process | ||||
|     by calling <function>fork1()</function>, but does not mark it | ||||
|     runnable.  When this new process is scheduled for execution by the | ||||
|     scheduler, the <function>start_init()</function> will be called. | ||||
|     That function is defined in <filename>init_main.c</filename>.  It | ||||
|     tries to load and exec the <filename>init</filename> binary, | ||||
|     probing <filename>/sbin/init</filename> first, then | ||||
|     <filename>/sbin/oinit</filename>, | ||||
|     <filename>/sbin/init.bak</filename>, and finally | ||||
|     <filename>/stand/sysinstall</filename>:</para> | ||||
| 
 | ||||
|   <programlisting><filename>sys/kern/init_main.c:</filename> | ||||
| static char init_path[MAXPATHLEN] = | ||||
| #ifdef	INIT_PATH | ||||
|     __XSTRING(INIT_PATH); | ||||
| #else | ||||
|     "/sbin/init:/sbin/oinit:/sbin/init.bak:/stand/sysinstall"; | ||||
| #endif</programlisting> | ||||
| 
 | ||||
|   </sect2> | ||||
| </sect1> | ||||
| 
 | ||||
| </chapter> | ||||
| 
 | ||||
| <!--  | ||||
|      Local Variables: | ||||
|      mode: sgml | ||||
|      sgml-declaration: "../chapter.decl" | ||||
|      sgml-indent-data: t | ||||
|      sgml-omittag: nil | ||||
|      sgml-always-quote-attributes: t | ||||
|      sgml-parent-document: ("../book.sgml" "part" "chapter") | ||||
|      End: | ||||
| --> | ||||
		Loading…
	
	Add table
		Add a link
		
	
		Reference in a new issue