Linux Kernel Booting Process (2) - For NLKB

Booting Process (2)
Taku Shimosawa
Pour le livre nouveau du Linux noyau 1

Materials
• http://www.slideshare.net/shimosawa/
2

Agenda
• Virtual Memory
• From architectural view
• Unfortunately, this presentation again does not enter
the main part of the kernel!
• Appendices
• Source code-level overview of the bootstrapping process
• Linker Scripts
• Inline Assemblers
• There are (implicitly) omitted spaces, tabs, white
lines, comments in the quoted source code.
• The omitted effective lines are denoted by … or […]
3

Scope of the last presentation : x86
• Real Mode (16-bit)
• Boot sector, setup_header, and
16-bit entry point
• C-Language main function
• Retrieving memory information
• Transition to the protected
mode
• Protected Mode (32-bit)
• 32-bit(/64-bit) entry point,
preparing for decompression,
calling decompression code
• (EFI-Stub) efi_main (entry point
from UEFI)
• EFI call functions
• Protected Mode/Long Mode
• The beginning of the main
kernel
4
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
* The …_32.S files are used in the 32-bit kernel, and …_64.S files are not. Vice versa.

Scope of the last presentation : ARM
• Compressed
• Entry point
• Decompressing function
• Actual decompressing
algorithm is in
lib/decompress_*.c
• Building a FDT from ATAGS for
compatibility
(CONFIG_ARM_ATAG_DTB_C
OMPAT)
• Decompressed
• The beginning of the main
kernel
5
 arch
 arm
 boot
 compressed
 head.S
 decompress.c
 atags_to_fdt.c
 kernel
 head.S

Follow-ups for the last presentation
• x86 assembly language
• What if instuructions with ≥3 operands?
• (e.g.) imul
• Multiply EBX by 19(0x13) and substitute the result to EAX
• Therefore,
6
AT&T Intel
Operand
Order
Source, Destination Destination, Source
AT&T Intel
Example imul $0x13, %ebx, %eax IMUL EAX, EBX, 13h
AT&T Intel
Operand
Order
Source, Destination
[[Op4,] Op3,] Op2, Op1
Destination, Source
Op1, Op2 [, Op3, [Op4]]

Follow-ups for the last presentation
• Multiple Relocations?
• The conclusion is “at most once” (in x86 arch)
• ELF relocation may follow the decompression, so the kernel
may be relocated twice in this sense.
• See the relocation part in this presentation.
7

x86 Architecture : Segmentation
• 6 Segment Registers (16-bit registers)
• Code Segment Register: CS
• Data Segment Register: DS, ES, FS, GS
• Stack Segment Register: SS
• Real mode : 20-bit address space
• Linear address = Physical address
• The size of each segment is 64K (16-bit)
• The segment register denotes the higher 16-bit offset in 20-bit address
space for the segment
• Protected mode : 32-bit/36-bit physical address space
• Virtual –(Paging)-> Linear –(Segmentation)-> Physical
• The offset and limit are stored in the descriptor table
• The segment registers points to the entry in the table
• Long mode : 48-bit physical address space
• For CS, DS, ES, and SS, the offset is always 0, the limit is ignored.
• For FS and GS, the offset can be set by the descriptor or through MSR
(for > 32-bit addresses)
8
Logical – (Segmentation) -> Linear –(Paging)-> Physical
Errata

So what? (p.32)
9
vmlinux
boot/compressed/vmlinux.bin
(1a) Strip symbols
vmlinux.bin.xz
(2a) Concatenate and compress (gzip, bzip2,
lzma, lzo, lz4)
piggy.o
(3) mkpiggy (piggy-back)
Make an object that contains the compressed image
piggy.o*.o
boot/compressed/vmlinux
(4) Link with the other objects in
boot/compressed
(Decompressing codes)
(5) Transform it into a simple binary
boot/vmlinux.bin
boot/vmlinux.binboot/setup.bin
(6) Concatenate with real-mode
setup code, headers, and CRC32 CRC
boot/bzImage
(1b) Make relocation information
(2b) Append the original size info (except for gzip) vmlinux.bin.xz
vmlinux.relocs
Size
Errata?

4. Virtual Memory
Segmentation and Paging
10

Virtual Memory
• The address visible to a task is “virtualized,” i.e.
translated by hardware to a certain physical address
when it is actually accessed.
• The hardware mechanism to translate the address is called
MMU (memory management unit).
• Aim / Benefit
• Using larger memory area than the machine actually is
equipped with.
• Memory swapping, sparse memory areas
• Isolating tasks’ memory area so that the different applications
cannot touch (read or write) the each other’s memory
• Not only between user tasks but between the kernel and tasks
• Abstracting the memory resources
• Providing contiguous memory area even if there is no physically
contiguous memory area available.
• User programs can run with certain addresses regardless of the
physical addresses where they are actually running.
11

Two ways to virtual memory
• Paging
• Dividing the memory area into chunks (“pages”) with a
certain small size, and defining a map from each chunk
to its physical location
• A different task may have a different map of the
memory
• Several overhead (both in speed and memory) to
translate and hold the map
• Segmentation
• The address is considered to be an offset inside a certain
segment of memory
• Less overhead (just adding an offset), but impossible to
achieve swapping
12

Illustrated
13
1
Segment
1
2
3
5
4
3
1
2
4
VA PA
1 4
2 1
3 3
5 2
2
~
4
Seg Star
t
End
1 2 4
1
Virtual Memory Physical Memory
Page Table
Segment Desc.
Paging
Segmentation

Architecture and VM Capability
• x86
• Capable of paging
• 16-bit and 32-bit has segmentation feature
• 64-bit mode has a very limited segmentation feature
• Because almost no one is using the segmentation feature
effectively!
• (See “flat model” described in a later slide)
• ARM
• Some CPU series has MMU, and is capable of paging
• “A” series
• Some CPU series only has MPU (memory protection unit)
• “R” series
• No MMU
• “M” series (MPU is optional)
14

Focusing on paging…
• How it works?
15
Memory instruction with a virtual address
CPU (MMU) looks for for the virtual address in TLB (Translation
Lookaside Buffer)
Does it exist?
Use the physical address in the TLB entry TLB Miss!
Call the handler, and ask it to fill
in a TLB entry corresponding to
the virtual address
Traverse the page table to find
the physical address for the
virtual address
Present?
Use the physical address
(May) remember it in TLB
Page fault! Call the handler.
Kernel’s Role
Software TLBHardware TLB
Yes No
Yes No

How far should hardware do?
• TLB (Translation Lookaside Buffer)
• Cache of “virtual-to-physical” mappings.
• Limited number of entries.
• Hardware-controlled TLB
• When TLB misses occur, the CPU traverses page tables
• The format for the page table is defined by the architecture.
• x86 and ARM
• Software-controlled TLB
• When TLB misses occur, the software (typically, the OS kernel)
traverses page tables, and tell the result (translated physical
address) by filling in some entry in TLB.
• Any type of page tables may be used (hash-based PT, for
example)
• But Linux uses almost the same format for this type of architecture
• PowerPC
16

Multilevel Page Table (tree-like)
• Typical structure of page table
• The first-level page table consists of entries that point to
another level page table. The index is some of the most
significant bits of the virtual memory.
• Of course, the next page table’s address is physical.
• The entries in the leaf page table denotes the physical
addresses.
17
Next level page table
Third level page table
Phys address
Phys address
…
First-level page table Second-level page table Third-level page table

x86-64 example
18
Resolving 0x00000004200310a5
= 00000000 00000000 00000000 00000100
00100000 00000011 00010000 10100101 (2)
PML4 Table
0
511
Page Directory Pointer Table
0
511
16
0
256
Page Directory Table
511
0x1234567000
0
49
Page Table
511
0x12345670a5
CR3
64 bits

x86-64
• Currently, only 48-bit in a linear address is effective.
• 64-bit address is sign-extension of the 48-bit address.
• Supports up to 52 bits for physical addresses
• %cr3 register : the physical address for the current
PML4 table
• mov ~~, %cr3 switches the page table (flushing TLB)
• Four level
• One entry in PML4 table corresponds to 512 GB of
virtual memory, an entry in PDP table to 1 GB, and so on.
• Each entry is 8 byte
• Each table has 512 entries
• Thus, each table is 4 KB = 1 page.
19

Large Table
• One page occupies one entry in TLB
• If one process uses 1 GB of memory, it uses 256K
pages.
• i.e. If TLB does not have 256K entries (and usually it
doesn’t), TLB misses are inevitable
• x86_64 supports three types of page size
• 4 KB (normal)
• 2 MB
• 1 GB (!)
• The disadvantage is that larger page requires
contiguous physical memory of the same size as the
page size.
20
An entry in higher-level page table directly
contains a physical address.

x86-64 example (2MB page)
21
Resolving 0x00000004200310a5
= 00000000 00000000 00000000 00000100
00100000 00000011 00010000 10100101
PML4 Table
0
511
Page Directory Pointer Table
0
511
16
0x1234400000
0
256
Page Directory Table
511
0x12344310a5
CR3
64 bits

Linux kernel usage
• Large Page
• The kernel mapping
• The kernel creates straight-mapping of physical memory in the
kernel virtual address area
• This area is created in booting, and never changes after that
• 1GB, 2MB pages are used
• Hugetlbfs
• Explicit use from user applications
• Transparent Huge Pages
• Implicit (transparent) use of large pages for user applications
22

ARM
• ARM
• Two memory architecture
• VMSA (Virtual Memory System Architecture) : MMU
• PMSA (Protected Memory System Architecture) : MPU
• VMSA
• Two page table formats
• Short descriptor table
• Up to two-level lookup
• 32-bit PA (*By supersection, 40-bit can be output)
• Long descriptor table
• Up to three-level lookup
• 40-bit PA
• Fixed size of page tables
23

Names in Linux
• Linux uses several arch-independent type names
for page table entries
• pgd_t, pud_t, pmd_t, pte_t
• Each type is one for an entry in a table of the corresponding
level
24
Architecture (& Config) Lv pgd_t pud_t pmd_t pte_t
x86_64 4 PML4E PDPTE PDE PTE
i386 (PAE) 3 PDPTE - PDE PTE
i386 2 PDE - - PTE
ARM (LPAE) 3 1st-lv. Desc. - 2nd-lv. Desc. 3rd-lv. Desc.
ARM 2 1st-lv. Desc. - - 2nd-lv. Desc.
ARM64 (64KB page) 2 1st-lv. Desc. - - 2nd-lv. Desc.
ARM64 3 1st-lv. Desc. - 2nd-lv. Desc. 3rd-lv. Desc.
(*)AArch64 supports four-level page tables, thus 48-bit VA.

Notes
• PAE (i386)
• Physical Address Extension
• For those who want to enjoy >4GB of memory in 32-bit mode.
• Virtual address remains 32-bit, but can map to any physical
address (< 64-bit)
• The size of each entry is extended to 64-bit
• CONFIG_X86_PAE
• LPAE
• Logical Physical Address Extension
• Almost the same as PAE in i386
• “The current implementation limits the output address range
to 40 bits”
• Each entry is extended to 64-bit (long-descriptor translation table
format)
• CONFIG_ARM_LPAE
25

ARM example (Short-descriptor)
26
Resolving 0x200310a5
= 00100000 00000011 00010000 10100101 (2)
1st Level Table
0
4095
0x12345000
0
255
49 0x123450a5
TTBR0
2nd Level Table
32 bits
512

Quick Chart
27
1st Level 2nd Level 3rd Level 4th Level
Intel
64-bit
[47:39] [38:30] [29:21] [20:12]
4 KB (64 bit x 512)
512 GB/Entry 1 GB / Entry 2 MB / Entry 4 KB / Entry
PAE
[31:30] [29:21] [20:12]
256 B (64 bit x 4) 4 KB (64 bit x 512)
1 GB / Entry 2 MB / Entry 4 KB / Entry
32-bit
[31:22] [21:12]
4 KB (32 bit x 1024)
4 MB / Entry 4 KB / Entry
ARM
LPAE
[31:30] [29:21] [20:12]
256 B (64 bit x 4) 4 KB (64 bit x 512)
32-bit
[31:20] [19:12]
16 KB (32 bit x
4096)
1 KB (32 bit x 256)
1 MB / Entry 4 KB / Entry
ARM
64
4KB
granule
[38:30] [29:21] [20:12]
4 KB (64 bit x 512)
VA Range used as index
Table size (entry size x n)
Size represented by each entry

Page size supported (by HW)
• x86_64
• 1 GB, 2 MB, 4 KB
• i386 (PAE)
• 2 MB, 4 KB
• i386
• 4 MB, 4 KB
• ARM
• 16 MB(*), 1 MB, 64 KB, 4 KB
• ARM (LPAE)
• 1 GB, 2 MB, 4 KB
• ARM64
• 1 GB, 2 MB, 4 KB (for 4KB translation granule)
• 32 MB, 16 KB (for 16KB translation granule)
• 512 MB, 64 KB (for 64KB translation granule)
28
(*) Depends on implementation

Page Attributes
• Pages can have attributes
• Used for memory protection
• Used for demand paging
• Used for COW (copy-on-write)
• Attributes
• Read / Write
• User / Privileged
• But where?
• In the page table entry corresponding to a page
• However, a page table entry is basically a physical
pointer, i.e. a 32-bit entry is occupied by 32-bit physical
pointer…
29

Page Attributes
• The lower bits in page table entries
• The start address of a page/page table is aligned!
• The lower bits are always zero.
30
Ignored
Physical Address [31:12]
3252
XD
63
Physical Address [51:32]
G
Igno
red
PAT
D
PCD
PWT
US
RW
PA
31 9 0
Physical Address [31:12] C B 1
XN
APTEX
AP2
S
nG
x86_64
ARM
(short
descriptor)

Page Attributes Comparison
31
x86_64 ARM (short)
Enabled? Present (P) Desc type (Bits 1 & 0)
RO or RW? Read/Write (RW)
AP [2:1] or AP [2:0]
Privileged only or any? User/Supervisor (US)
Write-through? PWT
TEX[2:0], B, C
Cachable? PCD
Accessed? Accessed (A) AP[0] (*configurable)
Dirty? Dirty (D) N/A
Memory Type PAT TEX[0], B, C (*configurable)
Global Global (G) Not Global (nG)
Executable? Execute-Disable (XD) Execute-Never (XN)
Sharable? (PAT) Sharable (S)

PowerPC Example [PowerPC 440]
• TLB is filled by software
• Search (tlbsx instrunction), R/W (tlbre, tbwe instructions)
32
32220
Effective Page Number [0:21]
TS
V SIZE TPAR TID
40
Real Page Number [0:21]
0
PA
R1
ERPN
PA
R2
0
Reserved U3-U0 W I M G E
X W R X W R
U S
• Attributes
• V : Valid
• SIZE : Page Size (4n KB, where n in
{0,1,2,3,4,5,7,9,10})
• U : User-defined storage attribute
• W: Write-through
• I: Caching Inhibited
• M: Memory coherency required
• G: Guarded
• E: Endian
• UX, UW, UR: User
executable, writable,
readable
• SX, SW, SR: Supervisor
executable, writable,
readable
• TPAR, PAR1, PAR2: Parity

Before the kernel starts…
• x86 (32-bit)
• Paging is disabled
• kernel/head_32.S creates a page table and turns on
paging
• x86 (64-bit)
• compressed/head_64.S creates an identical (virtual =
physical) page table for the first 4G
• Long mode requires paging enabled.
• kernel/head_64.S creates better page table
• ARM
• kernel/head.S creates a page table and turns on paging
33

Virtual memory mapping
34
x86_64 Virtuali386 Virtual Physical
LOWMEM
PAGE_OFFSET
(0xC0000000)
Up to ~896 MB
PAGE_OFFSET
(0xFFFF8800
00000000)
__START_KERNEL_map
(0xFFFFFFFF
80000000)

A. Booting in x86
By looking into the source codes
35

A-1. Real Mode
Plenty of assembler code, LD script, and inline assembly
language
36

Real mode kernel (from p.45)
• header.S
• Boot sector code which is no longer used
• Contains setup_header
• Prepares stack and BSS to run C programs
• Jumps into the C program (main.c)
• main.c
• Copies setup_header into “zeropage”
• Setups early console
• Initializes heap
• Checks the CPUs (64-bit capable for 64-bit kernel?)
• Collect HW information by querying to BIOS, and stores the
results in “zeropage”
• Finally transits to protected-mode, and jumps into the
“protected-mode kernel”
37

Boot sector (Useless)
38
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
.global bootsect_start
bootsect_start:
#ifdef CONFIG_EFI_STUB
# "MZ", MS-DOS header
.byte 0x4d
.byte 0x5a
#endif
# Normalize the start address
ljmp $BOOTSEG, $start2
start2:
movw %cs, %ax
movw %ax, %ds
movw %ax, %es
movw %ax, %ss
xorw %sp, %sp
sti
cld
movw $bugger_off_msg, %si
jmp msg_loop
Normalize CS to
BOOTSEG (0x7c0).
movw %ds, %cs is not allowed.
stack starts at 0x17c00
Enable interrupts cf. cli
Reset directions for string instructions
(Clear DF Flag) cf. std
Show the message "Direct floppy boot
is not supported. "

Wait, how the header code is placed
at the beginning of the kernel?
• The linker concatenates multiple object files
• The position in the resulting binary are not guaranteed
without any order to the linker
• The linker script (.ld/.lds/.lds.S) orders the positions
to the linker!
• As it is quite likely for you to use the C preprocessor for
the linker script, files with the extension “.lds.S” are first
processed by the preprocessor, and passed to the linker.
• Pass the linker script with “-T” overrides the default
linker script
• The default linker script can be displayed with “ld --
verbose"
39

LD script (1)
40
 arch
 x86
 boot
 setup.ld
 compressed
 vmlinux.lds.S
 kernel
 vmlinux.lds.S
/*
* setup.ld
*
* Linker script for the i386 setup code
*/
OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(_start)
SECTIONS
{
. = 0;
.bstext : { *(.bstext) }
.bsdata : { *(.bsdata) }
. = 495;
.header : { *(.header) }
.entrytext : { *(.entrytext) }
.inittext : { *(.inittext) }
.initdata : { *(.initdata) }
__end_init = .;
Specifies the output format (identical
to --oformat option)
OUTPUT_FORMAT(default, big, little)
Specifies the output architecture
Specifies the entry point symbol
(identical to -e option)

LD script (2)
41
/*
* setup.ld
*
*/
OUTPUT_ARCH(i386)
ENTRY(_start)
SECTIONS
{
. = 0;
. = 495;
__end_init = .;
 arch
 x86
 boot
 setup.ld
 compressed
 vmlinux.lds.S
 kernel
 vmlinux.lds.S
Specifies how the sections are output
. means the current position
Substituting to . means setting the
current position
Put the .bstext section at the current
position, i.e. at the address 0.
Put the .bsdata section after
the .bstext section.

bstext section…?
42
.code16
.section ".bstext", "ax"
.global bootsect_start
bootsect_start:
# "MZ", MS-DOS header
.byte 0x4d
.byte 0x5a
#endif
# Normalize the start address
ljmp $BOOTSEG, $start2
start2:
movw %cs, %ax
movw %ax, %ds
movw %ax, %es
movw %ax, %ss
xorw %sp, %sp
sti
cld
movw $bugger_off_msg, %si
jmp msg_loop
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
Here it is!
[Notes]
.code16 = Specify the binary for the following
code as 16-bit binary.
.section name[, flags] = Starts the section.
<flags> (excerpted)
• “a” : allocatable (loaded to memory when
executed)
• “w” : writable
• “x” : executable
.globl/.global symbol = Makes the symbol global
(Can be seen from other objects)

LD script (3)
43
/*
* setup.ld
*
*/
OUTPUT_ARCH(i386)
ENTRY(_start)
SECTIONS
{
. = 0;
. = 495;
__end_init = .;
 arch
 x86
 boot
 setup.ld
 compressed
 vmlinux.lds.S
 kernel
 vmlinux.lds.S
Specifies how the sections are output
Set the current position to 495
Places the header section at the
address 495
Declares a symbol “__end_init” that
refers to the current position
(the end of .initdata section)

LD script (4)
44
/*
* setup.ld
*
*/
OUTPUT_ARCH(i386)
ENTRY(_start)
SECTIONS
{
. = 0;
. = 495;
__end_init = .;
 arch
 x86
 boot
 setup.ld
 compressed
 vmlinux.lds.S
 kernel
 vmlinux.lds.S
.bstext
.bsdata
0
495
.header
.entrytext
.inittext
.initdataxxxx
__end_init

LD script (5)
• To be precise,
• Output a section the name of which is “.bstext”
• The output section contains all of the input section “.bstext”
• The input and output need not be 1-to-1
• The output section “.text” contains all of the input section “.text”,
and then all of the sections the names of which start with “.text.”
• Creates the new symbols “_text” and “_etext” which denote the
beginning and ending of the output section “.text”, respectively.
45
.text : {
_text = .; /* Text */
*(.text)
*(.text.*)
_etext = . ;
}

LD script (6)
46
. = ALIGN(16);
.data : { *(.data*) }
.signature : {
setup_sig = .;
LONG(0x5a5aaa55)
}
...
/DISCARD/ : { *(.note*) }
/*
* The ASSERT() sink to . is intentional, for
binutils 2.14 compatibility:
*/
. = ASSERT(_end <= 0x8000, "Setup too big!");
. = ASSERT(hdr == 0x1f1, "The setup header has
the wrong offset!");
/* Necessary for the very-old-loader check to
work... */
. = ASSERT(__end_init <= 5*512, "init sections
too big!");
}
 arch
 x86
 boot
 setup.ld
 compressed
 vmlinux.lds.S
 kernel
 vmlinux.lds.S
[Usage]
# Check signature at end of setup
cmpl $0x5a5aaa55, setup_sig
jne setup_bad
Align to the 16 byte boundary
Discard the sections .note*
Put this long value at the
current position
Assertions!

Column: align and balign
• LD’s ALIGN(x) returns the x-byte aligned address
• x must be power of two
• = (current + x – 1) & ~ (x – 1)
• GNU Assembler has two pseudo ops for alignment
• .align x, fill, max
• .balign x, fill, max
• Both aligns to the byte boundary specified by x. But x means…
• The skipped bytes are filled by fill (zero or nop)
• The maximum number of bytes to be skipped can be specified
with max.
47
.align (x = 4) .balign (x = 4)
i386 (elf), sparc, etc. Align to 4 byte
Align to 4 byte
ppc, i386 (a.out), arm Align to 16 byte (24)

COFF Stuffs
48
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
.org 0x3c
# Offset to the PE header.
.long pe_header
#endif /* CONFIG_EFI_STUB */
.section ".bsdata", "a"
bugger_off_msg:
.ascii "Direct floppy boot is not supported. "
.ascii "Use a boot loader program instead.rn"
...
.byte 0
pe_header:
.ascii "PE"
.word 0
coff_header:
#ifdef CONFIG_X86_32
.word 0x14c # i386
#else
.word 0x8664 # x86-64
#endif
[Notes]
.org location, fill = Set the current position to
location in the current section (filling the skipped
bytes with fill)
.ascii string = Put the string (w/o zero termination)
at the current position (cf. .asciz)
.byte val, .word val, .long val, .quad val
= Put the 1/2/4/8-byte value(s)

Real mode kernel (p.45)
• header.S
• main.c
49

Entry point (2nd sector)
50
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
.section ".header", "a"
.globl sentinel
sentinel: .byte 0xff, 0xff /* Used to detect broken loaders */
.globl hdr
hdr:
setup_sects: .byte 0 /* Filled in by build.c */
root_flags: .word ROOT_RDONLY
syssize: .long 0 /* Filled in by build.c */
ram_size: .word 0 /* Obsolete */
vid_mode: .word SVGA_MODE
root_dev: .word 0 /* Filled in by build.c */
boot_flag: .word 0xAA55
# offset 512, entry point
.globl _start
_start:
.byte 0xeb # short (2-byte) jump
.byte start_of_setup-1f
1:
.ascii "HdrS" # header signature
.word 0x020d # header version number (>= 0x0105)
.bstext
.bsdata
0
495
.header
To prevent the compiler
from accidentally
producing a 3-byte jump

Setup_header
• “.header” section starts at 495
• 2-byte sentinel is located at the beginning.
• Struct setup_header begins at 497 (=0x1f1)
51
51
47: struct setup_header {
48: __u8 setup_sects;
49: __u16 root_flags;
50: __u32 syssize;
51: __u16 ram_size;
52: __u16 vid_mode;
53: __u16 root_dev;
54: __u16 boot_flag;
55: __u16 jump;
56: __u32 header;
57: __u16 version;
58: __u32 realmode_swtch;
... (arch/x86/include/uapi/asm/bootparam.h)
Setup
code
Boot Sector
0x0000
0x0200
0x1f1

Column: Local Symbol in GAS
• Local symbols
• Symbols that can be used temporarily
• Format is N: (where N is a positive integer)
• To refer to the local symbols, use Nf or Nb.
• Nf refers to the next local label N.
• Nb refers to the most recently declared local label N.
• According to GNU assembler manual, these symbols are
internally transformed to the following format:
• LN^BO
• ^B is Ctrl-B (0x02), O is a serial number
• For 44th 3, “L3^B44” is used.
• Dollar local symbols (I haven’t seen this)
52
.byte start_of_setup-1f
1:
1:
jmp 1b

Get prepared to C (stack)
53
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
.section ".entrytext", "ax"
start_of_setup:
# Force %es = %ds
movw %ds, %ax
movw %ax, %es
cld
movw %ss, %dx
cmpw %ax, %dx # %ds == %ss?
movw %sp, %dx
je 2f # -> assume %sp is reasonably set
# Invalid %ss, make up a new stack
movw $_end, %dx
testb $CAN_USE_HEAP, loadflags
jz 1f
movw heap_end_ptr, %dx
1: addw $STACK_SIZE, %dx
jnc 2f
xorw %dx, %dx # Prevent wraparound
2: # Now %dx should point to the end of our stack space
andw $~3, %dx # dword align (might as well...)
jnz 3f
movw $0xfffc, %dx # Make sure we're not zero
3: movw %ax, %ss
movzwl %dx, %esp # Clear upper half of %esp
If %ds == %ss, %sp is
assumed to be properly set
by the loader
If not, sets up a new stack.
The address is _end +
STACK_SIZE (512 byte) or
heap_end_ptr + STACK_SIZE (if
CAN_USE_HEAP is set)

In other words,
• Set the stack segment as the same as %DS
• Allocate
512-byte for
the stack
54
unsigned short stack;
if (%ds != %ss) {
if (hdr.loadflags & CAN_USE_HEAP) {
stack = hdr.heap_end_ptr + STACK_SIZE;
} else {
stack = _end + STACK_SIZE;
}
if (carried over) { /* stack >= 0x10000 */
stack = 0;
}
}
/* Align to 4-byte */
stack &= ~3;
if (stack == 0)
stack = 0xfffc; /* – 4 */
%ss = %ds;
%esp = stack;

Get prepared to C
(CS fix and BSS clear)
55
sti # Now we should have a working stack
# We will have entered with %cs = %ds+0x20, normalize %cs so
# it is on par with the other segments.
pushw %ds
pushw $6f
lretw
6:
# Check signature at end of setup
cmpl $0x5a5aaa55, setup_sig
jne setup_bad
# Zero the bss
movw $__bss_start, %di
movw $_end+3, %cx
xorl %eax, %eax
subw %di, %cx
shrw $2, %cx
rep; stosl
# Jump to C code (should not return)
calll main
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
$6f is the address of the 6f,
which is the offset from the
boot sector.
Signature check
Fill the bss by zero.
“rep; stosl” (string
instruction) fills the
memory from %es:%di
for %cx DWORDs with %eax.

[Column] Calling conventions
• 16 bit (name unknown)
• Arguments: %ax, %dx, %cx
• Return value: %ax
• 32 bit (cdecl)
• Arguments: pushed on the stack (in the reversed order of the
arguments)
• Caller-saved: %eax, %ecx, and %edx
• Callee-saved: the others
• Return value: %eax (for int)
• 64 bit (amd64)
• Arguments: %rdi, %rsi, %rdx, %rcx, %r8, %r9
• Caller-saved: the others than callee-saved.
• Callee-saved: %rbp, %rbx, %r12 to %r15
• Return value: %eax
56
f(2, 5, 9, 11);
11
9
5
2
(return address)
stack

Real mode kernel (p.45)
• header.S
• main.c
57

main
58
void main(void)
{
/* First, copy the boot header into the "zeropage" */
copy_boot_params();
/* Initialize the early-boot console */
console_init();
...
/* End of heap check */
init_heap();
/* Make sure we have all the proper CPU support */
if (validate_cpu()) {
...
}
set_bios_mode();
detect_memory();
keyboard_init();
query_mca();
query_ist();
...
/* Set the video mode */
set_video();
/* Do the last things and invoke protected mode */
go_to_protected_mode();
}
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S

Copy to zeropage
• Very simple
• The omitted part is for compatibility with
old command-line parameter protocol
(located in the certain address)
59
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
struct boot_params boot_params __attribute__((aligned(16)));
...
static void copy_boot_params(void)
{
...
BUILD_BUG_ON(sizeof boot_params != 4096);
memcpy(&boot_params.hdr, &hdr, sizeof hdr);
...
}

Set up the serial console
• Parse the command line parameter
in very ad-hoc way, and find the
serial configuration
• Find “earlyprintk” and if it is either of
the following format
• “serial,0x3f8,115200”
• “serial,ttyS0,115200”
• “ttyS0,115200”
• Find “console” and find “uart8250,io,…” or “uart,io,…”
• If any serial config is found, set up it using I/O ports
60
void console_init(void)
{
parse_earlyprintk();
if (!early_serial_base)
parse_console_uart8250();
}
 arch
 x86
 boot
 header.S
 main.c
 early_serial_console.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S

Puts and putchar
• By BIOS call and serial I/O ports
61
void __attribute__((section(".inittext"))) putchar(int ch)
{
if (ch == 'n')
putchar('r'); /* n -> rn */
bios_putchar(ch);
if (early_serial_base != 0)
serial_putchar(ch);
}
void __attribute__((section(".inittext"))) puts(const char *str)
{
while (*str)
putchar(*str++);
}
 arch
 x86
 boot
 header.S
 main.c
 tty.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
[Notes] GCC extension __attribute__
section(section) : locate the function/variable in
the specified section.

Serial and BIOS putchar
62
static void __attribute__((section(".inittext")))
serial_putchar(int ch)
{
unsigned timeout = 0xffff;
while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --
timeout)
cpu_relax();
outb(ch, early_serial_base + TXR);
}
static void __attribute__((section(".inittext"))) bios_putchar(int
ch)
{
struct biosregs ireg;
initregs(&ireg);
ireg.bx = 0x0007;
ireg.cx = 0x0001;
ireg.ah = 0x0e;
ireg.al = ch;
intcall(0x10, &ireg, NULL);
}
 arch
 x86
 boot
 header.S
 main.c
 tty.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
Put a char on a
serial line by
using I/O ports
(IN and OUT
instructions)
Put a char on
VGA by BIOS
Call (INT 0x10,
AH = 0x0e)

BIOS Call
• BIOS Call is invoked by using an INT instruction
• Requires an assembly language support
• Parameters and return values are passed by a certain set
of registers
• INT instruction only takes an immediate for the interrupt
number.
• C prototype:
• struct biosregs has all the
general registers,
data segment registers,
the flag register
63
void intcall(u8 int_no, const struct biosregs *ireg,
struct biosregs *oreg);
void initregs(struct biosregs *reg)
{
memset(reg, 0, sizeof *reg);
reg->eflags |= X86_EFLAGS_CF;
reg->ds = ds();
reg->es = ds();
reg->fs = fs();
reg->gs = gs();
}

BIOS Call Impl. (1)
64
 arch
 x86
 boot
 header.S
 main.c
 bioscall.S
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
.code16
.section ".inittext","ax"
.globl intcall
...
intcall:
cmpb %al, 3f
je 1f
movb %al, 3f
jmp 1f /* Synchronize pipeline */
1:
...
/* Actual INT */
.byte 0xcd /* INT opcode */
3: .byte 0
...
void intcall(u8 int_no, const struct biosregs
*ireg, struct biosregs *oreg);
ax dx
cx
Checks the current operand of the INT
instruction, and rewrite (self-modify)
the interrupt number if different.

BIOS Call Impl. (2)
65
1:
/* Save state */
pushfl
pushw %fs
pushw %gs
pushal
/* Copy input state to stack frame */
subw $44, %sp
movw %dx, %si
movw %sp, %di
movw $11, %cx
rep; movsd
/* Pop full state from the stack */
popal
popw %gs
popw %fs
popw %es
popw %ds
popfl
/* Actual INT */
.byte 0xcd /* INT opcode */
3: .byte 0
EFLAGS
FS
GS
EAX
ECX
EDI
…
stack
EFLAGS
FS
GS
DS
ES
EAX
…
EDI
Copy of
struct
biosregs
*ireg
(44 bytes)
Registers
Registers

BIOS Call Impl. (3)
66
/* Push full state to the stack */
pushfl
pushw %ds
pushw %es
pushw %fs
pushw %gs
pushal
...
(Restore %ds, %sp, etc.)
...
/* Copy output state from stack frame */
movw 68(%esp), %di /* Original %cx == 3rd
argument */
andw %di, %di
jz 4f
movw %sp, %si
movw $11, %cx
rep; movsd
/* Restore state and return */
popal
popw %gs
popw %fs
popfl
retl
EFLAGS
FS
GS
EAX
ECX
EDI
…
stack
EFLAGS
FS
GS
DS
ES
EAX
…
EDI
Registers
*oregs
Registers

Inline assembly
• A quick way to use assembly language inside C
source codes
• For example, when you want to disable interrupts, put
into your C code.
• GCC’s extended inline assembly language enables
far more features (and more complicated)
• => Described in twenty or so slides later!
67
asm (“cli”);
static inline void outb(u8 v, u16 port)
{
asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
}

Initialize the heap
68
char *HEAP = _end;
char *heap_end = _end; /* Default end of heap = no heap */
...
static void init_heap(void)
{
char *stack_end;
if (boot_params.hdr.loadflags & CAN_USE_HEAP) {
asm("leal %P1(%%esp),%0"
: "=r" (stack_end) : "i" (-STACK_SIZE));
heap_end = (char *)
((size_t)boot_params.hdr.heap_end_ptr +
0x200);
if (heap_end > stack_end)
heap_end = stack_end;
} else {
/* Boot protocol 2.00 only, no heap available */
puts("WARNING: Ancient bootloader, some
functionality "
"may be limited!n");
}
}
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
Substitute %esp – STACK_SIZE to stack_end
heap_end
stack_end

When is the heap used?
• Heap allocation function is very simple
• And the calls for GET_HEAP exist only in the video
code files.
69
static inline char *__get_heap(size_t s, size_t a, size_t n)
{
char *tmp;
HEAP = (char *)(((size_t)HEAP+(a-1)) & ~(a-1));
tmp = HEAP;
HEAP += s*n;
return tmp;
}
#define GET_HEAP(type, n)
((type *)__get_heap(sizeof(type),__alignof__(type),(n)))
saved.data = GET_HEAP(u16, saved.x*saved.y);
(boot/video.c)

Retrieving memory info.
• As described in the last presentation,
detect_memory tries 3 methods
70
int detect_memory(void)
{
...
if (detect_memory_e820() > 0)
err = 0;
if (!detect_memory_e801())
err = 0;
if (!detect_memory_88())
err = 0;
return err;
}
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S

Memory Information [from p.48]
• AX = 0xe820, INT 0x15 [detect_memory_e820()]
• INPUT
• AX = 0xe820
• CX = size of the buffer
• EDX = “SMAP” (0x534d4150 / Signature)
• EBX = Continuation value
• ES:DI = address for the buffer
• OUTPUT
• CF = 0 if successful, 1 otherwise
• CX = Returned Byte
• EBX = Continuation value
• Each call returns information for one range
• To get information for the next range, give the continuation value returned in
the previous call
• The range information is returned by the following structure
• Stored in boot_params.e820_map (struct e820entry[128])
71
52 struct e820entry {
53 __u64 addr; /* start of memory segment */
54 __u64 size; /* size of memory segment */
55 __u32 type; /* type of memory segment */
56 } __attribute__((packed));
(arch/x86/include/uapi/asm/e820.h)

E820
72
static int detect_memory_e820(void)
{
int count = 0;
struct biosregs ireg, oreg;
struct e820entry *desc = boot_params.e820_map;
static struct e820entry buf; /* static so it is zeroed */
initregs(&ireg);
ireg.ax = 0xe820;
ireg.cx = sizeof buf;
ireg.edx = SMAP;
ireg.di = (size_t)&buf;
do {
intcall(0x15, &ireg, &oreg);
ireg.ebx = oreg.ebx; /* for next iteration... */
if (oreg.eflags & X86_EFLAGS_CF)
break;
...
*desc++ = buf;
count++;
} while (ireg.ebx && count < ARRAY_SIZE(boot_params.e820_map));
return boot_params.e820_entries = count;
}
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S

Video
• Smells like chaos
73

Go To Protected Mode
74
void go_to_protected_mode(void)
{
/* Hook before leaving real mode, also disables interrupts */
realmode_switch_hook();
/* Enable the A20 gate */
if (enable_a20()) {
puts("A20 gate not responding, unable to boot...n");
die();
}
/* Reset coprocessor (IGNNE#) */
reset_coprocessor();
/* Mask all interrupts in the PIC */
mask_all_interrupts();
/* Actual transition to protected mode... */
setup_idt();
setup_gdt();
protected_mode_jump(boot_params.hdr.code32_start,
(u32)&boot_params + (ds() << 4));
}
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S

Go To PM: Details (1)
• Call the hook if set
• Otherwise, disable interrupts and NMI.
75
static void realmode_switch_hook(void)
{
if (boot_params.hdr.realmode_swtch) {
asm volatile("lcallw *%0"
: : "m" (boot_params.hdr.realmode_swtch)
: "eax", "ebx", "ecx", "edx");
} else {
asm volatile("cli");
outb(0x80, 0x70); /* Disable NMI */
io_delay();
}
}
If a hook is set in
realmode_swtch
call the hook
Out 0x80 to port 0x70
(CMOS Controller!!)
(By a historical reason,
“NMI disable” bit is
located in the CMOS
controller)
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S

• Enable A20 line (20th bit of the address bus)
• In the initial state, the bit is masked (always 0)
• For compatibility with the program that expects address
wraparound in 1MB
• Some programs expect the address 0xFFFFF + 1 = 0x00000
• To use 32-bit of memory, this mask should be disabled.
• Many ways to do it
• But which way works depends on the firmware
• The famous one would be via the keyboard controller
port!
• Linux tries several ways, and several times
76

77
int enable_a20(void)
{...
while (loops--) {
if (a20_test_short())
return 0;
/* Next, try the BIOS (INT 0x15, AX=0x2401) */
enable_a20_bios();
return 0;
/* Try enabling A20 through the keyboard controller */
kbc_err = empty_8042();
return 0; /* BIOS worked, but with delayed reaction */
if (!kbc_err) {
enable_a20_kbc();
if (a20_test_long())
return 0;
}
/* Finally, try enabling the "fast A20 gate" */
enable_a20_fast();
if (a20_test_long())
return 0;
}
...
}
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 a20.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
*Tries 100000 times at most

78
/*
* Reset IGNNE# if asserted in the FPU.
*/
static void reset_coprocessor(void)
{
outb(0, 0xf0);
io_delay();
outb(0, 0xf1);
io_delay();
}
/*
* Disable all interrupts at the legacy PIC.
*/
static void mask_all_interrupts(void)
{
outb(0xff, 0xa1); /* Mask all interrupts on the secondary PIC */
io_delay();
outb(0xfb, 0x21); /* Mask all but cascade on the primary PIC */
io_delay();
}
The most legacy interrupt controller
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S

• IDT (Interrupt Descriptor Table)
• Describes the exception/interrupt handlers
(and task gate, etc.)
• At this time, no IDT is installed.
• null_idt contains information for the address and size for the
IDT, both of which are zero.
• LIDT instruction takes an argument that is a pointer to the
information.
79
static void setup_idt(void)
{
static const struct gdt_ptr null_idt = {0, 0};
asm volatile("lidtl %0" : : "m" (null_idt));
}
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
len (16-bit) address (32-bit)
IDT

• GDT (Global Descriptor Table)
• Describes the segment information
80
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
static void setup_gdt(void)
{
static const u64 boot_gdt[] __attribute__((aligned(16))) = {
/* CS: code, read/execute, 4 GB, base 0 */
[GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff),
/* DS: data, read/write, 4 GB, base 0 */
[GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff),
/* TSS: 32-bit tss, 104 bytes, base 4096 */
/* We only have a TSS here to keep Intel VT happy;
we don't actually use it for anything. */
[GDT_ENTRY_BOOT_TSS] = GDT_ENTRY(0x0089, 4096, 103),
};
static struct gdt_ptr gdt;
gdt.len = sizeof(boot_gdt)-1;
gdt.ptr = (u32)&boot_gdt + (ds() << 4);
asm volatile("lgdtl %0" : : "m" (gdt));
}
len (16-bit) address (32-bit)
boot_gdt (GDT)

x86 Architecture: GDT
• GDT
• Each entry has 8 byte
• Offset, limit, and attributes
• DPL: Descriptor privileged level (0-3: 0 is the most privileged)
• When a processor executes codes at a code segment, the
current privileged level (CPL) is the same as DPL of the code
segment. Can access data segments with DPL >= CPL.
81
G D L *
Limit
19:16
P DPL S Type
Base
23:16
Base
31:24
Base Address
15:00
Limit
15:00
[GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff),
[GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff),
9 = Code, Execute Only
3 = Data, R/W
08121315162024
0
4

x86 Architecture:
Flat Model and Segment Register
• Although the segment feature is available in 32-bit, the
common use is called “Flat Model.”
• Uses a single segment from zero to 232 – 1
• To be precise, different segments are required for code/data and
privileged/user mode.
• Linux uses four segments: KERNEL_CS, KERNEL_DS, USER_CS,
USER_DS
• During boot time, BOOT_CS and BOOT_DS are used (as defined in
the previous slide)
• Segment Register (Selector)
• If CS is to select BOOT_CS, CS = (index of BOOT_CS) << 3;
• GDT_ENTRY_BOOT_CS = (Index of BOOT_CS) = 2, then CS = 16
• The constants BOOT_CS = 16, BOOT_DS = 24.
• Note the difference between “ENTRY” and the actual value. 82
T
I
RPLIndex
02315

• Call the assembler part (no return)
83
(u32)&boot_params + (ds() << 4));
GLOBAL(protected_mode_jump)
movl %edx, %esi # Pointer to boot_params
table
xorl %ebx, %ebx
movw %cs, %bx
shll $4, %ebx
addl %ebx, 2f
jmp 1f # Short jump to serialize on 386/486
1:
movw $__BOOT_DS, %cx
movw $__BOOT_TSS, %di
movl %cr0, %edx
orb $X86_CR0_PE, %dl # Protected mode
movl %edx, %cr0
# Transition to 32-bit mode
.byte 0x66, 0xea # ljmpl opcode
2: .long in_pm32 # offset
.word __BOOT_CS # segment
ENDPROC(protected_mode_jump)
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
dx
ax
*(uint32_t *)2f += cs() << 4;
(phys addr of in_pm32)
[Notes]
In real mode,
physical address
= (Segment
Register << 4) +
Offset
To enter protected mode, set PE bit
in %cr0 register.
And 32-bit far jump operation
(not expressible in real mode asm)

Go To PM: Detail (8)
84
.code32
.section ".text32","ax"
GLOBAL(in_pm32)
# Set up data segments for flat 32-bit mode
movl %ecx, %ds
movl %ecx, %es
movl %ecx, %fs
movl %ecx, %gs
movl %ecx, %ss
...
addl %ebx, %esp
...
ltr %di
...
xorl %ecx, %ecx
xorl %edx, %edx
xorl %ebx, %ebx
xorl %ebp, %ebp
xorl %edi, %edi
...
lldt %cx
...
jmpl *%eax
ENDPROC(in_pm32)
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
(u32)&boot_params + (ds() << 4));
* (Omitted)
Task register and LDT (Local
Descriptor Table)

Extended inline assembler (1)
• GCC’s extended inline assembly language
• Input/output operands can be specified for the assembler
• Assembler template
• The actual assembly language with templates that will be
substituted by the output/input operands
• Output operands
• List of C variables modified by the assembler template
• Input operands
• List of C expressions read by the instructions in the assembler
template.
• Clobber
• List of registers/values to be changed by the assembler template
(other than the output operands)
85
asm [volatile] (assembler template : [ output operands [ : input operands [ : clobber ]]])
{
}
Assembler template Input operands

• Assembler template
• Basically, the same as the standalone assembly language
• %n (n is zero or positive integer) refers to the (n+1)-th
operand in the input and output operands.
• If the character “%” is to be used (to specify a certain
register “%ebx,” for example), “%%” must be used.
• Other than the number, the name can be used to specify
an operand. (%[symbolicname] refers to the operand
with the name [symbolicname])
• To use multiple instructions, use “;” or “n”
86
{
}

• Input operands
• Comma-separated list of C expressions prefixed with
constaints
• A constraint specifies how the expression is passed to
the assembler template.
• When multiple constraints are specified, the complier
selects the most efficient one.
87
constraint
C expression
Constraint
‘m’ Memory operand
‘r’ General register
‘i’ Immediate integer
‘0’ – ‘9’ = The same place as the
operand
Constraint (x86-specific)
‘a’’b’’c’’d’ A,B,C,D register
‘S’’D’ SI, DI register
‘N’ Unsigned 8-bit integer (for
in/out instructions)
‘A’ EDX:EAX (32bit), RDX/RAX
(64 bit)
[SymbolicName] “Constraints” (C Expression),…

• In this example,
• The value of v (u8) is stored in %al register
• The value of port (u16) is stored in %dx register or used
as 8-bit immediate.
• This function is declared as “inline,” so if this function is called
with a constant value as port which is less than 256, the “N”
constraint may be used.
• Then, the instruction(s) in the assembler templates are
executed.
• The resulting assembly language will be
88
outb:
movl 8(%esp), %edx
movl 4(%esp), %eax
outb %al,%dx
ret

• Output operands
• Comma-separated list of C variables prefixed with constraints
• Constraints should be prefixed with “=“ or “+”
• “+” means that the variable is used as a both input and output
operand.
• “&” constraint allocates a different register from the input
operands (for multiple instructions, this constraint may be
necessary)
• After the instruction(s) in the assembler template is executed,
the value of A register (%al) is stored to the variable v.
89
[SymbolicName] “=Constraints” (C Variable),…
static inline u8 inb(u16 port)
{
u8 v;
asm volatile("inb %1,%0" : "=a" (v) : "dN" (port));
return v;
}

• Clobber
• The list of registers/values modified by the instructions
• The output registers need not be specified here.
• The most common clobber is “memory”
• This means that the memory contents may be changed as side
effects, thus all the variables should be written back to the
memory before the assembler, and should be read again from the
memory after the assembler.
• “cc” : Condition (flags) registers
90
void *memcpy(void *dest, const void *src, size_t n)
{
int d0, d1, d2;
asm volatile(
"rep ; movslnt"
"movl %4,%%ecxnt"
"rep ; movsbnt"
: "=&c" (d0), "=&D" (d1), "=&S" (d2)
: "0" (n >> 2), "g" (n & 3), "1" (dest), "2" (src)
: "memory");
return dest;
}

(?)
91
void *memcpy(void *dest, const void *src, size_t n)
{
int d0, d1, d2;
asm volatile(
"rep ; movslnt"
"movl %4,%%ecxnt"
"rep ; movsbnt"
: "=&c" (d0), "=&D" (d1), "=&S" (d2)
: "0" (n >> 2), "g" (n & 3), "1" (dest), "2" (src)
: "memory");
return dest;
}
asm volatile(
"rep ; movslnt"
"movl %2,%%ecxnt"
"rep ; movsbnt"
: "+&D" (dest)
: "c" (n >> 2), "g" (n & 3), "S" (src)
: "memory");

• Examples (appeared in the previous slides)
• Example 1
• Stores %esp – STACK_SIZE to stack_end
• P in “%P1” is a modifier (but cannot find in the
document)
• With “%P1”
• With “%1”
• With “%c1” (“constant expression with no punctuation”)
92
asm("leal %P1(%%esp),%0"
: "=r" (stack_end) : "i" (-STACK_SIZE));
leal -512(%esp),%eax
leal $-512(%esp),%eax
leal -512(%esp),%eax

• Example 2
• Far-calls the address (the value of
boot_params.hdr.realmode_swtch)
• The registers eax, ebx, ecx, and edx will be changed in
this call.
• Example 3
93
asm volatile("lcallw *%0"
: : "m" (boot_params.hdr.realmode_swtch)
: "eax", "ebx", "ecx", "edx");
static const struct gdt_ptr null_idt = {0, 0};
asm volatile("lidtl %0" : : "m" (null_idt));
setup_idt:
lidtl null_idt.1378
ret

94
#define switch_to(prev, next, last)
do {
unsigned long ebx, ecx, edx, esi, edi;
asm volatile("pushflnt" /* save flags */
"pushl %%ebpnt" /* save EBP */
"movl %%esp,%[prev_sp]nt" /* save ESP */
"movl %[next_sp],%%espnt" /* restore ESP */
"movl $1f,%[prev_ip]nt" /* save EIP */
"pushl %[next_ip]nt" /* restore EIP */
__switch_canary
"jmp __switch_ton" /* regparm call */
"1:t"
"popl %%ebpnt" /* restore EBP */
"popfln" /* restore flags */
/* output parameters */
: [prev_sp] "=m" (prev->thread.sp),
[prev_ip] "=m" (prev->thread.ip),
"=a" (last),
/* clobbered output registers: */
"=b" (ebx), "=c" (ecx), "=d" (edx),
"=S" (esi), "=D" (edi)
__switch_canary_oparam
/* input parameters: */
: [next_sp] "m" (next->thread.sp),
[next_ip] "m" (next->thread.ip),
/* regparm parameters for __switch_to(): */
[prev] "a" (prev),
[next] "d" (next)
__switch_canary_iparam
: /* reloaded segment registers */
"memory");
} while (0)
arch/x86/include/asm/switch_to.h

• The key point
• The context is the stack
• The switched task resumes at “1:”. (just after “jmp
__switch_to”)
• The “__switch_to” function is called with a “jmp”
instruction, not a “call” instruction.
• Anyway
• The template does not use %n (number), but %[name]
style. (too many parameters)
95
asm volatile(...
"movl %%esp,%[prev_sp]nt" /* save ESP */
...
/* output parameters */
: [prev_sp] "=m" (prev->thread.sp),

Exercise: RDTSC
• RDTSC instruction
• Input : None
• Output : EDX (Higher 32-bit), EAX (Lower 32-bit)
96
unsigned long rdtsc(void)
{
}
asm volatile(“rdtsc” :
unsigned short high, low;
“=d” (high), “=a” (low));
return ((unsigned long)high << 32) | low;

Answer: rdtscll
97
#define rdtscll(val)
((val) = __native_read_tsc())
static __always_inline unsigned long long __native_read_tsc(void)
{
DECLARE_ARGS(val, low, high);
asm volatile("rdtsc" : EAX_EDX_RET(val, low, high));
return EAX_EDX_VAL(val, low, high);
}
#define DECLARE_ARGS(val, low, high) unsigned low, high
#define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high) << 32))
#define EAX_EDX_ARGS(val, low, high) "a" (low), "d" (high)
#define EAX_EDX_RET(val, low, high) "=a" (low), "=d" (high)
#else
#define DECLARE_ARGS(val, low, high) unsigned long long val
#define EAX_EDX_VAL(val, low, high) (val)
#define EAX_EDX_ARGS(val, low, high) "A" (val)
#define EAX_EDX_RET(val, low, high) "=A" (val)
#endif

A-2. Protected Mode
Again, full of the assembly code!
98

Protected-Mode Kernel (p.54)
• arch/x86/boot/compressed/head_{32,64}.S
• Goal: Decompresses the kernel (vmlinux.gz/.bz2/.xz…)
and start the kernel
• Relocates the decompressing code (if relocatable and
loaded at a different address)
• Enables paging and enters the long-mode (in head_64.S)
• Clears the BSS, and prepares the heap and stack
• Decompresses the kernel
• Relocates if required
• RANDOMIZED_BASE or RELOCATABLE (in 32-bit)
99

LD script?
100
 arch
 x86
 boot
 setup.ld
 compressed
 vmlinux.lds.S
 kernel
 vmlinux.lds.S
...
OUTPUT_ARCH(i386:x86-64)
ENTRY(startup_64)
#else
OUTPUT_ARCH(i386)
ENTRY(startup_32)
#endif
SECTIONS
{
/* Be careful parts of head_64.S
* assume startup_32 is at address 0.
*/
. = 0;
.head.text : {
_head = . ;
HEAD_TEXT
_ehead = . ;
}
.rodata..compressed : {
*(.rodata..compressed)
}
...
.head.text
.rodata..compres
sed
0 (_head)
(_ehead)
.text
.rodata
.got
.data
.bss
.pgtable (64
only)
(_etext, _rodata)
(_text)
(_erodata, _got)
(_egot, _data)
(_edata, _bss)
(_ebss, _pgtable)
(_epgtable, _end)

mkpiggy
• Section “.rodata..compressed” consists of the
compressed kernel (vmlinux)
101
printf(".section ".rodata..compressed","a",@progbitsn");
printf(".globl z_input_lenn");
printf("z_input_len = %lun", ilen);
printf(".globl z_output_lenn");
printf("z_output_len = %lun", (unsigned long)olen);
printf(".globl z_extract_offsetn");
printf("z_extract_offset = 0x%lxn", offs);
/* z_extract_offset_negative allows simplification of head_32.S */
printf(".globl z_extract_offset_negativen");
printf("z_extract_offset_negative = -0x%lxn", offs);
printf(".globl input_data, input_data_endn");
printf("input_data:n");
printf(".incbin "%s"n", argv[1]);
printf("input_data_end:n");
(arch/x86/boot/compressed/mkpiggy.c)

Entry point (32-bit)
102
.text
__HEAD
ENTRY(startup_32)
jmp preferred_addr
...
preferred_addr:
#endif
cld
testb $(1<<6), BP_loadflags(%esi)
jnz 1f
cli
movl $__BOOT_DS, %eax
movl %eax, %ds
movl %eax, %es
movl %eax, %fs
movl %eax, %gs
movl %eax, %ss
1:
.section ".head.text","ax"
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
If KEEP_SEGMENT is set in loadflags in
boot_params, do not reload the segments.

Protected-Mode Protocol (p.53)
• Starts at the top of the protected mode kernel
• Usually loaded at 0x100000 (1MB)
• Can be at any position if compiled as relocatable
• Should be at the same position as specified in the compile
time if compiled as not relocatable
• Used in “linux” module in GRUB2
• [Protocol] At the entry point,
• The loaded GDT must have __BOOT_CS (0x10 / execute and
read) and __BOOT_DS(0x18 / read and write)
• %cs must be __BOOT_CS
• %ds, %es, and %ss must be __BOOT_DS
• Interrupts must be disabled
• %esi must be the address for struct boot_params
• %ebp, %edi, and %ebx must be zero.
103

104

Where are we?
105
leal (BP_scratch+4)(%esi), %esp
call 1f
1: popl %ebp
subl $1b, %ebp
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
• The call instruction pushes the return
address onto the stack
• The return address should be the next instruction after
the call instruction, i.e. 1f
• The immediate “pop” pops the return address from the
stack, i.e. the absolute physical address for 1f
• Subtracting 1f (in this case, 1b) from the address (%ebp)
calculates the offset between the actual address and the
compile-time address (0-based, as seen in lds).

Memory View
106
PM kernelRM Kernel
Higher Address%ebp
vmlinux (decompressed)
Goal:
headBP compressed
%esi
z_extract_offset
(mkpiggy.c)
offs = (olen > ilen) ? olen - ilen : 0;
offs += olen >> 12; /* Add 8 bytes for each 32K block */
offs += 64*1024 + 128; /* Add 64K + 128 bytes slack */
offs = (offs+4095) & ~4095; /* Round to a 4K boundary */
...
printf("z_extract_offset = 0x%lxn", offs);
Relocated Kernel
LOAD_PHYSICAL_ADDRESS
(asm/x86/include/asm/boot.h)
#define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START
+ (CONFIG_PHYSICAL_ALIGN - 1))
& ~(CONFIG_PHYSICAL_ALIGN - 1))
*Default: 0x1000000
compressed

Determine where to
decompress
• If CONFIG_RELOCATABLE
• The current position
(BP_kernel_alignment-
aligned)
• Default: 2MB-align
• If it is less than
LOAD_PHYSICAL_ADDR,
LOAD_PHYSICAL_ADDR is
used
• If not
CONFIG_RELOCATABLE
• LOAD_PHYSICAL_ADDR is
used
• Now %ebx is the target
address
107
#ifdef CONFIG_RELOCATABLE
movl %ebp, %ebx
movl
BP_kernel_alignment(%esi), %eax
decl %eax
addl %eax, %ebx
notl %eax
andl %eax, %ebx
cmpl
$LOAD_PHYSICAL_ADDR, %ebx
jge 1f
#endif
movl
$LOAD_PHYSICAL_ADDR, %ebx
1:
 arch
 x86
 boot
 compressed
 head_32.S
 head_64.S
 kernel
 head_32.S
 head_64.S

Copy the decompression code
• Copy the area from the
head of PM kernel
(startup_32) to just before
the head of bss.
• The code copies the kernel
backwards in case of
overlapping
108
addl $z_extract_offset, %ebx
leal boot_stack_end(%ebx), %esp
pushl $0
popfl
pushl %esi
leal (_bss-4)(%ebp), %esi
leal (_bss-4)(%ebx), %edi
movl $(_bss - startup_32), %ecx
shrl $2, %ecx
std
rep movsl
cld
popl %esi
PM kernel
%ebp
Relocated
%ebx
z_extract_offset

Jump to the relocated address
• Jump to the copied
decompression code
• The decompression
code is the end in the
PM kernel
• Just after the
compressed kernel
image
• Clears the BSS
109
leal relocated(%ebx), %eax
jmp *%eax
ENDPROC(startup_32)
.text
relocated:
xorl %eax, %eax
leal _bss(%ebx), %edi
leal _ebss(%ebx), %ecx
subl %edi, %ecx
shrl $2, %ecx
rep stosl
%ebx
Relocated kernel
relocated

Why is z_extract_offset?
• The PM kernel contains the compressed kernel
image
• The relocating (copying) code is located at the head in
PM kernel
• The decompression code is located at the tail in the PM
kernel
• The decompression code after relocation is safe because
z_extract_offset + the compressed image size is larger
than the decompressed image size
110
headcompressed decomp
decompressed
z_extract_offset
work area
head compressed decomp
Relocate
z_extract_offset

Fix up the absolute addresses
• The decompression code is built with -fPIC
(position independent code), and so fixing up the
absolute addresses is achieved by modifying the
addresses in GOT (Global Offset Table).
111
/*
* Adjust our own GOT
*/
leal _got(%ebx), %edx
leal _egot(%ebx), %ecx
1:
cmpl %ecx, %edx
jae 2f
addl %ebx, (%edx)
addl $4, %edx
jmp 1b
2:
%ebx
Relocated kernel

112

Call the decompression routine
• Call the decompress_kernel function in C
• asmlinkage __visible void *decompress_kernel(void
*rmode, memptr heap, unsigned char *input_data,
unsigned long input_len, unsigned char *output,
unsigned long output_len)
113
pushl $z_output_len /* decompressed length */
leal z_extract_offset_negative(%ebx), %ebp
pushl %ebp /* output address */
pushl $z_input_len /* input_len */
leal input_data(%ebx), %eax
pushl %eax /* input_data */
leal boot_heap(%ebx), %eax
pushl %eax /* heap area */
pushl %esi /* real mode pointer */
call decompress_kernel /* returns kernel location in %eax */
BP Relocated
%ebx
z_extract_offset
%esi

Decompressing
114
asmlinkage __visible void *decompress_kernel(...)
{
...
output = choose_kernel_location(input_data, input_len,
output, output_len);
...
#ifndef CONFIG_RELOCATABLE
if ((unsigned long)output != LOAD_PHYSICAL_ADDR)
error("Wrong destination address");
#endif
debug_putstr("nDecompressing Linux... ");
decompress(input_data, input_len, NULL, NULL, output, NULL,
error);
parse_elf(output);
handle_relocations(output, output_len);
debug_putstr("done.nBooting the kernel.n");
return output;
}
 arch
 x86
 boot
 compressed
 head_32.S
 head_64.S
 misc.c
 kernel
 head_32.S
 head_64.S

Choosing the destination
• The choose_kernel_location function
• If KASLR is enabled, it computes some random output
address (aslr.c)
• Otherwise, it just returns the output parameter
115

Decompressing the kernel
• The decompress function does
everything
• The implementation is located at
lib/decompress_*.c
116
#ifdef CONFIG_KERNEL_GZIP
#include "../../../../lib/decompress_inflate.c"
#endif
#ifdef CONFIG_KERNEL_BZIP2
#include "../../../../lib/decompress_bunzip2.c"
#endif
#ifdef CONFIG_KERNEL_XZ
#include "../../../../lib/decompress_unxz.c"
#endif
...
 arch
 x86
 boot
 compressed
 head_32.S
 head_64.S
 misc.c
 kernel
 head_32.S
 head_64.S

Load the ELF
• parse_elf
• Parse the ELF header and locate the contents according
to the program header (p_paddr)
• If relocatable, the
p_paddr is offseted by
the actually loaded address.
117
for (i = 0; i < ehdr.e_phnum; i++) {
...
switch (phdr->p_type) {
case PT_LOAD:
#ifdef CONFIG_RELOCATABLE
dest = output;
dest += (phdr->p_paddr –
LOAD_PHYSICAL_ADDR);
#else
dest = (void *)(phdr->p_paddr);
#endif
memcpy(dest,
output + phdr->p_offset,
phdr->p_filesz);
break;
...
}
}
typedef struct elf32_phdr{
Elf32_Word p_type;
Elf32_Off p_offset;
Elf32_Addr p_vaddr;
Elf32_Addr p_paddr;
Elf32_Word p_filesz;
Elf32_Word p_memsz;
Elf32_Word p_flags;
Elf32_Word p_align;
} Elf32_Phdr;

118

Relocate the kernel image
• Relocation information (generated by the “relocs”
tool) is appended just after the ELF image
• The relocation information is a collection of addresses to
the absolute addresses in the kernel code
• These addresses are all expressed by kernel virtual addresses
vmlinux (ELF) 0 0… …
32-bit
relocation
addresses
64-bit
relocation
addresses
$ objdump –adr vmlinux
…
c1086910 <vfs_llseek>:
c1086910: 55 push %ebp
...
c1086919: bb 60 63 08 c1 mov $0xc1086360,%ebx
c108691a: R_386_32 no_llseek

Calculate deltas
• __START_KERNEL_map
• In 32-bit, PAGE_OFFSET (default: 0xC0000000)
• In 64-bit, 0xffffffff80000000
120
120
static void handle_relocations(void *output, unsigned long
output_len)
{
...
unsigned long min_addr = (unsigned long)output;
...
delta = min_addr - LOAD_PHYSICAL_ADDR;
...
map = delta - __START_KERNEL_map;
...
Difference between
the compile-time
physical address and
the actual physical
address
The offset of the kernel
virtual address to the
physical address

Apply the relocation
121
for (reloc = output + output_len - sizeof(*reloc); *reloc; reloc--) {
int extended = *reloc;
extended += map;
ptr = (unsigned long)extended;
if (ptr < min_addr || ptr > max_addr)
error("32-bit relocation outside of kernel!n");
*(uint32_t *)ptr += delta;
}
for (reloc--; *reloc; reloc--) {
long extended = *reloc;
extended += map;
ptr = (unsigned long)extended;
if (ptr < min_addr || ptr > max_addr)
error("64-bit relocation outside of kernel!n");
*(uint64_t *)ptr += delta;
}
#endif

OK, go to the entry point
• The entry point is always at the head of the kernel
• decompress_kernel returns the “output”
• The assembly code jumps into the entry point
122
asmlinkage __visible void *decompress_kernel(...)
{
...
output = choose_kernel_location(input_data, input_len,
output, output_len);
...
return output;
}
/*
* Jump to the decompressed kernel.
*/
xorl %ebx, %ebx
jmp *%eax

Next
• Go on to startup_32/startup_64
123

Linux Kernel Booting Process (2) - For NLKB

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Linux Kernel Booting Process (2) - For NLKB

Similar to Linux Kernel Booting Process (2) - For NLKB (20)

Recently uploaded

Recently uploaded (20)

Linux Kernel Booting Process (2) - For NLKB