Describes the bootstrapping part in Linux, and related architectural mechanisms and technologies.
This is the part two of the slides, and the succeeding slides may contain the errata for this slide.
3. Agenda
• Virtual Memory
• From architectural view
• Unfortunately, this presentation again does not enter
the main part of the kernel!
• Appendices
• Source code-level overview of the bootstrapping process
• Linker Scripts
• Inline Assemblers
• There are (implicitly) omitted spaces, tabs, white
lines, comments in the quoted source code.
• The omitted effective lines are denoted by … or […]
3
4. Scope of the last presentation : x86
• Real Mode (16-bit)
• Boot sector, setup_header, and
16-bit entry point
• C-Language main function
• Retrieving memory information
• Transition to the protected
mode
• Protected Mode (32-bit)
• 32-bit(/64-bit) entry point,
preparing for decompression,
calling decompression code
• (EFI-Stub) efi_main (entry point
from UEFI)
• EFI call functions
• Protected Mode/Long Mode
• The beginning of the main
kernel
4
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
* The …_32.S files are used in the 32-bit kernel, and …_64.S files are not. Vice versa.
5. Scope of the last presentation : ARM
• Compressed
• Entry point
• Decompressing function
• Actual decompressing
algorithm is in
lib/decompress_*.c
• Building a FDT from ATAGS for
compatibility
(CONFIG_ARM_ATAG_DTB_C
OMPAT)
• Decompressed
• The beginning of the main
kernel
5
arch
arm
boot
compressed
head.S
decompress.c
atags_to_fdt.c
kernel
head.S
6. Follow-ups for the last presentation
• x86 assembly language
• What if instuructions with ≥3 operands?
• (e.g.) imul
• Multiply EBX by 19(0x13) and substitute the result to EAX
• Therefore,
6
AT&T Intel
Operand
Order
Source, Destination Destination, Source
AT&T Intel
Example imul $0x13, %ebx, %eax IMUL EAX, EBX, 13h
AT&T Intel
Operand
Order
Source, Destination
[[Op4,] Op3,] Op2, Op1
Destination, Source
Op1, Op2 [, Op3, [Op4]]
7. Follow-ups for the last presentation
• Multiple Relocations?
• The conclusion is “at most once” (in x86 arch)
• ELF relocation may follow the decompression, so the kernel
may be relocated twice in this sense.
• See the relocation part in this presentation.
7
8. x86 Architecture : Segmentation
• 6 Segment Registers (16-bit registers)
• Code Segment Register: CS
• Data Segment Register: DS, ES, FS, GS
• Stack Segment Register: SS
• Real mode : 20-bit address space
• Linear address = Physical address
• The size of each segment is 64K (16-bit)
• The segment register denotes the higher 16-bit offset in 20-bit address
space for the segment
• Protected mode : 32-bit/36-bit physical address space
• Virtual –(Paging)-> Linear –(Segmentation)-> Physical
• The offset and limit are stored in the descriptor table
• The segment registers points to the entry in the table
• Long mode : 48-bit physical address space
• For CS, DS, ES, and SS, the offset is always 0, the limit is ignored.
• For FS and GS, the offset can be set by the descriptor or through MSR
(for > 32-bit addresses)
8
Logical – (Segmentation) -> Linear –(Paging)-> Physical
Errata
9. So what? (p.32)
9
vmlinux
boot/compressed/vmlinux.bin
(1a) Strip symbols
vmlinux.bin.xz
(2a) Concatenate and compress (gzip, bzip2,
lzma, lzo, lz4)
piggy.o
(3) mkpiggy (piggy-back)
Make an object that contains the compressed image
piggy.o*.o
boot/compressed/vmlinux
(4) Link with the other objects in
boot/compressed
(Decompressing codes)
(5) Transform it into a simple binary
boot/vmlinux.bin
boot/vmlinux.binboot/setup.bin
(6) Concatenate with real-mode
setup code, headers, and CRC32 CRC
boot/bzImage
(1b) Make relocation information
(2b) Append the original size info (except for gzip) vmlinux.bin.xz
vmlinux.relocs
Size
Errata?
11. Virtual Memory
• The address visible to a task is “virtualized,” i.e.
translated by hardware to a certain physical address
when it is actually accessed.
• The hardware mechanism to translate the address is called
MMU (memory management unit).
• Aim / Benefit
• Using larger memory area than the machine actually is
equipped with.
• Memory swapping, sparse memory areas
• Isolating tasks’ memory area so that the different applications
cannot touch (read or write) the each other’s memory
• Not only between user tasks but between the kernel and tasks
• Abstracting the memory resources
• Providing contiguous memory area even if there is no physically
contiguous memory area available.
• User programs can run with certain addresses regardless of the
physical addresses where they are actually running.
11
12. Two ways to virtual memory
• Paging
• Dividing the memory area into chunks (“pages”) with a
certain small size, and defining a map from each chunk
to its physical location
• A different task may have a different map of the
memory
• Several overhead (both in speed and memory) to
translate and hold the map
• Segmentation
• The address is considered to be an offset inside a certain
segment of memory
• Less overhead (just adding an offset), but impossible to
achieve swapping
12
14. Architecture and VM Capability
• x86
• Capable of paging
• 16-bit and 32-bit has segmentation feature
• 64-bit mode has a very limited segmentation feature
• Because almost no one is using the segmentation feature
effectively!
• (See “flat model” described in a later slide)
• ARM
• Some CPU series has MMU, and is capable of paging
• “A” series
• Some CPU series only has MPU (memory protection unit)
• “R” series
• No MMU
• “M” series (MPU is optional)
14
15. Focusing on paging…
• How it works?
15
Memory instruction with a virtual address
CPU (MMU) looks for for the virtual address in TLB (Translation
Lookaside Buffer)
Does it exist?
Use the physical address in the TLB entry TLB Miss!
Call the handler, and ask it to fill
in a TLB entry corresponding to
the virtual address
Traverse the page table to find
the physical address for the
virtual address
Present?
Use the physical address
(May) remember it in TLB
Page fault! Call the handler.
Kernel’s Role
Software TLBHardware TLB
Yes No
Yes No
16. How far should hardware do?
• TLB (Translation Lookaside Buffer)
• Cache of “virtual-to-physical” mappings.
• Limited number of entries.
• Hardware-controlled TLB
• When TLB misses occur, the CPU traverses page tables
• The format for the page table is defined by the architecture.
• x86 and ARM
• Software-controlled TLB
• When TLB misses occur, the software (typically, the OS kernel)
traverses page tables, and tell the result (translated physical
address) by filling in some entry in TLB.
• Any type of page tables may be used (hash-based PT, for
example)
• But Linux uses almost the same format for this type of architecture
• PowerPC
16
17. Multilevel Page Table (tree-like)
• Typical structure of page table
• The first-level page table consists of entries that point to
another level page table. The index is some of the most
significant bits of the virtual memory.
• Of course, the next page table’s address is physical.
• The entries in the leaf page table denotes the physical
addresses.
17
Next level page table
Third level page table
Phys address
Phys address
…
First-level page table Second-level page table Third-level page table
19. x86-64
• Currently, only 48-bit in a linear address is effective.
• 64-bit address is sign-extension of the 48-bit address.
• Supports up to 52 bits for physical addresses
• %cr3 register : the physical address for the current
PML4 table
• mov ~~, %cr3 switches the page table (flushing TLB)
• Four level
• One entry in PML4 table corresponds to 512 GB of
virtual memory, an entry in PDP table to 1 GB, and so on.
• Each entry is 8 byte
• Each table has 512 entries
• Thus, each table is 4 KB = 1 page.
19
20. Large Table
• One page occupies one entry in TLB
• If one process uses 1 GB of memory, it uses 256K
pages.
• i.e. If TLB does not have 256K entries (and usually it
doesn’t), TLB misses are inevitable
• x86_64 supports three types of page size
• 4 KB (normal)
• 2 MB
• 1 GB (!)
• The disadvantage is that larger page requires
contiguous physical memory of the same size as the
page size.
20
An entry in higher-level page table directly
contains a physical address.
22. Linux kernel usage
• Large Page
• The kernel mapping
• The kernel creates straight-mapping of physical memory in the
kernel virtual address area
• This area is created in booting, and never changes after that
• 1GB, 2MB pages are used
• Hugetlbfs
• Explicit use from user applications
• Transparent Huge Pages
• Implicit (transparent) use of large pages for user applications
22
23. ARM
• ARM
• Two memory architecture
• VMSA (Virtual Memory System Architecture) : MMU
• PMSA (Protected Memory System Architecture) : MPU
• VMSA
• Two page table formats
• Short descriptor table
• Up to two-level lookup
• 32-bit PA (*By supersection, 40-bit can be output)
• Long descriptor table
• Up to three-level lookup
• 40-bit PA
• Fixed size of page tables
23
24. Names in Linux
• Linux uses several arch-independent type names
for page table entries
• pgd_t, pud_t, pmd_t, pte_t
• Each type is one for an entry in a table of the corresponding
level
24
Architecture (& Config) Lv pgd_t pud_t pmd_t pte_t
x86_64 4 PML4E PDPTE PDE PTE
i386 (PAE) 3 PDPTE - PDE PTE
i386 2 PDE - - PTE
ARM (LPAE) 3 1st-lv. Desc. - 2nd-lv. Desc. 3rd-lv. Desc.
ARM 2 1st-lv. Desc. - - 2nd-lv. Desc.
ARM64 (64KB page) 2 1st-lv. Desc. - - 2nd-lv. Desc.
ARM64 3 1st-lv. Desc. - 2nd-lv. Desc. 3rd-lv. Desc.
(*)AArch64 supports four-level page tables, thus 48-bit VA.
25. Notes
• PAE (i386)
• Physical Address Extension
• For those who want to enjoy >4GB of memory in 32-bit mode.
• Virtual address remains 32-bit, but can map to any physical
address (< 64-bit)
• The size of each entry is extended to 64-bit
• CONFIG_X86_PAE
• LPAE
• Logical Physical Address Extension
• Almost the same as PAE in i386
• “The current implementation limits the output address range
to 40 bits”
• Each entry is extended to 64-bit (long-descriptor translation table
format)
• CONFIG_ARM_LPAE
25
29. Page Attributes
• Pages can have attributes
• Used for memory protection
• Used for demand paging
• Used for COW (copy-on-write)
• Attributes
• Read / Write
• User / Privileged
• But where?
• In the page table entry corresponding to a page
• However, a page table entry is basically a physical
pointer, i.e. a 32-bit entry is occupied by 32-bit physical
pointer…
29
30. Page Attributes
• The lower bits in page table entries
• The start address of a page/page table is aligned!
• The lower bits are always zero.
30
Ignored
Physical Address [31:12]
3252
XD
63
Physical Address [51:32]
G
Igno
red
PAT
D
PCD
PWT
US
RW
PA
31 9 0
Physical Address [31:12] C B 1
XN
APTEX
AP2
S
nG
x86_64
ARM
(short
descriptor)
31. Page Attributes Comparison
31
x86_64 ARM (short)
Enabled? Present (P) Desc type (Bits 1 & 0)
RO or RW? Read/Write (RW)
AP [2:1] or AP [2:0]
Privileged only or any? User/Supervisor (US)
Write-through? PWT
TEX[2:0], B, C
Cachable? PCD
Accessed? Accessed (A) AP[0] (*configurable)
Dirty? Dirty (D) N/A
Memory Type PAT TEX[0], B, C (*configurable)
Global Global (G) Not Global (nG)
Executable? Execute-Disable (XD) Execute-Never (XN)
Sharable? (PAT) Sharable (S)
32. PowerPC Example [PowerPC 440]
• TLB is filled by software
• Search (tlbsx instrunction), R/W (tlbre, tbwe instructions)
32
32220
Effective Page Number [0:21]
TS
V SIZE TPAR TID
40
Real Page Number [0:21]
0
PA
R1
ERPN
PA
R2
0
Reserved U3-U0 W I M G E
X W R X W R
U S
• Attributes
• V : Valid
• SIZE : Page Size (4n KB, where n in
{0,1,2,3,4,5,7,9,10})
• U : User-defined storage attribute
• W: Write-through
• I: Caching Inhibited
• M: Memory coherency required
• G: Guarded
• E: Endian
• UX, UW, UR: User
executable, writable,
readable
• SX, SW, SR: Supervisor
executable, writable,
readable
• TPAR, PAR1, PAR2: Parity
33. Before the kernel starts…
• x86 (32-bit)
• Paging is disabled
• kernel/head_32.S creates a page table and turns on
paging
• x86 (64-bit)
• compressed/head_64.S creates an identical (virtual =
physical) page table for the first 4G
• Long mode requires paging enabled.
• kernel/head_64.S creates better page table
• ARM
• kernel/head.S creates a page table and turns on paging
33
37. Real mode kernel (from p.45)
• header.S
• Boot sector code which is no longer used
• Contains setup_header
• Prepares stack and BSS to run C programs
• Jumps into the C program (main.c)
• main.c
• Copies setup_header into “zeropage”
• Setups early console
• Initializes heap
• Checks the CPUs (64-bit capable for 64-bit kernel?)
• Collect HW information by querying to BIOS, and stores the
results in “zeropage”
• Finally transits to protected-mode, and jumps into the
“protected-mode kernel”
37
39. Wait, how the header code is placed
at the beginning of the kernel?
• The linker concatenates multiple object files
• The position in the resulting binary are not guaranteed
without any order to the linker
• The linker script (.ld/.lds/.lds.S) orders the positions
to the linker!
• As it is quite likely for you to use the C preprocessor for
the linker script, files with the extension “.lds.S” are first
processed by the preprocessor, and passed to the linker.
• Pass the linker script with “-T” overrides the default
linker script
• The default linker script can be displayed with “ld --
verbose"
39
40. LD script (1)
40
arch
x86
boot
setup.ld
compressed
vmlinux.lds.S
kernel
vmlinux.lds.S
/*
* setup.ld
*
* Linker script for the i386 setup code
*/
OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(_start)
SECTIONS
{
. = 0;
.bstext : { *(.bstext) }
.bsdata : { *(.bsdata) }
. = 495;
.header : { *(.header) }
.entrytext : { *(.entrytext) }
.inittext : { *(.inittext) }
.initdata : { *(.initdata) }
__end_init = .;
Specifies the output format (identical
to --oformat option)
OUTPUT_FORMAT(default, big, little)
Specifies the output architecture
Specifies the entry point symbol
(identical to -e option)
41. LD script (2)
41
/*
* setup.ld
*
* Linker script for the i386 setup code
*/
OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(_start)
SECTIONS
{
. = 0;
.bstext : { *(.bstext) }
.bsdata : { *(.bsdata) }
. = 495;
.header : { *(.header) }
.entrytext : { *(.entrytext) }
.inittext : { *(.inittext) }
.initdata : { *(.initdata) }
__end_init = .;
arch
x86
boot
setup.ld
compressed
vmlinux.lds.S
kernel
vmlinux.lds.S
Specifies how the sections are output
. means the current position
Substituting to . means setting the
current position
Put the .bstext section at the current
position, i.e. at the address 0.
Put the .bsdata section after
the .bstext section.
42. bstext section…?
42
.code16
.section ".bstext", "ax"
.global bootsect_start
bootsect_start:
#ifdef CONFIG_EFI_STUB
# "MZ", MS-DOS header
.byte 0x4d
.byte 0x5a
#endif
# Normalize the start address
ljmp $BOOTSEG, $start2
start2:
movw %cs, %ax
movw %ax, %ds
movw %ax, %es
movw %ax, %ss
xorw %sp, %sp
sti
cld
movw $bugger_off_msg, %si
jmp msg_loop
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
Here it is!
[Notes]
.code16 = Specify the binary for the following
code as 16-bit binary.
.section name[, flags] = Starts the section.
<flags> (excerpted)
• “a” : allocatable (loaded to memory when
executed)
• “w” : writable
• “x” : executable
.globl/.global symbol = Makes the symbol global
(Can be seen from other objects)
43. LD script (3)
43
/*
* setup.ld
*
* Linker script for the i386 setup code
*/
OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(_start)
SECTIONS
{
. = 0;
.bstext : { *(.bstext) }
.bsdata : { *(.bsdata) }
. = 495;
.header : { *(.header) }
.entrytext : { *(.entrytext) }
.inittext : { *(.inittext) }
.initdata : { *(.initdata) }
__end_init = .;
arch
x86
boot
setup.ld
compressed
vmlinux.lds.S
kernel
vmlinux.lds.S
Specifies how the sections are output
Set the current position to 495
Places the header section at the
address 495
Declares a symbol “__end_init” that
refers to the current position
(the end of .initdata section)
45. LD script (5)
• To be precise,
• Output a section the name of which is “.bstext”
• The output section contains all of the input section “.bstext”
• The input and output need not be 1-to-1
• The output section “.text” contains all of the input section “.text”,
and then all of the sections the names of which start with “.text.”
• Creates the new symbols “_text” and “_etext” which denote the
beginning and ending of the output section “.text”, respectively.
45
.bstext : { *(.bstext) }
.text : {
_text = .; /* Text */
*(.text)
*(.text.*)
_etext = . ;
}
46. LD script (6)
46
. = ALIGN(16);
.data : { *(.data*) }
.signature : {
setup_sig = .;
LONG(0x5a5aaa55)
}
...
/DISCARD/ : { *(.note*) }
/*
* The ASSERT() sink to . is intentional, for
binutils 2.14 compatibility:
*/
. = ASSERT(_end <= 0x8000, "Setup too big!");
. = ASSERT(hdr == 0x1f1, "The setup header has
the wrong offset!");
/* Necessary for the very-old-loader check to
work... */
. = ASSERT(__end_init <= 5*512, "init sections
too big!");
}
arch
x86
boot
setup.ld
compressed
vmlinux.lds.S
kernel
vmlinux.lds.S
[Usage]
# Check signature at end of setup
cmpl $0x5a5aaa55, setup_sig
jne setup_bad
Align to the 16 byte boundary
Discard the sections .note*
Put this long value at the
current position
Assertions!
47. Column: align and balign
• LD’s ALIGN(x) returns the x-byte aligned address
• x must be power of two
• = (current + x – 1) & ~ (x – 1)
• GNU Assembler has two pseudo ops for alignment
• .align x, fill, max
• .balign x, fill, max
• Both aligns to the byte boundary specified by x. But x means…
• The skipped bytes are filled by fill (zero or nop)
• The maximum number of bytes to be skipped can be specified
with max.
47
.align (x = 4) .balign (x = 4)
i386 (elf), sparc, etc. Align to 4 byte
Align to 4 byte
ppc, i386 (a.out), arm Align to 16 byte (24)
48. COFF Stuffs
48
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
#ifdef CONFIG_EFI_STUB
.org 0x3c
# Offset to the PE header.
.long pe_header
#endif /* CONFIG_EFI_STUB */
.section ".bsdata", "a"
bugger_off_msg:
.ascii "Direct floppy boot is not supported. "
.ascii "Use a boot loader program instead.rn"
...
.byte 0
#ifdef CONFIG_EFI_STUB
pe_header:
.ascii "PE"
.word 0
coff_header:
#ifdef CONFIG_X86_32
.word 0x14c # i386
#else
.word 0x8664 # x86-64
#endif
[Notes]
.org location, fill = Set the current position to
location in the current section (filling the skipped
bytes with fill)
.ascii string = Put the string (w/o zero termination)
at the current position (cf. .asciz)
.byte val, .word val, .long val, .quad val
= Put the 1/2/4/8-byte value(s)
49. Real mode kernel (p.45)
• header.S
• Boot sector code which is no longer used
• Contains setup_header
• Prepares stack and BSS to run C programs
• Jumps into the C program (main.c)
• main.c
• Copies setup_header into “zeropage”
• Setups early console
• Initializes heap
• Checks the CPUs (64-bit capable for 64-bit kernel?)
• Collect HW information by querying to BIOS, and stores the
results in “zeropage”
• Finally transits to protected-mode, and jumps into the
“protected-mode kernel”
49
50. Entry point (2nd sector)
50
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
.section ".header", "a"
.globl sentinel
sentinel: .byte 0xff, 0xff /* Used to detect broken loaders */
.globl hdr
hdr:
setup_sects: .byte 0 /* Filled in by build.c */
root_flags: .word ROOT_RDONLY
syssize: .long 0 /* Filled in by build.c */
ram_size: .word 0 /* Obsolete */
vid_mode: .word SVGA_MODE
root_dev: .word 0 /* Filled in by build.c */
boot_flag: .word 0xAA55
# offset 512, entry point
.globl _start
_start:
.byte 0xeb # short (2-byte) jump
.byte start_of_setup-1f
1:
.ascii "HdrS" # header signature
.word 0x020d # header version number (>= 0x0105)
.bstext
.bsdata
0
495
.header
To prevent the compiler
from accidentally
producing a 3-byte jump
52. Column: Local Symbol in GAS
• Local symbols
• Symbols that can be used temporarily
• Format is N: (where N is a positive integer)
• To refer to the local symbols, use Nf or Nb.
• Nf refers to the next local label N.
• Nb refers to the most recently declared local label N.
• According to GNU assembler manual, these symbols are
internally transformed to the following format:
• LN^BO
• ^B is Ctrl-B (0x02), O is a serial number
• For 44th 3, “L3^B44” is used.
• Dollar local symbols (I haven’t seen this)
52
.byte start_of_setup-1f
1:
1:
jmp 1b
53. Get prepared to C (stack)
53
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
.section ".entrytext", "ax"
start_of_setup:
# Force %es = %ds
movw %ds, %ax
movw %ax, %es
cld
movw %ss, %dx
cmpw %ax, %dx # %ds == %ss?
movw %sp, %dx
je 2f # -> assume %sp is reasonably set
# Invalid %ss, make up a new stack
movw $_end, %dx
testb $CAN_USE_HEAP, loadflags
jz 1f
movw heap_end_ptr, %dx
1: addw $STACK_SIZE, %dx
jnc 2f
xorw %dx, %dx # Prevent wraparound
2: # Now %dx should point to the end of our stack space
andw $~3, %dx # dword align (might as well...)
jnz 3f
movw $0xfffc, %dx # Make sure we're not zero
3: movw %ax, %ss
movzwl %dx, %esp # Clear upper half of %esp
If %ds == %ss, %sp is
assumed to be properly set
by the loader
If not, sets up a new stack.
The address is _end +
STACK_SIZE (512 byte) or
heap_end_ptr + STACK_SIZE (if
CAN_USE_HEAP is set)
54. In other words,
• Set the stack segment as the same as %DS
• Allocate
512-byte for
the stack
54
unsigned short stack;
if (%ds != %ss) {
if (hdr.loadflags & CAN_USE_HEAP) {
stack = hdr.heap_end_ptr + STACK_SIZE;
} else {
stack = _end + STACK_SIZE;
}
if (carried over) { /* stack >= 0x10000 */
stack = 0;
}
}
/* Align to 4-byte */
stack &= ~3;
if (stack == 0)
stack = 0xfffc; /* – 4 */
%ss = %ds;
%esp = stack;
55. Get prepared to C
(CS fix and BSS clear)
55
sti # Now we should have a working stack
# We will have entered with %cs = %ds+0x20, normalize %cs so
# it is on par with the other segments.
pushw %ds
pushw $6f
lretw
6:
# Check signature at end of setup
cmpl $0x5a5aaa55, setup_sig
jne setup_bad
# Zero the bss
movw $__bss_start, %di
movw $_end+3, %cx
xorl %eax, %eax
subw %di, %cx
shrw $2, %cx
rep; stosl
# Jump to C code (should not return)
calll main
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
$6f is the address of the 6f,
which is the offset from the
boot sector.
Signature check
Fill the bss by zero.
“rep; stosl” (string
instruction) fills the
memory from %es:%di
for %cx DWORDs with %eax.
56. [Column] Calling conventions
• 16 bit (name unknown)
• Arguments: %ax, %dx, %cx
• Return value: %ax
• 32 bit (cdecl)
• Arguments: pushed on the stack (in the reversed order of the
arguments)
• Caller-saved: %eax, %ecx, and %edx
• Callee-saved: the others
• Return value: %eax (for int)
• 64 bit (amd64)
• Arguments: %rdi, %rsi, %rdx, %rcx, %r8, %r9
• Caller-saved: the others than callee-saved.
• Callee-saved: %rbp, %rbx, %r12 to %r15
• Return value: %eax
56
f(2, 5, 9, 11);
11
9
5
2
(return address)
stack
57. Real mode kernel (p.45)
• header.S
• Boot sector code which is no longer used
• Contains setup_header
• Prepares stack and BSS to run C programs
• Jumps into the C program (main.c)
• main.c
• Copies setup_header into “zeropage”
• Setups early console
• Initializes heap
• Checks the CPUs (64-bit capable for 64-bit kernel?)
• Collect HW information by querying to BIOS, and stores the
results in “zeropage”
• Finally transits to protected-mode, and jumps into the
“protected-mode kernel”
57
58. main
58
void main(void)
{
/* First, copy the boot header into the "zeropage" */
copy_boot_params();
/* Initialize the early-boot console */
console_init();
...
/* End of heap check */
init_heap();
/* Make sure we have all the proper CPU support */
if (validate_cpu()) {
...
}
set_bios_mode();
detect_memory();
keyboard_init();
query_mca();
query_ist();
...
/* Set the video mode */
set_video();
/* Do the last things and invoke protected mode */
go_to_protected_mode();
}
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
59. Copy to zeropage
• Very simple
• The omitted part is for compatibility with
old command-line parameter protocol
(located in the certain address)
59
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
struct boot_params boot_params __attribute__((aligned(16)));
...
static void copy_boot_params(void)
{
...
BUILD_BUG_ON(sizeof boot_params != 4096);
memcpy(&boot_params.hdr, &hdr, sizeof hdr);
...
}
60. Set up the serial console
• Parse the command line parameter
in very ad-hoc way, and find the
serial configuration
• Find “earlyprintk” and if it is either of
the following format
• “serial,0x3f8,115200”
• “serial,ttyS0,115200”
• “ttyS0,115200”
• Find “console” and find “uart8250,io,…” or “uart,io,…”
• If any serial config is found, set up it using I/O ports
60
void console_init(void)
{
parse_earlyprintk();
if (!early_serial_base)
parse_console_uart8250();
}
arch
x86
boot
header.S
main.c
early_serial_console.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
61. Puts and putchar
• By BIOS call and serial I/O ports
61
void __attribute__((section(".inittext"))) putchar(int ch)
{
if (ch == 'n')
putchar('r'); /* n -> rn */
bios_putchar(ch);
if (early_serial_base != 0)
serial_putchar(ch);
}
void __attribute__((section(".inittext"))) puts(const char *str)
{
while (*str)
putchar(*str++);
}
arch
x86
boot
header.S
main.c
tty.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
[Notes] GCC extension __attribute__
section(section) : locate the function/variable in
the specified section.
62. Serial and BIOS putchar
62
static void __attribute__((section(".inittext")))
serial_putchar(int ch)
{
unsigned timeout = 0xffff;
while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --
timeout)
cpu_relax();
outb(ch, early_serial_base + TXR);
}
static void __attribute__((section(".inittext"))) bios_putchar(int
ch)
{
struct biosregs ireg;
initregs(&ireg);
ireg.bx = 0x0007;
ireg.cx = 0x0001;
ireg.ah = 0x0e;
ireg.al = ch;
intcall(0x10, &ireg, NULL);
}
arch
x86
boot
header.S
main.c
tty.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
Put a char on a
serial line by
using I/O ports
(IN and OUT
instructions)
Put a char on
VGA by BIOS
Call (INT 0x10,
AH = 0x0e)
63. BIOS Call
• BIOS Call is invoked by using an INT instruction
• Requires an assembly language support
• Parameters and return values are passed by a certain set
of registers
• INT instruction only takes an immediate for the interrupt
number.
• C prototype:
• struct biosregs has all the
general registers,
data segment registers,
the flag register
63
void intcall(u8 int_no, const struct biosregs *ireg,
struct biosregs *oreg);
void initregs(struct biosregs *reg)
{
memset(reg, 0, sizeof *reg);
reg->eflags |= X86_EFLAGS_CF;
reg->ds = ds();
reg->es = ds();
reg->fs = fs();
reg->gs = gs();
}
64. BIOS Call Impl. (1)
64
arch
x86
boot
header.S
main.c
bioscall.S
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
.code16
.section ".inittext","ax"
.globl intcall
...
intcall:
cmpb %al, 3f
je 1f
movb %al, 3f
jmp 1f /* Synchronize pipeline */
1:
...
/* Actual INT */
.byte 0xcd /* INT opcode */
3: .byte 0
...
void intcall(u8 int_no, const struct biosregs
*ireg, struct biosregs *oreg);
ax dx
cx
Checks the current operand of the INT
instruction, and rewrite (self-modify)
the interrupt number if different.
65. BIOS Call Impl. (2)
65
1:
/* Save state */
pushfl
pushw %fs
pushw %gs
pushal
/* Copy input state to stack frame */
subw $44, %sp
movw %dx, %si
movw %sp, %di
movw $11, %cx
rep; movsd
/* Pop full state from the stack */
popal
popw %gs
popw %fs
popw %es
popw %ds
popfl
/* Actual INT */
.byte 0xcd /* INT opcode */
3: .byte 0
EFLAGS
FS
GS
EAX
ECX
EDI
…
stack
EFLAGS
FS
GS
DS
ES
EAX
…
EDI
Copy of
struct
biosregs
*ireg
(44 bytes)
Registers
Registers
66. BIOS Call Impl. (3)
66
/* Push full state to the stack */
pushfl
pushw %ds
pushw %es
pushw %fs
pushw %gs
pushal
...
(Restore %ds, %sp, etc.)
...
/* Copy output state from stack frame */
movw 68(%esp), %di /* Original %cx == 3rd
argument */
andw %di, %di
jz 4f
movw %sp, %si
movw $11, %cx
rep; movsd
/* Restore state and return */
popal
popw %gs
popw %fs
popfl
retl
EFLAGS
FS
GS
EAX
ECX
EDI
…
stack
EFLAGS
FS
GS
DS
ES
EAX
…
EDI
Registers
*oregs
Registers
67. Inline assembly
• A quick way to use assembly language inside C
source codes
• For example, when you want to disable interrupts, put
into your C code.
• GCC’s extended inline assembly language enables
far more features (and more complicated)
• => Described in twenty or so slides later!
67
asm (“cli”);
static inline void outb(u8 v, u16 port)
{
asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
}
68. Initialize the heap
68
char *HEAP = _end;
char *heap_end = _end; /* Default end of heap = no heap */
...
static void init_heap(void)
{
char *stack_end;
if (boot_params.hdr.loadflags & CAN_USE_HEAP) {
asm("leal %P1(%%esp),%0"
: "=r" (stack_end) : "i" (-STACK_SIZE));
heap_end = (char *)
((size_t)boot_params.hdr.heap_end_ptr +
0x200);
if (heap_end > stack_end)
heap_end = stack_end;
} else {
/* Boot protocol 2.00 only, no heap available */
puts("WARNING: Ancient bootloader, some
functionality "
"may be limited!n");
}
}
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
Substitute %esp – STACK_SIZE to stack_end
heap_end
stack_end
69. When is the heap used?
• Heap allocation function is very simple
• And the calls for GET_HEAP exist only in the video
code files.
69
static inline char *__get_heap(size_t s, size_t a, size_t n)
{
char *tmp;
HEAP = (char *)(((size_t)HEAP+(a-1)) & ~(a-1));
tmp = HEAP;
HEAP += s*n;
return tmp;
}
#define GET_HEAP(type, n)
((type *)__get_heap(sizeof(type),__alignof__(type),(n)))
saved.data = GET_HEAP(u16, saved.x*saved.y);
(boot/video.c)
70. Retrieving memory info.
• As described in the last presentation,
detect_memory tries 3 methods
70
int detect_memory(void)
{
...
if (detect_memory_e820() > 0)
err = 0;
if (!detect_memory_e801())
err = 0;
if (!detect_memory_88())
err = 0;
return err;
}
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
71. Memory Information [from p.48]
• AX = 0xe820, INT 0x15 [detect_memory_e820()]
• INPUT
• AX = 0xe820
• CX = size of the buffer
• EDX = “SMAP” (0x534d4150 / Signature)
• EBX = Continuation value
• ES:DI = address for the buffer
• OUTPUT
• CF = 0 if successful, 1 otherwise
• CX = Returned Byte
• EBX = Continuation value
• Each call returns information for one range
• To get information for the next range, give the continuation value returned in
the previous call
• The range information is returned by the following structure
• Stored in boot_params.e820_map (struct e820entry[128])
71
52 struct e820entry {
53 __u64 addr; /* start of memory segment */
54 __u64 size; /* size of memory segment */
55 __u32 type; /* type of memory segment */
56 } __attribute__((packed));
(arch/x86/include/uapi/asm/e820.h)
74. Go To Protected Mode
74
void go_to_protected_mode(void)
{
/* Hook before leaving real mode, also disables interrupts */
realmode_switch_hook();
/* Enable the A20 gate */
if (enable_a20()) {
puts("A20 gate not responding, unable to boot...n");
die();
}
/* Reset coprocessor (IGNNE#) */
reset_coprocessor();
/* Mask all interrupts in the PIC */
mask_all_interrupts();
/* Actual transition to protected mode... */
setup_idt();
setup_gdt();
protected_mode_jump(boot_params.hdr.code32_start,
(u32)&boot_params + (ds() << 4));
}
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
75. Go To PM: Details (1)
• Call the hook if set
• Otherwise, disable interrupts and NMI.
75
static void realmode_switch_hook(void)
{
if (boot_params.hdr.realmode_swtch) {
asm volatile("lcallw *%0"
: : "m" (boot_params.hdr.realmode_swtch)
: "eax", "ebx", "ecx", "edx");
} else {
asm volatile("cli");
outb(0x80, 0x70); /* Disable NMI */
io_delay();
}
}
If a hook is set in
realmode_swtch
call the hook
Out 0x80 to port 0x70
(CMOS Controller!!)
(By a historical reason,
“NMI disable” bit is
located in the CMOS
controller)
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
76. Go To PM: Details (2)
• Enable A20 line (20th bit of the address bus)
• In the initial state, the bit is masked (always 0)
• For compatibility with the program that expects address
wraparound in 1MB
• Some programs expect the address 0xFFFFF + 1 = 0x00000
• To use 32-bit of memory, this mask should be disabled.
• Many ways to do it
• But which way works depends on the firmware
• The famous one would be via the keyboard controller
port!
• Linux tries several ways, and several times
76
77. Go To PM: Details (3)
77
int enable_a20(void)
{...
while (loops--) {
if (a20_test_short())
return 0;
/* Next, try the BIOS (INT 0x15, AX=0x2401) */
enable_a20_bios();
if (a20_test_short())
return 0;
/* Try enabling A20 through the keyboard controller */
kbc_err = empty_8042();
if (a20_test_short())
return 0; /* BIOS worked, but with delayed reaction */
if (!kbc_err) {
enable_a20_kbc();
if (a20_test_long())
return 0;
}
/* Finally, try enabling the "fast A20 gate" */
enable_a20_fast();
if (a20_test_long())
return 0;
}
...
}
arch
x86
boot
header.S
main.c
memory.c
pm.c
a20.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
*Tries 100000 times at most
78. Go To PM: Details (4)
78
/*
* Reset IGNNE# if asserted in the FPU.
*/
static void reset_coprocessor(void)
{
outb(0, 0xf0);
io_delay();
outb(0, 0xf1);
io_delay();
}
/*
* Disable all interrupts at the legacy PIC.
*/
static void mask_all_interrupts(void)
{
outb(0xff, 0xa1); /* Mask all interrupts on the secondary PIC */
io_delay();
outb(0xfb, 0x21); /* Mask all but cascade on the primary PIC */
io_delay();
}
The most legacy interrupt controller
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
79. Go To PM: Details (5)
• IDT (Interrupt Descriptor Table)
• Describes the exception/interrupt handlers
(and task gate, etc.)
• At this time, no IDT is installed.
• null_idt contains information for the address and size for the
IDT, both of which are zero.
• LIDT instruction takes an argument that is a pointer to the
information.
79
static void setup_idt(void)
{
static const struct gdt_ptr null_idt = {0, 0};
asm volatile("lidtl %0" : : "m" (null_idt));
}
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
len (16-bit) address (32-bit)
IDT
80. Go To PM: Details (6)
• GDT (Global Descriptor Table)
• Describes the segment information
80
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
static void setup_gdt(void)
{
static const u64 boot_gdt[] __attribute__((aligned(16))) = {
/* CS: code, read/execute, 4 GB, base 0 */
[GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff),
/* DS: data, read/write, 4 GB, base 0 */
[GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff),
/* TSS: 32-bit tss, 104 bytes, base 4096 */
/* We only have a TSS here to keep Intel VT happy;
we don't actually use it for anything. */
[GDT_ENTRY_BOOT_TSS] = GDT_ENTRY(0x0089, 4096, 103),
};
static struct gdt_ptr gdt;
gdt.len = sizeof(boot_gdt)-1;
gdt.ptr = (u32)&boot_gdt + (ds() << 4);
asm volatile("lgdtl %0" : : "m" (gdt));
}
len (16-bit) address (32-bit)
boot_gdt (GDT)
81. x86 Architecture: GDT
• GDT
• Each entry has 8 byte
• Offset, limit, and attributes
• DPL: Descriptor privileged level (0-3: 0 is the most privileged)
• When a processor executes codes at a code segment, the
current privileged level (CPL) is the same as DPL of the code
segment. Can access data segments with DPL >= CPL.
81
G D L *
Limit
19:16
P DPL S Type
Base
23:16
Base
31:24
Base Address
15:00
Limit
15:00
[GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff),
[GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff),
9 = Code, Execute Only
3 = Data, R/W
08121315162024
0
4
82. x86 Architecture:
Flat Model and Segment Register
• Although the segment feature is available in 32-bit, the
common use is called “Flat Model.”
• Uses a single segment from zero to 232 – 1
• To be precise, different segments are required for code/data and
privileged/user mode.
• Linux uses four segments: KERNEL_CS, KERNEL_DS, USER_CS,
USER_DS
• During boot time, BOOT_CS and BOOT_DS are used (as defined in
the previous slide)
• Segment Register (Selector)
• If CS is to select BOOT_CS, CS = (index of BOOT_CS) << 3;
• GDT_ENTRY_BOOT_CS = (Index of BOOT_CS) = 2, then CS = 16
• The constants BOOT_CS = 16, BOOT_DS = 24.
• Note the difference between “ENTRY” and the actual value. 82
T
I
RPLIndex
02315
83. Go To PM: Details (7)
• Call the assembler part (no return)
83
protected_mode_jump(boot_params.hdr.code32_start,
(u32)&boot_params + (ds() << 4));
GLOBAL(protected_mode_jump)
movl %edx, %esi # Pointer to boot_params
table
xorl %ebx, %ebx
movw %cs, %bx
shll $4, %ebx
addl %ebx, 2f
jmp 1f # Short jump to serialize on 386/486
1:
movw $__BOOT_DS, %cx
movw $__BOOT_TSS, %di
movl %cr0, %edx
orb $X86_CR0_PE, %dl # Protected mode
movl %edx, %cr0
# Transition to 32-bit mode
.byte 0x66, 0xea # ljmpl opcode
2: .long in_pm32 # offset
.word __BOOT_CS # segment
ENDPROC(protected_mode_jump)
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
dx
ax
*(uint32_t *)2f += cs() << 4;
(phys addr of in_pm32)
[Notes]
In real mode,
physical address
= (Segment
Register << 4) +
Offset
To enter protected mode, set PE bit
in %cr0 register.
And 32-bit far jump operation
(not expressible in real mode asm)
85. Extended inline assembler (1)
• GCC’s extended inline assembly language
• Input/output operands can be specified for the assembler
• Assembler template
• The actual assembly language with templates that will be
substituted by the output/input operands
• Output operands
• List of C variables modified by the assembler template
• Input operands
• List of C expressions read by the instructions in the assembler
template.
• Clobber
• List of registers/values to be changed by the assembler template
(other than the output operands)
85
asm [volatile] (assembler template : [ output operands [ : input operands [ : clobber ]]])
static inline void outb(u8 v, u16 port)
{
asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
}
Assembler template Input operands
86. Extended inline assembler (2)
• Assembler template
• Basically, the same as the standalone assembly language
• %n (n is zero or positive integer) refers to the (n+1)-th
operand in the input and output operands.
• If the character “%” is to be used (to specify a certain
register “%ebx,” for example), “%%” must be used.
• Other than the number, the name can be used to specify
an operand. (%[symbolicname] refers to the operand
with the name [symbolicname])
• To use multiple instructions, use “;” or “n”
86
static inline void outb(u8 v, u16 port)
{
asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
}
87. Extended inline assembler (3)
• Input operands
• Comma-separated list of C expressions prefixed with
constaints
• A constraint specifies how the expression is passed to
the assembler template.
• When multiple constraints are specified, the complier
selects the most efficient one.
87
asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
constraint
C expression
Constraint
‘m’ Memory operand
‘r’ General register
‘i’ Immediate integer
‘0’ – ‘9’ = The same place as the
operand
Constraint (x86-specific)
‘a’’b’’c’’d’ A,B,C,D register
‘S’’D’ SI, DI register
‘N’ Unsigned 8-bit integer (for
in/out instructions)
‘A’ EDX:EAX (32bit), RDX/RAX
(64 bit)
[SymbolicName] “Constraints” (C Expression),…
88. Extended inline assembler (4)
• In this example,
• The value of v (u8) is stored in %al register
• The value of port (u16) is stored in %dx register or used
as 8-bit immediate.
• This function is declared as “inline,” so if this function is called
with a constant value as port which is less than 256, the “N”
constraint may be used.
• Then, the instruction(s) in the assembler templates are
executed.
• The resulting assembly language will be
88
asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
outb:
movl 8(%esp), %edx
movl 4(%esp), %eax
outb %al,%dx
ret
89. Extended inline assembler (5)
• Output operands
• Comma-separated list of C variables prefixed with constraints
• Constraints should be prefixed with “=“ or “+”
• “+” means that the variable is used as a both input and output
operand.
• “&” constraint allocates a different register from the input
operands (for multiple instructions, this constraint may be
necessary)
• After the instruction(s) in the assembler template is executed,
the value of A register (%al) is stored to the variable v.
89
[SymbolicName] “=Constraints” (C Variable),…
static inline u8 inb(u16 port)
{
u8 v;
asm volatile("inb %1,%0" : "=a" (v) : "dN" (port));
return v;
}
90. Extended inline assembler (6)
• Clobber
• The list of registers/values modified by the instructions
• The output registers need not be specified here.
• The most common clobber is “memory”
• This means that the memory contents may be changed as side
effects, thus all the variables should be written back to the
memory before the assembler, and should be read again from the
memory after the assembler.
• “cc” : Condition (flags) registers
90
void *memcpy(void *dest, const void *src, size_t n)
{
int d0, d1, d2;
asm volatile(
"rep ; movslnt"
"movl %4,%%ecxnt"
"rep ; movsbnt"
: "=&c" (d0), "=&D" (d1), "=&S" (d2)
: "0" (n >> 2), "g" (n & 3), "1" (dest), "2" (src)
: "memory");
return dest;
}
92. Extended inline assembler (7)
• Examples (appeared in the previous slides)
• Example 1
• Stores %esp – STACK_SIZE to stack_end
• P in “%P1” is a modifier (but cannot find in the
document)
• With “%P1”
• With “%1”
• With “%c1” (“constant expression with no punctuation”)
92
asm("leal %P1(%%esp),%0"
: "=r" (stack_end) : "i" (-STACK_SIZE));
leal -512(%esp),%eax
leal $-512(%esp),%eax
leal -512(%esp),%eax
93. Extended inline assembler (8)
• Example 2
• Far-calls the address (the value of
boot_params.hdr.realmode_swtch)
• The registers eax, ebx, ecx, and edx will be changed in
this call.
• Example 3
93
asm volatile("lcallw *%0"
: : "m" (boot_params.hdr.realmode_swtch)
: "eax", "ebx", "ecx", "edx");
static const struct gdt_ptr null_idt = {0, 0};
asm volatile("lidtl %0" : : "m" (null_idt));
setup_idt:
lidtl null_idt.1378
ret
95. Extended inline assembler (10)
• The key point
• The context is the stack
• The switched task resumes at “1:”. (just after “jmp
__switch_to”)
• The “__switch_to” function is called with a “jmp”
instruction, not a “call” instruction.
• Anyway
• The template does not use %n (number), but %[name]
style. (too many parameters)
95
asm volatile(...
"movl %%esp,%[prev_sp]nt" /* save ESP */
...
/* output parameters */
: [prev_sp] "=m" (prev->thread.sp),
99. Protected-Mode Kernel (p.54)
• arch/x86/boot/compressed/head_{32,64}.S
• Goal: Decompresses the kernel (vmlinux.gz/.bz2/.xz…)
and start the kernel
• Relocates the decompressing code (if relocatable and
loaded at a different address)
• Enables paging and enters the long-mode (in head_64.S)
• Clears the BSS, and prepares the heap and stack
• Decompresses the kernel
• Relocates if required
• RANDOMIZED_BASE or RELOCATABLE (in 32-bit)
99
102. Entry point (32-bit)
102
.text
__HEAD
ENTRY(startup_32)
#ifdef CONFIG_EFI_STUB
jmp preferred_addr
...
preferred_addr:
#endif
cld
testb $(1<<6), BP_loadflags(%esi)
jnz 1f
cli
movl $__BOOT_DS, %eax
movl %eax, %ds
movl %eax, %es
movl %eax, %fs
movl %eax, %gs
movl %eax, %ss
1:
.section ".head.text","ax"
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
If KEEP_SEGMENT is set in loadflags in
boot_params, do not reload the segments.
103. Protected-Mode Protocol (p.53)
• Starts at the top of the protected mode kernel
• Usually loaded at 0x100000 (1MB)
• Can be at any position if compiled as relocatable
• Should be at the same position as specified in the compile
time if compiled as not relocatable
• Used in “linux” module in GRUB2
• [Protocol] At the entry point,
• The loaded GDT must have __BOOT_CS (0x10 / execute and
read) and __BOOT_DS(0x18 / read and write)
• %cs must be __BOOT_CS
• %ds, %es, and %ss must be __BOOT_DS
• Interrupts must be disabled
• %esi must be the address for struct boot_params
• %ebp, %edi, and %ebx must be zero.
103
104. Protected-Mode Kernel (p.54)
• arch/x86/boot/compressed/head_{32,64}.S
• Goal: Decompresses the kernel (vmlinux.gz/.bz2/.xz…)
and start the kernel
• Relocates the decompressing code (if relocatable and
loaded at a different address)
• Enables paging and enters the long-mode (in head_64.S)
• Decompresses the kernel
• Relocates if required
• RANDOMIZED_BASE or RELOCATABLE (in 32-bit)
104
105. Where are we?
105
leal (BP_scratch+4)(%esi), %esp
call 1f
1: popl %ebp
subl $1b, %ebp
arch
x86
boot
header.S
main.c
memory.c
pm.c
pmjump.S
compressed
head_32.S
head_64.S
eboot.c
efi_stub_32.S
efi_stub_64.S
kernel
head_32.S
head_64.S
• The call instruction pushes the return
address onto the stack
• The return address should be the next instruction after
the call instruction, i.e. 1f
• The immediate “pop” pops the return address from the
stack, i.e. the absolute physical address for 1f
• Subtracting 1f (in this case, 1b) from the address (%ebp)
calculates the offset between the actual address and the
compile-time address (0-based, as seen in lds).
107. Determine where to
decompress
• If CONFIG_RELOCATABLE
• The current position
(BP_kernel_alignment-
aligned)
• Default: 2MB-align
• If it is less than
LOAD_PHYSICAL_ADDR,
LOAD_PHYSICAL_ADDR is
used
• If not
CONFIG_RELOCATABLE
• LOAD_PHYSICAL_ADDR is
used
• Now %ebx is the target
address
107
#ifdef CONFIG_RELOCATABLE
movl %ebp, %ebx
movl
BP_kernel_alignment(%esi), %eax
decl %eax
addl %eax, %ebx
notl %eax
andl %eax, %ebx
cmpl
$LOAD_PHYSICAL_ADDR, %ebx
jge 1f
#endif
movl
$LOAD_PHYSICAL_ADDR, %ebx
1:
arch
x86
boot
compressed
head_32.S
head_64.S
kernel
head_32.S
head_64.S
108. Copy the decompression code
• Copy the area from the
head of PM kernel
(startup_32) to just before
the head of bss.
• The code copies the kernel
backwards in case of
overlapping
108
addl $z_extract_offset, %ebx
leal boot_stack_end(%ebx), %esp
pushl $0
popfl
pushl %esi
leal (_bss-4)(%ebp), %esi
leal (_bss-4)(%ebx), %edi
movl $(_bss - startup_32), %ecx
shrl $2, %ecx
std
rep movsl
cld
popl %esi
PM kernel
%ebp
Relocated
vmlinux (decompressed)
%ebx
z_extract_offset
109. Jump to the relocated address
• Jump to the copied
decompression code
• The decompression
code is the end in the
PM kernel
• Just after the
compressed kernel
image
• Clears the BSS
109
leal relocated(%ebx), %eax
jmp *%eax
ENDPROC(startup_32)
.text
relocated:
xorl %eax, %eax
leal _bss(%ebx), %edi
leal _ebss(%ebx), %ecx
subl %edi, %ecx
shrl $2, %ecx
rep stosl
%ebx
Relocated kernel
vmlinux (decompressed)
relocated
110. Why is z_extract_offset?
• The PM kernel contains the compressed kernel
image
• The relocating (copying) code is located at the head in
PM kernel
• The decompression code is located at the tail in the PM
kernel
• The decompression code after relocation is safe because
z_extract_offset + the compressed image size is larger
than the decompressed image size
110
headcompressed decomp
decompressed
z_extract_offset
work area
head compressed decomp
Relocate
z_extract_offset
111. Fix up the absolute addresses
• The decompression code is built with -fPIC
(position independent code), and so fixing up the
absolute addresses is achieved by modifying the
addresses in GOT (Global Offset Table).
111
/*
* Adjust our own GOT
*/
leal _got(%ebx), %edx
leal _egot(%ebx), %ecx
1:
cmpl %ecx, %edx
jae 2f
addl %ebx, (%edx)
addl $4, %edx
jmp 1b
2:
%ebx
Relocated kernel
112. Protected-Mode Kernel (p.54)
• arch/x86/boot/compressed/head_{32,64}.S
• Goal: Decompresses the kernel (vmlinux.gz/.bz2/.xz…)
and start the kernel
• Relocates the decompressing code (if relocatable and
loaded at a different address)
• Enables paging and enters the long-mode (in head_64.S)
• Decompresses the kernel
• Relocates if required
• RANDOMIZED_BASE or RELOCATABLE (in 32-bit)
112
113. Call the decompression routine
• Call the decompress_kernel function in C
• asmlinkage __visible void *decompress_kernel(void
*rmode, memptr heap, unsigned char *input_data,
unsigned long input_len, unsigned char *output,
unsigned long output_len)
113
pushl $z_output_len /* decompressed length */
leal z_extract_offset_negative(%ebx), %ebp
pushl %ebp /* output address */
pushl $z_input_len /* input_len */
leal input_data(%ebx), %eax
pushl %eax /* input_data */
leal boot_heap(%ebx), %eax
pushl %eax /* heap area */
pushl %esi /* real mode pointer */
call decompress_kernel /* returns kernel location in %eax */
BP Relocated
vmlinux (decompressed)
%ebx
z_extract_offset
%esi
115. Choosing the destination
• The choose_kernel_location function
• If KASLR is enabled, it computes some random output
address (aslr.c)
• Otherwise, it just returns the output parameter
115
116. Decompressing the kernel
• The decompress function does
everything
• The implementation is located at
lib/decompress_*.c
116
#ifdef CONFIG_KERNEL_GZIP
#include "../../../../lib/decompress_inflate.c"
#endif
#ifdef CONFIG_KERNEL_BZIP2
#include "../../../../lib/decompress_bunzip2.c"
#endif
#ifdef CONFIG_KERNEL_XZ
#include "../../../../lib/decompress_unxz.c"
#endif
...
arch
x86
boot
compressed
head_32.S
head_64.S
misc.c
kernel
head_32.S
head_64.S
117. Load the ELF
• parse_elf
• Parse the ELF header and locate the contents according
to the program header (p_paddr)
• If relocatable, the
p_paddr is offseted by
the actually loaded address.
117
for (i = 0; i < ehdr.e_phnum; i++) {
...
switch (phdr->p_type) {
case PT_LOAD:
#ifdef CONFIG_RELOCATABLE
dest = output;
dest += (phdr->p_paddr –
LOAD_PHYSICAL_ADDR);
#else
dest = (void *)(phdr->p_paddr);
#endif
memcpy(dest,
output + phdr->p_offset,
phdr->p_filesz);
break;
...
}
}
typedef struct elf32_phdr{
Elf32_Word p_type;
Elf32_Off p_offset;
Elf32_Addr p_vaddr;
Elf32_Addr p_paddr;
Elf32_Word p_filesz;
Elf32_Word p_memsz;
Elf32_Word p_flags;
Elf32_Word p_align;
} Elf32_Phdr;
118. Protected-Mode Kernel (p.54)
• arch/x86/boot/compressed/head_{32,64}.S
• Goal: Decompresses the kernel (vmlinux.gz/.bz2/.xz…)
and start the kernel
• Relocates the decompressing code (if relocatable and
loaded at a different address)
• Enables paging and enters the long-mode (in head_64.S)
• Decompresses the kernel
• Relocates if required
• RANDOMIZED_BASE or RELOCATABLE (in 32-bit)
118
119. Relocate the kernel image
• Relocation information (generated by the “relocs”
tool) is appended just after the ELF image
• The relocation information is a collection of addresses to
the absolute addresses in the kernel code
• These addresses are all expressed by kernel virtual addresses
vmlinux (ELF) 0 0… …
32-bit
relocation
addresses
64-bit
relocation
addresses
$ objdump –adr vmlinux
…
c1086910 <vfs_llseek>:
c1086910: 55 push %ebp
...
c1086919: bb 60 63 08 c1 mov $0xc1086360,%ebx
c108691a: R_386_32 no_llseek
120. Calculate deltas
• __START_KERNEL_map
• In 32-bit, PAGE_OFFSET (default: 0xC0000000)
• In 64-bit, 0xffffffff80000000
120
120
static void handle_relocations(void *output, unsigned long
output_len)
{
...
unsigned long min_addr = (unsigned long)output;
...
delta = min_addr - LOAD_PHYSICAL_ADDR;
...
map = delta - __START_KERNEL_map;
...
Difference between
the compile-time
physical address and
the actual physical
address
The offset of the kernel
virtual address to the
physical address
121. Apply the relocation
121
for (reloc = output + output_len - sizeof(*reloc); *reloc; reloc--) {
int extended = *reloc;
extended += map;
ptr = (unsigned long)extended;
if (ptr < min_addr || ptr > max_addr)
error("32-bit relocation outside of kernel!n");
*(uint32_t *)ptr += delta;
}
#ifdef CONFIG_X86_64
for (reloc--; *reloc; reloc--) {
long extended = *reloc;
extended += map;
ptr = (unsigned long)extended;
if (ptr < min_addr || ptr > max_addr)
error("64-bit relocation outside of kernel!n");
*(uint64_t *)ptr += delta;
}
#endif
122. OK, go to the entry point
• The entry point is always at the head of the kernel
• decompress_kernel returns the “output”
• The assembly code jumps into the entry point
122
asmlinkage __visible void *decompress_kernel(...)
{
...
output = choose_kernel_location(input_data, input_len,
output, output_len);
...
return output;
}
/*
* Jump to the decompressed kernel.
*/
xorl %ebx, %ebx
jmp *%eax