SlideShare a Scribd company logo
1 of 123
Booting Process (2)
Taku Shimosawa
Pour le livre nouveau du Linux noyau 1
Materials
• http://www.slideshare.net/shimosawa/
2
Agenda
• Virtual Memory
• From architectural view
• Unfortunately, this presentation again does not enter
the main part of the kernel!
• Appendices
• Source code-level overview of the bootstrapping process
• Linker Scripts
• Inline Assemblers
• There are (implicitly) omitted spaces, tabs, white
lines, comments in the quoted source code.
• The omitted effective lines are denoted by … or […]
3
Scope of the last presentation : x86
• Real Mode (16-bit)
• Boot sector, setup_header, and
16-bit entry point
• C-Language main function
• Retrieving memory information
• Transition to the protected
mode
• Protected Mode (32-bit)
• 32-bit(/64-bit) entry point,
preparing for decompression,
calling decompression code
• (EFI-Stub) efi_main (entry point
from UEFI)
• EFI call functions
• Protected Mode/Long Mode
• The beginning of the main
kernel
4
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
* The …_32.S files are used in the 32-bit kernel, and …_64.S files are not. Vice versa.
Scope of the last presentation : ARM
• Compressed
• Entry point
• Decompressing function
• Actual decompressing
algorithm is in
lib/decompress_*.c
• Building a FDT from ATAGS for
compatibility
(CONFIG_ARM_ATAG_DTB_C
OMPAT)
• Decompressed
• The beginning of the main
kernel
5
 arch
 arm
 boot
 compressed
 head.S
 decompress.c
 atags_to_fdt.c
 kernel
 head.S
Follow-ups for the last presentation
• x86 assembly language
• What if instuructions with ≥3 operands?
• (e.g.) imul
• Multiply EBX by 19(0x13) and substitute the result to EAX
• Therefore,
6
AT&T Intel
Operand
Order
Source, Destination Destination, Source
AT&T Intel
Example imul $0x13, %ebx, %eax IMUL EAX, EBX, 13h
AT&T Intel
Operand
Order
Source, Destination
[[Op4,] Op3,] Op2, Op1
Destination, Source
Op1, Op2 [, Op3, [Op4]]
Follow-ups for the last presentation
• Multiple Relocations?
• The conclusion is “at most once” (in x86 arch)
• ELF relocation may follow the decompression, so the kernel
may be relocated twice in this sense.
• See the relocation part in this presentation.
7
x86 Architecture : Segmentation
• 6 Segment Registers (16-bit registers)
• Code Segment Register: CS
• Data Segment Register: DS, ES, FS, GS
• Stack Segment Register: SS
• Real mode : 20-bit address space
• Linear address = Physical address
• The size of each segment is 64K (16-bit)
• The segment register denotes the higher 16-bit offset in 20-bit address
space for the segment
• Protected mode : 32-bit/36-bit physical address space
• Virtual –(Paging)-> Linear –(Segmentation)-> Physical
• The offset and limit are stored in the descriptor table
• The segment registers points to the entry in the table
• Long mode : 48-bit physical address space
• For CS, DS, ES, and SS, the offset is always 0, the limit is ignored.
• For FS and GS, the offset can be set by the descriptor or through MSR
(for > 32-bit addresses)
8
Logical – (Segmentation) -> Linear –(Paging)-> Physical
Errata
So what? (p.32)
9
vmlinux
boot/compressed/vmlinux.bin
(1a) Strip symbols
vmlinux.bin.xz
(2a) Concatenate and compress (gzip, bzip2,
lzma, lzo, lz4)
piggy.o
(3) mkpiggy (piggy-back)
Make an object that contains the compressed image
piggy.o*.o
boot/compressed/vmlinux
(4) Link with the other objects in
boot/compressed
(Decompressing codes)
(5) Transform it into a simple binary
boot/vmlinux.bin
boot/vmlinux.binboot/setup.bin
(6) Concatenate with real-mode
setup code, headers, and CRC32 CRC
boot/bzImage
(1b) Make relocation information
(2b) Append the original size info (except for gzip) vmlinux.bin.xz
vmlinux.relocs
Size
Errata?
4. Virtual Memory
Segmentation and Paging
10
Virtual Memory
• The address visible to a task is “virtualized,” i.e.
translated by hardware to a certain physical address
when it is actually accessed.
• The hardware mechanism to translate the address is called
MMU (memory management unit).
• Aim / Benefit
• Using larger memory area than the machine actually is
equipped with.
• Memory swapping, sparse memory areas
• Isolating tasks’ memory area so that the different applications
cannot touch (read or write) the each other’s memory
• Not only between user tasks but between the kernel and tasks
• Abstracting the memory resources
• Providing contiguous memory area even if there is no physically
contiguous memory area available.
• User programs can run with certain addresses regardless of the
physical addresses where they are actually running.
11
Two ways to virtual memory
• Paging
• Dividing the memory area into chunks (“pages”) with a
certain small size, and defining a map from each chunk
to its physical location
• A different task may have a different map of the
memory
• Several overhead (both in speed and memory) to
translate and hold the map
• Segmentation
• The address is considered to be an offset inside a certain
segment of memory
• Less overhead (just adding an offset), but impossible to
achieve swapping
12
Illustrated
13
1
Segment
1
2
3
5
4
3
1
2
4
VA PA
1 4
2 1
3 3
5 2
2
~
4
Seg Star
t
End
1 2 4
1
Virtual Memory Physical Memory
Page Table
Segment Desc.
Paging
Segmentation
Architecture and VM Capability
• x86
• Capable of paging
• 16-bit and 32-bit has segmentation feature
• 64-bit mode has a very limited segmentation feature
• Because almost no one is using the segmentation feature
effectively!
• (See “flat model” described in a later slide)
• ARM
• Some CPU series has MMU, and is capable of paging
• “A” series
• Some CPU series only has MPU (memory protection unit)
• “R” series
• No MMU
• “M” series (MPU is optional)
14
Focusing on paging…
• How it works?
15
Memory instruction with a virtual address
CPU (MMU) looks for for the virtual address in TLB (Translation
Lookaside Buffer)
Does it exist?
Use the physical address in the TLB entry TLB Miss!
Call the handler, and ask it to fill
in a TLB entry corresponding to
the virtual address
Traverse the page table to find
the physical address for the
virtual address
Present?
Use the physical address
(May) remember it in TLB
Page fault! Call the handler.
Kernel’s Role
Software TLBHardware TLB
Yes No
Yes No
How far should hardware do?
• TLB (Translation Lookaside Buffer)
• Cache of “virtual-to-physical” mappings.
• Limited number of entries.
• Hardware-controlled TLB
• When TLB misses occur, the CPU traverses page tables
• The format for the page table is defined by the architecture.
• x86 and ARM
• Software-controlled TLB
• When TLB misses occur, the software (typically, the OS kernel)
traverses page tables, and tell the result (translated physical
address) by filling in some entry in TLB.
• Any type of page tables may be used (hash-based PT, for
example)
• But Linux uses almost the same format for this type of architecture
• PowerPC
16
Multilevel Page Table (tree-like)
• Typical structure of page table
• The first-level page table consists of entries that point to
another level page table. The index is some of the most
significant bits of the virtual memory.
• Of course, the next page table’s address is physical.
• The entries in the leaf page table denotes the physical
addresses.
17
Next level page table
Third level page table
Phys address
Phys address
…
First-level page table Second-level page table Third-level page table
x86-64 example
18
Resolving 0x00000004200310a5
= 00000000 00000000 00000000 00000100
00100000 00000011 00010000 10100101 (2)
PML4 Table
0
511
Page Directory Pointer Table
0
511
16
0
256
Page Directory Table
511
0x1234567000
0
49
Page Table
511
0x12345670a5
CR3
64 bits
x86-64
• Currently, only 48-bit in a linear address is effective.
• 64-bit address is sign-extension of the 48-bit address.
• Supports up to 52 bits for physical addresses
• %cr3 register : the physical address for the current
PML4 table
• mov ~~, %cr3 switches the page table (flushing TLB)
• Four level
• One entry in PML4 table corresponds to 512 GB of
virtual memory, an entry in PDP table to 1 GB, and so on.
• Each entry is 8 byte
• Each table has 512 entries
• Thus, each table is 4 KB = 1 page.
19
Large Table
• One page occupies one entry in TLB
• If one process uses 1 GB of memory, it uses 256K
pages.
• i.e. If TLB does not have 256K entries (and usually it
doesn’t), TLB misses are inevitable
• x86_64 supports three types of page size
• 4 KB (normal)
• 2 MB
• 1 GB (!)
• The disadvantage is that larger page requires
contiguous physical memory of the same size as the
page size.
20
An entry in higher-level page table directly
contains a physical address.
x86-64 example (2MB page)
21
Resolving 0x00000004200310a5
= 00000000 00000000 00000000 00000100
00100000 00000011 00010000 10100101
PML4 Table
0
511
Page Directory Pointer Table
0
511
16
0x1234400000
0
256
Page Directory Table
511
0x12344310a5
CR3
64 bits
Linux kernel usage
• Large Page
• The kernel mapping
• The kernel creates straight-mapping of physical memory in the
kernel virtual address area
• This area is created in booting, and never changes after that
• 1GB, 2MB pages are used
• Hugetlbfs
• Explicit use from user applications
• Transparent Huge Pages
• Implicit (transparent) use of large pages for user applications
22
ARM
• ARM
• Two memory architecture
• VMSA (Virtual Memory System Architecture) : MMU
• PMSA (Protected Memory System Architecture) : MPU
• VMSA
• Two page table formats
• Short descriptor table
• Up to two-level lookup
• 32-bit PA (*By supersection, 40-bit can be output)
• Long descriptor table
• Up to three-level lookup
• 40-bit PA
• Fixed size of page tables
23
Names in Linux
• Linux uses several arch-independent type names
for page table entries
• pgd_t, pud_t, pmd_t, pte_t
• Each type is one for an entry in a table of the corresponding
level
24
Architecture (& Config) Lv pgd_t pud_t pmd_t pte_t
x86_64 4 PML4E PDPTE PDE PTE
i386 (PAE) 3 PDPTE - PDE PTE
i386 2 PDE - - PTE
ARM (LPAE) 3 1st-lv. Desc. - 2nd-lv. Desc. 3rd-lv. Desc.
ARM 2 1st-lv. Desc. - - 2nd-lv. Desc.
ARM64 (64KB page) 2 1st-lv. Desc. - - 2nd-lv. Desc.
ARM64 3 1st-lv. Desc. - 2nd-lv. Desc. 3rd-lv. Desc.
(*)AArch64 supports four-level page tables, thus 48-bit VA.
Notes
• PAE (i386)
• Physical Address Extension
• For those who want to enjoy >4GB of memory in 32-bit mode.
• Virtual address remains 32-bit, but can map to any physical
address (< 64-bit)
• The size of each entry is extended to 64-bit
• CONFIG_X86_PAE
• LPAE
• Logical Physical Address Extension
• Almost the same as PAE in i386
• “The current implementation limits the output address range
to 40 bits”
• Each entry is extended to 64-bit (long-descriptor translation table
format)
• CONFIG_ARM_LPAE
25
ARM example (Short-descriptor)
26
Resolving 0x200310a5
= 00100000 00000011 00010000 10100101 (2)
1st Level Table
0
4095
0x12345000
0
255
49 0x123450a5
TTBR0
2nd Level Table
32 bits
512
Quick Chart
27
1st Level 2nd Level 3rd Level 4th Level
Intel
64-bit
[47:39] [38:30] [29:21] [20:12]
4 KB (64 bit x 512)
512 GB/Entry 1 GB / Entry 2 MB / Entry 4 KB / Entry
PAE
[31:30] [29:21] [20:12]
256 B (64 bit x 4) 4 KB (64 bit x 512)
1 GB / Entry 2 MB / Entry 4 KB / Entry
32-bit
[31:22] [21:12]
4 KB (32 bit x 1024)
4 MB / Entry 4 KB / Entry
ARM
LPAE
[31:30] [29:21] [20:12]
256 B (64 bit x 4) 4 KB (64 bit x 512)
1 GB / Entry 2 MB / Entry 4 KB / Entry
32-bit
[31:20] [19:12]
16 KB (32 bit x
4096)
1 KB (32 bit x 256)
1 MB / Entry 4 KB / Entry
ARM
64
4KB
granule
[38:30] [29:21] [20:12]
4 KB (64 bit x 512)
1 GB / Entry 2 MB / Entry 4 KB / Entry
VA Range used as index
Table size (entry size x n)
Size represented by each entry
Page size supported (by HW)
• x86_64
• 1 GB, 2 MB, 4 KB
• i386 (PAE)
• 2 MB, 4 KB
• i386
• 4 MB, 4 KB
• ARM
• 16 MB(*), 1 MB, 64 KB, 4 KB
• ARM (LPAE)
• 1 GB, 2 MB, 4 KB
• ARM64
• 1 GB, 2 MB, 4 KB (for 4KB translation granule)
• 32 MB, 16 KB (for 16KB translation granule)
• 512 MB, 64 KB (for 64KB translation granule)
28
(*) Depends on implementation
Page Attributes
• Pages can have attributes
• Used for memory protection
• Used for demand paging
• Used for COW (copy-on-write)
• Attributes
• Read / Write
• User / Privileged
• But where?
• In the page table entry corresponding to a page
• However, a page table entry is basically a physical
pointer, i.e. a 32-bit entry is occupied by 32-bit physical
pointer…
29
Page Attributes
• The lower bits in page table entries
• The start address of a page/page table is aligned!
• The lower bits are always zero.
30
Ignored
Physical Address [31:12]
3252
XD
63
Physical Address [51:32]
G
Igno
red
PAT
D
PCD
PWT
US
RW
PA
31 9 0
Physical Address [31:12] C B 1
XN
APTEX
AP2
S
nG
x86_64
ARM
(short
descriptor)
Page Attributes Comparison
31
x86_64 ARM (short)
Enabled? Present (P) Desc type (Bits 1 & 0)
RO or RW? Read/Write (RW)
AP [2:1] or AP [2:0]
Privileged only or any? User/Supervisor (US)
Write-through? PWT
TEX[2:0], B, C
Cachable? PCD
Accessed? Accessed (A) AP[0] (*configurable)
Dirty? Dirty (D) N/A
Memory Type PAT TEX[0], B, C (*configurable)
Global Global (G) Not Global (nG)
Executable? Execute-Disable (XD) Execute-Never (XN)
Sharable? (PAT) Sharable (S)
PowerPC Example [PowerPC 440]
• TLB is filled by software
• Search (tlbsx instrunction), R/W (tlbre, tbwe instructions)
32
32220
Effective Page Number [0:21]
TS
V SIZE TPAR TID
40
Real Page Number [0:21]
0
PA
R1
ERPN
PA
R2
0
Reserved U3-U0 W I M G E
X W R X W R
U S
• Attributes
• V : Valid
• SIZE : Page Size (4n KB, where n in
{0,1,2,3,4,5,7,9,10})
• U : User-defined storage attribute
• W: Write-through
• I: Caching Inhibited
• M: Memory coherency required
• G: Guarded
• E: Endian
• UX, UW, UR: User
executable, writable,
readable
• SX, SW, SR: Supervisor
executable, writable,
readable
• TPAR, PAR1, PAR2: Parity
Before the kernel starts…
• x86 (32-bit)
• Paging is disabled
• kernel/head_32.S creates a page table and turns on
paging
• x86 (64-bit)
• compressed/head_64.S creates an identical (virtual =
physical) page table for the first 4G
• Long mode requires paging enabled.
• kernel/head_64.S creates better page table
• ARM
• kernel/head.S creates a page table and turns on paging
33
Virtual memory mapping
34
x86_64 Virtuali386 Virtual Physical
LOWMEM
PAGE_OFFSET
(0xC0000000)
Up to ~896 MB
PAGE_OFFSET
(0xFFFF8800
00000000)
__START_KERNEL_map
(0xFFFFFFFF
80000000)
A. Booting in x86
By looking into the source codes
35
A-1. Real Mode
Plenty of assembler code, LD script, and inline assembly
language
36
Real mode kernel (from p.45)
• header.S
• Boot sector code which is no longer used
• Contains setup_header
• Prepares stack and BSS to run C programs
• Jumps into the C program (main.c)
• main.c
• Copies setup_header into “zeropage”
• Setups early console
• Initializes heap
• Checks the CPUs (64-bit capable for 64-bit kernel?)
• Collect HW information by querying to BIOS, and stores the
results in “zeropage”
• Finally transits to protected-mode, and jumps into the
“protected-mode kernel”
37
Boot sector (Useless)
38
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
.global bootsect_start
bootsect_start:
#ifdef CONFIG_EFI_STUB
# "MZ", MS-DOS header
.byte 0x4d
.byte 0x5a
#endif
# Normalize the start address
ljmp $BOOTSEG, $start2
start2:
movw %cs, %ax
movw %ax, %ds
movw %ax, %es
movw %ax, %ss
xorw %sp, %sp
sti
cld
movw $bugger_off_msg, %si
jmp msg_loop
Normalize CS to
BOOTSEG (0x7c0).
movw %ds, %cs is not allowed.
stack starts at 0x17c00
Enable interrupts cf. cli
Reset directions for string instructions
(Clear DF Flag) cf. std
Show the message "Direct floppy boot
is not supported. "
Wait, how the header code is placed
at the beginning of the kernel?
• The linker concatenates multiple object files
• The position in the resulting binary are not guaranteed
without any order to the linker
• The linker script (.ld/.lds/.lds.S) orders the positions
to the linker!
• As it is quite likely for you to use the C preprocessor for
the linker script, files with the extension “.lds.S” are first
processed by the preprocessor, and passed to the linker.
• Pass the linker script with “-T” overrides the default
linker script
• The default linker script can be displayed with “ld --
verbose"
39
LD script (1)
40
 arch
 x86
 boot
 setup.ld
 compressed
 vmlinux.lds.S
 kernel
 vmlinux.lds.S
/*
* setup.ld
*
* Linker script for the i386 setup code
*/
OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(_start)
SECTIONS
{
. = 0;
.bstext : { *(.bstext) }
.bsdata : { *(.bsdata) }
. = 495;
.header : { *(.header) }
.entrytext : { *(.entrytext) }
.inittext : { *(.inittext) }
.initdata : { *(.initdata) }
__end_init = .;
Specifies the output format (identical
to --oformat option)
OUTPUT_FORMAT(default, big, little)
Specifies the output architecture
Specifies the entry point symbol
(identical to -e option)
LD script (2)
41
/*
* setup.ld
*
* Linker script for the i386 setup code
*/
OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(_start)
SECTIONS
{
. = 0;
.bstext : { *(.bstext) }
.bsdata : { *(.bsdata) }
. = 495;
.header : { *(.header) }
.entrytext : { *(.entrytext) }
.inittext : { *(.inittext) }
.initdata : { *(.initdata) }
__end_init = .;
 arch
 x86
 boot
 setup.ld
 compressed
 vmlinux.lds.S
 kernel
 vmlinux.lds.S
Specifies how the sections are output
. means the current position
Substituting to . means setting the
current position
Put the .bstext section at the current
position, i.e. at the address 0.
Put the .bsdata section after
the .bstext section.
bstext section…?
42
.code16
.section ".bstext", "ax"
.global bootsect_start
bootsect_start:
#ifdef CONFIG_EFI_STUB
# "MZ", MS-DOS header
.byte 0x4d
.byte 0x5a
#endif
# Normalize the start address
ljmp $BOOTSEG, $start2
start2:
movw %cs, %ax
movw %ax, %ds
movw %ax, %es
movw %ax, %ss
xorw %sp, %sp
sti
cld
movw $bugger_off_msg, %si
jmp msg_loop
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
Here it is!
[Notes]
.code16 = Specify the binary for the following
code as 16-bit binary.
.section name[, flags] = Starts the section.
<flags> (excerpted)
• “a” : allocatable (loaded to memory when
executed)
• “w” : writable
• “x” : executable
.globl/.global symbol = Makes the symbol global
(Can be seen from other objects)
LD script (3)
43
/*
* setup.ld
*
* Linker script for the i386 setup code
*/
OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(_start)
SECTIONS
{
. = 0;
.bstext : { *(.bstext) }
.bsdata : { *(.bsdata) }
. = 495;
.header : { *(.header) }
.entrytext : { *(.entrytext) }
.inittext : { *(.inittext) }
.initdata : { *(.initdata) }
__end_init = .;
 arch
 x86
 boot
 setup.ld
 compressed
 vmlinux.lds.S
 kernel
 vmlinux.lds.S
Specifies how the sections are output
Set the current position to 495
Places the header section at the
address 495
Declares a symbol “__end_init” that
refers to the current position
(the end of .initdata section)
LD script (4)
44
/*
* setup.ld
*
* Linker script for the i386 setup code
*/
OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(_start)
SECTIONS
{
. = 0;
.bstext : { *(.bstext) }
.bsdata : { *(.bsdata) }
. = 495;
.header : { *(.header) }
.entrytext : { *(.entrytext) }
.inittext : { *(.inittext) }
.initdata : { *(.initdata) }
__end_init = .;
 arch
 x86
 boot
 setup.ld
 compressed
 vmlinux.lds.S
 kernel
 vmlinux.lds.S
.bstext
.bsdata
0
495
.header
.entrytext
.inittext
.initdataxxxx
__end_init
LD script (5)
• To be precise,
• Output a section the name of which is “.bstext”
• The output section contains all of the input section “.bstext”
• The input and output need not be 1-to-1
• The output section “.text” contains all of the input section “.text”,
and then all of the sections the names of which start with “.text.”
• Creates the new symbols “_text” and “_etext” which denote the
beginning and ending of the output section “.text”, respectively.
45
.bstext : { *(.bstext) }
.text : {
_text = .; /* Text */
*(.text)
*(.text.*)
_etext = . ;
}
LD script (6)
46
. = ALIGN(16);
.data : { *(.data*) }
.signature : {
setup_sig = .;
LONG(0x5a5aaa55)
}
...
/DISCARD/ : { *(.note*) }
/*
* The ASSERT() sink to . is intentional, for
binutils 2.14 compatibility:
*/
. = ASSERT(_end <= 0x8000, "Setup too big!");
. = ASSERT(hdr == 0x1f1, "The setup header has
the wrong offset!");
/* Necessary for the very-old-loader check to
work... */
. = ASSERT(__end_init <= 5*512, "init sections
too big!");
}
 arch
 x86
 boot
 setup.ld
 compressed
 vmlinux.lds.S
 kernel
 vmlinux.lds.S
[Usage]
# Check signature at end of setup
cmpl $0x5a5aaa55, setup_sig
jne setup_bad
Align to the 16 byte boundary
Discard the sections .note*
Put this long value at the
current position
Assertions!
Column: align and balign
• LD’s ALIGN(x) returns the x-byte aligned address
• x must be power of two
• = (current + x – 1) & ~ (x – 1)
• GNU Assembler has two pseudo ops for alignment
• .align x, fill, max
• .balign x, fill, max
• Both aligns to the byte boundary specified by x. But x means…
• The skipped bytes are filled by fill (zero or nop)
• The maximum number of bytes to be skipped can be specified
with max.
47
.align (x = 4) .balign (x = 4)
i386 (elf), sparc, etc. Align to 4 byte
Align to 4 byte
ppc, i386 (a.out), arm Align to 16 byte (24)
COFF Stuffs
48
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
#ifdef CONFIG_EFI_STUB
.org 0x3c
# Offset to the PE header.
.long pe_header
#endif /* CONFIG_EFI_STUB */
.section ".bsdata", "a"
bugger_off_msg:
.ascii "Direct floppy boot is not supported. "
.ascii "Use a boot loader program instead.rn"
...
.byte 0
#ifdef CONFIG_EFI_STUB
pe_header:
.ascii "PE"
.word 0
coff_header:
#ifdef CONFIG_X86_32
.word 0x14c # i386
#else
.word 0x8664 # x86-64
#endif
[Notes]
.org location, fill = Set the current position to
location in the current section (filling the skipped
bytes with fill)
.ascii string = Put the string (w/o zero termination)
at the current position (cf. .asciz)
.byte val, .word val, .long val, .quad val
= Put the 1/2/4/8-byte value(s)
Real mode kernel (p.45)
• header.S
• Boot sector code which is no longer used
• Contains setup_header
• Prepares stack and BSS to run C programs
• Jumps into the C program (main.c)
• main.c
• Copies setup_header into “zeropage”
• Setups early console
• Initializes heap
• Checks the CPUs (64-bit capable for 64-bit kernel?)
• Collect HW information by querying to BIOS, and stores the
results in “zeropage”
• Finally transits to protected-mode, and jumps into the
“protected-mode kernel”
49
Entry point (2nd sector)
50
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
.section ".header", "a"
.globl sentinel
sentinel: .byte 0xff, 0xff /* Used to detect broken loaders */
.globl hdr
hdr:
setup_sects: .byte 0 /* Filled in by build.c */
root_flags: .word ROOT_RDONLY
syssize: .long 0 /* Filled in by build.c */
ram_size: .word 0 /* Obsolete */
vid_mode: .word SVGA_MODE
root_dev: .word 0 /* Filled in by build.c */
boot_flag: .word 0xAA55
# offset 512, entry point
.globl _start
_start:
.byte 0xeb # short (2-byte) jump
.byte start_of_setup-1f
1:
.ascii "HdrS" # header signature
.word 0x020d # header version number (>= 0x0105)
.bstext
.bsdata
0
495
.header
To prevent the compiler
from accidentally
producing a 3-byte jump
Setup_header
• “.header” section starts at 495
• 2-byte sentinel is located at the beginning.
• Struct setup_header begins at 497 (=0x1f1)
51
51
47: struct setup_header {
48: __u8 setup_sects;
49: __u16 root_flags;
50: __u32 syssize;
51: __u16 ram_size;
52: __u16 vid_mode;
53: __u16 root_dev;
54: __u16 boot_flag;
55: __u16 jump;
56: __u32 header;
57: __u16 version;
58: __u32 realmode_swtch;
... (arch/x86/include/uapi/asm/bootparam.h)
Setup
code
Boot Sector
0x0000
0x0200
0x1f1
Column: Local Symbol in GAS
• Local symbols
• Symbols that can be used temporarily
• Format is N: (where N is a positive integer)
• To refer to the local symbols, use Nf or Nb.
• Nf refers to the next local label N.
• Nb refers to the most recently declared local label N.
• According to GNU assembler manual, these symbols are
internally transformed to the following format:
• LN^BO
• ^B is Ctrl-B (0x02), O is a serial number
• For 44th 3, “L3^B44” is used.
• Dollar local symbols (I haven’t seen this)
52
.byte start_of_setup-1f
1:
1:
jmp 1b
Get prepared to C (stack)
53
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
.section ".entrytext", "ax"
start_of_setup:
# Force %es = %ds
movw %ds, %ax
movw %ax, %es
cld
movw %ss, %dx
cmpw %ax, %dx # %ds == %ss?
movw %sp, %dx
je 2f # -> assume %sp is reasonably set
# Invalid %ss, make up a new stack
movw $_end, %dx
testb $CAN_USE_HEAP, loadflags
jz 1f
movw heap_end_ptr, %dx
1: addw $STACK_SIZE, %dx
jnc 2f
xorw %dx, %dx # Prevent wraparound
2: # Now %dx should point to the end of our stack space
andw $~3, %dx # dword align (might as well...)
jnz 3f
movw $0xfffc, %dx # Make sure we're not zero
3: movw %ax, %ss
movzwl %dx, %esp # Clear upper half of %esp
If %ds == %ss, %sp is
assumed to be properly set
by the loader
If not, sets up a new stack.
The address is _end +
STACK_SIZE (512 byte) or
heap_end_ptr + STACK_SIZE (if
CAN_USE_HEAP is set)
In other words,
• Set the stack segment as the same as %DS
• Allocate
512-byte for
the stack
54
unsigned short stack;
if (%ds != %ss) {
if (hdr.loadflags & CAN_USE_HEAP) {
stack = hdr.heap_end_ptr + STACK_SIZE;
} else {
stack = _end + STACK_SIZE;
}
if (carried over) { /* stack >= 0x10000 */
stack = 0;
}
}
/* Align to 4-byte */
stack &= ~3;
if (stack == 0)
stack = 0xfffc; /* – 4 */
%ss = %ds;
%esp = stack;
Get prepared to C
(CS fix and BSS clear)
55
sti # Now we should have a working stack
# We will have entered with %cs = %ds+0x20, normalize %cs so
# it is on par with the other segments.
pushw %ds
pushw $6f
lretw
6:
# Check signature at end of setup
cmpl $0x5a5aaa55, setup_sig
jne setup_bad
# Zero the bss
movw $__bss_start, %di
movw $_end+3, %cx
xorl %eax, %eax
subw %di, %cx
shrw $2, %cx
rep; stosl
# Jump to C code (should not return)
calll main
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
$6f is the address of the 6f,
which is the offset from the
boot sector.
Signature check
Fill the bss by zero.
“rep; stosl” (string
instruction) fills the
memory from %es:%di
for %cx DWORDs with %eax.
[Column] Calling conventions
• 16 bit (name unknown)
• Arguments: %ax, %dx, %cx
• Return value: %ax
• 32 bit (cdecl)
• Arguments: pushed on the stack (in the reversed order of the
arguments)
• Caller-saved: %eax, %ecx, and %edx
• Callee-saved: the others
• Return value: %eax (for int)
• 64 bit (amd64)
• Arguments: %rdi, %rsi, %rdx, %rcx, %r8, %r9
• Caller-saved: the others than callee-saved.
• Callee-saved: %rbp, %rbx, %r12 to %r15
• Return value: %eax
56
f(2, 5, 9, 11);
11
9
5
2
(return address)
stack
Real mode kernel (p.45)
• header.S
• Boot sector code which is no longer used
• Contains setup_header
• Prepares stack and BSS to run C programs
• Jumps into the C program (main.c)
• main.c
• Copies setup_header into “zeropage”
• Setups early console
• Initializes heap
• Checks the CPUs (64-bit capable for 64-bit kernel?)
• Collect HW information by querying to BIOS, and stores the
results in “zeropage”
• Finally transits to protected-mode, and jumps into the
“protected-mode kernel”
57
main
58
void main(void)
{
/* First, copy the boot header into the "zeropage" */
copy_boot_params();
/* Initialize the early-boot console */
console_init();
...
/* End of heap check */
init_heap();
/* Make sure we have all the proper CPU support */
if (validate_cpu()) {
...
}
set_bios_mode();
detect_memory();
keyboard_init();
query_mca();
query_ist();
...
/* Set the video mode */
set_video();
/* Do the last things and invoke protected mode */
go_to_protected_mode();
}
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
Copy to zeropage
• Very simple
• The omitted part is for compatibility with
old command-line parameter protocol
(located in the certain address)
59
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
struct boot_params boot_params __attribute__((aligned(16)));
...
static void copy_boot_params(void)
{
...
BUILD_BUG_ON(sizeof boot_params != 4096);
memcpy(&boot_params.hdr, &hdr, sizeof hdr);
...
}
Set up the serial console
• Parse the command line parameter
in very ad-hoc way, and find the
serial configuration
• Find “earlyprintk” and if it is either of
the following format
• “serial,0x3f8,115200”
• “serial,ttyS0,115200”
• “ttyS0,115200”
• Find “console” and find “uart8250,io,…” or “uart,io,…”
• If any serial config is found, set up it using I/O ports
60
void console_init(void)
{
parse_earlyprintk();
if (!early_serial_base)
parse_console_uart8250();
}
 arch
 x86
 boot
 header.S
 main.c
 early_serial_console.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
Puts and putchar
• By BIOS call and serial I/O ports
61
void __attribute__((section(".inittext"))) putchar(int ch)
{
if (ch == 'n')
putchar('r'); /* n -> rn */
bios_putchar(ch);
if (early_serial_base != 0)
serial_putchar(ch);
}
void __attribute__((section(".inittext"))) puts(const char *str)
{
while (*str)
putchar(*str++);
}
 arch
 x86
 boot
 header.S
 main.c
 tty.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
[Notes] GCC extension __attribute__
section(section) : locate the function/variable in
the specified section.
Serial and BIOS putchar
62
static void __attribute__((section(".inittext")))
serial_putchar(int ch)
{
unsigned timeout = 0xffff;
while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --
timeout)
cpu_relax();
outb(ch, early_serial_base + TXR);
}
static void __attribute__((section(".inittext"))) bios_putchar(int
ch)
{
struct biosregs ireg;
initregs(&ireg);
ireg.bx = 0x0007;
ireg.cx = 0x0001;
ireg.ah = 0x0e;
ireg.al = ch;
intcall(0x10, &ireg, NULL);
}
 arch
 x86
 boot
 header.S
 main.c
 tty.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
Put a char on a
serial line by
using I/O ports
(IN and OUT
instructions)
Put a char on
VGA by BIOS
Call (INT 0x10,
AH = 0x0e)
BIOS Call
• BIOS Call is invoked by using an INT instruction
• Requires an assembly language support
• Parameters and return values are passed by a certain set
of registers
• INT instruction only takes an immediate for the interrupt
number.
• C prototype:
• struct biosregs has all the
general registers,
data segment registers,
the flag register
63
void intcall(u8 int_no, const struct biosregs *ireg,
struct biosregs *oreg);
void initregs(struct biosregs *reg)
{
memset(reg, 0, sizeof *reg);
reg->eflags |= X86_EFLAGS_CF;
reg->ds = ds();
reg->es = ds();
reg->fs = fs();
reg->gs = gs();
}
BIOS Call Impl. (1)
64
 arch
 x86
 boot
 header.S
 main.c
 bioscall.S
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
.code16
.section ".inittext","ax"
.globl intcall
...
intcall:
cmpb %al, 3f
je 1f
movb %al, 3f
jmp 1f /* Synchronize pipeline */
1:
...
/* Actual INT */
.byte 0xcd /* INT opcode */
3: .byte 0
...
void intcall(u8 int_no, const struct biosregs
*ireg, struct biosregs *oreg);
ax dx
cx
Checks the current operand of the INT
instruction, and rewrite (self-modify)
the interrupt number if different.
BIOS Call Impl. (2)
65
1:
/* Save state */
pushfl
pushw %fs
pushw %gs
pushal
/* Copy input state to stack frame */
subw $44, %sp
movw %dx, %si
movw %sp, %di
movw $11, %cx
rep; movsd
/* Pop full state from the stack */
popal
popw %gs
popw %fs
popw %es
popw %ds
popfl
/* Actual INT */
.byte 0xcd /* INT opcode */
3: .byte 0
EFLAGS
FS
GS
EAX
ECX
EDI
…
stack
EFLAGS
FS
GS
DS
ES
EAX
…
EDI
Copy of
struct
biosregs
*ireg
(44 bytes)
Registers
Registers
BIOS Call Impl. (3)
66
/* Push full state to the stack */
pushfl
pushw %ds
pushw %es
pushw %fs
pushw %gs
pushal
...
(Restore %ds, %sp, etc.)
...
/* Copy output state from stack frame */
movw 68(%esp), %di /* Original %cx == 3rd
argument */
andw %di, %di
jz 4f
movw %sp, %si
movw $11, %cx
rep; movsd
/* Restore state and return */
popal
popw %gs
popw %fs
popfl
retl
EFLAGS
FS
GS
EAX
ECX
EDI
…
stack
EFLAGS
FS
GS
DS
ES
EAX
…
EDI
Registers
*oregs
Registers
Inline assembly
• A quick way to use assembly language inside C
source codes
• For example, when you want to disable interrupts, put
into your C code.
• GCC’s extended inline assembly language enables
far more features (and more complicated)
• => Described in twenty or so slides later!
67
asm (“cli”);
static inline void outb(u8 v, u16 port)
{
asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
}
Initialize the heap
68
char *HEAP = _end;
char *heap_end = _end; /* Default end of heap = no heap */
...
static void init_heap(void)
{
char *stack_end;
if (boot_params.hdr.loadflags & CAN_USE_HEAP) {
asm("leal %P1(%%esp),%0"
: "=r" (stack_end) : "i" (-STACK_SIZE));
heap_end = (char *)
((size_t)boot_params.hdr.heap_end_ptr +
0x200);
if (heap_end > stack_end)
heap_end = stack_end;
} else {
/* Boot protocol 2.00 only, no heap available */
puts("WARNING: Ancient bootloader, some
functionality "
"may be limited!n");
}
}
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
Substitute %esp – STACK_SIZE to stack_end
heap_end
stack_end
When is the heap used?
• Heap allocation function is very simple
• And the calls for GET_HEAP exist only in the video
code files.
69
static inline char *__get_heap(size_t s, size_t a, size_t n)
{
char *tmp;
HEAP = (char *)(((size_t)HEAP+(a-1)) & ~(a-1));
tmp = HEAP;
HEAP += s*n;
return tmp;
}
#define GET_HEAP(type, n) 
((type *)__get_heap(sizeof(type),__alignof__(type),(n)))
saved.data = GET_HEAP(u16, saved.x*saved.y);
(boot/video.c)
Retrieving memory info.
• As described in the last presentation,
detect_memory tries 3 methods
70
int detect_memory(void)
{
...
if (detect_memory_e820() > 0)
err = 0;
if (!detect_memory_e801())
err = 0;
if (!detect_memory_88())
err = 0;
return err;
}
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
Memory Information [from p.48]
• AX = 0xe820, INT 0x15 [detect_memory_e820()]
• INPUT
• AX = 0xe820
• CX = size of the buffer
• EDX = “SMAP” (0x534d4150 / Signature)
• EBX = Continuation value
• ES:DI = address for the buffer
• OUTPUT
• CF = 0 if successful, 1 otherwise
• CX = Returned Byte
• EBX = Continuation value
• Each call returns information for one range
• To get information for the next range, give the continuation value returned in
the previous call
• The range information is returned by the following structure
• Stored in boot_params.e820_map (struct e820entry[128])
71
52 struct e820entry {
53 __u64 addr; /* start of memory segment */
54 __u64 size; /* size of memory segment */
55 __u32 type; /* type of memory segment */
56 } __attribute__((packed));
(arch/x86/include/uapi/asm/e820.h)
E820
72
static int detect_memory_e820(void)
{
int count = 0;
struct biosregs ireg, oreg;
struct e820entry *desc = boot_params.e820_map;
static struct e820entry buf; /* static so it is zeroed */
initregs(&ireg);
ireg.ax = 0xe820;
ireg.cx = sizeof buf;
ireg.edx = SMAP;
ireg.di = (size_t)&buf;
do {
intcall(0x15, &ireg, &oreg);
ireg.ebx = oreg.ebx; /* for next iteration... */
if (oreg.eflags & X86_EFLAGS_CF)
break;
...
*desc++ = buf;
count++;
} while (ireg.ebx && count < ARRAY_SIZE(boot_params.e820_map));
return boot_params.e820_entries = count;
}
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
Video
• Smells like chaos
73
Go To Protected Mode
74
void go_to_protected_mode(void)
{
/* Hook before leaving real mode, also disables interrupts */
realmode_switch_hook();
/* Enable the A20 gate */
if (enable_a20()) {
puts("A20 gate not responding, unable to boot...n");
die();
}
/* Reset coprocessor (IGNNE#) */
reset_coprocessor();
/* Mask all interrupts in the PIC */
mask_all_interrupts();
/* Actual transition to protected mode... */
setup_idt();
setup_gdt();
protected_mode_jump(boot_params.hdr.code32_start,
(u32)&boot_params + (ds() << 4));
}
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
Go To PM: Details (1)
• Call the hook if set
• Otherwise, disable interrupts and NMI.
75
static void realmode_switch_hook(void)
{
if (boot_params.hdr.realmode_swtch) {
asm volatile("lcallw *%0"
: : "m" (boot_params.hdr.realmode_swtch)
: "eax", "ebx", "ecx", "edx");
} else {
asm volatile("cli");
outb(0x80, 0x70); /* Disable NMI */
io_delay();
}
}
If a hook is set in
realmode_swtch
call the hook
Out 0x80 to port 0x70
(CMOS Controller!!)
(By a historical reason,
“NMI disable” bit is
located in the CMOS
controller)
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
Go To PM: Details (2)
• Enable A20 line (20th bit of the address bus)
• In the initial state, the bit is masked (always 0)
• For compatibility with the program that expects address
wraparound in 1MB
• Some programs expect the address 0xFFFFF + 1 = 0x00000
• To use 32-bit of memory, this mask should be disabled.
• Many ways to do it
• But which way works depends on the firmware
• The famous one would be via the keyboard controller
port!
• Linux tries several ways, and several times
76
Go To PM: Details (3)
77
int enable_a20(void)
{...
while (loops--) {
if (a20_test_short())
return 0;
/* Next, try the BIOS (INT 0x15, AX=0x2401) */
enable_a20_bios();
if (a20_test_short())
return 0;
/* Try enabling A20 through the keyboard controller */
kbc_err = empty_8042();
if (a20_test_short())
return 0; /* BIOS worked, but with delayed reaction */
if (!kbc_err) {
enable_a20_kbc();
if (a20_test_long())
return 0;
}
/* Finally, try enabling the "fast A20 gate" */
enable_a20_fast();
if (a20_test_long())
return 0;
}
...
}
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 a20.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
*Tries 100000 times at most
Go To PM: Details (4)
78
/*
* Reset IGNNE# if asserted in the FPU.
*/
static void reset_coprocessor(void)
{
outb(0, 0xf0);
io_delay();
outb(0, 0xf1);
io_delay();
}
/*
* Disable all interrupts at the legacy PIC.
*/
static void mask_all_interrupts(void)
{
outb(0xff, 0xa1); /* Mask all interrupts on the secondary PIC */
io_delay();
outb(0xfb, 0x21); /* Mask all but cascade on the primary PIC */
io_delay();
}
The most legacy interrupt controller
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
Go To PM: Details (5)
• IDT (Interrupt Descriptor Table)
• Describes the exception/interrupt handlers
(and task gate, etc.)
• At this time, no IDT is installed.
• null_idt contains information for the address and size for the
IDT, both of which are zero.
• LIDT instruction takes an argument that is a pointer to the
information.
79
static void setup_idt(void)
{
static const struct gdt_ptr null_idt = {0, 0};
asm volatile("lidtl %0" : : "m" (null_idt));
}
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
len (16-bit) address (32-bit)
IDT
Go To PM: Details (6)
• GDT (Global Descriptor Table)
• Describes the segment information
80
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
static void setup_gdt(void)
{
static const u64 boot_gdt[] __attribute__((aligned(16))) = {
/* CS: code, read/execute, 4 GB, base 0 */
[GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff),
/* DS: data, read/write, 4 GB, base 0 */
[GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff),
/* TSS: 32-bit tss, 104 bytes, base 4096 */
/* We only have a TSS here to keep Intel VT happy;
we don't actually use it for anything. */
[GDT_ENTRY_BOOT_TSS] = GDT_ENTRY(0x0089, 4096, 103),
};
static struct gdt_ptr gdt;
gdt.len = sizeof(boot_gdt)-1;
gdt.ptr = (u32)&boot_gdt + (ds() << 4);
asm volatile("lgdtl %0" : : "m" (gdt));
}
len (16-bit) address (32-bit)
boot_gdt (GDT)
x86 Architecture: GDT
• GDT
• Each entry has 8 byte
• Offset, limit, and attributes
• DPL: Descriptor privileged level (0-3: 0 is the most privileged)
• When a processor executes codes at a code segment, the
current privileged level (CPL) is the same as DPL of the code
segment. Can access data segments with DPL >= CPL.
81
G D L *
Limit
19:16
P DPL S Type
Base
23:16
Base
31:24
Base Address
15:00
Limit
15:00
[GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff),
[GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff),
9 = Code, Execute Only
3 = Data, R/W
08121315162024
0
4
x86 Architecture:
Flat Model and Segment Register
• Although the segment feature is available in 32-bit, the
common use is called “Flat Model.”
• Uses a single segment from zero to 232 – 1
• To be precise, different segments are required for code/data and
privileged/user mode.
• Linux uses four segments: KERNEL_CS, KERNEL_DS, USER_CS,
USER_DS
• During boot time, BOOT_CS and BOOT_DS are used (as defined in
the previous slide)
• Segment Register (Selector)
• If CS is to select BOOT_CS, CS = (index of BOOT_CS) << 3;
• GDT_ENTRY_BOOT_CS = (Index of BOOT_CS) = 2, then CS = 16
• The constants BOOT_CS = 16, BOOT_DS = 24.
• Note the difference between “ENTRY” and the actual value. 82
T
I
RPLIndex
02315
Go To PM: Details (7)
• Call the assembler part (no return)
83
protected_mode_jump(boot_params.hdr.code32_start,
(u32)&boot_params + (ds() << 4));
GLOBAL(protected_mode_jump)
movl %edx, %esi # Pointer to boot_params
table
xorl %ebx, %ebx
movw %cs, %bx
shll $4, %ebx
addl %ebx, 2f
jmp 1f # Short jump to serialize on 386/486
1:
movw $__BOOT_DS, %cx
movw $__BOOT_TSS, %di
movl %cr0, %edx
orb $X86_CR0_PE, %dl # Protected mode
movl %edx, %cr0
# Transition to 32-bit mode
.byte 0x66, 0xea # ljmpl opcode
2: .long in_pm32 # offset
.word __BOOT_CS # segment
ENDPROC(protected_mode_jump)
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
dx
ax
*(uint32_t *)2f += cs() << 4;
(phys addr of in_pm32)
[Notes]
In real mode,
physical address
= (Segment
Register << 4) +
Offset
To enter protected mode, set PE bit
in %cr0 register.
And 32-bit far jump operation
(not expressible in real mode asm)
Go To PM: Detail (8)
84
.code32
.section ".text32","ax"
GLOBAL(in_pm32)
# Set up data segments for flat 32-bit mode
movl %ecx, %ds
movl %ecx, %es
movl %ecx, %fs
movl %ecx, %gs
movl %ecx, %ss
...
addl %ebx, %esp
...
ltr %di
...
xorl %ecx, %ecx
xorl %edx, %edx
xorl %ebx, %ebx
xorl %ebp, %ebp
xorl %edi, %edi
...
lldt %cx
...
jmpl *%eax
ENDPROC(in_pm32)
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
protected_mode_jump(boot_params.hdr.code32_start,
(u32)&boot_params + (ds() << 4));
* (Omitted)
Task register and LDT (Local
Descriptor Table)
Extended inline assembler (1)
• GCC’s extended inline assembly language
• Input/output operands can be specified for the assembler
• Assembler template
• The actual assembly language with templates that will be
substituted by the output/input operands
• Output operands
• List of C variables modified by the assembler template
• Input operands
• List of C expressions read by the instructions in the assembler
template.
• Clobber
• List of registers/values to be changed by the assembler template
(other than the output operands)
85
asm [volatile] (assembler template : [ output operands [ : input operands [ : clobber ]]])
static inline void outb(u8 v, u16 port)
{
asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
}
Assembler template Input operands
Extended inline assembler (2)
• Assembler template
• Basically, the same as the standalone assembly language
• %n (n is zero or positive integer) refers to the (n+1)-th
operand in the input and output operands.
• If the character “%” is to be used (to specify a certain
register “%ebx,” for example), “%%” must be used.
• Other than the number, the name can be used to specify
an operand. (%[symbolicname] refers to the operand
with the name [symbolicname])
• To use multiple instructions, use “;” or “n”
86
static inline void outb(u8 v, u16 port)
{
asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
}
Extended inline assembler (3)
• Input operands
• Comma-separated list of C expressions prefixed with
constaints
• A constraint specifies how the expression is passed to
the assembler template.
• When multiple constraints are specified, the complier
selects the most efficient one.
87
asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
constraint
C expression
Constraint
‘m’ Memory operand
‘r’ General register
‘i’ Immediate integer
‘0’ – ‘9’ = The same place as the
operand
Constraint (x86-specific)
‘a’’b’’c’’d’ A,B,C,D register
‘S’’D’ SI, DI register
‘N’ Unsigned 8-bit integer (for
in/out instructions)
‘A’ EDX:EAX (32bit), RDX/RAX
(64 bit)
[SymbolicName] “Constraints” (C Expression),…
Extended inline assembler (4)
• In this example,
• The value of v (u8) is stored in %al register
• The value of port (u16) is stored in %dx register or used
as 8-bit immediate.
• This function is declared as “inline,” so if this function is called
with a constant value as port which is less than 256, the “N”
constraint may be used.
• Then, the instruction(s) in the assembler templates are
executed.
• The resulting assembly language will be
88
asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
outb:
movl 8(%esp), %edx
movl 4(%esp), %eax
outb %al,%dx
ret
Extended inline assembler (5)
• Output operands
• Comma-separated list of C variables prefixed with constraints
• Constraints should be prefixed with “=“ or “+”
• “+” means that the variable is used as a both input and output
operand.
• “&” constraint allocates a different register from the input
operands (for multiple instructions, this constraint may be
necessary)
• After the instruction(s) in the assembler template is executed,
the value of A register (%al) is stored to the variable v.
89
[SymbolicName] “=Constraints” (C Variable),…
static inline u8 inb(u16 port)
{
u8 v;
asm volatile("inb %1,%0" : "=a" (v) : "dN" (port));
return v;
}
Extended inline assembler (6)
• Clobber
• The list of registers/values modified by the instructions
• The output registers need not be specified here.
• The most common clobber is “memory”
• This means that the memory contents may be changed as side
effects, thus all the variables should be written back to the
memory before the assembler, and should be read again from the
memory after the assembler.
• “cc” : Condition (flags) registers
90
void *memcpy(void *dest, const void *src, size_t n)
{
int d0, d1, d2;
asm volatile(
"rep ; movslnt"
"movl %4,%%ecxnt"
"rep ; movsbnt"
: "=&c" (d0), "=&D" (d1), "=&S" (d2)
: "0" (n >> 2), "g" (n & 3), "1" (dest), "2" (src)
: "memory");
return dest;
}
(?)
91
void *memcpy(void *dest, const void *src, size_t n)
{
int d0, d1, d2;
asm volatile(
"rep ; movslnt"
"movl %4,%%ecxnt"
"rep ; movsbnt"
: "=&c" (d0), "=&D" (d1), "=&S" (d2)
: "0" (n >> 2), "g" (n & 3), "1" (dest), "2" (src)
: "memory");
return dest;
}
asm volatile(
"rep ; movslnt"
"movl %2,%%ecxnt"
"rep ; movsbnt"
: "+&D" (dest)
: "c" (n >> 2), "g" (n & 3), "S" (src)
: "memory");
Extended inline assembler (7)
• Examples (appeared in the previous slides)
• Example 1
• Stores %esp – STACK_SIZE to stack_end
• P in “%P1” is a modifier (but cannot find in the
document)
• With “%P1”
• With “%1”
• With “%c1” (“constant expression with no punctuation”)
92
asm("leal %P1(%%esp),%0"
: "=r" (stack_end) : "i" (-STACK_SIZE));
leal -512(%esp),%eax
leal $-512(%esp),%eax
leal -512(%esp),%eax
Extended inline assembler (8)
• Example 2
• Far-calls the address (the value of
boot_params.hdr.realmode_swtch)
• The registers eax, ebx, ecx, and edx will be changed in
this call.
• Example 3
93
asm volatile("lcallw *%0"
: : "m" (boot_params.hdr.realmode_swtch)
: "eax", "ebx", "ecx", "edx");
static const struct gdt_ptr null_idt = {0, 0};
asm volatile("lidtl %0" : : "m" (null_idt));
setup_idt:
lidtl null_idt.1378
ret
Extended inline assembler (9)
94
#define switch_to(prev, next, last) 
do { 
unsigned long ebx, ecx, edx, esi, edi; 
asm volatile("pushflnt" /* save flags */ 
"pushl %%ebpnt" /* save EBP */ 
"movl %%esp,%[prev_sp]nt" /* save ESP */ 
"movl %[next_sp],%%espnt" /* restore ESP */ 
"movl $1f,%[prev_ip]nt" /* save EIP */ 
"pushl %[next_ip]nt" /* restore EIP */ 
__switch_canary 
"jmp __switch_ton" /* regparm call */ 
"1:t" 
"popl %%ebpnt" /* restore EBP */ 
"popfln" /* restore flags */ 
/* output parameters */ 
: [prev_sp] "=m" (prev->thread.sp), 
[prev_ip] "=m" (prev->thread.ip), 
"=a" (last), 
/* clobbered output registers: */ 
"=b" (ebx), "=c" (ecx), "=d" (edx), 
"=S" (esi), "=D" (edi) 
__switch_canary_oparam 
/* input parameters: */ 
: [next_sp] "m" (next->thread.sp), 
[next_ip] "m" (next->thread.ip), 
/* regparm parameters for __switch_to(): */ 
[prev] "a" (prev), 
[next] "d" (next) 
__switch_canary_iparam 
: /* reloaded segment registers */ 
"memory"); 
} while (0)
arch/x86/include/asm/switch_to.h
Extended inline assembler (10)
• The key point
• The context is the stack
• The switched task resumes at “1:”. (just after “jmp
__switch_to”)
• The “__switch_to” function is called with a “jmp”
instruction, not a “call” instruction.
• Anyway
• The template does not use %n (number), but %[name]
style. (too many parameters)
95
asm volatile(...
"movl %%esp,%[prev_sp]nt" /* save ESP */ 
...
/* output parameters */ 
: [prev_sp] "=m" (prev->thread.sp),
Exercise: RDTSC
• RDTSC instruction
• Input : None
• Output : EDX (Higher 32-bit), EAX (Lower 32-bit)
96
unsigned long rdtsc(void)
{
}
asm volatile(“rdtsc” :
unsigned short high, low;
“=d” (high), “=a” (low));
return ((unsigned long)high << 32) | low;
Answer: rdtscll
97
#define rdtscll(val) 
((val) = __native_read_tsc())
static __always_inline unsigned long long __native_read_tsc(void)
{
DECLARE_ARGS(val, low, high);
asm volatile("rdtsc" : EAX_EDX_RET(val, low, high));
return EAX_EDX_VAL(val, low, high);
}
#ifdef CONFIG_X86_64
#define DECLARE_ARGS(val, low, high) unsigned low, high
#define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high) << 32))
#define EAX_EDX_ARGS(val, low, high) "a" (low), "d" (high)
#define EAX_EDX_RET(val, low, high) "=a" (low), "=d" (high)
#else
#define DECLARE_ARGS(val, low, high) unsigned long long val
#define EAX_EDX_VAL(val, low, high) (val)
#define EAX_EDX_ARGS(val, low, high) "A" (val)
#define EAX_EDX_RET(val, low, high) "=A" (val)
#endif
A-2. Protected Mode
Again, full of the assembly code!
98
Protected-Mode Kernel (p.54)
• arch/x86/boot/compressed/head_{32,64}.S
• Goal: Decompresses the kernel (vmlinux.gz/.bz2/.xz…)
and start the kernel
• Relocates the decompressing code (if relocatable and
loaded at a different address)
• Enables paging and enters the long-mode (in head_64.S)
• Clears the BSS, and prepares the heap and stack
• Decompresses the kernel
• Relocates if required
• RANDOMIZED_BASE or RELOCATABLE (in 32-bit)
99
LD script?
100
 arch
 x86
 boot
 setup.ld
 compressed
 vmlinux.lds.S
 kernel
 vmlinux.lds.S
...
#ifdef CONFIG_X86_64
OUTPUT_ARCH(i386:x86-64)
ENTRY(startup_64)
#else
OUTPUT_ARCH(i386)
ENTRY(startup_32)
#endif
SECTIONS
{
/* Be careful parts of head_64.S
* assume startup_32 is at address 0.
*/
. = 0;
.head.text : {
_head = . ;
HEAD_TEXT
_ehead = . ;
}
.rodata..compressed : {
*(.rodata..compressed)
}
...
.head.text
.rodata..compres
sed
0 (_head)
(_ehead)
.text
.rodata
.got
.data
.bss
.pgtable (64
only)
(_etext, _rodata)
(_text)
(_erodata, _got)
(_egot, _data)
(_edata, _bss)
(_ebss, _pgtable)
(_epgtable, _end)
mkpiggy
• Section “.rodata..compressed” consists of the
compressed kernel (vmlinux)
101
printf(".section ".rodata..compressed","a",@progbitsn");
printf(".globl z_input_lenn");
printf("z_input_len = %lun", ilen);
printf(".globl z_output_lenn");
printf("z_output_len = %lun", (unsigned long)olen);
printf(".globl z_extract_offsetn");
printf("z_extract_offset = 0x%lxn", offs);
/* z_extract_offset_negative allows simplification of head_32.S */
printf(".globl z_extract_offset_negativen");
printf("z_extract_offset_negative = -0x%lxn", offs);
printf(".globl input_data, input_data_endn");
printf("input_data:n");
printf(".incbin "%s"n", argv[1]);
printf("input_data_end:n");
(arch/x86/boot/compressed/mkpiggy.c)
Entry point (32-bit)
102
.text
__HEAD
ENTRY(startup_32)
#ifdef CONFIG_EFI_STUB
jmp preferred_addr
...
preferred_addr:
#endif
cld
testb $(1<<6), BP_loadflags(%esi)
jnz 1f
cli
movl $__BOOT_DS, %eax
movl %eax, %ds
movl %eax, %es
movl %eax, %fs
movl %eax, %gs
movl %eax, %ss
1:
.section ".head.text","ax"
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
If KEEP_SEGMENT is set in loadflags in
boot_params, do not reload the segments.
Protected-Mode Protocol (p.53)
• Starts at the top of the protected mode kernel
• Usually loaded at 0x100000 (1MB)
• Can be at any position if compiled as relocatable
• Should be at the same position as specified in the compile
time if compiled as not relocatable
• Used in “linux” module in GRUB2
• [Protocol] At the entry point,
• The loaded GDT must have __BOOT_CS (0x10 / execute and
read) and __BOOT_DS(0x18 / read and write)
• %cs must be __BOOT_CS
• %ds, %es, and %ss must be __BOOT_DS
• Interrupts must be disabled
• %esi must be the address for struct boot_params
• %ebp, %edi, and %ebx must be zero.
103
Protected-Mode Kernel (p.54)
• arch/x86/boot/compressed/head_{32,64}.S
• Goal: Decompresses the kernel (vmlinux.gz/.bz2/.xz…)
and start the kernel
• Relocates the decompressing code (if relocatable and
loaded at a different address)
• Enables paging and enters the long-mode (in head_64.S)
• Decompresses the kernel
• Relocates if required
• RANDOMIZED_BASE or RELOCATABLE (in 32-bit)
104
Where are we?
105
leal (BP_scratch+4)(%esi), %esp
call 1f
1: popl %ebp
subl $1b, %ebp
 arch
 x86
 boot
 header.S
 main.c
 memory.c
 pm.c
 pmjump.S
 compressed
 head_32.S
 head_64.S
 eboot.c
 efi_stub_32.S
 efi_stub_64.S
 kernel
 head_32.S
 head_64.S
• The call instruction pushes the return
address onto the stack
• The return address should be the next instruction after
the call instruction, i.e. 1f
• The immediate “pop” pops the return address from the
stack, i.e. the absolute physical address for 1f
• Subtracting 1f (in this case, 1b) from the address (%ebp)
calculates the offset between the actual address and the
compile-time address (0-based, as seen in lds).
Memory View
106
PM kernelRM Kernel
Higher Address%ebp
vmlinux (decompressed)
Goal:
headBP compressed
%esi
z_extract_offset
(mkpiggy.c)
offs = (olen > ilen) ? olen - ilen : 0;
offs += olen >> 12; /* Add 8 bytes for each 32K block */
offs += 64*1024 + 128; /* Add 64K + 128 bytes slack */
offs = (offs+4095) & ~4095; /* Round to a 4K boundary */
...
printf("z_extract_offset = 0x%lxn", offs);
Relocated Kernel
LOAD_PHYSICAL_ADDRESS
(asm/x86/include/asm/boot.h)
#define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START 
+ (CONFIG_PHYSICAL_ALIGN - 1)) 
& ~(CONFIG_PHYSICAL_ALIGN - 1))
*Default: 0x1000000
compressed
Determine where to
decompress
• If CONFIG_RELOCATABLE
• The current position
(BP_kernel_alignment-
aligned)
• Default: 2MB-align
• If it is less than
LOAD_PHYSICAL_ADDR,
LOAD_PHYSICAL_ADDR is
used
• If not
CONFIG_RELOCATABLE
• LOAD_PHYSICAL_ADDR is
used
• Now %ebx is the target
address
107
#ifdef CONFIG_RELOCATABLE
movl %ebp, %ebx
movl
BP_kernel_alignment(%esi), %eax
decl %eax
addl %eax, %ebx
notl %eax
andl %eax, %ebx
cmpl
$LOAD_PHYSICAL_ADDR, %ebx
jge 1f
#endif
movl
$LOAD_PHYSICAL_ADDR, %ebx
1:
 arch
 x86
 boot
 compressed
 head_32.S
 head_64.S
 kernel
 head_32.S
 head_64.S
Copy the decompression code
• Copy the area from the
head of PM kernel
(startup_32) to just before
the head of bss.
• The code copies the kernel
backwards in case of
overlapping
108
addl $z_extract_offset, %ebx
leal boot_stack_end(%ebx), %esp
pushl $0
popfl
pushl %esi
leal (_bss-4)(%ebp), %esi
leal (_bss-4)(%ebx), %edi
movl $(_bss - startup_32), %ecx
shrl $2, %ecx
std
rep movsl
cld
popl %esi
PM kernel
%ebp
Relocated
vmlinux (decompressed)
%ebx
z_extract_offset
Jump to the relocated address
• Jump to the copied
decompression code
• The decompression
code is the end in the
PM kernel
• Just after the
compressed kernel
image
• Clears the BSS
109
leal relocated(%ebx), %eax
jmp *%eax
ENDPROC(startup_32)
.text
relocated:
xorl %eax, %eax
leal _bss(%ebx), %edi
leal _ebss(%ebx), %ecx
subl %edi, %ecx
shrl $2, %ecx
rep stosl
%ebx
Relocated kernel
vmlinux (decompressed)
relocated
Why is z_extract_offset?
• The PM kernel contains the compressed kernel
image
• The relocating (copying) code is located at the head in
PM kernel
• The decompression code is located at the tail in the PM
kernel
• The decompression code after relocation is safe because
z_extract_offset + the compressed image size is larger
than the decompressed image size
110
headcompressed decomp
decompressed
z_extract_offset
work area
head compressed decomp
Relocate
z_extract_offset
Fix up the absolute addresses
• The decompression code is built with -fPIC
(position independent code), and so fixing up the
absolute addresses is achieved by modifying the
addresses in GOT (Global Offset Table).
111
/*
* Adjust our own GOT
*/
leal _got(%ebx), %edx
leal _egot(%ebx), %ecx
1:
cmpl %ecx, %edx
jae 2f
addl %ebx, (%edx)
addl $4, %edx
jmp 1b
2:
%ebx
Relocated kernel
Protected-Mode Kernel (p.54)
• arch/x86/boot/compressed/head_{32,64}.S
• Goal: Decompresses the kernel (vmlinux.gz/.bz2/.xz…)
and start the kernel
• Relocates the decompressing code (if relocatable and
loaded at a different address)
• Enables paging and enters the long-mode (in head_64.S)
• Decompresses the kernel
• Relocates if required
• RANDOMIZED_BASE or RELOCATABLE (in 32-bit)
112
Call the decompression routine
• Call the decompress_kernel function in C
• asmlinkage __visible void *decompress_kernel(void
*rmode, memptr heap, unsigned char *input_data,
unsigned long input_len, unsigned char *output,
unsigned long output_len)
113
pushl $z_output_len /* decompressed length */
leal z_extract_offset_negative(%ebx), %ebp
pushl %ebp /* output address */
pushl $z_input_len /* input_len */
leal input_data(%ebx), %eax
pushl %eax /* input_data */
leal boot_heap(%ebx), %eax
pushl %eax /* heap area */
pushl %esi /* real mode pointer */
call decompress_kernel /* returns kernel location in %eax */
BP Relocated
vmlinux (decompressed)
%ebx
z_extract_offset
%esi
Decompressing
114
asmlinkage __visible void *decompress_kernel(...)
{
...
output = choose_kernel_location(input_data, input_len,
output, output_len);
...
#ifndef CONFIG_RELOCATABLE
if ((unsigned long)output != LOAD_PHYSICAL_ADDR)
error("Wrong destination address");
#endif
debug_putstr("nDecompressing Linux... ");
decompress(input_data, input_len, NULL, NULL, output, NULL,
error);
parse_elf(output);
handle_relocations(output, output_len);
debug_putstr("done.nBooting the kernel.n");
return output;
}
 arch
 x86
 boot
 compressed
 head_32.S
 head_64.S
 misc.c
 kernel
 head_32.S
 head_64.S
Choosing the destination
• The choose_kernel_location function
• If KASLR is enabled, it computes some random output
address (aslr.c)
• Otherwise, it just returns the output parameter
115
Decompressing the kernel
• The decompress function does
everything
• The implementation is located at
lib/decompress_*.c
116
#ifdef CONFIG_KERNEL_GZIP
#include "../../../../lib/decompress_inflate.c"
#endif
#ifdef CONFIG_KERNEL_BZIP2
#include "../../../../lib/decompress_bunzip2.c"
#endif
#ifdef CONFIG_KERNEL_XZ
#include "../../../../lib/decompress_unxz.c"
#endif
...
 arch
 x86
 boot
 compressed
 head_32.S
 head_64.S
 misc.c
 kernel
 head_32.S
 head_64.S
Load the ELF
• parse_elf
• Parse the ELF header and locate the contents according
to the program header (p_paddr)
• If relocatable, the
p_paddr is offseted by
the actually loaded address.
117
for (i = 0; i < ehdr.e_phnum; i++) {
...
switch (phdr->p_type) {
case PT_LOAD:
#ifdef CONFIG_RELOCATABLE
dest = output;
dest += (phdr->p_paddr –
LOAD_PHYSICAL_ADDR);
#else
dest = (void *)(phdr->p_paddr);
#endif
memcpy(dest,
output + phdr->p_offset,
phdr->p_filesz);
break;
...
}
}
typedef struct elf32_phdr{
Elf32_Word p_type;
Elf32_Off p_offset;
Elf32_Addr p_vaddr;
Elf32_Addr p_paddr;
Elf32_Word p_filesz;
Elf32_Word p_memsz;
Elf32_Word p_flags;
Elf32_Word p_align;
} Elf32_Phdr;
Protected-Mode Kernel (p.54)
• arch/x86/boot/compressed/head_{32,64}.S
• Goal: Decompresses the kernel (vmlinux.gz/.bz2/.xz…)
and start the kernel
• Relocates the decompressing code (if relocatable and
loaded at a different address)
• Enables paging and enters the long-mode (in head_64.S)
• Decompresses the kernel
• Relocates if required
• RANDOMIZED_BASE or RELOCATABLE (in 32-bit)
118
Relocate the kernel image
• Relocation information (generated by the “relocs”
tool) is appended just after the ELF image
• The relocation information is a collection of addresses to
the absolute addresses in the kernel code
• These addresses are all expressed by kernel virtual addresses
vmlinux (ELF) 0 0… …
32-bit
relocation
addresses
64-bit
relocation
addresses
$ objdump –adr vmlinux
…
c1086910 <vfs_llseek>:
c1086910: 55 push %ebp
...
c1086919: bb 60 63 08 c1 mov $0xc1086360,%ebx
c108691a: R_386_32 no_llseek
Calculate deltas
• __START_KERNEL_map
• In 32-bit, PAGE_OFFSET (default: 0xC0000000)
• In 64-bit, 0xffffffff80000000
120
120
static void handle_relocations(void *output, unsigned long
output_len)
{
...
unsigned long min_addr = (unsigned long)output;
...
delta = min_addr - LOAD_PHYSICAL_ADDR;
...
map = delta - __START_KERNEL_map;
...
Difference between
the compile-time
physical address and
the actual physical
address
The offset of the kernel
virtual address to the
physical address
Apply the relocation
121
for (reloc = output + output_len - sizeof(*reloc); *reloc; reloc--) {
int extended = *reloc;
extended += map;
ptr = (unsigned long)extended;
if (ptr < min_addr || ptr > max_addr)
error("32-bit relocation outside of kernel!n");
*(uint32_t *)ptr += delta;
}
#ifdef CONFIG_X86_64
for (reloc--; *reloc; reloc--) {
long extended = *reloc;
extended += map;
ptr = (unsigned long)extended;
if (ptr < min_addr || ptr > max_addr)
error("64-bit relocation outside of kernel!n");
*(uint64_t *)ptr += delta;
}
#endif
OK, go to the entry point
• The entry point is always at the head of the kernel
• decompress_kernel returns the “output”
• The assembly code jumps into the entry point
122
asmlinkage __visible void *decompress_kernel(...)
{
...
output = choose_kernel_location(input_data, input_len,
output, output_len);
...
return output;
}
/*
* Jump to the decompressed kernel.
*/
xorl %ebx, %ebx
jmp *%eax
Next
• Go on to startup_32/startup_64
123

More Related Content

What's hot

Slab Allocator in Linux Kernel
Slab Allocator in Linux KernelSlab Allocator in Linux Kernel
Slab Allocator in Linux KernelAdrian Huang
 
Memory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelMemory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelAdrian Huang
 
semaphore & mutex.pdf
semaphore & mutex.pdfsemaphore & mutex.pdf
semaphore & mutex.pdfAdrian Huang
 
Reverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelReverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelAdrian Huang
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory ManagementNi Zo-Ma
 
qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...
qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...
qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...Adrian Huang
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernelAdrian Huang
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in LinuxAdrian Huang
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocatorsHao-Ran Liu
 
Physical Memory Management.pdf
Physical Memory Management.pdfPhysical Memory Management.pdf
Physical Memory Management.pdfAdrian Huang
 
Linux memory-management-kamal
Linux memory-management-kamalLinux memory-management-kamal
Linux memory-management-kamalKamal Maiti
 
Linux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver OverviewLinux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver OverviewRajKumar Rampelli
 
Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Pankaj Suryawanshi
 
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtKernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtAnne Nicolas
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionGene Chang
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debuggingHao-Ran Liu
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VRISC-V International
 
Linux Kernel Startup Code In Embedded Linux
Linux    Kernel    Startup  Code In  Embedded  LinuxLinux    Kernel    Startup  Code In  Embedded  Linux
Linux Kernel Startup Code In Embedded LinuxEmanuele Bonanni
 
Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)Adrian Huang
 

What's hot (20)

Slab Allocator in Linux Kernel
Slab Allocator in Linux KernelSlab Allocator in Linux Kernel
Slab Allocator in Linux Kernel
 
Memory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelMemory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux Kernel
 
semaphore & mutex.pdf
semaphore & mutex.pdfsemaphore & mutex.pdf
semaphore & mutex.pdf
 
Reverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelReverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux Kernel
 
spinlock.pdf
spinlock.pdfspinlock.pdf
spinlock.pdf
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
 
qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...
qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...
qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernel
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in Linux
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocators
 
Physical Memory Management.pdf
Physical Memory Management.pdfPhysical Memory Management.pdf
Physical Memory Management.pdf
 
Linux memory-management-kamal
Linux memory-management-kamalLinux memory-management-kamal
Linux memory-management-kamal
 
Linux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver OverviewLinux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver Overview
 
Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)
 
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtKernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introduction
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-V
 
Linux Kernel Startup Code In Embedded Linux
Linux    Kernel    Startup  Code In  Embedded  LinuxLinux    Kernel    Startup  Code In  Embedded  Linux
Linux Kernel Startup Code In Embedded Linux
 
Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)
 

Similar to Linux Kernel Booting Process (2) - For NLKB

Driver development – memory management
Driver development – memory managementDriver development – memory management
Driver development – memory managementVandana Salve
 
MySQL Space Management
MySQL Space ManagementMySQL Space Management
MySQL Space ManagementMIJIN AN
 
Advance computer architecture
Advance computer architectureAdvance computer architecture
Advance computer architecturesuma1991
 
Intel microprocessor history lec12_x86arch.ppt
Intel microprocessor history lec12_x86arch.pptIntel microprocessor history lec12_x86arch.ppt
Intel microprocessor history lec12_x86arch.pptjeronimored
 
Memory Management Strategies - III.pdf
Memory Management Strategies - III.pdfMemory Management Strategies - III.pdf
Memory Management Strategies - III.pdfHarika Pudugosula
 
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Hsien-Hsin Sean Lee, Ph.D.
 

Similar to Linux Kernel Booting Process (2) - For NLKB (20)

02-OS-review.pptx
02-OS-review.pptx02-OS-review.pptx
02-OS-review.pptx
 
Driver development – memory management
Driver development – memory managementDriver development – memory management
Driver development – memory management
 
kerch04.ppt
kerch04.pptkerch04.ppt
kerch04.ppt
 
Linux memory
Linux memoryLinux memory
Linux memory
 
memory_mapping.ppt
memory_mapping.pptmemory_mapping.ppt
memory_mapping.ppt
 
Microprocessor
MicroprocessorMicroprocessor
Microprocessor
 
MySQL Space Management
MySQL Space ManagementMySQL Space Management
MySQL Space Management
 
Advance computer architecture
Advance computer architectureAdvance computer architecture
Advance computer architecture
 
Intel microprocessor history lec12_x86arch.ppt
Intel microprocessor history lec12_x86arch.pptIntel microprocessor history lec12_x86arch.ppt
Intel microprocessor history lec12_x86arch.ppt
 
Memory Management Strategies - III.pdf
Memory Management Strategies - III.pdfMemory Management Strategies - III.pdf
Memory Management Strategies - III.pdf
 
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
 
Memory
MemoryMemory
Memory
 
8086 architecture By Er. Swapnil Kaware
8086 architecture By Er. Swapnil Kaware8086 architecture By Er. Swapnil Kaware
8086 architecture By Er. Swapnil Kaware
 
cache memory
cache memorycache memory
cache memory
 
Risc vs cisc
Risc vs ciscRisc vs cisc
Risc vs cisc
 
04 cache memory
04 cache memory04 cache memory
04 cache memory
 
chapt_02.ppt
chapt_02.pptchapt_02.ppt
chapt_02.ppt
 
cache memory
 cache memory cache memory
cache memory
 
Memory (Computer Organization)
Memory (Computer Organization)Memory (Computer Organization)
Memory (Computer Organization)
 
CAO-Unit-III.pptx
CAO-Unit-III.pptxCAO-Unit-III.pptx
CAO-Unit-III.pptx
 

Recently uploaded

Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Christo Ananth
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 

Recently uploaded (20)

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 

Linux Kernel Booting Process (2) - For NLKB

  • 1. Booting Process (2) Taku Shimosawa Pour le livre nouveau du Linux noyau 1
  • 3. Agenda • Virtual Memory • From architectural view • Unfortunately, this presentation again does not enter the main part of the kernel! • Appendices • Source code-level overview of the bootstrapping process • Linker Scripts • Inline Assemblers • There are (implicitly) omitted spaces, tabs, white lines, comments in the quoted source code. • The omitted effective lines are denoted by … or […] 3
  • 4. Scope of the last presentation : x86 • Real Mode (16-bit) • Boot sector, setup_header, and 16-bit entry point • C-Language main function • Retrieving memory information • Transition to the protected mode • Protected Mode (32-bit) • 32-bit(/64-bit) entry point, preparing for decompression, calling decompression code • (EFI-Stub) efi_main (entry point from UEFI) • EFI call functions • Protected Mode/Long Mode • The beginning of the main kernel 4  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S * The …_32.S files are used in the 32-bit kernel, and …_64.S files are not. Vice versa.
  • 5. Scope of the last presentation : ARM • Compressed • Entry point • Decompressing function • Actual decompressing algorithm is in lib/decompress_*.c • Building a FDT from ATAGS for compatibility (CONFIG_ARM_ATAG_DTB_C OMPAT) • Decompressed • The beginning of the main kernel 5  arch  arm  boot  compressed  head.S  decompress.c  atags_to_fdt.c  kernel  head.S
  • 6. Follow-ups for the last presentation • x86 assembly language • What if instuructions with ≥3 operands? • (e.g.) imul • Multiply EBX by 19(0x13) and substitute the result to EAX • Therefore, 6 AT&T Intel Operand Order Source, Destination Destination, Source AT&T Intel Example imul $0x13, %ebx, %eax IMUL EAX, EBX, 13h AT&T Intel Operand Order Source, Destination [[Op4,] Op3,] Op2, Op1 Destination, Source Op1, Op2 [, Op3, [Op4]]
  • 7. Follow-ups for the last presentation • Multiple Relocations? • The conclusion is “at most once” (in x86 arch) • ELF relocation may follow the decompression, so the kernel may be relocated twice in this sense. • See the relocation part in this presentation. 7
  • 8. x86 Architecture : Segmentation • 6 Segment Registers (16-bit registers) • Code Segment Register: CS • Data Segment Register: DS, ES, FS, GS • Stack Segment Register: SS • Real mode : 20-bit address space • Linear address = Physical address • The size of each segment is 64K (16-bit) • The segment register denotes the higher 16-bit offset in 20-bit address space for the segment • Protected mode : 32-bit/36-bit physical address space • Virtual –(Paging)-> Linear –(Segmentation)-> Physical • The offset and limit are stored in the descriptor table • The segment registers points to the entry in the table • Long mode : 48-bit physical address space • For CS, DS, ES, and SS, the offset is always 0, the limit is ignored. • For FS and GS, the offset can be set by the descriptor or through MSR (for > 32-bit addresses) 8 Logical – (Segmentation) -> Linear –(Paging)-> Physical Errata
  • 9. So what? (p.32) 9 vmlinux boot/compressed/vmlinux.bin (1a) Strip symbols vmlinux.bin.xz (2a) Concatenate and compress (gzip, bzip2, lzma, lzo, lz4) piggy.o (3) mkpiggy (piggy-back) Make an object that contains the compressed image piggy.o*.o boot/compressed/vmlinux (4) Link with the other objects in boot/compressed (Decompressing codes) (5) Transform it into a simple binary boot/vmlinux.bin boot/vmlinux.binboot/setup.bin (6) Concatenate with real-mode setup code, headers, and CRC32 CRC boot/bzImage (1b) Make relocation information (2b) Append the original size info (except for gzip) vmlinux.bin.xz vmlinux.relocs Size Errata?
  • 11. Virtual Memory • The address visible to a task is “virtualized,” i.e. translated by hardware to a certain physical address when it is actually accessed. • The hardware mechanism to translate the address is called MMU (memory management unit). • Aim / Benefit • Using larger memory area than the machine actually is equipped with. • Memory swapping, sparse memory areas • Isolating tasks’ memory area so that the different applications cannot touch (read or write) the each other’s memory • Not only between user tasks but between the kernel and tasks • Abstracting the memory resources • Providing contiguous memory area even if there is no physically contiguous memory area available. • User programs can run with certain addresses regardless of the physical addresses where they are actually running. 11
  • 12. Two ways to virtual memory • Paging • Dividing the memory area into chunks (“pages”) with a certain small size, and defining a map from each chunk to its physical location • A different task may have a different map of the memory • Several overhead (both in speed and memory) to translate and hold the map • Segmentation • The address is considered to be an offset inside a certain segment of memory • Less overhead (just adding an offset), but impossible to achieve swapping 12
  • 13. Illustrated 13 1 Segment 1 2 3 5 4 3 1 2 4 VA PA 1 4 2 1 3 3 5 2 2 ~ 4 Seg Star t End 1 2 4 1 Virtual Memory Physical Memory Page Table Segment Desc. Paging Segmentation
  • 14. Architecture and VM Capability • x86 • Capable of paging • 16-bit and 32-bit has segmentation feature • 64-bit mode has a very limited segmentation feature • Because almost no one is using the segmentation feature effectively! • (See “flat model” described in a later slide) • ARM • Some CPU series has MMU, and is capable of paging • “A” series • Some CPU series only has MPU (memory protection unit) • “R” series • No MMU • “M” series (MPU is optional) 14
  • 15. Focusing on paging… • How it works? 15 Memory instruction with a virtual address CPU (MMU) looks for for the virtual address in TLB (Translation Lookaside Buffer) Does it exist? Use the physical address in the TLB entry TLB Miss! Call the handler, and ask it to fill in a TLB entry corresponding to the virtual address Traverse the page table to find the physical address for the virtual address Present? Use the physical address (May) remember it in TLB Page fault! Call the handler. Kernel’s Role Software TLBHardware TLB Yes No Yes No
  • 16. How far should hardware do? • TLB (Translation Lookaside Buffer) • Cache of “virtual-to-physical” mappings. • Limited number of entries. • Hardware-controlled TLB • When TLB misses occur, the CPU traverses page tables • The format for the page table is defined by the architecture. • x86 and ARM • Software-controlled TLB • When TLB misses occur, the software (typically, the OS kernel) traverses page tables, and tell the result (translated physical address) by filling in some entry in TLB. • Any type of page tables may be used (hash-based PT, for example) • But Linux uses almost the same format for this type of architecture • PowerPC 16
  • 17. Multilevel Page Table (tree-like) • Typical structure of page table • The first-level page table consists of entries that point to another level page table. The index is some of the most significant bits of the virtual memory. • Of course, the next page table’s address is physical. • The entries in the leaf page table denotes the physical addresses. 17 Next level page table Third level page table Phys address Phys address … First-level page table Second-level page table Third-level page table
  • 18. x86-64 example 18 Resolving 0x00000004200310a5 = 00000000 00000000 00000000 00000100 00100000 00000011 00010000 10100101 (2) PML4 Table 0 511 Page Directory Pointer Table 0 511 16 0 256 Page Directory Table 511 0x1234567000 0 49 Page Table 511 0x12345670a5 CR3 64 bits
  • 19. x86-64 • Currently, only 48-bit in a linear address is effective. • 64-bit address is sign-extension of the 48-bit address. • Supports up to 52 bits for physical addresses • %cr3 register : the physical address for the current PML4 table • mov ~~, %cr3 switches the page table (flushing TLB) • Four level • One entry in PML4 table corresponds to 512 GB of virtual memory, an entry in PDP table to 1 GB, and so on. • Each entry is 8 byte • Each table has 512 entries • Thus, each table is 4 KB = 1 page. 19
  • 20. Large Table • One page occupies one entry in TLB • If one process uses 1 GB of memory, it uses 256K pages. • i.e. If TLB does not have 256K entries (and usually it doesn’t), TLB misses are inevitable • x86_64 supports three types of page size • 4 KB (normal) • 2 MB • 1 GB (!) • The disadvantage is that larger page requires contiguous physical memory of the same size as the page size. 20 An entry in higher-level page table directly contains a physical address.
  • 21. x86-64 example (2MB page) 21 Resolving 0x00000004200310a5 = 00000000 00000000 00000000 00000100 00100000 00000011 00010000 10100101 PML4 Table 0 511 Page Directory Pointer Table 0 511 16 0x1234400000 0 256 Page Directory Table 511 0x12344310a5 CR3 64 bits
  • 22. Linux kernel usage • Large Page • The kernel mapping • The kernel creates straight-mapping of physical memory in the kernel virtual address area • This area is created in booting, and never changes after that • 1GB, 2MB pages are used • Hugetlbfs • Explicit use from user applications • Transparent Huge Pages • Implicit (transparent) use of large pages for user applications 22
  • 23. ARM • ARM • Two memory architecture • VMSA (Virtual Memory System Architecture) : MMU • PMSA (Protected Memory System Architecture) : MPU • VMSA • Two page table formats • Short descriptor table • Up to two-level lookup • 32-bit PA (*By supersection, 40-bit can be output) • Long descriptor table • Up to three-level lookup • 40-bit PA • Fixed size of page tables 23
  • 24. Names in Linux • Linux uses several arch-independent type names for page table entries • pgd_t, pud_t, pmd_t, pte_t • Each type is one for an entry in a table of the corresponding level 24 Architecture (& Config) Lv pgd_t pud_t pmd_t pte_t x86_64 4 PML4E PDPTE PDE PTE i386 (PAE) 3 PDPTE - PDE PTE i386 2 PDE - - PTE ARM (LPAE) 3 1st-lv. Desc. - 2nd-lv. Desc. 3rd-lv. Desc. ARM 2 1st-lv. Desc. - - 2nd-lv. Desc. ARM64 (64KB page) 2 1st-lv. Desc. - - 2nd-lv. Desc. ARM64 3 1st-lv. Desc. - 2nd-lv. Desc. 3rd-lv. Desc. (*)AArch64 supports four-level page tables, thus 48-bit VA.
  • 25. Notes • PAE (i386) • Physical Address Extension • For those who want to enjoy >4GB of memory in 32-bit mode. • Virtual address remains 32-bit, but can map to any physical address (< 64-bit) • The size of each entry is extended to 64-bit • CONFIG_X86_PAE • LPAE • Logical Physical Address Extension • Almost the same as PAE in i386 • “The current implementation limits the output address range to 40 bits” • Each entry is extended to 64-bit (long-descriptor translation table format) • CONFIG_ARM_LPAE 25
  • 26. ARM example (Short-descriptor) 26 Resolving 0x200310a5 = 00100000 00000011 00010000 10100101 (2) 1st Level Table 0 4095 0x12345000 0 255 49 0x123450a5 TTBR0 2nd Level Table 32 bits 512
  • 27. Quick Chart 27 1st Level 2nd Level 3rd Level 4th Level Intel 64-bit [47:39] [38:30] [29:21] [20:12] 4 KB (64 bit x 512) 512 GB/Entry 1 GB / Entry 2 MB / Entry 4 KB / Entry PAE [31:30] [29:21] [20:12] 256 B (64 bit x 4) 4 KB (64 bit x 512) 1 GB / Entry 2 MB / Entry 4 KB / Entry 32-bit [31:22] [21:12] 4 KB (32 bit x 1024) 4 MB / Entry 4 KB / Entry ARM LPAE [31:30] [29:21] [20:12] 256 B (64 bit x 4) 4 KB (64 bit x 512) 1 GB / Entry 2 MB / Entry 4 KB / Entry 32-bit [31:20] [19:12] 16 KB (32 bit x 4096) 1 KB (32 bit x 256) 1 MB / Entry 4 KB / Entry ARM 64 4KB granule [38:30] [29:21] [20:12] 4 KB (64 bit x 512) 1 GB / Entry 2 MB / Entry 4 KB / Entry VA Range used as index Table size (entry size x n) Size represented by each entry
  • 28. Page size supported (by HW) • x86_64 • 1 GB, 2 MB, 4 KB • i386 (PAE) • 2 MB, 4 KB • i386 • 4 MB, 4 KB • ARM • 16 MB(*), 1 MB, 64 KB, 4 KB • ARM (LPAE) • 1 GB, 2 MB, 4 KB • ARM64 • 1 GB, 2 MB, 4 KB (for 4KB translation granule) • 32 MB, 16 KB (for 16KB translation granule) • 512 MB, 64 KB (for 64KB translation granule) 28 (*) Depends on implementation
  • 29. Page Attributes • Pages can have attributes • Used for memory protection • Used for demand paging • Used for COW (copy-on-write) • Attributes • Read / Write • User / Privileged • But where? • In the page table entry corresponding to a page • However, a page table entry is basically a physical pointer, i.e. a 32-bit entry is occupied by 32-bit physical pointer… 29
  • 30. Page Attributes • The lower bits in page table entries • The start address of a page/page table is aligned! • The lower bits are always zero. 30 Ignored Physical Address [31:12] 3252 XD 63 Physical Address [51:32] G Igno red PAT D PCD PWT US RW PA 31 9 0 Physical Address [31:12] C B 1 XN APTEX AP2 S nG x86_64 ARM (short descriptor)
  • 31. Page Attributes Comparison 31 x86_64 ARM (short) Enabled? Present (P) Desc type (Bits 1 & 0) RO or RW? Read/Write (RW) AP [2:1] or AP [2:0] Privileged only or any? User/Supervisor (US) Write-through? PWT TEX[2:0], B, C Cachable? PCD Accessed? Accessed (A) AP[0] (*configurable) Dirty? Dirty (D) N/A Memory Type PAT TEX[0], B, C (*configurable) Global Global (G) Not Global (nG) Executable? Execute-Disable (XD) Execute-Never (XN) Sharable? (PAT) Sharable (S)
  • 32. PowerPC Example [PowerPC 440] • TLB is filled by software • Search (tlbsx instrunction), R/W (tlbre, tbwe instructions) 32 32220 Effective Page Number [0:21] TS V SIZE TPAR TID 40 Real Page Number [0:21] 0 PA R1 ERPN PA R2 0 Reserved U3-U0 W I M G E X W R X W R U S • Attributes • V : Valid • SIZE : Page Size (4n KB, where n in {0,1,2,3,4,5,7,9,10}) • U : User-defined storage attribute • W: Write-through • I: Caching Inhibited • M: Memory coherency required • G: Guarded • E: Endian • UX, UW, UR: User executable, writable, readable • SX, SW, SR: Supervisor executable, writable, readable • TPAR, PAR1, PAR2: Parity
  • 33. Before the kernel starts… • x86 (32-bit) • Paging is disabled • kernel/head_32.S creates a page table and turns on paging • x86 (64-bit) • compressed/head_64.S creates an identical (virtual = physical) page table for the first 4G • Long mode requires paging enabled. • kernel/head_64.S creates better page table • ARM • kernel/head.S creates a page table and turns on paging 33
  • 34. Virtual memory mapping 34 x86_64 Virtuali386 Virtual Physical LOWMEM PAGE_OFFSET (0xC0000000) Up to ~896 MB PAGE_OFFSET (0xFFFF8800 00000000) __START_KERNEL_map (0xFFFFFFFF 80000000)
  • 35. A. Booting in x86 By looking into the source codes 35
  • 36. A-1. Real Mode Plenty of assembler code, LD script, and inline assembly language 36
  • 37. Real mode kernel (from p.45) • header.S • Boot sector code which is no longer used • Contains setup_header • Prepares stack and BSS to run C programs • Jumps into the C program (main.c) • main.c • Copies setup_header into “zeropage” • Setups early console • Initializes heap • Checks the CPUs (64-bit capable for 64-bit kernel?) • Collect HW information by querying to BIOS, and stores the results in “zeropage” • Finally transits to protected-mode, and jumps into the “protected-mode kernel” 37
  • 38. Boot sector (Useless) 38  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S .global bootsect_start bootsect_start: #ifdef CONFIG_EFI_STUB # "MZ", MS-DOS header .byte 0x4d .byte 0x5a #endif # Normalize the start address ljmp $BOOTSEG, $start2 start2: movw %cs, %ax movw %ax, %ds movw %ax, %es movw %ax, %ss xorw %sp, %sp sti cld movw $bugger_off_msg, %si jmp msg_loop Normalize CS to BOOTSEG (0x7c0). movw %ds, %cs is not allowed. stack starts at 0x17c00 Enable interrupts cf. cli Reset directions for string instructions (Clear DF Flag) cf. std Show the message "Direct floppy boot is not supported. "
  • 39. Wait, how the header code is placed at the beginning of the kernel? • The linker concatenates multiple object files • The position in the resulting binary are not guaranteed without any order to the linker • The linker script (.ld/.lds/.lds.S) orders the positions to the linker! • As it is quite likely for you to use the C preprocessor for the linker script, files with the extension “.lds.S” are first processed by the preprocessor, and passed to the linker. • Pass the linker script with “-T” overrides the default linker script • The default linker script can be displayed with “ld -- verbose" 39
  • 40. LD script (1) 40  arch  x86  boot  setup.ld  compressed  vmlinux.lds.S  kernel  vmlinux.lds.S /* * setup.ld * * Linker script for the i386 setup code */ OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) ENTRY(_start) SECTIONS { . = 0; .bstext : { *(.bstext) } .bsdata : { *(.bsdata) } . = 495; .header : { *(.header) } .entrytext : { *(.entrytext) } .inittext : { *(.inittext) } .initdata : { *(.initdata) } __end_init = .; Specifies the output format (identical to --oformat option) OUTPUT_FORMAT(default, big, little) Specifies the output architecture Specifies the entry point symbol (identical to -e option)
  • 41. LD script (2) 41 /* * setup.ld * * Linker script for the i386 setup code */ OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) ENTRY(_start) SECTIONS { . = 0; .bstext : { *(.bstext) } .bsdata : { *(.bsdata) } . = 495; .header : { *(.header) } .entrytext : { *(.entrytext) } .inittext : { *(.inittext) } .initdata : { *(.initdata) } __end_init = .;  arch  x86  boot  setup.ld  compressed  vmlinux.lds.S  kernel  vmlinux.lds.S Specifies how the sections are output . means the current position Substituting to . means setting the current position Put the .bstext section at the current position, i.e. at the address 0. Put the .bsdata section after the .bstext section.
  • 42. bstext section…? 42 .code16 .section ".bstext", "ax" .global bootsect_start bootsect_start: #ifdef CONFIG_EFI_STUB # "MZ", MS-DOS header .byte 0x4d .byte 0x5a #endif # Normalize the start address ljmp $BOOTSEG, $start2 start2: movw %cs, %ax movw %ax, %ds movw %ax, %es movw %ax, %ss xorw %sp, %sp sti cld movw $bugger_off_msg, %si jmp msg_loop  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S Here it is! [Notes] .code16 = Specify the binary for the following code as 16-bit binary. .section name[, flags] = Starts the section. <flags> (excerpted) • “a” : allocatable (loaded to memory when executed) • “w” : writable • “x” : executable .globl/.global symbol = Makes the symbol global (Can be seen from other objects)
  • 43. LD script (3) 43 /* * setup.ld * * Linker script for the i386 setup code */ OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) ENTRY(_start) SECTIONS { . = 0; .bstext : { *(.bstext) } .bsdata : { *(.bsdata) } . = 495; .header : { *(.header) } .entrytext : { *(.entrytext) } .inittext : { *(.inittext) } .initdata : { *(.initdata) } __end_init = .;  arch  x86  boot  setup.ld  compressed  vmlinux.lds.S  kernel  vmlinux.lds.S Specifies how the sections are output Set the current position to 495 Places the header section at the address 495 Declares a symbol “__end_init” that refers to the current position (the end of .initdata section)
  • 44. LD script (4) 44 /* * setup.ld * * Linker script for the i386 setup code */ OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) ENTRY(_start) SECTIONS { . = 0; .bstext : { *(.bstext) } .bsdata : { *(.bsdata) } . = 495; .header : { *(.header) } .entrytext : { *(.entrytext) } .inittext : { *(.inittext) } .initdata : { *(.initdata) } __end_init = .;  arch  x86  boot  setup.ld  compressed  vmlinux.lds.S  kernel  vmlinux.lds.S .bstext .bsdata 0 495 .header .entrytext .inittext .initdataxxxx __end_init
  • 45. LD script (5) • To be precise, • Output a section the name of which is “.bstext” • The output section contains all of the input section “.bstext” • The input and output need not be 1-to-1 • The output section “.text” contains all of the input section “.text”, and then all of the sections the names of which start with “.text.” • Creates the new symbols “_text” and “_etext” which denote the beginning and ending of the output section “.text”, respectively. 45 .bstext : { *(.bstext) } .text : { _text = .; /* Text */ *(.text) *(.text.*) _etext = . ; }
  • 46. LD script (6) 46 . = ALIGN(16); .data : { *(.data*) } .signature : { setup_sig = .; LONG(0x5a5aaa55) } ... /DISCARD/ : { *(.note*) } /* * The ASSERT() sink to . is intentional, for binutils 2.14 compatibility: */ . = ASSERT(_end <= 0x8000, "Setup too big!"); . = ASSERT(hdr == 0x1f1, "The setup header has the wrong offset!"); /* Necessary for the very-old-loader check to work... */ . = ASSERT(__end_init <= 5*512, "init sections too big!"); }  arch  x86  boot  setup.ld  compressed  vmlinux.lds.S  kernel  vmlinux.lds.S [Usage] # Check signature at end of setup cmpl $0x5a5aaa55, setup_sig jne setup_bad Align to the 16 byte boundary Discard the sections .note* Put this long value at the current position Assertions!
  • 47. Column: align and balign • LD’s ALIGN(x) returns the x-byte aligned address • x must be power of two • = (current + x – 1) & ~ (x – 1) • GNU Assembler has two pseudo ops for alignment • .align x, fill, max • .balign x, fill, max • Both aligns to the byte boundary specified by x. But x means… • The skipped bytes are filled by fill (zero or nop) • The maximum number of bytes to be skipped can be specified with max. 47 .align (x = 4) .balign (x = 4) i386 (elf), sparc, etc. Align to 4 byte Align to 4 byte ppc, i386 (a.out), arm Align to 16 byte (24)
  • 48. COFF Stuffs 48  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S #ifdef CONFIG_EFI_STUB .org 0x3c # Offset to the PE header. .long pe_header #endif /* CONFIG_EFI_STUB */ .section ".bsdata", "a" bugger_off_msg: .ascii "Direct floppy boot is not supported. " .ascii "Use a boot loader program instead.rn" ... .byte 0 #ifdef CONFIG_EFI_STUB pe_header: .ascii "PE" .word 0 coff_header: #ifdef CONFIG_X86_32 .word 0x14c # i386 #else .word 0x8664 # x86-64 #endif [Notes] .org location, fill = Set the current position to location in the current section (filling the skipped bytes with fill) .ascii string = Put the string (w/o zero termination) at the current position (cf. .asciz) .byte val, .word val, .long val, .quad val = Put the 1/2/4/8-byte value(s)
  • 49. Real mode kernel (p.45) • header.S • Boot sector code which is no longer used • Contains setup_header • Prepares stack and BSS to run C programs • Jumps into the C program (main.c) • main.c • Copies setup_header into “zeropage” • Setups early console • Initializes heap • Checks the CPUs (64-bit capable for 64-bit kernel?) • Collect HW information by querying to BIOS, and stores the results in “zeropage” • Finally transits to protected-mode, and jumps into the “protected-mode kernel” 49
  • 50. Entry point (2nd sector) 50  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S .section ".header", "a" .globl sentinel sentinel: .byte 0xff, 0xff /* Used to detect broken loaders */ .globl hdr hdr: setup_sects: .byte 0 /* Filled in by build.c */ root_flags: .word ROOT_RDONLY syssize: .long 0 /* Filled in by build.c */ ram_size: .word 0 /* Obsolete */ vid_mode: .word SVGA_MODE root_dev: .word 0 /* Filled in by build.c */ boot_flag: .word 0xAA55 # offset 512, entry point .globl _start _start: .byte 0xeb # short (2-byte) jump .byte start_of_setup-1f 1: .ascii "HdrS" # header signature .word 0x020d # header version number (>= 0x0105) .bstext .bsdata 0 495 .header To prevent the compiler from accidentally producing a 3-byte jump
  • 51. Setup_header • “.header” section starts at 495 • 2-byte sentinel is located at the beginning. • Struct setup_header begins at 497 (=0x1f1) 51 51 47: struct setup_header { 48: __u8 setup_sects; 49: __u16 root_flags; 50: __u32 syssize; 51: __u16 ram_size; 52: __u16 vid_mode; 53: __u16 root_dev; 54: __u16 boot_flag; 55: __u16 jump; 56: __u32 header; 57: __u16 version; 58: __u32 realmode_swtch; ... (arch/x86/include/uapi/asm/bootparam.h) Setup code Boot Sector 0x0000 0x0200 0x1f1
  • 52. Column: Local Symbol in GAS • Local symbols • Symbols that can be used temporarily • Format is N: (where N is a positive integer) • To refer to the local symbols, use Nf or Nb. • Nf refers to the next local label N. • Nb refers to the most recently declared local label N. • According to GNU assembler manual, these symbols are internally transformed to the following format: • LN^BO • ^B is Ctrl-B (0x02), O is a serial number • For 44th 3, “L3^B44” is used. • Dollar local symbols (I haven’t seen this) 52 .byte start_of_setup-1f 1: 1: jmp 1b
  • 53. Get prepared to C (stack) 53  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S .section ".entrytext", "ax" start_of_setup: # Force %es = %ds movw %ds, %ax movw %ax, %es cld movw %ss, %dx cmpw %ax, %dx # %ds == %ss? movw %sp, %dx je 2f # -> assume %sp is reasonably set # Invalid %ss, make up a new stack movw $_end, %dx testb $CAN_USE_HEAP, loadflags jz 1f movw heap_end_ptr, %dx 1: addw $STACK_SIZE, %dx jnc 2f xorw %dx, %dx # Prevent wraparound 2: # Now %dx should point to the end of our stack space andw $~3, %dx # dword align (might as well...) jnz 3f movw $0xfffc, %dx # Make sure we're not zero 3: movw %ax, %ss movzwl %dx, %esp # Clear upper half of %esp If %ds == %ss, %sp is assumed to be properly set by the loader If not, sets up a new stack. The address is _end + STACK_SIZE (512 byte) or heap_end_ptr + STACK_SIZE (if CAN_USE_HEAP is set)
  • 54. In other words, • Set the stack segment as the same as %DS • Allocate 512-byte for the stack 54 unsigned short stack; if (%ds != %ss) { if (hdr.loadflags & CAN_USE_HEAP) { stack = hdr.heap_end_ptr + STACK_SIZE; } else { stack = _end + STACK_SIZE; } if (carried over) { /* stack >= 0x10000 */ stack = 0; } } /* Align to 4-byte */ stack &= ~3; if (stack == 0) stack = 0xfffc; /* – 4 */ %ss = %ds; %esp = stack;
  • 55. Get prepared to C (CS fix and BSS clear) 55 sti # Now we should have a working stack # We will have entered with %cs = %ds+0x20, normalize %cs so # it is on par with the other segments. pushw %ds pushw $6f lretw 6: # Check signature at end of setup cmpl $0x5a5aaa55, setup_sig jne setup_bad # Zero the bss movw $__bss_start, %di movw $_end+3, %cx xorl %eax, %eax subw %di, %cx shrw $2, %cx rep; stosl # Jump to C code (should not return) calll main  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S $6f is the address of the 6f, which is the offset from the boot sector. Signature check Fill the bss by zero. “rep; stosl” (string instruction) fills the memory from %es:%di for %cx DWORDs with %eax.
  • 56. [Column] Calling conventions • 16 bit (name unknown) • Arguments: %ax, %dx, %cx • Return value: %ax • 32 bit (cdecl) • Arguments: pushed on the stack (in the reversed order of the arguments) • Caller-saved: %eax, %ecx, and %edx • Callee-saved: the others • Return value: %eax (for int) • 64 bit (amd64) • Arguments: %rdi, %rsi, %rdx, %rcx, %r8, %r9 • Caller-saved: the others than callee-saved. • Callee-saved: %rbp, %rbx, %r12 to %r15 • Return value: %eax 56 f(2, 5, 9, 11); 11 9 5 2 (return address) stack
  • 57. Real mode kernel (p.45) • header.S • Boot sector code which is no longer used • Contains setup_header • Prepares stack and BSS to run C programs • Jumps into the C program (main.c) • main.c • Copies setup_header into “zeropage” • Setups early console • Initializes heap • Checks the CPUs (64-bit capable for 64-bit kernel?) • Collect HW information by querying to BIOS, and stores the results in “zeropage” • Finally transits to protected-mode, and jumps into the “protected-mode kernel” 57
  • 58. main 58 void main(void) { /* First, copy the boot header into the "zeropage" */ copy_boot_params(); /* Initialize the early-boot console */ console_init(); ... /* End of heap check */ init_heap(); /* Make sure we have all the proper CPU support */ if (validate_cpu()) { ... } set_bios_mode(); detect_memory(); keyboard_init(); query_mca(); query_ist(); ... /* Set the video mode */ set_video(); /* Do the last things and invoke protected mode */ go_to_protected_mode(); }  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S
  • 59. Copy to zeropage • Very simple • The omitted part is for compatibility with old command-line parameter protocol (located in the certain address) 59  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S struct boot_params boot_params __attribute__((aligned(16))); ... static void copy_boot_params(void) { ... BUILD_BUG_ON(sizeof boot_params != 4096); memcpy(&boot_params.hdr, &hdr, sizeof hdr); ... }
  • 60. Set up the serial console • Parse the command line parameter in very ad-hoc way, and find the serial configuration • Find “earlyprintk” and if it is either of the following format • “serial,0x3f8,115200” • “serial,ttyS0,115200” • “ttyS0,115200” • Find “console” and find “uart8250,io,…” or “uart,io,…” • If any serial config is found, set up it using I/O ports 60 void console_init(void) { parse_earlyprintk(); if (!early_serial_base) parse_console_uart8250(); }  arch  x86  boot  header.S  main.c  early_serial_console.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S
  • 61. Puts and putchar • By BIOS call and serial I/O ports 61 void __attribute__((section(".inittext"))) putchar(int ch) { if (ch == 'n') putchar('r'); /* n -> rn */ bios_putchar(ch); if (early_serial_base != 0) serial_putchar(ch); } void __attribute__((section(".inittext"))) puts(const char *str) { while (*str) putchar(*str++); }  arch  x86  boot  header.S  main.c  tty.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S [Notes] GCC extension __attribute__ section(section) : locate the function/variable in the specified section.
  • 62. Serial and BIOS putchar 62 static void __attribute__((section(".inittext"))) serial_putchar(int ch) { unsigned timeout = 0xffff; while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && -- timeout) cpu_relax(); outb(ch, early_serial_base + TXR); } static void __attribute__((section(".inittext"))) bios_putchar(int ch) { struct biosregs ireg; initregs(&ireg); ireg.bx = 0x0007; ireg.cx = 0x0001; ireg.ah = 0x0e; ireg.al = ch; intcall(0x10, &ireg, NULL); }  arch  x86  boot  header.S  main.c  tty.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S Put a char on a serial line by using I/O ports (IN and OUT instructions) Put a char on VGA by BIOS Call (INT 0x10, AH = 0x0e)
  • 63. BIOS Call • BIOS Call is invoked by using an INT instruction • Requires an assembly language support • Parameters and return values are passed by a certain set of registers • INT instruction only takes an immediate for the interrupt number. • C prototype: • struct biosregs has all the general registers, data segment registers, the flag register 63 void intcall(u8 int_no, const struct biosregs *ireg, struct biosregs *oreg); void initregs(struct biosregs *reg) { memset(reg, 0, sizeof *reg); reg->eflags |= X86_EFLAGS_CF; reg->ds = ds(); reg->es = ds(); reg->fs = fs(); reg->gs = gs(); }
  • 64. BIOS Call Impl. (1) 64  arch  x86  boot  header.S  main.c  bioscall.S  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S .code16 .section ".inittext","ax" .globl intcall ... intcall: cmpb %al, 3f je 1f movb %al, 3f jmp 1f /* Synchronize pipeline */ 1: ... /* Actual INT */ .byte 0xcd /* INT opcode */ 3: .byte 0 ... void intcall(u8 int_no, const struct biosregs *ireg, struct biosregs *oreg); ax dx cx Checks the current operand of the INT instruction, and rewrite (self-modify) the interrupt number if different.
  • 65. BIOS Call Impl. (2) 65 1: /* Save state */ pushfl pushw %fs pushw %gs pushal /* Copy input state to stack frame */ subw $44, %sp movw %dx, %si movw %sp, %di movw $11, %cx rep; movsd /* Pop full state from the stack */ popal popw %gs popw %fs popw %es popw %ds popfl /* Actual INT */ .byte 0xcd /* INT opcode */ 3: .byte 0 EFLAGS FS GS EAX ECX EDI … stack EFLAGS FS GS DS ES EAX … EDI Copy of struct biosregs *ireg (44 bytes) Registers Registers
  • 66. BIOS Call Impl. (3) 66 /* Push full state to the stack */ pushfl pushw %ds pushw %es pushw %fs pushw %gs pushal ... (Restore %ds, %sp, etc.) ... /* Copy output state from stack frame */ movw 68(%esp), %di /* Original %cx == 3rd argument */ andw %di, %di jz 4f movw %sp, %si movw $11, %cx rep; movsd /* Restore state and return */ popal popw %gs popw %fs popfl retl EFLAGS FS GS EAX ECX EDI … stack EFLAGS FS GS DS ES EAX … EDI Registers *oregs Registers
  • 67. Inline assembly • A quick way to use assembly language inside C source codes • For example, when you want to disable interrupts, put into your C code. • GCC’s extended inline assembly language enables far more features (and more complicated) • => Described in twenty or so slides later! 67 asm (“cli”); static inline void outb(u8 v, u16 port) { asm volatile("outb %0,%1" : : "a" (v), "dN" (port)); }
  • 68. Initialize the heap 68 char *HEAP = _end; char *heap_end = _end; /* Default end of heap = no heap */ ... static void init_heap(void) { char *stack_end; if (boot_params.hdr.loadflags & CAN_USE_HEAP) { asm("leal %P1(%%esp),%0" : "=r" (stack_end) : "i" (-STACK_SIZE)); heap_end = (char *) ((size_t)boot_params.hdr.heap_end_ptr + 0x200); if (heap_end > stack_end) heap_end = stack_end; } else { /* Boot protocol 2.00 only, no heap available */ puts("WARNING: Ancient bootloader, some functionality " "may be limited!n"); } }  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S Substitute %esp – STACK_SIZE to stack_end heap_end stack_end
  • 69. When is the heap used? • Heap allocation function is very simple • And the calls for GET_HEAP exist only in the video code files. 69 static inline char *__get_heap(size_t s, size_t a, size_t n) { char *tmp; HEAP = (char *)(((size_t)HEAP+(a-1)) & ~(a-1)); tmp = HEAP; HEAP += s*n; return tmp; } #define GET_HEAP(type, n) ((type *)__get_heap(sizeof(type),__alignof__(type),(n))) saved.data = GET_HEAP(u16, saved.x*saved.y); (boot/video.c)
  • 70. Retrieving memory info. • As described in the last presentation, detect_memory tries 3 methods 70 int detect_memory(void) { ... if (detect_memory_e820() > 0) err = 0; if (!detect_memory_e801()) err = 0; if (!detect_memory_88()) err = 0; return err; }  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S
  • 71. Memory Information [from p.48] • AX = 0xe820, INT 0x15 [detect_memory_e820()] • INPUT • AX = 0xe820 • CX = size of the buffer • EDX = “SMAP” (0x534d4150 / Signature) • EBX = Continuation value • ES:DI = address for the buffer • OUTPUT • CF = 0 if successful, 1 otherwise • CX = Returned Byte • EBX = Continuation value • Each call returns information for one range • To get information for the next range, give the continuation value returned in the previous call • The range information is returned by the following structure • Stored in boot_params.e820_map (struct e820entry[128]) 71 52 struct e820entry { 53 __u64 addr; /* start of memory segment */ 54 __u64 size; /* size of memory segment */ 55 __u32 type; /* type of memory segment */ 56 } __attribute__((packed)); (arch/x86/include/uapi/asm/e820.h)
  • 72. E820 72 static int detect_memory_e820(void) { int count = 0; struct biosregs ireg, oreg; struct e820entry *desc = boot_params.e820_map; static struct e820entry buf; /* static so it is zeroed */ initregs(&ireg); ireg.ax = 0xe820; ireg.cx = sizeof buf; ireg.edx = SMAP; ireg.di = (size_t)&buf; do { intcall(0x15, &ireg, &oreg); ireg.ebx = oreg.ebx; /* for next iteration... */ if (oreg.eflags & X86_EFLAGS_CF) break; ... *desc++ = buf; count++; } while (ireg.ebx && count < ARRAY_SIZE(boot_params.e820_map)); return boot_params.e820_entries = count; }  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S
  • 74. Go To Protected Mode 74 void go_to_protected_mode(void) { /* Hook before leaving real mode, also disables interrupts */ realmode_switch_hook(); /* Enable the A20 gate */ if (enable_a20()) { puts("A20 gate not responding, unable to boot...n"); die(); } /* Reset coprocessor (IGNNE#) */ reset_coprocessor(); /* Mask all interrupts in the PIC */ mask_all_interrupts(); /* Actual transition to protected mode... */ setup_idt(); setup_gdt(); protected_mode_jump(boot_params.hdr.code32_start, (u32)&boot_params + (ds() << 4)); }  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S
  • 75. Go To PM: Details (1) • Call the hook if set • Otherwise, disable interrupts and NMI. 75 static void realmode_switch_hook(void) { if (boot_params.hdr.realmode_swtch) { asm volatile("lcallw *%0" : : "m" (boot_params.hdr.realmode_swtch) : "eax", "ebx", "ecx", "edx"); } else { asm volatile("cli"); outb(0x80, 0x70); /* Disable NMI */ io_delay(); } } If a hook is set in realmode_swtch call the hook Out 0x80 to port 0x70 (CMOS Controller!!) (By a historical reason, “NMI disable” bit is located in the CMOS controller)  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S
  • 76. Go To PM: Details (2) • Enable A20 line (20th bit of the address bus) • In the initial state, the bit is masked (always 0) • For compatibility with the program that expects address wraparound in 1MB • Some programs expect the address 0xFFFFF + 1 = 0x00000 • To use 32-bit of memory, this mask should be disabled. • Many ways to do it • But which way works depends on the firmware • The famous one would be via the keyboard controller port! • Linux tries several ways, and several times 76
  • 77. Go To PM: Details (3) 77 int enable_a20(void) {... while (loops--) { if (a20_test_short()) return 0; /* Next, try the BIOS (INT 0x15, AX=0x2401) */ enable_a20_bios(); if (a20_test_short()) return 0; /* Try enabling A20 through the keyboard controller */ kbc_err = empty_8042(); if (a20_test_short()) return 0; /* BIOS worked, but with delayed reaction */ if (!kbc_err) { enable_a20_kbc(); if (a20_test_long()) return 0; } /* Finally, try enabling the "fast A20 gate" */ enable_a20_fast(); if (a20_test_long()) return 0; } ... }  arch  x86  boot  header.S  main.c  memory.c  pm.c  a20.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S *Tries 100000 times at most
  • 78. Go To PM: Details (4) 78 /* * Reset IGNNE# if asserted in the FPU. */ static void reset_coprocessor(void) { outb(0, 0xf0); io_delay(); outb(0, 0xf1); io_delay(); } /* * Disable all interrupts at the legacy PIC. */ static void mask_all_interrupts(void) { outb(0xff, 0xa1); /* Mask all interrupts on the secondary PIC */ io_delay(); outb(0xfb, 0x21); /* Mask all but cascade on the primary PIC */ io_delay(); } The most legacy interrupt controller  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S
  • 79. Go To PM: Details (5) • IDT (Interrupt Descriptor Table) • Describes the exception/interrupt handlers (and task gate, etc.) • At this time, no IDT is installed. • null_idt contains information for the address and size for the IDT, both of which are zero. • LIDT instruction takes an argument that is a pointer to the information. 79 static void setup_idt(void) { static const struct gdt_ptr null_idt = {0, 0}; asm volatile("lidtl %0" : : "m" (null_idt)); }  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S len (16-bit) address (32-bit) IDT
  • 80. Go To PM: Details (6) • GDT (Global Descriptor Table) • Describes the segment information 80  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S static void setup_gdt(void) { static const u64 boot_gdt[] __attribute__((aligned(16))) = { /* CS: code, read/execute, 4 GB, base 0 */ [GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff), /* DS: data, read/write, 4 GB, base 0 */ [GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff), /* TSS: 32-bit tss, 104 bytes, base 4096 */ /* We only have a TSS here to keep Intel VT happy; we don't actually use it for anything. */ [GDT_ENTRY_BOOT_TSS] = GDT_ENTRY(0x0089, 4096, 103), }; static struct gdt_ptr gdt; gdt.len = sizeof(boot_gdt)-1; gdt.ptr = (u32)&boot_gdt + (ds() << 4); asm volatile("lgdtl %0" : : "m" (gdt)); } len (16-bit) address (32-bit) boot_gdt (GDT)
  • 81. x86 Architecture: GDT • GDT • Each entry has 8 byte • Offset, limit, and attributes • DPL: Descriptor privileged level (0-3: 0 is the most privileged) • When a processor executes codes at a code segment, the current privileged level (CPL) is the same as DPL of the code segment. Can access data segments with DPL >= CPL. 81 G D L * Limit 19:16 P DPL S Type Base 23:16 Base 31:24 Base Address 15:00 Limit 15:00 [GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff), [GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff), 9 = Code, Execute Only 3 = Data, R/W 08121315162024 0 4
  • 82. x86 Architecture: Flat Model and Segment Register • Although the segment feature is available in 32-bit, the common use is called “Flat Model.” • Uses a single segment from zero to 232 – 1 • To be precise, different segments are required for code/data and privileged/user mode. • Linux uses four segments: KERNEL_CS, KERNEL_DS, USER_CS, USER_DS • During boot time, BOOT_CS and BOOT_DS are used (as defined in the previous slide) • Segment Register (Selector) • If CS is to select BOOT_CS, CS = (index of BOOT_CS) << 3; • GDT_ENTRY_BOOT_CS = (Index of BOOT_CS) = 2, then CS = 16 • The constants BOOT_CS = 16, BOOT_DS = 24. • Note the difference between “ENTRY” and the actual value. 82 T I RPLIndex 02315
  • 83. Go To PM: Details (7) • Call the assembler part (no return) 83 protected_mode_jump(boot_params.hdr.code32_start, (u32)&boot_params + (ds() << 4)); GLOBAL(protected_mode_jump) movl %edx, %esi # Pointer to boot_params table xorl %ebx, %ebx movw %cs, %bx shll $4, %ebx addl %ebx, 2f jmp 1f # Short jump to serialize on 386/486 1: movw $__BOOT_DS, %cx movw $__BOOT_TSS, %di movl %cr0, %edx orb $X86_CR0_PE, %dl # Protected mode movl %edx, %cr0 # Transition to 32-bit mode .byte 0x66, 0xea # ljmpl opcode 2: .long in_pm32 # offset .word __BOOT_CS # segment ENDPROC(protected_mode_jump)  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S dx ax *(uint32_t *)2f += cs() << 4; (phys addr of in_pm32) [Notes] In real mode, physical address = (Segment Register << 4) + Offset To enter protected mode, set PE bit in %cr0 register. And 32-bit far jump operation (not expressible in real mode asm)
  • 84. Go To PM: Detail (8) 84 .code32 .section ".text32","ax" GLOBAL(in_pm32) # Set up data segments for flat 32-bit mode movl %ecx, %ds movl %ecx, %es movl %ecx, %fs movl %ecx, %gs movl %ecx, %ss ... addl %ebx, %esp ... ltr %di ... xorl %ecx, %ecx xorl %edx, %edx xorl %ebx, %ebx xorl %ebp, %ebp xorl %edi, %edi ... lldt %cx ... jmpl *%eax ENDPROC(in_pm32)  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S protected_mode_jump(boot_params.hdr.code32_start, (u32)&boot_params + (ds() << 4)); * (Omitted) Task register and LDT (Local Descriptor Table)
  • 85. Extended inline assembler (1) • GCC’s extended inline assembly language • Input/output operands can be specified for the assembler • Assembler template • The actual assembly language with templates that will be substituted by the output/input operands • Output operands • List of C variables modified by the assembler template • Input operands • List of C expressions read by the instructions in the assembler template. • Clobber • List of registers/values to be changed by the assembler template (other than the output operands) 85 asm [volatile] (assembler template : [ output operands [ : input operands [ : clobber ]]]) static inline void outb(u8 v, u16 port) { asm volatile("outb %0,%1" : : "a" (v), "dN" (port)); } Assembler template Input operands
  • 86. Extended inline assembler (2) • Assembler template • Basically, the same as the standalone assembly language • %n (n is zero or positive integer) refers to the (n+1)-th operand in the input and output operands. • If the character “%” is to be used (to specify a certain register “%ebx,” for example), “%%” must be used. • Other than the number, the name can be used to specify an operand. (%[symbolicname] refers to the operand with the name [symbolicname]) • To use multiple instructions, use “;” or “n” 86 static inline void outb(u8 v, u16 port) { asm volatile("outb %0,%1" : : "a" (v), "dN" (port)); }
  • 87. Extended inline assembler (3) • Input operands • Comma-separated list of C expressions prefixed with constaints • A constraint specifies how the expression is passed to the assembler template. • When multiple constraints are specified, the complier selects the most efficient one. 87 asm volatile("outb %0,%1" : : "a" (v), "dN" (port)); constraint C expression Constraint ‘m’ Memory operand ‘r’ General register ‘i’ Immediate integer ‘0’ – ‘9’ = The same place as the operand Constraint (x86-specific) ‘a’’b’’c’’d’ A,B,C,D register ‘S’’D’ SI, DI register ‘N’ Unsigned 8-bit integer (for in/out instructions) ‘A’ EDX:EAX (32bit), RDX/RAX (64 bit) [SymbolicName] “Constraints” (C Expression),…
  • 88. Extended inline assembler (4) • In this example, • The value of v (u8) is stored in %al register • The value of port (u16) is stored in %dx register or used as 8-bit immediate. • This function is declared as “inline,” so if this function is called with a constant value as port which is less than 256, the “N” constraint may be used. • Then, the instruction(s) in the assembler templates are executed. • The resulting assembly language will be 88 asm volatile("outb %0,%1" : : "a" (v), "dN" (port)); outb: movl 8(%esp), %edx movl 4(%esp), %eax outb %al,%dx ret
  • 89. Extended inline assembler (5) • Output operands • Comma-separated list of C variables prefixed with constraints • Constraints should be prefixed with “=“ or “+” • “+” means that the variable is used as a both input and output operand. • “&” constraint allocates a different register from the input operands (for multiple instructions, this constraint may be necessary) • After the instruction(s) in the assembler template is executed, the value of A register (%al) is stored to the variable v. 89 [SymbolicName] “=Constraints” (C Variable),… static inline u8 inb(u16 port) { u8 v; asm volatile("inb %1,%0" : "=a" (v) : "dN" (port)); return v; }
  • 90. Extended inline assembler (6) • Clobber • The list of registers/values modified by the instructions • The output registers need not be specified here. • The most common clobber is “memory” • This means that the memory contents may be changed as side effects, thus all the variables should be written back to the memory before the assembler, and should be read again from the memory after the assembler. • “cc” : Condition (flags) registers 90 void *memcpy(void *dest, const void *src, size_t n) { int d0, d1, d2; asm volatile( "rep ; movslnt" "movl %4,%%ecxnt" "rep ; movsbnt" : "=&c" (d0), "=&D" (d1), "=&S" (d2) : "0" (n >> 2), "g" (n & 3), "1" (dest), "2" (src) : "memory"); return dest; }
  • 91. (?) 91 void *memcpy(void *dest, const void *src, size_t n) { int d0, d1, d2; asm volatile( "rep ; movslnt" "movl %4,%%ecxnt" "rep ; movsbnt" : "=&c" (d0), "=&D" (d1), "=&S" (d2) : "0" (n >> 2), "g" (n & 3), "1" (dest), "2" (src) : "memory"); return dest; } asm volatile( "rep ; movslnt" "movl %2,%%ecxnt" "rep ; movsbnt" : "+&D" (dest) : "c" (n >> 2), "g" (n & 3), "S" (src) : "memory");
  • 92. Extended inline assembler (7) • Examples (appeared in the previous slides) • Example 1 • Stores %esp – STACK_SIZE to stack_end • P in “%P1” is a modifier (but cannot find in the document) • With “%P1” • With “%1” • With “%c1” (“constant expression with no punctuation”) 92 asm("leal %P1(%%esp),%0" : "=r" (stack_end) : "i" (-STACK_SIZE)); leal -512(%esp),%eax leal $-512(%esp),%eax leal -512(%esp),%eax
  • 93. Extended inline assembler (8) • Example 2 • Far-calls the address (the value of boot_params.hdr.realmode_swtch) • The registers eax, ebx, ecx, and edx will be changed in this call. • Example 3 93 asm volatile("lcallw *%0" : : "m" (boot_params.hdr.realmode_swtch) : "eax", "ebx", "ecx", "edx"); static const struct gdt_ptr null_idt = {0, 0}; asm volatile("lidtl %0" : : "m" (null_idt)); setup_idt: lidtl null_idt.1378 ret
  • 94. Extended inline assembler (9) 94 #define switch_to(prev, next, last) do { unsigned long ebx, ecx, edx, esi, edi; asm volatile("pushflnt" /* save flags */ "pushl %%ebpnt" /* save EBP */ "movl %%esp,%[prev_sp]nt" /* save ESP */ "movl %[next_sp],%%espnt" /* restore ESP */ "movl $1f,%[prev_ip]nt" /* save EIP */ "pushl %[next_ip]nt" /* restore EIP */ __switch_canary "jmp __switch_ton" /* regparm call */ "1:t" "popl %%ebpnt" /* restore EBP */ "popfln" /* restore flags */ /* output parameters */ : [prev_sp] "=m" (prev->thread.sp), [prev_ip] "=m" (prev->thread.ip), "=a" (last), /* clobbered output registers: */ "=b" (ebx), "=c" (ecx), "=d" (edx), "=S" (esi), "=D" (edi) __switch_canary_oparam /* input parameters: */ : [next_sp] "m" (next->thread.sp), [next_ip] "m" (next->thread.ip), /* regparm parameters for __switch_to(): */ [prev] "a" (prev), [next] "d" (next) __switch_canary_iparam : /* reloaded segment registers */ "memory"); } while (0) arch/x86/include/asm/switch_to.h
  • 95. Extended inline assembler (10) • The key point • The context is the stack • The switched task resumes at “1:”. (just after “jmp __switch_to”) • The “__switch_to” function is called with a “jmp” instruction, not a “call” instruction. • Anyway • The template does not use %n (number), but %[name] style. (too many parameters) 95 asm volatile(... "movl %%esp,%[prev_sp]nt" /* save ESP */ ... /* output parameters */ : [prev_sp] "=m" (prev->thread.sp),
  • 96. Exercise: RDTSC • RDTSC instruction • Input : None • Output : EDX (Higher 32-bit), EAX (Lower 32-bit) 96 unsigned long rdtsc(void) { } asm volatile(“rdtsc” : unsigned short high, low; “=d” (high), “=a” (low)); return ((unsigned long)high << 32) | low;
  • 97. Answer: rdtscll 97 #define rdtscll(val) ((val) = __native_read_tsc()) static __always_inline unsigned long long __native_read_tsc(void) { DECLARE_ARGS(val, low, high); asm volatile("rdtsc" : EAX_EDX_RET(val, low, high)); return EAX_EDX_VAL(val, low, high); } #ifdef CONFIG_X86_64 #define DECLARE_ARGS(val, low, high) unsigned low, high #define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high) << 32)) #define EAX_EDX_ARGS(val, low, high) "a" (low), "d" (high) #define EAX_EDX_RET(val, low, high) "=a" (low), "=d" (high) #else #define DECLARE_ARGS(val, low, high) unsigned long long val #define EAX_EDX_VAL(val, low, high) (val) #define EAX_EDX_ARGS(val, low, high) "A" (val) #define EAX_EDX_RET(val, low, high) "=A" (val) #endif
  • 98. A-2. Protected Mode Again, full of the assembly code! 98
  • 99. Protected-Mode Kernel (p.54) • arch/x86/boot/compressed/head_{32,64}.S • Goal: Decompresses the kernel (vmlinux.gz/.bz2/.xz…) and start the kernel • Relocates the decompressing code (if relocatable and loaded at a different address) • Enables paging and enters the long-mode (in head_64.S) • Clears the BSS, and prepares the heap and stack • Decompresses the kernel • Relocates if required • RANDOMIZED_BASE or RELOCATABLE (in 32-bit) 99
  • 100. LD script? 100  arch  x86  boot  setup.ld  compressed  vmlinux.lds.S  kernel  vmlinux.lds.S ... #ifdef CONFIG_X86_64 OUTPUT_ARCH(i386:x86-64) ENTRY(startup_64) #else OUTPUT_ARCH(i386) ENTRY(startup_32) #endif SECTIONS { /* Be careful parts of head_64.S * assume startup_32 is at address 0. */ . = 0; .head.text : { _head = . ; HEAD_TEXT _ehead = . ; } .rodata..compressed : { *(.rodata..compressed) } ... .head.text .rodata..compres sed 0 (_head) (_ehead) .text .rodata .got .data .bss .pgtable (64 only) (_etext, _rodata) (_text) (_erodata, _got) (_egot, _data) (_edata, _bss) (_ebss, _pgtable) (_epgtable, _end)
  • 101. mkpiggy • Section “.rodata..compressed” consists of the compressed kernel (vmlinux) 101 printf(".section ".rodata..compressed","a",@progbitsn"); printf(".globl z_input_lenn"); printf("z_input_len = %lun", ilen); printf(".globl z_output_lenn"); printf("z_output_len = %lun", (unsigned long)olen); printf(".globl z_extract_offsetn"); printf("z_extract_offset = 0x%lxn", offs); /* z_extract_offset_negative allows simplification of head_32.S */ printf(".globl z_extract_offset_negativen"); printf("z_extract_offset_negative = -0x%lxn", offs); printf(".globl input_data, input_data_endn"); printf("input_data:n"); printf(".incbin "%s"n", argv[1]); printf("input_data_end:n"); (arch/x86/boot/compressed/mkpiggy.c)
  • 102. Entry point (32-bit) 102 .text __HEAD ENTRY(startup_32) #ifdef CONFIG_EFI_STUB jmp preferred_addr ... preferred_addr: #endif cld testb $(1<<6), BP_loadflags(%esi) jnz 1f cli movl $__BOOT_DS, %eax movl %eax, %ds movl %eax, %es movl %eax, %fs movl %eax, %gs movl %eax, %ss 1: .section ".head.text","ax"  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S If KEEP_SEGMENT is set in loadflags in boot_params, do not reload the segments.
  • 103. Protected-Mode Protocol (p.53) • Starts at the top of the protected mode kernel • Usually loaded at 0x100000 (1MB) • Can be at any position if compiled as relocatable • Should be at the same position as specified in the compile time if compiled as not relocatable • Used in “linux” module in GRUB2 • [Protocol] At the entry point, • The loaded GDT must have __BOOT_CS (0x10 / execute and read) and __BOOT_DS(0x18 / read and write) • %cs must be __BOOT_CS • %ds, %es, and %ss must be __BOOT_DS • Interrupts must be disabled • %esi must be the address for struct boot_params • %ebp, %edi, and %ebx must be zero. 103
  • 104. Protected-Mode Kernel (p.54) • arch/x86/boot/compressed/head_{32,64}.S • Goal: Decompresses the kernel (vmlinux.gz/.bz2/.xz…) and start the kernel • Relocates the decompressing code (if relocatable and loaded at a different address) • Enables paging and enters the long-mode (in head_64.S) • Decompresses the kernel • Relocates if required • RANDOMIZED_BASE or RELOCATABLE (in 32-bit) 104
  • 105. Where are we? 105 leal (BP_scratch+4)(%esi), %esp call 1f 1: popl %ebp subl $1b, %ebp  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S • The call instruction pushes the return address onto the stack • The return address should be the next instruction after the call instruction, i.e. 1f • The immediate “pop” pops the return address from the stack, i.e. the absolute physical address for 1f • Subtracting 1f (in this case, 1b) from the address (%ebp) calculates the offset between the actual address and the compile-time address (0-based, as seen in lds).
  • 106. Memory View 106 PM kernelRM Kernel Higher Address%ebp vmlinux (decompressed) Goal: headBP compressed %esi z_extract_offset (mkpiggy.c) offs = (olen > ilen) ? olen - ilen : 0; offs += olen >> 12; /* Add 8 bytes for each 32K block */ offs += 64*1024 + 128; /* Add 64K + 128 bytes slack */ offs = (offs+4095) & ~4095; /* Round to a 4K boundary */ ... printf("z_extract_offset = 0x%lxn", offs); Relocated Kernel LOAD_PHYSICAL_ADDRESS (asm/x86/include/asm/boot.h) #define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START + (CONFIG_PHYSICAL_ALIGN - 1)) & ~(CONFIG_PHYSICAL_ALIGN - 1)) *Default: 0x1000000 compressed
  • 107. Determine where to decompress • If CONFIG_RELOCATABLE • The current position (BP_kernel_alignment- aligned) • Default: 2MB-align • If it is less than LOAD_PHYSICAL_ADDR, LOAD_PHYSICAL_ADDR is used • If not CONFIG_RELOCATABLE • LOAD_PHYSICAL_ADDR is used • Now %ebx is the target address 107 #ifdef CONFIG_RELOCATABLE movl %ebp, %ebx movl BP_kernel_alignment(%esi), %eax decl %eax addl %eax, %ebx notl %eax andl %eax, %ebx cmpl $LOAD_PHYSICAL_ADDR, %ebx jge 1f #endif movl $LOAD_PHYSICAL_ADDR, %ebx 1:  arch  x86  boot  compressed  head_32.S  head_64.S  kernel  head_32.S  head_64.S
  • 108. Copy the decompression code • Copy the area from the head of PM kernel (startup_32) to just before the head of bss. • The code copies the kernel backwards in case of overlapping 108 addl $z_extract_offset, %ebx leal boot_stack_end(%ebx), %esp pushl $0 popfl pushl %esi leal (_bss-4)(%ebp), %esi leal (_bss-4)(%ebx), %edi movl $(_bss - startup_32), %ecx shrl $2, %ecx std rep movsl cld popl %esi PM kernel %ebp Relocated vmlinux (decompressed) %ebx z_extract_offset
  • 109. Jump to the relocated address • Jump to the copied decompression code • The decompression code is the end in the PM kernel • Just after the compressed kernel image • Clears the BSS 109 leal relocated(%ebx), %eax jmp *%eax ENDPROC(startup_32) .text relocated: xorl %eax, %eax leal _bss(%ebx), %edi leal _ebss(%ebx), %ecx subl %edi, %ecx shrl $2, %ecx rep stosl %ebx Relocated kernel vmlinux (decompressed) relocated
  • 110. Why is z_extract_offset? • The PM kernel contains the compressed kernel image • The relocating (copying) code is located at the head in PM kernel • The decompression code is located at the tail in the PM kernel • The decompression code after relocation is safe because z_extract_offset + the compressed image size is larger than the decompressed image size 110 headcompressed decomp decompressed z_extract_offset work area head compressed decomp Relocate z_extract_offset
  • 111. Fix up the absolute addresses • The decompression code is built with -fPIC (position independent code), and so fixing up the absolute addresses is achieved by modifying the addresses in GOT (Global Offset Table). 111 /* * Adjust our own GOT */ leal _got(%ebx), %edx leal _egot(%ebx), %ecx 1: cmpl %ecx, %edx jae 2f addl %ebx, (%edx) addl $4, %edx jmp 1b 2: %ebx Relocated kernel
  • 112. Protected-Mode Kernel (p.54) • arch/x86/boot/compressed/head_{32,64}.S • Goal: Decompresses the kernel (vmlinux.gz/.bz2/.xz…) and start the kernel • Relocates the decompressing code (if relocatable and loaded at a different address) • Enables paging and enters the long-mode (in head_64.S) • Decompresses the kernel • Relocates if required • RANDOMIZED_BASE or RELOCATABLE (in 32-bit) 112
  • 113. Call the decompression routine • Call the decompress_kernel function in C • asmlinkage __visible void *decompress_kernel(void *rmode, memptr heap, unsigned char *input_data, unsigned long input_len, unsigned char *output, unsigned long output_len) 113 pushl $z_output_len /* decompressed length */ leal z_extract_offset_negative(%ebx), %ebp pushl %ebp /* output address */ pushl $z_input_len /* input_len */ leal input_data(%ebx), %eax pushl %eax /* input_data */ leal boot_heap(%ebx), %eax pushl %eax /* heap area */ pushl %esi /* real mode pointer */ call decompress_kernel /* returns kernel location in %eax */ BP Relocated vmlinux (decompressed) %ebx z_extract_offset %esi
  • 114. Decompressing 114 asmlinkage __visible void *decompress_kernel(...) { ... output = choose_kernel_location(input_data, input_len, output, output_len); ... #ifndef CONFIG_RELOCATABLE if ((unsigned long)output != LOAD_PHYSICAL_ADDR) error("Wrong destination address"); #endif debug_putstr("nDecompressing Linux... "); decompress(input_data, input_len, NULL, NULL, output, NULL, error); parse_elf(output); handle_relocations(output, output_len); debug_putstr("done.nBooting the kernel.n"); return output; }  arch  x86  boot  compressed  head_32.S  head_64.S  misc.c  kernel  head_32.S  head_64.S
  • 115. Choosing the destination • The choose_kernel_location function • If KASLR is enabled, it computes some random output address (aslr.c) • Otherwise, it just returns the output parameter 115
  • 116. Decompressing the kernel • The decompress function does everything • The implementation is located at lib/decompress_*.c 116 #ifdef CONFIG_KERNEL_GZIP #include "../../../../lib/decompress_inflate.c" #endif #ifdef CONFIG_KERNEL_BZIP2 #include "../../../../lib/decompress_bunzip2.c" #endif #ifdef CONFIG_KERNEL_XZ #include "../../../../lib/decompress_unxz.c" #endif ...  arch  x86  boot  compressed  head_32.S  head_64.S  misc.c  kernel  head_32.S  head_64.S
  • 117. Load the ELF • parse_elf • Parse the ELF header and locate the contents according to the program header (p_paddr) • If relocatable, the p_paddr is offseted by the actually loaded address. 117 for (i = 0; i < ehdr.e_phnum; i++) { ... switch (phdr->p_type) { case PT_LOAD: #ifdef CONFIG_RELOCATABLE dest = output; dest += (phdr->p_paddr – LOAD_PHYSICAL_ADDR); #else dest = (void *)(phdr->p_paddr); #endif memcpy(dest, output + phdr->p_offset, phdr->p_filesz); break; ... } } typedef struct elf32_phdr{ Elf32_Word p_type; Elf32_Off p_offset; Elf32_Addr p_vaddr; Elf32_Addr p_paddr; Elf32_Word p_filesz; Elf32_Word p_memsz; Elf32_Word p_flags; Elf32_Word p_align; } Elf32_Phdr;
  • 118. Protected-Mode Kernel (p.54) • arch/x86/boot/compressed/head_{32,64}.S • Goal: Decompresses the kernel (vmlinux.gz/.bz2/.xz…) and start the kernel • Relocates the decompressing code (if relocatable and loaded at a different address) • Enables paging and enters the long-mode (in head_64.S) • Decompresses the kernel • Relocates if required • RANDOMIZED_BASE or RELOCATABLE (in 32-bit) 118
  • 119. Relocate the kernel image • Relocation information (generated by the “relocs” tool) is appended just after the ELF image • The relocation information is a collection of addresses to the absolute addresses in the kernel code • These addresses are all expressed by kernel virtual addresses vmlinux (ELF) 0 0… … 32-bit relocation addresses 64-bit relocation addresses $ objdump –adr vmlinux … c1086910 <vfs_llseek>: c1086910: 55 push %ebp ... c1086919: bb 60 63 08 c1 mov $0xc1086360,%ebx c108691a: R_386_32 no_llseek
  • 120. Calculate deltas • __START_KERNEL_map • In 32-bit, PAGE_OFFSET (default: 0xC0000000) • In 64-bit, 0xffffffff80000000 120 120 static void handle_relocations(void *output, unsigned long output_len) { ... unsigned long min_addr = (unsigned long)output; ... delta = min_addr - LOAD_PHYSICAL_ADDR; ... map = delta - __START_KERNEL_map; ... Difference between the compile-time physical address and the actual physical address The offset of the kernel virtual address to the physical address
  • 121. Apply the relocation 121 for (reloc = output + output_len - sizeof(*reloc); *reloc; reloc--) { int extended = *reloc; extended += map; ptr = (unsigned long)extended; if (ptr < min_addr || ptr > max_addr) error("32-bit relocation outside of kernel!n"); *(uint32_t *)ptr += delta; } #ifdef CONFIG_X86_64 for (reloc--; *reloc; reloc--) { long extended = *reloc; extended += map; ptr = (unsigned long)extended; if (ptr < min_addr || ptr > max_addr) error("64-bit relocation outside of kernel!n"); *(uint64_t *)ptr += delta; } #endif
  • 122. OK, go to the entry point • The entry point is always at the head of the kernel • decompress_kernel returns the “output” • The assembly code jumps into the entry point 122 asmlinkage __visible void *decompress_kernel(...) { ... output = choose_kernel_location(input_data, input_len, output, output_len); ... return output; } /* * Jump to the decompressed kernel. */ xorl %ebx, %ebx jmp *%eax
  • 123. Next • Go on to startup_32/startup_64 123