Getting to userspace - and back !
Before now, SnowflakeOS ran entirely in kernel mode, or ring 0
. Now, the time has come for it to move on to better places, those of userland, also called ring 3
.
The transition to having processes roam free in ring 3
was mostly made in the series of commits from here to there, and I encourage readers[1] to check them out.
What’s a process, anyway?
Well, as of right now, a process is described by the following structure:
typedef struct _proc_t {
struct _proc_t* next;
uint32_t pid;
// Length of program memory in number of pages
uint32_t stack_len;
uint32_t code_len;
uintptr_t directory;
uintptr_t kernel_stack;
registers_t registers;
} process_t;
I’ll explain. A process consists of executable code, a stack, and an execution
context. And another stack for execution in the kernel, I’ll get back to it.
The executable code along with the stack is stored in physical memory somewhere,
of course, but it needs to be mapped through paging to be accessible (if only to
write the code to the physical page!), so what we store is a page directory. It
maps addresses 0x00000000+
to the pages containing our code, and addresses
below our kernel to our stack pages. It also maps the kernel exactly as the
initial kernel page directory did; it’s a copy. Notice that once the code is
copied to physical memory, the page directory suffices to reference it; physical
memory never moves. We only keep track of the size occupied by the code and stack,
though I haven’t made these things dynamic yet, both are 4 MiB pages.
The execution context consists of basically one thing: the process’s registers, to
the surprise of no one who’s written assembly before. Indeed, registers do not
only include working registers like eax
et al, there’s also eip
storing the
address of the next instruction to be executed, esp
storing the stack pointer,
eflags
… The page directory can be considered context too, as it
may hold dynamically allocated memory (not yet implemented).
Let’s switch to it
With that said, starting a process in usermode is simply a matter of switching to
a correctly setup page directory, and pointing eip
and esp
to the right place!
And resuming an interrupted process is the same, but restoring registers beforehand.
[Edit: this was a pretty poor way of resuming a process, see the next post
for a better one]
The way to do these things is a bit convoluted though, as we need to iret
(aka
interrupt return) to our code, we can’t just jump to it, the reason being that we
need to change privilege level (from 0 to 3):
asm volatile (
"push $0x23\n" // user ds selector
"mov %0, %%eax\n"
"push %%eax\n" // %esp
"push $512\n" // %eflags with 9th bit set to allow calling interrupts
"push $0x1B\n" // user cs selector
"mov %1, %%eax\n"
"push %%eax\n" // %eip
"iret\n"
: /* read registers: none */
: "r" (esp_val), "r" (eip_val) /* inputs %0 and %1 stored anywhere */
: "%eax" /* registers clobbered by hand in there */
);
Getting control back
How do we get back to kernel mode execution once we’ve made the jump? The
answer is twofold: through syscalls and scheduling - interrupts in both cases.
Because right now scheduling is handled though syscalls in SnowflakeOS, I’ll
describe only those. In our kernel, calling int $48
with eax
set to n
triggers the n
th interrupt: process execution stops, privilege changes to 0 and
execution then resumes in the syscall handler. Because the kernel is mapped into
our process’s address space, no page directory switch is needed here. However, one
thing needs to be changed, and it’s the stack: we can’t just pollute the process’s
stack, and x86
gives us (forces us to use) a mechanism to switch stack as part
of the jump to ring 0. So what’s usually done, and what I did is allocating
memory for a kernel stack per process, and before switching to the execution of
that process, setting the “stack-to-be-switched-to” variable (in the TSS
, a GDT
entry) to that stack.
Right now syscall 0
is yield
, and it switches execution to the process pointed
to by the next
pointer in the process_t
structure. It’s cooperative
multitasking, preemptive will come later.
Collateral damage
kmalloc is dead, hail the new kmalloc
In the process of implementing that, I got rid of liballoc
and replaced it with
a dead-simple mechanism: I now map a certain zone after the kernel, and kmalloc
advances a pointer through it with each allocation. Sure, it doesn’t allow freeing
memory, but it has a great advantage: the previous kmalloc
needed to modify
kernel page tables dynamically, and those changes aren’t automatically reflected
across all copies of the initial page directory! It’s important because when
context switches, for instance while in kernel mode, the previous kernel stack
must remain mapped, as there are no stack switches when privilege level doesn’t
change.
I’ll reintroduce a cool allocator with the userspace libc
, the kernel
will stay simple and only grant pages at at time to processes. I’m not sure how
I’ll distribute the virtual address space, in fact, I’ll have to think about it.
Same way I did the kernel heap, probably, but right in the middle of
0x0-0xC0000000
.
Implementing useful system calls
The real fun begins now that things are in working order: syscalls!
I started with yield
to get started with multitasking, and quickly implemented
exit
, though testing that last one took a while: I’ve spent hours debugging
assembly in in Bochs to get yield
to work.
Only then, I started implementing a putchar
syscall. First, I needed to make my
TTY work in processes, which involved mapping it somewhere; I decided to put it
in my kernel heap. Tada, I can print again! I should have started
by doing that, it would have made a lot of debugging easier; but then again I’ve
learned a lot of assembly without it. Implementing the syscall was trivial
after that, and I was able to get my first “Hello world” from userspace.
I tried my hand at a wait
syscall to pause a process for some time, but I
approached it the wrong way, or at least in a way that Bochs liked, but QEMU did
not: waiting for time to pass in the syscall handler. In QEMU, timer interrupts
don’t trigger then, so time never passes :( I’ll get back to it from a scheduler
perspective.