OSTEP: Ch6: Mechanism: Limited Direct Execution: Library

OSTEP: Ch6: Mechanism: Limited Direct Execution

20th August 2020 at 2:19pm

核心问题	How to efficiently virtualize the CPU with control?
解法	Limited Direct Execution

操作系统在运行用户程序时需要做到几点：

限制用户程序的操作，比如用户程序无能直接访问底层硬件，用户程序读写文件时需要经过操作系统文件系统的权限管理
保持对系统的控制，比如操作系统一旦开始运行用户程序，需要有能力使 CPU 运行回操作系统的代码

限制用户程序的操作

操作系统为了实现一些统一的调度、安全等，不会直接开放用户直接做核心的操作。最简单的例子是，用户程序无法直接访问硬盘设备，而需要通过操作系统提供的 API （open(), read() 等）操作文件系统，在经过权限检查后才可以访问。如果操作系统不做这层限制，它就无法抽象出文件系统。

因此操作系统发展出 user mode 及 kernel mode。上文提到的核心操作都必须在 kernel mode 下由操作系统本身的代码完成。而用户程序想使用这些核心操作时，是通过调用 OS 提供的 system call 来实现的（后面简称为 syscall）。syscall 看起来跟普通函数很像，但调用它时，会触发系统进入 kernel mode 执行相应操作。

这一过程需要 硬件配合。syscall 事实上也是一些长驻在内存中的代码，也有相应的入口地址；当用户调用它时，由于安全考虑，用户程序并不能知道它的内存地址，因此需要硬件来配合执行相应的程序。事实上操作系统启动时，会初始化一个 trap table（也叫 trap handler table），把不同的 syscall 的号码和相应的 handler 地址交给 CPU 记忆。当用户程序调用 syscall 时，用户程序触发一个 trap CPU 指令，并附带相应的号码；CPU 便从 trap table 中找到相应的 handler 去执行。因为注意的是，syscall 也有自己的 stack（kernel stack）和寄存器值等，因此从用户程序切换到 syscall 的过程了也涉及 上下文切换。整个过程如下：

保持对系统的控制

当 CPU 在执行应用程序时，它自然就没有执行操作系统的代码。那怎样 使操作系统重新获得控制权，而不是让应用程序一直运行呢？有两种方式：

协作式的（cooperative）：等待应用程序调用 syscall 时，将控制权回到操作系统。早期的操作系统有这样设计的，同时也有专门的指令使应用程序就算不调用 syscall 也可以把控制权给回
非协作式的（non-cooperative，也称抢占式）：实现一套操作系统和硬件配合的 中断定时器（timer interrupt），每隔一段时间（也称处理器时间片，processor time slice）将控制权交回操作系统

整个流程如下，其中 k-stack 指 kernel stack，proc_t 指 process table：

结合此图和上图可以发现，这个过程中有一个核心是上下文切换（context switch）。有两个场景需要上下文切换：

从用户态变为内核态时，保存用户态进程的 reg 和 stack，恢复 kernel 的 reg 和 stack
切换进程时，保存当前进程的上下文，恢复将被运行进程的上下文

值得注意的是，操作系统和硬件都会做上下文切换。硬件做的可能更底层，比如只包含寄存器；操作系统应该会保存更多进程的状态数据。

ASIDE: KEY PROCESS TERMS

The CPU should support at least two modes of execution: a restricted user mode and a privileged (non-restricted) kernel mode.
Typical user applications run in user mode, and use a system call to trap into the kernel to request operating system services.
The trap instruction saves register state carefully, changes the hardware status to kernel mode, and jumps into the OS to a pre-specified destination: the trap table.
When the OS finishes servicing a system call, it returns to the user program via another special return-from-trap instruction, which reduces privilege and returns control to the instruction after the trap that jumped into the OS.
The trap tables must be set up by the OS at boot time, and make sure that they cannot be readily modified by user programs. All of this is part of the limited direct execution protocol which runs programs efficiently but without loss of OS control.
Once a program is running, the OS must use hardware mechanisms to ensure the user program does not run forever, namely the timer interrupt. This approach is a non-cooperative approach to CPU scheduling.
Sometimes the OS, during a timer interrupt or system call, might wish to switch from running the current process to a different one, a low-level technique known as a context switch.