* Regular function calls Material from courses in compilers or architecture Assume some code calls a function in your own code, or a library function like in libc: x = foo(a, b, c); At compile time, the compiler knows what fxn you're in now (let's say main). And it knows where the code for the fxn "foo" is, namely what address to "jump" to, to execute the instructions of function "foo". With shared libraries, the final linking to fxns from those libraries are done just when you try to execute a fxn w/ shared libs, often by a "runtime linker" (which is a sort of linker like 'ld' that's part of the C compiler), see e.g, /etc/ld.so. Right before you can execute 'foo', we need to do a few things: 1. Know where we are now, so we can come back there, exact mem addr, e.g., in main(). 2. Preserve the current state of my program: current addr (which is in the Program Counter, or PC reg), and need to "preserve" the current scope of the function that is executing (e.g., the values of all automatic variables). 3. In prep for calling function foo, need to "store" the values of the parameters being passed to foo, so they'll be available when foo() runs. In C, we use "pass by value", which means that you don't pass a reference/ptr to an object, but the object's value. This ensures that the scope of the variable inside and outside 'foo' differs: so if foo changes the variable 'a' that's passed to it, foo's caller will NOT see that change. This means that we have to allocate space for all params passed on to foo. 4. At compiler time, the compiler determines how much space is needed for all of the params passed to foo: sizeof(a) + sizeof(b) + sizeof(c). 5. The compiler also needs to reserve space for the return address, back to main. That'll be "sizeof(void*)" -- namely the size of a memory addr. 6. Compiler also has to reserve space for the return value 'x', or sizeof(x). 7. compiler will determine that a stack frame of the right size N (for points 4, 5, and 6 above) is needed. So right before foo executes, the stack pointer (SP) register will be adjusted by that size N. This is the act of "pushing" a new stack frame. 8. Most stacks in the STACK segment are virtual memory in the addr space of the user process, and STACk often starts at high mem addrs, and "grows" towards smaller mem addrs. So before foo() can run, the compiler does "SP = SP - N". 9. Next, compiler will have inserted instrux to copy the values of params a, b, and c, as well as the current value of the PC, into that stack frame, relative to the newly adjusted SP reg. 10. compiler can execute the instruction to "JSR foo" (whatever the actual mem addr of foo is in the current program). 11. Now the first instruction of 'foo' is executing, and all of the params it was expecting to have passed to it, are already allocated and have their proper values/content. 12. Now foo executes, ....., until it gets to the end of the fxn or an explicit "return" statement. 13. Compiler will have embedded instructions, so that a C "return" statement translates to: - copy the return value to the location in the stack frame reserved for the retval. - adjust the stack pointer back by N mem addrs (this is "popping" the stack frame") - copy foo's retval (from wherever it is in mem), to the mem location of variable 'x' in the caller. - execute a return instruction, such as a GOTO or JMP, to the return addr that is the current stack frame, which was saved from before (before we jumped to 'foo'). This adjusts the PC back to the very next instruction after foo() returns. x = foo(a, b, c); ^ //this is not the next instruction Sometimes the compiler will insert code after you're back from foo, to copy the value of 'x' from the previous stack frame to the local stack frame. Note: all this happens in the same addr space of the same process, and thus the same process has the same access to all of its own (virt) mem. * virtual memory and process memory segments Each process has its own virtual memory that's different from other processes' virtual memory. Modern processors are 64 bit which means that it can address 2^64 bytes of ram (a lot). For simplicity, we'll assume in this course examples of a 32-bit processor, which can address 2^32=4GB of RAM. Assume 32-bit CPUs, then a process virt mem image looks like this 1. TEXT segment, at the lowest addresses, which contains a read-only, executable set of pages that correspond to the actual binary on storage media. 2. Optional STATICS readonly segment, which includes 'const' and other constants, including C "double quoted" strings. 3. Next usually comes the HEAP segment, which is for dynamic mem allocations, such as malloc(3). As you alloc more and more, the HEAP segment can grow -- the OS will grow it as needed, or the process (or malloc lib) can invoke special syscalls to extend the HEAP segment. Calls like brk(2) or sbrk(2). 4. Usually what's in b/t, here, is a fairly large region. STACK and HEAP can grow towards each other here. In theory both can hit each other, but in practice the OS notices that and will abort the program with an error (e.g., SEGV). Also, most OSs limit how much the HEAP and STACK can grow (see ulimit/{get,set}rlimit syscalls). Usually mmap(2) and shared libraries will be mapped (as readonly/executable segments) in this range. The rest is an UN-allocated region of memory pages. Note: not an allocated but uninitialized, but never even been allocated. Meaning: there's no physical-to-virtual mapping of those pages for this process. ANY attempt to access any byte of an unallocated page, will be trapped by the MMU, interrupt the CPU (read: OS will abort your program with SEGV, and core dump). 5. Usually at the bottom of the addr space is the STACK segment. It 'grows" towards lower number addresses, automatically by the OS, as needed. Every valid mapping of virtual-to-physical also includes access flags: Read, Write, or eXecute. Any attempt to violate the page protections results in a SEGV. Q: how does phys mem get allocate to virt pages, wouldn't we run out of phys mem? A: more later (when we discuss mem mgmt), but yes, the OS has to map v2p and p2v all the time, keeping only active pages mapped, anything that is not mapped to an active process, is preserved on the file system (if a readonly page) or stored in the 'swap' partition or "paging file" * Systems Calls Syscalls are different: the code for them is not in your process or in a library, but rather inside the OS. Kernel memory and state is protected, no process can access it directly. So how can we invoke some code INSIDE the kernel, but keep the same neat interface to a function: fd = open(name, flags, mode); // open(2) syscall Q: How do we switch from running in user mode to kernel mode? Q: How do we pass on the params to the kernel, and get back a retval? A: Must have a shared medium to exchange these params b/t the user and the kernel. Why not store the params in a file as a shared medium? B/c it's SLOW. Why not store the params in RAM and give the kernel a virt add? Clearly the kernel can translate a process's virt add into the physical page and copy those bytes... But that's still too slow. A: we use the fastest shared medium we can get, and that's storing syscall params into CPU registers. Param 1 -> R1, param 2 -> R2, etc. This is esp. useful when you have plenty of spare general purpose registers to use. Q: how do we tell the kernel which syscall to run? What if we pass a "string name"? That'll be too slow and cumbersome, b/c you have to pass a variable num of bytes. A: a fixed number that fits neatly in one integer or even smaller. Every syscall has a FIXED, well known number, that's hard coded inside (1) the OS, (2) libc, and (3) compilers (for ALL languages). So at compile time, we know what the number for each syscall is, and hard code into the program. Now, store this syscall num into yet another register, let's say R5. All of the above happens still in userland. Now, when you invoke a syscall like open(2), what you're really executing is some wrapper inside libc, that does the following: - store the syscall number in a register - store all params in registers - tell the kernel to run the syscall, this is done using a special purpose interrupt. Historically called "int 80h" (from old MS-DOS). - When a user program invokes an interrupt... you cause your OWN program to stop running, b/c whatever the CPU was running at that time, it is suspended, and an interrupt handler corresponding to the interrupt number is invoked. These interrupt handlers are KERNEL functions. Note: the syscall interrupt is unprivileged: any process can execute it. Most interrupts are protected so only the kernel can invoke/accept them. End of user mode code (for now) Note: some system calls have a lot of parameters, for example, select(2). Some CPU architectures don't have "enough" general purpose registers to use for passing both the syscall num + all the arguments. In that case, the libc warpper puts the syscall args into some mem location of the user process (could be stack, heap, anywhere) -- and instead, it passes to the kernel (in a register), the starting addr of the mem where all the syscall's args are. ##################### Now we're running inside the kernel, and we've just invoked the interrupt handler that's associated with system calls. syscall_interrupt_handler() { first, preserve the current state of the CPU (that'd be whatever the user process did, so we can resume it when we return from the syscall). The user process is de-scheduled, i.e., its state changes from RUNNING to READY or WAITING (may depend on the syscall itself). Handler has to find out which syscall is asked to run and its parameters. It gets it from the registers' values that were stored by the syscall wrapper in user land. The kernel maintains a system call "table" or array of all system calls, by their number. The array includes info like, how many params the syscall needs, and the fxn ptr inside the kernel to execute. Handler now prepares a "stack frame" in kernel space. The stack frame copies the syscall params from registers into the frame. It reserves space for return value (in kernel), and then it "jumps" to the fxn ptr of the syscall by its number, for example syscall 1: call sys_read(...) syscall 2: call sys_write(...) etc. all the way to 300-400 syscalls (linux has that many). The code to prepare the stack frame and jump to it is usually written in assembly, for efficiency, and also b/c you don't have a C runtime env running yet, not until you jump to sys_*(). At this point, the handler is "done" without an explicit return(): instead, it jumped to the syscall entry point inside the kernel, like sys_read, sys_open, etc. Now the syscall is running in the kernel, with full privileges, can do anything it wants, while the process is suspended. (skip what happens now -- subject of next few lectures). When syscall is done, it'll have resulted in some "side effects". For example, if the syscall is read(2), and it succeeded, then it'll have copied some data into a user buffer provided by the user process. Finally, when the sys_*() is ready to return, special assembly code is invoked (which got compiled in when the kernel code was built), to: - store the ret val from the syscall into a register. Ret val can be 0: usually means success >0: e.g., how many bytes read/written <0: an error occurred like ENOMEM, EIO, EPERM, EACCESS, many others. The kernel knows which process asked to invoke the syscall, b/c that process had to be suspended. Once we have the retval and the kernel code is done, the end-of-syscall code will also re-schedule the user process (i.e., put it back into READY state). At the end, syscall then yield() to the scheduler. Now, scheduler is invoked, and picks the next process to run... ... eventually the user process that invoked the syscall is scheduled, and it resumes execution exactly where it stopped (right after the syscall interrupt INSIDE the libc wrapper) End of kernel mode code ##################### Now, we're back in user mode, running the process. The libc wrapper does this: - pick the retval from the register - if retval is <0, then set errno=abs(retval) and return -1 from syscall wrapper - else return retval from syscall wrapper This permits user programs to do: fd = open(...); if (fd < 0) { perror("open"); // prints an err msg based on global errno var exit(1); } Lesson: syscalls are expensive, slow the process and the OS. Use them sparingly, and when you do, use them efficiently. For example, suppose you're reading some file: if you'll only read(2) one byte at a time, you'd be wasting a lot of effort for little gain.