This is a modified version of an assignment that ran first in Fall 2015 at JHU. See http://gaming.jhu.edu/~phf/2015/fall/cs318/assignment-uthreads.shtml and the Changelog below. If you are interested in using this assignment in your course, please contact me for reference implementations and test cases!

The first half of this assignment is Assignments for an xv6 OS Class: Miniature pthreads Library.

Overview

Oh yes, we’re back! Now that you have a working uthr library, it’s time for you to kick off the training wheels! By that, I mean, essentially, mv kernel kernel.provided and make things work again!

As before, this is almost entirely taken from Carnegie Mellon’s Operating Systems class. This assignment has been modified from its original version. It has been reformatted to fit this semester.

You’ll start with the xv6 from nwf’s git repository at commit a7934c7cd236bab8e431e7c9cde93256dd40371b for this assignment. This commit includes no test cases, but does include some useful fixes to the kernel code. Any additional test cases will be distributed via the course Piazza page.

System Calls

The kernel system calls you have to support are as described in the previous assignment. To quickly review:

Scheduling

  • void yield(int tid) will invoke the scheduler again with a hint that it would be good for tid to be on core soon (ideally, now), giving up the remainder of this thread’s timeslice. Note that if there are no other running threads, the kernel may return to this one without context switching. If tid is negative, the kernel will simply run the next RUNNABLE thread. Yielding to one’s self is not an error, but may return immediately rather than actually entering the scheduler.

  • void desch(int *guard) will put the current thread to sleep (that is, suspend its execution) if *guard is nonzero. If the process exit()-s or is subject to kill() while this thread remains so suspended, this thread will be destroyed, too.

  • int mkrun(int tid) should awaken a thread that is sleeping in desch(). If there is no such thread, it will return a negative number; otherwise, it will return 0.

    mkrun(tid) will return failure if there is no thread identified by tid, and will return success if tid exists and was asleep due to desch(). For the purpose of this exercise, its return value is undefined in other cases (though obviously the kernel blob does something). It is not an error to call mkrun() on a thread that is not sleeping due to desch(), but the return value is not specified (it could be viewed as success – the target thread is, after all, now not blocked having desch()-ed itself… but it could also be viewed as failure – the target thread wasn’t in desch() to be made runnable, now was it?).

    A thread’s invocation of desch() is atomic with respect to any mkrun() targeting that thread. That is, the decision to go to deschedule happens entirely before or entirely after any decision to wake up; dually, the decision to wake up happens entirely before or entirely after any decision to deschedule.

Threads

  • int gettid(void) will return the current thread’s identifier.

  • int tspawn(void *stktop, void (*f)(void *), void *a) starts a new thread running on the stack indicated by stktop running the function f with argument a. If it is unable to do so, it will return a negative number; otherwise, it will return the thread identifier of the spawned thread (somewhat like fork()). tspawn() requires two words below stktop.

  • void texit(int *where, int what) ends the execution of the current thread after validating that where is a valid pointer and atomically committing what to where.

Miscellany

  • exit() and kill() will terminate (or schedule for prompt termination) all threads in a process. Thankfully, mechanism already exists in xv6 for signaling this.

  • wait() returns only when a child process has no more threads in it; that is, when all threads have exited (or been killed).

  • fork() and exec() are permitted to fail if there are more than one thread running in the current process.

Memory Safety Problems

One of the biggest challenges of a multi-threaded system is that the kernel and userland can now race each other. In stock xv6, there’s only one CPU accessing a given (non-kernel) page at a time, so the only locking that has to happen is between kernel functions.

We punted on this in the shm assignment by telling you to make the kernel consider shared memory segments invalid userland pointers, which was sort of OK; maybe the kernel doesn’t have a whole lot of business looking at your shared memory anyway, and if you want to read() or write() (for example) something in shared memory, you can use a bounce buffer that isn’t in shared memory.

Unfortunately, with our thread model, we can no longer punt: all memory in a process is shared between threads. There’s just no way around it any more. Representative of various classes of challenge are “boring” syscalls like sleep(), “straightforward” calls like sbrk() and close(), “asynchronously memory-accessing” calls like write(), and “memory-access-with-parsing” calls like open(). Let’s look in detail at some of the challenges before we talk about any solutions.

Use-Site Validation

It is informative to begin by thinking about the following case, assuming that sbrk(0) returns a page-aligned value.

Thread (CPU) 1

Thread (CPU) 2

Call ptr = sbrk(0)

Call write(0,ptr,1)

Call sbrk(-PGSIZE)

Load sz = proc->sz

sz = proc->sz

Check that ptr < sz

growproc(-PGSIZE)

proc->sz = sz-PGSIZE

Do file stuff…

Return to userland

Read from checked ptr

What happens? One answer (there are, sadly, others; keep reading) is that CPU 1 panic()-s because it takes a page fault while in the kernel! Well that stinks! What happened?

We have introduced, by adding threads to the system, a Time-Of-Check to Time-of-Use (“TOCTOU”) window wherein our check (that ptr was below proc->sz, which it was!) is no longer relevant by the time we use the data we checked. Any fix will require careful auditing of the entire xv6 syscall vocabulary to determine when proc->sz is checked vs. when these observations are actually depended upon. Thankfully, there are not many syscalls!

File Descriptors

A similar race happens between two threads accessing the collection of open file descriptors: a read() could race with a close(), for example. If we were not careful, read() could end up with a pointer to a closed file structure, and madness would ensue.

One must also guard the management of proc->cwd, but thankfully, that is easy.

Buffer Content Validity

In many cases, because it “knows” that there is no possibility of concurrent mutation, the xv6 kernel checks that a buffer is “real” (backed by memory) and then proceeds to engage (potentially) multiple operations of that buffer. (See, for example, the behavior of namex(): the path parameter might be a validated pointer to user memory via, e.g., sys_unlink and nameiparent.) This all works fine when pages are not accessible by other mutators, but we run into problems here, now that things can change out from under us! This is even worse than the earlier problems because the other things that might “interfere” are threads running userland code, rather than being in the kernel… so we’re gonna have to get creative.

Other CPUs’ TLBs

There’s another problem, though, I’m sad to say: growproc() knows how to shrink a process and knows how to kfree() the frame that it just extracted from a pte, but… it’s not safe to do that any more. In a multi-processor system, that other CPU is going to have that PTE cached in its TLB and so your freed memory is now exposed to userland and OOPS! In tabular form,

Thread (CPU) 1

Thread (CPU) 2

Call ptr = sbrk(0) - PGSIZE

x = *ptr; this pulls the PTE backing ptr into the TLB

Call sbrk(-PGSIZE)

growproc(-PGSIZE) changes proc->pgdir to remove the PTE backing ptr and kfree()-s that page

growproc() calls switchuvm() to flush the TLB of this CPU!

Read and write above ptr to snoop the kernel’s internals!

In particular, this process would have free access to kalloc()’s free list. which would enable it with relatively little effort to take over the kernel. (Details left as an exercise for the reader, but the basic idea is: you can, with high confidence, control what the kalloc()-after-next will return, since the freshly freed page is quite likely to be at the head of the free list. If you can cause sbrk() to map a page of kernel code for you…) Isn’t that scary?!

Memory Safety Solutions

What’s to be done? Let’s start with the most terrifying, TLB shootdown issues. Broadly speaking, there are two things we needed to get that problem:

  1. the same address space (pgdir) mapped on multiple CPUs, and

  2. a lack of synchronization between CPUs when the address space was updated, which would have ensured that there were no other references to the removed page given to kfree().

We’re going to require different things of different students: Undergraduate students are going to take a large hammer approach and eliminate the problem by preventing multiple CPUs from sharing a context at once. Graduate students will be required to live with multiple CPUs running in the same address space, and so will attack the problem by addressing the lack of synchronization. That is, only graduate kernels will actually offer the benefit of multiple cores to multi-threaded processes. (Though enthusiastic undergrads are invited, having solved the problem using our stipulated approach first, to submit a delta showing the second approach. We will grade the first and will look at the second for a little extra credit.)

Undergraduates: Two-Tier Scheduling

While real kernels schedule threads and let the process components come along for the ride, we’re going to stick to a historically real approach of scheduling processes and threads within processes. That is, just like the current xv6 scheduler, your scheduler will pick a process to bring on core and prevent any other CPU from scheduling that process while it’s claimed by this one. Having made that decision, the scheduler will then pick, in a round-robin fashion, a thread from the runnable threads within this process. (There should be no running threads at this point, right?) It is up to you to decide how often to switch between processes versus threads within a process; options range from “run every thread before moving to the next process” to “run only one thread before considering a different process”. Note that your scheduler should not favor some subset of threads in a process, so even if it is switching processes before exhausting all threads within a process, every thread should still get a chance to run even assuming none of the other threads ever voluntarily sleep.

Interestingly, this approach also extends naturally to handling our buffer validity problems: it suffices to prevent intra-process inter-thread scheduling while manipulating a buffer. This gives the buffer manipulation code the illusion that nothing has changed, because there will be no concurrent mutators in this address space! Isn’t that cute?

It is OK if you make certain other operations done by the kernel prevent intra-process inter-thread scheduling, for some period of time, even if they do not (and should not) prevent inter-process scheduling. Please try to keep the number of affected subsystems to a minimum. Part of keeping this to a minimum will involve allowing inter-thread (esp. intra-process) scheduling to occur even with, e.g., an sbrk() call in progress. (That is, we expect the kernel to be maximally “preemptive” – it should allow a thread that is running purely userland code to be scheduled and run normally even while other threads (in the same process or not) are contending over various resources and going out to disk and… so on.)

Two-tier scheduling does not, however, completely solve the TOCTOU and open files issues; see below.

Graduates: CPU Synchronization Tricks

Alright, we’re going to actually solve this problem, so that multi-threaded workloads actually get a benefit from having multiple cores available!

One option – one frequently employed in the real world, even – is to use the LAPIC and things called Inter-Processor Interrupts to cause other processors to promptly evict entries from their TLB. You are welcome to take this approach and ignore what follows, but please carefully document anything and everything you study to write your code.

However, this being xv6, we’re never really in all that much of a hurry, so it’s OK if reducing proc->sz takes a while, especially if it keeps code simpler. Besides, how often does a multi-threaded application reduce its memory footprint. One way to do this is to borrow a technique from some really cool technology called Read-Copy-Update Mutual Exclusion (“RCU”). The chain of reasoning goes like this:

  1. Imagine we knew when all CPUs in the system had flushed their TLBs. Concretely, imagine a magic function wait_everyone_tlb_flush(void) that somehow knows just how long to wait until this happens.

  2. If we had removed the PTE before that, but not freed the page, after wait_everyone_tlb_flush() returns, we would know that no CPU would have that entry cached in their TLBs, by construction.

  3. When do processors running xv6 flush their TLBs? Rarely, but: at least every time they return to the scheduler() function, which calls switchkvm() when it takes a process off-core.

  4. How could we know when every CPU has gone through scheduler()? What if the CPUs formed a “bucket-brigade ring”, passing threads to the next one in line?

  5. Ah, so a thread could put itself into the bucket-brigade ring at the “next” CPU, saying “my stop is my current CPU”. (Think carefully about any atomicity requirements on “current” and “next”!) Then, around the ring it goes until the current CPU says “Oh, look, a thread for me!” and makes it run again. In tabular form,

    CPU 1

    CPU 2

    CPU 3

    Decreasing sz, want to kfree(page)

    Thread x puts itself into the ring at CPU 2 and goes off-core.

    schedule() someone else

    Timer interrupt! trap() calls yield() gets us to schedule()

    Pass x to CPU 3

    schedule() pass x to CPU 1

    schedule() sees the thread come home and makes it runnable again

    After this point, the thread will come on core again and will be secure in the knowledge that all other CPUs have flushed their TLBs.

So: if we had a function like waitscheds(...) and a coopratively modified scheduler(), then we’d just have to

  1. Remove the PTE from the process’s page directory

  2. wait for waitscheds(...) to return

  3. call kfree().

Isn’t that awesome?

You probably want to write and test this RCU ring stand-alone!

P.S. There may be some funny locking involved! In particular, the thread must go to sleep when it puts itself on the bucket brigade, and so…

P.S. Do we have to do this RCU game when we’re cleaning up a page directory? Why or why not?

P.S. The story told here is a per-freed-page story. Is there something better you can think of?

P.S. In a real system, these kinds of bucket brigades often have a callback function (typically the common void f(void *) and void *a pattern) that’s run on each CPU during (the equivalent of) schedule(). The description above elided these callbacks as we do not need them (if we had them, we’d just be using return;). You may, however, find that callbacks are useful for testing the bucket brigade or that they could be interesting for other reasons.

Graduate Students: Buffer Safety

Despite the tricks above, we are still faced with the problem that userspace memory could be mutated by multiple threads running on multiple CPUs. What should we do? There are at least two solutions I can think of:

  • One answer is to use bounce-buffers: that is, to make a private copy of a userland string and then use that, rather than the userland copy. The copy operation inherently potentially races with other threads, but thereafter all the kernel’s attention is focused on the in-kernel copy.

    This will involve auditing and modification of sysfile.c but not too much else!

  • A combination of the RCU-esque trick outlined above and the undergrad one-thread-per-process approach could also be viable! If we could temporarily declare our intention to run only one thread per process, and ensure that the other CPUs knew… we’d be in good shape!

    This will likely involve the introduction and use of two new kernel primitive operations, maybe named something like gost() and gomt() which transition the current process between single-threaded and multi-threaded modes. These will probably need to use xchg() internally to ensure that CPU store buffers are flushed, as with release(). gost() will, obviously, take some time.

    This approach is probably less ideal than the above, but involves maybe less machinery.

Note that buffers that are treated opaquely by the kernel, such as file data for write() and the landing area for read(), do not need this kind of careful handling. That is, userland can only hurt itself if it races threads writing to buf and write(fd, buf, n), right?

Everyone: TOCTOU and ofiles

OK, finally, the simplest problems. There are a few approaches available, but the most readily available options for the proc->sz problem, perhaps, are:

  • Narrow the TOCTOU window so that the check and use may be made atomic. While possibly the most efficient option, this required modification in some or all of console.c, file.c, fs.c, and pipe.c, among others. Notably, it required the introduction of so-called “safe-copy” primitives within the kernel.

  • Serialize all uses of and changes to proc->sz. This is less efficient, since it implies a great deal more contention between threads, but it is by far more expedient. Unfortuately, the sole encapsulated primitive offered by xv6 for serialization, spinlock-s, is not sufficient, as a thread cannot sleep with a spinlock held. You will have to use spinlock-s to build something better.

    (Graduate students should note that a gost()-like approach here is not acceptable, since often the TOC and TOU are separated by blocking operations like disk access and we would dearly love to let some other thread onto the core!)

For proc->ofiles, we again have a few options:

  • Make the ofile array per-thread instead of per-process. While not ideal, it’s at least one option to consider! (Maybe if we are serious about this option, we would introduce a way of passing files between threads.) We will forbid this approach this time around.

  • Take advantage of xv6’s built-in file reference counting and have a small critical section around the process’s ofile array which bumps the reference count then drops the lock and refers to the open file structure obtained from the ofile array directly, rather than subsequently through ofile. Don’t forget to drop the reference when done! Graduate submissions should aim for this and expect a few points loss if the solution falls short of this level of minimal serialization.

  • Serialize all calls within a process through the sysfile interfaces. As might be imagined, this is far simpler, if slower. Again, acquire() will be insufficient for the serialization at hand, as file operations often block. Undergraduate submissions will receive full credit for a correct solution of this flavor.

  • Serialize sysfile calls and all manipulations of proc->sz in the same way, as very often the two interact. This is slower still, since some pairs of operations, like sysclose() and sbrk(), will now somewhat artificially serialize.

proc->cwd can be simply managed by sprinking some spinlocks in the appropriate places; note that the only thing that needs to be guarded in sys_chdir is the replacement of proc->cwd with ip. If one grabs the old value of proc->cwd prior to that replacement as oldip, locks may be dropped before operating on oldip.

Threads

Errant Behavior

Upstream xv6 kills a process when it misbehaves (e.g. pagefaults). Your new thread-aware kernel should kill only a thread when it pagefaults, not the whole process.

Thread Identifiers

As before, thread identifiers and process identifiers are still in the same namespace, and the first thread in a process will have equal identifiers.

Kernel Stacks

Each thread will require its own kernel stack. Unfortunately, xv6 at present does not free a kstack until wait(); that’s sort of OK for processes, but it’s not OK now that we have tspawn() and texit() and nothing like twait(). You will have to design a mechanism for reclaiming kstacks promptly. An ideal mechanism will bound the number of stale kstacks in the system by some constant (even if I vary NTHR). (0 would be really good, but, while possible, is perhaps tricky to achieve. 1 is good and has a cute solution. NCPU would be fine and might be arguably better than 1 on a real system, which xv6 is not exactly. NPROC would be OK but strange. NTHR may appear to be a constant, but it isn’t.)

Allocations

This is xv6 and we haven’t asked you to write an allocator and we’re not going to give you one. Your system should support NPROC (i.e. 64) processes and a global pool of NTHR = 256 threads, which are to be allocated to processes on a first-come-first-served basis.

You should be able to handle the case when one process has grabbed all 256 threads (though admittedly, it would take a modified init to do that).

You must gracefully handle the “out of threads” case in both fork() and tspawn() (i.e. must not panic() the system) but note that graceful handling need not include attempting to get more! (This is just in line with fork() already handling the out-of-processes case!)

The Assignment Itself

Part 1: Study, Design, Log (15%)

The usual request for a detailed README with your initial plans, your software’s evolution, and all that good stuff.

Part 2: xxvv66: xv6 with threads (70%)

Replace the shipped kernel blob we gave you with something that works for your thread library and behold all that you have made.

Grade Breakdown

In an effort to reduce the surprise and potential stress of being graded, this section’s large impact will be broken up as follows:

  • Automated testing: coming up to a shell and passing usertest on both 1 and 2 CPU configurations will be worth 30% of this section (21% of the total grade). Note that usertest does not exercise any threading at all, and the upstream xv6 reliably passes it. You should be regression testing early and often!

  • Additional automated testing: passing all of 318_ts1.c, 318_dmr.c, and 318_lots_thr.c, on both 1 and 2 CPU configurations, will be worth 40% of this section (or 28% of the total grade).

    Note that none of these tests use your thread library! However, you should be using your thread library and your new knowledge to test parts of the kernel: probing the scheduler with spinlocks is a good way to exercise yield(). Semaphores exercise desch() and mkrun(). Thread tests exercise a whole lot of pieces!

  • Hidden test cases will count for another 10% of this section (7% total).

    None of these tests will use your thread library either! They might use my thread library, but that’s not your problem.

  • Manual audit of the submission will count for the remaining 20% of this section (14% total).

Part 3: Test Cases! (15%)

This assignment is so very complicated that I am going to insist that you write your own test cases. The ones given to you are a good place to start, and of course you’re welcome to share test cases with each other, but you must write your own tests, ideally that meaningfully test different things. Why yes, I might be crowd-sourcing a threaded xv6 test harness, why do you ask?

Please document what each test is supposed to test in your README and/or at the top of each test file (maybe a quick summary in README and more verbose in the file itself?). Note that good tests come in a few flavors:

  • Regression-like limited probes that touch minimal subsets of functionality and attempt to probe very specific conditions. Ideally, these tests reliably hit their target. (Lots of things in usertests, and things like 318_ts1.c or 318_dmr.c.)

  • Stress tests which do something to exhaust, or nearly-exhaust, some resource(s) and then try to exercise some condition(s). (usertests does this a few times)

  • Reliability tests, sometimes also called Continuous Hours of Operation (“CHO”) tests: fling lots of work at the system and see what happens. Keep it up for as long as possible. Sometimes CHO tests try to recreate (sped up?) real conditions, or sometimes they’re stress-test like, keeping the system low on resources, or sometimes they repeatedly take the system in and out of resource scarcity.

  • Random testing: given a palette of possible operations, do a random sequence of them! This includes things like “fuzzing” and “monkey testing”.

In many of these cases, the goal is to probe at the failure cases of the system. The success paths get lots of exercise. The failure paths, on the other hand…

Please additionally document your contributions to the communal test pool – linking to posts on Piazza is probably sufficient. Any tests posted to Piazza must have your name(s) in a comment near the top; any code derived from tests on Piazza must leave the original authors’ names there and add your own.

Do not feel obligated to use the TAPish test macros – I certainly will not judge you for not liking my style – though you may find them convenient, especially in combination with the runme driver. A shell-script could get you pretty good regression testing! (You are also welcome to develop your own TAPish test “framework”!)

Incidentally, some collected ideas of things to test:

  • Wild pointer handling. Should kill a thread, not the process, and not the kernel!

  • kill() a thread ID that isn’t a process ID. Right thing happen?

  • yield()! Make sure that a few processes can cheerfully bounce back and forth “forever” yield()-ing to each other. Can you yield to yourself? ( ;) )

  • Graduates: that RCU bucket brigade. You can even write tests for this before you have any thread support at all – just push processes around the ring at first.

  • tspawn() until you can’t any more, making sure that allocation failure is graceful!

  • tspawn() until you can’t any more, then (have another process) fork()! (That is: test what happens when there are free struct proc but no free struct thread!)

  • TOCTOU vulnerabilities and TLB shootdown: try to crash your kernel by rapidly wiggling sbrk() and having threads race the kernel!

  • What happens, anyway, if you race a write() and a for-loop paving over the same buffer (possibly repeatedly? backwards?). For best results, try with CPUS=2.

  • Memory leaks! How big can you push sbrk() after the system has just started up? Bring it back down and go do a bunch of things and wait for them all to finish and clean up. Can you still push sbrk() that high? No? Something’s gone missing! Find it!

  • “Continuous Hours of Operation” style tests: spawn a whole lot of tests at once and keep them going! Keep the kernel and all CPUs as busy as possible with as many different tasks as you can think of! Terrify the kernel with all of your userland-y demands!

One more clever trick someone once taught me, by the way, is to change the tick frequency. You’ll get a lot of different interleavings of threads if you switch at different speeds!

Suggested Plan Of Attack

  1. Mechanically separate struct proc into struct proc and struct thread. Initially, allow each struct proc to have exactly one struct thread, (and for each struct thread to belong to only one struct proc, but that part isn’t surprising).

    You will probably want to revise the “per-CPU” data by either adding a notion of “current thread” or by replacing the existing “proc” with the same. Note that if you wish to change the size of the per-CPU data, vm.c’s seginit() function will need to know.

    Make the requisite changes to trap.c for errant behavior, even if just in skeletal form. (Maybe fatal traps become infinite loops, since we don’t know how to kill just the current thread yet?) (Though while here, you can probably start to guess what’s in texit().)

    Chase compilation failures through the whole system until you have something that reliably passes usertests. Congratulations, you have just made the foundation of all subsequent efforts!

  2. Change the data structure! Allow (but do not give a mechanism for creating!) more than one thread in a process. You will encounter ample food for thought in this process; make note of every question you encounter!

    Graduate students probably want to change what schedule() scans! Undergrad students can write an initial version of their two-tier scheduler now.

    Chase compilation failures again and then debug until usertests is happy.

    Either here or at the previous step, it will probably do you some good to make procdump() more than superficially aware of this new threading thing. Calling procdump() at choice moments can be a huge boon to debugging! (And hitting Ctrl-P on the keyboard is always satisfying!)

  3. Sometime around now might be the right time to add desch and mkrun. 318_dmr.c should help you get going.

  4. Think about how to guard sz (and ofile). You can add those guards now and the system should continue to pass usertests! (The guards will be guarding against something that cannot happen!)

  5. Graduate students: think about that RCU bucket brigade. Get it working (maybe add some system calls and cprintf()-s around the system to observe that things are behaving as you expect).

    (This is an excellent opportunity to demonstrate your test-writing skills as part of the test-case part of the assignment.)

  6. Graduate students: wire up the RCU bucket brigade where it needs to go. The system should continue to reliably pass usertests.

  7. New kstack reclaim: design and move your kernel to use a new technique. As per usual, usertests should continue to pass!

    The kfree(p->kstack) from wait() should likely be gone, now.

  8. The moment of truth! Add tspawn() support! 318_ts1.c and 318_lots_thr.c should help you get on your feet and believe that things are working! (It’s OK not to have texit() yet!)

  9. Finish the assignment. (That means texit(), yield(), gettid() and everything I might have left out).

  10. Test, test, and test some more. (Rather like Portal!) While here, marvel at what you are doing: you have added parallel threads (concurrent, too, in the case of grad students) with shared-by-default memory to UNIX. This, historically, is a big f****** deal, as Vice President Joe Biden might say.

Good luck!

Recall that one last bit of advice: if something really isn’t working, get up and stretch. Go for a walk around the building. Relax. Do other work. Come back later and you’ll see things you didn’t see before.

Changelog