This is a modified version of an assignment that ran first in Fall 2015 at JHU. See http://gaming.jhu.edu/~phf/2015/fall/cs318/assignment-uthreads.shtml and the Changelog below. If you are interested in using this assignment in your course, please contact me for reference implementations and test cases!
The first half of this assignment is Assignments for an xv6 OS Class: Miniature pthreads Library.
Oh yes, we’re back! Now that you have a working
uthr library, it’s time
for you to kick off the training wheels! By that, I mean, essentially,
kernel kernel.provided and make things work again!
You’ll start with the xv6 from nwf’s git repository at commit a7934c7cd236bab8e431e7c9cde93256dd40371b for this assignment. This commit includes no test cases, but does include some useful fixes to the kernel code. Any additional test cases will be distributed via the course Piazza page.
The kernel system calls you have to support are as described in the previous assignment. To quickly review:
void yield(int tid)will invoke the scheduler again with a hint that it would be good for
tidto be on core soon (ideally, now), giving up the remainder of this thread’s timeslice. Note that if there are no other running threads, the kernel may return to this one without context switching. If
tidis negative, the kernel will simply run the next
RUNNABLEthread. Yielding to one’s self is not an error, but may return immediately rather than actually entering the scheduler.
void desch(int *guard)will put the current thread to sleep (that is, suspend its execution) if
*guardis nonzero. If the process
exit()-s or is subject to
kill()while this thread remains so suspended, this thread will be destroyed, too.
int mkrun(int tid)should awaken a thread that is sleeping in
desch(). If there is no such thread, it will return a negative number; otherwise, it will return
mkrun(tid)will return failure if there is no thread identified by
tid, and will return success if
tidexists and was asleep due to
desch(). For the purpose of this exercise, its return value is undefined in other cases (though obviously the kernel blob does something). It is not an error to call
mkrun()on a thread that is not sleeping due to
desch(), but the return value is not specified (it could be viewed as success – the target thread is, after all, now not blocked having
desch()-ed itself… but it could also be viewed as failure – the target thread wasn’t in
desch()to be made runnable, now was it?).
A thread’s invocation of
desch()is atomic with respect to any
mkrun()targeting that thread. That is, the decision to go to deschedule happens entirely before or entirely after any decision to wake up; dually, the decision to wake up happens entirely before or entirely after any decision to deschedule.
int gettid(void)will return the current thread’s identifier.
int tspawn(void *stktop, void (*f)(void *), void *a)starts a new thread running on the stack indicated by
stktoprunning the function
a. If it is unable to do so, it will return a negative number; otherwise, it will return the thread identifier of the spawned thread (somewhat like
tspawn()requires two words below
void texit(int *where, int what)ends the execution of the current thread after validating that
whereis a valid pointer and atomically committing
kill()will terminate (or schedule for prompt termination) all threads in a process. Thankfully, mechanism already exists in xv6 for signaling this.
wait()returns only when a child process has no more threads in it; that is, when all threads have exited (or been killed).
exec()are permitted to fail if there are more than one thread running in the current process.
One of the biggest challenges of a multi-threaded system is that the kernel
and userland can now race each other. In stock
xv6, there’s only one
CPU accessing a given (non-kernel) page at a time, so the only locking that
has to happen is between kernel functions.
We punted on this in the
shm assignment by telling you to make the
kernel consider shared memory segments invalid userland pointers, which was
sort of OK; maybe the kernel doesn’t have a whole lot of business looking at
your shared memory anyway, and if you want to
example) something in shared memory, you can use a bounce buffer that
isn’t in shared memory.
Unfortunately, with our thread model, we can no longer punt: all memory in
a process is shared between threads. There’s just no way around it any
more. Representative of various classes of challenge are “boring” syscalls
sleep(), “straightforward” calls like
“asynchronously memory-accessing” calls like
“memory-access-with-parsing” calls like
open(). Let’s look in detail
at some of the challenges before we talk about any solutions.
It is informative to begin by thinking about the following case, assuming
sbrk(0) returns a page-aligned value.
Thread (CPU) 1
Thread (CPU) 2
ptr = sbrk(0)
sz = proc->sz
sz = proc->sz
ptr < sz
proc->sz = sz-PGSIZE
Do file stuff…
Return to userland
Read from checked
What happens? One answer (there are, sadly, others; keep reading) is that
panic()-s because it takes a page fault while in the kernel! Well
that stinks! What happened?
We have introduced, by adding threads to the system, a Time-Of-Check to
Time-of-Use (“TOCTOU”) window
wherein our check (that
ptr was below
proc->sz, which it was!) is
no longer relevant by the time we use the data we checked. Any fix will
require careful auditing of the entire xv6 syscall vocabulary to determine
proc->sz is checked vs. when these observations are actually
depended upon. Thankfully, there are not many syscalls!
A similar race happens between two threads accessing the collection of open
file descriptors: a
read() could race with a
close(), for example.
If we were not careful,
read() could end up with a pointer to a closed
file structure, and madness would ensue.
One must also guard the management of
proc->cwd, but thankfully, that is
In many cases, because it “knows” that there is no possibility of concurrent
mutation, the xv6 kernel checks that a buffer is “real” (backed by memory)
and then proceeds to engage (potentially) multiple operations of that buffer.
(See, for example, the behavior of
path parameter might
be a validated pointer to user memory via, e.g.,
nameiparent.) This all works fine when pages are not accessible by
other mutators, but we run into problems here, now that things can change
out from under us! This is even worse than the earlier problems because the
other things that might “interfere” are threads running userland code,
rather than being in the kernel… so we’re gonna have to get creative.
There’s another problem, though, I’m sad to say:
growproc() knows how
to shrink a process and knows how to
kfree() the frame that it just
extracted from a
pte, but… it’s not safe to do that any more. In a
multi-processor system, that other CPU is going to have that PTE cached in
its TLB and so your freed memory is now exposed to userland and OOPS!
In tabular form,
Thread (CPU) 1
Thread (CPU) 2
ptr = sbrk(0) - PGSIZE
x = *ptr; this pulls the PTE backing
ptrinto the TLB
proc->pgdirto remove the PTE backing
kfree()-s that page
switchuvm()to flush the TLB of this CPU!
Read and write above
ptrto snoop the kernel’s internals!
In particular, this process would have free access to
list. which would enable it with relatively little effort to take over the
kernel. (Details left as an exercise for the reader, but the basic idea is:
you can, with high confidence, control what the
return, since the freshly freed page is quite likely to be at the head of
the free list. If you can cause
sbrk() to map a page of kernel code for
you…) Isn’t that scary?!
What’s to be done? Let’s start with the most terrifying, TLB shootdown issues. Broadly speaking, there are two things we needed to get that problem:
the same address space (
pgdir) mapped on multiple CPUs, and
a lack of synchronization between CPUs when the address space was updated, which would have ensured that there were no other references to the removed page given to
We’re going to require different things of different students: Undergraduate students are going to take a large hammer approach and eliminate the problem by preventing multiple CPUs from sharing a context at once. Graduate students will be required to live with multiple CPUs running in the same address space, and so will attack the problem by addressing the lack of synchronization. That is, only graduate kernels will actually offer the benefit of multiple cores to multi-threaded processes. (Though enthusiastic undergrads are invited, having solved the problem using our stipulated approach first, to submit a delta showing the second approach. We will grade the first and will look at the second for a little extra credit.)
While real kernels schedule threads and let the process components come along for the ride, we’re going to stick to a historically real approach of scheduling processes and threads within processes. That is, just like the current xv6 scheduler, your scheduler will pick a process to bring on core and prevent any other CPU from scheduling that process while it’s claimed by this one. Having made that decision, the scheduler will then pick, in a round-robin fashion, a thread from the runnable threads within this process. (There should be no running threads at this point, right?) It is up to you to decide how often to switch between processes versus threads within a process; options range from “run every thread before moving to the next process” to “run only one thread before considering a different process”. Note that your scheduler should not favor some subset of threads in a process, so even if it is switching processes before exhausting all threads within a process, every thread should still get a chance to run even assuming none of the other threads ever voluntarily sleep.
Interestingly, this approach also extends naturally to handling our buffer validity problems: it suffices to prevent intra-process inter-thread scheduling while manipulating a buffer. This gives the buffer manipulation code the illusion that nothing has changed, because there will be no concurrent mutators in this address space! Isn’t that cute?
It is OK if you make certain other operations done by the kernel prevent
intra-process inter-thread scheduling, for some period of time, even if they
do not (and should not) prevent inter-process scheduling. Please try to
keep the number of affected subsystems to a minimum. Part of keeping this
to a minimum will involve allowing inter-thread (esp. intra-process)
scheduling to occur even with, e.g., an
sbrk() call in progress. (That
is, we expect the kernel to be maximally “preemptive” – it should allow a
thread that is running purely userland code to be scheduled and run
normally even while other threads (in the same process or not) are
contending over various resources and going out to disk and… so on.)
Two-tier scheduling does not, however, completely solve the TOCTOU and open files issues; see below.
Alright, we’re going to actually solve this problem, so that multi-threaded workloads actually get a benefit from having multiple cores available!
One option – one frequently employed in the real world, even – is to use the LAPIC and things called Inter-Processor Interrupts to cause other processors to promptly evict entries from their TLB. You are welcome to take this approach and ignore what follows, but please carefully document anything and everything you study to write your code.
However, this being
xv6, we’re never really in all that much of a hurry,
so it’s OK if reducing
proc->sz takes a while, especially if it keeps
code simpler. Besides, how often does a multi-threaded application reduce
its memory footprint. One way to do this is to borrow a technique from
some really cool technology called Read-Copy-Update Mutual Exclusion
(“RCU”). The chain of
reasoning goes like this:
Imagine we knew when all CPUs in the system had flushed their TLBs. Concretely, imagine a magic function
wait_everyone_tlb_flush(void)that somehow knows just how long to wait until this happens.
If we had removed the PTE before that, but not freed the page, after
wait_everyone_tlb_flush()returns, we would know that no CPU would have that entry cached in their TLBs, by construction.
When do processors running
xv6flush their TLBs? Rarely, but: at least every time they return to the
scheduler()function, which calls
switchkvm()when it takes a process off-core.
How could we know when every CPU has gone through
scheduler()? What if the CPUs formed a “bucket-brigade ring”, passing threads to the next one in line?
Ah, so a thread could put itself into the bucket-brigade ring at the “next” CPU, saying “my stop is my current CPU”. (Think carefully about any atomicity requirements on “current” and “next”!) Then, around the ring it goes until the current CPU says “Oh, look, a thread for me!” and makes it run again. In tabular form,
sz, want to
Thread x puts itself into the ring at CPU 2 and goes off-core.
yield()gets us to
Pass x to CPU 3
schedule()pass x to CPU 1
schedule()sees the thread come home and makes it runnable again
After this point, the thread will come on core again and will be secure in the knowledge that all other CPUs have flushed their TLBs.
So: if we had a function like
waitscheds(...) and a coopratively
scheduler(), then we’d just have to
Remove the PTE from the process’s page directory
Isn’t that awesome?
You probably want to write and test this RCU ring stand-alone!
P.S. There may be some funny locking involved! In particular, the thread must go to sleep when it puts itself on the bucket brigade, and so…
P.S. Do we have to do this RCU game when we’re cleaning up a page directory? Why or why not?
P.S. The story told here is a per-freed-page story. Is there something better you can think of?
P.S. In a real system, these kinds of bucket brigades often have a
callback function (typically the common
void f(void *) and
pattern) that’s run on each CPU during (the equivalent of)
The description above elided these callbacks as we do not need them (if we
had them, we’d just be using
return;). You may, however, find that
callbacks are useful for testing the bucket brigade or that they could be
interesting for other reasons.
Despite the tricks above, we are still faced with the problem that userspace memory could be mutated by multiple threads running on multiple CPUs. What should we do? There are at least two solutions I can think of:
One answer is to use bounce-buffers: that is, to make a private copy of a userland string and then use that, rather than the userland copy. The copy operation inherently potentially races with other threads, but thereafter all the kernel’s attention is focused on the in-kernel copy.
This will involve auditing and modification of
sysfile.cbut not too much else!
A combination of the RCU-esque trick outlined above and the undergrad one-thread-per-process approach could also be viable! If we could temporarily declare our intention to run only one thread per process, and ensure that the other CPUs knew… we’d be in good shape!
This will likely involve the introduction and use of two new kernel primitive operations, maybe named something like
gomt()which transition the current process between single-threaded and multi-threaded modes. These will probably need to use
xchg()internally to ensure that CPU store buffers are flushed, as with
gost()will, obviously, take some time.
This approach is probably less ideal than the above, but involves maybe less machinery.
Note that buffers that are treated opaquely by the kernel, such as file
write() and the landing area for
read(), do not need this
kind of careful handling. That is, userland can only hurt itself if it
races threads writing to
write(fd, buf, n), right?
OK, finally, the simplest problems. There are a few approaches available,
but the most readily available options for the
Narrow the TOCTOU window so that the check and use may be made atomic. While possibly the most efficient option, this required modification in some or all of
pipe.c, among others. Notably, it required the introduction of so-called “safe-copy” primitives within the kernel.
Serialize all uses of and changes to
proc->sz. This is less efficient, since it implies a great deal more contention between threads, but it is by far more expedient. Unfortuately, the sole encapsulated primitive offered by
spinlock-s, is not sufficient, as a thread cannot sleep with a spinlock held. You will have to use
spinlock-s to build something better.
(Graduate students should note that a
gost()-like approach here is not acceptable, since often the TOC and TOU are separated by blocking operations like disk access and we would dearly love to let some other thread onto the core!)
proc->ofiles, we again have a few options:
ofilearray per-thread instead of per-process. While not ideal, it’s at least one option to consider! (Maybe if we are serious about this option, we would introduce a way of passing files between threads.) We will forbid this approach this time around.
Take advantage of
xv6’s built-in file reference counting and have a small critical section around the process’s
ofilearray which bumps the reference count then drops the lock and refers to the open file structure obtained from the
ofilearray directly, rather than subsequently through
ofile. Don’t forget to drop the reference when done! Graduate submissions should aim for this and expect a few points loss if the solution falls short of this level of minimal serialization.
Serialize all calls within a process through the
sysfileinterfaces. As might be imagined, this is far simpler, if slower. Again,
acquire()will be insufficient for the serialization at hand, as file operations often block. Undergraduate submissions will receive full credit for a correct solution of this flavor.
sysfilecalls and all manipulations of
proc->szin the same way, as very often the two interact. This is slower still, since some pairs of operations, like
sbrk(), will now somewhat artificially serialize.
proc->cwd can be simply managed by sprinking some spinlocks in the
appropriate places; note that the only thing that needs to be guarded in
sys_chdir is the replacement of
ip. If one
grabs the old value of
proc->cwd prior to that replacement as
locks may be dropped before operating on
xv6 kills a process when it misbehaves (e.g. pagefaults).
Your new thread-aware kernel should kill only a thread when it pagefaults,
not the whole process.
As before, thread identifiers and process identifiers are still in the same namespace, and the first thread in a process will have equal identifiers.
Each thread will require its own kernel stack. Unfortunately,
present does not free a kstack until
wait(); that’s sort of OK for
processes, but it’s not OK now that we have
twait(). You will have to design a mechanism for
reclaiming kstacks promptly. An ideal mechanism will bound the number of
stale kstacks in the system by some constant (even if I vary
(0 would be really good, but, while possible, is perhaps tricky to achieve.
1 is good and has a cute solution.
NCPU would be fine and might be
arguably better than 1 on a real system, which xv6 is not exactly.
would be OK but strange.
NTHR may appear to be a constant, but it
This is xv6 and we haven’t asked you to write an allocator and we’re not
going to give you one. Your system should support
processes and a global pool of
NTHR = 256 threads, which are to be
allocated to processes on a first-come-first-served basis.
You should be able to handle the case when one process has grabbed all 256 threads (though admittedly, it would take a modified init to do that).
You must gracefully handle the “out of threads” case in both
tspawn() (i.e. must not
panic() the system) but note that
graceful handling need not include attempting to get more! (This is just in
fork() already handling the out-of-processes case!)
The usual request for a detailed
README with your initial plans, your
software’s evolution, and all that good stuff.
Replace the shipped kernel blob we gave you with something that works for your thread library and behold all that you have made.
In an effort to reduce the surprise and potential stress of being graded, this section’s large impact will be broken up as follows:
Automated testing: coming up to a shell and passing
userteston both 1 and 2 CPU configurations will be worth 30% of this section (21% of the total grade). Note that
usertestdoes not exercise any threading at all, and the upstream xv6 reliably passes it. You should be regression testing early and often!
Additional automated testing: passing all of
318_lots_thr.c, on both 1 and 2 CPU configurations, will be worth 40% of this section (or 28% of the total grade).
Note that none of these tests use your thread library! However, you should be using your thread library and your new knowledge to test parts of the kernel: probing the scheduler with spinlocks is a good way to exercise
yield(). Semaphores exercise
mkrun(). Thread tests exercise a whole lot of pieces!
Hidden test cases will count for another 10% of this section (7% total).
None of these tests will use your thread library either! They might use my thread library, but that’s not your problem.
Manual audit of the submission will count for the remaining 20% of this section (14% total).
This assignment is so very complicated that I am going to insist that you write your own test cases. The ones given to you are a good place to start, and of course you’re welcome to share test cases with each other, but you must write your own tests, ideally that meaningfully test different things. Why yes, I might be crowd-sourcing a threaded xv6 test harness, why do you ask?
Please document what each test is supposed to test in your
and/or at the top of each test file (maybe a quick summary in
and more verbose in the file itself?). Note that good tests come in a few
Regression-like limited probes that touch minimal subsets of functionality and attempt to probe very specific conditions. Ideally, these tests reliably hit their target. (Lots of things in
usertests, and things like
Stress tests which do something to exhaust, or nearly-exhaust, some resource(s) and then try to exercise some condition(s). (
usertestsdoes this a few times)
Reliability tests, sometimes also called Continuous Hours of Operation (“CHO”) tests: fling lots of work at the system and see what happens. Keep it up for as long as possible. Sometimes CHO tests try to recreate (sped up?) real conditions, or sometimes they’re stress-test like, keeping the system low on resources, or sometimes they repeatedly take the system in and out of resource scarcity.
Random testing: given a palette of possible operations, do a random sequence of them! This includes things like “fuzzing” and “monkey testing”.
In many of these cases, the goal is to probe at the failure cases of the system. The success paths get lots of exercise. The failure paths, on the other hand…
Please additionally document your contributions to the communal test pool – linking to posts on Piazza is probably sufficient. Any tests posted to Piazza must have your name(s) in a comment near the top; any code derived from tests on Piazza must leave the original authors’ names there and add your own.
Do not feel obligated to use the TAPish test macros – I certainly will not
judge you for not liking my style – though you may find them convenient,
especially in combination with the
runme driver. A shell-script could
get you pretty good regression testing! (You are also welcome to develop
your own TAPish test “framework”!)
Incidentally, some collected ideas of things to test:
Wild pointer handling. Should kill a thread, not the process, and not the kernel!
kill()a thread ID that isn’t a process ID. Right thing happen?
yield()! Make sure that a few processes can cheerfully bounce back and forth “forever”
yield()-ing to each other. Can you yield to yourself? (
Graduates: that RCU bucket brigade. You can even write tests for this before you have any thread support at all – just push processes around the ring at first.
tspawn()until you can’t any more, making sure that allocation failure is graceful!
tspawn()until you can’t any more, then (have another process)
fork()! (That is: test what happens when there are free
struct procbut no free
TOCTOU vulnerabilities and TLB shootdown: try to crash your kernel by rapidly wiggling
sbrk()and having threads race the kernel!
What happens, anyway, if you race a
for-loop paving over the same buffer (possibly repeatedly? backwards?). For best results, try with
Memory leaks! How big can you push sbrk() after the system has just started up? Bring it back down and go do a bunch of things and wait for them all to finish and clean up. Can you still push sbrk() that high? No? Something’s gone missing! Find it!
“Continuous Hours of Operation” style tests: spawn a whole lot of tests at once and keep them going! Keep the kernel and all CPUs as busy as possible with as many different tasks as you can think of! Terrify the kernel with all of your userland-y demands!
One more clever trick someone once taught me, by the way, is to change the tick frequency. You’ll get a lot of different interleavings of threads if you switch at different speeds!
struct thread. Initially, allow each
struct procto have exactly one
struct thread, (and for each
struct threadto belong to only one
struct proc, but that part isn’t surprising).
You will probably want to revise the “per-CPU” data by either adding a notion of “current thread” or by replacing the existing “proc” with the same. Note that if you wish to change the size of the per-CPU data, vm.c’s
seginit()function will need to know.
Make the requisite changes to
trap.cfor errant behavior, even if just in skeletal form. (Maybe fatal traps become infinite loops, since we don’t know how to kill just the current thread yet?) (Though while here, you can probably start to guess what’s in
Chase compilation failures through the whole system until you have something that reliably passes
usertests. Congratulations, you have just made the foundation of all subsequent efforts!
Change the data structure! Allow (but do not give a mechanism for creating!) more than one thread in a process. You will encounter ample food for thought in this process; make note of every question you encounter!
Graduate students probably want to change what
schedule()scans! Undergrad students can write an initial version of their two-tier scheduler now.
Chase compilation failures again and then debug until
Either here or at the previous step, it will probably do you some good to make
procdump()more than superficially aware of this new threading thing. Calling
procdump()at choice moments can be a huge boon to debugging! (And hitting
Ctrl-Pon the keyboard is always satisfying!)
Sometime around now might be the right time to add
318_dmr.cshould help you get going.
Think about how to guard
ofile). You can add those guards now and the system should continue to pass
usertests! (The guards will be guarding against something that cannot happen!)
Graduate students: think about that RCU bucket brigade. Get it working (maybe add some system calls and
cprintf()-s around the system to observe that things are behaving as you expect).
(This is an excellent opportunity to demonstrate your test-writing skills as part of the test-case part of the assignment.)
Graduate students: wire up the RCU bucket brigade where it needs to go. The system should continue to reliably pass
New kstack reclaim: design and move your kernel to use a new technique. As per usual,
usertestsshould continue to pass!
wait()should likely be gone, now.
The moment of truth! Add
318_lots_thr.cshould help you get on your feet and believe that things are working! (It’s OK not to have
Finish the assignment. (That means
gettid()and everything I might have left out).
Test, test, and test some more. (Rather like Portal!) While here, marvel at what you are doing: you have added parallel threads (concurrent, too, in the case of grad students) with shared-by-default memory to UNIX. This, historically, is a big f****** deal, as Vice President Joe Biden might say.
Recall that one last bit of advice: if something really isn’t working, get up and stretch. Go for a walk around the building. Relax. Do other work. Come back later and you’ll see things you didn’t see before.
Unlike the CMU original version, this version uses
tspawn()instead of a much more raw
thread_fork()which spawns a new thread with an identical register file, requiring the use of assembler in the analog of
void texit(int *where, int what)in line with Assignments for an xv6 OS Class: Miniature pthreads Library.
Added commentary about