######################################################
Assignments for an xv6 OS Class: Kernel Thread Support
######################################################

.. sidebar:: Contents

   .. contents::
      :depth: 2
      :local:


This is a modified version of an assignment that ran first in Fall 2015 at
JHU.  See
http://gaming.jhu.edu/~phf/2015/fall/cs318/assignment-uthreads.shtml and the
`Changelog`_ below.  If you are interested in using this assignment in your
course, please contact me for reference implementations and test cases!

The first half of this assignment is :doc:`osassign-uthreads`.

Overview
########

Oh yes, we're back!  Now that you have a working ``uthr`` library, it's time
for you to kick off the training wheels!  By that, I mean, essentially, ``mv
kernel kernel.provided`` and make things work again!

As before, this is almost entirely taken from `Carnegie Mellon's Operating
Systems class <http://www.cs.cmu.edu/~410>`_.  *This assignment has been
modified from its original version.  It has been* `reformatted to fit this
semester <https://en.wikipedia.org/wiki/Film_modification>`_.

You'll start with the xv6 from `nwf's git repository
<https://github.com/nwf/xv6-public>`_ at commit
a7934c7cd236bab8e431e7c9cde93256dd40371b for this assignment.  This commit
includes no test cases, but does include some useful fixes to the kernel
code.  Any additional test cases will be distributed via the course Piazza
page.

System Calls
============

The kernel system calls you have to support are *as described* in the
:doc:`previous assignment <osassign-uthreads>`.  To quickly review:

Scheduling
----------

* ``void yield(int tid)`` will invoke the scheduler again with a *hint*
  that it would be good for ``tid`` to be on core *soon* (ideally, *now*),
  giving up the remainder of this thread's timeslice.  Note that if there
  are no other running threads, the kernel may return to this one without
  context switching.  If ``tid`` is negative, the kernel will simply run the
  next ``RUNNABLE`` thread.  Yielding to one's self is *not an error*, but
  *may* return immediately rather than actually entering the scheduler.

* ``void desch(int *guard)`` will put the current thread to sleep
  (that is, suspend its execution) if ``*guard`` is nonzero.  If the process
  ``exit()``-s or is subject to ``kill()`` while this thread remains so
  suspended, this thread will be destroyed, too.

* ``int mkrun(int tid)`` should awaken a thread that is sleeping in
  ``desch()``.  If there is no such thread, it will return a negative
  number; otherwise, it will return ``0``.

  ``mkrun(tid)`` will return failure if there is no thread identified by
  ``tid``, and will return success if ``tid`` exists and was asleep due to
  ``desch()``.  For the purpose of this exercise, its return value is
  undefined in other cases (though obviously the kernel blob does
  *something*).  It is not an error to call ``mkrun()`` on a thread that is
  not sleeping due to ``desch()``, but the return value is not specified (it
  could be viewed as success -- the target thread is, after all, now not
  blocked having ``desch()``-ed itself... but it could also be viewed as
  failure -- the target thread wasn't in ``desch()`` to be made runnable,
  now was it?).

  A thread's invocation of ``desch()`` is *atomic* with respect to any
  ``mkrun()`` targeting that thread.  That is, the decision to go to
  deschedule happens *entirely before* or *entirely after* any decision to
  wake up; dually, the decision to wake up happens *entirely before* or
  *entirely after* any decision to deschedule.

Threads
-------

* ``int gettid(void)`` will return the current thread's identifier.

* ``int tspawn(void *stktop, void (*f)(void *), void *a)`` starts a new
  thread running on the stack indicated by ``stktop`` running the function
  ``f`` with argument ``a``.  If it is unable to do so, it will return
  a negative number; otherwise, it will return the thread identifier of
  the spawned thread (somewhat like ``fork()``).  ``tspawn()`` requires
  two words *below* ``stktop``.

* ``void texit(int *where, int what)`` ends the execution of the current
  thread after validating that ``where`` is a valid pointer and *atomically*
  committing ``what`` to ``where``.

Miscellany
----------

* ``exit()`` and ``kill()`` will terminate (or schedule for prompt
  termination) *all* threads in a process.  Thankfully, mechanism already
  exists in xv6 for signaling this.

* ``wait()`` returns only when a child *process* has no more *threads* in
  it; that is, when *all threads* have exited (or been killed).

* ``fork()`` and ``exec()`` are permitted to fail if there are more than
  one thread running in the current process.

Memory Safety Problems
######################

One of the biggest challenges of a multi-threaded system is that the kernel
and userland can now race each other.  In stock ``xv6``, there's only one
CPU accessing a given (non-kernel) page at a time, so the only locking that
has to happen is between kernel functions.

We punted on this in the ``shm`` assignment by telling you to make the
kernel consider shared memory segments invalid userland pointers, which was
sort of OK; maybe the kernel doesn't have a whole lot of business looking at
your shared memory anyway, and if you want to ``read()`` or ``write()`` (for
example) something in shared memory, you can use a **bounce buffer** that
isn't in shared memory.

Unfortunately, with our thread model, we can no longer punt: *all* memory in
a process is shared between threads.  There's just no way around it any
more.  Representative of various classes of challenge are "boring" syscalls
like ``sleep()``, "straightforward" calls like ``sbrk()`` and ``close()``,
"asynchronously memory-accessing" calls like ``write()``, and
"memory-access-with-parsing" calls like ``open()``.   Let's look in detail
at some of the challenges before we talk about any solutions.

Use-Site Validation
===================

It is informative to begin by thinking about the following case, assuming
that ``sbrk(0)`` returns a *page-aligned* value.

  +---------------------------+--------------------------+
  | Thread (CPU) 1            | Thread (CPU) 2           |
  +===========================+==========================+
  | Call ``ptr = sbrk(0)``    |                          |
  +---------------------------+--------------------------+
  | Call ``write(0,ptr,1)``   | Call ``sbrk(-PGSIZE)``   |
  +---------------------------+--------------------------+
  | Load ``sz = proc->sz``    | ``sz = proc->sz``        |
  +---------------------------+--------------------------+
  | Check that ``ptr < sz``   |                          |
  +---------------------------+--------------------------+
  |                           | ``growproc(-PGSIZE)``    |
  +---------------------------+--------------------------+
  |                           | ``proc->sz = sz-PGSIZE`` |
  +---------------------------+--------------------------+
  | Do file stuff...          | Return to userland       |
  +---------------------------+--------------------------+
  | Read from checked ``ptr`` |                          |
  +---------------------------+--------------------------+

What happens?  One answer (there are, sadly, others; keep reading) is that
CPU 1 ``panic()``-s because it takes a page fault while in the kernel!  Well
that stinks!  What happened?

We have introduced, by adding threads to the system, a `Time-Of-Check to
Time-of-Use ("TOCTOU")
<https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use>`_ window
wherein our check (that ``ptr`` was below ``proc->sz``, which it *was*!) is
no longer relevant by the time we *use* the data we checked.  Any fix will
require careful auditing of the entire xv6 syscall vocabulary to determine
when ``proc->sz`` is checked vs. when these observations are actually
depended upon.  Thankfully, there are not many syscalls!

File Descriptors
================

A similar race happens between two threads accessing the collection of open
file descriptors: a ``read()`` could race with a ``close()``, for example.
If we were not careful, ``read()`` could end up with a pointer to a closed
file structure, and madness would ensue.

One must also guard the management of ``proc->cwd``, but thankfully, that is
easy.

Buffer Content Validity
=======================

In many cases, because it "knows" that there is no possibility of concurrent
mutation, the xv6 kernel checks that a buffer is "real" (backed by memory)
and then proceeds to engage (potentially) *multiple* operations of that buffer.
(See, for example, the behavior of ``namex()``: the ``path`` parameter might
be a validated pointer to user memory via, e.g., ``sys_unlink`` and
``nameiparent``.)  This all works fine when pages are not accessible by
other mutators, but we run into problems here, now that things can change
out from under us!  This is even worse than the earlier problems because the
other things that might "interfere" are threads running *userland* code,
rather than being in the kernel... so we're gonna have to get creative.

Other CPUs' TLBs
================

There's another problem, though, I'm sad to say:  ``growproc()`` knows how
to shrink a process and knows how to ``kfree()`` the frame that it just
extracted from a ``pte``, but... it's not safe to do that any more.  In a
multi-*processor* system, that other CPU is going to have that PTE cached in
its TLB and so your freed memory is now exposed to userland and **OOPS**!
In tabular form,

  +------------------------------------+-----------------------------------+
  | Thread (CPU) 1                     | Thread (CPU) 2                    |
  +====================================+===================================+
  | Call ``ptr = sbrk(0) - PGSIZE``    |                                   |
  +------------------------------------+-----------------------------------+
  | ``x = *ptr``; this pulls the       |                                   |
  | PTE backing ``ptr`` into the TLB   |                                   |
  +------------------------------------+-----------------------------------+
  |                                    | Call ``sbrk(-PGSIZE)``            |
  +------------------------------------+-----------------------------------+
  |                                    | ``growproc(-PGSIZE)`` changes     |
  |                                    | ``proc->pgdir`` to remove the PTE |
  |                                    | backing ``ptr`` and ``kfree()``-s |
  |                                    | that page                         |
  +------------------------------------+-----------------------------------+
  |                                    | ``growproc()`` calls              |
  |                                    | ``switchuvm()`` to flush the TLB  |
  |                                    | of **this** CPU!                  |
  +------------------------------------+-----------------------------------+
  | Read and write above ``ptr`` to    |                                   |
  | snoop the kernel's internals!      |                                   |
  +------------------------------------+-----------------------------------+

In particular, this process would have free access to ``kalloc()``'s *free
list*. which would enable it with relatively little effort to take over the
kernel.  (Details left as an exercise for the reader, but the basic idea is:
you can, with high confidence, control what the ``kalloc()``-after-next will
return, since the freshly freed page is quite likely to be at the head of
the free list.  If you can cause ``sbrk()`` to map a page of kernel code for
you...)  Isn't that **scary**?!

Memory Safety Solutions
#######################

What's to be done?  Let's start with the most terrifying, TLB shootdown
issues.  Broadly speaking, there are two things we needed to get that
problem:

1. the same address space (``pgdir``) mapped on multiple CPUs, and

2. a lack of synchronization between CPUs when the address space was
   updated, which would have ensured that there were no other references
   to the removed page given to ``kfree()``.

We're going to require different things of different students:
**Undergraduate students** are going to take a *large hammer* approach and
eliminate the problem by preventing multiple CPUs from sharing a context at
once.  **Graduate students** will be required to live with multiple CPUs
running in the same address space, and so will attack the problem by
addressing the lack of synchronization.  That is, only graduate kernels will
actually offer the benefit of multiple cores to multi-threaded processes.
(Though enthusiastic undergrads are invited, *having solved the problem
using our stipulated approach first*, to submit a delta showing the second
approach.  We will grade the first and will look at the second for a little
extra credit.)

Undergraduates: Two-Tier Scheduling
===================================

While real kernels schedule *threads* and let the *process* components come
along for the ride, we're going to stick to a *historically real* approach
of scheduling *processes* and *threads within processes*.  That is, just
like the current xv6 scheduler, your scheduler will pick a *process* to
bring on core and prevent any other CPU from scheduling that process while
it's claimed by this one.  Having made that decision, the scheduler will
*then* pick, in a round-robin fashion, a thread from the runnable threads
within this process. (There should be no *running* threads at this point,
right?)  It is up to you to decide how often to switch between processes
versus threads within a process; options range from "run every thread before
moving to the next process" to "run only one thread before considering a
different process".  Note that your scheduler should not favor some subset
of threads in a process, so even if it is switching processes before
exhausting all threads within a process, every thread should still get a
chance to run even assuming none of the other threads ever voluntarily
sleep.

Interestingly, this approach also extends naturally to handling our buffer
validity problems: it suffices to *prevent intra-process inter-thread
scheduling* while manipulating a buffer.  This gives the buffer manipulation
code the illusion that nothing has changed, because there will be no
concurrent mutators in this address space!  Isn't that cute?

It is *OK* if you make certain other operations done by the kernel prevent
intra-process inter-thread scheduling, for some period of time, even if they
do not (and should not) prevent inter-process scheduling.  Please try to
keep the number of affected subsystems to a minimum.  Part of keeping this
to a minimum will involve allowing inter-thread (esp. intra-process)
scheduling to occur even with, e.g., an ``sbrk()`` call in progress.  (That
is, we expect the kernel to be maximally "preemptive" -- it should allow a
thread that is running *purely userland code* to be scheduled and run
normally even while other threads (in the same process or not) are
contending over various resources and going out to disk and... so on.)

Two-tier scheduling does *not*, however, completely solve the TOCTOU and
open files issues; see below.

Graduates: CPU Synchronization Tricks
=====================================

Alright, we're going to actually *solve* this problem, so that
multi-threaded workloads actually get a benefit from having multiple cores
available!

One option -- one frequently employed in the real world, even -- is to use
the LAPIC and things called Inter-Processor Interrupts to cause other
processors to promptly evict entries from their TLB.  *You are welcome to
take this approach and ignore what follows, but please* **carefully
document** *anything and everything you study to write your code*.

However, this being ``xv6``, we're never really in all that much of a hurry,
so it's OK if reducing ``proc->sz`` takes a while, especially if it keeps
code simpler.  Besides, how often does a multi-threaded application *reduce*
its memory footprint.  **One way** to do this is to borrow a technique from
some really cool technology called `Read-Copy-Update Mutual Exclusion
("RCU") <https://en.wikipedia.org/wiki/Read-copy-update>`_.  The chain of
reasoning goes like this:

1. Imagine we knew when all CPUs in the system had flushed their TLBs.
   Concretely, imagine a magic function ``wait_everyone_tlb_flush(void)``
   that somehow knows just how long to wait until this happens.

2. If we had removed the PTE before that, but not freed the page, *after*
   ``wait_everyone_tlb_flush()`` returns, we would know that no CPU would
   have that entry cached in their TLBs, by construction.

3. When do processors running ``xv6`` flush their TLBs?  Rarely, but:
   *at least every time* they return to the ``scheduler()`` function,
   which calls ``switchkvm()`` when it takes a process off-core.

4. How could we know when *every* CPU has gone through ``scheduler()``?
   What if the CPUs formed a "bucket-brigade ring", passing threads
   to the next one in line?

5. Ah, so a thread could put itself *into* the bucket-brigade ring at the
   "next" CPU, saying "my stop is my current CPU".  (**Think carefully**
   about any atomicity requirements on "current" and "next"!)  Then,
   around the ring it goes until the current CPU says "Oh, look, a
   thread for me!" and makes it run again.  In tabular form,

     +----------------------+----------------------+----------------------+
     | CPU 1                | CPU 2                | CPU 3                |
     +======================+======================+======================+
     | Decreasing ``sz``,   |                      |                      |
     | want to              |                      |                      |
     | ``kfree(page)``      |                      |                      |
     +----------------------+----------------------+----------------------+
     | Thread x puts itself |                      |                      |
     | into the ring at CPU |                      |                      |
     | 2 and goes off-core. |                      |                      |
     +----------------------+----------------------+----------------------+
     | ``schedule()``       |                      |                      |
     | someone else         |                      |                      |
     +----------------------+----------------------+----------------------+
     |                      | Timer interrupt!     |                      |
     |                      | ``trap()`` calls     |                      |
     |                      | ``yield()`` gets us  |                      |     
     |                      | to ``schedule()``    |                      |
     +----------------------+----------------------+----------------------+
     |                      | Pass x to CPU 3      |                      |
     +----------------------+----------------------+----------------------+
     |                      |                      | ``schedule()``       |
     |                      |                      | pass x to CPU 1      |
     +----------------------+----------------------+----------------------+
     | ``schedule()`` sees  |                      |                      |
     | the thread come home |                      |                      |
     | and makes it         |                      |                      |
     | runnable again       |                      |                      |
     +----------------------+----------------------+----------------------+

   After this point, the thread will come on core again and will be secure
   in the knowledge that all other CPUs have flushed their TLBs.

So: if we had a function like ``waitscheds(...)`` and a coopratively
modified ``scheduler()``, then we'd just have to

1. Remove the PTE from the process's page directory

2. wait for ``waitscheds(...)`` to return

3. call ``kfree()``.

Isn't that awesome?

**You probably want to write and test this RCU ring stand-alone**!

P.S. There may be some funny locking involved!  In particular, the thread
must go to sleep when it puts itself on the bucket brigade, and so...

P.S. Do we have to do this RCU game when we're cleaning up a page directory?
Why or why not?

P.S. The story told here is a per-freed-page story.  Is there something
better you can think of?

P.S. In a real system, these kinds of bucket brigades often have a
*callback* function (typically the common ``void f(void *)`` and ``void *a``
pattern) that's run on *each CPU* during (the equivalent of) ``schedule()``.
The description above elided these callbacks as we do not *need* them (if we
had them, we'd just be using ``return;``).  You may, however, find that
callbacks are useful for testing the bucket brigade or that they could be
interesting for other reasons.

Graduate Students: Buffer Safety
================================

Despite the tricks above, we are still faced with the problem that userspace
memory could be mutated by multiple threads running on multiple CPUs.  What
should we do?  There are **at least** two solutions I can think of:

* One answer is to use **bounce-buffers**: that is, to make a *private copy*
  of a userland string and then use *that*, rather than the userland copy.
  The *copy* operation inherently potentially races with other threads, but
  thereafter all the kernel's attention is focused on the in-kernel copy.

  This will involve auditing and modification of ``sysfile.c`` but not too
  much else!

* A *combination* of the RCU-esque trick outlined above and the undergrad
  one-thread-per-process approach could also be viable!  If we could
  *temporarily* declare our intention to run only one thread per process,
  *and ensure that the other CPUs knew*... we'd be in good shape!

  This will likely involve the introduction and use of two new kernel
  primitive operations, maybe named something like ``gost()`` and ``gomt()``
  which transition the current process between single-threaded and
  multi-threaded modes.  These will probably need to use ``xchg()``
  internally to ensure that CPU store buffers are flushed, as with
  ``release()``.  ``gost()`` will, obviously, take some time.

  This approach is probably less ideal than the above, but involves maybe
  less machinery.

Note that buffers that are treated *opaquely* by the kernel, such as file
data for ``write()`` and the landing area for ``read()``, do not need this
kind of careful handling.  That is, userland can only hurt itself if it
races threads writing to ``buf`` and ``write(fd, buf, n)``, right?

Everyone: TOCTOU and ofiles
===========================

OK, finally, the simplest problems.  There are a few approaches available,
but the most readily available options for the ``proc->sz`` problem,
perhaps, are:

* Narrow the TOCTOU window so that the check and use may be made atomic.
  While possibly the most *efficient* option, this required modification in
  some or all of ``console.c``, ``file.c``, ``fs.c``, and ``pipe.c``, among
  others.  Notably, it required the introduction of so-called "safe-copy"
  primitives within the kernel.

* Serialize all uses of and changes to ``proc->sz``.  This is less
  efficient, since it implies a great deal more contention between threads,
  but it is by far more expedient.  Unfortuately, the sole encapsulated
  primitive offered by ``xv6`` for serialization, ``spinlock``-s, is not
  sufficient, as a thread cannot sleep with a spinlock held.   You will have
  to use ``spinlock``-s to build something better.

  (Graduate students should note that a ``gost()``-like approach here is
  *not acceptable*, since often the TOC and TOU are separated by blocking
  operations like disk access and we would dearly love to let some other
  thread onto the core!)

For ``proc->ofiles``, we again have a few options:

* Make the ``ofile`` array per-*thread* instead of per-*process*.  While not
  ideal, it's at least one option to consider!  (Maybe if we are serious
  about this option, we would introduce a way of passing files between
  threads.)  We will **forbid** this approach this time around.

* Take advantage of ``xv6``'s built-in file reference counting and have a
  small critical section around the process's ``ofile`` array which bumps
  the reference count then drops the lock and refers to the open file
  structure obtained from the ``ofile`` array directly, rather than
  subsequently through ``ofile``.  Don't forget to drop the reference when
  done!  **Graduate** submissions should aim for this and expect a few
  points loss if the solution falls short of this level of minimal
  serialization.

* Serialize all calls within a process through the ``sysfile`` interfaces.
  As might be imagined, this is far simpler, if slower.  Again,
  ``acquire()`` will be insufficient for the serialization at hand, as
  file operations often block.  **Undergraduate** submissions will receive
  full credit for a correct solution of this flavor.

* Serialize ``sysfile`` calls *and* all manipulations of ``proc->sz`` in
  the same way, as very often the two interact.  This is slower still, since
  some pairs of operations, like ``sysclose()`` and ``sbrk()``, will now
  somewhat artificially serialize.

``proc->cwd`` can be simply managed by sprinking some spinlocks in the
appropriate places; note that the only thing that needs to be guarded in
``sys_chdir`` is the *replacement* of ``proc->cwd`` with ``ip``.  If one
grabs the old value of ``proc->cwd`` prior to that replacement as ``oldip``,
locks may be dropped before operating on ``oldip``.

Threads
#######

Errant Behavior
===============

Upstream ``xv6`` kills a *process* when it misbehaves (e.g. pagefaults).
Your new thread-aware kernel should kill only a *thread* when it pagefaults,
not the whole process.

Thread Identifiers
==================

As before, *thread identifiers* and *process identifiers* are still in the
same namespace, and the first thread in a process will have equal
identifiers.

Kernel Stacks
=============

Each thread will require its own kernel stack.  Unfortunately, ``xv6`` at
present does not free a kstack until ``wait()``; that's sort of OK for
processes, but it's not OK now that we have ``tspawn()`` and ``texit()`` and
nothing like ``twait()``.  You will have to design a mechanism for
reclaiming kstacks *promptly*.  An ideal mechanism will bound the number of
stale kstacks in the system by some *constant* (even if I vary ``NTHR``).
(0 would be really good, but, while possible, is perhaps tricky to achieve.
1 is good and has a cute solution.  ``NCPU`` would be fine and might be
arguably better than 1 on a real system, which xv6 is not exactly. ``NPROC``
would be OK but strange.  ``NTHR`` may appear to be a constant, but it
isn't.)

Allocations
===========

This is xv6 and we haven't asked you to write an allocator and we're not
going to give you one.  Your system should support ``NPROC`` (i.e. ``64``)
processes and a *global pool* of ``NTHR = 256`` threads, which are to be
allocated to processes on a first-come-first-served basis.

You should be able to handle the case when one process has grabbed all 256
threads (though admittedly, it would take a modified init to do that).

You must gracefully handle the "out of threads" case in both ``fork()`` and
``tspawn()`` (i.e. must not ``panic()`` the system) but note that
graceful handling need not include attempting to get more!  (This is just in
line with ``fork()`` already handling the out-of-processes case!)

The Assignment Itself
#####################

Part 1: Study, Design, Log (15%)
================================

The usual request for a detailed ``README`` with your initial plans, your
software's evolution, and all that good stuff.

Part 2: xxvv66: xv6 with threads (70%)
======================================

Replace the shipped kernel blob we gave you with something that works for
your thread library and *behold all that you have made*.

Grade Breakdown
---------------

In an effort to reduce the surprise and potential stress of being graded,
this section's large impact will be broken up as follows:

* Automated testing: coming up to a shell and passing ``usertest`` on both
  1 and 2 CPU configurations will be worth 30% of this section (21% of the
  total grade).  Note that ``usertest`` does not exercise any threading at
  all, and the upstream xv6 reliably passes it.  You should be regression
  testing early and often!

* Additional automated testing: passing all of ``318_ts1.c``, ``318_dmr.c``,
  and ``318_lots_thr.c``, on both 1 and 2 CPU configurations, will be worth
  40% of this section (or 28% of the total grade).

  **Note that none of these tests use your thread library**!  However, you
  should be using your thread library and your new knowledge to test parts
  of the kernel: probing the scheduler with spinlocks is a good way to
  exercise ``yield()``.  Semaphores exercise ``desch()`` and ``mkrun()``.
  Thread tests exercise a whole lot of pieces!

* Hidden test cases will count for another 10% of this section (7% total).

  **None of these tests will use your thread library either**!  They might
  use *my* thread library, but that's not your problem.

* Manual audit of the submission will count for the remaining 20% of this
  section (14% total).

Part 3: Test Cases! (15%)
=========================

This assignment is *so very complicated* that I am going to insist that you
write your own test cases.  The ones given to you are a good place to start,
and of course you're welcome to share test cases with each other, but you
must write **your own** tests, ideally that *meaningfully test different
things*.  Why yes, I might be crowd-sourcing a threaded xv6 test harness,
why do you ask?

Please **document** what each test is supposed to test in your ``README``
and/or at the top of each test file (maybe a quick summary in ``README``
and more verbose in the file itself?).  Note that good tests come in a few
flavors:

  * Regression-like *limited* probes that touch minimal subsets of
    functionality and attempt to probe very specific conditions.  Ideally,
    these tests *reliably* hit their target.  (Lots of things in
    ``usertests``, and things like ``318_ts1.c`` or ``318_dmr.c``.)

  * Stress tests which do something to exhaust, or nearly-exhaust,
    some resource(s) and then try to exercise some condition(s).
    (``usertests`` does this a few times)

  * Reliability tests, sometimes also called Continuous Hours of Operation
    ("CHO") tests: fling lots of work at the system and see what happens.
    Keep it up for as long as possible.  Sometimes CHO tests try to recreate
    (sped up?) real conditions, or sometimes they're stress-test like,
    keeping the system low on resources, or sometimes they repeatedly take
    the system in and out of resource scarcity.

  * Random testing: given a palette of possible operations, do a random
    sequence of them!  This includes things like "fuzzing" and "monkey
    testing".

In many of these cases, the goal is to probe at the *failure cases* of the
system.  The *success* paths get lots of exercise.  The *failure* paths, on
the other hand...

Please *additionally document* your contributions to the communal test pool
-- linking to posts on Piazza is probably sufficient.  Any tests posted to
Piazza **must** have your name(s) in a comment near the top; any code
derived from tests on Piazza **must leave the original authors' names there
and add your own**.

Do not feel obligated to use the TAPish test macros -- I certainly will not
judge you for not liking my style -- though you may find them convenient,
especially in combination with the ``runme`` driver.  A shell-script could
get you pretty good regression testing!  (You are also welcome to develop
your own TAPish test "framework"!)

Incidentally, some collected ideas of things to test:

* Wild pointer handling.  Should kill a thread, not the process, and not
  the kernel!

* ``kill()`` a thread ID that isn't a process ID.  Right thing happen?

* ``yield()``!  Make sure that a few processes can cheerfully bounce back
  and forth "forever" ``yield()``-ing to each other.  Can you yield to
  yourself? ( ``;)`` )

* Graduates: that RCU bucket brigade.  You can even write tests for this
  before you have any thread support at all -- just push processes around
  the ring at first.

* ``tspawn()`` until you can't any more, making sure that allocation
  failure is graceful!

* ``tspawn()`` until you can't any more, *then* (have another process)
  ``fork()``!  (That is: test what happens when there are free ``struct
  proc`` but no free ``struct thread``!)

* TOCTOU vulnerabilities and TLB shootdown: try to crash your kernel by
  rapidly wiggling ``sbrk()`` and having threads race the kernel!

* What *happens*, anyway, if you race a ``write()`` and a
  ``for``-loop paving over the same buffer (possibly repeatedly?  backwards?).
  For best results, try with ``CPUS=2``.

* Memory leaks!  How big can you push sbrk() after the system has just
  started up?  Bring it back down and go do a bunch of things and wait for
  them all to finish and clean up.  Can you still push sbrk() that high?
  No?  Something's gone missing!  Find it!

* "Continuous Hours of Operation" style tests: spawn a whole lot of tests
  at once and keep them going!  Keep the kernel and all CPUs as busy as
  possible with as many different tasks as you can think of!  Terrify the
  kernel with all of your userland-y demands!

One more clever trick someone once taught me, by the way, is to *change the
tick frequency*.  You'll get a lot of different interleavings of threads
if you switch at different speeds!

Suggested Plan Of Attack
########################

1. Mechanically separate ``struct proc`` into ``struct proc`` and ``struct
   thread``.  Initially, allow each ``struct proc`` to have *exactly one*
   ``struct thread``, (and for each ``struct thread`` to belong to only one
   ``struct proc``, but that part isn't surprising).

   You will probably want to revise the "per-CPU" data by either adding a
   notion of "current thread" or by replacing the existing "proc" with the
   same.  Note that if you wish to change the size of the per-CPU data,
   vm.c's ``seginit()`` function will need to know.

   Make the requisite changes to ``trap.c`` for errant behavior, even if
   just in skeletal form. (Maybe fatal traps become infinite loops, since we
   don't know how to kill just the current thread yet?) (Though while here,
   you can probably start to guess what's in ``texit()``.)

   Chase compilation failures through the whole system until you have
   something that reliably passes ``usertests``.  Congratulations, you have
   just made the foundation of all subsequent efforts!

2. Change the data structure!  Allow (but do not give a mechanism for
   creating!) more than one thread in a process.  You will encounter
   ample food for thought in this process; make note of every question
   you encounter!

   **Graduate** students probably want to change what ``schedule()`` scans!
   **Undergrad** students can write an initial version of their two-tier
   scheduler now.

   Chase compilation failures again and then debug until ``usertests`` is
   happy.

   Either here or at the previous step, it will probably do you some good
   to make ``procdump()`` more than superficially aware of this new
   threading thing.  Calling ``procdump()`` at choice moments can be a huge
   boon to debugging!  (And hitting ``Ctrl-P`` on the keyboard is always
   satisfying!)

3. Sometime around now might be the right time to add ``desch`` and
   ``mkrun``.  ``318_dmr.c`` should help you get going.

4. Think about how to guard ``sz`` (and ``ofile``).  You can *add*
   those guards now and the system should continue to pass ``usertests``!
   (The guards will be guarding against something that cannot happen!)

5. **Graduate** students: think about that RCU bucket brigade.  Get it
   working (maybe add some system calls and ``cprintf()``-s around the
   system to observe that things are behaving as you expect).

   (This is an excellent opportunity to demonstrate your test-writing skills
   as *part of* the test-case part of the assignment.)

6. **Graduate** students: wire up the RCU bucket brigade where it needs
   to go.  The system should continue to reliably pass ``usertests``.

7. New kstack reclaim: design and move your kernel to use a new
   technique.  As per usual, ``usertests`` should continue to pass!

   The ``kfree(p->kstack)`` from ``wait()`` should likely *be gone*, now.

8. The moment of truth!  Add ``tspawn()`` support!  ``318_ts1.c`` and
   ``318_lots_thr.c`` should help you get on your feet and believe that
   things are working!  (It's OK not to have ``texit()`` yet!)

9. Finish the assignment.  (That means ``texit()``, ``yield()``,
   ``gettid()`` and everything I might have left out).

10. Test, test, and test some more.  (Rather like Portal!)  While here,
    marvel at what you are doing: you have added *parallel threads*
    (*concurrent*, too, in the case of grad students) with
    *shared-by-default* memory to UNIX.  This, historically, is a big
    f****** deal, as Vice President Joe Biden might say.

Good luck!
##########

Recall that one last bit of advice: if something really isn't working, get
up and stretch.  Go for a walk around the building.  Relax.  Do other work.
Come back later and you'll see things you didn't see before.

Changelog
#########

* Unlike the CMU original version, this version uses ``tspawn()`` instead
  of a much more raw ``thread_fork()`` which spawns a new thread with
  an *identical register file*, requiring the use of assembler in the
  analog of ``uthr_create()``.

* Updated to ``void texit(int *where, int what)`` in line with
  :doc:`osassign-uthreads`.

* Added commentary about ``proc->cwd``.