Introduction

Currently, the only caches available to Plan 9 systems are the in-kernel image cache and CFS. Both are very loose caches and do little to reduce the latency of small operations (they help mostly for large operations, where they can avoid streaming data back from the server).

Here is a proposal for a way for cache controllers or servers to notify clients or caches of invalidation events. It allows caches to know (up to latency) that they are up to date with the server state. The proposal does not alter the 9P protocol at all (in contrast to Op); instead it relies on additional aname-s to provide a control channel. We refer to the protocol run over the cacheWhen cfs starts up, it makes no assumptions of the validity of elements in cache. control aname as a side-protocol to emphasize that it is spoken beside an ordinary 9P stream.

This protocol has been designed to allow caches to help hide latency as well as bandwidth. We hope to allow caching of directories and to reduce or eliminate altogether the need to stat() files on the server to check cache validity. Caches are purely event-driven; the protocol does not call for timeouts of any sort. Caches here remain write-through for simplicity.

The protocol is engineered to be small and simple. Some optimizations have been considered, but above all the emphasis has been on getting the fundamental approach right and allowing some extensibility through the namespace exported by the cache controller.

Existing Caches: CFS

CFS is currently a single-threaded, synchronous cache. It’s behavior is simple: directories are not cached and cache status is only validated on Topen requests:

  • For directories (files with QTDIR), pass the operation to the server.

  • For directory-specific operations (Twalk, Tcreate, Tremove), pass the operation to the server.

  • For Twrite and Twstat messages, pass the operation to the server and update the cache.

  • For Topen, pass the operation and mark the in-memory version with the resulting QID.

  • If the request is a Tread,
    • Find the QID in cache; if the version reported by Topen mismatches, throw out the cached data.

    • If the data are not present, it forwards the request on to the server, collects the results, caches them and responds to the client.

    • Otherwise, answer from cache.

There is also some special handling of QTAPPEND files.

The Callback Scheme

If the server (or a cache controller) published an append-only list of all changes (let’s call it “the journal”) to the filesystem, then:

  • The state of the filesystem could be described as the size of this stream.

  • A cache’s contents may be described as a subset of the state given by an offset into this stream. (Several offsets may work if the changes are orthogonal to the contents of the cache, however the caches themselves can track their offset and so always have a dividing line between the “seen” and “unseen” changes)

The notion of callbacks comes about by allowing reads to the journal to block. If a client is “behind” the leading edge of the log, then reads should return immediately. When (if) a client catches up to the leading edge, the first read should return immediately with zero bytes available, informing the client of its caught up state. Subsequent reads from this client at the edge should block until the journal is appended.

In this way we have added asynchronous callbacks to 9P without altering the protocol, in particular without altering its inherently client-driven nature. There is precedent within Plan9 for files with quasi-blocking semantics as above: the usb audio driver’s mixer file behaves similarly. See usb(4).

While the cache is not caught up, it may block client requests or may revert to a Tstat-based validation behavior.

Contents of the Callback Journal

A sufficient, though much larger than necessary, journal would consist of all operations given to the filesystem. This has several unpleasant features, but it would, in theory, suffice.

9P already offers us a convenient guaranteed-unique small tag on every entity served by a file server: the qid. If the journal held only the qid of every altered (write, wstat, or removed) file, caches could know to invalidate their cached copies of content from that qid. There is no information exposure except possibly the rate of file operations on the server, but not even user information is given in this channel, making it very unlikely to be a leak.

Directories may also be safely cached: they have qids, of course. They are modified only by Tcreate and Tremove messages. Note that Tremove should journal an update to both the directory itself and the file removed, as caches may have discarded one but held the other.

ReQID operations

Simply enumerating mutated QIDs would allow caches to invalidate their contents but, if a file is open and becomes invalid, the cache must revert to being pass-through and will have to fetch anew the new version of the file. This happens because the cache is not informed of the new qid and so cannot relabel the contents it does have (if we go with typed journals; see below) or may subsequently receive of the open file. We would therefore like the journal to contain pairs of qid, e.g. (old, new), to allow a cache client to re-read an entire file (or suffix, if QTAPPEND) and bring itself back up to date, so that it doesn’t have to re-download the contents on subsequent Topen.

Todo

I’m not really sure what to do about this! The cache controller can re-walk fids to find out the new QIDs, but that’s going to be annoying. This suggests against interposition or that perhaps an extension to 9P whereby mutation events return the new QID.

Filtered Callbacks

However, even the journal of all qids changing on a server might be rather large and mostly uninteresting to every cache. We could imagine that the server/cache controller could be selective and only advance the logs of caches that had ever read from that qid.

If the cache and identifies itself (quasi-)uniquely to the cache controller, the cache controller may keep per-cache lists of qids and provide many, filtered journals containing only these qids. Note that these journals should continue to be updated even after a cache has disconnected, under the assumption that it will reconnect soon. This implies that the cache controller will be storing some duplicated state in the journals, but we can easily guarantee that this state will not increase without bound.

Server-side Journal Size Management

Holding all journal data forever is unwise. The cache controller is free to forget (and return Rerror messages for) sufficiently old journal entries. Caches must interpret this as a signal that they should revert to Tstat behavior. (Optionally, they may seek forward for the tail of their journal and preemptively invalidate entries in the cache; this may be useful to reclaim space? It may also not be useful at all.)

The correct recovery procedure, upon an Rerror from the per-cache journal file, is to mark the cache as stale, to seek to the end of the file, and to dispatch a read, as if there had been no interruption. All entires in the cache are now stale, but the journal is caught up and all subsequently revalidated files will be shot down as appropriate. At this point, the cache may revert to cfs-like behavior, validating the cache contents with Topen (if the client requests a Topen) or synthesized Tstat messages (on other client requests).

The controller may delete the filtered journal file, as well as backing store material. In this case, the correct procedure for the cache is to re-open the file and resume reads from offset zero, marking the cache invalid as above. Deletions may happen even while the cache is connected if the controller becomes sufficiently resource-starved to merit dropping the client’s filter.

Journal Rollover

File sizes in 9P are rather large, but it should be possible to roll over journal files (back to offset 0). One option is simple deletion of the journal file, but that’s somewhat klugy and results in total invalidation of the cache.

Todo

Perhaps sequence-number arithmetic can save us?

Alternate Uses As A Notification Channel

As pointed out by Uriel at IWP9 2007, this protocol may have uses other than cache control: in particular, it naturally provides an arm/notify/rearm API similar to Linux’s inotify. Even for servers supporting live queries (ala BeOS), this protocol may still have a place as the back channel to notify clients that their queries have been updated.

The Journal Side Protocol

9P servers may export multiple file systems. We propose a synthetic file system to be exported along side others, possibly with the suffix /cc (for cache control). Within this file system, we will expose a directory called “journals” (segregated in case we later wish to add additional functionality to the cache control stream; e.g. locks and leases. Later, later.). The files inside here serve to name caches and identify their connections.

A cache wishing to play the filtered callback game first looks up or generates its probably-unique identifier (sufficiently large random numbers will suffice). It then carries out authentication as usual, Tattach-es the /cc system. It then opens the file bearing its name in /journals on the /cc export, or creates it if it does not exist. Now it Tattach-es to the corresponding normal file system. (Because authentication has taken place, the permissions on this file may be locked to the credentials that were used to authenticate, meaning that cross-user journal snooping is not possible.)

While it is believed that filtering will be useful to have on large installations and to close the minimal information leak of server mutation rate, but for initial implementations, the file name /journal in /cc is reserved for an unfiltered journal. If exposed, this file indicates that the controller will provide an unfiltered journal, possibly in addition to filtered ones. It is then the cache’s choice.

The cache controller uses the open file descriptor to the journal file to identify the operations of this cache and maintain the server-side qid filter. Entries will be taken off the filter list as they are shot down (placed into the journal) and of course the filter may be entirely discarded if the cache controller removes the journal file.

Journals are marked QTAPPEND and may only be opened for reading by the clients. A cache may delete its filtered journal file to indicate that the controller may discard its filter; this may be a polite action if a cache is reformatted.

Note that because the cache control protocol is simply a file system, it should be straightforward to mount it (without caching, of course) and watch the journal changes happen in parallel to the cache’s operations. This should facilitate debugging.

Experimental Followons

Typed Journals

It may make sense to indicate in the journal which kind of update triggered the insertion as well as some other metadata. For example, a Tremove should cause the cache to dump all of its data for a file, but a Twstat may merely mean that the cache needs to Tstat the file again. If Twrite journal entries include the new qid and the region of the file updated, that may also prove useful for large files or avoid flushing cache state for regions not touched.

This may dramatically increase the record size of journal entries. Note that some of these things mean that we lose the arm/disarm/rearm behavior of the caching protocol: as described above, a Twrite entry would implicitly rearm the journal on the new qid.

The cache controller should always have the option of reverting to a general flush notification, to deal with, for example, a large number of writes all over the file. It is also possible that the cache controller may “fudge” the journal a bit: if a large number of updates to a qid (or qid chain, ala Twrite) are pending and not yet read by the cache, they could be replaced in entirety by a simple flush message, freeing up controller state.

Musings about volume-like bundles

Since the server/cache controller knows the full path taken to reach a given file, it should be possible to emulate something like AFS volume callbacks by informing the caches that a given qid will be invalidated by an alias in the journal. Under this scheme, an arbitrary collection of qids can be given a single qid handle for invalidation.

Explicitly, consider a fid that has been walked into /bin/386. Since this directory contains lots of files and is updated only rarely, we would like that the cache controller not have to keep a large amount of state here. Upon entering /, the fid’s server-side metadata would be labeled with the bulk-invalidate qid for /, upon entering /bin this would be replaced with that of /bin, etc. If an open is made for reading, the server must return the qid of the actual file, but must notify the cache of the aliasing effect (the server is required to notify the client of the alias before it may notify the client of invalidations of the bulk address). If any update opcodes (write/unlink/wstat/create/…) take place based on this fid, the bulk invalidation qid is written to the journals, and the world proceeds from there.

I propose that aliasing records be handed back not in the journal file itself, but perhaps in a file named /journal-alias/${CACHENAME} in the /cc side-protocol file system. All records in this file are two-qids wide. Let’s (arbitrarily) mandate that the order is qid-returned-from-Topen/Twalk followed by the aliasing qid. That is, the second one is what will end up in the journal and is the handle to the whole “volume”.

Note that having this be a separate file means that it may be optional for caches and servers to support it. If the protocol is to open the aliases file first, servers may enforce that caches partake in this protocol by rejecting their attempts to create journal files (this kind of mandatory support for aliasing may help reduce server overhead).

Each server-side accessed file should point at its invalidation qid and be hashed by invalidation qid. This means that the server can easily flush the right qid and can also flush invalidation records for all files on a volume once a flush has taken place.

Musings about cache flush notifications

It may be advantageous to a given client to notify the server when it has flushed a file from cache, to keep the server from (falsely) flushing files as part of its memory reclaim procedure. This also helps the server as it avoids entering the more aggressive phases of memory reclaim in the first place.

I propose a write-only, QTAPPEND file evict in the root of the /cc directory into which a cache may write a qid it no longer cares about. This file does not need to be per-client as it is a client->server communication pathway.

Other Questions

  • Coherency? Current thoughts are to just avoid the problem by having clients mount filesystems without caching to get coherent access.

  • Write-back, not write-through?

  • Distributed cache coherency?

Recommendations for P92010

As much as we try to make the protocol entirely orthogonal to 9P2000, there are some small changes which would help our lives immensely but, at the same time, do not force people to play this (or any other) game.

Interaction with QTCTL / QTDECENT

There is a proposal to add a new qid type flag, QTDECENT, to denote that a file is cachable (the contrast, of course, is files synthesized by device driver). The original proposal was for a QTCTL, but the sense has been flipped so as to have “failsafe” defaults. I believe that this would be a beneficial addition to 9P2010.

QTDECENT would let the generic cache controller shim discussed above know which files it could reasonably attempt to cache. It could then pass on QTDECENT to clients or caches that did not speak the cache control protocol, such as the current cfs. Prior to QTDECENT, it would have been necessary to carefully offer cache control only on true file servers or emulations of true file servers. With QTDECENT, the cache controller shim may be placed on any exported file system, again subject to the constraint that it sees all traffic to the files or is itself speaking the cache control side protocol with the exporter.

The journal files should not be marked QTDECENT, but this does not seem to be a correctness issue so much as one of taste.

Note

For completeness, synthetic/”indecent” objects are currently typically marked by qid.version == 0, but this is undocumented and nowhere enforced. Having an explicit flag seems better.

Mutation operations should return the new QID

Certain mutation operations do not return the new QID when they are carried out:
  • Rcreate does carry the QID of the file created but not that of the directory containing it.

  • Rremove does not carry the new QID of the directory containing the removed file.

  • Rwstat does not return the new QID of the file being modified nor the directory containing it.

  • Rwrite does not carry the new QID of the file being modified.

We suggest that 9P2K10 responses contain these five additional QID fields so that our intermediating controller does not

Interaction with propagation of QIDs to root

There has been a proposal that all mutations propagate QID version changes up to the root. This has many advantages, including very fast exact delta computation. There are a few possible mechanisms to support intermediators such as ours:

  • Don’t. That is, require the cache controller and server to be the same body of code.

  • The operations described above could return lists of QIDs.

  • Servers playing the propagate-to-root game publish, in another journal-like file, a list of all reQIDs that could not be described in 9P2K (or 9P2K10).

  • The intermediator(s) can subscribe to this. Clients may (but need not) be allowed to subscribe but will probably prefer to get their updates filtered by the intermediator(s).

Of the three, the latter is the most attractive to us because there is no danger that the R-messages would overflow the maximal message size, and it avoids mingling the already complicated server code with the complicated cache controller code.

Acknowledgements

Much of this is not entirely original thought but was developed with the help and gentle prodding of David Eckhardt (de0u@andrew.cmu.edu). This document originally appeared on the Carnegie Mellon University 15-412 class wiki.

Publications

IWP9 2007 WiP Presentation

I (nwf) gave a brief presentation at the WiP session of IWP9 2007. It is available dvi form here. This is mostly for archival purposes; the slides are basically a subset of this page.

IWP9 2009

We (Venkatesh Srinivas and nwf) wrote a paper for IWP9 2009 talking about a summer 2009 research implementation of jccfs. The code base for this project is officially housed in a Mercurial tree at http://www.grex.org/~vsrinivas/src/cfs.

A version of the paper with the below errata applied is available in pdf form here. The submitted version is available in pdf form here;

Errata

The versions of this paper hosted on this site have had all (known) errata corrected.

  1. The characterization of cfs(4) given in section 2.1 is incorrect. cfs(4) merely forwards Topen messages and collects the file’s QID from the Ropen. The first paragraph should read

    cfs(4) is an on-disk cache intended for use by Plan 9 terminals. It copies data from read messages into an on-disk cache. For subsequent read operations, if the data are already present, cfs responds with cached data. On each Open, cfs implicitly collects stat information from the server to check for validity of cached contents. cfs does not cache directory conents nor, by extension, does it attempt to hide latency for walk operations. Every Walk, Stat, and Read on directory is simply passed through to the server.

  2. The printed paper does not have footnote markers.

  3. The printed paper, in section 4, suggests that a table is in a section titled “Execution Characteristics”. No such section exists; instead the table follows immediately below, on the facing page.