######################################
Internet Relay Chat Invisible Encoding
######################################

.. sidebar:: Contents

   .. contents::
      :depth: 2
      :local:

This is a set of notes devoted to a hack for carrying structured metadata
over IRC.  The protocol is a collaboration between Nathaniel Filardo and
Glenn Willen, with input from many others on Freenode's #cslounge (we love
you all).

There is a list of implementations of various subsets of the IRCIE universe
at `Implementations`_.  If somebody just sent you here, you might start
looking there.  Note that not all implementations support all meta-data, but
ideally they should understand each other's common subset just fine.

The technicalities of the encoding are discussed in `Encoding`_.  That page
also contains the authoritative list of "well known type" identifiers for
use with IRCIE.

Currently, there are definitions for a few types of data:
 * `Instances`_ -- Add instance labels to messages, making IRC a little more like Zephyr.
 * `Miscellaneous Metadata`_ -- Some small things, including robot indicators.

.. Not yet
.. * `Comic Chat`_ -- Carry Microsoft's Comic Chat data in a way they never anticipated.

========
Encoding
========

Message Protocol
################

Invisible Coding
================

Here's the hack, which we lovingly refer to as "IRC invisible (en)coding"
(IRCIE).  One uses the characters ``^B``, ``^C``, ``^O``, ``^V``, and
``^_``, which are IRC formatting control characters, to encode various data
in a way that most clients won't ever see.

The mapping is as follows:

+-----------+---------------------+------------+-------+
| Character | IRC Control meaning | ASCII      | Value |
+===========+=====================+============+=======+
| ``^B``    | Bold                | 0x02 (STX) | 0     |
+-----------+---------------------+------------+-------+
| ``^C``    | Set color           | 0x03 (ETX) | 1     |
+-----------+---------------------+------------+-------+
| ``^O``    | Kill formatting     | 0x0F (SI)  | 2     |
+-----------+---------------------+------------+-------+
| ``^V``    | Reverse video       | 0x16 (SYN) | 3     |
+-----------+---------------------+------------+-------+
| ``^_``    | Underline           | 0x1F (US)  | 4     |
+-----------+---------------------+------------+-------+

We avoid the use of ``^Cdd,dd`` (the color formatting control sequence), as
some clients have a bad habit of dropping the ``^C`` and printing the digits
instead.  Further we avoid the use of space and tab as these may become
visible when interspersed with the formatting control characters, which
would defeat the invisibility of the encoding technique.  We do not use
``^F`` ("blink") for reasons long since forgotten -- perhaps there was a
misbehaving popular client?

Framing
=======

All protocol messages are embedded at the end of an IRC message.  This makes
for easier parsing and simplifies the protocol design (no risk of
accidentally "finding" a message in the middle of another message, for
example.)  However, for compatibility with many real-world clients (as
opposed to the specification) which expect CTCP messages to span an entire
message, we allow protocol messages at the "logical end" of IRC messages
such as CTCP ACTION.  See Examples below.

For maximal extensibility, we start by defining an outer protocol which
allows for multiple inner protocols to coexist, using type and length
coding; we hope this will allow for compatibility, should future protocol
designers (or future versions of this protocol) want to use the same
invisible-character trick that we use here.

We signal the start of a protocol message with the lead-in character ``^O``,
followed by the empty string as our type tag, followed by the tag terminator
``^O``.  Type/version tags should consist of sequences of characters other
than ``^O``.   The effect of this is to sandwich a series of formatting
control codes between two kill formatting codes, an extremely unlikely thing
to be generated accidentally.  We consider it virtually certain that no
legitimate IRC message would have such strings of characters without
intervening printable text; we therefore see this as an ideal lead-in
sequence for embedded messages, and get the possibility of adding more
message types for free.

Next we encode the length of the message, using the "L" code described
below, in bytes. The length encoded only includes the message (i.e. the part
after the length itself, up to and not including the final ``^O``.).  This
length is sometimes called the "meta length" or just "MetaL".

Next comes the message itself, in the TLV format described below.

Finally, a message MUST end with ``^O`` which is outside the TLV encodings.
(This resets unaware clients' idea of the current text style, which will be
totally confused by this point.  This is a courtesy to potentially buggy
clients which fail to reset formatting codes at the end of a line, as well
as another convenient marker unlikely to be sent by normal clients.)

TLV Encoding
============

Entries inside the protocol stream are TLV encoded using a two-character (25
possibilities) type (so-called "T encoding") and a length-then-modified
base-5 encoding of the length ("L encoding").  The remainder of the record
is then interpreted according to its type tag.

Error Handling
--------------

A client receiving a message with a record type it does not understand MUST
discard that record and MUST process all other records in the message, if
any.  Clients MAY inform the user of unrecognized messages but SHOULD allow
the user to disable this notification easily.

A client receiving a malformed record SHOULD assume that it has mistakenly
identified this protocol and cease processing the message; recovery would be
difficult due to the lack of (supposed) length information.  If the
corruption is only visible late in the stream, after one or more records
have already been processed, clients SHOULD inform their users of the error
but again SHOULD allow the user to disable this notification easily.
Clients MAY cease processing this protocol for a given channel after such an
error, again informing the user of their decisions.

L encoding
----------

L encodings have two components: a fixed one-symbol prefix and a
variable-length suffix.  The single symbol is the biased length of the
suffix, expressed as a T-encoded value.  Note that we reserve the maximal
value for future expansion.

The protocol defined here uses a bias value of 1, so that a "0" for the
first byte of the L encoding indicates that the next one (1) symbol is to be
consumed as part of the length, etc.

This scheme supports all length values ranging from 0 (which is two bytes
long, namely ``^B^B``) to 779 (which is five bytes long, namely
``^V^_^_^_^_``).  This is considered sufficient for all practical uses of
this protocol.  In particular, 779 is more than the usual line lengths
tolerated by some ircds.

This encoding was chosen due to its bias for short sequences to have short
length fields while still being relatively easy to parse.

Assigned Types
##############

Currently the following message types are assigned:

+-------+----------------------------------------------------+------------------+
| Value | Purpose                                            | See              |
+=======+====================================================+==================+
| 0-2   | RESERVED IETFNG PROTOCOL EXTENSIONS                |                  |
+-------+----------------------------------------------------+------------------+
| 3     | Head-of-frame Flags (special handling)             | [[../Misc]]      |
+-------+----------------------------------------------------+------------------+
| 4     | Continuation Flags                                 | [[../Misc]]      |
+-------+----------------------------------------------------+------------------+
| 5     | Instance label, absolute encoding, huffman table 1 | [[../Instances]] |
+-------+----------------------------------------------------+------------------+
| 6     | Teledildonics Protocol Transport                   | ...              |
+-------+----------------------------------------------------+------------------+
| 7-14  | RESERVED, GLOBAL ASSIGNMENT                        |                  |
+-------+----------------------------------------------------+------------------+
| 15    | OTR advertisement message                          | [[../Misc]]      |
+-------+----------------------------------------------------+------------------+
| 16    | Miscellaneous Message Flags (deprecated)           | [[../Misc]]      |
+-------+----------------------------------------------------+------------------+
| 17    | MS Comic Chat Data (Re-)Encoding                   | [[../ComicChat]] |
+-------+----------------------------------------------------+------------------+
| 18-19 | RESERVED, GLOBAL ASSIGNMENT                        |                  |
+-------+----------------------------------------------------+------------------+
| 20-24 | EXPERIMENTAL, LOCAL ASSIGNMENT                     |                  |
+-------+----------------------------------------------------+------------------+

Infelicities
############

Note that some ircds will actively filter out formatting codes if channel
operators request said behavior, rendering this protocol unable to function.
This is unfortunate, but perhaps understandable given the potential for
abuse.  One may be able to convince server devs that messages such as ours
which do not in fact alter the formatting may be worth passing, but that
seems somewhat unlikely.

=========
Instances
=========

What Are Instances?
###################

Instances are essentially "threads" of conversation.  The name comes from
`Zephyr <http://en.wikipedia.org/wiki/Zephyr_(software)>`_ where messages
are routed by a triplet of "recipient", "class", and "instance".
Essentially, IRC has only private messages without classes or instances and
public classes (channels).

Design Goals?
#############

First, we partition the set of clients into "aware" and "unaware" by whether
or not (respectively) they are playing this game.  We want the following
behaviors:

  * Unaware clients remain unaware of instances, and users receive all
    messages on all instances within the channel.

  * Unaware client users do not see any real difference in messages (no
    control characters, ugly strings, etc.)

  * Aware clients may filter ("punt" and "unpunt") instances, as well as
    send messages on them.

Instance Labels
###############

Huffman Table 1
===============

This is a 5-ary Huffman tree whose output is to be T encoded::

  ( ( rsoit )
    ( gb<>- )
    ( mane. )
    ( ( Ch()= )
      ( U@HG# )
      ( &j+NB )
      ( MFL;: )
      ( ^~Q?Z ) )
    ( ( 'ufp/ )
      ( ldcv_ )
      ( STARE )
      ( I O ( wWkqx ) ( DPyXY ) ( KVJz" ) )
      ( ( 01234 ) ( 56789 ) ( %*,|! ) ( `$\{} ) ( [] ) ) ) )

This is to say that, as examples, ``r`` encodes as ``00``, ``I`` encodes as ``440``, and ``,`` encodes as ``4422``.

Instance Continuation Messages
==============================


Empty instance declarations are used to mean "see last instance tag" rather
than "no instance" (which can be obtained by leaving out an instance tag
altogether).  If a client finds that it must split a message, this gives a
much shorter, fixed-width protocol message which may be placed on subsequent
lines.  Clients which see an Instance Continuation Message but have not seen
an Instance Label previously (as might be the case after joining a channel)
SHOULD display the message as if it had no instance and MAY indicate to the
user that this downgrade has occurred.

Note that while it is in theory possibly to use an ICM when multiple
independent messages are sent on the same instance, clients SHOULD NOT use
this functionality, as it increases the risk that an aware client will be
unable to correctly route an instanced message. Clients MUST NOT use an ICM
if more than 60 seconds have elapsed since sending a message with an
Instance Label or if they have witnessed a JOIN.

Note that if an Instance Continuation Message and an Instance Label appear
in the same protocol message, aware clients MUST consider the label
authoritative and MAY inform the user that there was an error in the
message.  Conforming clients MUST NOT generate messages with both ICM and IL
entries.

Protocol Examples
=================

1. To send the instance label test, the full message works out to be

   +---------+---------+----------------------+
   | Header            | ``^O^O``             |
   +---------+---------+----------------------+
   | MetaL             | ``^C^C^V`` (13)      |
   +---------+---------+----------------------+
   | First Record                             |
   +---------+---------+----------------------+
   |         |  Type   | ``^C^B`` (5)         |
   +---------+---------+----------------------+
   |         | Length  | ``^C^B^V`` (8)       |
   +---------+---------+----------------------+
   |         | Value   | ``^B^_^O^V^B^C^B^_`` |
   +---------+---------+----------------------+
   | Footer            | ``^O``               |
   +---------+---------+----------------------+

   for a total of 19 bytes sent (11 of which are protocol overhead).

1. A CTCP ACTION message such as ``^AACTION barfs on the floor.^A`` may be
   instance tagged with label "test" by sending ``^AACTION barfs on the
   floor.^O^O^C^C^V^C^B^C^B^V^B^_^O^V^B^C^B^_^O^A``.

1. An instance continuation message is rendered as

   +---------+---------+--------------+
   | Header            | ``^O^O``     |
   +---------+---------+--------------+
   | MetaL             | ``^B^V`` (3) |
   +---------+---------+--------------+
   | First Record                     |
   +---------+---------+--------------+
   |         |  Type   | ``^C^B`` (5) |
   +---------+---------+--------------+
   |         | Length  | ``^B^B`` (0) |
   +---------+---------+--------------+
   |         | Value   |              |
   +---------+---------+--------------+
   | Footer            | ``^O``       |
   +---------+---------+--------------+

.. .. == WTF? ==
.. .. 
.. .. Comic Chat is an old piece of software from Microsoft Research which renders IRC sessions as comics.  While it tries to largely
.. .. automagically determine the layout and appearance of characters, occasionally it needs to send some metadata into the channel,
.. .. which can be annoying to people not playing that game.  This seemed like a prime opportunity for IRCIE.
.. .. 
.. .. See [[Misc/ComicChat]] for some more discussion of the protocol being re-encoded here.
.. .. 
.. .. == IRCIE Encoding ==
.. .. 
.. .. Unfortunately, `# Appears as` lines are going to encode into entirely blank lines from the perspective of unaware clients.
.. .. That's probably OK.
.. .. 
.. .. The MSCC IRCIE messages shall begin with one type tag byte, T-encoded.  A value of 0 indicates that the message is
.. .. an `# Appears as` line, with the payload (e.g. `ANNA`) coded using [[../Instances]] Huffman Table 1.  A value of 1
.. .. indicates that it is the body of a `(#...)` declaration (omitting the parens and the initial `#`) coded as follows:
.. .. 
.. ..  * G and E values can be thought as encoding numerals between 0 and 4096.  Fixed width, six symbol field.  Max value indicates "not present".
.. ..  * R is either present or not.  Fixed width, one symbol field.  0 for not present, 1 for present.  Other values reserved.
.. ..  * M codes values between 1 and 5.  Fixed width, one symbol field.  Value of 0 indicates not present.
.. ..  * T carries arbitrary data, and so we encode the entire comma-separated string using the Huffman Table 1 from [[../Instances]], prefixed by a single symbol to carry present/not as with R.
.. .. 
.. .. This requires 6+6+1+1+1+n = 15+n bytes, which isn't too bad.
.. .. 
.. .. == External Links ==
.. .. 
.. ..  * http://en.wikipedia.org/wiki/Microsoft_Comic_Chat

======================
Miscellaneous Metadata
======================

Head Of Frame Flags
###################

Sometimes, all you want is a regex.  Head-of-frame flags are intended to be
scanned by agents not interested in decoding the entirety of IRCIE (as well
as those which are).  They MUST immediately follow the sigil and metaL
header fields, and MUST be repeated at the head of every fragment (see
below) of a message.  HOF flags' definitions violate the encoding
abstraction and are encoded directly as symbols in the stream, to facilitate
this parsing.  Fields are filled in "left to right" ("most significant
position" first); fields not present read as their defaulted value.

+----------+-------+------------------------+
| Position | Value | Signal                 |
+----------+-------+------------------------+
| 0        | 0     | Not a bot.             |
+          +-------+------------------------+
|          | 1     | Bot/automated message. |
+          +-------+------------------------+
|          | 2-4   | Bot flag: reserved.    |
+----------+-------+------------------------+

Automated response flags are there to help bots avoid chatter-spam
(responding to other bots, possibly circularly).  Concretely, bots not
following IRCIE should produce ``^O^O^C^B^B^B^V^B^C^C^O`` and should use
this regex to recognize bot flags encoded in arbitrary IRCIE TLV streams:
``^O^O(^B|^C.|^O..|^V...|^_....).^B^V(^B|^C.|^O..|^V...|^_....).[^C^O^V^_].*^O$``.

OTR Advertisement Messages
##########################

Reserved for use with Off The Record messaging when being used on an IRC
transport.  For compatibility with the extant OTR protocol, the payload of
this message is a list of OTR versions to support.  These values are formed
by T-encoding, using two bytes each, and the list is formed by
concatenation.  (Note that some of the older OTR documentation does not
admit that there are multiple tags!)

+------------+-----------------+
| List value | Signal          |
+------------+-----------------+
| 0          | RESERVED        |
+------------+-----------------+
| 1          | OTR V 1         |
+------------+-----------------+
| 2          | OTR V 2         |
+------------+-----------------+
| 3-24       | RESERVED        |
+------------+-----------------+

For example:

+---------+---------+--------------------+
| Header            | ``^O^O``           |
+---------+---------+--------------------+
| MetaL             | ``^C^B^V`` (8)     |
+---------+---------+--------------------+
| First Record                           |
+---------+---------+--------------------+
|         |  Type   | ``^V^B`` (15)      |
+---------+---------+--------------------+
|         | Length  | ``^B^_`` (4)       |
+---------+---------+--------------------+
|         | Value   | ``^B^O^B^C`` (2,1) |
+---------+---------+--------------------+
| Footer            | ``^O``             |
+---------+---------+--------------------+

Miscellaneous Message Flags
###########################

The message flags type is reserved for carrying a bitset for flagged
attributes of the message to which it is attached.  This may be thought of
as an infinite bitstring transmitted without leading zeros.  The value is
encoded numerically using T-encoding with a leading 1 which is ignored (to
signal the leftmost boundary of bits to be interpreted).  Clients receiving
bits beyond their interpretation are obligated to ignore these bits.

No MMFs are defined at present; the sole user of this feature was moved to
the Head-of-frame flags above.  Nonetheless, the type is reserved should we
ever want it back.

Continuation Flags
##################

A single character flag, marking fragmented IRC messages.  Each IRC message
may contain AT MOST one continuation flag message.  The messages are
T-encoded as follows; other values are reserved.

+-------+------------------------------+
| Value | Meaning                      |
+=======+==============================+
| 0     | Begin a split message set    |
+-------+------------------------------+
| 1     | Continue a split message set |
+-------+------------------------------+
| 2     | End a split message set      |
+-------+------------------------------+

When processing a stream of messages, a "begin split message set" indicates
that the transmitting client has had to fragment the IRC line in order to
satisfy the IRC server.  It will then produce a stream of zero or more
"continues split"-tagged messages followed by exactly one "end split"
message.

The semantics of processing Continuation Flags are as if the message had not
been fragmented.  That is, the client will accumulate lines in a fragmented
set, and will then parse all other IRCIE messages contained in the
accumulated sets.  Head-of-frame flags MUST be identical across fragments
and are only parsed once after reassembly.

Client processing must also consider any message without a continuation flag
as having been proceeded by an end if the client has sent a begin (that is,
messages either IRCIE-untagged and IRCIE-tagged w/o continuation flags while
expecting a continue or an end should cause the client to pretend that it
has seen an end prior to the triggering message).  The same behavior is
expected if a client drops from the server before sending the ending
message.

Clients should avoid changing IRC nicknames while emitting continuations but
clients MAY chose to defensively track said.

It is an error to send continue or end messages without first sending a
beginning, and clients SHOULD discard the continuation flags in this case.

===============
Implementations
===============


* There is a general purpose, Perl IRCIE codec and instancer plugin for irssi available at
  http://hydra-www.ietfng.org/gitweb/?p=instirc;a=summary

* There is an emacs ``erc`` instances plugin at http://www.talchas.net/erc-ircie.el

* There is some early work on an instance-specific plugin for XChat, written
  in C, at http://adrake.org/blog/software/instancedxchat


===================
Commentary vs. CTCP
===================

This protocol is awful, in a sense.  It duplicates a huge amount of
machinery (signaling, message framing, dispatch) already present for CTCP
(see http://www.irchelp.org/irchelp/rfc/ctcpspec.html and
http://www.invlogic.com/irc/ctcp.html) and in a sense ought not exist.
However, as of this writing, most IRC clients do not properly parse CTCP and
choke if either multiple CTCP messages are sent in one IRC line (which is
legal) or if the CTCP message does not span the entire IRC line (also
legal).  Further, clients tend to be bad about ignoring CTCPs they don't
understand, coughing up user-visible noise instead.

If IRC clients were to properly understand CTCP, the instance labels might
be encoded by the introduction of a new command which means "the next IRC
message or CTCP ACTION from this client will be on instance ..." and send
that at the head of the IRC line in question.  A similar CTCP may be defined
for instance continuation messages with the same semantics as the message
defined here.  Done correctly, such a CTCP vocabulary would be more flexible
than the protocol defined here, allowing one IRC line to carry multiple
messages bound for different instances.