preserves/preserves.md

---
no_site_title: true
title: "Preserves: an Expressive Data Language"
---

Tony Garnock-Jones <tonyg@leastfixedpoint.com>  
August 2019. Version 0.0.6.

  [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
  [spki]: http://world.std.com/~cme/html/spki.html
  [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
  [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
  [abnf]: https://tools.ietf.org/html/rfc7405

This document proposes a data model and serialization format called
*Preserves*.

Preserves supports *records* with user-defined *labels*. This relieves
the confusion caused by encoding records as dictionaries, seen in most
data languages in use on the web. It also allows Preserves to easily
represent the *labelled sums of products* as seen in many functional
programming languages.

Preserves also supports the usual suite of atomic and compound data
types, in particular including *binary* data as a distinct type from
text strings. Its *annotations* allow separation of data from metadata
such as [comments](conventions.html#comments), trace information, and
provenance information.

Finally, Preserves defines precisely how to *compare* two values.
Comparison is based on the data model, not on syntax or on data
structures of any particular implementation language.

## Starting with Semantics

Taking inspiration from functional programming, we start with a
definition of the *values* that we want to work with and give them
meaning independent of their syntax.

Our `Value`s fall into two broad categories: *atomic* and *compound*
data.

                          Value = Atom
                                | Compound

                           Atom = Boolean
                                | Float
                                | Double
                                | SignedInteger
                                | String
                                | ByteString
                                | Symbol

                       Compound = Record
                                | Sequence
                                | Set
                                | Dictionary

**Total order.**<a name="total-order"></a> As we go, we will
incrementally specify a total order over `Value`s. Two values of the
same kind are compared using kind-specific rules. The ordering among
values of different kinds is essentially arbitrary, but having a total
order is convenient for many tasks, so we define it as
follows:[^ordering-by-syntax]

            (Values)        Atom < Compound

            (Compounds)     Record < Sequence < Set < Dictionary

            (Atoms)         Boolean < Float < Double < SignedInteger
                              < String < ByteString < Symbol

  [^ordering-by-syntax]: The observant reader may note that the
    ordering here is the same as that implied by the tagging scheme
    used in the concrete binary syntax for `Value`s.

**Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
neither is less than the other according to the total order.

### Signed integers.

A `SignedInteger` is a signed integer of arbitrary width.
`SignedInteger`s are compared as mathematical integers.

### Unicode strings.

A `String` is a sequence of Unicode
[code-point](http://www.unicode.org/glossary/#code_point)s. `String`s
are compared lexicographically, code-point by
code-point.[^utf8-is-awesome]

  [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
    gives the same result as a lexicographic byte-by-byte comparison
    of the UTF-8 encoding of a string!

### Binary data.

A `ByteString` is a sequence of octets. `ByteString`s are compared
lexicographically.

### Symbols.

Programming languages like Lisp and Prolog frequently use string-like
values called *symbols*. Here, a `Symbol` is, like a `String`, a
sequence of Unicode code-points representing an identifier of some
kind. `Symbol`s are also compared lexicographically by code-point.

### Booleans.

There are two `Boolean`s, “false” and “true”. The “false” value is
less-than the “true” value.

### IEEE floating-point values.

`Float`s and `Double`s are single- and double-precision IEEE 754
floating-point values, respectively. `Float`s, `Double`s and
`SignedInteger`s are disjoint; by the rules [above](#total-order),
every `Float` is less than every `Double`, and every `SignedInteger`
is greater than both. Two `Float`s or two `Double`s are to be ordered
by the `totalOrder` predicate defined in section 5.10 of
[IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).

### Records.

A `Record` is a *labelled* tuple of `Value`s, the record's *fields*. A
label can be any `Value`, but is usually a `Symbol`.[^extensibility]
[^iri-labels] `Record`s are compared lexicographically: first by
label, then by field sequence.

  [^extensibility]: The [Racket](https://racket-lang.org/) programming
    language defines
    “[prefab](http://docs.racket-lang.org/guide/define-struct.html#(part._prefab-struct))”
    structure types, which map well to our `Record`s. Racket supports
    record extensibility by encoding record supertypes into record
    labels as specially-formatted lists.

  [^iri-labels]: It is occasionally (but seldom) necessary to
    interpret such `Symbol` labels as UTF-8 encoded IRIs. Where a
    label can be read as a relative IRI, it is notionally interpreted
    with respect to the IRI
    `urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can
    be read as an absolute IRI, it stands for that IRI; and otherwise,
    it cannot be read as an IRI at all, and so the label simply stands
    for itself—for its own `Value`.

### Sequences.

A `Sequence` is a sequence of `Value`s. `Sequence`s are compared
lexicographically.

### Sets.

A `Set` is an unordered finite set of `Value`s. It contains no
duplicate values, following the [equivalence relation](#equivalence)
induced by the total order on `Value`s. Two `Set`s are compared by
sorting their elements ascending using the [total order](#total-order)
and comparing the resulting `Sequence`s.

### Dictionaries.

A `Dictionary` is an unordered finite collection of pairs of `Value`s.
Each pair comprises a *key* and a *value*. Keys in a `Dictionary` are
pairwise distinct. Instances of `Dictionary` are compared by
lexicographic comparison of the sequences resulting from ordering each
`Dictionary`'s pairs in ascending order by key.

## Textual Syntax

Now we have discussed `Value`s and their meanings, we may turn to
techniques for *representing* `Value`s for communication or storage.

In this section, we use [case-sensitive ABNF][abnf] to define a
textual syntax that is easy for people to read and
write.[^json-superset] Most of the examples in this document are
written using this syntax. In the following section, we will define an
equivalent compact machine-readable syntax.

  [^json-superset]: The grammar of the textual syntax is a superset of
    JSON, with the slightly unusual feature that `true`, `false`, and
    `null` are all read as `Symbol`s, and that `SignedInteger`s are
    never read as `Double`s.

### Character set.

[ABNF][abnf] allows easy definition of US-ASCII-based languages.
However, Preserves is a Unicode-based language. Therefore, we
reinterpret ABNF as a grammar for recognising sequences of Unicode
code points.

Textual syntax for a `Value` *SHOULD* be encoded using UTF-8 where
possible.

### Whitespace.

Whitespace is defined as any number of spaces, tabs, carriage returns,
line feeds, or commas.

                ws = *(%x20 / %x09 / newline / ",")
           newline = CR / LF

### Grammar.

Standalone documents may have trailing whitespace.

          Document = Value ws

Any `Value` may be preceded by whitespace.

             Value = ws (Record / Collection / Atom / Compact)
        Collection = Sequence / Dictionary / Set
              Atom = Boolean / Float / Double / SignedInteger /
                     String / ByteString / Symbol

Each `Record` is an angle-bracket enclosed grouping of its
label-`Value` followed by its field-`Value`s.

            Record = "<" Value *Value ws ">"

`Sequence`s are enclosed in square brackets. `Dictionary` values are
curly-brace-enclosed colon-separated pairs of values. `Set`s are
written either as one or more values enclosed in curly braces, or zero
or more values enclosed by the tokens `#set{` and
`}`.[^printing-collections] It is an error for a set to contain
duplicate elements or for a dictionary to contain duplicate keys.

          Sequence = "[" *Value ws "]"
        Dictionary = "{" *(Value ws ":" Value) ws "}"
               Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}"

  [^printing-collections]: **Implementation note.** When implementing
    printing of `Value`s using the textual syntax, consider supporting
    (a) optional pretty-printing with indentation, (b) optional
    JSON-compatible print mode for that subset of `Value` that is
    compatible with JSON, and (c) optional submodes for no commas,
    commas separating, and commas terminating elements or key/value
    pairs within a collection.

`Boolean`s are the simple literal strings `#true` and `#false`.

           Boolean = %s"#true" / %s"#false"

Numeric data follow the
[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
the addition of a trailing “f” distinguishing `Float` from `Double`
values. `Float`s and `Double`s always have either a fractional part or
an exponent part, where `SignedInteger`s never have
either.[^reading-and-writing-floats-accurately]
[^arbitrary-precision-signedinteger]

             Float = flt %i"f"
            Double = flt
     SignedInteger = int

          digit1-9 = %x31-39
               nat = %x30 / ( digit1-9 *DIGIT )
               int = ["-"] nat
              frac = "." 1*DIGIT
               exp = %i"e" ["-"/"+"] 1*DIGIT
               flt = int (frac exp / frac / exp)

  [^reading-and-writing-floats-accurately]: **Implementation note.**
    Your language's standard library likely has a good routine for
    converting between decimal notation and IEEE 754 floating-point.
    However, if not, or if you are interested in the challenges of
    accurately reading and writing floating point numbers, see the
    excellent matched pair of 1990 papers by Clinger and Steele &
    White, and a recent follow-up by Jaffer:

    Clinger, William D. ‘How to Read Floating Point Numbers
    Accurately’. In Proc. PLDI. White Plains, New York, 1990.
    <https://doi.org/10.1145/93542.93557>.

    Steele, Guy L., Jr., and Jon L. White. ‘How to Print
    Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains,
    New York, 1990. <https://doi.org/10.1145/93542.93559>.

    Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of
    Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013.
    <http://arxiv.org/abs/1310.8121>.

  [^arbitrary-precision-signedinteger]: **Implementation note.** Be
    aware when implementing reading and writing of `SignedInteger`s
    that the data model *requires* arbitrary-precision integers. Your
    I/O routines must not truncate precision either when reading or
    writing a `SignedInteger`.

`String`s are,
[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly
escaped text surrounded by double quotes. The escaping rules are the
same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]

            String = %x22 *char %x22
              char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
         unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
            escape = %x5C              ; \
           escaped = ( %x5C /          ; \    reverse solidus U+005C
                       %x2F /          ; /    solidus         U+002F
                       %x62 /          ; b    backspace       U+0008
                       %x66 /          ; f    form feed       U+000C
                       %x6E /          ; n    line feed       U+000A
                       %x72 /          ; r    carriage return U+000D
                       %x74 )          ; t    tab             U+0009

  [^string-json-correspondence]: The grammar for `String` has the same
    effect as the
    [JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for
    `string`. Some auxiliary definitions (e.g. `escaped`) are lifted
    largely unmodified from the text of RFC 8259.

  [^escaping-surrogate-pairs]: In particular, note JSON's rules around
    the use of surrogate pairs for code points not in the Basic
    Multilingual Plane. We encourage implementations to avoid escaping
    such characters when producing output, and instead to rely on the
    UTF-8 encoding of the entire document to handle them correctly.

A `ByteString` may be written in any of three different forms.

The first is similar to a `String`, but prepended with a hash sign
`#`. In addition, only Unicode code points overlapping with printable
7-bit ASCII are permitted unescaped inside such a `ByteString`; other
byte values must be escaped by prepending a two-digit hexadecimal
value with `\x`.

        ByteString = "#" %x22 *binchar %x22
           binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
      binunescaped = %x20-21 / %x23-5B / %x5D-7E

The second is as a sequence of pairs of hexadecimal digits interleaved
with whitespace and surrounded by `#hex{` and `}`.

       ByteString =/ %s"#hex{" *(ws / 2HEXDIG) ws "}"

The third is as a sequence of
[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
with whitespace and surrounded by `#base64{` and `}`. Plain and
URL-safe Base64 characters are allowed.

       ByteString =/ %s"#base64{" *(ws / base64char) ws "}" /
        base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="

A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as
it conforms to certain restrictions on the characters appearing in the
symbol. Alternatively, it may be written in a quoted form. The quoted
form is much the same as the syntax for `String`s, including embedded
escape syntax, except using a bar or pipe character (`|`) instead of a
double quote mark.

            Symbol = symstart *symcont / "|" *symchar "|"
          symstart = ALPHA / sympunct / symustart
           symcont = ALPHA / sympunct / symustart / symucont / DIGIT / "-"
          sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
                     "?" / "_" / "=" / "+" / "/" / "."
           symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
         symustart = <any code point greater than 127 whose Unicode
                      category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me,
                      Pc, Po, Sc, Sm, Sk, So, or Co>
          symucont = <any code point greater than 127 whose Unicode
                      category is Nd, Nl, No, or Pd>

  [^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt]
    definition of “token representation”, and with the
    [R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4).

Finally, any `Value` may be represented by escaping from the textual
syntax to the [compact binary syntax](#compact-binary-syntax) by
prefixing a `ByteString` containing the binary representation of the
`Value` with `#value`.[^rationale-switch-to-binary] [^no-literal-binary-in-text]

           Compact = %s"#value" ws ByteString

  [^rationale-switch-to-binary]: **Rationale.** The textual syntax
    cannot express every `Value`: specifically, it cannot express the
    several million floating-point NaNs, or the two floating-point
    Infinities. Since the compact binary format for `Value`s expresses
    each `Value` with precision, embedding binary `Value`s solves the
    problem.

  [^no-literal-binary-in-text]: Every text is ultimately physically
    stored as bytes; therefore, it might seem possible to escape to
    the raw binary form of compact binary encoding from within a
    pieces of textual syntax. However, while bytes must be involved in
    any *representation* of text, the text *itself* is logically a
    sequence of *code points* and is not *intrinsically* a binary
    structure at all. It would be incoherent to expect to be able to
    access the representation of the text from within the text itself.

### Annotations.

**Syntax.** When written down, a `Value` may have an associated
sequence of *annotations* carrying “out-of-band” contextual metadata
about the value. Each annotation is, in turn, a `Value`, and may
itself have annotations.

            Value =/ ws "@" Value Value

Each annotation is preceded by `@`; the underlying annotated value
follows its annotations. Here we extend only the syntactic nonterminal
named “`Value`” without altering the semantic class of `Value`s.

**Equivalence.** Annotations appear within syntax denoting a `Value`;
however, the annotations are not part of the denoted value. They are
only part of the syntax. Annotations do not play a part in
equivalences and orderings of `Value`s.

Reflective tools such as debuggers, user interfaces, and message
routers and relays---tools which process `Value`s generically---may
use annotated inputs to tailor their operation, or may insert
annotations in their outputs. By contrast, in ordinary programs, as a
rule of thumb, the presence, absence or content of an annotation
should not change the control flow or output of the program.
Annotations are data *describing* `Value`s, and are not in the domain
of any specific application of `Value`s. That is, an annotation will
almost never cause a non-reflective program to do anything observably
different.

## Compact Binary Syntax

A `Repr` is a binary-syntax encoding, or representation, of either

 - a `Value`,
 - a “placeholder” for a `Value`, or
 - an annotation on a `Repr`.

Each `Repr` comprises one or more bytes describing the kind of
represented information and the length of the representation, followed
by the encoded details.

For a value `v`, we write `[[v]]` for the `Repr` of v.

### Type and Length representation.

Each `Repr` takes one of three possible forms:

 - (A) type-specific form, used for simple values such as `Boolean`s
   or `Float`s, for placeholders, and for introducing annotations.

 - (B) a variable-length form with length specified up-front, used for
   compound and variable-length atomic data structures when their
   sizes are known at the time serialization begins.

 - (C) a variable-length streaming form with unknown or unpredictable
   length, used in cases when serialization begins before the number
   of elements or bytes in the corresponding `Value` is known.

Applications may choose between formats B and C depending on their
needs at serialization time.

#### The lead byte.

Every `Repr` starts with a *lead byte*, constructed by
`leadbyte(t,n,m)`, where `t`,`n`∈{0,1,2,3} and 0≤`m`<16:

    leadbyte(t,n,m) = [t*64 + n*16 + m]

The arguments `t`, `n` and `m` describe the rest of the
representation.[^some-encodings-unused]

  [^some-encodings-unused]: Some encodings are unused. All such
    encodings are reserved for future versions of this specification.

| `t` | `n` | `m` | Meaning |
| --- | --- | --- | ------- |
|  0  |  0  | 0–3 | (format A) An `Atom` with fixed-length binary representation |
|  0  |  0  | 4   | (format C) Stream end |
|  0  |  0  | 5   | (format A) Annotation |
|  0  |  1  |     | (format A) Placeholder for an application-specific `Value` |
|  0  |  2  |     | (format C) Stream start |
|  0  |  3  |     | (format A) Certain small `SignedInteger`s |
|  1  |     |     | (format B) An `Atom` with variable-length binary representation |
|  2  |     |     | (format B) A `Compound` with variable-length representation |

#### Encoding data of type-specific length (format A).

Each type of data defines its own rules for this format.

#### Encoding data of known length (format B).

Format B is used where the length `l` of the `Value` to be encoded is
known when serialization begins. Format B `Repr`s use `m` in
`leadbyte` to encode `l`. The length counts *bytes* for atomic
`Value`s, but counts *contained values* for compound `Value`s.

 - A length `l` between 0 and 14 is represented using `leadbyte` with
   `m=l`.
 - A length of 15 or greater is represented by `m=15` and additional
   bytes describing the length following the lead byte.

The function `header(t,n,m)` yields an appropriate sequence of bytes
describing a `Repr`'s type and length when `t`, `n` and `m` are
appropriate non-negative integers:

    header(t,n,m) =    leadbyte(t,n,m)                 when m < 15
                    or leadbyte(t,n,15) ++ varint(m)   otherwise

The additional length bytes are formatted as
[base 128 varints][varint]. We write `varint(m)` for the
varint-encoding of `m`. Quoting the [Google Protocol Buffers][varint]
definition,

> Each byte in a varint, except the last byte, has the most
> significant bit (msb) set – this indicates that there are further
> bytes to come. The lower 7 bits of each byte are used to store the
> two's complement representation of the number in groups of 7 bits,
> least significant group first.

The following table illustrates varint-encoding.

| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
| ------ | ------------------- | ------------ |
| 15 | `0001111` | 15 |
| 300 | `0000010 0101100` | 172 2 |
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |

#### Streaming data of unknown length (format C).

A `Repr` where the length of the `Value` to be encoded is variable and
not known at the time serialization of the `Value` starts is encoded
by a single Stream Start (“open”) byte, followed by zero or more
*chunks*, followed by a matching Stream End (“close”) byte:

     open(t,n) = leadbyte(0,2, t*4 + n) = [0x20 + t*4 + n]
       close() = leadbyte(0,0, 4)       = [0x04]

For a format C `Repr` of an atomic `Value`, each chunk is to be a
format B `Repr` of a `ByteString`, no matter the type of the overall
`Value`. Annotations are not allowed on these individual chunks.

For a format C `Repr` of a compound `Value`, each chunk is to be a
single `Repr`, which may itself be annotated.

Each chunk within a format C `Repr` *MUST* have non-zero length.
Software that decodes `Repr`s *MUST* reject `Repr`s that include
zero-length chunks.

### Records.

Format B (known length):

    [[ <L F_1...F_m> ]] = header(2,0,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]

For `m` fields, `m+1` is supplied to `header`, to account for the
encoding of the record label.

Format C (streaming):

    [[ <L F_1...F_m> ]] = open(2,0) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close()

Applications *SHOULD* prefer the known-length format for encoding
`Record`s.

### Placeholders.

Applications may define an interpretation for numbered *placeholders*
in the binary syntax, mapping each *placeholder number* `n` to a
specific `Value`. For example, a placeholder number may be assigned
for a frequently-used `Record` label.

A `Value` `v` for which placeholder number `n` has been assigned may
be tersely encoded as

    [[v]] = header(0,1,n)  when n is a placeholder number for v

**Examples.** For example, a protocol may choose to assign placeholder
number 4 to the symbol `void`, making

    [[void]] = header(0,1,4) = [0x14]
    [[<void>]] = header(2,0,1) ++ [[void]] = [0x81, 0x14]

or it may map symbol `person` to placeholder number 102, making

    [[person]] = header(0,1,102) = [0x1F, 0x66]

and so

    [[<person "Dr" "Elizabeth" "Blackwell">]]
      = header(2,0,4) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
      =          [0x84, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]

for format B, or

    open(2,0) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close()
       = [0x28, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x04]

for format C.

### Sequences, Sets and Dictionaries.

Format B (known length):

            [[ [X_1...X_m] ]] = header(2,1,m)   ++ [[X_1]] ++...++ [[X_m]]
        [[ #set{X_1...X_m} ]] = header(2,2,m)   ++ [[X_1]] ++...++ [[X_m]]
    [[ {K_1:V_1...K_m:V_m} ]] = header(2,3,m*2) ++ [[K_1]] ++ [[V_1]] ++...
                                                ++ [[K_m]] ++ [[V_m]]

Note that `m*2` is given to `header` for a `Dictionary`, since there
are two `Value`s in each key-value pair.

Format C (streaming):

            [[ [X_1...X_m] ]] = open(2,1) ++ [[X_1]] ++...++ [[X_m]] ++ close()
        [[ #set{E_1...E_m} ]] = open(2,2) ++ [[E_1]] ++...++ [[E_m]] ++ close()
    [[ {K_1:V_1...K_m:V_m} ]] = open(2,3) ++ [[K_1]] ++ [[V_1]] ++...
                                          ++ [[K_m]] ++ [[V_m]] ++ close()

Applications may use whichever format suits their needs on a
case-by-case basis.

There is *no* ordering requirement on the `E_i` elements or
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct.

  [^no-sorting-rationale]: In the BitTorrent encoding format,
    [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
    dictionary key/value pairs must be sorted by key. This is a
    necessary step for ensuring serialization of `Value`s is
    canonical. We do not require that key/value pairs (or set
    elements) be in sorted order for serialized `Value`s, because (a)
    where canonicalization is used for cryptographic signatures, it is
    more reliable to simply retain the exact binary form of the signed
    document than to depend on canonical de- and re-serialization, and
    (b) sorting keys or elements makes no sense in streaming
    serialization formats.

    However, a quality implementation may wish to offer the programmer
    the option of serializing with set elements and dictionary keys in
    sorted order.

### SignedIntegers.

Format B/A (known length/fixed-size):

    [[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x)  if x<-3 ∨ 13≤x
                                     header(0,3,x+16)              if -3≤x<0
                                     header(0,3,x)                 if 0≤x<13

Integers in the range [-3,12] are compactly represented using format A
because they are so frequently used. Other integers are represented
using format B.

Format C *MUST NOT* be used for `SignedInteger`s.

The function `intbytes(x)` gives the big-endian two's-complement
binary representation of `x`, taking exactly as many whole bytes as
needed to unambiguously identify the value and its sign, and `m =
|intbytes(x)|`. The most-significant bit in the first byte in
`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes]

  [^zero-intbytes]: The value 0 needs zero bytes to identify the
    value, so `intbytes(0)` is the empty byte string. Non-zero values
    need at least one byte.

For example,

    [[   -257 ]] = 42 FE FF    [[     -3 ]] = 3D       [[    128 ]] = 42 00 80
    [[   -256 ]] = 42 FF 00    [[     -2 ]] = 3E       [[    255 ]] = 42 00 FF
    [[   -255 ]] = 42 FF 01    [[     -1 ]] = 3F       [[    256 ]] = 42 01 00
    [[   -254 ]] = 42 FF 02    [[      0 ]] = 30       [[  32767 ]] = 42 7F FF
    [[   -129 ]] = 42 FF 7F    [[      1 ]] = 31       [[  32768 ]] = 43 00 80 00
    [[   -128 ]] = 41 80       [[     12 ]] = 3C       [[  65535 ]] = 43 00 FF FF
    [[   -127 ]] = 41 81       [[     13 ]] = 41 0D    [[  65536 ]] = 43 01 00 00
    [[     -4 ]] = 41 FC       [[    127 ]] = 41 7F    [[ 131072 ]] = 43 02 00 00

### Strings, ByteStrings and Symbols.

Syntax for these three types varies only in the value of `n` supplied
to `header` and `open`. In each case, the payload following the header
is a binary sequence; for `String` and `Symbol`, it is a UTF-8
encoding of the `Value`'s code points, while for `ByteString` it is
the raw data contained within the `Value` unmodified.

Format B (known length):

              [[ S ]] = header(1,n,m) ++ encode(S)
              where m = |encode(S)|
    and (n,encode(S)) = (1,utf8(S))  if S ∈ String
                        (2,S)        if S ∈ ByteString
                        (3,utf8(S))  if S ∈ Symbol

To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and
then a sequence of zero or more format B chunks, followed by
`close()`. Every chunk must be a `ByteString`, and no chunk may be
annotated.

While the overall content of a streamed `String` or `Symbol` must be
valid UTF-8, individual chunks do not have to conform to UTF-8.

### Fixed-length Atoms.

Fixed-length atoms all use format A, and do not have a length
representation. They repurpose the bits that format B `Repr`s use to
specify lengths. Applications *MUST NOT* use format C with `open(0,n)`
for any `n`.

#### Booleans.

    [[ #false ]] = header(0,0,0) = [0x00]
    [[  #true ]] = header(0,0,1) = [0x01]

#### Floats and Doubles.

    [[ F ]] when F ∈ Float  = header(0,0,2) ++ binary32(F)
    [[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D)

The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
8-byte IEEE 754 binary representations of `F` and `D`, respectively.

### Annotations.

To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
`[0x05] ++ [[v]]`.

For example, the `Repr` corresponding to textual syntax `@a@b[]`,
i.e. an empty sequence annotated with two symbols, `a` and `b`, is

    [[ @a @b [] ]]
      = [0x05] ++ [[a]] ++ [0x05] ++ [[b]] ++ [[ [] ]]
      = [0x05, 0x71, 0x61, 0x05, 0x71, 0x62, 0x90]

## Examples

### Simple examples.

<!-- TODO: Give some examples of large and small Preserves, perhaps -->
<!-- translated from various JSON blobs floating around the internet. -->

For the following examples, imagine an application that maps
placeholder number 0 to symbol `discard`, 1 to `capture`, and 2 to
`observe`.

| Value                                             | Encoded byte sequence                                                               |
|---------------------------------------------------|-------------------------------------------------------------------------------------|
| `<capture <discard>>`                             | 82 11 81 10                                                                         |
| `<observe <speak <discard> <capture <discard>>>>` | 82 12 83 75 's' 'p' 'e' 'a' 'k' 81 10 82 11 81 11                                   |
| `[1 2 3 4]` (format B)                            | 94 31 32 33 34                                                                      |
| `[1 2 3 4]` (format C)                            | 29 31 32 33 34 04                                                                   |
| `[-2 -1 0 1]`                                     | 94 3E 3F 30 31                                                                      |
| `"hello"` (format B)                              | 55 'h' 'e' 'l' 'l' 'o'                                                              |
| `"hello"` (format C, 2 chunks)                    | 25 62 'h' 'e' 63 'l' 'l' 'o' 35                                                     |
| `"hello"` (format C, 5 chunks)                    | 25 61 'h' 61 'e' 61 'l' 61 'l' 61 'o' 35                                            |
| `["hello" there #"world" [] #set{} #true #false]` | 97 55 'h' 'e' 'l' 'l' 'o' 75 't' 'h' 'e' 'r' 'e' 65 'w' 'o' 'r' 'l' 'd' 90 A0 01 00 |
| `-257`                                            | 42 FE FF                                                                            |
| `-1`                                              | 3F                                                                                  |
| `0`                                               | 30                                                                                  |
| `1`                                               | 31                                                                                  |
| `255`                                             | 42 00 FF                                                                            |
| `1.0f`                                            | 02 3F 80 00 00                                                                      |
| `1.0`                                             | 03 3F F0 00 00 00 00 00 00                                                          |
| `-1.202e300`                                      | 03 FE 3C B7 B7 59 BF 04 26                                                          |

The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record`

    <[titled person 2 thing 1] 101 "Blackwell" <date 1821 2 3> "Dr">

encodes to

    85                              ;; Record, generic, 4+1
      95                              ;; Sequence, 5
        76 74 69 74 6C 65 64            ;; Symbol, "titled"
        76 70 65 72 73 6F 6E            ;; Symbol, "person"
        32                              ;; SignedInteger, "2"
        75 74 68 69 6E 67               ;; Symbol, "thing"
        31                              ;; SignedInteger, "1"
      41 65                           ;; SignedInteger, "101"
      59 42 6C 61 63 6B 77 65 6C 6C   ;; String, "Blackwell"
      84                              ;; Record, generic, 3+1
        74 64 61 74 65                  ;; Symbol, "date"
        42 07 1D                        ;; SignedInteger, "1821"
        32                              ;; SignedInteger, "2"
        33                              ;; SignedInteger, "3"
      52 44 72                        ;; String, "Dr"

  [^extensibility2]: It happens to line up with Racket's
    representation of a record label for an inheritance hierarchy
    where `titled` extends `person` extends `thing`:

        (struct date (year month day) #:prefab)
        (struct thing (id) #:prefab)
        (struct person thing (name date-of-birth) #:prefab)
        (struct titled person (title) #:prefab)

    For more detail on Racket's representations of record labels, see
    [the Racket documentation for `make-prefab-struct`](http://docs.racket-lang.org/reference/structutils.html#%28def._%28%28quote._~23~25kernel%29._make-prefab-struct%29%29).

---

### JSON examples.

The examples from
[RFC 8259](https://tools.ietf.org/html/rfc8259#section-13) read as
valid Preserves, though the JSON literals `true`, `false` and `null`
read as `Symbol`s. The first example:

    {
      "Image": {
          "Width":  800,
          "Height": 600,
          "Title":  "View from 15th Floor",
          "Thumbnail": {
              "Url":    "http://www.example.com/image/481989943",
              "Height": 125,
              "Width":  100
          },
          "Animated" : false,
          "IDs": [116, 943, 234, 38793]
        }
    }

encodes to binary as follows:

    B2
      55 "Image"
      BC
        55 "Width"    42 03 20
        55 "Title"    5F 14 "View from 15th Floor"
        58 "Animated" 75 "false"
        56 "Height"   42 02 58
        59 "Thumbnail"
          B6
            55 "Width"  41 64
            53 "Url"    5F 26 "http://www.example.com/image/481989943"
            56 "Height" 41 7D
            53 "IDs"    94
                          41 74
                          42 03 AF
                          42 00 EA
                          43 00 97 89

and the second example:

    [
      {
         "precision": "zip",
         "Latitude":  37.7668,
         "Longitude": -122.3959,
         "Address":   "",
         "City":      "SAN FRANCISCO",
         "State":     "CA",
         "Zip":       "94107",
         "Country":   "US"
      },
      {
         "precision": "zip",
         "Latitude":  37.371991,
         "Longitude": -122.026020,
         "Address":   "",
         "City":      "SUNNYVALE",
         "State":     "CA",
         "Zip":       "94085",
         "Country":   "US"
      }
    ]

encodes to binary as follows:

    92
      BF 10
        59 "precision"  53 "zip"
        58 "Latitude"   03 40 42 E2 26 80 9D 49 52
        59 "Longitude"  03 C0 5E 99 56 6C F4 1F 21
        57 "Address"    50
        54 "City"       5D "SAN FRANCISCO"
        55 "State"      52 "CA"
        53 "Zip"        55 "94107"
        57 "Country"    52 "US"
      BF 10
        59 "precision"  53 "zip"
        58 "Latitude"   03 40 42 AF 9D 66 AD B4 03
        59 "Longitude"  03 C0 5E 81 AA 4F CA 42 AF
        57 "Address"    50
        54 "City"       59 "SUNNYVALE"
        55 "State"      52 "CA"
        53 "Zip"        55 "94085"
        57 "Country"    52 "US"

## Security Considerations

**Empty chunks.** Chunks of zero length are prohibited in streamed
(format C) `Repr`s. However, a malicious or broken encoder may include
them nonetheless. This opens up a possibility for denial-of-service:
an attacker may begin streaming a `String`, for example, sending an
endless sequence of zero length chunks, appearing to make progress but
not actually doing so. Implementations *MUST* reject zero length
chunks when decoding, and *MUST NOT* produce them when encoding.

**Whitespace.** Similarly, the textual format for `Value`s allows
arbitrary whitespace in many positions. In streaming transfer
situations, consider optional restrictions on the amount of
consecutive whitespace that may appear in a serialized `Value`.

**Annotations.** Also similarly, in modes where a `Value` is being
read while annotations are skipped, an endless sequence of annotations
may give an illusion of progress.

**Canonical form for cryptographic hashing and signing.** As
specified, neither the textual nor the compact binary encoding rules
for `Value`s force canonical serializations. Two serializations of the
same `Value` may yield different binary `Repr`s.

## Acknowledgements

The use of low-order bits of each lead byte for the length of short
values is inspired by a similar feature of [CBOR](http://cbor.io/).

The treatment of commas as whitespace in the text syntax is inspired
by the same feature of [EDN](https://github.com/edn-format/edn).

The text syntax for `Boolean`s, `Symbol`s, and `ByteString`s is
directly inspired by [Racket](https://racket-lang.org/)'s lexical
syntax.

## Appendix. Table of lead byte values

     00 - False
     01 - True
     02 - Float
     03 - Double
     04 - End stream
     05 - Annotation
    (0x)  RESERVED 06-0F
     1x - Placeholder
     2x - Start Stream
     3x - Small integers 0..12,-3..-1

     4x - SignedInteger
     5x - String
     6x - ByteString
     7x - Symbol

     8x - Record
     9x - Sequence
     Ax - Set
     Bx - Dictionary

    (Cx)  RESERVED C0-CF
    (Dx)  RESERVED D0-DF
    (Ex)  RESERVED E0-EF
    (Fx)  RESERVED F0-FF

## Appendix. Bit fields within lead byte values

     tt nn mmmm  contents
     ---------- ---------

     00 00 0000  False
     00 00 0001  True
     00 00 0010  Float, 32 bits big-endian binary
     00 00 0011  Double, 64 bits big-endian binary
     00 00 0100  End Stream (to match a previous Start Stream)
     00 00 0101  Annotation; two more Reprs follow

     00 01 mmmm  Placeholder; m is the placeholder number

     00 10 ttnn  Start Stream <tt,nn>
                   When tt = 00 --> error
                             01 --> each chunk is a ByteString
                             10 --> each chunk is a single encoded Value
                             11 --> error (RESERVED)

     00 11 xxxx  Small integers 0..12,-3..-1

     01 00 mmmm  SignedInteger, big-endian binary
     01 01 mmmm  String, UTF-8 binary
     01 10 mmmm  ByteString
     01 11 mmmm  Symbol, UTF-8 binary

     10 00 mmmm  Record
     10 01 mmmm  Sequence
     10 10 mmmm  Set
     10 11 mmmm  Dictionary

     11 nn mmmm  error, RESERVED

Where `mmmm` appears, interpret it as an unsigned 4-bit number `m`. If
`m`<15, let `l`=`m`. Otherwise, `m`=15; let `l` be the result of
decoding the varint that follows.

Then, if `ttnn`=`0001`, `l` is the placeholder number; otherwise, `l`
is the length of the body that follows, counted in bytes for `tt`=`01`
and in `Repr`s for `tt`=`10`.

<!-- Heading to visually offset the footnotes from the main document: -->
## Notes
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								---
-												Proper layouting

											
										
										
											2019-08-18 21:08:55 +00:00
+								no_site_title: true
 								title: "Preserves: an Expressive Data Language"
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								---
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								Tony Garnock-Jones <tonyg@leastfixedpoint.com>
-												Angle bracket S-exprs for Records!

											
										
										
											2019-08-11 22:54:57 +00:00
+								August 2019. Version 0.0.6.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								  [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
 								  [spki]: http://world.std.com/~cme/html/spki.html
 								  [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
 								  [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								  [abnf]: https://tools.ietf.org/html/rfc7405
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								This document proposes a data model and serialization format called
 								*Preserves*.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								Preserves supports *records* with user-defined *labels*. This relieves
 								the confusion caused by encoding records as dictionaries, seen in most
 								data languages in use on the web. It also allows Preserves to easily
 								represent the *labelled sums of products* as seen in many functional
 								programming languages.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								Preserves also supports the usual suite of atomic and compound data
 								types, in particular including *binary* data as a distinct type from
-												Initial draft text re annotations

											
										
										
											2019-07-03 23:33:37 +00:00
+								text strings. Its *annotations* allow separation of data from metadata
-												Split out inessential text from the spec

											
										
										
											2019-08-18 16:51:26 +00:00
+								such as [comments](conventions.html#comments), trace information, and
 								provenance information.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								Finally, Preserves defines precisely how to *compare* two values.
 								Comparison is based on the data model, not on syntax or on data
 								structures of any particular implementation language.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								## Starting with Semantics
 								Taking inspiration from functional programming, we start with a
 								definition of the *values* that we want to work with and give them
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								meaning independent of their syntax.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								Our `Value`s fall into two broad categories: *atomic* and *compound*
 								data.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								                          Value = Atom
 								                                | Compound
-												Fixes

											
										
										
											2018-09-23 21:44:43 +00:00
+								                           Atom = Boolean
 								                                | Float
 								                                | Double
 								                                | SignedInteger
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								                                | String
 								                                | ByteString
 								                                | Symbol
 								                       Compound = Record
 								                                | Sequence
 								                                | Set
 								                                | Dictionary
 								**Total order.**<a name="total-order"></a> As we go, we will
 								incrementally specify a total order over `Value`s. Two values of the
 								same kind are compared using kind-specific rules. The ordering among
 								values of different kinds is essentially arbitrary, but having a total
 								order is convenient for many tasks, so we define it as
 								follows:[^ordering-by-syntax]
-												Fixes

											
										
										
											2018-09-23 21:44:43 +00:00
+								            (Values)        Atom < Compound
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								            (Compounds)     Record < Sequence < Set < Dictionary
-												Fixes

											
										
										
											2018-09-23 21:44:43 +00:00
+								            (Atoms)         Boolean < Float < Double < SignedInteger
 								                              < String < ByteString < Symbol
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								  [^ordering-by-syntax]: The observant reader may note that the
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								    ordering here is the same as that implied by the tagging scheme
 								    used in the concrete binary syntax for `Value`s.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								**Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
 								neither is less than the other according to the total order.
 								### Signed integers.
 								A `SignedInteger` is a signed integer of arbitrary width.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								`SignedInteger`s are compared as mathematical integers.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								### Unicode strings.
 								A `String` is a sequence of Unicode
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								[code-point](http://www.unicode.org/glossary/#code_point)s. `String`s
 								are compared lexicographically, code-point by
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								code-point.[^utf8-is-awesome]
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								  [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
 								    gives the same result as a lexicographic byte-by-byte comparison
 								    of the UTF-8 encoding of a string!
 								### Binary data.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								A `ByteString` is a sequence of octets. `ByteString`s are compared
 								lexicographically.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								### Symbols.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								Programming languages like Lisp and Prolog frequently use string-like
 								values called *symbols*. Here, a `Symbol` is, like a `String`, a
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								sequence of Unicode code-points representing an identifier of some
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								kind. `Symbol`s are also compared lexicographically by code-point.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								### Booleans.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								There are two `Boolean`s, “false” and “true”. The “false” value is
 								less-than the “true” value.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								### IEEE floating-point values.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								`Float`s and `Double`s are single- and double-precision IEEE 754
 								floating-point values, respectively. `Float`s, `Double`s and
 								`SignedInteger`s are disjoint; by the rules [above](#total-order),
 								every `Float` is less than every `Double`, and every `SignedInteger`
 								is greater than both. Two `Float`s or two `Double`s are to be ordered
 								by the `totalOrder` predicate defined in section 5.10 of
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								[IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
 								### Records.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								A `Record` is a *labelled* tuple of `Value`s, the record's *fields*. A
 								label can be any `Value`, but is usually a `Symbol`.[^extensibility]
 								[^iri-labels] `Record`s are compared lexicographically: first by
 								label, then by field sequence.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								  [^extensibility]: The [Racket](https://racket-lang.org/) programming
 								    language defines
-												Tweaks; python mapping

											
										
										
											2018-09-24 17:34:07 +00:00
+								    “[prefab](http://docs.racket-lang.org/guide/define-struct.html#(part._prefab-struct))”
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								    structure types, which map well to our `Record`s. Racket supports
 								    record extensibility by encoding record supertypes into record
 								    labels as specially-formatted lists.
 								  [^iri-labels]: It is occasionally (but seldom) necessary to
 								    interpret such `Symbol` labels as UTF-8 encoded IRIs. Where a
 								    label can be read as a relative IRI, it is notionally interpreted
 								    with respect to the IRI
 								    `urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can
 								    be read as an absolute IRI, it stands for that IRI; and otherwise,
 								    it cannot be read as an IRI at all, and so the label simply stands
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								    for itself—for its own `Value`.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								### Sequences.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								A `Sequence` is a sequence of `Value`s. `Sequence`s are compared
 								lexicographically.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								### Sets.
 								A `Set` is an unordered finite set of `Value`s. It contains no
 								duplicate values, following the [equivalence relation](#equivalence)
 								induced by the total order on `Value`s. Two `Set`s are compared by
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								sorting their elements ascending using the [total order](#total-order)
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								and comparing the resulting `Sequence`s.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								### Dictionaries.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								A `Dictionary` is an unordered finite collection of pairs of `Value`s.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								Each pair comprises a *key* and a *value*. Keys in a `Dictionary` are
 								pairwise distinct. Instances of `Dictionary` are compared by
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								lexicographic comparison of the sequences resulting from ordering each
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								`Dictionary`'s pairs in ascending order by key.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								## Textual Syntax
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								Now we have discussed `Value`s and their meanings, we may turn to
 								techniques for *representing* `Value`s for communication or storage.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								In this section, we use [case-sensitive ABNF][abnf] to define a
 								textual syntax that is easy for people to read and
 								write.[^json-superset] Most of the examples in this document are
 								written using this syntax. In the following section, we will define an
 								equivalent compact machine-readable syntax.
 								  [^json-superset]: The grammar of the textual syntax is a superset of
 								    JSON, with the slightly unusual feature that `true`, `false`, and
 								    `null` are all read as `Symbol`s, and that `SignedInteger`s are
 								    never read as `Double`s.
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### Character set.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								[ABNF][abnf] allows easy definition of US-ASCII-based languages.
 								However, Preserves is a Unicode-based language. Therefore, we
 								reinterpret ABNF as a grammar for recognising sequences of Unicode
 								code points.
 								Textual syntax for a `Value` *SHOULD* be encoded using UTF-8 where
 								possible.
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### Whitespace.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								Whitespace is defined as any number of spaces, tabs, carriage returns,
-												Remove comments, in prep for annotations replacing them

											
										
										
											2019-07-01 20:31:49 +00:00
+								line feeds, or commas.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
-												Remove comments, in prep for annotations replacing them

											
										
										
											2019-07-01 20:31:49 +00:00
+								                ws = *(%x20 / %x09 / newline / ",")
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								           newline = CR / LF
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### Grammar.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								Standalone documents may have trailing whitespace.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								          Document = Value ws
 								Any `Value` may be preceded by whitespace.
 								             Value = ws (Record / Collection / Atom / Compact)
 								        Collection = Sequence / Dictionary / Set
 								              Atom = Boolean / Float / Double / SignedInteger /
 								                     String / ByteString / Symbol
-												Angle bracket S-exprs for Records!

											
										
										
											2019-08-11 22:54:57 +00:00
+								Each `Record` is an angle-bracket enclosed grouping of its
 								label-`Value` followed by its field-`Value`s.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
-												Angle bracket S-exprs for Records!

											
										
										
											2019-08-11 22:54:57 +00:00
+								            Record = "<" Value *Value ws ">"
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								`Sequence`s are enclosed in square brackets. `Dictionary` values are
 								curly-brace-enclosed colon-separated pairs of values. `Set`s are
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								written either as one or more values enclosed in curly braces, or zero
 								or more values enclosed by the tokens `#set{` and
-												Clarify no-duplicates in syntaxes.

											
										
										
											2019-08-18 12:56:13 +00:00
+								`}`.[^printing-collections] It is an error for a set to contain
 								duplicate elements or for a dictionary to contain duplicate keys.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								          Sequence = "[" *Value ws "]"
 								        Dictionary = "{" *(Value ws ":" Value) ws "}"
 								               Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}"
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								  [^printing-collections]: **Implementation note.** When implementing
 								    printing of `Value`s using the textual syntax, consider supporting
 								    (a) optional pretty-printing with indentation, (b) optional
 								    JSON-compatible print mode for that subset of `Value` that is
 								    compatible with JSON, and (c) optional submodes for no commas,
 								    commas separating, and commas terminating elements or key/value
 								    pairs within a collection.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								`Boolean`s are the simple literal strings `#true` and `#false`.
 								           Boolean = %s"#true" / %s"#false"
 								Numeric data follow the
 								[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
-												Fancy quotes

											
										
										
											2019-08-18 15:51:59 +00:00
+								the addition of a trailing “f” distinguishing `Float` from `Double`
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								values. `Float`s and `Double`s always have either a fractional part or
-												Handle a couple of TODOs

											
										
										
											2018-09-27 12:34:32 +00:00
+								an exponent part, where `SignedInteger`s never have
 								either.[^reading-and-writing-floats-accurately]
 								[^arbitrary-precision-signedinteger]
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								             Float = flt %i"f"
 								            Double = flt
 								     SignedInteger = int
 								          digit1-9 = %x31-39
 								               nat = %x30 / ( digit1-9 *DIGIT )
 								               int = ["-"] nat
 								              frac = "." 1*DIGIT
 								               exp = %i"e" ["-"/"+"] 1*DIGIT
 								               flt = int (frac exp / frac / exp)
-												Handle a couple of TODOs

											
										
										
											2018-09-27 12:34:32 +00:00
+								  [^reading-and-writing-floats-accurately]: **Implementation note.**
 								    Your language's standard library likely has a good routine for
 								    converting between decimal notation and IEEE 754 floating-point.
 								    However, if not, or if you are interested in the challenges of
 								    accurately reading and writing floating point numbers, see the
 								    excellent matched pair of 1990 papers by Clinger and Steele &
 								    White, and a recent follow-up by Jaffer:
 								    Clinger, William D. ‘How to Read Floating Point Numbers
 								    Accurately’. In Proc. PLDI. White Plains, New York, 1990.
 								    <https://doi.org/10.1145/93542.93557>.
 								    Steele, Guy L., Jr., and Jon L. White. ‘How to Print
 								    Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains,
 								    New York, 1990. <https://doi.org/10.1145/93542.93559>.
 								    Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of
 								    Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013.
 								    <http://arxiv.org/abs/1310.8121>.
 								  [^arbitrary-precision-signedinteger]: **Implementation note.** Be
 								    aware when implementing reading and writing of `SignedInteger`s
 								    that the data model *requires* arbitrary-precision integers. Your
 								    I/O routines must not truncate precision either when reading or
 								    writing a `SignedInteger`.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								`String`s are,
 								[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly
 								escaped text surrounded by double quotes. The escaping rules are the
-												Handle a couple of TODOs

											
										
										
											2018-09-27 12:34:32 +00:00
+								same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								            String = %x22 *char %x22
 								              char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
 								         unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
 								            escape = %x5C              ; \
 								           escaped = ( %x5C /          ; \    reverse solidus U+005C
 								                       %x2F /          ; /    solidus         U+002F
 								                       %x62 /          ; b    backspace       U+0008
 								                       %x66 /          ; f    form feed       U+000C
 								                       %x6E /          ; n    line feed       U+000A
 								                       %x72 /          ; r    carriage return U+000D
 								                       %x74 )          ; t    tab             U+0009
 								  [^string-json-correspondence]: The grammar for `String` has the same
 								    effect as the
 								    [JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for
 								    `string`. Some auxiliary definitions (e.g. `escaped`) are lifted
 								    largely unmodified from the text of RFC 8259.
-												Handle a couple of TODOs

											
										
										
											2018-09-27 12:34:32 +00:00
+								  [^escaping-surrogate-pairs]: In particular, note JSON's rules around
 								    the use of surrogate pairs for code points not in the Basic
 								    Multilingual Plane. We encourage implementations to avoid escaping
 								    such characters when producing output, and instead to rely on the
 								    UTF-8 encoding of the entire document to handle them correctly.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								A `ByteString` may be written in any of three different forms.
 								The first is similar to a `String`, but prepended with a hash sign
 								`#`. In addition, only Unicode code points overlapping with printable
 -bit ASCII are permitted unescaped inside such a `ByteString`; other
 								byte values must be escaped by prepending a two-digit hexadecimal
 								value with `\x`.
 								        ByteString = "#" %x22 *binchar %x22
 								           binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
 								      binunescaped = %x20-21 / %x23-5B / %x5D-7E
-												Typo

											
										
										
											2018-09-28 10:12:44 +00:00
+								The second is as a sequence of pairs of hexadecimal digits interleaved
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								with whitespace and surrounded by `#hex{` and `}`.
-												hexchunk was a bad idea; introduce IOList instead

											
										
										
											2019-07-11 16:34:47 +00:00
+								       ByteString =/ %s"#hex{" *(ws / 2HEXDIG) ws "}"
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								The third is as a sequence of
 								[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
 								with whitespace and surrounded by `#base64{` and `}`. Plain and
 								URL-safe Base64 characters are allowed.
 								       ByteString =/ %s"#base64{" *(ws / base64char) ws "}" /
 								        base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
-												Fancy quotes

											
										
										
											2019-08-18 15:51:59 +00:00
+								A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								it conforms to certain restrictions on the characters appearing in the
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								symbol. Alternatively, it may be written in a quoted form. The quoted
 								form is much the same as the syntax for `String`s, including embedded
 								escape syntax, except using a bar or pipe character (`|`) instead of a
 								double quote mark.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								            Symbol = symstart *symcont / "|" *symchar "|"
-												Avoid confusing dashes/numerics in symunicode at start of a symbol

											
										
										
											2019-08-18 15:51:46 +00:00
+								          symstart = ALPHA / sympunct / symustart
 								           symcont = ALPHA / sympunct / symustart / symucont / DIGIT / "-"
-												Prepare for annotations by disallowing @ in raw symbols

											
										
										
											2018-10-08 20:24:40 +00:00
+								          sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
-												Angle bracket S-exprs for Records!

											
										
										
											2019-08-11 22:54:57 +00:00
+								                     "?" / "_" / "=" / "+" / "/" / "."
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								           symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
-												Avoid confusing dashes/numerics in symunicode at start of a symbol

											
										
										
											2019-08-18 15:51:46 +00:00
+								         symustart = <any code point greater than 127 whose Unicode
 								                      category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me,
 								                      Pc, Po, Sc, Sm, Sk, So, or Co>
 								          symucont = <any code point greater than 127 whose Unicode
 								                      category is Nd, Nl, No, or Pd>
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								  [^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt]
-												Fancy quotes

											
										
										
											2019-08-18 15:51:59 +00:00
+								    definition of “token representation”, and with the
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								    [R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4).
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
-												Simplify, repair, and regularise embedded binary values in textual syntax

											
										
										
											2018-09-29 16:50:57 +00:00
+								Finally, any `Value` may be represented by escaping from the textual
 								syntax to the [compact binary syntax](#compact-binary-syntax) by
 								prefixing a `ByteString` containing the binary representation of the
 								`Value` with `#value`.[^rationale-switch-to-binary] [^no-literal-binary-in-text]
 								           Compact = %s"#value" ws ByteString
 								  [^rationale-switch-to-binary]: **Rationale.** The textual syntax
 								    cannot express every `Value`: specifically, it cannot express the
 								    several million floating-point NaNs, or the two floating-point
 								    Infinities. Since the compact binary format for `Value`s expresses
 								    each `Value` with precision, embedding binary `Value`s solves the
 								    problem.
 								  [^no-literal-binary-in-text]: Every text is ultimately physically
 								    stored as bytes; therefore, it might seem possible to escape to
 								    the raw binary form of compact binary encoding from within a
 								    pieces of textual syntax. However, while bytes must be involved in
 								    any *representation* of text, the text *itself* is logically a
 								    sequence of *code points* and is not *intrinsically* a binary
 								    structure at all. It would be incoherent to expect to be able to
 								    access the representation of the text from within the text itself.
-												Initial draft text re annotations

											
										
										
											2019-07-03 23:33:37 +00:00
+								### Annotations.
-												More on annotations

											
										
										
											2019-07-11 01:52:04 +00:00
+								**Syntax.** When written down, a `Value` may have an associated
 								sequence of *annotations* carrying “out-of-band” contextual metadata
 								about the value. Each annotation is, in turn, a `Value`, and may
 								itself have annotations.
-												Initial draft text re annotations

											
										
										
											2019-07-03 23:33:37 +00:00
 								            Value =/ ws "@" Value Value
 								Each annotation is preceded by `@`; the underlying annotated value
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								follows its annotations. Here we extend only the syntactic nonterminal
-												Fancy quotes

											
										
										
											2019-08-18 15:51:59 +00:00
+								named “`Value`” without altering the semantic class of `Value`s.
-												Initial draft text re annotations

											
										
										
											2019-07-03 23:33:37 +00:00
-												More on annotations

											
										
										
											2019-07-11 01:52:04 +00:00
+								**Equivalence.** Annotations appear within syntax denoting a `Value`;
 								however, the annotations are not part of the denoted value. They are
 								only part of the syntax. Annotations do not play a part in
 								equivalences and orderings of `Value`s.
-												Initial draft text re annotations

											
										
										
											2019-07-03 23:33:37 +00:00
 								Reflective tools such as debuggers, user interfaces, and message
 								routers and relays---tools which process `Value`s generically---may
 								use annotated inputs to tailor their operation, or may insert
 								annotations in their outputs. By contrast, in ordinary programs, as a
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								rule of thumb, the presence, absence or content of an annotation
 								should not change the control flow or output of the program.
 								Annotations are data *describing* `Value`s, and are not in the domain
 								of any specific application of `Value`s. That is, an annotation will
 								almost never cause a non-reflective program to do anything observably
 								different.
-												Initial draft text re annotations

											
										
										
											2019-07-03 23:33:37 +00:00
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								## Compact Binary Syntax
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								A `Repr` is a binary-syntax encoding, or representation, of either
 								 - a `Value`,
-												Fancy quotes

											
										
										
											2019-08-18 15:51:59 +00:00
+								 - a “placeholder” for a `Value`, or
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								 - an annotation on a `Repr`.
 								Each `Repr` comprises one or more bytes describing the kind of
 								represented information and the length of the representation, followed
 								by the encoded details.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								For a value `v`, we write `[[v]]` for the `Repr` of v.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### Type and Length representation.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								Each `Repr` takes one of three possible forms:
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								 - (A) type-specific form, used for simple values such as `Boolean`s
 								   or `Float`s, for placeholders, and for introducing annotations.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								 - (B) a variable-length form with length specified up-front, used for
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								   compound and variable-length atomic data structures when their
 								   sizes are known at the time serialization begins.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								 - (C) a variable-length streaming form with unknown or unpredictable
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								   length, used in cases when serialization begins before the number
 								   of elements or bytes in the corresponding `Value` is known.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								Applications may choose between formats B and C depending on their
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								needs at serialization time.
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								#### The lead byte.
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Tighten

											
										
										
											2018-09-24 14:33:19 +00:00
+								Every `Repr` starts with a *lead byte*, constructed by
 								`leadbyte(t,n,m)`, where `t`,`n`∈{0,1,2,3} and 0≤`m`<16:
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								    leadbyte(t,n,m) = [t*64 + n*16 + m]
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								The arguments `t`, `n` and `m` describe the rest of the
 								representation.[^some-encodings-unused]
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								  [^some-encodings-unused]: Some encodings are unused. All such
 								    encodings are reserved for future versions of this specification.
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								| `t` | `n` | `m` | Meaning |
 								| --- | --- | --- | ------- |
 								|  0  |  0  | 0–3 | (format A) An `Atom` with fixed-length binary representation |
 								|  0  |  0  | 4   | (format C) Stream end |
 								|  0  |  0  | 5   | (format A) Annotation |
 								|  0  |  1  |     | (format A) Placeholder for an application-specific `Value` |
 								|  0  |  2  |     | (format C) Stream start |
 								|  0  |  3  |     | (format A) Certain small `SignedInteger`s |
 								|  1  |     |     | (format B) An `Atom` with variable-length binary representation |
 								|  2  |     |     | (format B) A `Compound` with variable-length representation |
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								#### Encoding data of type-specific length (format A).
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								Each type of data defines its own rules for this format.
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								#### Encoding data of known length (format B).
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								Format B is used where the length `l` of the `Value` to be encoded is
 								known when serialization begins. Format B `Repr`s use `m` in
 								`leadbyte` to encode `l`. The length counts *bytes* for atomic
 								`Value`s, but counts *contained values* for compound `Value`s.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								 - A length `l` between 0 and 14 is represented using `leadbyte` with
 								   `m=l`.
 								 - A length of 15 or greater is represented by `m=15` and additional
 								   bytes describing the length following the lead byte.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								The function `header(t,n,m)` yields an appropriate sequence of bytes
 								describing a `Repr`'s type and length when `t`, `n` and `m` are
 								appropriate non-negative integers:
 								    header(t,n,m) =    leadbyte(t,n,m)                 when m < 15
 								                    or leadbyte(t,n,15) ++ varint(m)   otherwise
 								The additional length bytes are formatted as
 								[base 128 varints][varint]. We write `varint(m)` for the
 								varint-encoding of `m`. Quoting the [Google Protocol Buffers][varint]
 								definition,
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								> Each byte in a varint, except the last byte, has the most
 								> significant bit (msb) set – this indicates that there are further
 								> bytes to come. The lower 7 bits of each byte are used to store the
 								> two's complement representation of the number in groups of 7 bits,
 								> least significant group first.
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								The following table illustrates varint-encoding.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
 								| ------ | ------------------- | ------------ |
 								| 15 | `0001111` | 15 |
 								| 300 | `0000010 0101100` | 172 2 |
 								| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								#### Streaming data of unknown length (format C).
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
 								A `Repr` where the length of the `Value` to be encoded is variable and
 								not known at the time serialization of the `Value` starts is encoded
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								by a single Stream Start (“open”) byte, followed by zero or more
 								*chunks*, followed by a matching Stream End (“close”) byte:
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								     open(t,n) = leadbyte(0,2, t*4 + n) = [0x20 + t*4 + n]
 								       close() = leadbyte(0,0, 4)       = [0x04]
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								For a format C `Repr` of an atomic `Value`, each chunk is to be a
 								format B `Repr` of a `ByteString`, no matter the type of the overall
 								`Value`. Annotations are not allowed on these individual chunks.
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								For a format C `Repr` of a compound `Value`, each chunk is to be a
 								single `Repr`, which may itself be annotated.
 								Each chunk within a format C `Repr` *MUST* have non-zero length.
 								Software that decodes `Repr`s *MUST* reject `Repr`s that include
 								zero-length chunks.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### Records.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								Format B (known length):
-												Angle bracket S-exprs for Records!

											
										
										
											2019-08-11 22:54:57 +00:00
+								    [[ <L F_1...F_m> ]] = header(2,0,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								For `m` fields, `m+1` is supplied to `header`, to account for the
 								encoding of the record label.
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								Format C (streaming):
-												Angle bracket S-exprs for Records!

											
										
										
											2019-08-11 22:54:57 +00:00
+								    [[ <L F_1...F_m> ]] = open(2,0) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close()
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
 								Applications *SHOULD* prefer the known-length format for encoding
 								`Record`s.
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								### Placeholders.
-												Tweak

											
										
										
											2019-07-14 02:26:29 +00:00
+								Applications may define an interpretation for numbered *placeholders*
 								in the binary syntax, mapping each *placeholder number* `n` to a
 								specific `Value`. For example, a placeholder number may be assigned
 								for a frequently-used `Record` label.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								A `Value` `v` for which placeholder number `n` has been assigned may
 								be tersely encoded as
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								    [[v]] = header(0,1,n)  when n is a placeholder number for v
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								**Examples.** For example, a protocol may choose to assign placeholder
 								number 4 to the symbol `void`, making
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								    [[void]] = header(0,1,4) = [0x14]
-												Angle bracket S-exprs for Records!

											
										
										
											2019-08-11 22:54:57 +00:00
+								    [[<void>]] = header(2,0,1) ++ [[void]] = [0x81, 0x14]
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								or it may map symbol `person` to placeholder number 102, making
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								    [[person]] = header(0,1,102) = [0x1F, 0x66]
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								and so
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Angle bracket S-exprs for Records!

											
										
										
											2019-08-11 22:54:57 +00:00
+								    [[<person "Dr" "Elizabeth" "Blackwell">]]
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								      = header(2,0,4) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
 								      =          [0x84, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
 								for format B, or
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								    open(2,0) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close()
 								       = [0x28, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x04]
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
 								for format C.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### Sequences, Sets and Dictionaries.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								Format B (known length):
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								            [[ [X_1...X_m] ]] = header(2,1,m)   ++ [[X_1]] ++...++ [[X_m]]
 								        [[ #set{X_1...X_m} ]] = header(2,2,m)   ++ [[X_1]] ++...++ [[X_m]]
 								    [[ {K_1:V_1...K_m:V_m} ]] = header(2,3,m*2) ++ [[K_1]] ++ [[V_1]] ++...
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								                                                ++ [[K_m]] ++ [[V_m]]
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								Note that `m*2` is given to `header` for a `Dictionary`, since there
 								are two `Value`s in each key-value pair.
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
 								Format C (streaming):
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								            [[ [X_1...X_m] ]] = open(2,1) ++ [[X_1]] ++...++ [[X_m]] ++ close()
-												Clarify no-duplicates in syntaxes.

											
										
										
											2019-08-18 12:56:13 +00:00
+								        [[ #set{E_1...E_m} ]] = open(2,2) ++ [[E_1]] ++...++ [[E_m]] ++ close()
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								    [[ {K_1:V_1...K_m:V_m} ]] = open(2,3) ++ [[K_1]] ++ [[V_1]] ++...
 								                                          ++ [[K_m]] ++ [[V_m]] ++ close()
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
 								Applications may use whichever format suits their needs on a
 								case-by-case basis.
-												Many improvements

											
										
										
											2018-09-23 17:14:58 +00:00
-												Clarify no-duplicates in syntaxes.

											
										
										
											2019-08-18 12:56:13 +00:00
+								There is *no* ordering requirement on the `E_i` elements or
-												Many improvements

											
										
										
											2018-09-23 17:14:58 +00:00
+								`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
-												Tweak

											
										
										
											2019-08-18 21:45:57 +00:00
+								order. However, the `E_i` and `K_i` *MUST* be pairwise distinct.
-												Many improvements

											
										
										
											2018-09-23 17:14:58 +00:00
 								  [^no-sorting-rationale]: In the BitTorrent encoding format,
 								    [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
 								    dictionary key/value pairs must be sorted by key. This is a
 								    necessary step for ensuring serialization of `Value`s is
 								    canonical. We do not require that key/value pairs (or set
 								    elements) be in sorted order for serialized `Value`s, because (a)
 								    where canonicalization is used for cryptographic signatures, it is
 								    more reliable to simply retain the exact binary form of the signed
 								    document than to depend on canonical de- and re-serialization, and
 								    (b) sorting keys or elements makes no sense in streaming
 								    serialization formats.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								    However, a quality implementation may wish to offer the programmer
 								    the option of serializing with set elements and dictionary keys in
 								    sorted order.
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### SignedIntegers.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
+								Format B/A (known length/fixed-size):
 								    [[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x)  if x<-3 ∨ 13≤x
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								                                     header(0,3,x+16)              if -3≤x<0
 								                                     header(0,3,x)                 if 0≤x<13
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
+								Integers in the range [-3,12] are compactly represented using format A
 								because they are so frequently used. Other integers are represented
 								using format B.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								Format C *MUST NOT* be used for `SignedInteger`s.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								The function `intbytes(x)` gives the big-endian two's-complement
 								binary representation of `x`, taking exactly as many whole bytes as
 								needed to unambiguously identify the value and its sign, and `m =
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
+								|intbytes(x)|`. The most-significant bit in the first byte in
 								`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes]
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
+								  [^zero-intbytes]: The value 0 needs zero bytes to identify the
 								    value, so `intbytes(0)` is the empty byte string. Non-zero values
 								    need at least one byte.
-												Many improvements

											
										
										
											2018-09-23 17:14:58 +00:00
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								For example,
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								    [[   -257 ]] = 42 FE FF    [[     -3 ]] = 3D       [[    128 ]] = 42 00 80
 								    [[   -256 ]] = 42 FF 00    [[     -2 ]] = 3E       [[    255 ]] = 42 00 FF
 								    [[   -255 ]] = 42 FF 01    [[     -1 ]] = 3F       [[    256 ]] = 42 01 00
 								    [[   -254 ]] = 42 FF 02    [[      0 ]] = 30       [[  32767 ]] = 42 7F FF
 								    [[   -129 ]] = 42 FF 7F    [[      1 ]] = 31       [[  32768 ]] = 43 00 80 00
 								    [[   -128 ]] = 41 80       [[     12 ]] = 3C       [[  65535 ]] = 43 00 FF FF
-												Tighten

											
										
										
											2018-09-24 14:33:19 +00:00
+								    [[   -127 ]] = 41 81       [[     13 ]] = 41 0D    [[  65536 ]] = 43 01 00 00
 								    [[     -4 ]] = 41 FC       [[    127 ]] = 41 7F    [[ 131072 ]] = 43 02 00 00
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### Strings, ByteStrings and Symbols.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								Syntax for these three types varies only in the value of `n` supplied
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								to `header` and `open`. In each case, the payload following the header
 								is a binary sequence; for `String` and `Symbol`, it is a UTF-8
 								encoding of the `Value`'s code points, while for `ByteString` it is
 								the raw data contained within the `Value` unmodified.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								Format B (known length):
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								              [[ S ]] = header(1,n,m) ++ encode(S)
 								              where m = |encode(S)|
 								    and (n,encode(S)) = (1,utf8(S))  if S ∈ String
 								                        (2,S)        if S ∈ ByteString
 								                        (3,utf8(S))  if S ∈ Symbol
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and
 								then a sequence of zero or more format B chunks, followed by
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								`close()`. Every chunk must be a `ByteString`, and no chunk may be
 								annotated.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								While the overall content of a streamed `String` or `Symbol` must be
 								valid UTF-8, individual chunks do not have to conform to UTF-8.
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### Fixed-length Atoms.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								Fixed-length atoms all use format A, and do not have a length
 								representation. They repurpose the bits that format B `Repr`s use to
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								specify lengths. Applications *MUST NOT* use format C with `open(0,n)`
 								for any `n`.
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								#### Booleans.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								    [[ #false ]] = header(0,0,0) = [0x00]
 								    [[  #true ]] = header(0,0,1) = [0x01]
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								#### Floats and Doubles.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								    [[ F ]] when F ∈ Float  = header(0,0,2) ++ binary32(F)
 								    [[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D)
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
 								The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
 -byte IEEE 754 binary representations of `F` and `D`, respectively.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								### Annotations.
 								To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
 								`[0x05] ++ [[v]]`.
-												Angle bracket S-exprs for Records!

											
										
										
											2019-08-11 22:54:57 +00:00
+								For example, the `Repr` corresponding to textual syntax `@a@b[]`,
-												Spacing is actually required here :-/

											
										
										
											2019-08-11 14:25:43 +00:00
+								i.e. an empty sequence annotated with two symbols, `a` and `b`, is
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
 								    [[ @a @b [] ]]
 								      = [0x05] ++ [[a]] ++ [0x05] ++ [[b]] ++ [[ [] ]]
 								      = [0x05, 0x71, 0x61, 0x05, 0x71, 0x62, 0x90]
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								## Examples
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### Simple examples.
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								<!-- TODO: Give some examples of large and small Preserves, perhaps -->
 								<!-- translated from various JSON blobs floating around the internet. -->
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								For the following examples, imagine an application that maps
 								placeholder number 0 to symbol `discard`, 1 to `capture`, and 2 to
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								`observe`.
-												Improve clarity of simple example table. Closes #4.

											
										
										
											2019-08-11 14:56:46 +00:00
+								| Value                                             | Encoded byte sequence                                                               |
 								|---------------------------------------------------|-------------------------------------------------------------------------------------|
-												Angle bracket S-exprs for Records!

											
										
										
											2019-08-11 22:54:57 +00:00
+								| `<capture <discard>>`                             | 82 11 81 10                                                                         |
 								| `<observe <speak <discard> <capture <discard>>>>` | 82 12 83 75 's' 'p' 'e' 'a' 'k' 81 10 82 11 81 11                                   |
-												Improve clarity of simple example table. Closes #4.

											
										
										
											2019-08-11 14:56:46 +00:00
+								| `[1 2 3 4]` (format B)                            | 94 31 32 33 34                                                                      |
 								| `[1 2 3 4]` (format C)                            | 29 31 32 33 34 04                                                                   |
 								| `[-2 -1 0 1]`                                     | 94 3E 3F 30 31                                                                      |
 								| `"hello"` (format B)                              | 55 'h' 'e' 'l' 'l' 'o'                                                              |
 								| `"hello"` (format C, 2 chunks)                    | 25 62 'h' 'e' 63 'l' 'l' 'o' 35                                                     |
 								| `"hello"` (format C, 5 chunks)                    | 25 61 'h' 61 'e' 61 'l' 61 'l' 61 'o' 35                                            |
 								| `["hello" there #"world" [] #set{} #true #false]` | 97 55 'h' 'e' 'l' 'l' 'o' 75 't' 'h' 'e' 'r' 'e' 65 'w' 'o' 'r' 'l' 'd' 90 A0 01 00 |
 								| `-257`                                            | 42 FE FF                                                                            |
 								| `-1`                                              | 3F                                                                                  |
 								| `0`                                               | 30                                                                                  |
 								| `1`                                               | 31                                                                                  |
 								| `255`                                             | 42 00 FF                                                                            |
 								| `1.0f`                                            | 02 3F 80 00 00                                                                      |
 								| `1.0`                                             | 03 3F F0 00 00 00 00 00 00                                                          |
 								| `-1.202e300`                                      | 03 FE 3C B7 B7 59 BF 04 26                                                          |
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record`
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Angle bracket S-exprs for Records!

											
										
										
											2019-08-11 22:54:57 +00:00
+								    <[titled person 2 thing 1] 101 "Blackwell" <date 1821 2 3> "Dr">
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								encodes to
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+;; Record, generic, 4+1
 ;; Sequence, 5
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+74 69 74 6C 65 64            ;; Symbol, "titled"
 70 65 72 73 6F 6E            ;; Symbol, "person"
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+;; SignedInteger, "2"
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+74 68 69 6E 67               ;; Symbol, "thing"
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+;; SignedInteger, "1"
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+65                           ;; SignedInteger, "101"
 42 6C 61 63 6B 77 65 6C 6C   ;; String, "Blackwell"
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+;; Record, generic, 3+1
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+64 61 74 65                  ;; Symbol, "date"
 07 1D                        ;; SignedInteger, "1821"
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+;; SignedInteger, "2"
 ;; SignedInteger, "3"
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+44 72                        ;; String, "Dr"
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								  [^extensibility2]: It happens to line up with Racket's
 								    representation of a record label for an inheritance hierarchy
 								    where `titled` extends `person` extends `thing`:
 								        (struct date (year month day) #:prefab)
 								        (struct thing (id) #:prefab)
 								        (struct person thing (name date-of-birth) #:prefab)
 								        (struct titled person (title) #:prefab)
-												Link to Racket docs for prefab struct labels

											
										
										
											2018-09-25 09:08:22 +00:00
+								    For more detail on Racket's representations of record labels, see
 								    [the Racket documentation for `make-prefab-struct`](http://docs.racket-lang.org/reference/structutils.html#%28def._%28%28quote._~23~25kernel%29._make-prefab-struct%29%29).
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								---
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### JSON examples.
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
 								The examples from
 								[RFC 8259](https://tools.ietf.org/html/rfc8259#section-13) read as
 								valid Preserves, though the JSON literals `true`, `false` and `null`
 								read as `Symbol`s. The first example:
 								    {
 								      "Image": {
 								          "Width":  800,
 								          "Height": 600,
 								          "Title":  "View from 15th Floor",
 								          "Thumbnail": {
 								              "Url":    "http://www.example.com/image/481989943",
 								              "Height": 125,
 								              "Width":  100
 								          },
 								          "Animated" : false,
 								          "IDs": [116, 943, 234, 38793]
 								        }
 								    }
 								encodes to binary as follows:
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								    B2
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+"Image"
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								      BC
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+"Width"    42 03 20
 "Title"    5F 14 "View from 15th Floor"
 "Animated" 75 "false"
 "Height"   42 02 58
 "Thumbnail"
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								          B6
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+"Width"  41 64
 "Url"    5F 26 "http://www.example.com/image/481989943"
 "Height" 41 7D
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+"IDs"    94
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+74
 03 AF
 00 EA
 00 97 89
 								and the second example:
 								    [
 								      {
 								         "precision": "zip",
 								         "Latitude":  37.7668,
 								         "Longitude": -122.3959,
 								         "Address":   "",
 								         "City":      "SAN FRANCISCO",
 								         "State":     "CA",
 								         "Zip":       "94107",
 								         "Country":   "US"
 								      },
 								      {
 								         "precision": "zip",
 								         "Latitude":  37.371991,
 								         "Longitude": -122.026020,
 								         "Address":   "",
 								         "City":      "SUNNYVALE",
 								         "State":     "CA",
 								         "Zip":       "94085",
 								         "Country":   "US"
 								      }
 								    ]
 								encodes to binary as follows:
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
 								      BF 10
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+"precision"  53 "zip"
 "Latitude"   03 40 42 E2 26 80 9D 49 52
 "Longitude"  03 C0 5E 99 56 6C F4 1F 21
 "Address"    50
 "City"       5D "SAN FRANCISCO"
 "State"      52 "CA"
 "Zip"        55 "94107"
 "Country"    52 "US"
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								      BF 10
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+"precision"  53 "zip"
 "Latitude"   03 40 42 AF 9D 66 AD B4 03
 "Longitude"  03 C0 5E 81 AA 4F CA 42 AF
 "Address"    50
 "City"       59 "SUNNYVALE"
 "State"      52 "CA"
 "Zip"        55 "94085"
 "Country"    52 "US"
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								## Security Considerations
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								**Empty chunks.** Chunks of zero length are prohibited in streamed
 								(format C) `Repr`s. However, a malicious or broken encoder may include
 								them nonetheless. This opens up a possibility for denial-of-service:
 								an attacker may begin streaming a `String`, for example, sending an
 								endless sequence of zero length chunks, appearing to make progress but
 								not actually doing so. Implementations *MUST* reject zero length
 								chunks when decoding, and *MUST NOT* produce them when encoding.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								**Whitespace.** Similarly, the textual format for `Value`s allows
 								arbitrary whitespace in many positions. In streaming transfer
 								situations, consider optional restrictions on the amount of
-												Remove comments, in prep for annotations replacing them

											
										
										
											2019-07-01 20:31:49 +00:00
+								consecutive whitespace that may appear in a serialized `Value`.
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								**Annotations.** Also similarly, in modes where a `Value` is being
 								read while annotations are skipped, an endless sequence of annotations
 								may give an illusion of progress.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								**Canonical form for cryptographic hashing and signing.** As
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								specified, neither the textual nor the compact binary encoding rules
 								for `Value`s force canonical serializations. Two serializations of the
 								same `Value` may yield different binary `Repr`s.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												Acknowledge influences

											
										
										
											2019-08-18 21:42:23 +00:00
+								## Acknowledgements
 								The use of low-order bits of each lead byte for the length of short
 								values is inspired by a similar feature of [CBOR](http://cbor.io/).
 								The treatment of commas as whitespace in the text syntax is inspired
 								by the same feature of [EDN](https://github.com/edn-format/edn).
-												Acknowledge Racket influence

											
										
										
											2019-08-19 20:14:46 +00:00
+								The text syntax for `Boolean`s, `Symbol`s, and `ByteString`s is
 								directly inspired by [Racket](https://racket-lang.org/)'s lexical
 								syntax.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								## Appendix. Table of lead byte values
 - False
 - True
 - Float
 - Double
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+- End stream
 - Annotation
 								    (0x)  RESERVED 06-0F
 x - Placeholder
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+x - Start Stream
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+x - Small integers 0..12,-3..-1
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
 x - SignedInteger
 x - String
 x - ByteString
 x - Symbol
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+x - Record
 x - Sequence
 								     Ax - Set
 								     Bx - Dictionary
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								    (Cx)  RESERVED C0-CF
 								    (Dx)  RESERVED D0-DF
 								    (Ex)  RESERVED E0-EF
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								    (Fx)  RESERVED F0-FF
 								## Appendix. Bit fields within lead byte values
 								     tt nn mmmm  contents
 								     ---------- ---------
 00 0000  False
 00 0001  True
 00 0010  Float, 32 bits big-endian binary
 00 0011  Double, 64 bits big-endian binary
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+00 0100  End Stream (to match a previous Start Stream)
 00 0101  Annotation; two more Reprs follow
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+01 mmmm  Placeholder; m is the placeholder number
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+10 ttnn  Start Stream <tt,nn>
 								                   When tt = 00 --> error
-												Streamed binaries always use ByteString chunks

											
										
										
											2018-09-24 22:15:36 +00:00
+--> each chunk is a ByteString
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+--> each chunk is a single encoded Value
 --> error (RESERVED)
 11 xxxx  Small integers 0..12,-3..-1
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
 00 mmmm  SignedInteger, big-endian binary
 01 mmmm  String, UTF-8 binary
 10 mmmm  ByteString
 11 mmmm  Symbol, UTF-8 binary
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+00 mmmm  Record
 01 mmmm  Sequence
 10 mmmm  Set
 11 mmmm  Dictionary
 nn mmmm  error, RESERVED
 								Where `mmmm` appears, interpret it as an unsigned 4-bit number `m`. If
 								`m`<15, let `l`=`m`. Otherwise, `m`=15; let `l` be the result of
 								decoding the varint that follows.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								Then, if `ttnn`=`0001`, `l` is the placeholder number; otherwise, `l`
 								is the length of the body that follows, counted in bytes for `tt`=`01`
 								and in `Repr`s for `tt`=`10`.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												Restore removed "Notes" heading

											
										
										
											2019-07-14 18:09:19 +00:00
+								<!-- Heading to visually offset the footnotes from the main document: -->
 								## Notes