preserves/preserves.md

---
---
<title>Preserves: an Expressive Data Language</title>
<link rel="stylesheet" href="preserves.css">

# Preserves: an Expressive Data Language

Tony Garnock-Jones <tonyg@leastfixedpoint.com>  
November 2018. Version 0.0.4.

  [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
  [spki]: http://world.std.com/~cme/html/spki.html
  [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
  [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
  [abnf]: https://tools.ietf.org/html/rfc7405

This document proposes a data model and serialization format called
*Preserves*.

Preserves supports *records* with user-defined *labels*. This relieves
the confusion caused by encoding records as dictionaries, seen in most
data languages in use on the web. It also allows Preserves to easily
represent the *labelled sums of products* as seen in many functional
programming languages.

Preserves also supports the usual suite of atomic and compound data
types, in particular including *binary* data as a distinct type from
text strings. Its *annotations* allow separation of data from metadata
such as comments, trace information, and provenance information.

Finally, Preserves defines precisely how to *compare* two values.
Comparison is based on the data model, not on syntax or on data
structures of any particular implementation language.

## Starting with Semantics

Taking inspiration from functional programming, we start with a
definition of the *values* that we want to work with and give them
meaning independent of their syntax.

Our `Value`s fall into two broad categories: *atomic* and *compound*
data.

                          Value = Atom
                                | Compound

                           Atom = Boolean
                                | Float
                                | Double
                                | SignedInteger
                                | String
                                | ByteString
                                | Symbol

                       Compound = Record
                                | Sequence
                                | Set
                                | Dictionary

**Total order.**<a name="total-order"></a> As we go, we will
incrementally specify a total order over `Value`s. Two values of the
same kind are compared using kind-specific rules. The ordering among
values of different kinds is essentially arbitrary, but having a total
order is convenient for many tasks, so we define it as
follows:[^ordering-by-syntax]

            (Values)        Atom < Compound

            (Compounds)     Record < Sequence < Set < Dictionary

            (Atoms)         Boolean < Float < Double < SignedInteger
                              < String < ByteString < Symbol

  [^ordering-by-syntax]: The observant reader may note that the
    ordering here is (almost) the same as that implied by the tagging
    scheme used in the concrete binary syntax for `Value`s. (The
    exception is the syntax for small integers near zero.)

**Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
neither is less than the other according to the total order.

### Signed integers.

A `SignedInteger` is a signed integer of arbitrary width.
`SignedInteger`s are compared as mathematical integers.

### Unicode strings.

A `String` is a sequence of Unicode
[code-point](http://www.unicode.org/glossary/#code_point)s. `String`s
are compared lexicographically, code-point by
code-point.[^utf8-is-awesome]

  [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
    gives the same result as a lexicographic byte-by-byte comparison
    of the UTF-8 encoding of a string!

### Binary data.

A `ByteString` is a sequence of octets. `ByteString`s are compared
lexicographically.

### Symbols.

Programming languages like Lisp and Prolog frequently use string-like
values called *symbols*. Here, a `Symbol` is, like a `String`, a
sequence of Unicode code-points representing an identifier of some
kind. `Symbol`s are also compared lexicographically by code-point.

### Booleans.

There are two `Boolean`s, “false” and “true”. The “false” value is
less-than the “true” value.

### IEEE floating-point values.

`Float`s and `Double`s are single- and double-precision IEEE 754
floating-point values, respectively. `Float`s, `Double`s and
`SignedInteger`s are disjoint; by the rules [above](#total-order),
every `Float` is less than every `Double`, and every `SignedInteger`
is greater than both. Two `Float`s or two `Double`s are to be ordered
by the `totalOrder` predicate defined in section 5.10 of
[IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).

### Records.

A `Record` is a *labelled* tuple of `Value`s, the record's *fields*. A
label can be any `Value`, but is usually a `Symbol`.[^extensibility]
[^iri-labels] `Record`s are compared lexicographically: first by
label, then by field sequence.

  [^extensibility]: The [Racket](https://racket-lang.org/) programming
    language defines
    “[prefab](http://docs.racket-lang.org/guide/define-struct.html#(part._prefab-struct))”
    structure types, which map well to our `Record`s. Racket supports
    record extensibility by encoding record supertypes into record
    labels as specially-formatted lists.

  [^iri-labels]: It is occasionally (but seldom) necessary to
    interpret such `Symbol` labels as UTF-8 encoded IRIs. Where a
    label can be read as a relative IRI, it is notionally interpreted
    with respect to the IRI
    `urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can
    be read as an absolute IRI, it stands for that IRI; and otherwise,
    it cannot be read as an IRI at all, and so the label simply stands
    for itself—for its own `Value`.

### Sequences.

A `Sequence` is a sequence of `Value`s. `Sequence`s are compared
lexicographically.

### Sets.

A `Set` is an unordered finite set of `Value`s. It contains no
duplicate values, following the [equivalence relation](#equivalence)
induced by the total order on `Value`s. Two `Set`s are compared by
sorting their elements ascending using the [total order](#total-order)
and comparing the resulting `Sequence`s.

### Dictionaries.

A `Dictionary` is an unordered finite collection of pairs of `Value`s.
Each pair comprises a *key* and a *value*. Keys in a `Dictionary` are
pairwise distinct. Instances of `Dictionary` are compared by
lexicographic comparison of the sequences resulting from ordering each
`Dictionary`'s pairs in ascending order by key.

## Textual Syntax

Now we have discussed `Value`s and their meanings, we may turn to
techniques for *representing* `Value`s for communication or storage.

In this section, we use [case-sensitive ABNF][abnf] to define a
textual syntax that is easy for people to read and
write.[^json-superset] Most of the examples in this document are
written using this syntax. In the following section, we will define an
equivalent compact machine-readable syntax.

  [^json-superset]: The grammar of the textual syntax is a superset of
    JSON, with the slightly unusual feature that `true`, `false`, and
    `null` are all read as `Symbol`s, and that `SignedInteger`s are
    never read as `Double`s.

### Character set

[ABNF][abnf] allows easy definition of US-ASCII-based languages.
However, Preserves is a Unicode-based language. Therefore, we
reinterpret ABNF as a grammar for recognising sequences of Unicode
code points.

Textual syntax for a `Value` *SHOULD* be encoded using UTF-8 where
possible.

### Whitespace

Whitespace is defined as any number of spaces, tabs, carriage returns,
line feeds, or commas.

                ws = *(%x20 / %x09 / newline / ",")
           newline = CR / LF

### Grammar

Standalone documents may have trailing whitespace.

          Document = Value ws

Any `Value` may be preceded by whitespace.

             Value = ws (Record / Collection / Atom / Compact)
        Collection = Sequence / Dictionary / Set
              Atom = Boolean / Float / Double / SignedInteger /
                     String / ByteString / Symbol

Each `Record` is its label-`Value` followed by a parenthesised
grouping of its field-`Value`s. Whitespace is not permitted between
the label and the open-parenthesis.

            Record = Value "(" *Value ws ")"

`Sequence`s are enclosed in square brackets. `Dictionary` values are
curly-brace-enclosed colon-separated pairs of values. `Set`s are
written either as one or more values enclosed in curly braces, or zero
or more values enclosed by the tokens `#set{` and
`}`.[^printing-collections]

          Sequence = "[" *Value ws "]"
        Dictionary = "{" *(Value ws ":" Value) ws "}"
               Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}"

  [^printing-collections]: **Implementation note.** When implementing
    printing of `Value`s using the textual syntax, consider supporting
    (a) optional pretty-printing with indentation, (b) optional
    JSON-compatible print mode for that subset of `Value` that is
    compatible with JSON, and (c) optional submodes for no commas,
    commas separating, and commas terminating elements or key/value
    pairs within a collection.

The special cases of records with a single field, which is in turn a
sequence or dictionary, may be written omitting the parentheses.

           Record =/ Value Sequence
           Record =/ Value Dictionary

`Boolean`s are the simple literal strings `#true` and `#false`.

           Boolean = %s"#true" / %s"#false"

Numeric data follow the
[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
the addition of a trailing "f" distinguishing `Float` from `Double`
values. `Float`s and `Double`s always have either a fractional part or
an exponent part, where `SignedInteger`s never have
either.[^reading-and-writing-floats-accurately]
[^arbitrary-precision-signedinteger]

             Float = flt %i"f"
            Double = flt
     SignedInteger = int

          digit1-9 = %x31-39
               nat = %x30 / ( digit1-9 *DIGIT )
               int = ["-"] nat
              frac = "." 1*DIGIT
               exp = %i"e" ["-"/"+"] 1*DIGIT
               flt = int (frac exp / frac / exp)

  [^reading-and-writing-floats-accurately]: **Implementation note.**
    Your language's standard library likely has a good routine for
    converting between decimal notation and IEEE 754 floating-point.
    However, if not, or if you are interested in the challenges of
    accurately reading and writing floating point numbers, see the
    excellent matched pair of 1990 papers by Clinger and Steele &
    White, and a recent follow-up by Jaffer:

    Clinger, William D. ‘How to Read Floating Point Numbers
    Accurately’. In Proc. PLDI. White Plains, New York, 1990.
    <https://doi.org/10.1145/93542.93557>.

    Steele, Guy L., Jr., and Jon L. White. ‘How to Print
    Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains,
    New York, 1990. <https://doi.org/10.1145/93542.93559>.

    Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of
    Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013.
    <http://arxiv.org/abs/1310.8121>.

  [^arbitrary-precision-signedinteger]: **Implementation note.** Be
    aware when implementing reading and writing of `SignedInteger`s
    that the data model *requires* arbitrary-precision integers. Your
    I/O routines must not truncate precision either when reading or
    writing a `SignedInteger`.

`String`s are,
[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly
escaped text surrounded by double quotes. The escaping rules are the
same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]

            String = %x22 *char %x22
              char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
         unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
            escape = %x5C              ; \
           escaped = ( %x5C /          ; \    reverse solidus U+005C
                       %x2F /          ; /    solidus         U+002F
                       %x62 /          ; b    backspace       U+0008
                       %x66 /          ; f    form feed       U+000C
                       %x6E /          ; n    line feed       U+000A
                       %x72 /          ; r    carriage return U+000D
                       %x74 )          ; t    tab             U+0009

  [^string-json-correspondence]: The grammar for `String` has the same
    effect as the
    [JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for
    `string`. Some auxiliary definitions (e.g. `escaped`) are lifted
    largely unmodified from the text of RFC 8259.

  [^escaping-surrogate-pairs]: In particular, note JSON's rules around
    the use of surrogate pairs for code points not in the Basic
    Multilingual Plane. We encourage implementations to avoid escaping
    such characters when producing output, and instead to rely on the
    UTF-8 encoding of the entire document to handle them correctly.

A `ByteString` may be written in any of three different forms.

The first is similar to a `String`, but prepended with a hash sign
`#`. In addition, only Unicode code points overlapping with printable
7-bit ASCII are permitted unescaped inside such a `ByteString`; other
byte values must be escaped by prepending a two-digit hexadecimal
value with `\x`.

        ByteString = "#" %x22 *binchar %x22
           binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
      binunescaped = %x20-21 / %x23-5B / %x5D-7E

The second is as a sequence of pairs of hexadecimal digits interleaved
with whitespace and surrounded by `#hex{` and `}`.

       ByteString =/ %s"#hex{" *(ws / 2HEXDIG) ws "}"

The third is as a sequence of
[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
with whitespace and surrounded by `#base64{` and `}`. Plain and
URL-safe Base64 characters are allowed.

       ByteString =/ %s"#base64{" *(ws / base64char) ws "}" /
        base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="

A `Symbol` may be written in a "bare" form[^cf-sexp-token] so long as
it conforms to certain restrictions on the characters appearing in the
symbol. Alternatively, it may be written in a quoted form. The quoted
form is much the same as the syntax for `String`s, including embedded
escape syntax, except using a bar or pipe character (`|`) instead of a
double quote mark.

            Symbol = symstart *symcont / "|" *symchar "|"
          symstart = ALPHA / sympunct / symunicode
           symcont = ALPHA / sympunct / symunicode / DIGIT / "-"
          sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
                     "?" / "_" / "=" / "+" / "<" / ">" / "/" / "."
           symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
        symunicode = <any code point greater than 127 whose Unicode
                      category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd,
                      Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co>

  [^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt]
    definition of "token representation", and with the
    [R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4).

Finally, any `Value` may be represented by escaping from the textual
syntax to the [compact binary syntax](#compact-binary-syntax) by
prefixing a `ByteString` containing the binary representation of the
`Value` with `#value`.[^rationale-switch-to-binary] [^no-literal-binary-in-text]

           Compact = %s"#value" ws ByteString

  [^rationale-switch-to-binary]: **Rationale.** The textual syntax
    cannot express every `Value`: specifically, it cannot express the
    several million floating-point NaNs, or the two floating-point
    Infinities. Since the compact binary format for `Value`s expresses
    each `Value` with precision, embedding binary `Value`s solves the
    problem.

  [^no-literal-binary-in-text]: Every text is ultimately physically
    stored as bytes; therefore, it might seem possible to escape to
    the raw binary form of compact binary encoding from within a
    pieces of textual syntax. However, while bytes must be involved in
    any *representation* of text, the text *itself* is logically a
    sequence of *code points* and is not *intrinsically* a binary
    structure at all. It would be incoherent to expect to be able to
    access the representation of the text from within the text itself.

### Annotations.

When written down, a `Value` may have an associated sequence of
*annotations* carrying “out-of-band” contextual metadata about the
value. Each annotation is, in turn, a `Value`, and may itself have
annotations.

            Value =/ ws "@" Value Value

Each annotation is preceded by `@`; the underlying annotated value
follows its annotations.

Annotations appear within syntax denoting a `Value`; however, the
annotations are not part of the denoted value. They are only part of
the syntax. Annotations do not play a part in equivalences and
orderings of `Value`s.

Reflective tools such as debuggers, user interfaces, and message
routers and relays---tools which process `Value`s generically---may
use annotated inputs to tailor their operation, or may insert
annotations in their outputs. By contrast, in ordinary programs, as a
rule of thumb, the presence, absence or specific value of an
annotation should not change the control flow or output of the
program. Annotations are data *describing* `Value`s, and are not in
the domain of any specific application of `Value`s. That is, an
annotation will almost never cause a non-reflective program to do
anything observably different.

## Compact Binary Syntax

A `Repr` is an encoding, or representation, of a specific `Value`.
Each `Repr` comprises one or more bytes describing first the kind of
represented `Value` and the length of the representation, and then the
encoded details of the `Value` itself.

For a value `v`, we write `[[v]]` for the `Repr` of v.

### Type and Length representation

Each `Repr` takes one of three possible forms:

 - (A) a fixed-length form, used for simple values such as `Boolean`s
   or `Float`s.

 - (B) a variable-length form with length specified up-front, used for
   almost all `Record`s as well as for most `Sequence`s and `String`s,
   when their sizes are known at the time serialization begins.

 - (C) a variable-length streaming form with unknown or unpredictable
   length, used only seldom for `Record`s, since the number of fields
   in a `Record` is usually statically known, but sometimes used for
   `Sequence`s, `String`s etc., such as in cases when serialization
   begins before the number of elements or bytes in the corresponding
   `Value` is known.

Applications may choose between formats B and C depending on their
needs at serialization time.

#### The lead byte

Every `Repr` starts with a *lead byte*, constructed by
`leadbyte(t,n,m)`, where `t`,`n`∈{0,1,2,3} and 0≤`m`<16:

    leadbyte(t,n,m) = [t*64 + n*16 + m]

The arguments `t` and `n` describe the rest of the
representation:[^some-encodings-unused]

  [^some-encodings-unused]: Some encodings are unused. All such
    encodings are reserved for future versions of this specification.

 - `t`=0, `n`=0 (format A) represents an `Atom` with fixed-length binary representation.
 - `t`=0, `n`=1 (format A) represents certain small `SignedInteger`s.
 - `t`=0, `n`=2 (format C) is a Stream Start byte.
 - `t`=0, `n`=3 (format C) is a Stream End byte.
 - `t`=1 (format B) represents an `Atom` with variable-length binary representation.
 - `t`=2 (format B) represents a `Record`.
 - `t`=3 (format B) represents a `Sequence`, `Set` or `Dictionary`.

#### Encoding data of fixed length (format A)

Each specific type of data defines its own rules for this format.

#### Encoding data of known length (format B)

A `Repr` where the length of the `Value` to be encoded is variable but
known uses the value of `m` in `leadbyte` to encode its length. The
length counts *bytes* for atomic `Value`s, but counts *contained
values* for compound `Value`s.

 - A length `l` between 0 and 14 is represented using `leadbyte` with
   `m=l`.
 - A length of 15 or greater is represented by `m=15` and additional
   bytes describing the length following the lead byte.

The function `header(t,n,m)` yields an appropriate sequence of bytes
describing a `Repr`'s type and length when `t`, `n` and `m` are
appropriate non-negative integers:

    header(t,n,m) =    leadbyte(t,n,m)                 when m < 15
                    or leadbyte(t,n,15) ++ varint(m)   otherwise

The additional length bytes are formatted as
[base 128 varints][varint]. We write `varint(m)` for the
varint-encoding of `m`. Quoting the [Google Protocol Buffers][varint]
definition,

> Each byte in a varint, except the last byte, has the most
> significant bit (msb) set – this indicates that there are further
> bytes to come. The lower 7 bits of each byte are used to store the
> two's complement representation of the number in groups of 7 bits,
> least significant group first.

**Examples.**

 - The varint representation of 15 is just the byte 15.
 - 300 (binary, grouped into 7-bit chunks, `10 0101100`) varint-encodes to the two bytes 172 and 2.
 - 1000000000 (binary `11 1011100 1101011 0010100 0000000`) varint-encodes to bytes 128, 148, 235, 220, and 3.

#### Streaming data of unknown length (format C)

A `Repr` where the length of the `Value` to be encoded is variable and
not known at the time serialization of the `Value` starts is encoded
by a single Stream Start (“open”) byte, followed by zero or more
*chunks*, followed by a matching Stream End (“close”) byte:

     open(t,n) = leadbyte(0,2, t*4 + n)
    close(t,n) = leadbyte(0,3, t*4 + n)

For a `Repr` of a `Value` containing binary data, each chunk is to be
a format B `Repr` of a `ByteString`, no matter the type of the overall
`Repr`.

For a `Repr` of a `Value` containing other `Value`s, each chunk is to
be a single `Repr`.

### Records

Format B (known length):

    [[ L(F_1...F_m) ]] = header(2,3,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]

For `m` fields, `m+1` is supplied to `header`, to account for the
encoding of the record label.

Format C (streaming):

    [[ L(F_1...F_m) ]] = open(2,3) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close(2,3)

Applications *SHOULD* prefer the known-length format for encoding
`Record`s.

#### Application-specific short form for labels

Any given protocol using Preserves may additionally define an
interpretation for `n`∈{0,1,2}, mapping each *short form label
number* `n` to a specific record label. When encoding `m` fields with
short form label number `n`, format B becomes

    header(2,n,m) ++ [[F_1]] ++...++ [[F_m]]

and format C becomes

    open(2,n) ++ [[F_1]] ++...++ [[F_m]] ++ close(2,n)

**Examples.** For example, a protocol may choose to map records
labelled `void` to `n=0`, making

    [[void()]] = header(2,0,0) = [0x80]

or it may map records labelled `person` to short form label number 1,
making

    [[person("Dr", "Elizabeth", "Blackwell")]]
        = header(2,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
        =        [0x93] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]

for format B, or

        = open(2,1) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close(2,1)
        =    [0x29] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x39]

for format C.

### Sequences, Sets and Dictionaries

Format B (known length):

            [[ [X_1...X_m] ]] = header(3,0,m)   ++ [[X_1]] ++...++ [[X_m]]
        [[ #set{X_1...X_m} ]] = header(3,1,m)   ++ [[X_1]] ++...++ [[X_m]]
    [[ {K_1:V_1...K_m:V_m} ]] = header(3,2,m*2) ++ [[K_1]] ++ [[V_1]] ++...
                                                ++ [[K_m]] ++ [[V_m]]

Note that `m*2` is given to `header` for a `Dictionary`, since there
are two `Value`s in each key-value pair.

Format C (streaming):

            [[ [X_1...X_m] ]] = open(3,0) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,0)
        [[ #set{X_1...X_m} ]] = open(3,1) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,1)
    [[ {K_1:V_1...K_m:V_m} ]] = open(3,2) ++ [[K_1]] ++ [[V_1]] ++...
                                          ++ [[K_m]] ++ [[V_m]] ++ close(3,2)

Applications may use whichever format suits their needs on a
case-by-case basis.

There is *no* ordering requirement on the `X_i` elements or
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
order.

  [^no-sorting-rationale]: In the BitTorrent encoding format,
    [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
    dictionary key/value pairs must be sorted by key. This is a
    necessary step for ensuring serialization of `Value`s is
    canonical. We do not require that key/value pairs (or set
    elements) be in sorted order for serialized `Value`s, because (a)
    where canonicalization is used for cryptographic signatures, it is
    more reliable to simply retain the exact binary form of the signed
    document than to depend on canonical de- and re-serialization, and
    (b) sorting keys or elements makes no sense in streaming
    serialization formats.

    However, a quality implementation may wish to offer the programmer
    the option of serializing with set elements and dictionary keys in
    sorted order.

Note that `header(3,3,m)` and `open(3,3)`/`close(3,3)` are unused and reserved.

### SignedIntegers

Format B/A (known length/fixed-size):

    [[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x)  if x<-3 ∨ 13≤x
                                     header(0,1,x+16)              if -3≤x<0
                                     header(0,1,x)                 if 0≤x<13

Integers in the range [-3,12] are compactly represented using format A
because they are so frequently used. Other integers are represented
using format B.

Format C *MUST NOT* be used for `SignedInteger`s.

The function `intbytes(x)` gives the big-endian two's-complement
binary representation of `x`, taking exactly as many whole bytes as
needed to unambiguously identify the value and its sign, and `m =
|intbytes(x)|`. The most-significant bit in the first byte in
`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes]

  [^zero-intbytes]: The value 0 needs zero bytes to identify the
    value, so `intbytes(0)` is the empty byte string. Non-zero values
    need at least one byte.

For example,

    [[   -257 ]] = 42 FE FF    [[     -3 ]] = 1D       [[    128 ]] = 42 00 80
    [[   -256 ]] = 42 FF 00    [[     -2 ]] = 1E       [[    255 ]] = 42 00 FF
    [[   -255 ]] = 42 FF 01    [[     -1 ]] = 1F       [[    256 ]] = 42 01 00
    [[   -254 ]] = 42 FF 02    [[      0 ]] = 10       [[  32767 ]] = 42 7F FF
    [[   -129 ]] = 42 FF 7F    [[      1 ]] = 11       [[  32768 ]] = 43 00 80 00
    [[   -128 ]] = 41 80       [[     12 ]] = 1C       [[  65535 ]] = 43 00 FF FF
    [[   -127 ]] = 41 81       [[     13 ]] = 41 0D    [[  65536 ]] = 43 01 00 00
    [[     -4 ]] = 41 FC       [[    127 ]] = 41 7F    [[ 131072 ]] = 43 02 00 00

### Strings, ByteStrings and Symbols

Syntax for these three types varies only in the value of `n` supplied
to `header`, `open`, and `close`. In each case, the payload following
the header is a binary sequence; for `String` and `Symbol`, it is a
UTF-8 encoding of the `Value`'s code points, while for `ByteString` it
is the raw data contained within the `Value` unmodified.

Format B (known length):

              [[ S ]] = header(1,n,m) ++ encode(S)
              where m = |encode(S)|
    and (n,encode(S)) = (1,utf8(S))  if S ∈ String
                        (2,S)        if S ∈ ByteString
                        (3,utf8(S))  if S ∈ Symbol

To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and
then a sequence of zero or more format B chunks, followed by
`close(1,n)`. Every chunk must be a `ByteString`.

While the overall content of a streamed `String` or `Symbol` must be
valid UTF-8, individual chunks do not have to conform to UTF-8.

### Fixed-length Atoms

Fixed-length atoms all use format A, and do not have a length
representation. They repurpose the bits that format B `Repr`s use to
specify lengths. Applications *MUST NOT* use format C with
`open(0,n)` or `close(0,n)` for any `n`.

#### Booleans

    [[ #false ]] = header(0,0,0) = [0x00]
    [[  #true ]] = header(0,0,1) = [0x01]

#### Floats and Doubles

    [[ F ]] when F ∈ Float  = header(0,0,2) ++ binary32(F)
    [[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D)

The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
8-byte IEEE 754 binary representations of `F` and `D`, respectively.

## Examples

### Simple examples

<!-- TODO: Give some examples of large and small Preserves, perhaps -->
<!-- translated from various JSON blobs floating around the internet. -->

For the following examples, imagine an application that maps `Record`
short form label number 0 to label `discard`, 1 to `capture`, and 2 to
`observe`.

| Value                                             | Encoded hexadecimal byte sequence                                    |
|---------------------------------------------------|----------------------------------------------------------------------|
| `capture(discard())`                              | 91 80                                                                |
| `observe(speak(discard(), capture(discard())))`   | A1 B3 75 73 70 65 61 6B 80 91 80                                     |
| `[1 2 3 4]` (format B)                            | C4 11 12 13 14                                                       |
| `[1 2 3 4]` (format C)                            | 2C 11 12 13 14 3C                                                    |
| `[-2 -1 0 1]`                                     | C4 1E 1F 10 11                                                       |
| `"hello"` (format B)                              | 55 68 65 6C 6C 6F                                                    |
| `"hello"` (format C, 2 chunks)                    | 25 62 68 65 63 6C 6C 6F 35                                           |
| `"hello"` (format C, 5 chunks)                    | 25 62 68 65 62 6C 6C 60 60 61 6F 35                                  |
| `["hello" there #"world" [] #set{} #true #false]` | C7 55 68 65 6C 6C 6F 75 74 68 65 72 65 65 77 6F 72 6C 64 C0 D0 01 00 |
| `-257`                                            | 42 FE FF                                                             |
| `-1`                                              | 1F                                                                   |
| `0`                                               | 10                                                                   |
| `1`                                               | 11                                                                   |
| `255`                                             | 42 00 FF                                                             |
| `1.0f`                                            | 02 3F 80 00 00                                                       |
| `1.0`                                             | 03 3F F0 00 00 00 00 00 00                                           |
| `-1.202e300`                                      | 03 FE 3C B7 B7 59 BF 04 26                                           |

The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record`

    [titled person 2 thing 1](101, "Blackwell", date(1821 2 3), "Dr")

encodes to

    B5                              ;; Record, generic, 4+1
      C5                              ;; Sequence, 5
        76 74 69 74 6C 65 64            ;; Symbol, "titled"
        76 70 65 72 73 6F 6E            ;; Symbol, "person"
        12                              ;; SignedInteger, "2"
        75 74 68 69 6E 67               ;; Symbol, "thing"
        11                              ;; SignedInteger, "1"
      41 65                           ;; SignedInteger, "101"
      59 42 6C 61 63 6B 77 65 6C 6C   ;; String, "Blackwell"
      B4                              ;; Record, generic, 3+1
        74 64 61 74 65                  ;; Symbol, "date"
        42 07 1D                        ;; SignedInteger, "1821"
        12                              ;; SignedInteger, "2"
        13                              ;; SignedInteger, "3"
      52 44 72                        ;; String, "Dr"

  [^extensibility2]: It happens to line up with Racket's
    representation of a record label for an inheritance hierarchy
    where `titled` extends `person` extends `thing`:

        (struct date (year month day) #:prefab)
        (struct thing (id) #:prefab)
        (struct person thing (name date-of-birth) #:prefab)
        (struct titled person (title) #:prefab)

    For more detail on Racket's representations of record labels, see
    [the Racket documentation for `make-prefab-struct`](http://docs.racket-lang.org/reference/structutils.html#%28def._%28%28quote._~23~25kernel%29._make-prefab-struct%29%29).

---

### JSON examples

The examples from
[RFC 8259](https://tools.ietf.org/html/rfc8259#section-13) read as
valid Preserves, though the JSON literals `true`, `false` and `null`
read as `Symbol`s. The first example:

    {
      "Image": {
          "Width":  800,
          "Height": 600,
          "Title":  "View from 15th Floor",
          "Thumbnail": {
              "Url":    "http://www.example.com/image/481989943",
              "Height": 125,
              "Width":  100
          },
          "Animated" : false,
          "IDs": [116, 943, 234, 38793]
        }
    }

encodes to binary as follows:

    E2
      55 "Image"
      EC
        55 "Width"    42 03 20
        55 "Title"    5F 14 "View from 15th Floor"
        58 "Animated" 75 "false"
        56 "Height"   42 02 58
        59 "Thumbnail"
          E6
            55 "Width"  41 64
            53 "Url"    5F 26 "http://www.example.com/image/481989943"
            56 "Height" 41 7D
            53 "IDs"    C4
                          41 74
                          42 03 AF
                          42 00 EA
                          43 00 97 89

and the second example:

    [
      {
         "precision": "zip",
         "Latitude":  37.7668,
         "Longitude": -122.3959,
         "Address":   "",
         "City":      "SAN FRANCISCO",
         "State":     "CA",
         "Zip":       "94107",
         "Country":   "US"
      },
      {
         "precision": "zip",
         "Latitude":  37.371991,
         "Longitude": -122.026020,
         "Address":   "",
         "City":      "SUNNYVALE",
         "State":     "CA",
         "Zip":       "94085",
         "Country":   "US"
      }
    ]

encodes to binary as follows:

    C2
      EF 10
        59 "precision"  53 "zip"
        58 "Latitude"   03 40 42 E2 26 80 9D 49 52
        59 "Longitude"  03 C0 5E 99 56 6C F4 1F 21
        57 "Address"    50
        54 "City"       5D "SAN FRANCISCO"
        55 "State"      52 "CA"
        53 "Zip"        55 "94107"
        57 "Country"    52 "US"
      EF 10
        59 "precision"  53 "zip"
        58 "Latitude"   03 40 42 AF 9D 66 AD B4 03
        59 "Longitude"  03 C0 5E 81 AA 4F CA 42 AF
        57 "Address"    50
        54 "City"       59 "SUNNYVALE"
        55 "State"      52 "CA"
        53 "Zip"        55 "94085"
        57 "Country"    52 "US"

## Conventions for Common Data Types

The `Value` data type is essentially an S-Expression, able to
represent semi-structured data over `ByteString`, `String`,
`SignedInteger` atoms and so on.[^why-not-spki-sexps]

  [^why-not-spki-sexps]: Rivest's S-Expressions are in many ways
    similar to Preserves. However, while they include binary data and
    sequences, and an obvious equivalence for them exists, they lack
    numbers *per se* as well as any kind of unordered structure such
    as sets or maps. In addition, while "display hints" allow
    labelling of binary data with an intended interpretation, they
    cannot be attached to any other kind of structure, and the "hint"
    itself can only be a binary blob.

However, users need a wide variety of data types for representing
domain-specific values such as various kinds of encoded and normalized
text, calendrical values, machine words, and so on.

Appropriately-labelled `Record`s denote these domain-specific data
types.[^why-dictionaries]

  [^why-dictionaries]: Given `Record`'s existence, it may seem odd
    that `Dictionary`, `Set`, `Float`, etc. are given special
    treatment. Preserves aims to offer a useful basic equivalence
    predicate to programmers, and so if a data type demands a special
    equivalence predicate, as `Dictionary`, `Set` and `Float` all do,
    then the type should be included in the base language. Otherwise,
    it can be represented as a `Record` and treated separately. Both
    `Boolean` and `String` are seeming exceptions: they merit
    inclusion because of their cultural importance.

All of these conventions are optional. They form a layer atop the core
`Value` structure. Non-domain-specific tools do not in general need to
treat them specially.

**Validity.** Many of the labels we will describe in this section come
  with side-conditions on the contents of labelled `Record`s. It is
  possible to construct an instance of `Value` that violates these
  side-conditions without ceasing to be a `Value` or becoming
  unrepresentable. However, we say that such a `Value` is *invalid*
  because it fails to honour the necessary side-conditions.
  Implementations *SHOULD* allow two modes of working: one which
  treats all `Value`s identically, without regard for side-conditions,
  and one which enforces validity (i.e. side-conditions) when reading,
  writing, or constructing `Value`s.

### MIME-type tagged binary data

Many internet protocols use
[media types](https://tools.ietf.org/html/rfc6838) (a.k.a MIME types)
to indicate the format of some associated binary data. For this
purpose, we define `MIMEData` to be a record labelled `mime` with two
fields, the first being a `Symbol`, the media type, and the second
being a `ByteString`, the binary data.

While each media type may define its own rules for comparing
documents, we define ordering among `MIMEData` *representations* of
such media types following the general rules for ordering of
`Record`s.

**Examples.**

| Value                                      | Encoded hexadecimal byte sequence                                                                                 |
|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
| `mime(application/octet-stream #"abcde")`  | B3 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
| `mime(text/plain #"ABC")`                  | B3 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43                                                    |
| `mime(application/xml #"<xhtml/>")`        | B3 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E                   |
| `mime(text/csv #"123,234,345")`            | B3 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35                                  |

Applications making heavy use of `mime` records may choose to use a
short form label number for the record type. For example, if short
form label number 1 were chosen, the second example above,
`mime(text/plain "ABC")`, would be encoded with "92" in place of "B3
74 6D 69 6D 65".

### Unicode normalization forms

Unicode defines multiple
[normalization forms](http://unicode.org/reports/tr15/) for text.
While no particular normalization form is required for `String`s,
users may need to unambiguously signal or require a particular
normalization form. A `NormalizedString` is a `Record` labelled with
`unicode-normalization` and having two fields, the first of which is a
`Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`,
`nfkc`, `nfkd`), and the second of which is a `String` whose
underlying code point representation *MUST* be normalized according to
the named normalization form.

### IRIs (URIs, URLs, URNs, etc.)

An `IRI` is a `Record` labelled with `iri` and having one field, a
`String` which is the IRI itself and which *MUST* be a valid absolute
or relative IRI.

### Machine words

The definition of `SignedInteger` captures all integers. However, in
certain circumstances it can be valuable to assert that a number
inhabits a particular range, such as a fixed-width machine word.

A family of labels `i`*n* and `u`*n* for *n* ∈ {8,16,32,64} denote
*n*-bit-wide signed and unsigned range restrictions, respectively.
Records with these labels *MUST* have one field, a `SignedInteger`,
which *MUST* fall within the appropriate range. That is, to be valid,
 - in `i8(`*x*`)`, -128 <= *x* <= 127.
 - in `u8(`*x*`)`, 0 <= *x* <= 255.
 - in `i16(`*x*`)`, -32768 <= *x* <= 32767.
 - etc.

### Anonymous Tuples and Unit

A `Tuple` is a `Record` with label `tuple` and zero or more fields,
denoting an anonymous tuple of values.

The 0-ary tuple, `tuple()`, denotes the empty tuple, sometimes called
"unit" or "void" (but *not* e.g. JavaScript's "undefined" value).

### Null and Undefined

Tony Hoare's
"[billion-dollar mistake](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions)"
can be represented with the 0-ary `Record` `null()`. An "undefined"
value can be represented as `undefined()`.

### Dates and Times

Dates, times, moments, and timestamps can be represented with a
`Record` with label `rfc3339` having a single field, a `String`, which
*MUST* conform to one of the `full-date`, `partial-time`, `full-time`,
or `date-time` productions of
[section 5.6 of RFC 3339](https://tools.ietf.org/html/rfc3339#section-5.6).

## Security Considerations

**Empty chunks.** Streamed (format C) `String`s, `ByteString`s and
`Symbol`s may include chunks of zero length. This opens up a
possibility for denial-of-service: an attacker may begin streaming a
string, sending an endless sequence of zero length chunks, appearing
to make progress but not actually doing so. Implementations may place
optional reasonable restrictions on the number of consecutive empty
chunks that may appear in a stream, and may even supply an optional
mode that rejects empty chunks entirely.

**Whitespace.** Similarly, the textual format for `Value`s allows
arbitrary whitespace in many positions. In streaming transfer
situations, consider optional restrictions on the amount of
consecutive whitespace that may appear in a serialized `Value`.

**Canonical form for cryptographic hashing and signing.** As
specified, neither the textual nor the compact binary encoding rules
for `Value`s force canonical serializations. Two serializations of the
same `Value` may yield different binary `Repr`s.

## Appendix. Table of lead byte values

     00 - False
     01 - True
     02 - Float
     03 - Double
    (0x)  RESERVED 04-0F
     1x - Small integers 0..12,-3..-1
     2x - Start Stream
     3x - End Stream

     4x - SignedInteger
     5x - String
     6x - ByteString
     7x - Symbol

     8x - short form Record label index 0
     9x - short form Record label index 1
     Ax - short form Record label index 2
     Bx - Record

     Cx - Sequence
     Dx - Set
     Ex - Dictionary
    (Fx)  RESERVED F0-FF

## Appendix. Bit fields within lead byte values

     tt nn mmmm  contents
     ---------- ---------

     00 00 0000  False
     00 00 0001  True
     00 00 0010  Float, 32 bits big-endian binary
     00 00 0011  Double, 64 bits big-endian binary

     00 01 xxxx  Small integers 0..12,-3..-1

     00 10 ttnn  Start Stream <tt,nn>
                   When tt = 00 --> error
                             01 --> each chunk is a ByteString
                             1x --> each chunk is a single encoded Value
     00 11 ttnn  End Stream <tt,nn> (must match preceding Start Stream)

     01 00 mmmm  SignedInteger, big-endian binary
     01 01 mmmm  String, UTF-8 binary
     01 10 mmmm  ByteString
     01 11 mmmm  Symbol, UTF-8 binary

     10 00 mmmm  application-specific Record
     10 01 mmmm  application-specific Record
     10 10 mmmm  application-specific Record
     10 11 mmmm  Record

     11 00 mmmm  Sequence
     11 01 mmmm  Set
     11 10 mmmm  Dictionary

     If mmmm = 1111, a varint(m) follows, giving the length, before
     the body; otherwise, m is the length of the body to follow.

## Appendix. Representing Values in Programming Languages

We have given a definition of `Value` and its semantics, and proposed
a concrete syntax for communicating and storing `Value`s. We now turn
to **suggested** representations of `Value`s as *programming-language
values* for various programming languages.

When designing a language mapping, an important consideration is
roundtripping: serialization after deserialization, and vice versa,
should both be identities.

### JavaScript

 - `Boolean` ↔ `Boolean`
 - `Float` and `Double` ↔ numbers
 - `SignedInteger` ↔ numbers or `BigInt` (see [here](https://developers.google.com/web/updates/2018/05/bigint) and [here](https://github.com/tc39/proposal-bigint))
 - `String` ↔ strings
 - `ByteString` ↔ `Uint8Array`
 - `Symbol` ↔ `Symbol.for(...)`
 - `Record` ↔ `{ "_label": theLabel, "_fields": [field0, ..., fieldN] }`, plus convenience accessors
    - `(undefined)` ↔ the undefined value
    - `(rfc3339 F)` ↔ `Date`, if `F` matches the `date-time` RFC 3339 production
 - `Sequence` ↔ `Array`
 - `Set` ↔ `{ "_set": M }` where `M` is a `Map` from the elements of the set to `true`
 - `Dictionary` ↔ a `Map`

### Scheme/Racket

 - `Boolean` ↔ booleans
 - `Float` and `Double` ↔ inexact numbers (Racket: single- and double-precision floats)
 - `SignedInteger` ↔ exact numbers
 - `String` ↔ strings
 - `ByteString` ↔ byte vector (Racket: "Bytes")
 - `Symbol` ↔ symbols
 - `Record` ↔ structures (Racket: prefab struct)
 - `Sequence` ↔ lists
 - `Set` ↔ Racket: sets
 - `Dictionary` ↔ Racket: hash-table

### Java

 - `Boolean` ↔ `Boolean`
 - `Float` and `Double` ↔ `Float` and `Double`
 - `SignedInteger` ↔ `Integer`, `Long`, `BigInteger`
 - `String` ↔ `String`
 - `ByteString` ↔ `byte[]`
 - `Symbol` ↔ a simple data class wrapping a `String`
 - `Record` ↔ in a simple implementation, a generic `Record` class; else perhaps a bean mapping?
    - `(mime T B)` ↔ an implementation of `javax.activation.DataSource`?
 - `Sequence` ↔ an implementation of `java.util.List`
 - `Set` ↔ an implementation of `java.util.Set`
 - `Dictionary` ↔ an implementation of `java.util.Map`

### Erlang

 - `Boolean` ↔ `true` and `false`
 - `Float` and `Double` ↔ floats (unsure how Erlang deals with single-precision)
 - `SignedInteger` ↔ integers
 - `String` ↔ pair of `utf8` and a binary
 - `ByteString` ↔ a binary
 - `Symbol` ↔ pair of `atom` and a binary
 - `Record` ↔ triple of `obj`, label, and field list
 - `Sequence` ↔ a list
 - `Set` ↔ a `sets` set
 - `Dictionary` ↔ a [map][erlang-map] (new in Erlang/OTP R17)

This is a somewhat unsatisfactory mapping because: (a) Erlang doesn't
garbage-collect its atoms, meaning that (a.1) representing `Symbol`s
as atoms could lead to denial-of-service and (a.2) representing
`Symbol`-labelled `Record`s as Erlang records must be rejected for the
same reason; (b) even if it did, Erlang's boolean values are atoms,
which would then clash with the `Symbol`s `true` and `false`; and (c)
Erlang has no distinct string type, making for a trilemma where
`String`s are in danger of clashing with `ByteString`s, `Sequence`s,
or `Record`s.

### Python

 - `Boolean` ↔ `True` and `False`
 - `Float` ↔ a `Float` wrapper-class for a double-precision value
 - `Double` ↔ float
 - `SignedInteger` ↔ int and long
 - `String` ↔ `unicode`
 - `ByteString` ↔ `bytes`
 - `Symbol` ↔ a simple data class wrapping a `unicode`
 - `Record` ↔ something like `namedtuple`, but that doesn't care about class identity?
 - `Sequence` ↔ `tuple` (but accept `list` during encoding)
 - `Set` ↔ `frozenset` (but accept `set` during encoding)
 - `Dictionary` ↔ a hashable (immutable) dictionary-like thing (but accept `dict` during encoding)

### Squeak Smalltalk

 - `Boolean` ↔ `true` and `false`
 - `Float` ↔ perhaps a subclass of `Float`?
 - `Double` ↔ `Float`
 - `SignedInteger` ↔ `Integer`
 - `String` ↔ `WideString`
 - `ByteString` ↔ `ByteArray`
 - `Symbol` ↔ `WideSymbol`
 - `Record` ↔ a simple data class
 - `Sequence` ↔ `ArrayedCollection` (usually `OrderedCollection`)
 - `Set` ↔ `Set`
 - `Dictionary` ↔ `Dictionary`

## Appendix. Why not Just Use JSON?

<!-- JSON lacks semantics: JSON syntax doesn't denote anything -->

JSON offers *syntax* for numbers, strings, booleans, null, arrays and
string-keyed maps. However, it suffers from two major problems. First,
it offers no *semantics* for the syntax: it is left to each
implementation to determine how to treat each JSON term. This causes
[interoperability](http://seriot.ch/parsing_json.php) and even
[security](http://web.archive.org/web/20180906202559/http://docs.couchdb.org/en/stable/cve/2017-12635.html)
issues. Second, JSON's lack of support for type tags leads to awkward
and incompatible *encodings* of type information in terms of the fixed
suite of constructors on offer.

There are other minor problems with JSON having to do with its syntax.
Examples include its relative verbosity and its lack of support for
binary data.

### JSON syntax doesn't *mean* anything

When are two JSON values the same? When are they different?
<!-- When is one JSON value "less than" another? -->

The specifications are largely silent on these questions. Different
JSON implementations give different answers.

Specifically, JSON does not:

 - assign any meaning to numbers,[^meaning-ieee-double]
 - determine how strings are to be compared,[^string-key-comparison]
 - determine whether object key ordering is significant,[^json-member-ordering] or
 - determine whether duplicate object keys are permitted, what it
   would mean if they were, or how to determine a duplicate in the
   first place.[^json-key-uniqueness]

In short, JSON syntax doesn't *denote* anything.[^xml-infoset] [^other-formats]

  [^meaning-ieee-double]:
    [Section 6 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-6)
    does go so far as to indicate “good interoperability can be
    achieved” by imagining that parsers are able reliably to
    understand the syntax of numbers as denoting an IEEE 754
    double-precision floating-point value.

  [^string-key-comparison]:
    [Section 8.3 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-8.3)
    suggests that *if* an implementation compares strings used as
    object keys “code unit by code unit”, then it will interoperate
    with *other such implementations*, but neither requires this
    behaviour nor discusses comparisons of strings used in other
    contexts.

  [^json-member-ordering]:
    [Section 4 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-4)
    remarks that “[implementations] differ as to whether or not they
    make the ordering of object members visible to calling software.”

  [^json-key-uniqueness]:
    [Section 4 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-4)
    is the only place in the specification that mentions the issue. It
    explicitly sanctions implementations supporting duplicate keys,
    noting only that “when the names within an object are not unique,
    the behavior of software that receives such an object is
    unpredictable.” Implementations are free to choose any behaviour
    at all in this situation, including signalling an error, or
    discarding all but one of a set of duplicates.

  [^xml-infoset]: The XML world has the concept of
    [XML infoset](https://www.w3.org/TR/xml-infoset/). Loosely
    speaking, XML infoset is the *denotation* of an XML document; the
    *meaning* of the document.

  [^other-formats]: Most other recent data languages are like JSON in
    specifying only a syntax with no associated semantics. While some
    do make a sketch of a semantics, the result is often
    underspecified (e.g. in terms of how strings are to be compared),
    overly machine-oriented (e.g. treating 32-bit integers as
    fundamentally distinct from 64-bit integers and from
    floating-point numbers), overly fine (e.g. giving visibility to
    the order in which map entries are written), or all three.

Some examples:

 - are the JSON values `1`, `1.0`, and `1e0` the same or different?
 - are the JSON values `1.0` and `1.0000000000000001` the same or different?
 - are the JSON strings `"päron"` (UTF-8 `70c3a4726f6e`) and `"päron"`
   (UTF-8 `7061cc88726f6e`) the same or different?
 - are the JSON objects `{"a":1, "b":2}` and `{"b":2, "a":1}` the same
   or different?
 - which, if any, of `{"a":1, "a":2}`, `{"a":1}` and `{"a":2}` are the
   same? Are all three legal?
 - are `{"päron":1}` and `{"päron":1}` the same or different?

### JSON can multiply nicely, but it can't add very well

JSON includes a fixed set of types: numbers, strings, booleans, null,
arrays and string-keyed maps. Domain-specific data must be *encoded*
into these types. For example, dates and email addresses are often
represented as strings with an implicit internal structure.

There is no convention for *labelling* a value as belonging to a
particular category. Instead, JSON-encoded data are often labelled in
an ad-hoc way. Multiple incompatible approaches exist. For example, a
"money" structure containing a `currency` field and an `amount` may be
represented in any number of ways:

    { "_type": "money", "currency": "EUR", "amount": 10 }
    { "type": "money", "value": { "currency": "EUR", "amount": 10 } }
    [ "money", { "currency": "EUR", "amount": 10 } ]
    { "@money": { "currency": "EUR", "amount": 10 } }

This causes particular problems when JSON is used to represent *sum*
or *union* types, such as "either a value or an error, but not both".
Again, multiple incompatible approaches exist.

For example, imagine an API for depositing money in an account. The
response might be either a "success" response indicating the new
balance, or one of a set of possible errors.

Sometimes, a *pair* of values is used, with `null` marking the option
not taken.[^interesting-failure-mode]

    { "ok": { "balance": 210 }, "error": null }
    { "ok": null, "error": "Unauthorized" }

  [^interesting-failure-mode]: What is the meaning of a document where
    both `ok` and `error` are non-null? What might happen when a
    program is presented with such a document?

The branch not chosen is sometimes present, sometimes omitted as if it
were an optional field:

    { "ok": { "balance": 210 } }
    { "error": "Unauthorized" }

Sometimes, an array of a label and a value is used:

    [ "ok", { "balance": 210 } ]
    [ "error", "Unauthorized" ]

Sometimes, the shape of the data is sufficient to distinguish among
the alternatives, and the label is left implicit:

    { "balance": 210 }
    "Unauthorized"

JSON itself does not offer any guidance for which of these options to
choose. In many real cases on the web, poor choices have led to
encodings that are irrecoverably ambiguous.

# Open questions

Q. Should "symbols" instead be URIs? Relative, usually; relative to
what? Some domain-specific base URI?

Q. Literal small integers: are they pulling their weight? They're not
absolutely necessary. They mess up the connection between
value-type-ordering and repr-tag-ordering! (The connection between
*value* ordering and *repr* ordering is already irretrievably messed
up: length prefixes blow lexicographic ordering away, sign bits are
the wrong way around, floats are sign-magnitude, etc etc.)

Q. Should we go for trying to make the data ordering line up with the
encoding ordering? We'd have to only use streaming forms, and avoid
the small integer encoding, and not store record arities, and sort
sets and dictionaries, and mask floats and doubles (perhaps
[like this](https://stackoverflow.com/questions/43299299/sorting-floating-point-values-using-their-byte-representation)),
and pick a specific `NaN`, and I don't know what to do about
SignedIntegers. Perhaps make them more like float formats, with the
byte count acting as a kind of exponent underneath the sign bit.

 - Perhaps define separate additional canonicalization restrictions?
   Doesn't help the ordering, but does help the equivalence.

 - Canonicalization and early-bailout-equivalence-checking are in
   tension with support for streaming values.

Q. The postfix fields in the textual syntax come unannounced: "oh, and
another thing, what you just read is a label, and here are some
fields." This is a problem for interactive reading of textual syntax,
because after a complete term, it needs to see the next character to
tell whether it is an open-parenthesis or not! For this reason, I've
disallowed whitespace between a label `Value` and the open-parenthesis
of the fields. Is this reasonable??

Q. To remain compatible with JSON, portions of the text syntax have to
remain case-insensitive (`%i"..."`). However, non-JSON extensions do
not. There's only one (?) at the moment, the `%i"f"` in `Float`;
should it be changed to case-sensitive?

TODO: Examples of the ordering. `"bzz" < "c" < "caa"`; `#true < 3 < "3" < |3|`

TODO: Probably should add a canonicalized subset. Consider adding
explicit "I promise this is canonical" marker, like a BOM, which
identifies a binary value as (first) binary and (second, optionally)
as canonical. UTF-8 disallows byte `0xFF` from appearing anywhere in a
text; this might be a good candidate for a marker sequence.
((Actually, perhaps `0x10` would be good! It corresponds to DLE, "data
link escape"; it is not a printable ASCII character, and is disallowed
in the textual Preserves grammar; and it is also mnemonic for "version
0", since it is the Preserves binary encoding of the small integer
zero.))

TODO: Remove the special short syntax for application-specific record
label usage? Then perhaps 8x, 9x, Ax and Bx would work for Record,
Sequence, Set and Dictionary, leaving Cx, Dx, Ex and Fx entirely free.

TODO: Forbid empty chunks?

## Notes
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								---
 								---
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								<title>Preserves: an Expressive Data Language</title>
-												Rearrange repo

											
										
										
											2018-09-29 16:26:39 +00:00
+								<link rel="stylesheet" href="preserves.css">
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								# Preserves: an Expressive Data Language
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								Tony Garnock-Jones <tonyg@leastfixedpoint.com>
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								November 2018. Version 0.0.4.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								  [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
 								  [spki]: http://world.std.com/~cme/html/spki.html
 								  [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
 								  [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								  [abnf]: https://tools.ietf.org/html/rfc7405
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								This document proposes a data model and serialization format called
 								*Preserves*.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								Preserves supports *records* with user-defined *labels*. This relieves
 								the confusion caused by encoding records as dictionaries, seen in most
 								data languages in use on the web. It also allows Preserves to easily
 								represent the *labelled sums of products* as seen in many functional
 								programming languages.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								Preserves also supports the usual suite of atomic and compound data
 								types, in particular including *binary* data as a distinct type from
-												Initial draft text re annotations

											
										
										
											2019-07-03 23:33:37 +00:00
+								text strings. Its *annotations* allow separation of data from metadata
 								such as comments, trace information, and provenance information.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								Finally, Preserves defines precisely how to *compare* two values.
 								Comparison is based on the data model, not on syntax or on data
 								structures of any particular implementation language.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								## Starting with Semantics
 								Taking inspiration from functional programming, we start with a
 								definition of the *values* that we want to work with and give them
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								meaning independent of their syntax.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								Our `Value`s fall into two broad categories: *atomic* and *compound*
 								data.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								                          Value = Atom
 								                                | Compound
-												Fixes

											
										
										
											2018-09-23 21:44:43 +00:00
+								                           Atom = Boolean
 								                                | Float
 								                                | Double
 								                                | SignedInteger
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								                                | String
 								                                | ByteString
 								                                | Symbol
 								                       Compound = Record
 								                                | Sequence
 								                                | Set
 								                                | Dictionary
 								**Total order.**<a name="total-order"></a> As we go, we will
 								incrementally specify a total order over `Value`s. Two values of the
 								same kind are compared using kind-specific rules. The ordering among
 								values of different kinds is essentially arbitrary, but having a total
 								order is convenient for many tasks, so we define it as
 								follows:[^ordering-by-syntax]
-												Fixes

											
										
										
											2018-09-23 21:44:43 +00:00
+								            (Values)        Atom < Compound
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								            (Compounds)     Record < Sequence < Set < Dictionary
-												Fixes

											
										
										
											2018-09-23 21:44:43 +00:00
+								            (Atoms)         Boolean < Float < Double < SignedInteger
 								                              < String < ByteString < Symbol
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								  [^ordering-by-syntax]: The observant reader may note that the
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
+								    ordering here is (almost) the same as that implied by the tagging
 								    scheme used in the concrete binary syntax for `Value`s. (The
 								    exception is the syntax for small integers near zero.)
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								**Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
 								neither is less than the other according to the total order.
 								### Signed integers.
 								A `SignedInteger` is a signed integer of arbitrary width.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								`SignedInteger`s are compared as mathematical integers.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								### Unicode strings.
 								A `String` is a sequence of Unicode
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								[code-point](http://www.unicode.org/glossary/#code_point)s. `String`s
 								are compared lexicographically, code-point by
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								code-point.[^utf8-is-awesome]
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								  [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
 								    gives the same result as a lexicographic byte-by-byte comparison
 								    of the UTF-8 encoding of a string!
 								### Binary data.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								A `ByteString` is a sequence of octets. `ByteString`s are compared
 								lexicographically.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								### Symbols.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								Programming languages like Lisp and Prolog frequently use string-like
 								values called *symbols*. Here, a `Symbol` is, like a `String`, a
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								sequence of Unicode code-points representing an identifier of some
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								kind. `Symbol`s are also compared lexicographically by code-point.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								### Booleans.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								There are two `Boolean`s, “false” and “true”. The “false” value is
 								less-than the “true” value.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								### IEEE floating-point values.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								`Float`s and `Double`s are single- and double-precision IEEE 754
 								floating-point values, respectively. `Float`s, `Double`s and
 								`SignedInteger`s are disjoint; by the rules [above](#total-order),
 								every `Float` is less than every `Double`, and every `SignedInteger`
 								is greater than both. Two `Float`s or two `Double`s are to be ordered
 								by the `totalOrder` predicate defined in section 5.10 of
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								[IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
 								### Records.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								A `Record` is a *labelled* tuple of `Value`s, the record's *fields*. A
 								label can be any `Value`, but is usually a `Symbol`.[^extensibility]
 								[^iri-labels] `Record`s are compared lexicographically: first by
 								label, then by field sequence.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								  [^extensibility]: The [Racket](https://racket-lang.org/) programming
 								    language defines
-												Tweaks; python mapping

											
										
										
											2018-09-24 17:34:07 +00:00
+								    “[prefab](http://docs.racket-lang.org/guide/define-struct.html#(part._prefab-struct))”
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								    structure types, which map well to our `Record`s. Racket supports
 								    record extensibility by encoding record supertypes into record
 								    labels as specially-formatted lists.
 								  [^iri-labels]: It is occasionally (but seldom) necessary to
 								    interpret such `Symbol` labels as UTF-8 encoded IRIs. Where a
 								    label can be read as a relative IRI, it is notionally interpreted
 								    with respect to the IRI
 								    `urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can
 								    be read as an absolute IRI, it stands for that IRI; and otherwise,
 								    it cannot be read as an IRI at all, and so the label simply stands
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								    for itself—for its own `Value`.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								### Sequences.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								A `Sequence` is a sequence of `Value`s. `Sequence`s are compared
 								lexicographically.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								### Sets.
 								A `Set` is an unordered finite set of `Value`s. It contains no
 								duplicate values, following the [equivalence relation](#equivalence)
 								induced by the total order on `Value`s. Two `Set`s are compared by
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								sorting their elements ascending using the [total order](#total-order)
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								and comparing the resulting `Sequence`s.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								### Dictionaries.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								A `Dictionary` is an unordered finite collection of pairs of `Value`s.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								Each pair comprises a *key* and a *value*. Keys in a `Dictionary` are
 								pairwise distinct. Instances of `Dictionary` are compared by
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								lexicographic comparison of the sequences resulting from ordering each
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								`Dictionary`'s pairs in ascending order by key.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								## Textual Syntax
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								Now we have discussed `Value`s and their meanings, we may turn to
 								techniques for *representing* `Value`s for communication or storage.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								In this section, we use [case-sensitive ABNF][abnf] to define a
 								textual syntax that is easy for people to read and
 								write.[^json-superset] Most of the examples in this document are
 								written using this syntax. In the following section, we will define an
 								equivalent compact machine-readable syntax.
 								  [^json-superset]: The grammar of the textual syntax is a superset of
 								    JSON, with the slightly unusual feature that `true`, `false`, and
 								    `null` are all read as `Symbol`s, and that `SignedInteger`s are
 								    never read as `Double`s.
 								### Character set
 								[ABNF][abnf] allows easy definition of US-ASCII-based languages.
 								However, Preserves is a Unicode-based language. Therefore, we
 								reinterpret ABNF as a grammar for recognising sequences of Unicode
 								code points.
 								Textual syntax for a `Value` *SHOULD* be encoded using UTF-8 where
 								possible.
 								### Whitespace
 								Whitespace is defined as any number of spaces, tabs, carriage returns,
-												Remove comments, in prep for annotations replacing them

											
										
										
											2019-07-01 20:31:49 +00:00
+								line feeds, or commas.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
-												Remove comments, in prep for annotations replacing them

											
										
										
											2019-07-01 20:31:49 +00:00
+								                ws = *(%x20 / %x09 / newline / ",")
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								           newline = CR / LF
 								### Grammar
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								Standalone documents may have trailing whitespace.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								          Document = Value ws
 								Any `Value` may be preceded by whitespace.
 								             Value = ws (Record / Collection / Atom / Compact)
 								        Collection = Sequence / Dictionary / Set
 								              Atom = Boolean / Float / Double / SignedInteger /
 								                     String / ByteString / Symbol
 								Each `Record` is its label-`Value` followed by a parenthesised
-												Clarification

											
										
										
											2018-09-28 10:12:35 +00:00
+								grouping of its field-`Value`s. Whitespace is not permitted between
 								the label and the open-parenthesis.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
-												Disallow whitespace between a label and its open-parenthesis

											
										
										
											2018-09-28 10:00:40 +00:00
+								            Record = Value "(" *Value ws ")"
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								`Sequence`s are enclosed in square brackets. `Dictionary` values are
 								curly-brace-enclosed colon-separated pairs of values. `Set`s are
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								written either as one or more values enclosed in curly braces, or zero
 								or more values enclosed by the tokens `#set{` and
 								`}`.[^printing-collections]
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								          Sequence = "[" *Value ws "]"
 								        Dictionary = "{" *(Value ws ":" Value) ws "}"
 								               Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}"
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								  [^printing-collections]: **Implementation note.** When implementing
 								    printing of `Value`s using the textual syntax, consider supporting
 								    (a) optional pretty-printing with indentation, (b) optional
 								    JSON-compatible print mode for that subset of `Value` that is
 								    compatible with JSON, and (c) optional submodes for no commas,
 								    commas separating, and commas terminating elements or key/value
 								    pairs within a collection.
-												Special cases for label[...] and label{...}

											
										
										
											2018-10-08 19:53:53 +00:00
+								The special cases of records with a single field, which is in turn a
 								sequence or dictionary, may be written omitting the parentheses.
 								           Record =/ Value Sequence
 								           Record =/ Value Dictionary
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								`Boolean`s are the simple literal strings `#true` and `#false`.
 								           Boolean = %s"#true" / %s"#false"
 								Numeric data follow the
 								[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
 								the addition of a trailing "f" distinguishing `Float` from `Double`
 								values. `Float`s and `Double`s always have either a fractional part or
-												Handle a couple of TODOs

											
										
										
											2018-09-27 12:34:32 +00:00
+								an exponent part, where `SignedInteger`s never have
 								either.[^reading-and-writing-floats-accurately]
 								[^arbitrary-precision-signedinteger]
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								             Float = flt %i"f"
 								            Double = flt
 								     SignedInteger = int
 								          digit1-9 = %x31-39
 								               nat = %x30 / ( digit1-9 *DIGIT )
 								               int = ["-"] nat
 								              frac = "." 1*DIGIT
 								               exp = %i"e" ["-"/"+"] 1*DIGIT
 								               flt = int (frac exp / frac / exp)
-												Handle a couple of TODOs

											
										
										
											2018-09-27 12:34:32 +00:00
+								  [^reading-and-writing-floats-accurately]: **Implementation note.**
 								    Your language's standard library likely has a good routine for
 								    converting between decimal notation and IEEE 754 floating-point.
 								    However, if not, or if you are interested in the challenges of
 								    accurately reading and writing floating point numbers, see the
 								    excellent matched pair of 1990 papers by Clinger and Steele &
 								    White, and a recent follow-up by Jaffer:
 								    Clinger, William D. ‘How to Read Floating Point Numbers
 								    Accurately’. In Proc. PLDI. White Plains, New York, 1990.
 								    <https://doi.org/10.1145/93542.93557>.
 								    Steele, Guy L., Jr., and Jon L. White. ‘How to Print
 								    Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains,
 								    New York, 1990. <https://doi.org/10.1145/93542.93559>.
 								    Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of
 								    Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013.
 								    <http://arxiv.org/abs/1310.8121>.
 								  [^arbitrary-precision-signedinteger]: **Implementation note.** Be
 								    aware when implementing reading and writing of `SignedInteger`s
 								    that the data model *requires* arbitrary-precision integers. Your
 								    I/O routines must not truncate precision either when reading or
 								    writing a `SignedInteger`.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								`String`s are,
 								[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly
 								escaped text surrounded by double quotes. The escaping rules are the
-												Handle a couple of TODOs

											
										
										
											2018-09-27 12:34:32 +00:00
+								same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								            String = %x22 *char %x22
 								              char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
 								         unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
 								            escape = %x5C              ; \
 								           escaped = ( %x5C /          ; \    reverse solidus U+005C
 								                       %x2F /          ; /    solidus         U+002F
 								                       %x62 /          ; b    backspace       U+0008
 								                       %x66 /          ; f    form feed       U+000C
 								                       %x6E /          ; n    line feed       U+000A
 								                       %x72 /          ; r    carriage return U+000D
 								                       %x74 )          ; t    tab             U+0009
 								  [^string-json-correspondence]: The grammar for `String` has the same
 								    effect as the
 								    [JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for
 								    `string`. Some auxiliary definitions (e.g. `escaped`) are lifted
 								    largely unmodified from the text of RFC 8259.
-												Handle a couple of TODOs

											
										
										
											2018-09-27 12:34:32 +00:00
+								  [^escaping-surrogate-pairs]: In particular, note JSON's rules around
 								    the use of surrogate pairs for code points not in the Basic
 								    Multilingual Plane. We encourage implementations to avoid escaping
 								    such characters when producing output, and instead to rely on the
 								    UTF-8 encoding of the entire document to handle them correctly.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								A `ByteString` may be written in any of three different forms.
 								The first is similar to a `String`, but prepended with a hash sign
 								`#`. In addition, only Unicode code points overlapping with printable
 -bit ASCII are permitted unescaped inside such a `ByteString`; other
 								byte values must be escaped by prepending a two-digit hexadecimal
 								value with `\x`.
 								        ByteString = "#" %x22 *binchar %x22
 								           binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
 								      binunescaped = %x20-21 / %x23-5B / %x5D-7E
-												Typo

											
										
										
											2018-09-28 10:12:44 +00:00
+								The second is as a sequence of pairs of hexadecimal digits interleaved
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								with whitespace and surrounded by `#hex{` and `}`.
 								       ByteString =/ %s"#hex{" *(ws / 2HEXDIG) ws "}"
 								The third is as a sequence of
 								[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
 								with whitespace and surrounded by `#base64{` and `}`. Plain and
 								URL-safe Base64 characters are allowed.
 								       ByteString =/ %s"#base64{" *(ws / base64char) ws "}" /
 								        base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								A `Symbol` may be written in a "bare" form[^cf-sexp-token] so long as
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								it conforms to certain restrictions on the characters appearing in the
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								symbol. Alternatively, it may be written in a quoted form. The quoted
 								form is much the same as the syntax for `String`s, including embedded
 								escape syntax, except using a bar or pipe character (`|`) instead of a
 								double quote mark.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								            Symbol = symstart *symcont / "|" *symchar "|"
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								          symstart = ALPHA / sympunct / symunicode
-												Allow dots to lead/fill raw symbols

											
										
										
											2018-10-08 18:54:04 +00:00
+								           symcont = ALPHA / sympunct / symunicode / DIGIT / "-"
-												Prepare for annotations by disallowing @ in raw symbols

											
										
										
											2018-10-08 20:24:40 +00:00
+								          sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
-												Allow dots to lead/fill raw symbols

											
										
										
											2018-10-08 18:54:04 +00:00
+								                     "?" / "_" / "=" / "+" / "<" / ">" / "/" / "."
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								           symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								        symunicode = <any code point greater than 127 whose Unicode
 								                      category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd,
 								                      Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co>
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								  [^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt]
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								    definition of "token representation", and with the
 								    [R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4).
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
-												Simplify, repair, and regularise embedded binary values in textual syntax

											
										
										
											2018-09-29 16:50:57 +00:00
+								Finally, any `Value` may be represented by escaping from the textual
 								syntax to the [compact binary syntax](#compact-binary-syntax) by
 								prefixing a `ByteString` containing the binary representation of the
 								`Value` with `#value`.[^rationale-switch-to-binary] [^no-literal-binary-in-text]
 								           Compact = %s"#value" ws ByteString
 								  [^rationale-switch-to-binary]: **Rationale.** The textual syntax
 								    cannot express every `Value`: specifically, it cannot express the
 								    several million floating-point NaNs, or the two floating-point
 								    Infinities. Since the compact binary format for `Value`s expresses
 								    each `Value` with precision, embedding binary `Value`s solves the
 								    problem.
 								  [^no-literal-binary-in-text]: Every text is ultimately physically
 								    stored as bytes; therefore, it might seem possible to escape to
 								    the raw binary form of compact binary encoding from within a
 								    pieces of textual syntax. However, while bytes must be involved in
 								    any *representation* of text, the text *itself* is logically a
 								    sequence of *code points* and is not *intrinsically* a binary
 								    structure at all. It would be incoherent to expect to be able to
 								    access the representation of the text from within the text itself.
-												Initial draft text re annotations

											
										
										
											2019-07-03 23:33:37 +00:00
+								### Annotations.
 								When written down, a `Value` may have an associated sequence of
 								*annotations* carrying “out-of-band” contextual metadata about the
 								value. Each annotation is, in turn, a `Value`, and may itself have
 								annotations.
 								            Value =/ ws "@" Value Value
 								Each annotation is preceded by `@`; the underlying annotated value
 								follows its annotations.
 								Annotations appear within syntax denoting a `Value`; however, the
 								annotations are not part of the denoted value. They are only part of
 								the syntax. Annotations do not play a part in equivalences and
 								orderings of `Value`s.
 								Reflective tools such as debuggers, user interfaces, and message
 								routers and relays---tools which process `Value`s generically---may
 								use annotated inputs to tailor their operation, or may insert
 								annotations in their outputs. By contrast, in ordinary programs, as a
 								rule of thumb, the presence, absence or specific value of an
 								annotation should not change the control flow or output of the
 								program. Annotations are data *describing* `Value`s, and are not in
 								the domain of any specific application of `Value`s. That is, an
 								annotation will almost never cause a non-reflective program to do
 								anything observably different.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								## Compact Binary Syntax
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								A `Repr` is an encoding, or representation, of a specific `Value`.
 								Each `Repr` comprises one or more bytes describing first the kind of
 								represented `Value` and the length of the representation, and then the
 								encoded details of the `Value` itself.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								For a value `v`, we write `[[v]]` for the `Repr` of v.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								### Type and Length representation
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								Each `Repr` takes one of three possible forms:
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								 - (A) a fixed-length form, used for simple values such as `Boolean`s
 								   or `Float`s.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								 - (B) a variable-length form with length specified up-front, used for
 								   almost all `Record`s as well as for most `Sequence`s and `String`s,
 								   when their sizes are known at the time serialization begins.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								 - (C) a variable-length streaming form with unknown or unpredictable
 								   length, used only seldom for `Record`s, since the number of fields
 								   in a `Record` is usually statically known, but sometimes used for
 								   `Sequence`s, `String`s etc., such as in cases when serialization
 								   begins before the number of elements or bytes in the corresponding
 								   `Value` is known.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								Applications may choose between formats B and C depending on their
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								needs at serialization time.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								#### The lead byte
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Tighten

											
										
										
											2018-09-24 14:33:19 +00:00
+								Every `Repr` starts with a *lead byte*, constructed by
 								`leadbyte(t,n,m)`, where `t`,`n`∈{0,1,2,3} and 0≤`m`<16:
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								    leadbyte(t,n,m) = [t*64 + n*16 + m]
-												Tighten

											
										
										
											2018-09-24 14:33:19 +00:00
+								The arguments `t` and `n` describe the rest of the
 								representation:[^some-encodings-unused]
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								  [^some-encodings-unused]: Some encodings are unused. All such
 								    encodings are reserved for future versions of this specification.
-												Tighten

											
										
										
											2018-09-24 14:33:19 +00:00
+								 - `t`=0, `n`=0 (format A) represents an `Atom` with fixed-length binary representation.
 								 - `t`=0, `n`=1 (format A) represents certain small `SignedInteger`s.
 								 - `t`=0, `n`=2 (format C) is a Stream Start byte.
 								 - `t`=0, `n`=3 (format C) is a Stream End byte.
 								 - `t`=1 (format B) represents an `Atom` with variable-length binary representation.
 								 - `t`=2 (format B) represents a `Record`.
 								 - `t`=3 (format B) represents a `Sequence`, `Set` or `Dictionary`.
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								#### Encoding data of fixed length (format A)
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
 								Each specific type of data defines its own rules for this format.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								#### Encoding data of known length (format B)
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
 								A `Repr` where the length of the `Value` to be encoded is variable but
 								known uses the value of `m` in `leadbyte` to encode its length. The
 								length counts *bytes* for atomic `Value`s, but counts *contained
 								values* for compound `Value`s.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								 - A length `l` between 0 and 14 is represented using `leadbyte` with
 								   `m=l`.
 								 - A length of 15 or greater is represented by `m=15` and additional
 								   bytes describing the length following the lead byte.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								The function `header(t,n,m)` yields an appropriate sequence of bytes
 								describing a `Repr`'s type and length when `t`, `n` and `m` are
 								appropriate non-negative integers:
 								    header(t,n,m) =    leadbyte(t,n,m)                 when m < 15
 								                    or leadbyte(t,n,15) ++ varint(m)   otherwise
 								The additional length bytes are formatted as
 								[base 128 varints][varint]. We write `varint(m)` for the
 								varint-encoding of `m`. Quoting the [Google Protocol Buffers][varint]
 								definition,
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								> Each byte in a varint, except the last byte, has the most
 								> significant bit (msb) set – this indicates that there are further
 								> bytes to come. The lower 7 bits of each byte are used to store the
 								> two's complement representation of the number in groups of 7 bits,
 								> least significant group first.
 								**Examples.**
 								 - The varint representation of 15 is just the byte 15.
 								 - 300 (binary, grouped into 7-bit chunks, `10 0101100`) varint-encodes to the two bytes 172 and 2.
 								 - 1000000000 (binary `11 1011100 1101011 0010100 0000000`) varint-encodes to bytes 128, 148, 235, 220, and 3.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								#### Streaming data of unknown length (format C)
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
 								A `Repr` where the length of the `Value` to be encoded is variable and
 								not known at the time serialization of the `Value` starts is encoded
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								by a single Stream Start (“open”) byte, followed by zero or more
 								*chunks*, followed by a matching Stream End (“close”) byte:
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								     open(t,n) = leadbyte(0,2, t*4 + n)
 								    close(t,n) = leadbyte(0,3, t*4 + n)
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
 								For a `Repr` of a `Value` containing binary data, each chunk is to be
-												Streamed binaries always use ByteString chunks

											
										
										
											2018-09-24 22:15:36 +00:00
+								a format B `Repr` of a `ByteString`, no matter the type of the overall
 								`Repr`.
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
 								For a `Repr` of a `Value` containing other `Value`s, each chunk is to
 								be a single `Repr`.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								### Records
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								Format B (known length):
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								    [[ L(F_1...F_m) ]] = header(2,3,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								For `m` fields, `m+1` is supplied to `header`, to account for the
 								encoding of the record label.
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								Format C (streaming):
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								    [[ L(F_1...F_m) ]] = open(2,3) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close(2,3)
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
 								Applications *SHOULD* prefer the known-length format for encoding
 								`Record`s.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								#### Application-specific short form for labels
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								Any given protocol using Preserves may additionally define an
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								interpretation for `n`∈{0,1,2}, mapping each *short form label
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								number* `n` to a specific record label. When encoding `m` fields with
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								short form label number `n`, format B becomes
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								    header(2,n,m) ++ [[F_1]] ++...++ [[F_m]]
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
 								and format C becomes
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								    open(2,n) ++ [[F_1]] ++...++ [[F_m]] ++ close(2,n)
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								**Examples.** For example, a protocol may choose to map records
 								labelled `void` to `n=0`, making
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								    [[void()]] = header(2,0,0) = [0x80]
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								or it may map records labelled `person` to short form label number 1,
 								making
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								    [[person("Dr", "Elizabeth", "Blackwell")]]
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								        = header(2,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
 								        =        [0x93] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
 								for format B, or
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								        = open(2,1) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close(2,1)
 								        =    [0x29] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x39]
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
 								for format C.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								### Sequences, Sets and Dictionaries
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								Format B (known length):
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								            [[ [X_1...X_m] ]] = header(3,0,m)   ++ [[X_1]] ++...++ [[X_m]]
 								        [[ #set{X_1...X_m} ]] = header(3,1,m)   ++ [[X_1]] ++...++ [[X_m]]
 								    [[ {K_1:V_1...K_m:V_m} ]] = header(3,2,m*2) ++ [[K_1]] ++ [[V_1]] ++...
 								                                                ++ [[K_m]] ++ [[V_m]]
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								Note that `m*2` is given to `header` for a `Dictionary`, since there
 								are two `Value`s in each key-value pair.
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
 								Format C (streaming):
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								            [[ [X_1...X_m] ]] = open(3,0) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,0)
 								        [[ #set{X_1...X_m} ]] = open(3,1) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,1)
 								    [[ {K_1:V_1...K_m:V_m} ]] = open(3,2) ++ [[K_1]] ++ [[V_1]] ++...
 								                                          ++ [[K_m]] ++ [[V_m]] ++ close(3,2)
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
 								Applications may use whichever format suits their needs on a
 								case-by-case basis.
-												Many improvements

											
										
										
											2018-09-23 17:14:58 +00:00
 								There is *no* ordering requirement on the `X_i` elements or
 								`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
 								order.
 								  [^no-sorting-rationale]: In the BitTorrent encoding format,
 								    [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
 								    dictionary key/value pairs must be sorted by key. This is a
 								    necessary step for ensuring serialization of `Value`s is
 								    canonical. We do not require that key/value pairs (or set
 								    elements) be in sorted order for serialized `Value`s, because (a)
 								    where canonicalization is used for cryptographic signatures, it is
 								    more reliable to simply retain the exact binary form of the signed
 								    document than to depend on canonical de- and re-serialization, and
 								    (b) sorting keys or elements makes no sense in streaming
 								    serialization formats.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								    However, a quality implementation may wish to offer the programmer
 								    the option of serializing with set elements and dictionary keys in
 								    sorted order.
-												Grammar

											
										
										
											2018-09-28 10:12:58 +00:00
+								Note that `header(3,3,m)` and `open(3,3)`/`close(3,3)` are unused and reserved.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Tighten

											
										
										
											2018-09-24 14:33:19 +00:00
+								### SignedIntegers
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
+								Format B/A (known length/fixed-size):
 								    [[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x)  if x<-3 ∨ 13≤x
 								                                     header(0,1,x+16)              if -3≤x<0
 								                                     header(0,1,x)                 if 0≤x<13
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
+								Integers in the range [-3,12] are compactly represented using format A
 								because they are so frequently used. Other integers are represented
 								using format B.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								Format C *MUST NOT* be used for `SignedInteger`s.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								The function `intbytes(x)` gives the big-endian two's-complement
 								binary representation of `x`, taking exactly as many whole bytes as
 								needed to unambiguously identify the value and its sign, and `m =
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
+								|intbytes(x)|`. The most-significant bit in the first byte in
 								`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes]
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
+								  [^zero-intbytes]: The value 0 needs zero bytes to identify the
 								    value, so `intbytes(0)` is the empty byte string. Non-zero values
 								    need at least one byte.
-												Many improvements

											
										
										
											2018-09-23 17:14:58 +00:00
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								For example,
-												Tighten

											
										
										
											2018-09-24 14:33:19 +00:00
+								    [[   -257 ]] = 42 FE FF    [[     -3 ]] = 1D       [[    128 ]] = 42 00 80
 								    [[   -256 ]] = 42 FF 00    [[     -2 ]] = 1E       [[    255 ]] = 42 00 FF
 								    [[   -255 ]] = 42 FF 01    [[     -1 ]] = 1F       [[    256 ]] = 42 01 00
 								    [[   -254 ]] = 42 FF 02    [[      0 ]] = 10       [[  32767 ]] = 42 7F FF
 								    [[   -129 ]] = 42 FF 7F    [[      1 ]] = 11       [[  32768 ]] = 43 00 80 00
 								    [[   -128 ]] = 41 80       [[     12 ]] = 1C       [[  65535 ]] = 43 00 FF FF
 								    [[   -127 ]] = 41 81       [[     13 ]] = 41 0D    [[  65536 ]] = 43 01 00 00
 								    [[     -4 ]] = 41 FC       [[    127 ]] = 41 7F    [[ 131072 ]] = 43 02 00 00
 								### Strings, ByteStrings and Symbols
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								Syntax for these three types varies only in the value of `n` supplied
 								to `header`, `open`, and `close`. In each case, the payload following
 								the header is a binary sequence; for `String` and `Symbol`, it is a
 								UTF-8 encoding of the `Value`'s code points, while for `ByteString` it
 								is the raw data contained within the `Value` unmodified.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								Format B (known length):
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								              [[ S ]] = header(1,n,m) ++ encode(S)
 								              where m = |encode(S)|
 								    and (n,encode(S)) = (1,utf8(S))  if S ∈ String
 								                        (2,S)        if S ∈ ByteString
 								                        (3,utf8(S))  if S ∈ Symbol
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and
 								then a sequence of zero or more format B chunks, followed by
-												Streamed binaries always use ByteString chunks

											
										
										
											2018-09-24 22:15:36 +00:00
+								`close(1,n)`. Every chunk must be a `ByteString`.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								While the overall content of a streamed `String` or `Symbol` must be
 								valid UTF-8, individual chunks do not have to conform to UTF-8.
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								### Fixed-length Atoms
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								Fixed-length atoms all use format A, and do not have a length
 								representation. They repurpose the bits that format B `Repr`s use to
 								specify lengths. Applications *MUST NOT* use format C with
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								`open(0,n)` or `close(0,n)` for any `n`.
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								#### Booleans
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								    [[ #false ]] = header(0,0,0) = [0x00]
 								    [[  #true ]] = header(0,0,1) = [0x01]
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								#### Floats and Doubles
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								    [[ F ]] when F ∈ Float  = header(0,0,2) ++ binary32(F)
 								    [[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D)
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
 								The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
 -byte IEEE 754 binary representations of `F` and `D`, respectively.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								## Examples
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								### Simple examples
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								<!-- TODO: Give some examples of large and small Preserves, perhaps -->
 								<!-- translated from various JSON blobs floating around the internet. -->
 								For the following examples, imagine an application that maps `Record`
 								short form label number 0 to label `discard`, 1 to `capture`, and 2 to
 								`observe`.
-												Small fix and new question

											
										
										
											2018-09-25 14:53:42 +00:00
+								| Value                                             | Encoded hexadecimal byte sequence                                    |
 								|---------------------------------------------------|----------------------------------------------------------------------|
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								| `capture(discard())`                              | 91 80                                                                |
 								| `observe(speak(discard(), capture(discard())))`   | A1 B3 75 73 70 65 61 6B 80 91 80                                     |
-												Small fix and new question

											
										
										
											2018-09-25 14:53:42 +00:00
+								| `[1 2 3 4]` (format B)                            | C4 11 12 13 14                                                       |
 								| `[1 2 3 4]` (format C)                            | 2C 11 12 13 14 3C                                                    |
 								| `[-2 -1 0 1]`                                     | C4 1E 1F 10 11                                                       |
 								| `"hello"` (format B)                              | 55 68 65 6C 6C 6F                                                    |
 								| `"hello"` (format C, 2 chunks)                    | 25 62 68 65 63 6C 6C 6F 35                                           |
 								| `"hello"` (format C, 5 chunks)                    | 25 62 68 65 62 6C 6C 60 60 61 6F 35                                  |
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								| `["hello" there #"world" [] #set{} #true #false]` | C7 55 68 65 6C 6C 6F 75 74 68 65 72 65 65 77 6F 72 6C 64 C0 D0 01 00 |
-												Small fix and new question

											
										
										
											2018-09-25 14:53:42 +00:00
+								| `-257`                                            | 42 FE FF                                                             |
 								| `-1`                                              | 1F                                                                   |
 								| `0`                                               | 10                                                                   |
 								| `1`                                               | 11                                                                   |
 								| `255`                                             | 42 00 FF                                                             |
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								| `1.0f`                                            | 02 3F 80 00 00                                                       |
 								| `1.0`                                             | 03 3F F0 00 00 00 00 00 00                                           |
 								| `-1.202e300`                                      | 03 FE 3C B7 B7 59 BF 04 26                                           |
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record`
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								    [titled person 2 thing 1](101, "Blackwell", date(1821 2 3), "Dr")
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								encodes to
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+								    B5                              ;; Record, generic, 4+1
 								      C5                              ;; Sequence, 5
 74 69 74 6C 65 64            ;; Symbol, "titled"
 70 65 72 73 6F 6E            ;; Symbol, "person"
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
+;; SignedInteger, "2"
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+74 68 69 6E 67               ;; Symbol, "thing"
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
+;; SignedInteger, "1"
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+65                           ;; SignedInteger, "101"
 42 6C 61 63 6B 77 65 6C 6C   ;; String, "Blackwell"
 								      B4                              ;; Record, generic, 3+1
 64 61 74 65                  ;; Symbol, "date"
 07 1D                        ;; SignedInteger, "1821"
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
+;; SignedInteger, "2"
 ;; SignedInteger, "3"
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
+44 72                        ;; String, "Dr"
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								  [^extensibility2]: It happens to line up with Racket's
 								    representation of a record label for an inheritance hierarchy
 								    where `titled` extends `person` extends `thing`:
 								        (struct date (year month day) #:prefab)
 								        (struct thing (id) #:prefab)
 								        (struct person thing (name date-of-birth) #:prefab)
 								        (struct titled person (title) #:prefab)
-												Link to Racket docs for prefab struct labels

											
										
										
											2018-09-25 09:08:22 +00:00
+								    For more detail on Racket's representations of record labels, see
 								    [the Racket documentation for `make-prefab-struct`](http://docs.racket-lang.org/reference/structutils.html#%28def._%28%28quote._~23~25kernel%29._make-prefab-struct%29%29).
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								---
 								### JSON examples
 								The examples from
 								[RFC 8259](https://tools.ietf.org/html/rfc8259#section-13) read as
 								valid Preserves, though the JSON literals `true`, `false` and `null`
 								read as `Symbol`s. The first example:
 								    {
 								      "Image": {
 								          "Width":  800,
 								          "Height": 600,
 								          "Title":  "View from 15th Floor",
 								          "Thumbnail": {
 								              "Url":    "http://www.example.com/image/481989943",
 								              "Height": 125,
 								              "Width":  100
 								          },
 								          "Animated" : false,
 								          "IDs": [116, 943, 234, 38793]
 								        }
 								    }
 								encodes to binary as follows:
 								    E2
 "Image"
 								      EC
 "Width"    42 03 20
 "Title"    5F 14 "View from 15th Floor"
 "Animated" 75 "false"
 "Height"   42 02 58
 "Thumbnail"
 								          E6
 "Width"  41 64
 "Url"    5F 26 "http://www.example.com/image/481989943"
 "Height" 41 7D
 "IDs"    C4
 74
 03 AF
 00 EA
 00 97 89
 								and the second example:
 								    [
 								      {
 								         "precision": "zip",
 								         "Latitude":  37.7668,
 								         "Longitude": -122.3959,
 								         "Address":   "",
 								         "City":      "SAN FRANCISCO",
 								         "State":     "CA",
 								         "Zip":       "94107",
 								         "Country":   "US"
 								      },
 								      {
 								         "precision": "zip",
 								         "Latitude":  37.371991,
 								         "Longitude": -122.026020,
 								         "Address":   "",
 								         "City":      "SUNNYVALE",
 								         "State":     "CA",
 								         "Zip":       "94085",
 								         "Country":   "US"
 								      }
 								    ]
 								encodes to binary as follows:
 								    C2
 								      EF 10
 "precision"  53 "zip"
 "Latitude"   03 40 42 E2 26 80 9D 49 52
 "Longitude"  03 C0 5E 99 56 6C F4 1F 21
 "Address"    50
 "City"       5D "SAN FRANCISCO"
 "State"      52 "CA"
 "Zip"        55 "94107"
 "Country"    52 "US"
 								      EF 10
 "precision"  53 "zip"
 "Latitude"   03 40 42 AF 9D 66 AD B4 03
 "Longitude"  03 C0 5E 81 AA 4F CA 42 AF
 "Address"    50
 "City"       59 "SUNNYVALE"
 "State"      52 "CA"
 "Zip"        55 "94085"
 "Country"    52 "US"
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								## Conventions for Common Data Types
 								The `Value` data type is essentially an S-Expression, able to
 								represent semi-structured data over `ByteString`, `String`,
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								`SignedInteger` atoms and so on.[^why-not-spki-sexps]
 								  [^why-not-spki-sexps]: Rivest's S-Expressions are in many ways
 								    similar to Preserves. However, while they include binary data and
 								    sequences, and an obvious equivalence for them exists, they lack
 								    numbers *per se* as well as any kind of unordered structure such
 								    as sets or maps. In addition, while "display hints" allow
 								    labelling of binary data with an intended interpretation, they
 								    cannot be attached to any other kind of structure, and the "hint"
 								    itself can only be a binary blob.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								However, users need a wide variety of data types for representing
 								domain-specific values such as various kinds of encoded and normalized
 								text, calendrical values, machine words, and so on.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								Appropriately-labelled `Record`s denote these domain-specific data
 								types.[^why-dictionaries]
 								  [^why-dictionaries]: Given `Record`'s existence, it may seem odd
 								    that `Dictionary`, `Set`, `Float`, etc. are given special
 								    treatment. Preserves aims to offer a useful basic equivalence
 								    predicate to programmers, and so if a data type demands a special
 								    equivalence predicate, as `Dictionary`, `Set` and `Float` all do,
 								    then the type should be included in the base language. Otherwise,
 								    it can be represented as a `Record` and treated separately. Both
 								    `Boolean` and `String` are seeming exceptions: they merit
 								    inclusion because of their cultural importance.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								All of these conventions are optional. They form a layer atop the core
 								`Value` structure. Non-domain-specific tools do not in general need to
 								treat them specially.
 								**Validity.** Many of the labels we will describe in this section come
 								  with side-conditions on the contents of labelled `Record`s. It is
 								  possible to construct an instance of `Value` that violates these
 								  side-conditions without ceasing to be a `Value` or becoming
 								  unrepresentable. However, we say that such a `Value` is *invalid*
 								  because it fails to honour the necessary side-conditions.
 								  Implementations *SHOULD* allow two modes of working: one which
 								  treats all `Value`s identically, without regard for side-conditions,
 								  and one which enforces validity (i.e. side-conditions) when reading,
 								  writing, or constructing `Value`s.
-												Many improvements

											
										
										
											2018-09-23 17:14:58 +00:00
+								### MIME-type tagged binary data
 								Many internet protocols use
 								[media types](https://tools.ietf.org/html/rfc6838) (a.k.a MIME types)
 								to indicate the format of some associated binary data. For this
 								purpose, we define `MIMEData` to be a record labelled `mime` with two
 								fields, the first being a `Symbol`, the media type, and the second
 								being a `ByteString`, the binary data.
 								While each media type may define its own rules for comparing
 								documents, we define ordering among `MIMEData` *representations* of
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								such media types following the general rules for ordering of
 								`Record`s.
-												Many improvements

											
										
										
											2018-09-23 17:14:58 +00:00
 								**Examples.**
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								| Value                                      | Encoded hexadecimal byte sequence                                                                                 |
 								|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								| `mime(application/octet-stream #"abcde")`  | B3 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
 								| `mime(text/plain #"ABC")`                  | B3 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43                                                    |
 								| `mime(application/xml #"<xhtml/>")`        | B3 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E                   |
 								| `mime(text/csv #"123,234,345")`            | B3 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35                                  |
-												Many improvements

											
										
										
											2018-09-23 17:14:58 +00:00
 								Applications making heavy use of `mime` records may choose to use a
 								short form label number for the record type. For example, if short
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								form label number 1 were chosen, the second example above,
 								`mime(text/plain "ABC")`, would be encoded with "92" in place of "B3
 6D 69 6D 65".
-												Many improvements

											
										
										
											2018-09-23 17:14:58 +00:00
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								### Unicode normalization forms
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								Unicode defines multiple
 								[normalization forms](http://unicode.org/reports/tr15/) for text.
 								While no particular normalization form is required for `String`s,
 								users may need to unambiguously signal or require a particular
 								normalization form. A `NormalizedString` is a `Record` labelled with
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								`unicode-normalization` and having two fields, the first of which is a
 								`Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`,
 								`nfkc`, `nfkd`), and the second of which is a `String` whose
 								underlying code point representation *MUST* be normalized according to
 								the named normalization form.
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								### IRIs (URIs, URLs, URNs, etc.)
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								An `IRI` is a `Record` labelled with `iri` and having one field, a
 								`String` which is the IRI itself and which *MUST* be a valid absolute
 								or relative IRI.
 								### Machine words
 								The definition of `SignedInteger` captures all integers. However, in
 								certain circumstances it can be valuable to assert that a number
 								inhabits a particular range, such as a fixed-width machine word.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								A family of labels `i`*n* and `u`*n* for *n* ∈ {8,16,32,64} denote
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								*n*-bit-wide signed and unsigned range restrictions, respectively.
 								Records with these labels *MUST* have one field, a `SignedInteger`,
 								which *MUST* fall within the appropriate range. That is, to be valid,
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								 - in `i8(`*x*`)`, -128 <= *x* <= 127.
 								 - in `u8(`*x*`)`, 0 <= *x* <= 255.
 								 - in `i16(`*x*`)`, -32768 <= *x* <= 32767.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								 - etc.
 								### Anonymous Tuples and Unit
 								A `Tuple` is a `Record` with label `tuple` and zero or more fields,
 								denoting an anonymous tuple of values.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								The 0-ary tuple, `tuple()`, denotes the empty tuple, sometimes called
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								"unit" or "void" (but *not* e.g. JavaScript's "undefined" value).
 								### Null and Undefined
 								Tony Hoare's
 								"[billion-dollar mistake](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions)"
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								can be represented with the 0-ary `Record` `null()`. An "undefined"
 								value can be represented as `undefined()`.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								### Dates and Times
 								Dates, times, moments, and timestamps can be represented with a
 								`Record` with label `rfc3339` having a single field, a `String`, which
 								*MUST* conform to one of the `full-date`, `partial-time`, `full-time`,
 								or `date-time` productions of
 								[section 5.6 of RFC 3339](https://tools.ietf.org/html/rfc3339#section-5.6).
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								## Security Considerations
 								**Empty chunks.** Streamed (format C) `String`s, `ByteString`s and
 								`Symbol`s may include chunks of zero length. This opens up a
 								possibility for denial-of-service: an attacker may begin streaming a
 								string, sending an endless sequence of zero length chunks, appearing
 								to make progress but not actually doing so. Implementations may place
 								optional reasonable restrictions on the number of consecutive empty
 								chunks that may appear in a stream, and may even supply an optional
 								mode that rejects empty chunks entirely.
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								**Whitespace.** Similarly, the textual format for `Value`s allows
 								arbitrary whitespace in many positions. In streaming transfer
 								situations, consider optional restrictions on the amount of
-												Remove comments, in prep for annotations replacing them

											
										
										
											2019-07-01 20:31:49 +00:00
+								consecutive whitespace that may appear in a serialized `Value`.
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								**Canonical form for cryptographic hashing and signing.** As
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								specified, neither the textual nor the compact binary encoding rules
 								for `Value`s force canonical serializations. Two serializations of the
 								same `Value` may yield different binary `Repr`s.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
 								## Appendix. Table of lead byte values
 - False
 - True
 - Float
 - Double
 								    (0x)  RESERVED 04-0F
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
+x - Small integers 0..12,-3..-1
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+x - Start Stream
 x - End Stream
 x - SignedInteger
 x - String
 x - ByteString
 x - Symbol
 x - short form Record label index 0
 x - short form Record label index 1
 								     Ax - short form Record label index 2
 								     Bx - Record
 								     Cx - Sequence
 								     Dx - Set
 								     Ex - Dictionary
 								    (Fx)  RESERVED F0-FF
 								## Appendix. Bit fields within lead byte values
 								     tt nn mmmm  contents
 								     ---------- ---------
 00 0000  False
 00 0001  True
 00 0010  Float, 32 bits big-endian binary
 00 0011  Double, 64 bits big-endian binary
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
+01 xxxx  Small integers 0..12,-3..-1
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+10 ttnn  Start Stream <tt,nn>
 								                   When tt = 00 --> error
-												Streamed binaries always use ByteString chunks

											
										
										
											2018-09-24 22:15:36 +00:00
+--> each chunk is a ByteString
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+x --> each chunk is a single encoded Value
 11 ttnn  End Stream <tt,nn> (must match preceding Start Stream)
 00 mmmm  SignedInteger, big-endian binary
 01 mmmm  String, UTF-8 binary
 10 mmmm  ByteString
 11 mmmm  Symbol, UTF-8 binary
 00 mmmm  application-specific Record
 01 mmmm  application-specific Record
 10 mmmm  application-specific Record
 11 mmmm  Record
 00 mmmm  Sequence
 01 mmmm  Set
 10 mmmm  Dictionary
 								     If mmmm = 1111, a varint(m) follows, giving the length, before
 								     the body; otherwise, m is the length of the body to follow.
 								## Appendix. Representing Values in Programming Languages
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								We have given a definition of `Value` and its semantics, and proposed
 								a concrete syntax for communicating and storing `Value`s. We now turn
 								to **suggested** representations of `Value`s as *programming-language
 								values* for various programming languages.
 								When designing a language mapping, an important consideration is
 								roundtripping: serialization after deserialization, and vice versa,
 								should both be identities.
 								### JavaScript
-												Tweaks; python mapping

											
										
										
											2018-09-24 17:34:07 +00:00
+								 - `Boolean` ↔ `Boolean`
 								 - `Float` and `Double` ↔ numbers
 								 - `SignedInteger` ↔ numbers or `BigInt` (see [here](https://developers.google.com/web/updates/2018/05/bigint) and [here](https://github.com/tc39/proposal-bigint))
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								 - `String` ↔ strings
 								 - `ByteString` ↔ `Uint8Array`
 								 - `Symbol` ↔ `Symbol.for(...)`
 								 - `Record` ↔ `{ "_label": theLabel, "_fields": [field0, ..., fieldN] }`, plus convenience accessors
 								    - `(undefined)` ↔ the undefined value
 								    - `(rfc3339 F)` ↔ `Date`, if `F` matches the `date-time` RFC 3339 production
 								 - `Sequence` ↔ `Array`
 								 - `Set` ↔ `{ "_set": M }` where `M` is a `Map` from the elements of the set to `true`
 								 - `Dictionary` ↔ a `Map`
 								### Scheme/Racket
-												Tweaks; python mapping

											
										
										
											2018-09-24 17:34:07 +00:00
+								 - `Boolean` ↔ booleans
 								 - `Float` and `Double` ↔ inexact numbers (Racket: single- and double-precision floats)
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								 - `SignedInteger` ↔ exact numbers
 								 - `String` ↔ strings
 								 - `ByteString` ↔ byte vector (Racket: "Bytes")
 								 - `Symbol` ↔ symbols
 								 - `Record` ↔ structures (Racket: prefab struct)
 								 - `Sequence` ↔ lists
 								 - `Set` ↔ Racket: sets
 								 - `Dictionary` ↔ Racket: hash-table
 								### Java
-												Tweaks; python mapping

											
										
										
											2018-09-24 17:34:07 +00:00
+								 - `Boolean` ↔ `Boolean`
 								 - `Float` and `Double` ↔ `Float` and `Double`
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								 - `SignedInteger` ↔ `Integer`, `Long`, `BigInteger`
 								 - `String` ↔ `String`
 								 - `ByteString` ↔ `byte[]`
 								 - `Symbol` ↔ a simple data class wrapping a `String`
 								 - `Record` ↔ in a simple implementation, a generic `Record` class; else perhaps a bean mapping?
-												Tweaks; python mapping

											
										
										
											2018-09-24 17:34:07 +00:00
+								    - `(mime T B)` ↔ an implementation of `javax.activation.DataSource`?
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								 - `Sequence` ↔ an implementation of `java.util.List`
 								 - `Set` ↔ an implementation of `java.util.Set`
 								 - `Dictionary` ↔ an implementation of `java.util.Map`
 								### Erlang
-												Tweaks; python mapping

											
										
										
											2018-09-24 17:34:07 +00:00
+								 - `Boolean` ↔ `true` and `false`
 								 - `Float` and `Double` ↔ floats (unsure how Erlang deals with single-precision)
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								 - `SignedInteger` ↔ integers
-												Improve (?) Erlang mapping

											
										
										
											2018-09-24 18:54:52 +00:00
+								 - `String` ↔ pair of `utf8` and a binary
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								 - `ByteString` ↔ a binary
-												Improve (?) Erlang mapping

											
										
										
											2018-09-24 18:54:52 +00:00
+								 - `Symbol` ↔ pair of `atom` and a binary
 								 - `Record` ↔ triple of `obj`, label, and field list
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								 - `Sequence` ↔ a list
-												Improve (?) Erlang mapping

											
										
										
											2018-09-24 18:54:52 +00:00
+								 - `Set` ↔ a `sets` set
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								 - `Dictionary` ↔ a [map][erlang-map] (new in Erlang/OTP R17)
-												Improve (?) Erlang mapping

											
										
										
											2018-09-24 18:54:52 +00:00
+								This is a somewhat unsatisfactory mapping because: (a) Erlang doesn't
 								garbage-collect its atoms, meaning that (a.1) representing `Symbol`s
 								as atoms could lead to denial-of-service and (a.2) representing
 								`Symbol`-labelled `Record`s as Erlang records must be rejected for the
 								same reason; (b) even if it did, Erlang's boolean values are atoms,
 								which would then clash with the `Symbol`s `true` and `false`; and (c)
 								Erlang has no distinct string type, making for a trilemma where
 								`String`s are in danger of clashing with `ByteString`s, `Sequence`s,
 								or `Record`s.
-												Tweaks; python mapping

											
										
										
											2018-09-24 17:34:07 +00:00
 								### Python
 								 - `Boolean` ↔ `True` and `False`
 								 - `Float` ↔ a `Float` wrapper-class for a double-precision value
 								 - `Double` ↔ float
-												Python SignedInteger rep needs long as well as int

											
										
										
											2018-09-25 09:20:35 +00:00
+								 - `SignedInteger` ↔ int and long
-												Tweaks; python mapping

											
										
										
											2018-09-24 17:34:07 +00:00
+								 - `String` ↔ `unicode`
 								 - `ByteString` ↔ `bytes`
 								 - `Symbol` ↔ a simple data class wrapping a `unicode`
 								 - `Record` ↔ something like `namedtuple`, but that doesn't care about class identity?
-												Small fix and new question

											
										
										
											2018-09-25 14:53:42 +00:00
+								 - `Sequence` ↔ `tuple` (but accept `list` during encoding)
 								 - `Set` ↔ `frozenset` (but accept `set` during encoding)
 								 - `Dictionary` ↔ a hashable (immutable) dictionary-like thing (but accept `dict` during encoding)
-												Tweaks; python mapping

											
										
										
											2018-09-24 17:34:07 +00:00
-												Squeak Smalltalk mapping

											
										
										
											2018-09-24 18:54:59 +00:00
+								### Squeak Smalltalk
 								 - `Boolean` ↔ `true` and `false`
 								 - `Float` ↔ perhaps a subclass of `Float`?
 								 - `Double` ↔ `Float`
 								 - `SignedInteger` ↔ `Integer`
 								 - `String` ↔ `WideString`
 								 - `ByteString` ↔ `ByteArray`
 								 - `Symbol` ↔ `WideSymbol`
 								 - `Record` ↔ a simple data class
 								 - `Sequence` ↔ `ArrayedCollection` (usually `OrderedCollection`)
 								 - `Set` ↔ `Set`
 								 - `Dictionary` ↔ `Dictionary`
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								## Appendix. Why not Just Use JSON?
 								<!-- JSON lacks semantics: JSON syntax doesn't denote anything -->
 								JSON offers *syntax* for numbers, strings, booleans, null, arrays and
 								string-keyed maps. However, it suffers from two major problems. First,
 								it offers no *semantics* for the syntax: it is left to each
 								implementation to determine how to treat each JSON term. This causes
 								[interoperability](http://seriot.ch/parsing_json.php) and even
 								[security](http://web.archive.org/web/20180906202559/http://docs.couchdb.org/en/stable/cve/2017-12635.html)
 								issues. Second, JSON's lack of support for type tags leads to awkward
 								and incompatible *encodings* of type information in terms of the fixed
 								suite of constructors on offer.
 								There are other minor problems with JSON having to do with its syntax.
 								Examples include its relative verbosity and its lack of support for
 								binary data.
 								### JSON syntax doesn't *mean* anything
 								When are two JSON values the same? When are they different?
 								<!-- When is one JSON value "less than" another? -->
 								The specifications are largely silent on these questions. Different
 								JSON implementations give different answers.
 								Specifically, JSON does not:
 								 - assign any meaning to numbers,[^meaning-ieee-double]
 								 - determine how strings are to be compared,[^string-key-comparison]
 								 - determine whether object key ordering is significant,[^json-member-ordering] or
 								 - determine whether duplicate object keys are permitted, what it
 								   would mean if they were, or how to determine a duplicate in the
 								   first place.[^json-key-uniqueness]
 								In short, JSON syntax doesn't *denote* anything.[^xml-infoset] [^other-formats]
 								  [^meaning-ieee-double]:
-												RFC7159 -> RFC8259

											
										
										
											2018-09-24 18:12:43 +00:00
+								    [Section 6 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-6)
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								    does go so far as to indicate “good interoperability can be
 								    achieved” by imagining that parsers are able reliably to
 								    understand the syntax of numbers as denoting an IEEE 754
 								    double-precision floating-point value.
 								  [^string-key-comparison]:
-												RFC7159 -> RFC8259

											
										
										
											2018-09-24 18:12:43 +00:00
+								    [Section 8.3 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-8.3)
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								    suggests that *if* an implementation compares strings used as
 								    object keys “code unit by code unit”, then it will interoperate
 								    with *other such implementations*, but neither requires this
 								    behaviour nor discusses comparisons of strings used in other
 								    contexts.
 								  [^json-member-ordering]:
-												RFC7159 -> RFC8259

											
										
										
											2018-09-24 18:12:43 +00:00
+								    [Section 4 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-4)
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								    remarks that “[implementations] differ as to whether or not they
 								    make the ordering of object members visible to calling software.”
 								  [^json-key-uniqueness]:
-												RFC7159 -> RFC8259

											
										
										
											2018-09-24 18:12:43 +00:00
+								    [Section 4 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-4)
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								    is the only place in the specification that mentions the issue. It
 								    explicitly sanctions implementations supporting duplicate keys,
 								    noting only that “when the names within an object are not unique,
 								    the behavior of software that receives such an object is
 								    unpredictable.” Implementations are free to choose any behaviour
 								    at all in this situation, including signalling an error, or
 								    discarding all but one of a set of duplicates.
 								  [^xml-infoset]: The XML world has the concept of
 								    [XML infoset](https://www.w3.org/TR/xml-infoset/). Loosely
 								    speaking, XML infoset is the *denotation* of an XML document; the
 								    *meaning* of the document.
 								  [^other-formats]: Most other recent data languages are like JSON in
 								    specifying only a syntax with no associated semantics. While some
 								    do make a sketch of a semantics, the result is often
 								    underspecified (e.g. in terms of how strings are to be compared),
 								    overly machine-oriented (e.g. treating 32-bit integers as
 								    fundamentally distinct from 64-bit integers and from
 								    floating-point numbers), overly fine (e.g. giving visibility to
 								    the order in which map entries are written), or all three.
 								Some examples:
 								 - are the JSON values `1`, `1.0`, and `1e0` the same or different?
 								 - are the JSON values `1.0` and `1.0000000000000001` the same or different?
 								 - are the JSON strings `"päron"` (UTF-8 `70c3a4726f6e`) and `"päron"`
 								   (UTF-8 `7061cc88726f6e`) the same or different?
 								 - are the JSON objects `{"a":1, "b":2}` and `{"b":2, "a":1}` the same
 								   or different?
 								 - which, if any, of `{"a":1, "a":2}`, `{"a":1}` and `{"a":2}` are the
 								   same? Are all three legal?
 								 - are `{"päron":1}` and `{"päron":1}` the same or different?
 								### JSON can multiply nicely, but it can't add very well
 								JSON includes a fixed set of types: numbers, strings, booleans, null,
 								arrays and string-keyed maps. Domain-specific data must be *encoded*
 								into these types. For example, dates and email addresses are often
 								represented as strings with an implicit internal structure.
 								There is no convention for *labelling* a value as belonging to a
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								particular category. Instead, JSON-encoded data are often labelled in
 								an ad-hoc way. Multiple incompatible approaches exist. For example, a
 								"money" structure containing a `currency` field and an `amount` may be
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								represented in any number of ways:
 								    { "_type": "money", "currency": "EUR", "amount": 10 }
 								    { "type": "money", "value": { "currency": "EUR", "amount": 10 } }
 								    [ "money", { "currency": "EUR", "amount": 10 } ]
 								    { "@money": { "currency": "EUR", "amount": 10 } }
 								This causes particular problems when JSON is used to represent *sum*
 								or *union* types, such as "either a value or an error, but not both".
 								Again, multiple incompatible approaches exist.
 								For example, imagine an API for depositing money in an account. The
 								response might be either a "success" response indicating the new
 								balance, or one of a set of possible errors.
 								Sometimes, a *pair* of values is used, with `null` marking the option
 								not taken.[^interesting-failure-mode]
 								    { "ok": { "balance": 210 }, "error": null }
 								    { "ok": null, "error": "Unauthorized" }
 								  [^interesting-failure-mode]: What is the meaning of a document where
 								    both `ok` and `error` are non-null? What might happen when a
 								    program is presented with such a document?
 								The branch not chosen is sometimes present, sometimes omitted as if it
 								were an optional field:
 								    { "ok": { "balance": 210 } }
 								    { "error": "Unauthorized" }
 								Sometimes, an array of a label and a value is used:
 								    [ "ok", { "balance": 210 } ]
 								    [ "error", "Unauthorized" ]
 								Sometimes, the shape of the data is sufficient to distinguish among
 								the alternatives, and the label is left implicit:
 								    { "balance": 210 }
 								    "Unauthorized"
 								JSON itself does not offer any guidance for which of these options to
 								choose. In many real cases on the web, poor choices have led to
 								encodings that are irrecoverably ambiguous.
 								# Open questions
 								Q. Should "symbols" instead be URIs? Relative, usually; relative to
 								what? Some domain-specific base URI?
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
+								Q. Literal small integers: are they pulling their weight? They're not
 								absolutely necessary. They mess up the connection between
-												More notes

											
										
										
											2018-10-02 12:07:18 +00:00
+								value-type-ordering and repr-tag-ordering! (The connection between
 								*value* ordering and *repr* ordering is already irretrievably messed
 								up: length prefixes blow lexicographic ordering away, sign bits are
 								the wrong way around, floats are sign-magnitude, etc etc.)
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Small fix and new question

											
										
										
											2018-09-25 14:53:42 +00:00
+								Q. Should we go for trying to make the data ordering line up with the
 								encoding ordering? We'd have to only use streaming forms, and avoid
 								the small integer encoding, and not store record arities, and sort
 								sets and dictionaries, and mask floats and doubles (perhaps
 								[like this](https://stackoverflow.com/questions/43299299/sorting-floating-point-values-using-their-byte-representation)),
-												Notes on NaNs

											
										
										
											2018-09-28 10:48:58 +00:00
+								and pick a specific `NaN`, and I don't know what to do about
 								SignedIntegers. Perhaps make them more like float formats, with the
 								byte count acting as a kind of exponent underneath the sign bit.
-												Small fix and new question

											
										
										
											2018-09-25 14:53:42 +00:00
-												Disallow whitespace between a label and its open-parenthesis

											
										
										
											2018-09-28 10:00:40 +00:00
+								 - Perhaps define separate additional canonicalization restrictions?
 								   Doesn't help the ordering, but does help the equivalence.
 								 - Canonicalization and early-bailout-equivalence-checking are in
 								   tension with support for streaming values.
 								Q. The postfix fields in the textual syntax come unannounced: "oh, and
 								another thing, what you just read is a label, and here are some
 								fields." This is a problem for interactive reading of textual syntax,
 								because after a complete term, it needs to see the next character to
 								tell whether it is an open-parenthesis or not! For this reason, I've
 								disallowed whitespace between a label `Value` and the open-parenthesis
 								of the fields. Is this reasonable??
-												Note re case-insensitivity

											
										
										
											2018-09-30 20:12:17 +00:00
+								Q. To remain compatible with JSON, portions of the text syntax have to
 								remain case-insensitive (`%i"..."`). However, non-JSON extensions do
 								not. There's only one (?) at the moment, the `%i"f"` in `Float`;
 								should it be changed to case-sensitive?
-												Add TODO

											
										
										
											2018-10-02 12:02:50 +00:00
+								TODO: Examples of the ordering. `"bzz" < "c" < "caa"`; `#true < 3 < "3" < |3|`
-												More notes

											
										
										
											2018-10-02 12:07:18 +00:00
+								TODO: Probably should add a canonicalized subset. Consider adding
 								explicit "I promise this is canonical" marker, like a BOM, which
 								identifies a binary value as (first) binary and (second, optionally)
 								as canonical. UTF-8 disallows byte `0xFF` from appearing anywhere in a
 								text; this might be a good candidate for a marker sequence.
-												TODOs

											
										
										
											2018-11-08 12:35:55 +00:00
+								((Actually, perhaps `0x10` would be good! It corresponds to DLE, "data
 								link escape"; it is not a printable ASCII character, and is disallowed
 								in the textual Preserves grammar; and it is also mnemonic for "version
 ", since it is the Preserves binary encoding of the small integer
 								zero.))
 								TODO: Remove the special short syntax for application-specific record
 								label usage? Then perhaps 8x, 9x, Ax and Bx would work for Record,
 								Sequence, Set and Dictionary, leaving Cx, Dx, Ex and Fx entirely free.
 								TODO: Forbid empty chunks?
-												More notes

											
										
										
											2018-10-02 12:07:18 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								## Notes