preserves/preserves-binary.md

---
no_site_title: true
title: "Preserves: Binary Syntax"
---

Tony Garnock-Jones <tonyg@leastfixedpoint.com>
{{ site.version_date }}. Version {{ site.version }}.

  [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
  [LEB128]: https://en.wikipedia.org/wiki/LEB128
  [canonical]: canonical-binary.html

*Preserves* is a data model, with associated serialization formats. This
document defines one of those formats: a binary syntax for `Value`s from
the [Preserves data model](preserves.html) that is easy for computer
software to read and write. An [equivalent human-readable text
syntax](preserves-text.html) also exists.

## Machine-Oriented Binary Syntax

A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
For a value `v`, we write `«v»` for the `Repr` of v.

### Type and Length representation.

Each `Repr` starts with a tag byte, describing the kind of information
represented. Depending on the tag, a length indicator, further encoded
information, and/or an ending tag may follow.

    tag                          (simple atomic data)
    tag ++ length ++ binarydata  (floats, doubles, integers, strings, symbols, and binary)
    tag ++ repr ++ ... ++ endtag (compound data)

The unique end tag is byte value `0x84`.

If present after a tag, the length of a following piece of binary data
is formatted as a [base 128 varint][varint].[^see-also-leb128] We
write `varint(m)` for the varint-encoding of `m`. Quoting the
[Google Protocol Buffers][varint] definition,

  [^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
    integers. Varints and LEB128-encoded integers differ only for
    negative numbers, which cannot appear as length indicators and are
    thus not used in Preserves.

> Each byte in a varint, except the last byte, has the most
> significant bit (msb) set – this indicates that there are further
> bytes to come. The lower 7 bits of each byte are used to store the
> two's complement representation of the number in groups of 7 bits,
> least significant group first.

For example, `varint(15)` is `[0x0F]`, and `varint(1000000000)` is `[0x80, 0x94, 0xeb, 0xdc,
0x03]`.

It is an error for a varint-encoded `m` in a `Repr` to be anything
other than the unique shortest encoding for that `m`. That is, a
varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0.

### Records, Sequences, Sets and Dictionaries.

          «<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
            «[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
           «#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
    «{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]

There is *no* ordering requirement on the `E_i` elements or
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
addition, implementations *SHOULD* default to writing set elements and
dictionary key/value pairs in order sorted lexicographically by their
`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
serializing in some other implementation-defined order.

  [^no-sorting-rationale]: In the BitTorrent encoding format,
    [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
    dictionary key/value pairs must be sorted by key. This is a
    necessary step for ensuring serialization of `Value`s is
    canonical. We encourage, but do not require that key/value pairs (or set
    elements) be in sorted order for serialized `Value`s; however, a
    [canonical form][canonical] for `Repr`s does exist where a sorted
    ordering is required.

  [^not-sorted-semantically]: It's important to note that the sort
    ordering for writing out set elements and dictionary key/value pairs
    is *not* the same as the sort ordering implied by the semantic
    ordering of those elements or keys. For example, the `Repr` of a
    negative number sorts lexicographically *after* the `Repr` of zero,
    despite being semantically *less than* zero.

    **Rationale**. This is for ease-of-implementation reasons: not all
    languages can easily represent sorted sets or sorted dictionaries,
    but encoding and then sorting byte strings is much more likely to
    be within easy reach.

### SignedIntegers, Strings, ByteStrings and Symbols.

    «S» = [0xB0] ++ varint(|intbytes(S)|) ++ intbytes(S)  if S ∈ SignedInteger
          [0xB1] ++ varint(|utf8(S)|) ++ utf8(S)          if S ∈ String
          [0xB2] ++ varint(|S|) ++ S                      if S ∈ ByteString
          [0xB3] ++ varint(|utf8(S)|) ++ utf8(S)          if S ∈ Symbol

For `String` and `Symbol`, the data following the tag and length is a
UTF-8 encoding of the `Value`. For `ByteString`, it is the raw data
contained within the `Value` unmodified. For `SignedInteger`, it is
the big-endian two's-complement binary representation of the number,
taking exactly as many whole bytes as needed to unambiguously identify
the value and its sign. `intbytes(0)` is special-cased to be the empty
byte sequence;[^empty-intbytes-sequence] otherwise, the
most-significant bit of the first byte is the sign bit. (See
[SignedInteger examples](#signedinteger-examples) in the appendix
below.)

  [^empty-intbytes-sequence]: Without the special case of
    `intbytes(0)` yielding the empty byte sequence, no input to
    `intbytes` would result in an empty sequence. In principle, either
    `0` or `-1` could be special-cased to the empty sequence; here, we
    arbitrarily choose `0`.

### Booleans.

    «#f» = [0x80]
    «#t» = [0x81]

### Floats and Doubles.

    «F» = [0x87, 0x04] ++ binary32(F)  if F ∈ Float
    «D» = [0x87, 0x08] ++ binary64(D)  if D ∈ Double

The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
8-byte IEEE 754 binary representations of `F` and `D`, respectively.

### Embeddeds.

    «#!V» = [0x86] ++ «V»

The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
represent the denoted object, prefixed with `[0x86]`.

### Annotations.

    «@W V» = [0x85] ++ «W» «V»

Each annotation `W` precedes the `Value` `V` it
annotates.[^annotation-syntax-rationale] Both `W` and `V` *MAY*
themselves be further annotated. Implementations *SHOULD* default to
omitting annotations from `Repr`s. See [examples in the
appendix](#annotation-examples).

  [^annotation-syntax-rationale]: **Rationale.** The syntax for
    annotated values is unlike any other compound syntax defined here.
    It allows parsers to *statelessly* skip annotations: while `0x85` is
    the next input byte, skip it, skip a single `Repr`, and then try
    again; otherwise, parse the input stream as usual.

    By contrast, one might imagine using a parenthesized syntax like the syntax of
    compound data. In that case, parsers would have to remember
    information about nesting of annotations. This would not necessarily
    be an issue for recursive ("DOM"-like) parsers, but streaming
    ("SAX"-like) parsers would have to either take on complexity
    internally or force it on their clients, even when those clients did
    not care about annotations.

## Security Considerations

**Annotations.** In modes where a `Value` is being read while
annotations are skipped, an endless series of annotations may give an
illusion of progress.

**Canonical form for cryptographic hashing and signing.** No canonical
*textual* encoding of a `Value` is specified. However, a [canonical
form][canonical] exists for binary encoded `Value`s, and
implementations *SHOULD* produce canonical binary encodings by
default; however, an implementation *MAY* permit two serializations of
the same `Value` to yield different binary `Repr`s.

## Appendix. Autodetection of textual or binary syntax

Every tag byte in a binary Preserves `Document` falls within the range
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
bytes*, and will never occur as the first byte of a UTF-8 encoding. This
means no binary-encoded document can be misinterpreted as valid UTF-8.

Conversely, a UTF-8 document must start with a valid scalar value,
meaning in particular that it must not start with a byte in the range
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
Preserves document can be misinterpreted as a binary-syntax document.

Examination of the top two bits of the first byte of a document gives
its syntax: if the top two bits are `10`, it should be interpreted as
a binary-syntax document; otherwise, it should be interpreted as text.

## Appendix. Table of tag values

     80 - False
     81 - True
     84 - End marker
     85 - Annotation
     86 - Embedded
     87 - Float and Double

     B0 - Integer
     B1 - String
     B2 - ByteString
     B3 - Symbol
     B4 - Record
     B5 - Sequence
     B6 - Set
     B7 - Dictionary

All tags fall in the range [`0x80`, `0xBF`].

Tag values `82`, `83`, `88`...`AF`, and `B8`...`BF` are **reserved**.

## Appendix. Binary SignedInteger representation

Languages that provide fixed-width machine word types may find the
following table useful in encoding and decoding binary `SignedInteger`
values.

| Integer range                              | Bytes required | Encoding (hex)                                    |
| ---                                        | ---            | ---                                               |
| 0                                          | 2              | `B0` `00`                                         |
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8)    | 3              | `B0` `01` `XX`                                    |
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 4              | `B0` `02` `XX` `XX`                               |
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 5              | `B0` `03` `XX` `XX` `XX`                          |
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 6              | `B0` `04` `XX` `XX` `XX` `XX`                     |
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 7              | `B0` `05` `XX` `XX` `XX` `XX` `XX`                |
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 8              | `B0` `06` `XX` `XX` `XX` `XX` `XX` `XX`           |
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 9              | `B0` `07` `XX` `XX` `XX` `XX` `XX` `XX` `XX`      |
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 10             | `B0` `08` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |

## Appendix. Examples

### <a id="signedinteger-examples"></a>SignedInteger examples

      «-257» = B0 02 FE FF     «-2» = B0 01 FE       «255» = B0 02 00 FF
      «-256» = B0 02 FF 00     «-1» = B0 01 FF       «256» = B0 02 01 00
      «-255» = B0 02 FF 01      «0» = B0 00        «32767» = B0 02 7F FF
      «-129» = B0 02 FF 7F      «1» = B0 01 01     «32768» = B0 03 00 80 00
      «-128» = B0 01 80       «127» = B0 01 7F     «65535» = B0 03 00 FF FF
      «-127» = B0 01 81       «128» = B0 02 00 80  «65536» = B0 03 01 00 00

      «87112285931760246646623899502532662132736»
        = B0 12 01 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00
                00 00

### <a id="annotation-examples"></a>Annotation examples

The `Repr` corresponding to textual syntax `@a@b[]`, i.e. an empty
sequence annotated with two symbols, `a` and `b`, is

    «@a @b []» = [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
               = [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]

Annotations may themselves be annotated. Here, `c` is annotated with
`b`, which itself is annotated with `a`:

    «@ @a b c» = [0x85] ++ [0x85] ++ «a» ++ «b» ++ «c»>

<!-- Heading to visually offset the footnotes from the main document: -->
## Notes