---
no_site_title: true
title: "Preserves: Binary Syntax"
---

Tony Garnock-Jones <tonyg@leastfixedpoint.com>  
{{ site.version_date }}. Version {{ site.version }}.

  [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
  [LEB128]: https://en.wikipedia.org/wiki/LEB128
  [canonical]: canonical-binary.html

*Preserves* is a data model, with associated serialization formats. This
document defines one of those formats: a binary syntax for `Value`s from
the [Preserves data model](preserves.html) that is easy for computer
software to read and write. An [equivalent human-readable text
syntax](preserves-text.html) also exists.

## Machine-Oriented Binary Syntax

A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
For a value `v`, we write `«v»` for the `Repr` of v.

### Type and Length representation.

Each `Repr` starts with a tag byte, describing the kind of information
represented. Depending on the tag, a length indicator, further encoded
information, and/or an ending tag may follow.

    tag                          (simple atomic data)
    tag ++ length ++ binarydata  (floats, doubles, integers, strings, symbols, and binary)
    tag ++ repr ++ ... ++ endtag (compound data)

The unique end tag is byte value `0x84`.

If present after a tag, the length of a following piece of binary data
is formatted as a [base 128 varint][varint].[^see-also-leb128] We
write `varint(m)` for the varint-encoding of `m`. Quoting the
[Google Protocol Buffers][varint] definition,

  [^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
    integers. Varints and LEB128-encoded integers differ only for
    negative numbers, which cannot appear as length indicators and are
    thus not used in Preserves.

> Each byte in a varint, except the last byte, has the most
> significant bit (msb) set – this indicates that there are further
> bytes to come. The lower 7 bits of each byte are used to store the
> two's complement representation of the number in groups of 7 bits,
> least significant group first.

For example, `varint(15)` is `[0x0F]`, and `varint(1000000000)` is `[0x80, 0x94, 0xeb, 0xdc,
0x03]`.

It is an error for a varint-encoded `m` in a `Repr` to be anything
other than the unique shortest encoding for that `m`. That is, a
varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0.

### Records, Sequences, Sets and Dictionaries.

          «<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
            «[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
           «#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
    «{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]

There is *no* ordering requirement on the `E_i` elements or
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
addition, implementations *SHOULD* default to writing set elements and
dictionary key/value pairs in order sorted lexicographically by their
`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
serializing in some other implementation-defined order.

  [^no-sorting-rationale]: In the BitTorrent encoding format,
    [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
    dictionary key/value pairs must be sorted by key. This is a
    necessary step for ensuring serialization of `Value`s is
    canonical. We encourage, but do not require that key/value pairs (or set
    elements) be in sorted order for serialized `Value`s; however, a
    [canonical form][canonical] for `Repr`s does exist where a sorted
    ordering is required.

  [^not-sorted-semantically]: It's important to note that the sort
    ordering for writing out set elements and dictionary key/value
    pairs is *not* the same as the sort ordering implied by the
    semantic ordering of those elements or keys. For example, the
    `Repr` of a negative number very far from zero will start with
    byte that is *greater* than the byte which starts the `Repr` of
    zero, making it sort lexicographically later by `Repr`, despite
    being semantically *less than* zero.

    **Rationale**. This is for ease-of-implementation reasons: not all
    languages can easily represent sorted sets or sorted dictionaries,
    but encoding and then sorting byte strings is much more likely to
    be within easy reach.

### SignedIntegers.

    «x» = [0xB0] ++ varint(|intbytes(x)|) ++ intbytes(x)  if x ∈ SignedInteger

The function `intbytes(x)` gives the big-endian two's-complement
binary representation of `x`, taking exactly as many whole bytes as
needed to unambiguously identify the value and its sign. The value 0
needs zero bytes to identify the value; non-zero values need at least
one byte, and the most-significant bit in the first byte is the sign
bit. See the [examples in the appendix below](#signedinteger-examples).

### Strings, ByteStrings and Symbols.

    «S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ String
          [0xB2] ++ varint(|S|) ++ S              if S ∈ ByteString
          [0xB3] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ Symbol

Syntax for these three types varies only in the tag used. For `String`
and `Symbol`, the data following the tag is a UTF-8 encoding of the
`Value`, while for `ByteString` it is the raw data contained within the
`Value` unmodified.

### Booleans.

    «#f» = [0x80]
    «#t» = [0x81]

### Floats and Doubles.

    «F» = [0x87, 0x04] ++ binary32(F)  if F ∈ Float
    «D» = [0x87, 0x08] ++ binary64(D)  if D ∈ Double

The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
8-byte IEEE 754 binary representations of `F` and `D`, respectively.

### Embeddeds.

    «#!V» = [0x86] ++ «V»

The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
represent the denoted object, prefixed with `[0x86]`.

### Annotations.

    «@W V» = [0x85] ++ «W» «V»

Each annotation `W` precedes the `Value` `V` it annotates; both `W` and
`V` *MAY* themselves be further annotated. Implementations *SHOULD*
default to omitting annotations from `Repr`s. See [examples in the
appendix](#annotation-examples).

## Security Considerations

**Annotations.** In modes where a `Value` is being read while
annotations are skipped, an endless series of annotations may give an
illusion of progress.

**Canonical form for cryptographic hashing and signing.** No canonical
*textual* encoding of a `Value` is specified. However, a [canonical
form][canonical] exists for binary encoded `Value`s, and
implementations *SHOULD* produce canonical binary encodings by
default; however, an implementation *MAY* permit two serializations of
the same `Value` to yield different binary `Repr`s.

## Appendix. Autodetection of textual or binary syntax

Every tag byte in a binary Preserves `Document` falls within the range
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
bytes*, and will never occur as the first byte of a UTF-8 encoding. This
means no binary-encoded document can be misinterpreted as valid UTF-8.

Conversely, a UTF-8 document must start with a valid scalar value,
meaning in particular that it must not start with a byte in the range
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
Preserves document can be misinterpreted as a binary-syntax document.

Examination of the top two bits of the first byte of a document gives
its syntax: if the top two bits are `10`, it should be interpreted as
a binary-syntax document; otherwise, it should be interpreted as text.

## Appendix. Table of tag values

     80 - False
     81 - True
     84 - End marker
     85 - Annotation
     86 - Embedded
     87 - Float and Double

     B0 - Integer
     B1 - String
     B2 - ByteString
     B3 - Symbol
     B4 - Record
     B5 - Sequence
     B6 - Set
     B7 - Dictionary

All tags fall in the range [`0x80`, `0xBF`].

Tag values `82`, `83`, `88`...`AF`, and `B8`...`BF` are **reserved**.

## Appendix. Binary SignedInteger representation

Languages that provide fixed-width machine word types may find the
following table useful in encoding and decoding binary `SignedInteger`
values.

| Integer range                              | Bytes required | Encoding (hex)                                    |
| ---                                        | ---            | ---                                               |
| 0                                          | 2              | `B0` `00`                                         |
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8)    | 3              | `B0` `01` `XX`                                    |
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 4              | `B0` `02` `XX` `XX`                               |
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 5              | `B0` `03` `XX` `XX` `XX`                          |
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 6              | `B0` `04` `XX` `XX` `XX` `XX`                     |
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 7              | `B0` `05` `XX` `XX` `XX` `XX` `XX`                |
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 8              | `B0` `06` `XX` `XX` `XX` `XX` `XX` `XX`           |
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 9              | `B0` `07` `XX` `XX` `XX` `XX` `XX` `XX` `XX`      |
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 10             | `B0` `08` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |

## Appendix. Examples

### <a id="signedinteger-examples"></a>SignedInteger examples

      «-257» = B0 02 FE FF     «-2» = B0 01 FE       «255» = B0 02 00 FF
      «-256» = B0 02 FF 00     «-1» = B0 01 FF       «256» = B0 02 01 00
      «-255» = B0 02 FF 01      «0» = B0 00        «32767» = B0 02 7F FF
      «-129» = B0 02 FF 7F      «1» = B0 01 01     «32768» = B0 03 00 80 00
      «-128» = B0 01 80       «127» = B0 01 7F     «65535» = B0 03 00 FF FF
      «-127» = B0 01 81       «128» = B0 02 00 80  «65536» = B0 03 01 00 00

      «87112285931760246646623899502532662132736»
        = B0 12 01 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00
                00 00

### <a id="annotation-examples"></a>Annotation examples

The `Repr` corresponding to textual syntax `@a@b[]`, i.e. an empty
sequence annotated with two symbols, `a` and `b`, is

    «@a @b []» = [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
               = [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]

Annotations may themselves be annotated. Here, `c` is annotated with
`b`, which itself is annotated with `a`:

    «@ @a b c» = [0x85] ++ [0x85] ++ «a» ++ «b» ++ «c»>

<!-- Heading to visually offset the footnotes from the main document: -->
## Notes