--- no_site_title: true title: "Preserves: Binary Syntax" --- Tony Garnock-Jones {{ site.version_date }}. Version {{ site.version }}. [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints [LEB128]: https://en.wikipedia.org/wiki/LEB128 [canonical]: canonical-binary.html *Preserves* is a data model, with associated serialization formats. This document defines one of those formats: a binary syntax for `Value`s from the [Preserves data model](preserves.html) that is easy for computer software to read and write. An [equivalent human-readable text syntax](preserves-text.html) also exists. ## Machine-Oriented Binary Syntax A `Repr` is a binary-syntax encoding, or representation, of a `Value`. For a value `v`, we write `«v»` for the `Repr` of v. ### Type and Length representation. Each `Repr` starts with a tag byte, describing the kind of information represented. Depending on the tag, a length indicator, further encoded information, and/or an ending tag may follow. tag (simple atomic data) tag ++ length ++ binarydata (floats, doubles, integers, strings, symbols, and binary) tag ++ repr ++ ... ++ endtag (compound data) The unique end tag is byte value `0x84`. If present after a tag, the length of a following piece of binary data is formatted as a [base 128 varint][varint].[^see-also-leb128] We write `varint(m)` for the varint-encoding of `m`. Quoting the [Google Protocol Buffers][varint] definition, [^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned integers. Varints and LEB128-encoded integers differ only for negative numbers, which cannot appear as length indicators and are thus not used in Preserves. > Each byte in a varint, except the last byte, has the most > significant bit (msb) set – this indicates that there are further > bytes to come. The lower 7 bits of each byte are used to store the > two's complement representation of the number in groups of 7 bits, > least significant group first. For example, `varint(15)` is `[0x0F]`, and `varint(1000000000)` is `[0x80, 0x94, 0xeb, 0xdc, 0x03]`. It is an error for a varint-encoded `m` in a `Repr` to be anything other than the unique shortest encoding for that `m`. That is, a varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0. ### Records, Sequences, Sets and Dictionaries. «» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84] «[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84] «#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84] «{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84] There is *no* ordering requirement on the `E_i` elements or `K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In addition, implementations *SHOULD* default to writing set elements and dictionary key/value pairs in order sorted lexicographically by their `Repr`s[^not-sorted-semantically], and *MAY* offer the option of serializing in some other implementation-defined order. [^no-sorting-rationale]: In the BitTorrent encoding format, [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding), dictionary key/value pairs must be sorted by key. This is a necessary step for ensuring serialization of `Value`s is canonical. We encourage, but do not require that key/value pairs (or set elements) be in sorted order for serialized `Value`s; however, a [canonical form][canonical] for `Repr`s does exist where a sorted ordering is required. [^not-sorted-semantically]: It's important to note that the sort ordering for writing out set elements and dictionary key/value pairs is *not* the same as the sort ordering implied by the semantic ordering of those elements or keys. For example, the `Repr` of a negative number very far from zero will start with byte that is *greater* than the byte which starts the `Repr` of zero, making it sort lexicographically later by `Repr`, despite being semantically *less than* zero. **Rationale**. This is for ease-of-implementation reasons: not all languages can easily represent sorted sets or sorted dictionaries, but encoding and then sorting byte strings is much more likely to be within easy reach. ### SignedIntegers. «x» = [0xB0] ++ varint(|intbytes(x)|) ++ intbytes(x) if x ∈ SignedInteger The function `intbytes(x)` gives the big-endian two's-complement binary representation of `x`, taking exactly as many whole bytes as needed to unambiguously identify the value and its sign. The value 0 needs zero bytes to identify the value; non-zero values need at least one byte, and the most-significant bit in the first byte is the sign bit. See the [examples in the appendix below](#signedinteger-examples). ### Strings, ByteStrings and Symbols. «S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String [0xB2] ++ varint(|S|) ++ S if S ∈ ByteString [0xB3] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ Symbol Syntax for these three types varies only in the tag used. For `String` and `Symbol`, the data following the tag is a UTF-8 encoding of the `Value`, while for `ByteString` it is the raw data contained within the `Value` unmodified. ### Booleans. «#f» = [0x80] «#t» = [0x81] ### Floats and Doubles. «F» = [0x87, 0x04] ++ binary32(F) if F ∈ Float «D» = [0x87, 0x08] ++ binary64(D) if D ∈ Double The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and 8-byte IEEE 754 binary representations of `F` and `D`, respectively. ### Embeddeds. «#!V» = [0x86] ++ «V» The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to represent the denoted object, prefixed with `[0x86]`. ### Annotations. «@W V» = [0x85] ++ «W» «V» Each annotation `W` precedes the `Value` `V` it annotates; both `W` and `V` *MAY* themselves be further annotated. Implementations *SHOULD* default to omitting annotations from `Repr`s. See [examples in the appendix](#annotation-examples). ## Security Considerations **Annotations.** In modes where a `Value` is being read while annotations are skipped, an endless series of annotations may give an illusion of progress. **Canonical form for cryptographic hashing and signing.** No canonical *textual* encoding of a `Value` is specified. However, a [canonical form][canonical] exists for binary encoded `Value`s, and implementations *SHOULD* produce canonical binary encodings by default; however, an implementation *MAY* permit two serializations of the same `Value` to yield different binary `Repr`s. ## Appendix. Autodetection of textual or binary syntax Every tag byte in a binary Preserves `Document` falls within the range [`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation bytes*, and will never occur as the first byte of a UTF-8 encoding. This means no binary-encoded document can be misinterpreted as valid UTF-8. Conversely, a UTF-8 document must start with a valid scalar value, meaning in particular that it must not start with a byte in the range [`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax Preserves document can be misinterpreted as a binary-syntax document. Examination of the top two bits of the first byte of a document gives its syntax: if the top two bits are `10`, it should be interpreted as a binary-syntax document; otherwise, it should be interpreted as text. ## Appendix. Table of tag values 80 - False 81 - True 84 - End marker 85 - Annotation 86 - Embedded 87 - Float and Double B0 - Integer B1 - String B2 - ByteString B3 - Symbol B4 - Record B5 - Sequence B6 - Set B7 - Dictionary All tags fall in the range [`0x80`, `0xBF`]. Tag values `82`, `83`, `88`...`AF`, and `B8`...`BF` are **reserved**. ## Appendix. Binary SignedInteger representation Languages that provide fixed-width machine word types may find the following table useful in encoding and decoding binary `SignedInteger` values. | Integer range | Bytes required | Encoding (hex) | | --- | --- | --- | | 0 | 2 | `B0` `00` | | -27 ≤ n < 27 (i8) | 3 | `B0` `01` `XX` | | -215 ≤ n < 215 (i16) | 4 | `B0` `02` `XX` `XX` | | -223 ≤ n < 223 (i24) | 5 | `B0` `03` `XX` `XX` `XX` | | -231 ≤ n < 231 (i32) | 6 | `B0` `04` `XX` `XX` `XX` `XX` | | -239 ≤ n < 239 (i40) | 7 | `B0` `05` `XX` `XX` `XX` `XX` `XX` | | -247 ≤ n < 247 (i48) | 8 | `B0` `06` `XX` `XX` `XX` `XX` `XX` `XX` | | -255 ≤ n < 255 (i56) | 9 | `B0` `07` `XX` `XX` `XX` `XX` `XX` `XX` `XX` | | -263 ≤ n < 263 (i64) | 10 | `B0` `08` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` | ## Appendix. Examples ### SignedInteger examples «-257» = B0 02 FE FF «-2» = B0 01 FE «255» = B0 02 00 FF «-256» = B0 02 FF 00 «-1» = B0 01 FF «256» = B0 02 01 00 «-255» = B0 02 FF 01 «0» = B0 00 «32767» = B0 02 7F FF «-129» = B0 02 FF 7F «1» = B0 01 01 «32768» = B0 03 00 80 00 «-128» = B0 01 80 «127» = B0 01 7F «65535» = B0 03 00 FF FF «-127» = B0 01 81 «128» = B0 02 00 80 «65536» = B0 03 01 00 00 «87112285931760246646623899502532662132736» = B0 12 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ### Annotation examples The `Repr` corresponding to textual syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols, `a` and `b`, is «@a @b []» = [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]» = [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84] Annotations may themselves be annotated. Here, `c` is annotated with `b`, which itself is annotated with `a`: «@ @a b c» = [0x85] ++ [0x85] ++ «a» ++ «b» ++ «c»> ## Notes