preserves/preserves-binary.md

302 lines
13 KiB
Markdown

---
no_site_title: true
title: "Preserves: Binary Syntax"
---
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
{{ site.version_date }}. Version {{ site.version }}.
[LEB128]: https://en.wikipedia.org/wiki/LEB128
[argdata]: https://github.com/NuxiNL/argdata
[canonical]: canonical-binary.html
[google-varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
[vlq]: https://en.wikipedia.org/wiki/Variable-length_quantity
*Preserves* is a data model, with associated serialization formats. This
document defines one of those formats: a binary syntax for `Value`s from
the [Preserves data model](preserves.html) that is easy for computer
software to read and write. An [equivalent human-readable text
syntax](preserves-text.html) also exists.
## Machine-Oriented Binary Syntax
A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
For a value `v`, we write `«v»` for the `Repr` of v.
### Type and Length representation.
Each `Repr` starts with a tag byte, describing the kind of information
represented.
However, inspired by [argdata][], a `Repr` does *not* describe its own
length. Instead, the surrounding context must supply the expected length
of the `Repr`.
As a consequence, `Repr`s for `Compound` values store the lengths of
their contained values. Each contained `Value` is represented as a
length in bytes followed by its own `Repr`. Implementations use each
stored length to decide when to stop reading the following `Repr`.
<a id="varint"></a> Each length is stored as an [argdata][]-compatible
big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint
stores seven bits of the length. All bytes have a clear upper bit,
except the final byte, which has the upper bit set. We write
`len(m)` for the varint-encoding of a non-negative integer `m`,
defined recursively as follows:
len(m) = e(m, 128)
where e(v, d) = [v + d] if v < 128
e(v / 128, 0) ++ [(v % 128) + d] if v 128
[^see-also-leb128]: Argdata's length representation is very close to
[Variable-length quantity (VLQ)][VLQ] encoding, differing only in
the flipped interpretation of the high bit of each byte. It is
big-endian, unlike [LEB128][] encoding ([as used by
Google][google-varint] in protobufs).
We write `len(|r|)` for the varint-encoding of the length of `Repr` `r`.
The following table illustrates varint-encoding.
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `len(m)` bytes |
|-------------|-------------------------------------------|-----------------|
| 15 | `0001111` | 143 |
| 300 | `0000010 0101100` | 2 172 |
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 |
There is no requirement that a varint-encoded `m` in a `Repr` be the unique shortest encoding
for that `m`.[^overlong-varint] However, implementations *SHOULD* use the shortest encoding
whereever possible when writing, and *SHOULD* reject excessively long encodings when reading
encoded values.[^excessively-long-varint]
[^overlong-varint]: **Implementation note.** The spec permits overlong length encodings to
reduce wasted activity in resource-constrained situations. If an implementation is in
anything other than a very low-level language, it is likely to be able to use
[IOList](./conventions.html#iolists)-style data structures to avoid unnecessary copying.
[^excessively-long-varint]: As a guideline, reject more than eight leading `0` bytes in a
varint.
### Records, Sequences, Sets and Dictionaries.
«<L F_1...F_m>» = [0xA7] ++ seq(«L», «F_1», ..., «F_m»)
«[X_1...X_m]» = [0xA8] ++ seq(«X_1», ..., «X_m»)
«#{E_1...E_m}» = [0xA9] ++ seq(«E_1», ..., «E_m»)
«{K_1:V_1...K_m:V_m}» = [0xAA] ++ seq(«K_1», «V_1», ..., «K_m», «V_m»)
seq(R_1, ..., R_m) = len(|R_1|) ++ R_1 ++...++ len(|R_m|) ++ R_m
There is *no* ordering requirement on the `E_i` elements or
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
addition, implementations *SHOULD* default to writing set elements and
dictionary key/value pairs in order sorted lexicographically by their
`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
serializing in some other implementation-defined order.
[^no-sorting-rationale]: In the BitTorrent encoding format,
[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
dictionary key/value pairs must be sorted by key. This is a
necessary step for ensuring serialization of `Value`s is
canonical. We do not require that key/value pairs (or set
elements) be in sorted order for serialized `Value`s; however, a
[canonical form][canonical] for `Repr`s does exist where a sorted
ordering is required.
[^not-sorted-semantically]: It's important to note that the sort
ordering for writing out set elements and dictionary key/value
pairs is *not* the same as the sort ordering implied by the
semantic ordering of those elements or keys. For example, the
`Repr` of a negative number very far from zero will start with a
byte that is *greater* than the byte which starts the `Repr` of
zero, making it sort lexicographically later by `Repr`, despite
being semantically *less than* zero.
**Rationale**. This is for ease-of-implementation reasons: not all
languages can easily represent sorted sets or sorted dictionaries,
but encoding and then sorting byte strings is much more likely to
be within easy reach.
No sentinel marks the end of a sequence of length-prefixed `Repr`s.
During decoding, use the length of the containing `Repr` to decide when
to stop expecting more contained `Repr`s.
### SignedIntegers.
«x» when x ∈ SignedInteger = [0xA3] ++ intbytes(x)
The function `intbytes(x)` gives the big-endian two's-complement binary
representation of `x`, taking exactly as many whole bytes as needed to
unambiguously identify the value and its sign. As a special case,
`intbytes(0)` is the empty byte sequence. The most-significant bit in
the first byte in `intbytes(x)` (for `x`≠0) is the sign
bit.[^zero-intbytes] Every `SignedInteger` *MUST* be represented with
its shortest possible encoding.
For example,
«87112285931760246646623899502532662132736»
= A3 01 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00
«-257» = A3 FE FF «-3» = A3 FD «128» = A3 00 80
«-256» = A3 FF 00 «-2» = A3 FE «255» = A3 00 FF
«-255» = A3 FF 01 «-1» = A3 FF «256» = A3 01 00
«-254» = A3 FF 02 «0» = A3 «32767» = A3 7F FF
«-129» = A3 FF 7F «1» = A3 01 «32768» = A3 00 80 00
«-128» = A3 80 «12» = A3 0C «65535» = A3 00 FF FF
«-127» = A3 81 «13» = A3 0D «65536» = A3 01 00 00
«-4» = A3 FC «127» = A3 7F «131072» = A3 02 00 00
[^zero-intbytes]: The value 0 needs zero bytes to identify the
value, so `intbytes(0)` is the empty byte string. Non-zero values
need at least one byte.
### Strings, ByteStrings and Symbols.
Syntax for these three types varies only in the tag used. For `String`
and `Symbol`, the data following the tag is a UTF-8 encoding of the
`Value`'s code points, while for `ByteString` it is the raw data
contained within the `Value` unmodified.
«S» = [0xA4] ++ utf8(S) if S ∈ String
[0xA5] ++ S if S ∈ ByteString
[0xA6] ++ utf8(S) if S ∈ Symbol
### Booleans.
«#f» = [0xA0]
«#t» = [0xA1]
### Floats and Doubles.
«F» when F ∈ Float = [0xA2] ++ binary32(F)
«D» when D ∈ Double = [0xA2] ++ binary64(D)
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
8-byte IEEE 754 binary representations of `F` and `D`, respectively.
### Embeddeds.
The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
represent the denoted object, prefixed with `[0xBF]`.
«#!V» = [0xBF] ++ «V»
### Annotations.
To annotate a `Repr` `r` with some sequence of `Value`s `[v_1, ...,
v_m]`, surround `r` as follows:
[0xBE] ++ len(|r|) ++ r ++ len(|«v_1»|) ++ «v_1» ++...++ len(|«v_m»|) ++ «v_m»
The `Repr` `r` *MUST NOT* already have annotations; that is, it must not begin with `0xBE`.
For example, the `Repr` corresponding to textual syntax `@a@b[]`, i.e.
an empty sequence annotated with two symbols, `a` and `b`, is
«@a @b []»
= [0xBE] ++ len(|«[]»|) ++ «[]» ++ len(|«a»|) ++ «a» ++ len(|«b»|) ++ «b»
= [0xBE, 0x81, 0xA8, 0x82, 0xA6, 0x61, 0x82, 0xA6, 0x62]
## Security Considerations
**Annotations.** In modes where a `Value` is being read while
annotations are skipped, an endless sequence of annotations may give an
illusion of progress.
**Overlong varints.** The binary format allows (but discourages) overlong [varint](#varint)s.
Consider optional restrictions on the number of redundant leading `0` bytes accepted when
reading a varint.
**Canonical form for cryptographic hashing and signing.** No canonical
textual encoding of a `Value` is specified. A
[canonical form][canonical] exists for binary encoded `Value`s, and
implementations *SHOULD* produce canonical binary encodings by
default; however, an implementation *MAY* permit two serializations of
the same `Value` to yield different binary `Repr`s.
## Acknowledgements
The exclusion of lengths from `Repr`s, placing lengths instead ahead of
contained values in sequences, is inspired by [argdata][].
## Appendix. Autodetection of textual or binary syntax
Every tag byte in a binary Preserves `Repr` falls within the range
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
bytes*, and will never occur as the first byte of a UTF-8 encoded code
point. This means no binary-encoded `Repr` can be misinterpreted as
valid UTF-8.
Conversely, a UTF-8 `Document` must start with a valid codepoint,
meaning in particular that it must not start with a byte in the range
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
Preserves `Document` can be misinterpreted as a binary-syntax `Repr`.
Examination of the top two bits of the first byte of an encoded `Value`
gives its syntax: if the top two bits are `10`, it should be interpreted
as a binary-syntax `Repr`; otherwise, it should be interpreted as text.
**Streaming.** Autodetection is still possible when streaming an
undetermined number of `Value`s across, say, a TCP/IP connection:
- If the text syntax is to be used for the connection, simply start
writing each `Document` one after the other. Documents for `Atom`s
*MUST* be separated from their neighbours by whitespace; in general,
whitespace *SHOULD* be used to separate adjacent documents.
Specifically, whitespace separating adjacent documents *SHOULD* be
ASCII newline (10).
- If the binary syntax is to be used for the connection, start the
connection with byte `0xA8` (sequence). After the initial byte, send
each value `v` as `len(|«v»|) ++ «v»`. A side effect of this approach
is that the entire stream, when complete, is a valid `Sequence`
`Repr`.
## Appendix. Table of tag values
(8x) RESERVED 80-8F
(9x) RESERVED 90-9F
A0 - False
A1 - True
A2 - Float or Double (length disambiguates)
A3 - SignedIntegers (0 is encoded with no bytes at all)
A4 - String (no trailing NUL is added)
A5 - ByteString
A6 - Symbol
A7 - Record
A8 - Sequence
A9 - Set
AA - Dictionary
(Ax) RESERVED AB-AF
(Bx) RESERVED B0-BD
BE - Annotations. {BE Lval val Lann0 ann0 Lann1 ann1 ...}
BF - Embedded
## Appendix. Binary SignedInteger representation
Languages that provide fixed-width machine word types may find the
following table useful in encoding and decoding binary `SignedInteger`
values.
| Integer range | Bytes required | Encoding (hex) |
| --- | --- | --- |
| 0 | 1 | `A3` |
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 2 | `A3` `XX` |
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3 | `A3` `XX` `XX` |
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4 | `A3` `XX` `XX` `XX` |
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5 | `A3` `XX` `XX` `XX` `XX` |
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6 | `A3` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
<!-- Heading to visually offset the footnotes from the main document: -->
## Notes