2022-06-18 17:11:08 +00:00
|
|
|
---
|
|
|
|
no_site_title: true
|
|
|
|
title: "Preserves: Binary Syntax"
|
|
|
|
---
|
|
|
|
|
|
|
|
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
|
|
|
{{ site.version_date }}. Version {{ site.version }}.
|
|
|
|
|
|
|
|
[LEB128]: https://en.wikipedia.org/wiki/LEB128
|
2022-06-10 15:33:52 +00:00
|
|
|
[argdata]: https://github.com/NuxiNL/argdata
|
2022-06-18 17:11:08 +00:00
|
|
|
[canonical]: canonical-binary.html
|
2022-06-10 15:33:52 +00:00
|
|
|
[google-varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
|
|
|
|
[vlq]: https://en.wikipedia.org/wiki/Variable-length_quantity
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
*Preserves* is a data model, with associated serialization formats. This
|
|
|
|
document defines one of those formats: a binary syntax for `Value`s from
|
|
|
|
the [Preserves data model](preserves.html) that is easy for computer
|
|
|
|
software to read and write. An [equivalent human-readable text
|
|
|
|
syntax](preserves-text.html) also exists.
|
|
|
|
|
|
|
|
## Machine-Oriented Binary Syntax
|
|
|
|
|
|
|
|
A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
|
|
|
|
For a value `v`, we write `«v»` for the `Repr` of v.
|
|
|
|
|
|
|
|
### Type and Length representation.
|
|
|
|
|
|
|
|
Each `Repr` starts with a tag byte, describing the kind of information
|
2022-06-10 15:33:52 +00:00
|
|
|
represented.
|
2022-06-18 17:11:08 +00:00
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
However, inspired by [argdata][], a `Repr` does *not* describe its own
|
2022-06-11 09:13:48 +00:00
|
|
|
length. Instead, the surrounding context must supply the expected length
|
|
|
|
of the `Repr`.
|
2022-06-18 17:11:08 +00:00
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
As a consequence, `Repr`s for `Compound` values store the lengths of
|
|
|
|
their contained values. Each contained `Value` is represented as a
|
2022-06-11 09:13:48 +00:00
|
|
|
length in bytes followed by its own `Repr`. Implementations use each
|
|
|
|
stored length to decide when to stop reading the following `Repr`.
|
2022-06-18 17:11:08 +00:00
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
<a id="varint"></a> Each length is stored as an [argdata][]-compatible
|
|
|
|
big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint
|
|
|
|
stores seven bits of the length. All bytes have a clear upper bit,
|
|
|
|
except the final byte, which has the upper bit set. We write
|
|
|
|
`len(m)` for the varint-encoding of a non-negative integer `m`,
|
|
|
|
defined recursively as follows:
|
2022-06-18 17:11:08 +00:00
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
len(m) = e(m, 128)
|
|
|
|
where e(v, d) = [v + d] if v < 128
|
|
|
|
e(v / 128, 0) ++ [(v % 128) + d] if v ≥ 128
|
2022-06-18 17:11:08 +00:00
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
[^see-also-leb128]: Argdata's length representation is very close to
|
|
|
|
[Variable-length quantity (VLQ)][VLQ] encoding, differing only in
|
|
|
|
the flipped interpretation of the high bit of each byte. It is
|
|
|
|
big-endian, unlike [LEB128][] encoding ([as used by
|
|
|
|
Google][google-varint] in protobufs).
|
2022-06-18 17:11:08 +00:00
|
|
|
|
2022-06-11 08:06:51 +00:00
|
|
|
We write `len(|r|)` for the varint-encoding of the length of `Repr` `r`.
|
|
|
|
|
2022-06-18 17:11:08 +00:00
|
|
|
The following table illustrates varint-encoding.
|
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `len(m)` bytes |
|
|
|
|
|-------------|-------------------------------------------|-----------------|
|
|
|
|
| 15 | `0001111` | 143 |
|
|
|
|
| 300 | `0000010 0101100` | 2 172 |
|
|
|
|
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 |
|
2022-06-18 17:11:08 +00:00
|
|
|
|
2022-06-12 07:26:47 +00:00
|
|
|
There is no requirement that a varint-encoded `m` in a `Repr` be the
|
|
|
|
unique shortest encoding for that `m`.[^overlong-varint] However,
|
|
|
|
implementations *SHOULD* use the shortest encoding whereever possible
|
|
|
|
when writing, and *MAY* reject encodings with more than eight leading
|
|
|
|
`0` bytes when reading encoded values.
|
2022-06-11 09:13:48 +00:00
|
|
|
|
|
|
|
[^overlong-varint]: **Implementation note.** The spec permits overlong length encodings to
|
|
|
|
reduce wasted activity in resource-constrained situations. If an implementation is in
|
|
|
|
anything other than a very low-level language, it is likely to be able to use
|
|
|
|
[IOList](./conventions.html#iolists)-style data structures to avoid unnecessary copying.
|
|
|
|
|
2022-06-18 17:11:08 +00:00
|
|
|
### Records, Sequences, Sets and Dictionaries.
|
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
«<L F_1...F_m>» = [0xA7] ++ seq(«L», «F_1», ..., «F_m»)
|
|
|
|
«[X_1...X_m]» = [0xA8] ++ seq(«X_1», ..., «X_m»)
|
|
|
|
«#{E_1...E_m}» = [0xA9] ++ seq(«E_1», ..., «E_m»)
|
|
|
|
«{K_1:V_1...K_m:V_m}» = [0xAA] ++ seq(«K_1», «V_1», ..., «K_m», «V_m»)
|
2022-06-11 08:06:51 +00:00
|
|
|
|
|
|
|
seq(R_1, ..., R_m) = len(|R_1|) ++ R_1 ++...++ len(|R_m|) ++ R_m
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
There is *no* ordering requirement on the `E_i` elements or
|
|
|
|
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
|
|
|
|
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
|
|
|
|
addition, implementations *SHOULD* default to writing set elements and
|
|
|
|
dictionary key/value pairs in order sorted lexicographically by their
|
|
|
|
`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
|
|
|
|
serializing in some other implementation-defined order.
|
|
|
|
|
|
|
|
[^no-sorting-rationale]: In the BitTorrent encoding format,
|
|
|
|
[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
|
|
|
|
dictionary key/value pairs must be sorted by key. This is a
|
|
|
|
necessary step for ensuring serialization of `Value`s is
|
|
|
|
canonical. We do not require that key/value pairs (or set
|
|
|
|
elements) be in sorted order for serialized `Value`s; however, a
|
|
|
|
[canonical form][canonical] for `Repr`s does exist where a sorted
|
|
|
|
ordering is required.
|
|
|
|
|
|
|
|
[^not-sorted-semantically]: It's important to note that the sort
|
|
|
|
ordering for writing out set elements and dictionary key/value
|
|
|
|
pairs is *not* the same as the sort ordering implied by the
|
|
|
|
semantic ordering of those elements or keys. For example, the
|
2022-06-10 15:33:52 +00:00
|
|
|
`Repr` of a negative number very far from zero will start with a
|
2022-06-18 17:11:08 +00:00
|
|
|
byte that is *greater* than the byte which starts the `Repr` of
|
|
|
|
zero, making it sort lexicographically later by `Repr`, despite
|
|
|
|
being semantically *less than* zero.
|
|
|
|
|
|
|
|
**Rationale**. This is for ease-of-implementation reasons: not all
|
|
|
|
languages can easily represent sorted sets or sorted dictionaries,
|
|
|
|
but encoding and then sorting byte strings is much more likely to
|
|
|
|
be within easy reach.
|
|
|
|
|
2022-06-11 09:13:48 +00:00
|
|
|
No sentinel marks the end of a sequence of length-prefixed `Repr`s.
|
|
|
|
During decoding, use the length of the containing `Repr` to decide when
|
|
|
|
to stop expecting more contained `Repr`s.
|
|
|
|
|
2022-06-18 17:11:08 +00:00
|
|
|
### SignedIntegers.
|
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
«x» when x ∈ SignedInteger = [0xA3] ++ intbytes(x)
|
|
|
|
|
|
|
|
The function `intbytes(x)` gives the big-endian two's-complement binary
|
|
|
|
representation of `x`, taking exactly as many whole bytes as needed to
|
|
|
|
unambiguously identify the value and its sign. As a special case,
|
|
|
|
`intbytes(0)` is the empty byte sequence. The most-significant bit in
|
|
|
|
the first byte in `intbytes(x)` (for `x`≠0) is the sign
|
|
|
|
bit.[^zero-intbytes] Every `SignedInteger` *MUST* be represented with
|
|
|
|
its shortest possible encoding.
|
|
|
|
|
|
|
|
For example,
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
«87112285931760246646623899502532662132736»
|
2022-06-10 15:33:52 +00:00
|
|
|
= A3 01 00 00 00 00 00 00 00
|
|
|
|
00 00 00 00 00 00 00 00
|
|
|
|
00 00
|
|
|
|
|
|
|
|
«-257» = A3 FE FF «-3» = A3 FD «128» = A3 00 80
|
|
|
|
«-256» = A3 FF 00 «-2» = A3 FE «255» = A3 00 FF
|
|
|
|
«-255» = A3 FF 01 «-1» = A3 FF «256» = A3 01 00
|
|
|
|
«-254» = A3 FF 02 «0» = A3 «32767» = A3 7F FF
|
|
|
|
«-129» = A3 FF 7F «1» = A3 01 «32768» = A3 00 80 00
|
|
|
|
«-128» = A3 80 «12» = A3 0C «65535» = A3 00 FF FF
|
|
|
|
«-127» = A3 81 «13» = A3 0D «65536» = A3 01 00 00
|
|
|
|
«-4» = A3 FC «127» = A3 7F «131072» = A3 02 00 00
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
[^zero-intbytes]: The value 0 needs zero bytes to identify the
|
|
|
|
value, so `intbytes(0)` is the empty byte string. Non-zero values
|
|
|
|
need at least one byte.
|
|
|
|
|
|
|
|
### Strings, ByteStrings and Symbols.
|
|
|
|
|
|
|
|
Syntax for these three types varies only in the tag used. For `String`
|
|
|
|
and `Symbol`, the data following the tag is a UTF-8 encoding of the
|
|
|
|
`Value`'s code points, while for `ByteString` it is the raw data
|
|
|
|
contained within the `Value` unmodified.
|
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
«S» = [0xA4] ++ utf8(S) if S ∈ String
|
|
|
|
[0xA5] ++ S if S ∈ ByteString
|
|
|
|
[0xA6] ++ utf8(S) if S ∈ Symbol
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
### Booleans.
|
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
«#f» = [0xA0]
|
|
|
|
«#t» = [0xA1]
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
### Floats and Doubles.
|
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
«F» when F ∈ Float = [0xA2] ++ binary32(F)
|
|
|
|
«D» when D ∈ Double = [0xA2] ++ binary64(D)
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
|
|
|
|
8-byte IEEE 754 binary representations of `F` and `D`, respectively.
|
|
|
|
|
|
|
|
### Embeddeds.
|
|
|
|
|
|
|
|
The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
|
2022-06-10 15:33:52 +00:00
|
|
|
represent the denoted object, prefixed with `[0xBF]`.
|
2022-06-18 17:11:08 +00:00
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
«#!V» = [0xBF] ++ «V»
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
### Annotations.
|
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
To annotate a `Repr` `r` with some sequence of `Value`s `[v_1, ...,
|
|
|
|
v_m]`, surround `r` as follows:
|
|
|
|
|
2022-06-11 08:06:51 +00:00
|
|
|
[0xBE] ++ len(|r|) ++ r ++ len(|«v_1»|) ++ «v_1» ++...++ len(|«v_m»|) ++ «v_m»
|
2022-06-10 15:33:52 +00:00
|
|
|
|
|
|
|
The `Repr` `r` *MUST NOT* already have annotations; that is, it must not begin with `0xBE`.
|
|
|
|
|
|
|
|
For example, the `Repr` corresponding to textual syntax `@a@b[]`, i.e.
|
|
|
|
an empty sequence annotated with two symbols, `a` and `b`, is
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
«@a @b []»
|
2022-06-11 08:06:51 +00:00
|
|
|
= [0xBE] ++ len(|«[]»|) ++ «[]» ++ len(|«a»|) ++ «a» ++ len(|«b»|) ++ «b»
|
2022-06-10 15:33:52 +00:00
|
|
|
= [0xBE, 0x81, 0xA8, 0x82, 0xA6, 0x61, 0x82, 0xA6, 0x62]
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
## Security Considerations
|
|
|
|
|
|
|
|
**Annotations.** In modes where a `Value` is being read while
|
|
|
|
annotations are skipped, an endless sequence of annotations may give an
|
|
|
|
illusion of progress.
|
|
|
|
|
2022-06-12 07:26:47 +00:00
|
|
|
**Overlong varints.** The binary format allows (but discourages)
|
|
|
|
overlong [varint](#varint)s. Because every `Repr` has a bound on its
|
|
|
|
length from its surrounding context, this is not a denial-of-service
|
|
|
|
vector *per se*; however, implementations may wish to consider optional
|
|
|
|
restrictions on the number of redundant leading `0` bytes accepted when
|
2022-06-11 09:13:48 +00:00
|
|
|
reading a varint.
|
|
|
|
|
2022-06-18 17:11:08 +00:00
|
|
|
**Canonical form for cryptographic hashing and signing.** No canonical
|
|
|
|
textual encoding of a `Value` is specified. A
|
|
|
|
[canonical form][canonical] exists for binary encoded `Value`s, and
|
|
|
|
implementations *SHOULD* produce canonical binary encodings by
|
|
|
|
default; however, an implementation *MAY* permit two serializations of
|
|
|
|
the same `Value` to yield different binary `Repr`s.
|
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
## Acknowledgements
|
|
|
|
|
|
|
|
The exclusion of lengths from `Repr`s, placing lengths instead ahead of
|
|
|
|
contained values in sequences, is inspired by [argdata][].
|
|
|
|
|
2022-06-18 17:11:08 +00:00
|
|
|
## Appendix. Autodetection of textual or binary syntax
|
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
Every tag byte in a binary Preserves `Repr` falls within the range
|
2022-06-18 17:11:08 +00:00
|
|
|
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
|
|
|
|
bytes*, and will never occur as the first byte of a UTF-8 encoded code
|
2022-06-10 15:33:52 +00:00
|
|
|
point. This means no binary-encoded `Repr` can be misinterpreted as
|
2022-06-18 17:11:08 +00:00
|
|
|
valid UTF-8.
|
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
Conversely, a UTF-8 `Document` must start with a valid codepoint,
|
2022-06-18 17:11:08 +00:00
|
|
|
meaning in particular that it must not start with a byte in the range
|
|
|
|
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
|
2022-06-10 15:33:52 +00:00
|
|
|
Preserves `Document` can be misinterpreted as a binary-syntax `Repr`.
|
2022-06-18 17:11:08 +00:00
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
Examination of the top two bits of the first byte of an encoded `Value`
|
|
|
|
gives its syntax: if the top two bits are `10`, it should be interpreted
|
|
|
|
as a binary-syntax `Repr`; otherwise, it should be interpreted as text.
|
|
|
|
|
|
|
|
**Streaming.** Autodetection is still possible when streaming an
|
|
|
|
undetermined number of `Value`s across, say, a TCP/IP connection:
|
|
|
|
|
|
|
|
- If the text syntax is to be used for the connection, simply start
|
|
|
|
writing each `Document` one after the other. Documents for `Atom`s
|
|
|
|
*MUST* be separated from their neighbours by whitespace; in general,
|
|
|
|
whitespace *SHOULD* be used to separate adjacent documents.
|
|
|
|
Specifically, whitespace separating adjacent documents *SHOULD* be
|
|
|
|
ASCII newline (10).
|
|
|
|
|
|
|
|
- If the binary syntax is to be used for the connection, start the
|
|
|
|
connection with byte `0xA8` (sequence). After the initial byte, send
|
2022-06-11 08:06:51 +00:00
|
|
|
each value `v` as `len(|«v»|) ++ «v»`. A side effect of this approach
|
2022-06-10 15:33:52 +00:00
|
|
|
is that the entire stream, when complete, is a valid `Sequence`
|
|
|
|
`Repr`.
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
## Appendix. Table of tag values
|
|
|
|
|
2022-06-10 15:33:52 +00:00
|
|
|
(8x) RESERVED 80-8F
|
|
|
|
(9x) RESERVED 90-9F
|
|
|
|
|
|
|
|
A0 - False
|
|
|
|
A1 - True
|
|
|
|
A2 - Float or Double (length disambiguates)
|
|
|
|
A3 - SignedIntegers (0 is encoded with no bytes at all)
|
|
|
|
A4 - String (no trailing NUL is added)
|
|
|
|
A5 - ByteString
|
|
|
|
A6 - Symbol
|
|
|
|
|
|
|
|
A7 - Record
|
|
|
|
A8 - Sequence
|
|
|
|
A9 - Set
|
|
|
|
AA - Dictionary
|
|
|
|
|
|
|
|
(Ax) RESERVED AB-AF
|
|
|
|
|
|
|
|
(Bx) RESERVED B0-BD
|
|
|
|
BE - Annotations. {BE Lval val Lann0 ann0 Lann1 ann1 ...}
|
|
|
|
BF - Embedded
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
## Appendix. Binary SignedInteger representation
|
|
|
|
|
|
|
|
Languages that provide fixed-width machine word types may find the
|
|
|
|
following table useful in encoding and decoding binary `SignedInteger`
|
|
|
|
values.
|
|
|
|
|
|
|
|
| Integer range | Bytes required | Encoding (hex) |
|
|
|
|
| --- | --- | --- |
|
2022-06-10 15:33:52 +00:00
|
|
|
| 0 | 1 | `A3` |
|
|
|
|
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 2 | `A3` `XX` |
|
|
|
|
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3 | `A3` `XX` `XX` |
|
|
|
|
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4 | `A3` `XX` `XX` `XX` |
|
2022-06-18 17:11:08 +00:00
|
|
|
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5 | `A3` `XX` `XX` `XX` `XX` |
|
2022-06-10 15:33:52 +00:00
|
|
|
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6 | `A3` `XX` `XX` `XX` `XX` `XX` |
|
|
|
|
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` |
|
|
|
|
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
|
|
|
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
<!-- Heading to visually offset the footnotes from the main document: -->
|
|
|
|
## Notes
|