2022-06-18 17:11:08 +00:00
|
|
|
|
---
|
|
|
|
|
no_site_title: true
|
|
|
|
|
title: "Preserves: Binary Syntax"
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
|
|
|
|
{{ site.version_date }}. Version {{ site.version }}.
|
|
|
|
|
|
|
|
|
|
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
|
|
|
|
|
[LEB128]: https://en.wikipedia.org/wiki/LEB128
|
|
|
|
|
[canonical]: canonical-binary.html
|
|
|
|
|
|
|
|
|
|
*Preserves* is a data model, with associated serialization formats. This
|
|
|
|
|
document defines one of those formats: a binary syntax for `Value`s from
|
|
|
|
|
the [Preserves data model](preserves.html) that is easy for computer
|
|
|
|
|
software to read and write. An [equivalent human-readable text
|
|
|
|
|
syntax](preserves-text.html) also exists.
|
|
|
|
|
|
|
|
|
|
## Machine-Oriented Binary Syntax
|
|
|
|
|
|
|
|
|
|
A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
|
|
|
|
|
For a value `v`, we write `«v»` for the `Repr` of v.
|
|
|
|
|
|
|
|
|
|
### Type and Length representation.
|
|
|
|
|
|
|
|
|
|
Each `Repr` starts with a tag byte, describing the kind of information
|
|
|
|
|
represented. Depending on the tag, a length indicator, further encoded
|
|
|
|
|
information, and/or an ending tag may follow.
|
|
|
|
|
|
2023-03-15 15:24:02 +00:00
|
|
|
|
tag (simple atomic data)
|
2024-01-27 10:34:51 +00:00
|
|
|
|
tag ++ length ++ binarydata (doubles, integers, strings, symbols, and binary)
|
2023-10-14 22:33:41 +00:00
|
|
|
|
tag ++ repr ++ ... ++ endtag (compound data)
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
|
|
The unique end tag is byte value `0x84`.
|
|
|
|
|
|
|
|
|
|
If present after a tag, the length of a following piece of binary data
|
|
|
|
|
is formatted as a [base 128 varint][varint].[^see-also-leb128] We
|
|
|
|
|
write `varint(m)` for the varint-encoding of `m`. Quoting the
|
|
|
|
|
[Google Protocol Buffers][varint] definition,
|
|
|
|
|
|
|
|
|
|
[^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
|
|
|
|
|
integers. Varints and LEB128-encoded integers differ only for
|
2023-03-15 15:24:02 +00:00
|
|
|
|
negative numbers, which cannot appear as length indicators and are
|
|
|
|
|
thus not used in Preserves.
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
|
|
> Each byte in a varint, except the last byte, has the most
|
|
|
|
|
> significant bit (msb) set – this indicates that there are further
|
|
|
|
|
> bytes to come. The lower 7 bits of each byte are used to store the
|
|
|
|
|
> two's complement representation of the number in groups of 7 bits,
|
|
|
|
|
> least significant group first.
|
|
|
|
|
|
2023-03-15 15:24:02 +00:00
|
|
|
|
For example, `varint(15)` is `[0x0F]`, and `varint(1000000000)` is `[0x80, 0x94, 0xeb, 0xdc,
|
|
|
|
|
0x03]`.
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
|
|
It is an error for a varint-encoded `m` in a `Repr` to be anything
|
|
|
|
|
other than the unique shortest encoding for that `m`. That is, a
|
|
|
|
|
varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0.
|
|
|
|
|
|
|
|
|
|
### Records, Sequences, Sets and Dictionaries.
|
|
|
|
|
|
|
|
|
|
«<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
|
|
|
|
|
«[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
|
|
|
|
|
«#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
|
|
|
|
|
«{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]
|
|
|
|
|
|
|
|
|
|
There is *no* ordering requirement on the `E_i` elements or
|
|
|
|
|
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
|
|
|
|
|
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
|
|
|
|
|
addition, implementations *SHOULD* default to writing set elements and
|
|
|
|
|
dictionary key/value pairs in order sorted lexicographically by their
|
|
|
|
|
`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
|
|
|
|
|
serializing in some other implementation-defined order.
|
|
|
|
|
|
|
|
|
|
[^no-sorting-rationale]: In the BitTorrent encoding format,
|
|
|
|
|
[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
|
|
|
|
|
dictionary key/value pairs must be sorted by key. This is a
|
|
|
|
|
necessary step for ensuring serialization of `Value`s is
|
2023-03-15 15:24:02 +00:00
|
|
|
|
canonical. We encourage, but do not require that key/value pairs (or set
|
2022-06-18 17:11:08 +00:00
|
|
|
|
elements) be in sorted order for serialized `Value`s; however, a
|
|
|
|
|
[canonical form][canonical] for `Repr`s does exist where a sorted
|
|
|
|
|
ordering is required.
|
|
|
|
|
|
|
|
|
|
[^not-sorted-semantically]: It's important to note that the sort
|
2023-10-14 23:21:59 +00:00
|
|
|
|
ordering for writing out set elements and dictionary key/value pairs
|
|
|
|
|
is *not* the same as the sort ordering implied by the semantic
|
|
|
|
|
ordering of those elements or keys. For example, the `Repr` of a
|
|
|
|
|
negative number sorts lexicographically *after* the `Repr` of zero,
|
|
|
|
|
despite being semantically *less than* zero.
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
|
|
**Rationale**. This is for ease-of-implementation reasons: not all
|
|
|
|
|
languages can easily represent sorted sets or sorted dictionaries,
|
|
|
|
|
but encoding and then sorting byte strings is much more likely to
|
|
|
|
|
be within easy reach.
|
|
|
|
|
|
2023-10-16 23:34:21 +00:00
|
|
|
|
### SignedIntegers, Strings, ByteStrings and Symbols.
|
|
|
|
|
|
|
|
|
|
«S» = [0xB0] ++ varint(|intbytes(S)|) ++ intbytes(S) if S ∈ SignedInteger
|
|
|
|
|
[0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String
|
|
|
|
|
[0xB2] ++ varint(|S|) ++ S if S ∈ ByteString
|
|
|
|
|
[0xB3] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ Symbol
|
|
|
|
|
|
|
|
|
|
For `String` and `Symbol`, the data following the tag and length is a
|
|
|
|
|
UTF-8 encoding of the `Value`. For `ByteString`, it is the raw data
|
|
|
|
|
contained within the `Value` unmodified. For `SignedInteger`, it is
|
|
|
|
|
the big-endian two's-complement binary representation of the number,
|
|
|
|
|
taking exactly as many whole bytes as needed to unambiguously identify
|
|
|
|
|
the value and its sign. `intbytes(0)` is special-cased to be the empty
|
|
|
|
|
byte sequence;[^empty-intbytes-sequence] otherwise, the
|
|
|
|
|
most-significant bit of the first byte is the sign bit. (See
|
|
|
|
|
[SignedInteger examples](#signedinteger-examples) in the appendix
|
|
|
|
|
below.)
|
2023-10-16 22:31:10 +00:00
|
|
|
|
|
|
|
|
|
[^empty-intbytes-sequence]: Without the special case of
|
|
|
|
|
`intbytes(0)` yielding the empty byte sequence, no input to
|
|
|
|
|
`intbytes` would result in an empty sequence. In principle, either
|
|
|
|
|
`0` or `-1` could be special-cased to the empty sequence; here, we
|
|
|
|
|
arbitrarily choose `0`.
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
|
|
### Booleans.
|
|
|
|
|
|
|
|
|
|
«#f» = [0x80]
|
|
|
|
|
«#t» = [0x81]
|
|
|
|
|
|
2024-01-27 10:34:51 +00:00
|
|
|
|
### Doubles.
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
2023-03-15 15:24:02 +00:00
|
|
|
|
«D» = [0x87, 0x08] ++ binary64(D) if D ∈ Double
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
2024-01-27 10:34:51 +00:00
|
|
|
|
The function `binary64(D)` yields the big-endian 8-byte IEEE 754 binary
|
|
|
|
|
representation of `D`.
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
|
|
|
|
### Embeddeds.
|
|
|
|
|
|
2024-02-05 21:38:49 +00:00
|
|
|
|
«#:V» = [0x86] ++ «V»
|
2023-03-15 15:24:02 +00:00
|
|
|
|
|
2022-06-18 17:11:08 +00:00
|
|
|
|
The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
|
|
|
|
|
represent the denoted object, prefixed with `[0x86]`.
|
|
|
|
|
|
|
|
|
|
### Annotations.
|
|
|
|
|
|
2023-10-14 22:33:41 +00:00
|
|
|
|
«@W V» = [0x85] ++ «W» «V»
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
2023-10-14 23:16:46 +00:00
|
|
|
|
Each annotation `W` precedes the `Value` `V` it
|
|
|
|
|
annotates.[^annotation-syntax-rationale] Both `W` and `V` *MAY*
|
|
|
|
|
themselves be further annotated. Implementations *SHOULD* default to
|
|
|
|
|
omitting annotations from `Repr`s. See [examples in the
|
2023-10-14 22:49:09 +00:00
|
|
|
|
appendix](#annotation-examples).
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
2023-10-14 23:16:46 +00:00
|
|
|
|
[^annotation-syntax-rationale]: **Rationale.** The syntax for
|
|
|
|
|
annotated values is unlike any other compound syntax defined here.
|
|
|
|
|
It allows parsers to *statelessly* skip annotations: while `0x85` is
|
|
|
|
|
the next input byte, skip it, skip a single `Repr`, and then try
|
|
|
|
|
again; otherwise, parse the input stream as usual.
|
|
|
|
|
|
|
|
|
|
By contrast, one might imagine using a parenthesized syntax like the syntax of
|
|
|
|
|
compound data. In that case, parsers would have to remember
|
|
|
|
|
information about nesting of annotations. This would not necessarily
|
|
|
|
|
be an issue for recursive ("DOM"-like) parsers, but streaming
|
|
|
|
|
("SAX"-like) parsers would have to either take on complexity
|
|
|
|
|
internally or force it on their clients, even when those clients did
|
|
|
|
|
not care about annotations.
|
|
|
|
|
|
2022-06-18 17:11:08 +00:00
|
|
|
|
## Security Considerations
|
|
|
|
|
|
|
|
|
|
**Annotations.** In modes where a `Value` is being read while
|
2023-10-14 22:33:41 +00:00
|
|
|
|
annotations are skipped, an endless series of annotations may give an
|
2022-06-18 17:11:08 +00:00
|
|
|
|
illusion of progress.
|
|
|
|
|
|
|
|
|
|
**Canonical form for cryptographic hashing and signing.** No canonical
|
2023-03-15 15:24:02 +00:00
|
|
|
|
*textual* encoding of a `Value` is specified. However, a [canonical
|
|
|
|
|
form][canonical] exists for binary encoded `Value`s, and
|
2022-06-18 17:11:08 +00:00
|
|
|
|
implementations *SHOULD* produce canonical binary encodings by
|
|
|
|
|
default; however, an implementation *MAY* permit two serializations of
|
|
|
|
|
the same `Value` to yield different binary `Repr`s.
|
|
|
|
|
|
|
|
|
|
## Appendix. Autodetection of textual or binary syntax
|
|
|
|
|
|
|
|
|
|
Every tag byte in a binary Preserves `Document` falls within the range
|
|
|
|
|
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
|
2023-10-13 12:01:21 +00:00
|
|
|
|
bytes*, and will never occur as the first byte of a UTF-8 encoding. This
|
|
|
|
|
means no binary-encoded document can be misinterpreted as valid UTF-8.
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
2023-10-13 12:01:21 +00:00
|
|
|
|
Conversely, a UTF-8 document must start with a valid scalar value,
|
2022-06-18 17:11:08 +00:00
|
|
|
|
meaning in particular that it must not start with a byte in the range
|
|
|
|
|
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
|
|
|
|
|
Preserves document can be misinterpreted as a binary-syntax document.
|
|
|
|
|
|
|
|
|
|
Examination of the top two bits of the first byte of a document gives
|
|
|
|
|
its syntax: if the top two bits are `10`, it should be interpreted as
|
|
|
|
|
a binary-syntax document; otherwise, it should be interpreted as text.
|
|
|
|
|
|
|
|
|
|
## Appendix. Table of tag values
|
|
|
|
|
|
|
|
|
|
80 - False
|
|
|
|
|
81 - True
|
|
|
|
|
84 - End marker
|
2023-10-14 22:33:41 +00:00
|
|
|
|
85 - Annotation
|
2022-06-18 17:11:08 +00:00
|
|
|
|
86 - Embedded
|
2024-01-27 10:34:51 +00:00
|
|
|
|
87 - Double
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
2023-03-15 15:24:02 +00:00
|
|
|
|
B0 - Integer
|
2022-06-18 17:11:08 +00:00
|
|
|
|
B1 - String
|
|
|
|
|
B2 - ByteString
|
|
|
|
|
B3 - Symbol
|
|
|
|
|
B4 - Record
|
|
|
|
|
B5 - Sequence
|
|
|
|
|
B6 - Set
|
|
|
|
|
B7 - Dictionary
|
|
|
|
|
|
2023-10-13 12:21:40 +00:00
|
|
|
|
All tags fall in the range [`0x80`, `0xBF`].
|
|
|
|
|
|
2023-10-14 22:33:41 +00:00
|
|
|
|
Tag values `82`, `83`, `88`...`AF`, and `B8`...`BF` are **reserved**.
|
2023-10-13 12:21:40 +00:00
|
|
|
|
|
2022-06-18 17:11:08 +00:00
|
|
|
|
## Appendix. Binary SignedInteger representation
|
|
|
|
|
|
|
|
|
|
Languages that provide fixed-width machine word types may find the
|
|
|
|
|
following table useful in encoding and decoding binary `SignedInteger`
|
|
|
|
|
values.
|
|
|
|
|
|
2023-03-15 19:50:49 +00:00
|
|
|
|
| Integer range | Bytes required | Encoding (hex) |
|
|
|
|
|
| --- | --- | --- |
|
|
|
|
|
| 0 | 2 | `B0` `00` |
|
2023-03-15 15:24:02 +00:00
|
|
|
|
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 3 | `B0` `01` `XX` |
|
|
|
|
|
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 4 | `B0` `02` `XX` `XX` |
|
|
|
|
|
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 5 | `B0` `03` `XX` `XX` `XX` |
|
|
|
|
|
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 6 | `B0` `04` `XX` `XX` `XX` `XX` |
|
|
|
|
|
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 7 | `B0` `05` `XX` `XX` `XX` `XX` `XX` |
|
|
|
|
|
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 8 | `B0` `06` `XX` `XX` `XX` `XX` `XX` `XX` |
|
|
|
|
|
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 9 | `B0` `07` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
|
|
|
|
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 10 | `B0` `08` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
2022-06-18 17:11:08 +00:00
|
|
|
|
|
2023-10-14 22:33:41 +00:00
|
|
|
|
## Appendix. Examples
|
|
|
|
|
|
2023-10-14 22:49:09 +00:00
|
|
|
|
### <a id="signedinteger-examples"></a>SignedInteger examples
|
2023-10-13 12:21:40 +00:00
|
|
|
|
|
|
|
|
|
«-257» = B0 02 FE FF «-2» = B0 01 FE «255» = B0 02 00 FF
|
|
|
|
|
«-256» = B0 02 FF 00 «-1» = B0 01 FF «256» = B0 02 01 00
|
|
|
|
|
«-255» = B0 02 FF 01 «0» = B0 00 «32767» = B0 02 7F FF
|
|
|
|
|
«-129» = B0 02 FF 7F «1» = B0 01 01 «32768» = B0 03 00 80 00
|
|
|
|
|
«-128» = B0 01 80 «127» = B0 01 7F «65535» = B0 03 00 FF FF
|
|
|
|
|
«-127» = B0 01 81 «128» = B0 02 00 80 «65536» = B0 03 01 00 00
|
|
|
|
|
|
|
|
|
|
«87112285931760246646623899502532662132736»
|
|
|
|
|
= B0 12 01 00 00 00 00 00 00 00
|
|
|
|
|
00 00 00 00 00 00 00 00
|
|
|
|
|
00 00
|
|
|
|
|
|
2023-10-14 22:33:41 +00:00
|
|
|
|
### <a id="annotation-examples"></a>Annotation examples
|
|
|
|
|
|
|
|
|
|
The `Repr` corresponding to textual syntax `@a@b[]`, i.e. an empty
|
|
|
|
|
sequence annotated with two symbols, `a` and `b`, is
|
|
|
|
|
|
|
|
|
|
«@a @b []» = [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
|
|
|
|
|
= [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]
|
|
|
|
|
|
|
|
|
|
Annotations may themselves be annotated. Here, `c` is annotated with
|
|
|
|
|
`b`, which itself is annotated with `a`:
|
|
|
|
|
|
|
|
|
|
«@ @a b c» = [0x85] ++ [0x85] ++ «a» ++ «b» ++ «c»>
|
|
|
|
|
|
2022-06-18 17:11:08 +00:00
|
|
|
|
<!-- Heading to visually offset the footnotes from the main document: -->
|
|
|
|
|
## Notes
|