389 lines
18 KiB
Markdown
389 lines
18 KiB
Markdown
---
|
|
no_site_title: true
|
|
title: "Preserves: Binary Syntax"
|
|
---
|
|
|
|
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
|
{{ site.version_date }}. Version {{ site.version }}.
|
|
|
|
[LEB128]: https://en.wikipedia.org/wiki/LEB128
|
|
[argdata]: https://github.com/NuxiNL/argdata
|
|
[canonical]: canonical-binary.html
|
|
[google-varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
|
|
[vlq]: https://en.wikipedia.org/wiki/Variable-length_quantity
|
|
|
|
*Preserves* is a data model, with associated serialization formats. This
|
|
document defines one of those formats: a binary syntax for `Value`s from
|
|
the [Preserves data model](preserves.html) that is easy for computer
|
|
software to read and write. An [equivalent human-readable text
|
|
syntax](preserves-text.html) also exists.
|
|
|
|
## Machine-Oriented Binary Syntax
|
|
|
|
A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
|
|
|
|
### Type and Length representation.
|
|
|
|
Each `Repr` starts with a tag byte, describing the kind of information
|
|
represented. The expected length of the `Repr` is always available
|
|
from the surrounding context: either from a containing encoded value, or
|
|
from the overall container of the data, which could be a file, an HTTP
|
|
message, a UDP packet, etc.
|
|
|
|
### Atomic Values.
|
|
|
|
**Booleans.** The "false" boolean's `Repr` is just tag `0xA0`; "true" is
|
|
`0xA1`.
|
|
|
|
**Floats and Doubles.** Both `Float` and `Double` values are represented
|
|
as tag `0xA2` followed by big-endian 4- or 8-byte IEEE 754 binary
|
|
representations of the values, respectively.
|
|
|
|
**SignedIntegers.** A `SignedInteger` encodes as tag `0xA3` followed by
|
|
a big-endian two's-complement binary representation of the value, taking
|
|
at least as many whole bytes as needed to unambiguously identify the
|
|
value and its sign. Zero may be represented as the tag alone, with no
|
|
following bytes. The most-significant bit in the first byte after the
|
|
tag is the sign bit.[^zero-intbytes] The shortest possible encoding
|
|
*SHOULD* be used.[^overlong-signedinteger]
|
|
|
|
[^zero-intbytes]: The value 0 needs zero bytes to identify the value,
|
|
so `intbytes(0)` can be the empty byte string. Non-zero values need
|
|
at least one byte.
|
|
|
|
[^overlong-signedinteger]: **Implementation note.** The spec permits
|
|
overlong `SignedInteger` encodings to allow e.g. construction of
|
|
`Repr`s by filling in partially-completed templates, which can be
|
|
useful in resource-constrained situations.
|
|
|
|
**Strings.** A `String` encodes as tag `0xA4` followed by the UTF-8
|
|
encoding of the string, with an additional trailing `NUL` (0) byte. The
|
|
`NUL` byte *MUST NOT* be treated as part of the `String`: it exists to
|
|
permit zero-copy C interoperability.[^zero-copy-c-string-interop]
|
|
|
|
[^zero-copy-c-string-interop]: Some care must still be taken when
|
|
passing `String` `Repr`s directly to a C-style ABI, since `String`s
|
|
may contain the zero Unicode code point, which C library routines
|
|
will usually misinterpret as an end-of-string marker.
|
|
|
|
**ByteStrings.** A `ByteString` encodes as tag `0xA5` followed by the
|
|
bytes themselves.
|
|
|
|
**Symbols.** A `Symbol` encodes as tag `0xA6` followed by the UTF-8
|
|
encoding of the symbol's code points.
|
|
|
|
### Compound Values.
|
|
|
|
`Repr`s for `Compound` values store the lengths of their contained
|
|
values. Each contained `Value` is converted to a `Repr` and stored as
|
|
the length of the `Repr` in bytes followed by the `Repr` itself.
|
|
Implementations use each stored length to decide when to stop reading
|
|
the associated `Repr`. Similarly, no sentinel marks the end of a
|
|
sequence of length-prefixed `Repr`s. Implementations use the length of
|
|
the containing `Repr`, known from the surrounding context, to decide
|
|
when to stop expecting more contained `Repr`s.
|
|
|
|
<a id="varint"></a> Each length is stored as an [argdata][]-compatible
|
|
big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint
|
|
stores seven bits of the length. All bytes have a clear upper bit,
|
|
except the final byte, which has the upper bit set. Implementations
|
|
*SHOULD* use the shortest encoding for a varint, and *MUST NOT* produce
|
|
an encoded varint with more than nine leading `0`
|
|
bytes.[^overlong-varint] [^nine-leading-varint-zeroes]
|
|
|
|
[^see-also-leb128]: Argdata's length representation is very close to
|
|
[Variable-length quantity (VLQ)][VLQ] encoding, differing only in
|
|
the flipped interpretation of the high bit of each byte. It is
|
|
big-endian, unlike [LEB128][] encoding ([as used by
|
|
Google][google-varint] in protobufs).
|
|
|
|
[^overlong-varint]: **Implementation note.** The spec permits overlong
|
|
length encodings to reduce wasted activity in resource-constrained
|
|
situations. If an implementation is in anything other than a very
|
|
low-level language, it is likely to be able to use
|
|
[IOList](./conventions.html#iolists)-style data structures to avoid
|
|
unnecessary copying.
|
|
|
|
[^nine-leading-varint-zeroes]: Nine leading zero bytes, plus one
|
|
non-zero byte, equals ten bytes in total. Each byte of varint yields
|
|
7 bits of usable length indicator, so ten bytes gives 70 bits, while
|
|
nine would only give 63, not quite enough for a 64-bit value. Of
|
|
course, it may be some time before an encoder legitimately needs to
|
|
use a 64-bit length indicator, let alone in a resource-constrained
|
|
situation.
|
|
|
|
**Records.** A `Record` is encoded as tag `0xA7` followed by the
|
|
length-prefixed encodings of its label and fields.
|
|
|
|
**Sequences.** A `Sequence` is encoded as tag `0xA8` followed by the
|
|
length-prefixed encodings of its members.
|
|
|
|
**Sets.** A `Set` is encoded like a `Sequence`, but with tag `0xA9`, and
|
|
in some arbitrary order.
|
|
|
|
**Dictionaries.** A `Dictionary` encodes as tag `0xAA` followed by the
|
|
length-prefixed keys and values, in an alternating key/value sequence.
|
|
|
|
There is *no* ordering requirement on the elements of sets or the key/value pairs of
|
|
dictionaries.[^no-sorting-rationale] However, elements of sets and keys in dictionaries *MUST*
|
|
be pairwise distinct. In addition, implementations *SHOULD* default to writing set elements in
|
|
order sorted lexicographically by their `Repr`s and *SHOULD* default to writing dictionary
|
|
key/value pairs in order sorted lexicographically by the `Repr`s of their
|
|
keys[^not-sorted-semantically]. Implementations *MAY* offer the option of serializing in some
|
|
other implementation-defined order.
|
|
|
|
[^no-sorting-rationale]: In the BitTorrent encoding format,
|
|
[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
|
|
dictionary key/value pairs must be sorted by key. This is a
|
|
necessary step for ensuring serialization of `Value`s is
|
|
canonical. We do not require that key/value pairs (or set
|
|
elements) be in sorted order for serialized `Value`s; however, a
|
|
[canonical form][canonical] for `Repr`s does exist where a sorted
|
|
ordering is required.
|
|
|
|
[^not-sorted-semantically]: It's important to note that the sort
|
|
ordering for writing out set elements and dictionary key/value
|
|
pairs is *not* the same as the sort ordering implied by the
|
|
semantic ordering of those elements or keys. For example, the
|
|
`Repr` of a negative number very far from zero will start with a
|
|
byte that is *greater* than the byte which starts the `Repr` of
|
|
zero, making it sort lexicographically later by `Repr`, despite
|
|
being semantically *less than* zero.
|
|
|
|
**Rationale**. This is for ease-of-implementation reasons: not all
|
|
languages can easily represent sorted sets or sorted dictionaries,
|
|
but encoding and then sorting byte strings is much more likely to
|
|
be within easy reach.
|
|
|
|
### Embedded Values.
|
|
|
|
Embedded values are encoded as tag `0xAB` followed by the encoding of
|
|
some `Value` chosen to represent the denoted embedded object.
|
|
|
|
### Annotations.
|
|
|
|
The encoding of a sequence of annotations for a `Repr` uses tag `0xBF`,
|
|
followed by the length-prefixed `Repr`, followed by the length-prefixed
|
|
encoded annotations, in order. The `Repr` *MUST NOT* already have
|
|
annotations (must not begin with `0xBF`), and there *MUST* be at least
|
|
one `Value` in the sequence following the `Repr`.
|
|
|
|
## Examples (normative)
|
|
|
|
We write `«v»` for the `Repr` of some `Value` `v`, and `varint(|«v»|)` for
|
|
the varint-encoded length of the `Repr` of `v`.
|
|
|
|
### Varints (length representations).
|
|
|
|
The following table illustrates varint-encoding.
|
|
|
|
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
|
|
|-------------|-------------------------------------------|-------------------|
|
|
| 15 | `0001111` | 143 |
|
|
| 300 | `0000010 0101100` | 2 172 |
|
|
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 |
|
|
|
|
### Atoms.
|
|
|
|
«#f» = A0 «#t» = A1
|
|
|
|
«0.123f» = A2 3D FB E7 6D «0.123» = A2 3F BF 7C ED 91 68 72 B0
|
|
|
|
«-257» = A3 FE FF «-3» = A3 FD «128» = A3 00 80
|
|
«-256» = A3 FF 00 «-2» = A3 FE «255» = A3 00 FF
|
|
«-255» = A3 FF 01 «-1» = A3 FF «256» = A3 01 00
|
|
«-254» = A3 FF 02 «0» = A3 «32767» = A3 7F FF
|
|
«-129» = A3 FF 7F «1» = A3 01 «32768» = A3 00 80 00
|
|
«-128» = A3 80 «12» = A3 0C «65535» = A3 00 FF FF
|
|
«-127» = A3 81 «13» = A3 0D «65536» = A3 01 00 00
|
|
«-4» = A3 FC «127» = A3 7F «131072» = A3 02 00 00
|
|
|
|
«87112285931760246646623899502532662132736»
|
|
= A3 01 00 00 00 00 00 00 00
|
|
00 00 00 00 00 00 00 00
|
|
00 00
|
|
|
|
«""» = A4 00 «||» = A6
|
|
«"a"» = A4 61 00 «|a|» = A6 61
|
|
«"hello"» = A4 68 65 6C 6C 6F 00 «|hello|» = A6 68 65 6C 6C 6F
|
|
|
|
«#[]» = A5
|
|
«#[AQ==]» = A5 01
|
|
«#[ATAyMDMwNDA1]» = A5 01 02 03 04 05
|
|
|
|
### Compounds.
|
|
|
|
«<window 100 120 500 300>» = A7 87 A6 77696E646F77
|
|
82 A3 64
|
|
82 A3 78
|
|
83 A3 01F4
|
|
83 A3 012C
|
|
|
|
«["zzzz(...192 zs...)zzzz"]»
|
|
(a length-1 sequence containing a length-200 string)
|
|
= A8 01CA A4 7A7A7A7A (... 192 repetitions of 7A ...)
|
|
7A7A7A7A 00
|
|
|
|
«[H, He, Li, Be, B, C, N, O, F, Ne]»
|
|
= A8 82A648 83A64865 83A64C69 83A64265
|
|
82A642 82A643 82A64E 82A64F 82A646 83A64E65
|
|
|
|
«#{H He Li Be B C N O F Ne}»
|
|
= A9 82A642 (B)
|
|
83A64265 (Be)
|
|
82A643 (C)
|
|
82A646 (F)
|
|
82A648 (H)
|
|
83A64865 (He)
|
|
83A64C69 (Li)
|
|
82A64E (N)
|
|
83A64E65 (Ne)
|
|
82A64F (O)
|
|
|
|
«{H: 1.0080f, He: 4.0026f, Li: 6.94f, Be: 9.0122f,
|
|
B: 10.81f, C: 12.011f, N: 14.007f, O: 15.999f,
|
|
F: 18.998f, Ne: 20.180f}»
|
|
= AA 82A642 85A2412CF5C3 (B: 10.81f)
|
|
83A64265 85A2411031F9 (Be: 9.0122f)
|
|
82A643 85A241402D0E (C: 12.011f)
|
|
82A646 85A24197FBE7 (F: 18.998f)
|
|
82A648 85A23F810625 (H: 1.0080f)
|
|
83A64865 85A24080154D (He: 4.0026f)
|
|
83A64C69 85A240DE147B (Li: 6.94f)
|
|
82A64E 85A241601CAC (N: 14.007f)
|
|
83A64E65 85A241A170A4 (Ne: 20.180f)
|
|
82A64F 85A2417FFBE7 (O: 15.999f)
|
|
|
|
«[[H 1.0080f] [He 4.0026f] [Li 6.94f] [Be 9.0122f]
|
|
[B 10.81f] [C 12.011f] [N 14.007f] [O 15.999f]
|
|
[F 18.998f] [Ne 20.180f]]»
|
|
= A8 8A A8 82A648 85A23F810625 ([H 1.0080f])
|
|
8B A8 83A64865 85A24080154D ([He 4.0026f])
|
|
8B A8 83A64C69 85A240DE147B ([Li 6.94f])
|
|
8B A8 83A64265 85A2411031F9 ([Be 9.0122f])
|
|
8A A8 82A642 85A2412CF5C3 ([B 10.81f])
|
|
8A A8 82A643 85A241402D0E ([C 12.011f])
|
|
8A A8 82A64E 85A241601CAC ([N 14.007f])
|
|
8A A8 82A64F 85A2417FFBE7 ([O 15.999f])
|
|
8A A8 82A646 85A24197FBE7 ([F 18.998f])
|
|
8B A8 83A64E65 85A241A170A4 ([Ne 20.180f])
|
|
|
|
### Annotations.
|
|
|
|
The `Repr` corresponding to textual syntax `@a@b[]`, i.e. an empty sequence annotated with two
|
|
symbols, `a` and `b`, is
|
|
|
|
«@a @b []»
|
|
= [0xBF] ++ varint(|«[]»|) ++ «[]»
|
|
++ varint(|«a»|) ++ «a»
|
|
++ varint(|«b»|) ++ «b»
|
|
= [0xBF, 0x81, 0xA8, 0x82, 0xA6, 0x61, 0x82, 0xA6, 0x62]
|
|
|
|
## Security Considerations
|
|
|
|
**Overlong varints.** The binary format allows (but discourages)
|
|
overlong [varint](#varint)s. Because every `Repr` has a bound on its
|
|
length from its surrounding context, this is not a denial-of-service
|
|
vector *per se*; however, implementations may wish to consider optional
|
|
restrictions on the number of redundant leading `0` bytes accepted when
|
|
reading a varint.
|
|
|
|
**Overlong SignedIntegers.** Similarly, implementations may wish to
|
|
consider optional restrictions on the number of redundant leading `0xFF`
|
|
or `0x00` bytes accepted when reading a `SignedInteger`.
|
|
|
|
**Canonical form for cryptographic hashing and signing.** No canonical
|
|
textual encoding of a `Value` is specified. A
|
|
[canonical form][canonical] exists for binary encoded `Value`s, and
|
|
implementations *SHOULD* produce canonical binary encodings by
|
|
default; however, an implementation *MAY* permit two serializations of
|
|
the same `Value` to yield different binary `Repr`s.
|
|
|
|
## Acknowledgements
|
|
|
|
The exclusion of lengths from `Repr`s, placing lengths instead ahead of
|
|
contained values in sequences, is inspired by [argdata][], as is the
|
|
inclusion of a `NUL` byte in `String` `Repr`s for C interoperability.
|
|
|
|
## Appendix. Summary of syntax
|
|
|
|
{% include cheatsheet-binary.md %}
|
|
|
|
## Appendix. Autodetection of textual or binary syntax
|
|
|
|
Every tag byte in a binary Preserves `Repr` falls within the range
|
|
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
|
|
bytes*, and will never occur as the first byte of a UTF-8 encoded code
|
|
point. This means no binary-encoded `Repr` can be misinterpreted as
|
|
valid UTF-8.
|
|
|
|
Conversely, a UTF-8 `Document` must start with a valid codepoint,
|
|
meaning in particular that it must not start with a byte in the range
|
|
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
|
|
Preserves `Document` can be misinterpreted as a binary-syntax `Repr`.
|
|
|
|
Examination of the top two bits of the first byte of an encoded `Value`
|
|
gives its syntax: if the top two bits are `10`, it should be interpreted
|
|
as a binary-syntax `Repr`; otherwise, it should be interpreted as text.
|
|
|
|
**Streaming.** Autodetection is still possible when streaming an
|
|
undetermined number of `Value`s across, say, a TCP/IP connection:
|
|
|
|
- If the text syntax is to be used for the connection, simply start
|
|
writing each `Document` one after the other. Documents for `Atom`s
|
|
are in general ambiguous if not separated from their neighbours by
|
|
whitespace; whitespace *SHOULD* be used to separate adjacent
|
|
documents. Specifically, whitespace separating adjacent documents
|
|
*SHOULD* be ASCII newline (10).
|
|
|
|
- If the binary syntax is to be used for the connection, start the
|
|
connection with byte `0xA8` (sequence). After the initial byte, send
|
|
each value `v` as `varint(|«v»|) ++ «v»`. A side effect of this approach
|
|
is that the entire stream, when complete, is a valid `Sequence`
|
|
`Repr`.
|
|
|
|
## Appendix. Table of tag values
|
|
|
|
(8x) RESERVED 80-8F
|
|
(9x) RESERVED 90-9F
|
|
|
|
A0 - False
|
|
A1 - True
|
|
A2 - Float or Double (length disambiguates)
|
|
A3 - SignedIntegers (0 may be encoded with no bytes at all)
|
|
A4 - String (a trailing NUL is added)
|
|
A5 - ByteString
|
|
A6 - Symbol
|
|
|
|
A7 - Record
|
|
A8 - Sequence
|
|
A9 - Set
|
|
AA - Dictionary
|
|
|
|
AB - Embedded
|
|
|
|
(Ax) RESERVED AC-AF
|
|
(Bx) RESERVED B0-BE
|
|
BF - Annotations. {BF Lval val Lann0 ann0 Lann1 ann1 ...}
|
|
|
|
## Appendix. Binary SignedInteger representation
|
|
|
|
Languages that provide fixed-width machine word types may find the
|
|
following table useful in encoding and decoding binary `SignedInteger`
|
|
values.
|
|
|
|
| Integer range | Bytes required | Encoding (hex) |
|
|
| --- | --- | --- |
|
|
| 0 | 1 | `A3` |
|
|
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 2 | `A3` `XX` |
|
|
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3 | `A3` `XX` `XX` |
|
|
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4 | `A3` `XX` `XX` `XX` |
|
|
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5 | `A3` `XX` `XX` `XX` `XX` |
|
|
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6 | `A3` `XX` `XX` `XX` `XX` `XX` |
|
|
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` |
|
|
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
|
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
|
|
|
<!-- Heading to visually offset the footnotes from the main document: -->
|
|
## Notes
|