preserves/preserves-binary.md

389 lines
18 KiB
Markdown
Raw Permalink Normal View History

2022-06-18 17:11:08 +00:00
---
no_site_title: true
title: "Preserves: Binary Syntax"
---
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
{{ site.version_date }}. Version {{ site.version }}.
[LEB128]: https://en.wikipedia.org/wiki/LEB128
[argdata]: https://github.com/NuxiNL/argdata
2022-06-18 17:11:08 +00:00
[canonical]: canonical-binary.html
[google-varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
[vlq]: https://en.wikipedia.org/wiki/Variable-length_quantity
2022-06-18 17:11:08 +00:00
*Preserves* is a data model, with associated serialization formats. This
document defines one of those formats: a binary syntax for `Value`s from
the [Preserves data model](preserves.html) that is easy for computer
software to read and write. An [equivalent human-readable text
syntax](preserves-text.html) also exists.
## Machine-Oriented Binary Syntax
A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
### Type and Length representation.
Each `Repr` starts with a tag byte, describing the kind of information
represented. The expected length of the `Repr` is always available
2022-06-13 13:50:15 +00:00
from the surrounding context: either from a containing encoded value, or
from the overall container of the data, which could be a file, an HTTP
message, a UDP packet, etc.
2022-06-18 17:11:08 +00:00
### Atomic Values.
**Booleans.** The "false" boolean's `Repr` is just tag `0xA0`; "true" is
`0xA1`.
**Floats and Doubles.** Both `Float` and `Double` values are represented
as tag `0xA2` followed by big-endian 4- or 8-byte IEEE 754 binary
representations of the values, respectively.
**SignedIntegers.** A `SignedInteger` encodes as tag `0xA3` followed by
a big-endian two's-complement binary representation of the value, taking
at least as many whole bytes as needed to unambiguously identify the
value and its sign. Zero may be represented as the tag alone, with no
following bytes. The most-significant bit in the first byte after the
tag is the sign bit.[^zero-intbytes] The shortest possible encoding
*SHOULD* be used.[^overlong-signedinteger]
[^zero-intbytes]: The value 0 needs zero bytes to identify the value,
so `intbytes(0)` can be the empty byte string. Non-zero values need
at least one byte.
[^overlong-signedinteger]: **Implementation note.** The spec permits
overlong `SignedInteger` encodings to allow e.g. construction of
`Repr`s by filling in partially-completed templates, which can be
useful in resource-constrained situations.
**Strings.** A `String` encodes as tag `0xA4` followed by the UTF-8
encoding of the string, with an additional trailing `NUL` (0) byte. The
`NUL` byte *MUST NOT* be treated as part of the `String`: it exists to
permit zero-copy C interoperability.[^zero-copy-c-string-interop]
[^zero-copy-c-string-interop]: Some care must still be taken when
passing `String` `Repr`s directly to a C-style ABI, since `String`s
may contain the zero Unicode code point, which C library routines
will usually misinterpret as an end-of-string marker.
**ByteStrings.** A `ByteString` encodes as tag `0xA5` followed by the
bytes themselves.
**Symbols.** A `Symbol` encodes as tag `0xA6` followed by the UTF-8
encoding of the symbol's code points.
### Compound Values.
`Repr`s for `Compound` values store the lengths of their contained
values. Each contained `Value` is converted to a `Repr` and stored as
the length of the `Repr` in bytes followed by the `Repr` itself.
Implementations use each stored length to decide when to stop reading
the associated `Repr`. Similarly, no sentinel marks the end of a
sequence of length-prefixed `Repr`s. Implementations use the length of
the containing `Repr`, known from the surrounding context, to decide
when to stop expecting more contained `Repr`s.
2022-06-18 17:11:08 +00:00
<a id="varint"></a> Each length is stored as an [argdata][]-compatible
big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint
stores seven bits of the length. All bytes have a clear upper bit,
2022-06-20 15:05:47 +00:00
except the final byte, which has the upper bit set. Implementations
*SHOULD* use the shortest encoding for a varint, and *MUST NOT* produce
an encoded varint with more than nine leading `0`
bytes.[^overlong-varint] [^nine-leading-varint-zeroes]
2022-06-18 17:11:08 +00:00
[^see-also-leb128]: Argdata's length representation is very close to
[Variable-length quantity (VLQ)][VLQ] encoding, differing only in
the flipped interpretation of the high bit of each byte. It is
big-endian, unlike [LEB128][] encoding ([as used by
Google][google-varint] in protobufs).
2022-06-18 17:11:08 +00:00
2022-06-20 15:05:47 +00:00
[^overlong-varint]: **Implementation note.** The spec permits overlong
length encodings to reduce wasted activity in resource-constrained
situations. If an implementation is in anything other than a very
low-level language, it is likely to be able to use
[IOList](./conventions.html#iolists)-style data structures to avoid
unnecessary copying.
[^nine-leading-varint-zeroes]: Nine leading zero bytes, plus one
non-zero byte, equals ten bytes in total. Each byte of varint yields
7 bits of usable length indicator, so ten bytes gives 70 bits, while
nine would only give 63, not quite enough for a 64-bit value. Of
course, it may be some time before an encoder legitimately needs to
use a 64-bit length indicator, let alone in a resource-constrained
situation.
**Records.** A `Record` is encoded as tag `0xA7` followed by the
length-prefixed encodings of its label and fields.
2022-06-18 17:11:08 +00:00
**Sequences.** A `Sequence` is encoded as tag `0xA8` followed by the
length-prefixed encodings of its members.
**Sets.** A `Set` is encoded like a `Sequence`, but with tag `0xA9`, and
in some arbitrary order.
2022-06-18 17:11:08 +00:00
**Dictionaries.** A `Dictionary` encodes as tag `0xAA` followed by the
length-prefixed keys and values, in an alternating key/value sequence.
2022-07-10 11:21:31 +00:00
There is *no* ordering requirement on the elements of sets or the key/value pairs of
dictionaries.[^no-sorting-rationale] However, elements of sets and keys in dictionaries *MUST*
be pairwise distinct. In addition, implementations *SHOULD* default to writing set elements in
order sorted lexicographically by their `Repr`s and *SHOULD* default to writing dictionary
key/value pairs in order sorted lexicographically by the `Repr`s of their
keys[^not-sorted-semantically]. Implementations *MAY* offer the option of serializing in some
other implementation-defined order.
2022-06-18 17:11:08 +00:00
[^no-sorting-rationale]: In the BitTorrent encoding format,
[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
dictionary key/value pairs must be sorted by key. This is a
necessary step for ensuring serialization of `Value`s is
canonical. We do not require that key/value pairs (or set
elements) be in sorted order for serialized `Value`s; however, a
[canonical form][canonical] for `Repr`s does exist where a sorted
ordering is required.
[^not-sorted-semantically]: It's important to note that the sort
ordering for writing out set elements and dictionary key/value
pairs is *not* the same as the sort ordering implied by the
semantic ordering of those elements or keys. For example, the
`Repr` of a negative number very far from zero will start with a
2022-06-18 17:11:08 +00:00
byte that is *greater* than the byte which starts the `Repr` of
zero, making it sort lexicographically later by `Repr`, despite
being semantically *less than* zero.
**Rationale**. This is for ease-of-implementation reasons: not all
languages can easily represent sorted sets or sorted dictionaries,
but encoding and then sorting byte strings is much more likely to
be within easy reach.
### Embedded Values.
Embedded values are encoded as tag `0xAB` followed by the encoding of
some `Value` chosen to represent the denoted embedded object.
2022-06-18 17:11:08 +00:00
### Annotations.
The encoding of a sequence of annotations for a `Repr` uses tag `0xBF`,
followed by the length-prefixed `Repr`, followed by the length-prefixed
encoded annotations, in order. The `Repr` *MUST NOT* already have
annotations (must not begin with `0xBF`), and there *MUST* be at least
one `Value` in the sequence following the `Repr`.
## Examples (normative)
We write `«v»` for the `Repr` of some `Value` `v`, and `varint(|«v»|)` for
the varint-encoded length of the `Repr` of `v`.
2022-06-19 12:36:05 +00:00
### Varints (length representations).
The following table illustrates varint-encoding.
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
|-------------|-------------------------------------------|-------------------|
| 15 | `0001111` | 143 |
| 300 | `0000010 0101100` | 2 172 |
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 |
2022-06-19 12:36:05 +00:00
2022-06-19 19:10:42 +00:00
### Atoms.
«#f» = A0 «#t» = A1
«0.123f» = A2 3D FB E7 6D «0.123» = A2 3F BF 7C ED 91 68 72 B0
«-257» = A3 FE FF «-3» = A3 FD «128» = A3 00 80
«-256» = A3 FF 00 «-2» = A3 FE «255» = A3 00 FF
«-255» = A3 FF 01 «-1» = A3 FF «256» = A3 01 00
«-254» = A3 FF 02 «0» = A3 «32767» = A3 7F FF
«-129» = A3 FF 7F «1» = A3 01 «32768» = A3 00 80 00
«-128» = A3 80 «12» = A3 0C «65535» = A3 00 FF FF
«-127» = A3 81 «13» = A3 0D «65536» = A3 01 00 00
«-4» = A3 FC «127» = A3 7F «131072» = A3 02 00 00
«87112285931760246646623899502532662132736»
= A3 01 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00
«""» = A4 00 «||» = A6
«"a"» = A4 61 00 «|a|» = A6 61
«"hello"» = A4 68 65 6C 6C 6F 00 «|hello|» = A6 68 65 6C 6C 6F
«#[]» = A5
«#[AQ==]» = A5 01
«#[ATAyMDMwNDA1]» = A5 01 02 03 04 05
### Compounds.
«<window 100 120 500 300>» = A7 87 A6 77696E646F77
82 A3 64
82 A3 78
83 A3 01F4
83 A3 012C
«["zzzz(...192 zs...)zzzz"]»
(a length-1 sequence containing a length-200 string)
= A8 01CA A4 7A7A7A7A (... 192 repetitions of 7A ...)
7A7A7A7A 00
«[H, He, Li, Be, B, C, N, O, F, Ne]»
= A8 82A648 83A64865 83A64C69 83A64265
82A642 82A643 82A64E 82A64F 82A646 83A64E65
«#{H He Li Be B C N O F Ne}»
= A9 82A642 (B)
83A64265 (Be)
82A643 (C)
82A646 (F)
82A648 (H)
83A64865 (He)
83A64C69 (Li)
82A64E (N)
83A64E65 (Ne)
82A64F (O)
«{H: 1.0080f, He: 4.0026f, Li: 6.94f, Be: 9.0122f,
B: 10.81f, C: 12.011f, N: 14.007f, O: 15.999f,
F: 18.998f, Ne: 20.180f}»
= AA 82A642 85A2412CF5C3 (B: 10.81f)
83A64265 85A2411031F9 (Be: 9.0122f)
82A643 85A241402D0E (C: 12.011f)
82A646 85A24197FBE7 (F: 18.998f)
82A648 85A23F810625 (H: 1.0080f)
83A64865 85A24080154D (He: 4.0026f)
83A64C69 85A240DE147B (Li: 6.94f)
82A64E 85A241601CAC (N: 14.007f)
83A64E65 85A241A170A4 (Ne: 20.180f)
82A64F 85A2417FFBE7 (O: 15.999f)
«[[H 1.0080f] [He 4.0026f] [Li 6.94f] [Be 9.0122f]
[B 10.81f] [C 12.011f] [N 14.007f] [O 15.999f]
[F 18.998f] [Ne 20.180f]]»
= A8 8A A8 82A648 85A23F810625 ([H 1.0080f])
8B A8 83A64865 85A24080154D ([He 4.0026f])
8B A8 83A64C69 85A240DE147B ([Li 6.94f])
8B A8 83A64265 85A2411031F9 ([Be 9.0122f])
8A A8 82A642 85A2412CF5C3 ([B 10.81f])
8A A8 82A643 85A241402D0E ([C 12.011f])
8A A8 82A64E 85A241601CAC ([N 14.007f])
8A A8 82A64F 85A2417FFBE7 ([O 15.999f])
8A A8 82A646 85A24197FBE7 ([F 18.998f])
8B A8 83A64E65 85A241A170A4 ([Ne 20.180f])
2022-06-19 12:36:05 +00:00
### Annotations.
The `Repr` corresponding to textual syntax `@a@b[]`, i.e. an empty sequence annotated with two
symbols, `a` and `b`, is
2022-06-18 17:11:08 +00:00
«@a @b []»
= [0xBF] ++ varint(|«[]»|) ++ «[]»
++ varint(|«a»|) ++ «a»
++ varint(|«b»|) ++ «b»
= [0xBF, 0x81, 0xA8, 0x82, 0xA6, 0x61, 0x82, 0xA6, 0x62]
2022-06-18 17:11:08 +00:00
## Security Considerations
**Overlong varints.** The binary format allows (but discourages)
overlong [varint](#varint)s. Because every `Repr` has a bound on its
length from its surrounding context, this is not a denial-of-service
vector *per se*; however, implementations may wish to consider optional
restrictions on the number of redundant leading `0` bytes accepted when
reading a varint.
2022-06-19 19:19:04 +00:00
**Overlong SignedIntegers.** Similarly, implementations may wish to
consider optional restrictions on the number of redundant leading `0xFF`
or `0x00` bytes accepted when reading a `SignedInteger`.
2022-06-18 17:11:08 +00:00
**Canonical form for cryptographic hashing and signing.** No canonical
textual encoding of a `Value` is specified. A
[canonical form][canonical] exists for binary encoded `Value`s, and
implementations *SHOULD* produce canonical binary encodings by
default; however, an implementation *MAY* permit two serializations of
the same `Value` to yield different binary `Repr`s.
## Acknowledgements
The exclusion of lengths from `Repr`s, placing lengths instead ahead of
2022-06-12 08:10:35 +00:00
contained values in sequences, is inspired by [argdata][], as is the
inclusion of a `NUL` byte in `String` `Repr`s for C interoperability.
2022-06-20 15:05:47 +00:00
## Appendix. Summary of syntax
{% include cheatsheet-binary.md %}
2022-06-18 17:11:08 +00:00
## Appendix. Autodetection of textual or binary syntax
Every tag byte in a binary Preserves `Repr` falls within the range
2022-06-18 17:11:08 +00:00
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
bytes*, and will never occur as the first byte of a UTF-8 encoded code
point. This means no binary-encoded `Repr` can be misinterpreted as
2022-06-18 17:11:08 +00:00
valid UTF-8.
Conversely, a UTF-8 `Document` must start with a valid codepoint,
2022-06-18 17:11:08 +00:00
meaning in particular that it must not start with a byte in the range
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
Preserves `Document` can be misinterpreted as a binary-syntax `Repr`.
2022-06-18 17:11:08 +00:00
Examination of the top two bits of the first byte of an encoded `Value`
gives its syntax: if the top two bits are `10`, it should be interpreted
as a binary-syntax `Repr`; otherwise, it should be interpreted as text.
**Streaming.** Autodetection is still possible when streaming an
undetermined number of `Value`s across, say, a TCP/IP connection:
- If the text syntax is to be used for the connection, simply start
writing each `Document` one after the other. Documents for `Atom`s
2022-06-13 13:53:16 +00:00
are in general ambiguous if not separated from their neighbours by
whitespace; whitespace *SHOULD* be used to separate adjacent
documents. Specifically, whitespace separating adjacent documents
*SHOULD* be ASCII newline (10).
- If the binary syntax is to be used for the connection, start the
connection with byte `0xA8` (sequence). After the initial byte, send
each value `v` as `varint(|«v»|) ++ «v»`. A side effect of this approach
is that the entire stream, when complete, is a valid `Sequence`
`Repr`.
2022-06-18 17:11:08 +00:00
## Appendix. Table of tag values
(8x) RESERVED 80-8F
(9x) RESERVED 90-9F
A0 - False
A1 - True
A2 - Float or Double (length disambiguates)
2022-06-19 15:25:40 +00:00
A3 - SignedIntegers (0 may be encoded with no bytes at all)
A4 - String (a trailing NUL is added)
A5 - ByteString
A6 - Symbol
A7 - Record
A8 - Sequence
A9 - Set
AA - Dictionary
AB - Embedded
(Ax) RESERVED AC-AF
(Bx) RESERVED B0-BE
BF - Annotations. {BF Lval val Lann0 ann0 Lann1 ann1 ...}
2022-06-18 17:11:08 +00:00
## Appendix. Binary SignedInteger representation
Languages that provide fixed-width machine word types may find the
following table useful in encoding and decoding binary `SignedInteger`
values.
| Integer range | Bytes required | Encoding (hex) |
| --- | --- | --- |
| 0 | 1 | `A3` |
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 2 | `A3` `XX` |
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3 | `A3` `XX` `XX` |
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4 | `A3` `XX` `XX` `XX` |
2022-06-18 17:11:08 +00:00
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5 | `A3` `XX` `XX` `XX` `XX` |
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6 | `A3` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
2022-06-18 17:11:08 +00:00
<!-- Heading to visually offset the footnotes from the main document: -->
## Notes