preserves/preserves-binary.md

257 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
no_site_title: true
title: "Preserves: Binary Syntax"
---
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
{{ site.version_date }}. Version {{ site.version }}.
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
[LEB128]: https://en.wikipedia.org/wiki/LEB128
[canonical]: canonical-binary.html
*Preserves* is a data model, with associated serialization formats. This
document defines one of those formats: a binary syntax for `Value`s from
the [Preserves data model](preserves.html) that is easy for computer
software to read and write. An [equivalent human-readable text
syntax](preserves-text.html) also exists.
## Machine-Oriented Binary Syntax
A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
For a value `v`, we write `«v»` for the `Repr` of v.
### Type and Length representation.
Each `Repr` starts with a tag byte, describing the kind of information
represented. Depending on the tag, a length indicator, further encoded
information, and/or an ending tag may follow.
tag (simple atomic data and small integers)
tag ++ binarydata (most integers)
tag ++ length ++ binarydata (large integers, strings, symbols, and binary)
tag ++ repr ++ ... ++ endtag (compound data)
The unique end tag is byte value `0x84`.
If present after a tag, the length of a following piece of binary data
is formatted as a [base 128 varint][varint].[^see-also-leb128] We
write `varint(m)` for the varint-encoding of `m`. Quoting the
[Google Protocol Buffers][varint] definition,
[^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
integers. Varints and LEB128-encoded integers differ only for
signed integers, which are not used in Preserves.
> Each byte in a varint, except the last byte, has the most
> significant bit (msb) set this indicates that there are further
> bytes to come. The lower 7 bits of each byte are used to store the
> two's complement representation of the number in groups of 7 bits,
> least significant group first.
The following table illustrates varint-encoding.
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
| ------ | ------------------- | ------------ |
| 15 | `0001111` | 15 |
| 300 | `0000010 0101100` | 172 2 |
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
It is an error for a varint-encoded `m` in a `Repr` to be anything
other than the unique shortest encoding for that `m`. That is, a
varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0.
### Records, Sequences, Sets and Dictionaries.
«<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
«[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
«#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
«{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]
There is *no* ordering requirement on the `E_i` elements or
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
addition, implementations *SHOULD* default to writing set elements and
dictionary key/value pairs in order sorted lexicographically by their
`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
serializing in some other implementation-defined order.
[^no-sorting-rationale]: In the BitTorrent encoding format,
[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
dictionary key/value pairs must be sorted by key. This is a
necessary step for ensuring serialization of `Value`s is
canonical. We do not require that key/value pairs (or set
elements) be in sorted order for serialized `Value`s; however, a
[canonical form][canonical] for `Repr`s does exist where a sorted
ordering is required.
[^not-sorted-semantically]: It's important to note that the sort
ordering for writing out set elements and dictionary key/value
pairs is *not* the same as the sort ordering implied by the
semantic ordering of those elements or keys. For example, the
`Repr` of a negative number very far from zero will start with
byte that is *greater* than the byte which starts the `Repr` of
zero, making it sort lexicographically later by `Repr`, despite
being semantically *less than* zero.
**Rationale**. This is for ease-of-implementation reasons: not all
languages can easily represent sorted sets or sorted dictionaries,
but encoding and then sorting byte strings is much more likely to
be within easy reach.
### SignedIntegers.
«x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m>16
([0xA0] + m - 1) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m≤16
([0xA0] + x) if (-3≤x≤-1)
([0x90] + x) if ( 0≤x≤12)
where m = |intbytes(x)|
Integers in the range [-3,12] are compactly represented with tags
between `0x90` and `0x9F` because they are so frequently used.
Integers up to 16 bytes long are represented with a single-byte tag
encoding the length of the integer. Larger integers are represented
with an explicit varint length. Every `SignedInteger` *MUST* be
represented with its shortest possible encoding.
The function `intbytes(x)` gives the big-endian two's-complement
binary representation of `x`, taking exactly as many whole bytes as
needed to unambiguously identify the value and its sign, and `m =
|intbytes(x)|`. The most-significant bit in the first byte in
`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes] For
example,
«87112285931760246646623899502532662132736»
= B0 12 01 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00
«-257» = A1 FE FF «-3» = 9D «128» = A1 00 80
«-256» = A1 FF 00 «-2» = 9E «255» = A1 00 FF
«-255» = A1 FF 01 «-1» = 9F «256» = A1 01 00
«-254» = A1 FF 02 «0» = 90 «32767» = A1 7F FF
«-129» = A1 FF 7F «1» = 91 «32768» = A2 00 80 00
«-128» = A0 80 «12» = 9C «65535» = A2 00 FF FF
«-127» = A0 81 «13» = A0 0D «65536» = A2 01 00 00
«-4» = A0 FC «127» = A0 7F «131072» = A2 02 00 00
[^zero-intbytes]: The value 0 needs zero bytes to identify the
value, so `intbytes(0)` is the empty byte string. Non-zero values
need at least one byte.
### Strings, ByteStrings and Symbols.
Syntax for these three types varies only in the tag used. For `String`
and `Symbol`, the data following the tag is a UTF-8 encoding of the
`Value`'s code points, while for `ByteString` it is the raw data
contained within the `Value` unmodified.
«S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String
[0xB2] ++ varint(|S|) ++ S if S ∈ ByteString
[0xB3] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ Symbol
### Booleans.
«#f» = [0x80]
«#t» = [0x81]
### Floats and Doubles.
«F» when F ∈ Float = [0x82] ++ binary32(F)
«D» when D ∈ Double = [0x83] ++ binary64(D)
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
8-byte IEEE 754 binary representations of `F` and `D`, respectively.
### Embeddeds.
The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
represent the denoted object, prefixed with `[0x86]`.
«#!V» = [0x86] ++ «V»
### Annotations.
To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
`[0x85] ++ «v»`. For example, the `Repr` corresponding to textual
syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols,
`a` and `b`, is
«@a @b []»
= [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
= [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]
## Security Considerations
**Annotations.** In modes where a `Value` is being read while
annotations are skipped, an endless sequence of annotations may give an
illusion of progress.
**Canonical form for cryptographic hashing and signing.** No canonical
textual encoding of a `Value` is specified. A
[canonical form][canonical] exists for binary encoded `Value`s, and
implementations *SHOULD* produce canonical binary encodings by
default; however, an implementation *MAY* permit two serializations of
the same `Value` to yield different binary `Repr`s.
## Appendix. Autodetection of textual or binary syntax
Every tag byte in a binary Preserves `Document` falls within the range
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
bytes*, and will never occur as the first byte of a UTF-8 encoded code
point. This means no binary-encoded document can be misinterpreted as
valid UTF-8.
Conversely, a UTF-8 document must start with a valid codepoint,
meaning in particular that it must not start with a byte in the range
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
Preserves document can be misinterpreted as a binary-syntax document.
Examination of the top two bits of the first byte of a document gives
its syntax: if the top two bits are `10`, it should be interpreted as
a binary-syntax document; otherwise, it should be interpreted as text.
## Appendix. Table of tag values
80 - False
81 - True
82 - Float
83 - Double
84 - End marker
85 - Annotation
86 - Embedded
(8x) RESERVED 87-8F
9x - Small integers 0..12,-3..-1
An - Medium integers, (n+1) bytes long
B0 - Large integers, variable length
B1 - String
B2 - ByteString
B3 - Symbol
B4 - Record
B5 - Sequence
B6 - Set
B7 - Dictionary
## Appendix. Binary SignedInteger representation
Languages that provide fixed-width machine word types may find the
following table useful in encoding and decoding binary `SignedInteger`
values.
| Integer range | Bytes required | Encoding (hex) |
| --- | --- | --- |
| -3 ≤ n ≤ 12 | 1 | `9X` |
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 2 | `A0` `XX` |
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3 | `A1` `XX` `XX` |
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4 | `A2` `XX` `XX` `XX` |
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5 | `A3` `XX` `XX` `XX` `XX` |
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6 | `A4` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7 | `A5` `XX` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8 | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9 | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
<!-- Heading to visually offset the footnotes from the main document: -->
## Notes