preserves/preserves-binary.md

257 lines
11 KiB
Markdown
Raw Normal View History

2022-06-18 17:11:08 +00:00
---
no_site_title: true
title: "Preserves: Binary Syntax"
---
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
{{ site.version_date }}. Version {{ site.version }}.
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
[LEB128]: https://en.wikipedia.org/wiki/LEB128
[canonical]: canonical-binary.html
*Preserves* is a data model, with associated serialization formats. This
document defines one of those formats: a binary syntax for `Value`s from
the [Preserves data model](preserves.html) that is easy for computer
software to read and write. An [equivalent human-readable text
syntax](preserves-text.html) also exists.
## Machine-Oriented Binary Syntax
A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
For a value `v`, we write `«v»` for the `Repr` of v.
### Type and Length representation.
Each `Repr` starts with a tag byte, describing the kind of information
represented. Depending on the tag, a length indicator, further encoded
information, and/or an ending tag may follow.
tag (simple atomic data and small integers)
tag ++ binarydata (most integers)
tag ++ length ++ binarydata (large integers, strings, symbols, and binary)
tag ++ repr ++ ... ++ endtag (compound data)
The unique end tag is byte value `0x84`.
If present after a tag, the length of a following piece of binary data
is formatted as a [base 128 varint][varint].[^see-also-leb128] We
write `varint(m)` for the varint-encoding of `m`. Quoting the
[Google Protocol Buffers][varint] definition,
[^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
integers. Varints and LEB128-encoded integers differ only for
signed integers, which are not used in Preserves.
> Each byte in a varint, except the last byte, has the most
> significant bit (msb) set this indicates that there are further
> bytes to come. The lower 7 bits of each byte are used to store the
> two's complement representation of the number in groups of 7 bits,
> least significant group first.
The following table illustrates varint-encoding.
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
| ------ | ------------------- | ------------ |
| 15 | `0001111` | 15 |
| 300 | `0000010 0101100` | 172 2 |
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
It is an error for a varint-encoded `m` in a `Repr` to be anything
other than the unique shortest encoding for that `m`. That is, a
varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0.
### Records, Sequences, Sets and Dictionaries.
«<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
«[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
«#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
«{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]
There is *no* ordering requirement on the `E_i` elements or
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
addition, implementations *SHOULD* default to writing set elements and
dictionary key/value pairs in order sorted lexicographically by their
`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
serializing in some other implementation-defined order.
[^no-sorting-rationale]: In the BitTorrent encoding format,
[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
dictionary key/value pairs must be sorted by key. This is a
necessary step for ensuring serialization of `Value`s is
canonical. We do not require that key/value pairs (or set
elements) be in sorted order for serialized `Value`s; however, a
[canonical form][canonical] for `Repr`s does exist where a sorted
ordering is required.
[^not-sorted-semantically]: It's important to note that the sort
ordering for writing out set elements and dictionary key/value
pairs is *not* the same as the sort ordering implied by the
semantic ordering of those elements or keys. For example, the
`Repr` of a negative number very far from zero will start with
byte that is *greater* than the byte which starts the `Repr` of
zero, making it sort lexicographically later by `Repr`, despite
being semantically *less than* zero.
**Rationale**. This is for ease-of-implementation reasons: not all
languages can easily represent sorted sets or sorted dictionaries,
but encoding and then sorting byte strings is much more likely to
be within easy reach.
### SignedIntegers.
«x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m>16
([0xA0] + m - 1) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m≤16
([0xA0] + x) if (-3≤x≤-1)
([0x90] + x) if ( 0≤x≤12)
where m = |intbytes(x)|
Integers in the range [-3,12] are compactly represented with tags
between `0x90` and `0x9F` because they are so frequently used.
Integers up to 16 bytes long are represented with a single-byte tag
encoding the length of the integer. Larger integers are represented
with an explicit varint length. Every `SignedInteger` *MUST* be
represented with its shortest possible encoding.
The function `intbytes(x)` gives the big-endian two's-complement
binary representation of `x`, taking exactly as many whole bytes as
needed to unambiguously identify the value and its sign, and `m =
|intbytes(x)|`. The most-significant bit in the first byte in
`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes] For
example,
«87112285931760246646623899502532662132736»
= B0 12 01 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00
«-257» = A1 FE FF «-3» = 9D «128» = A1 00 80
«-256» = A1 FF 00 «-2» = 9E «255» = A1 00 FF
«-255» = A1 FF 01 «-1» = 9F «256» = A1 01 00
«-254» = A1 FF 02 «0» = 90 «32767» = A1 7F FF
«-129» = A1 FF 7F «1» = 91 «32768» = A2 00 80 00
«-128» = A0 80 «12» = 9C «65535» = A2 00 FF FF
«-127» = A0 81 «13» = A0 0D «65536» = A2 01 00 00
«-4» = A0 FC «127» = A0 7F «131072» = A2 02 00 00
[^zero-intbytes]: The value 0 needs zero bytes to identify the
value, so `intbytes(0)` is the empty byte string. Non-zero values
need at least one byte.
### Strings, ByteStrings and Symbols.
Syntax for these three types varies only in the tag used. For `String`
and `Symbol`, the data following the tag is a UTF-8 encoding of the
`Value`'s code points, while for `ByteString` it is the raw data
contained within the `Value` unmodified.
«S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String
[0xB2] ++ varint(|S|) ++ S if S ∈ ByteString
[0xB3] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ Symbol
### Booleans.
«#f» = [0x80]
«#t» = [0x81]
### Floats and Doubles.
«F» when F ∈ Float = [0x82] ++ binary32(F)
«D» when D ∈ Double = [0x83] ++ binary64(D)
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
8-byte IEEE 754 binary representations of `F` and `D`, respectively.
### Embeddeds.
The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
represent the denoted object, prefixed with `[0x86]`.
«#!V» = [0x86] ++ «V»
### Annotations.
To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
`[0x85] ++ «v»`. For example, the `Repr` corresponding to textual
syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols,
`a` and `b`, is
«@a @b []»
= [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
= [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]
## Security Considerations
**Annotations.** In modes where a `Value` is being read while
annotations are skipped, an endless sequence of annotations may give an
illusion of progress.
**Canonical form for cryptographic hashing and signing.** No canonical
textual encoding of a `Value` is specified. A
[canonical form][canonical] exists for binary encoded `Value`s, and
implementations *SHOULD* produce canonical binary encodings by
default; however, an implementation *MAY* permit two serializations of
the same `Value` to yield different binary `Repr`s.
## Appendix. Autodetection of textual or binary syntax
Every tag byte in a binary Preserves `Document` falls within the range
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
bytes*, and will never occur as the first byte of a UTF-8 encoded code
point. This means no binary-encoded document can be misinterpreted as
valid UTF-8.
Conversely, a UTF-8 document must start with a valid codepoint,
meaning in particular that it must not start with a byte in the range
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
Preserves document can be misinterpreted as a binary-syntax document.
Examination of the top two bits of the first byte of a document gives
its syntax: if the top two bits are `10`, it should be interpreted as
a binary-syntax document; otherwise, it should be interpreted as text.
## Appendix. Table of tag values
80 - False
81 - True
82 - Float
83 - Double
84 - End marker
85 - Annotation
86 - Embedded
(8x) RESERVED 87-8F
9x - Small integers 0..12,-3..-1
An - Medium integers, (n+1) bytes long
B0 - Large integers, variable length
B1 - String
B2 - ByteString
B3 - Symbol
B4 - Record
B5 - Sequence
B6 - Set
B7 - Dictionary
## Appendix. Binary SignedInteger representation
Languages that provide fixed-width machine word types may find the
following table useful in encoding and decoding binary `SignedInteger`
values.
| Integer range | Bytes required | Encoding (hex) |
| --- | --- | --- |
| -3 ≤ n ≤ 12 | 1 | `9X` |
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 2 | `A0` `XX` |
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3 | `A1` `XX` `XX` |
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4 | `A2` `XX` `XX` `XX` |
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5 | `A3` `XX` `XX` `XX` `XX` |
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6 | `A4` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7 | `A5` `XX` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8 | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9 | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
<!-- Heading to visually offset the footnotes from the main document: -->
## Notes